TensorRT-LLM 0.11.0 Release
Hi,
We are very pleased to announce the 0.11.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported very long context for LLaMA (see “Long context evaluation” section in
examples/llama/README.md
). - Low latency optimization
- Added a reduce-norm feature which aims to fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, which is recommended to be enabled when the batch size is small and the generation phase time is dominant.
- Added FP8 support to the GEMM plugin, which benefits the cases when batch size is smaller than 4.
- Added a fused GEMM-SwiGLU plugin for FP8 on SM90.
- LoRA enhancements
- Supported running FP8 LLaMA with FP16 LoRA checkpoints.
- Added support for quantized base model and FP16/BF16 LoRA.
- SQ OOTB (- INT8 A/W) + FP16/BF16/FP32 LoRA
- INT8/ INT4 Weight-Only (INT8 /W) + FP16/BF16/FP32 LoRA
- Weight-Only Group-wise + FP16/BF16/FP32 LoRA
- Added LoRA support to Qwen2, see “Run models with LoRA” section in
examples/qwen/README.md
. - Added support for Phi-3-mini/small FP8 base + FP16/BF16 LoRA, see “Run Phi-3 with LoRA” section in
examples/phi/README.md
. - Added support for starcoder-v2 FP8 base + FP16/BF16 LoRA, see “Run StarCoder2 with LoRA” section in
examples/gpt/README.md
.
- Encoder-decoder models C++ runtime enhancements
- Supported paged KV cache and inflight batching. (#800)
- Supported tensor parallelism.
- Supported INT8 quantization with embedding layer excluded.
- Updated default model for Whisper to
distil-whisper/distil-large-v3
, thanks to the contribution from @IbrahimAmin1 in #1337. - Supported HuggingFace model automatically download for the Python high level API.
- Supported explicit draft tokens for in-flight batching.
- Supported local custom calibration datasets, thanks to the contribution from @DreamGenX in #1762.
- Added batched logits post processor.
- Added Hopper qgmma kernel to XQA JIT codepath.
- Supported tensor parallelism and expert parallelism enabled together for MoE.
- Supported the pipeline parallelism cases when the number of layers cannot be divided by PP size.
- Added
numQueuedRequests
to the iteration stats log of the executor API. - Added
iterLatencyMilliSec
to the iteration stats log of the executor API. - Add HuggingFace model zoo from the community, thanks to the contribution from @matichon-vultureprime in #1674.
API Changes
- [BREAKING CHANGE]
trtllm-build
command- Migrated Whisper to unified workflow (
trtllm-build
command), see documents: examples/whisper/README.md. max_batch_size
intrtllm-build
command is switched to 256 by default.max_num_tokens
intrtllm-build
command is switched to 8192 by default.- Deprecated
max_output_len
and addedmax_seq_len
. - Removed unnecessary
--weight_only_precision
argument fromtrtllm-build
command. - Removed
attention_qk_half_accumulation
argument fromtrtllm-build
command. - Removed
use_context_fmha_for_generation
argument fromtrtllm-build
command. - Removed
strongly_typed
argument fromtrtllm-build
command. - The default value of
max_seq_len
reads from the HuggingFace mode config now.
- Migrated Whisper to unified workflow (
- C++ runtime
- [BREAKING CHANGE] Renamed
free_gpu_memory_fraction
inModelRunnerCpp
tokv_cache_free_gpu_memory_fraction
. - [BREAKING CHANGE] Refactored
GptManager
API- Moved
maxBeamWidth
intoTrtGptModelOptionalParams
. - Moved
schedulerConfig
intoTrtGptModelOptionalParams
.
- Moved
- Added some more options to
ModelRunnerCpp
, includingmax_tokens_in_paged_kv_cache
,kv_cache_enable_block_reuse
andenable_chunked_context
.
- [BREAKING CHANGE] Renamed
- [BREAKING CHANGE] Python high-level API
- Removed the
ModelConfig
class, and all the options are moved toLLM
class. - Refactored the
LLM
class, please refer toexamples/high-level-api/README.md
- Moved the most commonly used options in the explicit arg-list, and hidden the expert options in the kwargs.
- Exposed
model
to accept either HuggingFace model name or local HuggingFace model/TensorRT-LLM checkpoint/TensorRT-LLM engine. - Support downloading model from HuggingFace model hub, currently only Llama variants are supported.
- Support build cache to reuse the built TensorRT-LLM engines by setting environment variable
TLLM_HLAPI_BUILD_CACHE=1
or passingenable_build_cache=True
toLLM
class. - Exposed low-level options including
BuildConfig
,SchedulerConfig
and so on in the kwargs, ideally you should be able to configure details about the build and runtime phase.
- Refactored
LLM.generate()
andLLM.generate_async()
API.- Removed
SamplingConfig
. - Added
SamplingParams
with more extensive parameters, seetensorrt_llm/hlapi/utils.py
.- The new
SamplingParams
contains and manages fields from Python bindings ofSamplingConfig
,OutputConfig
, and so on.
- The new
- Refactored
LLM.generate()
output asRequestOutput
, seetensorrt_llm/hlapi/llm.py
.
- Removed
- Updated the
apps
examples, specially by rewriting bothchat.py
andfastapi_server.py
using theLLM
APIs, please refer to theexamples/apps/README.md
for details.- Updated the
chat.py
to support multi-turn conversation, allowing users to chat with a model in the terminal. - Fixed the
fastapi_server.py
and eliminate the need formpirun
in multi-GPU scenarios.
- Updated the
- Removed the
- [BREAKING CHANGE] Speculative decoding configurations unification
- Introduction of
SpeculativeDecodingMode.h
to choose between different speculative decoding techniques. - Introduction of
SpeculativeDecodingModule.h
base class for speculative decoding techniques. - Removed
decodingMode.h
.
- Introduction of
gptManagerBenchmark
- [BREAKING CHANGE]
api
ingptManagerBenchmark
command isexecutor
by default now. - Added a runtime
max_batch_size
. - Added a runtime
max_num_tokens
.
- [BREAKING CHANGE]
- [BREAKING CHANGE] Added a
bias
argument to theLayerNorm
module, and supports non-bias layer normalization. - [BREAKING CHANGE] Removed
GptSession
Python bindings.
Model Updates
- Supported Jais, see
examples/jais/README.md
. - Supported DiT, see
examples/dit/README.md
. - Supported VILA 1.5.
- Supported Video NeVA, see
Video NeVA
section inexamples/multimodal/README.md
. - Supported Grok-1, see
examples/grok/README.md
. - Supported Qwen1.5-110B with FP8 PTQ.
- Supported Phi-3 small model with block sparse attention.
- Supported InternLM2 7B/20B, thanks to the contribution from @RunningLeon in #1392.
- Supported Phi-3-medium models, see
examples/phi/README.md
. - Supported Qwen1.5 MoE A2.7B.
- Supported phi 3 vision multimodal.
Fixed Issues
- Fixed brokens outputs for the cases when batch size is larger than 1. (#1539)
- Fixed
top_k
type inexecutor.py
, thanks to the contribution from @vonjackustc in #1329. - Fixed stop and bad word list pointer offset in Python runtime, thanks to the contribution from @fjosw in #1486.
- Fixed some typos for Whisper model, thanks to the contribution from @Pzzzzz5142 in #1328.
- Fixed export failure with CUDA driver < 526 and pynvml >= 11.5.0, thanks to the contribution from @CoderHam in #1537.
- Fixed an issue in NMT weight conversion, thanks to the contribution from @Pzzzzz5142 in #1660.
- Fixed LLaMA Smooth Quant conversion, thanks to the contribution from @lopuhin in #1650.
- Fixed
qkv_bias
shape issue for Qwen1.5-32B (#1589), thanks to the contribution from @Tlntin in #1637. - Fixed the error of Ada traits for
fpA_intB
, thanks to the contribution from @JamesTheZ in #1583. - Update
examples/qwenvl/requirements.txt
, thanks to the contribution from @ngoanpv in #1248. - Fixed rsLoRA scaling in
lora_manager
, thanks to the contribution from @TheCodeWrangler in #1669. - Fixed Qwen1.5 checkpoint convert failure #1675.
- Fixed Medusa safetensors and AWQ conversion, thanks to the contribution from @Tushar-ml in #1535.
- Fixed
convert_hf_mpt_legacy
call failure when the function is called in other than global scope, thanks to the contribution from @bloodeagle40234 in #1534. - Fixed
use_fp8_context_fmha
broken outputs (#1539). - Fixed pre-norm weight conversion for NMT models, thanks to the contribution from @Pzzzzz5142 in #1723.
- Fixed random seed initialization issue, thanks to the contribution from @pathorn in #1742.
- Fixed stop words and bad words in python bindings. (#1642)
- Fixed the issue that when converting checkpoint for Mistral 7B v0.3, thanks to the contribution from @Ace-RR: #1732.
- Fixed broken inflight batching for fp8 Llama and Mixtral, thanks to the contribution from @bprus: #1738
- Fixed the failure when
quantize.py
is export data to config.json, thanks to the contribution from @janpetrov: #1676 - Raise error when autopp detects unsupported quant plugin #1626.
- Fixed the issue that
shared_embedding_table
is not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz. - Fixed stop and bad words list contiguous for
ModelRunner
#1815, thanks to the contribution from @Marks101. - Fixed missing comment for
FAST_BUILD
, thanks to the support from @lkm2835 in #1851. - Fixed the issues that Top-P sampling occasionally produces invalid tokens. #1590
- Fixed #1424.
- Fixed #1529.
- Fixed
benchmarks/cpp/README.md
for #1562 and #1552. - Fixed dead link, thanks to the help from @DefTruth, @buvnswrn and @sunjiabin17 in: triton-inference-server/tensorrtllm_backend#478, triton-inference-server/tensorrtllm_backend#482 and triton-inference-server/tensorrtllm_backend#449.
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.05-py3
. - Base Docker image for TensorRT-LLM backend is updated to
nvcr.io/nvidia/tritonserver:24.05-py3
. - The dependent TensorRT version is updated to 10.1.0.
- The dependent CUDA version is updated to 12.4.1.
- The dependent PyTorch version is updated to 2.3.1.
- The dependent ModelOpt version is updated to v0.13.0.
Known Issues
- In a conda environment on Windows, installation of TensorRT-LLM may succeed. However, when importing the library in Python, you may receive an error message of
OSError: exception: access violation reading 0x0000000000000000
. This issue is under investigation.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team