sync with 0.7.1 #308

dtrifiro · 2025-01-27T10:20:59Z

Changelog:
https://github.com/vllm-project/vllm/releases/tag/v0.7.0
https://github.com/vllm-project/vllm/releases/tag/v0.7.1

…apping (vllm-project#11924) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…tup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <kzawora@habana.ai>

) Signed-off-by: Shanshan Shen <467638484@qq.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…oject#10467)

…ect#12051) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

) Signed-off-by: youkaichao <youkaichao@gmail.com>

…ect#12062) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…Manager (vllm-project#12003)

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

…12023) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Signed-off-by: kewang-xlnx <kewang@xilinx.com> Signed-off-by: kewang2 <kewang2@amd.com> Co-authored-by: kewang2 <kewang2@amd.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

…12050) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

Signed-off-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

…llm-project#12087)

…s supported. (vllm-project#8651) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>

…ct#12105)

Signed-off-by: mgoin <michael@neuralmagic.com>

…m-project#12067) Signed-off-by: Isotr0py <2037008807@qq.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

) Signed-off-by: youkaichao <youkaichao@gmail.com>

…t#12104) Signed-off-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Isotr0py <2037008807@qq.com>

…project#12555) Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>

Signed-off-by: mgoin <michael@neuralmagic.com>

…caling (vllm-project#11868)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: simon-mo <xmo@berkeley.edu>

…2571)

dtrifiro · 2025-01-31T14:45:30Z

/test cuda-pr-image-mirror

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Co-authored-by: simon-mo <xmo@berkeley.edu>

@hmellor

It's very annoying when I forgot to add `-s` in `git commit` to sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git push -f` to fix the DCO. This PR adds a hook to sign off commits automatically when `-s` is missing to solve this problem. The only change from the user side is now users have to install 2 hooks, so instead of just ``` pre-commit install ``` Now we need to ``` pre-commit install --hook-type pre-commit --hook-type commit-msg ``` Note that even if users still only install the pre-commit hook, they won't get any error in `git commit`. Just the sign-off hook won't run. cc @hmellor @youkaichao --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

@WoosukKwon

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

…oject#12603) This pr adds extra key to block hash, to generate different hash value for two blocks with the same token string but different extra_keys in their parent blocks. For example, it can generate different hash value for the second block of the following two requests: ```python request1 = make_request( request_id=0, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash1", "hash2"], ) request2 = make_request( request_id=1, prompt_token_ids=[_ for _ in range(6)], mm_positions=[{ "offset": 0, "length": 3 }, { "offset": 3, "length": 3 }], mm_hashes=["hash3", "hash2"], ) ``` --------- Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Instead of having to create a new build with release version put in as env var.

SUMMARY: * previous PR for pulling in block configs also changed defaults (https://github.com/vllm-project/vllm/pull/11589/files) for FP8 * this broke L4 MoE since there was not enough SHM for the default configuration * this reverts the non-block example to the default Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

…2563) **[Guided decoding performance optimization]** Sending the guided decoding bitmask in xgrammar to the GPU (`self.token_bitmask.to(scores.device)`) is a blocking operation that prevents the CPU from pre-launching the sampler kernels. The CPU waits until decode is complete, then copies the bitmask over. This PR changes the operation to async via setting `non-blocking=True`. (Current) The CPU is blocked on a `cudaStreamSynchronize` and only pre-empts the sampling kernels after bitmask application. Below is the Nsys profile for one decode phase from Llama 3.1 8B. ![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824) With the optimization, this is no longer the case: ![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7) --------- Signed-off-by: Ryan N <ryan.nguyen@centml.ai>

- Make device tab names more explicit - Add comprehensive list of devices to https://docs.vllm.ai/en/latest/getting_started/installation/index.html - Add `attention` blocks to the intro of all devices that don't have pre-built wheels/images --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

@mgoin

Based on a request by @mgoin , with @kylesayrs we have added an example doc for int4 w4a16 quantization, following the pre-existing int8 w8a8 quantization example and the example available in [`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py) FIX #n/a (no issue created) @kylesayrs and I have discussed a couple additional improvements for the quantization docs. We will revisit at a later date, possibly including: - A section for "choosing the correct quantization scheme/ compression technique" - Additional vision or audio calibration datasets --------- Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*)

…llm-project#11161) FIX issue vllm-project#9688 vllm-project#11086 vllm-project#12487 --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

…oject#12617) Without this PR --------------- Quantizing models with llm-compressor and a recipe that explicitly lists names of layers produces a model that is not loadable by vLLM (i.e. `vllm serve <model>` fails with `raise ValueError(f"Unable to find matching target for {module} in the ...`). Example recipe: ``` recipe = """ quantization_stage: run_type: oneshot quantization_modifiers: GPTQModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 4 type: "int" symmetric: true strategy: "group" group_size: 128 targets: [ "model.layers.0.mlp.down_proj", "model.layers.2.mlp.down_proj", "model.layers.3.mlp.down_proj", "model.layers.4.mlp.down_proj", "model.layers.5.mlp.down_proj", "model.layers.6.mlp.down_proj", "model.layers.7.mlp.down_proj", "model.layers.8.mlp.down_proj", "model.layers.9.mlp.down_proj", "model.layers.10.mlp.down_proj", "model.layers.11.mlp.down_proj", "model.layers.12.mlp.down_proj", "model.layers.13.mlp.down_proj", "model.layers.14.mlp.down_proj", "model.layers.15.mlp.down_proj", "model.layers.16.mlp.down_proj", "model.layers.17.mlp.down_proj", "model.layers.19.mlp.down_proj", "model.layers.21.mlp.down_proj", "model.layers.22.mlp.down_proj", . . . ] """ ``` To reproduce the vLLM error: ```bash vllm serve nm-testing/eldar-test ``` With this PR ------------ Models are loaded correctly without any errors.

…12599)

Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

…oject#12517) This PR addresses a bug in the Cutlass integration where the `sparsity_config.ignore` list was not being respected. When only a subset of modules were configured as Sparse24, the system incorrectly selected Cutlass for non-sparse modules as well. This update ensures the correct scheme is selected for non-sparse modules, fixing this behavior. --- ### Changes - Updated logic to correctly respect `sparsity_config.ignore`. - Ensured non-sparse modules use the appropriate scheme instead of defaulting to Cutlass. --- <details> <summary>Testing Setup</summary> The fix has been tested on top of [this diff](vllm-project#12097). #### Steps to Test: ```bash git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16 git cherry-pick ca624cd # this branch ``` #### Additional Patch Required: ```diff diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py index a54177c1c..f916dd0c9 100644 --- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py +++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py @@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs, QuantizationStrategy, QuantizationType) from pydantic import BaseModel - +from vllm.logger import init_logger from vllm.model_executor.layers.fused_moe import FusedMoE from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase, UnquantizedLinearMethod) @@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import ( should_ignore_layer) from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod from vllm.platforms import current_platform - +logger = init_logger(__name__) __all__ = ["CompressedTensorsLinearMethod"] SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config" ``` Apply using: ```bash git apply logging-patch.patch ``` </details> --- <details> <summary>Models Tested</summary> - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed` - `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed` </details> --- <details> <summary>Example Output</summary> #### Layers 0-5 (Sparse24) ``` Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj ... ``` #### Layers 6+ (Non-Sparse, FP8) ``` Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj ... ``` </details> **Note:** Assumed all modules in fused layers such as `QKV_proj` and `Gate_up_proj` follow the same quantization/pruning scheme. --- For related tasks using the Asana app for GitHub, refer to [[this link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160). Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

) This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights --------- Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>

…coding, v1 (vllm-project#12280) We have `v1`, `structured-output`, and `speculative-decoding` labels on github. This adds automation for applying these labels based on the files touched by a PR. Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>

@mgoin

…lm-project#12642) From @mgoin in vllm-project#12638 I cannot push to that branch, therefore a new PR to unblock release. --------- Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: simon-mo <simon.mo@hey.com> Co-authored-by: mgoin <michael@neuralmagic.com>

openshift-ci · 2025-02-04T23:49:29Z

@dtrifiro: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/smoke-test	`0a63af8`	link	true	`/test smoke-test`
ci/prow/images	`0a63af8`	link	true	`/test images`
ci/prow/cuda-pr-image-mirror	`0a63af8`	link	true	`/test cuda-pr-image-mirror`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

jeejeelee and others added 30 commits January 15, 2025 07:49

[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_m…

a3a3ee4

…apping (vllm-project#11924) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[Kernel] Support MulAndSilu (vllm-project#11624)

42f5e7c

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in se…

1a51b9f

…tup.py (vllm-project#12046) Signed-off-by: Konrad Zawora <kzawora@habana.ai>

[Platform] move current_memory_usage() into platform (vllm-project#11369

9ddac56

) Signed-off-by: Shanshan Shen <467638484@qq.com>

[V1][BugFix] Fix edge case in VLM scheduling (vllm-project#12065)

b7ee940

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Misc] Add multipstep chunked-prefill support for FlashInfer (vllm-pr…

0794e74

…oject#10467)

[core] Turn off GPU communication overlap for Ray executor (vllm-proj…

f218f9c

…ect#12051) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[core] platform agnostic executor via collective_rpc (vllm-project#11256

ad34c0d

) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Doc] Update examples to remove SparseAutoModelForCausalLM (vllm-proj…

3f9b7ab

…ect#12062) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCache…

994fc65

…Manager (vllm-project#12003)

Fix: cases with empty sparsity config (vllm-project#12057)

cbe9439

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>

Type-fix: make execute_model output type optional (vllm-project#12020)

ad388d2

[Platform] Do not raise error if _Backend is not found (vllm-project#…

3adf0ff

…12023) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com>

[Model]: Support internlm3 (vllm-project#12037)

97eb97b

Misc: allow to use proxy in HTTPConnection (vllm-project#12042)

5ecf3e0

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

[Misc][Quark] Upstream Quark format to VLLM (vllm-project#10765)

de0526f

Signed-off-by: kewang-xlnx <kewang@xilinx.com> Signed-off-by: kewang2 <kewang2@amd.com> Co-authored-by: kewang2 <kewang2@amd.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>

[Doc]: Update OpenAI-Compatible Server documents (vllm-project#12082)

57e729e

[Bugfix] use right truncation for non-generative tasks (vllm-project#…

edce722

…12050) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[V1][Core] Autotune encoder cache budget (vllm-project#11895)

70755e8

Signed-off-by: Roger Wang <ywang@roblox.com>

[Bugfix] Fix _get_lora_device for HQQ marlin (vllm-project#12090)

ebd8c66

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Allow hip sources to be directly included when compiling for rocm. (v…

cd9d06f

…llm-project#12087)

[Core] Default to using per_token quantization for fp8 when cutlass i…

fa0050d

…s supported. (vllm-project#8651) Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: Michael Goin <mgoin@redhat.com> Co-authored-by: mgoin <michael@neuralmagic.com>

[Doc] Add documentation for specifying model architecture (vllm-proje…

f8ef146

…ct#12105)

Various cosmetic/comment fixes (vllm-project#12089)

9aa1519

Signed-off-by: mgoin <michael@neuralmagic.com>

[Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 (vll…

dd7c9ad

…m-project#12067) Signed-off-by: Isotr0py <2037008807@qq.com>

Support torchrun and SPMD-style offline inference (vllm-project#12071)

bf53e0c

Signed-off-by: youkaichao <youkaichao@gmail.com>

[core] LLM.collective_rpc interface and RLHF example (vllm-project#12084

92e793d

) Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] Fix max image feature size for Llava-one-vision (vllm-projec…

874f7c2

…t#12104) Signed-off-by: Roger Wang <ywang@roblox.com>

[misc] Add LoRA kernel micro benchmarks (vllm-project#11579)

5fd24ec

[Model] Add support for deepseek-vl2-tiny model (vllm-project#12068)

62b06ba

Signed-off-by: Isotr0py <2037008807@qq.com>

npanpaliya and others added 6 commits January 30, 2025 16:29

[CPU][PPC] Updated torch, torchvision, torchaudio dependencies (vllm-…

bd2107e

…project#12555) Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>

[V1][Log] Add max request concurrency log to V1 (vllm-project#12569)

4078052

Signed-off-by: mgoin <michael@neuralmagic.com>

[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) s…

9798b2f

…caling (vllm-project#11868)

[ROCm][AMD][Model] llama 3.2 support upstreaming (vllm-project#12421)

a1fc18c

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

[Bugfix] Gracefully handle huggingface hub http error (vllm-project#1…

7a8987d

…2571)

hmellor and others added 20 commits January 31, 2025 09:20

Add favicon to docs (vllm-project#12611)

e3f7ff6

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[BugFix] Fix Torch.Compile For DeepSeek (vllm-project#12594)

325f679

Co-authored-by: simon-mo <xmo@berkeley.edu>

[Docs][V1] Prefix caching design (vllm-project#12598)

60bcef0

- Create v1 design document section in docs. - Add prefix caching design doc. @WoosukKwon @ywang96 --------- Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[release] Add input step to ask for Release version (vllm-project#12631)

415f194

Instead of having to create a new build with release version put in as env var.

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for …

eb5741a

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

[V1] Bugfix: Validate Model Input Length (vllm-project#12600)

b1340f9

SUMMARY: * avoid crashing the engine when we get an input longer than max_model_len FIX vllm-project#12567(*link existing issues this PR will resolve*)

[ci] Upgrade transformers to 4.48.2 in CI dependencies (vllm-project#…

35b7a05

…12599)

[Bugfix/CI] Fixup benchmark_moe.py (vllm-project#12562)

cfa134d

Fixes `is_marlin` not being passed into `get_default_config` Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size` Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

dtrifiro changed the title ~~sync with 0.7.0~~ sync with 0.7.1 Feb 4, 2025

Sync with upstream @ v0.7.1

0a63af8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync with 0.7.1 #308

sync with 0.7.1 #308

dtrifiro commented Jan 27, 2025 •

edited

Loading

dtrifiro commented Jan 31, 2025

openshift-ci bot commented Feb 4, 2025

sync with 0.7.1 #308

Are you sure you want to change the base?

sync with 0.7.1 #308

Conversation

dtrifiro commented Jan 27, 2025 • edited Loading

dtrifiro commented Jan 31, 2025

openshift-ci bot commented Feb 4, 2025

dtrifiro commented Jan 27, 2025 •

edited

Loading