sync with 0.7.2 #315

dtrifiro · 2025-02-07T18:02:14Z

0.7.2 changelog: https://github.com/vllm-project/vllm/releases/v0.7.2
Dockerfile.ubi: bump flashinfer to v0.2.0.post2

Word "evolved" was mistyped Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> --------- Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

Fix vllm-project#12647 The `get_quant_method` of `moe_wna16` always return moe method, GPTQ-based linear method or AWQ-based linear method, even when the target module is attention layer. https://github.com/vllm-project/vllm/blob/baeded25699f9f4851843306f27f685c4d4ee7c5/vllm/attention/layer.py#L86-L92 Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

I noticed during testing that I was getting a lot of these deprecation warnings about `local_lora_path`: ``` DeprecationWarning: The 'lora_local_path' attribute is deprecated and will be removed in a future version. Please use 'lora_path' instead. ``` The check used for emitting this warning was always True, even when the parameter was not actually specified. It will always be in `__struct_fields__`. We should be checking for a non-None value, instead. Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>

A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used. Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

@comaniac

…anager (vllm-project#12608) As mentioned in RFC vllm-project#12254, this PR achieves the task: combine allocate_slots and append_slots. There should be no functionality change, except that in decode, also raise exception when num_tokens is zero (like prefill), and change the unit test case accordingly. @comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo --------- Signed-off-by: Shawn Du <shawnd200@outlook.com>

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

…lm-project#12628) - **Add SPDX license headers to python source files** - **Check for SPDX headers using pre-commit** commit 9d7ef44 Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:18:24 2025 -0500 Add SPDX license headers to python source files This commit adds SPDX license headers to python source files as recommended to the project by the Linux Foundation. These headers provide a concise way that is both human and machine readable for communicating license information for each source file. It helps avoid any ambiguity about the license of the code and can also be easily used by tools to help manage license compliance. The Linux Foundation runs license scans against the codebase to help ensure we are in compliance with the licenses of the code we use, including dependencies. Having these headers in place helps that tool do its job. More information can be found on the SPDX site: - https://spdx.dev/learn/handling-license-info/ Signed-off-by: Russell Bryant <rbryant@redhat.com> commit 5a1cf1c Author: Russell Bryant <rbryant@redhat.com> Date: Fri Jan 31 14:36:32 2025 -0500 Check for SPDX headers using pre-commit Signed-off-by: Russell Bryant <rbryant@redhat.com> --------- Signed-off-by: Russell Bryant <rbryant@redhat.com>

…ct#12667) As more and more people are trying deepseek models with multi-node inference, vllm-project#7815 becomes more frequent. Let's give clear message to users. Signed-off-by: youkaichao <youkaichao@gmail.com>

sgl_moe_align_block_size is based on: sgl-project/sglang@ded9fcd moe_align_block_size is based on: sgl-project/sglang@ba5112f Signed-off-by: Yang Chen <yangche@fb.com>

…oject#12669) When people use deepseek models, they find that they need to solve cv2 version conflict, see https://zhuanlan.zhihu.com/p/21064432691 . I added the check, and make all imports of `cv2` lazy. --------- Signed-off-by: youkaichao <youkaichao@gmail.com>

@kylesayrs

…roject#12666) Thanks @kylesayrs for catching this!

…llm-project#12570) Fix to AWQ quant loading of the new R1 model The new optimized MoE kernels for a large number of experts `moe_wn16` uses AWQ quant which requires the attention layers to be in 16bit The current merge has broken this, and the `get_quant_method` must return None for it to work correctly again --------- Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Beim <beim2015@outlook.com> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: simon-mo <xmo@berkeley.edu> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Ryan N <ryan.nguyen@centml.ai> Signed-off-by: Brian Dellabetta <bdellabe@redhat.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Shawn Du <shawnd200@outlook.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Beim <805908499@qq.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: simon-mo <xmo@berkeley.edu> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Co-authored-by: fade_away <1028552010@qq.com> Co-authored-by: weilong.yu <weilong.yu@shopee.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: Shawn Du <shawnd200@outlook.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

fixes problems like vllm-project#12635 and vllm-project#12636 and vllm-project#12565 --------- Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

@Isotr0py

# Adds support for `transformers` as a backend Following huggingface/transformers#35235, a bunch of models should already be supported, we are ramping up support for more models. Thanks @Isotr0py for the TP support, and @hmellor for his help as well! This includes: - `trust_remote_code=True` support: any model on the hub, if it implements attention the correct way can be natively supported!! - tensor parallel support --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…#12694) Signed-off-by: Russell Bryant <rbryant@redhat.com>

…aled mm (vllm-project#12696) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

…project#12415) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

…ct#12621) Signed-off-by: Russell Bryant <rbryant@redhat.com>

…fig (vllm-project#12710) Signed-off-by: mgoin <michael@neuralmagic.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

…essed Tensors configs (vllm-project#12711)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

…oject#12553) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>

…or_pytorch'' for --tensor-parallel-size more than 1 (vllm-project#12546)

Signed-off-by: youkaichao <youkaichao@gmail.com>

Merged via CLI script

Signed-off-by: Lu Fang <lufang@fb.com>

…t#12793)

Signed-off-by: youkaichao <youkaichao@gmail.com>

openshift-ci · 2025-02-07T18:02:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dtrifiro]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.0.post2

…with the existing tuning, prior to moving all the way forward to release/3.2.x; Using the correct hipblaslt version in the name (#315)

mgoin and others added 30 commits February 1, 2025 16:16

Apply torch.compile to fused_moe/grouped_topk (vllm-project#12637)

3194039

doc: fixing minor typo in readme.md (vllm-project#12643)

b4e5c03

Word "evolved" was mistyped Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com> --------- Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

[V1][Minor] Avoid frequently creating ConstantList (vllm-project#12653)

abfcdcd

A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used. Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Hardware][Intel GPU] add XPU bf16 support (vllm-project#12392)

f256ebe

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

[Doc] Deprecate Discord (vllm-project#12668)

326fcc8

[Kernel] port sgl moe_align_block_size kernels (vllm-project#12574)

95460fc

sgl_moe_align_block_size is based on: sgl-project/sglang@ded9fcd moe_align_block_size is based on: sgl-project/sglang@ba5112f Signed-off-by: Yang Chen <yangche@fb.com>

Properly check if all fused layers are in the list of targets (vllm-p…

c5932e5

…roject#12666) Thanks @kylesayrs for catching this!

[cuda] manually import the correct pynvml module (vllm-project#12679)

ad4a9dc

fixes problems like vllm-project#12635 and vllm-project#12636 and vllm-project#12565 --------- Signed-off-by: youkaichao <youkaichao@gmail.com>

[ci/build] fix gh200 test (vllm-project#12681)

1298a40

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Misc] Fix improper placement of SPDX header in scripts (vllm-project…

33e0602

…#12694) Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper sc…

c11de33

…aled mm (vllm-project#12696) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

Squelch MLA warning for Compressed-Tensors Models (vllm-project#12704)

6dd5e52

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

[Model] Add Deepseek V3 fp8_w8a8 configs for B200 (vllm-project#12707)

4797dad

[MISC] Remove model input dumping when exception (vllm-project#12582)

cf58b9c

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[V1] Revert uncache_blocks and support recaching full blocks (vllm-…

5095e96

…project#12415) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>

[Core] Improve hash collision avoidance in prefix caching (vllm-proje…

73b35cc

…ct#12621) Signed-off-by: Russell Bryant <rbryant@redhat.com>

Support Pixtral-Large HF by using llava multimodal_projector_bias con…

5d98d56

…fig (vllm-project#12710) Signed-off-by: mgoin <michael@neuralmagic.com>

[Doc] Replace ibm-fms with ibm-ai-platform (vllm-project#12709)

bb392af

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

[Quant] Fix use_mla TypeError and support loading pure-sparsity Compr…

4896d0c

…essed Tensors configs (vllm-project#12711)

[AMD][ROCm] Enable DeepSeek model on ROCm (vllm-project#12662)

c36ac98

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>

[Misc] Add BNB quantization for Whisper (vllm-project#12381)

96b2362

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

[VLM] Merged multi-modal processor for InternVL-based models (vllm-pr…

d1ca7df

…oject#12553) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com>

mgoin and others added 18 commits February 4, 2025 22:43

[Doc] Remove performance warning for auto_awq.md (vllm-project#12743)

c53dc46

[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_f…

022bcc7

…or_pytorch'' for --tensor-parallel-size more than 1 (vllm-project#12546)

[core][distributed] exact ray placement control (vllm-project#12732)

bc1bdec

Signed-off-by: youkaichao <youkaichao@gmail.com>

Merging PR vllm-project#12536

4c3aac5

Merged via CLI script

[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)

af8486d

Add: Support for Sparse24Bitmask Compressed Models

3b2005e

[VLM] Use shared field to pass token ids to model

a4ce74c

[Docs] Drop duplicate [source] links

9a5b155

[VLM] Qwen2.5-VL

bf3b79e

[VLM] Update compatibility with transformers 4.49

75404d0

[ROCm][Kernel] Using the correct warp_size value

5b19b93

[Bugfix] Better FP8 supported defaults

76abd0c

[Misc][Easy] Remove the space from the file name

9cdea30

[Model] LoRA Support for Ultravox model (vllm-project#11253)

d88506d

[Bugfix] Fix the test_ultravox.py's license (vllm-project#12806)

56534cd

Signed-off-by: Lu Fang <lufang@fb.com>

Improve TransformersModel UX (vllm-project#12785)

1a6fcad

[Misc] Remove duplicated DeepSeek V2/V3 model definition (vllm-projec…

449d1bc

…t#12793)

[Misc] Improve error message for incorrect pynvml (vllm-project#12809)

0408efc

Signed-off-by: youkaichao <youkaichao@gmail.com>

dtrifiro requested a review from njhill as a code owner February 7, 2025 18:02

openshift-ci bot requested review from vaibhavjainwiz and Xaenalt February 7, 2025 18:02

openshift-ci bot added the approved label Feb 7, 2025

dtrifiro force-pushed the sync-with-0.7.2 branch from ae28d3e to 6ec0863 Compare February 7, 2025 18:52

dtrifiro added 3 commits February 7, 2025 19:52

Sync with upstream @ v0.7.2

3d84028

Dockerfile.ubi: bump flashinfer to v0.2.0.post2

54bf162

https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.2.0.post2

extras: add sdpx license identifier

6ec0863

dtrifiro merged commit 1da3720 into opendatahub-io:main Feb 7, 2025
2 of 6 checks passed

dtrifiro deleted the sync-with-0.7.2 branch February 7, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync with 0.7.2 #315

sync with 0.7.2 #315

dtrifiro commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025

sync with 0.7.2 #315

sync with 0.7.2 #315

Conversation

dtrifiro commented Feb 7, 2025

openshift-ci bot commented Feb 7, 2025