Sync with upstream@v0.4.3-60-gbaa15a9e #47

github-actions · 2024-06-07T04:30:55Z

Merge vllm-project/vllm:main@v0.4.3-60-gbaa15a9e into main

Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>

…ionRequest` (#5135)

openshift-ci · 2024-06-07T04:31:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: github-actions[bot]
Once this PR has been reviewed and has the lgtm label, please assign danielezonca for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-06-07T04:31:08Z

Hi @github-actions[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

)

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

dtrifiro · 2024-06-07T13:12:58Z

/ok-to-test
/approved

…5296)

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

Co-authored-by: team <calvinn.ng@ahrefs.com>

Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in #5183.

[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)

[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)

…_scale (#5353)

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)

[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)

…m ops (#5047)

Co-authored-by: Michael Goin <michael@neuralmagic.com>

…erver (#5374)

dtrifiro · 2024-06-10T09:57:09Z

/ok-to-test

Co-authored-by: Roger Wang <ywang@roblox.com>

* support quark * using torch/all.h * loading weight from quark output * support both ammo and quark * Update doc * fix load ammo * fix linter * fix isort

liuyhwangyh and others added 7 commits June 6, 2024 09:28

Bugfix: fix broken of download models from modelscope (#5233)

4efff03

Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>

[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294)

abe855d

[Frontend] enable passing multiple LoRA adapters at once to generate() (

828da0d

#5300)

[Core] Avoid copying prompt/output tokens if no penalties are used (#…

a31cab7

…5289)

[Core] Change LoRA embedding sharding to support loading methods (#5038)

ccdc490

[Misc] Missing error message for custom ops import (#5282)

1506374

[Feature][Frontend]: Add support for stream_options in `ChatComplet…

baa15a9

…ionRequest` (#5135)

github-actions bot added the code-sync Sync with upstream label Jun 7, 2024

openshift-ci bot requested review from heyselbi and vaibhavjainwiz June 7, 2024 04:31

openshift-ci bot added the needs-ok-to-test label Jun 7, 2024

youkaichao and others added 3 commits June 6, 2024 22:15

[Misc][Utils] allow get_open_port to be called for multiple times (#5333

388596c

)

Remove Ray health check (#4693)

18a277b

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels Jun 7, 2024

JamesLim-sy and others added 11 commits June 7, 2024 13:35

Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#…

dc49fb8

…5296)

[Kernel] Dynamic Per-Token Activation Quantization (#5037)

ca3ea51

Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

[Frontend] Add OpenAI Vision API Support (#5237)

7a9cb29

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Misc] Remove unused cuda_utils.h in CPU backend (#5345)

6840a71

fix DbrxFusedNormAttention missing cache_config (#5340)

767c727

Co-authored-by: team <calvinn.ng@ahrefs.com>

[Misc] Add args for selecting distributed executor to benchmarks (#5335)

b3376e5

[ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965)

c96fc06

[CI/Test] improve robustness of test (hf_runner) (#5347)

9fb900f

[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)

[CI/Test] improve robustness of test (vllm_runner) (#5357)

8ea5e44

[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…

c09dade

…_scale (#5353)

youkaichao and others added 7 commits June 8, 2024 19:14

[Core][CUDA Graph] add output buffer for cudagraph (#5074)

0373e18

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)

[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361)

5d7e3d0

[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custo…

5467ac3

…m ops (#5047)

[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164)

45f92c0

[Misc] Update to comply with the new compressed-tensors config (#5350)

5884c2b

Co-authored-by: Michael Goin <michael@neuralmagic.com>

[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API S…

68bc817

…erver (#5374)

[misc][typo] fix typo (#5372)

c81da5f

DarkLight1337 and others added 2 commits June 10, 2024 19:38

[Misc] Improve error message when LoRA parsing fails (#5194)

0bfa1c4

[Model] Initial support for LLaVA-NeXT (#4199)

6b29d6f

Co-authored-by: Roger Wang <ywang@roblox.com>

dtrifiro merged commit e332d6f into opendatahub-io:main Jun 10, 2024
0 of 3 checks passed

Xaenalt pushed a commit that referenced this pull request Sep 18, 2024

Remove usage of wrap_in_hpu_graph in PT eager (#47)

1c5d12e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with upstream@v0.4.3-60-gbaa15a9e #47

Sync with upstream@v0.4.3-60-gbaa15a9e #47

github-actions bot commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

dtrifiro commented Jun 7, 2024

dtrifiro commented Jun 10, 2024

Sync with upstream@v0.4.3-60-gbaa15a9e #47

Sync with upstream@v0.4.3-60-gbaa15a9e #47

Conversation

github-actions bot commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

dtrifiro commented Jun 7, 2024

dtrifiro commented Jun 10, 2024