-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync with upstream@v0.4.3-60-gbaa15a9e #47
Conversation
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: github-actions[bot] The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @github-actions[bot]. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
/ok-to-test |
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: team <calvinn.ng@ahrefs.com>
Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in #5183.
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)
/ok-to-test |
Co-authored-by: Roger Wang <ywang@roblox.com>
* support quark * using torch/all.h * loading weight from quark output * support both ammo and quark * Update doc * fix load ammo * fix linter * fix isort
Merge vllm-project/vllm:main@v0.4.3-60-gbaa15a9e into main