[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412

tjtanaa · 2025-02-07T15:40:05Z

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Note: This PR feature requires ROCm 6.3 and later and GPU Arch MI300 and later.

Description

This PR involves the following enhancements

This is a PR specific to support Per-Token-Activation Per-Channel-Weight (PTPC-FP8) FP8 Quantization Inferencing.
The model will be quantized on-the-fly from BFloat16 to FP8. Model weight which are store in Float16 will need to be casted into BFloat16.
It used PyTorch latest rowwise scaled GEMM feature in torch._scaled_mm which is introduced in [ROCm] hipblaslt rowwise f8 gemm pytorch/pytorch#144432 , which speeds up current naive implementation by at least 2 times. For more details check out the Performance section

To support this feature, the Dockerfile.rocm_base PyTorch repo commit has been updated to 3a585126.
Dockerfile.rocm is left untouched as the base image is referencing to AMD docker hub registry. That base image at this point in time has already installed with PyTorch repo commit 3a585126.

Small enhancement. The documentation has been updated to ROCm 6.3 and various commits in the installation step has been updated to match the commits in Dockerfile.rocm_base.

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>

Signed-off-by: Chenguang Li <757486878@qq.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…ect#11998) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

…llm-project#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

…ct#11516) Signed-off-by: Shanshan Shen <467638484@qq.com>

vllm-project#11982) Signed-off-by: elijah <f1renze.142857@gmail.com>

…project#12013) Signed-off-by: Yikun <yikunkero@gmail.com>

Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>

This reverts commit 889e662.

* Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…#12014) Signed-off-by: Konrad Zawora <kzawora@habana.ai>

…t#12025) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…project#12040) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…lm-project#12045)

* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648) Signed-off-by: yisheng <yi.sheng@intel.com> * [Doc][3/N] Reorganize Serving section (vllm-project#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Kernel][LoRA]Punica prefill kernels fusion (vllm-project#11234) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Co-authored-by: Zhonghua Deng <abatom@163.com> * [Bugfix] Update attention interface in `Whisper` (vllm-project#11784) Signed-off-by: Roger Wang <ywang@roblox.com> * [CI] Fix neuron CI and run offline tests (vllm-project#11779) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768) * [Doc] Create a vulnerability management team (vllm-project#9925) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI][CPU] adding build number to docker image name (vllm-project#11788) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [doc] add doc to explain how to use uv (vllm-project#11773) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [V1] Support audio language models on V1 (vllm-project#11733) Signed-off-by: Roger Wang <ywang@roblox.com> * [doc] update how pip can install nightly wheels (vllm-project#11806) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] Add note to `gte-Qwen2` models (vllm-project#11808) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [optimization] remove python function call for custom op (vllm-project#11750) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] update the prefix for qwen2 (vllm-project#11795) Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com> * [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417) Signed-off-by: Sourashis Roy <sroy@roblox.com> * [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794) * [Doc] Group examples into categories (vllm-project#11782) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] sort torch profiler table by kernel timing (vllm-project#11813) * Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824) * Fixed docker build for ppc64le (vllm-project#11518) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * [OpenVINO] Fixed Docker.openvino build (vllm-project#11732) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Docs] reorganize sponsorship page (vllm-project#11639) Signed-off-by: simon-mo <simon.mo@hey.com> * [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [misc] improve memory profiling (vllm-project#11809) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [doc] update wheels url (vllm-project#11830) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833) * [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> * [torch.compile] consider relevant code in compilation cache (vllm-project#11614) Signed-off-by: youkaichao <youkaichao@gmail.com> * [VLM] Reorganize profiling/processing-related code (vllm-project#11812) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Move examples into categories (vllm-project#11840) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc][4/N] Reorganize API Reference (vllm-project#11843) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Bugfix][XPU] fix silu_and_mul (vllm-project#11823) Signed-off-by: yan ma <yan.ma@intel.com> * [Misc] Move some model utils into vision file (vllm-project#11848) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Expand Multimodal API Reference (vllm-project#11852) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc]add some explanations for BlockHashType (vllm-project#11847) * [TPU][Quantization] TPU `W8A8` (vllm-project#11785) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * [Docs] Add Google Cloud Meetup (vllm-project#11864) * [CI] Turn on basic correctness tests for V1 (vllm-project#10864) * treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849) Signed-off-by: mgoin <michael@neuralmagic.com> * [Misc] Move `print_*_once` from utils to logger (vllm-project#11298) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> * [Doc] Intended links Python multiprocessing library (vllm-project#11878) * [perf]fix current stream (vllm-project#11870) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875) Signed-off-by: Ye Qi <yeq@meta.com> Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com> * [Doc] Add model development API Reference (vllm-project#11884) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [platform] Allow platform specify attention backend (vllm-project#11609) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> * [ci]try to fix flaky multi-step tests (vllm-project#11894) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Docs] Add Modal to deployment frameworks (vllm-project#11907) * [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Simon Mo <simon.mo@hey.com> * [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Show default pooling method in a table (vllm-project#11904) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [ci] fix gh200 tests (vllm-project#11919) Signed-off-by: youkaichao <youkaichao@gmail.com> * [misc] remove python function call for custom activation op (vllm-project#11885) Co-authored-by: youkaichao <youkaichao@gmail.com> * [platform] support pytorch custom op pluggable (vllm-project#11328) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * Replace "online inference" with "online serving" (vllm-project#11923) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ci] Fix sampler tests (vllm-project#11922) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [platform] support custom torch.compile backend key (vllm-project#11318) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Doc] Rename offline inference examples (vllm-project#11927) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Fix docstring in `get_ip` function (vllm-project#11932) Signed-off-by: Kuntai Du <kuntai@uchicago.edu> * Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933) Signed-off-by: Kuntai Du <kuntai@uchicago.edu> * [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920) Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn> Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn> * [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> * [mypy] Fix mypy warnings in api_server.py (vllm-project#11941) Signed-off-by: Fred Reiss <frreiss@us.ibm.com> * [ci] fix broken distributed-tests-4-gpus (vllm-project#11937) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> * [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921) Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> * [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Basic guide for writing unit tests for new models (vllm-project#11951) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix RobertaModel loading (vllm-project#11940) Signed-off-by: NickLucche <nlucches@redhat.com> * [Model] Add cogagent model support vLLM (vllm-project#11742) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <ywang@roblox.com> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <757486878@qq.com> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <youkaichao@gmail.com> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <467638484@qq.com> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <f1renze.142857@gmail.com> * Using list * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Trying to make scales work with compileable attention * Docs lint --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sourashis Roy <sroy@roblox.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Signed-off-by: Ye Qi <yeq@meta.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Fred Reiss <frreiss@us.ibm.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah <f1renze.142857@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: YiSheng5 <yi.sheng@intel.com> Co-authored-by: Zhonghua Deng <abatom@163.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan <yuan.zhou@intel.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com> Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com> Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: WangErXiao <863579016@qq.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Charles Frye <cfrye59@gmail.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: cennn <61925104+cennn@users.noreply.github.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: minmin <rmm0811@gmail.com> Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Fred Reiss <frreiss@us.ibm.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: sixgod <evethwillbeok@outlook.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Concurrensee <yida.wu@amd.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com>

…12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Isotr0py <2037008807@qq.com>

…t#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

…12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

* Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment

Upstream merge 25 01 27

* updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace

* integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of ROCm#372 (ROCm#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>

…s padding (ROCm#394)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

…-llama-fp8 Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…-torch_scaled_mm-tj Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa · 2025-02-25T04:50:41Z

This feature has been support through this PR #445

Concurrensee and others added 30 commits January 12, 2025 23:05

[Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947)

cf6bbcb

Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>

[Misc]Minor Changes about Worker (vllm-project#11555)

c3f05b0

Signed-off-by: Chenguang Li <757486878@qq.com>

[platform] add ray_device_key (vllm-project#11948)

89ce62a

Signed-off-by: youkaichao <youkaichao@gmail.com>

Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980)

5340a30

Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>

[Kernel] unified_attention for Attention.forward (vllm-project#11967)

0f8cafe

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

[Doc][V1] Update model implementation guide for V1 support (vllm-proj…

cd82499

…ect#11998) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

[Doc] Organise installation documentation into categories and tabs (v…

e8c23ff

…llm-project#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[platform] add device_control env var (vllm-project#12009)

458e63a

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Platform] Move get_punica_wrapper() function to Platform (vllm-proje…

a7d5968

…ct#11516) Signed-off-by: Shanshan Shen <467638484@qq.com>

bugfix: Fix signature mismatch in benchmark's get_tokenizer function (

c6db213

vllm-project#11982) Signed-off-by: elijah <f1renze.142857@gmail.com>

[Doc] Fix build from source and installation link in README.md (vllm-…

289b519

…project#12013) Signed-off-by: Yikun <yikunkero@gmail.com>

Merge remote-tracking branch 'upstream/main'

ce53f46

Using list

5a51290

[Bugfix] Fix deepseekv3 gate bias error (vllm-project#12002)

f35ec46

Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>

Revert "[misc] improve memory profiling (vllm-project#11809)"

079750e

This reverts commit 889e662.

Trying to make scales work with compileable attention

043c93d

[Docs] Add Sky Computing Lab to project intro (vllm-project#12019)

1a40125

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[HPU][Bugfix] set_forward_context and CI test execution (vllm-project…

078da31

…#12014) Signed-off-by: Konrad Zawora <kzawora@habana.ai>

[Doc] Update Quantization Hardware Support Documentation (vllm-projec…

8a1f938

…t#12025) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

[HPU][misc] add comments for explanation (vllm-project#12034)

ff39141

Signed-off-by: youkaichao <youkaichao@gmail.com>

[Bugfix] Fix various bugs in multi-modal processor (vllm-project#12031)

bb354e6

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Kernel] Revert the API change of Attention.forward (vllm-project#12038)

1f18adb

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

[Platform] Add output for Attention Backend (vllm-project#11981)

2e0e017

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (vllm-…

a2d2acb

…project#12040) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Explain where the engine args go when using Docker (vllm-project#12041)

c9d6ff5

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Docs lint

16f8680

Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_13

eb4abfd

[Doc]: Update the Json Example of the Engine Arguments document (vl…

87054a5

…lm-project#12045)

pooyadavoodi and others added 28 commits January 27, 2025 04:30

[Frontend] Support scores endpoint in run_batch (vllm-project#12430)

0cc6b38

Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>

[Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

5204ff5

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

remove some comment

d0020b1

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm…

372bf08

…-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com>

[V1][Minor] Minor optimizations for update_from_output (vllm-project#…

624a1e4

…12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

ce69f7f

Signed-off-by: Isotr0py <2037008807@qq.com>

[Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-projec…

103bd17

…t#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

[V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

01ba927

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#…

3f1fc74

…12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

2bc3fbb

Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

Merge remote-tracking branch 'upstream/main'

8e6d987

Support FP8 FA from Quark format (ROCm#388)

6b2147f

* Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment

Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_27

a892ecc

Direct call on ROCm

c8b8654

Merge pull request ROCm#391 from ROCm/upstream_merge_25_01_27

b2c3b22

Upstream merge 25 01 27

20250127 docs update (ROCm#392)

7a292f9

* updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace

Using a more precise profiling on ROCm to properly account for weight…

22141e7

…s padding (ROCm#394)

Update Dockerfile.rocm

6852819

Merge remote-tracking branch 'origin/main' into main-to-llama-fp8

6cfbe01

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

[Bugfix]: inclucde the env variables required for running FastSyncLLM

b64246a

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

fix pre-commit lint

64e4aa9

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

merge main into ptpcfp8

23bcbae

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

[Bugfix] included missing environment variable

f11a517

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

Merge remote-tracking branch 'origin/llama_fp8_12062024' into main-to…

7e2b9da

…-llama-fp8 Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

resolve merge conflict to fix bug openai/api_server.py

e7595ed

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Merge remote-tracking branch 'origin/main-to-llama-fp8' into ptpc-fp8…

ec2aa42

…-torch_scaled_mm-tj Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

use USE_ROWWISE_TORCH_SCALED_MM

435b115

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa closed this Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412

[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412

tjtanaa commented Feb 7, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Feb 25, 2025

[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412

[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412

Conversation

tjtanaa commented Feb 7, 2025 • edited by github-actions bot Loading

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Description

tjtanaa commented Feb 25, 2025

tjtanaa commented Feb 7, 2025 •

edited by github-actions bot

Loading