forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412
Closed
tjtanaa
wants to merge
763
commits into
ROCm:llama_fp8_12062024
from
EmbeddedLLM:ptpc-fp8-torch_scaled_mm-tj
Closed
[Quant] [Feature] Per-Token-Activation Per-Channel-Weight FP8 Quantization #412
tjtanaa
wants to merge
763
commits into
ROCm:llama_fp8_12062024
from
EmbeddedLLM:ptpc-fp8-torch_scaled_mm-tj
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…ect#11998) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…llm-project#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
…ct#11516) Signed-off-by: Shanshan Shen <467638484@qq.com>
vllm-project#11982) Signed-off-by: elijah <f1renze.142857@gmail.com>
…project#12013) Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>
This reverts commit 889e662.
* Commiting the *multilingual* P3L test. * Created a *multi-lingual* P3L test. * Making ruff happy. * . * Added a reference to the language-scripture Confluence table. * Typo fixing. * Harmonizing naming. * Fixing comments in the header. --------- Co-authored-by: Alexei V. Ivanov <alivanov@banff-cyxtera-s65-4.amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…#12014) Signed-off-by: Konrad Zawora <kzawora@habana.ai>
…t#12025) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…project#12040) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
* [Bugfix][V1] Fix molmo text-only inputs (vllm-project#11676) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel] Move attn_type to Attention.__init__() (vllm-project#11690) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (vllm-project#11685) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (vllm-project#11772) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Model] Future-proof Qwen2-Audio multi-modal processor (vllm-project#11776) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [XPU] Make pp group initilized for pipeline-parallelism (vllm-project#11648) Signed-off-by: yisheng <yi.sheng@intel.com> * [Doc][3/N] Reorganize Serving section (vllm-project#11766) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Kernel][LoRA]Punica prefill kernels fusion (vllm-project#11234) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Co-authored-by: Zhonghua Deng <abatom@163.com> * [Bugfix] Update attention interface in `Whisper` (vllm-project#11784) Signed-off-by: Roger Wang <ywang@roblox.com> * [CI] Fix neuron CI and run offline tests (vllm-project#11779) Signed-off-by: Liangfu Chen <liangfc@amazon.com> * fix init error for MessageQueue when n_local_reader is zero (vllm-project#11768) * [Doc] Create a vulnerability management team (vllm-project#9925) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [CI][CPU] adding build number to docker image name (vllm-project#11788) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * [V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (vllm-project#11798) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (vllm-project#11800) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [doc] add doc to explain how to use uv (vllm-project#11773) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [V1] Support audio language models on V1 (vllm-project#11733) Signed-off-by: Roger Wang <ywang@roblox.com> * [doc] update how pip can install nightly wheels (vllm-project#11806) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] Add note to `gte-Qwen2` models (vllm-project#11808) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [optimization] remove python function call for custom op (vllm-project#11750) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] update the prefix for qwen2 (vllm-project#11795) Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com> * [Doc]Add documentation for using EAGLE in vLLM (vllm-project#11417) Signed-off-by: Sourashis Roy <sroy@roblox.com> * [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (vllm-project#11794) * [Doc] Group examples into categories (vllm-project#11782) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Fix image input for Pixtral-HF (vllm-project#11741) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] sort torch profiler table by kernel timing (vllm-project#11813) * Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (vllm-project#11824) * Fixed docker build for ppc64le (vllm-project#11518) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * [OpenVINO] Fixed Docker.openvino build (vllm-project#11732) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [Bugfix] Add checks for LoRA and CPU offload (vllm-project#11810) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Docs] reorganize sponsorship page (vllm-project#11639) Signed-off-by: simon-mo <simon.mo@hey.com> * [Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used (vllm-project#11825) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [misc] improve memory profiling (vllm-project#11809) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [doc] update wheels url (vllm-project#11830) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Docs] Update sponsor name: 'Novita' to 'Novita AI' (vllm-project#11833) * [Hardware][Apple] Native support for macOS Apple Silicon (vllm-project#11696) Signed-off-by: Wallas Santos <wallashss@ibm.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> * [torch.compile] consider relevant code in compilation cache (vllm-project#11614) Signed-off-by: youkaichao <youkaichao@gmail.com> * [VLM] Reorganize profiling/processing-related code (vllm-project#11812) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Move examples into categories (vllm-project#11840) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc][4/N] Reorganize API Reference (vllm-project#11843) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [CI/Build][Bugfix] Fix CPU CI image clean up (vllm-project#11836) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Bugfix][XPU] fix silu_and_mul (vllm-project#11823) Signed-off-by: yan ma <yan.ma@intel.com> * [Misc] Move some model utils into vision file (vllm-project#11848) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Expand Multimodal API Reference (vllm-project#11852) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc]add some explanations for BlockHashType (vllm-project#11847) * [TPU][Quantization] TPU `W8A8` (vllm-project#11785) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> * [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (vllm-project#11698) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * [Docs] Add Google Cloud Meetup (vllm-project#11864) * [CI] Turn on basic correctness tests for V1 (vllm-project#10864) * treat do_lower_case in the same way as the sentence-transformers library (vllm-project#11815) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> * [Doc] Recommend uv and python 3.12 for quickstart guide (vllm-project#11849) Signed-off-by: mgoin <michael@neuralmagic.com> * [Misc] Move `print_*_once` from utils to logger (vllm-project#11298) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> * [Doc] Intended links Python multiprocessing library (vllm-project#11878) * [perf]fix current stream (vllm-project#11870) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Override dunder methods of placeholder modules (vllm-project#11882) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] fix beam search input errors and latency benchmark script (vllm-project#11875) Signed-off-by: Ye Qi <yeq@meta.com> Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com> * [Doc] Add model development API Reference (vllm-project#11884) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [platform] Allow platform specify attention backend (vllm-project#11609) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> * [ci]try to fix flaky multi-step tests (vllm-project#11894) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Misc] Provide correct Pixtral-HF chat template (vllm-project#11891) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Docs] Add Modal to deployment frameworks (vllm-project#11907) * [Doc][5/N] Move Community and API Reference to the bottom (vllm-project#11896) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Simon Mo <simon.mo@hey.com> * [VLM] Enable tokenized inputs for merged multi-modal processor (vllm-project#11900) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Show default pooling method in a table (vllm-project#11904) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [torch.compile] Hide KV cache behind torch.compile boundary (vllm-project#11677) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Bugfix] Validate lora adapters to avoid crashing server (vllm-project#11727) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [BUGFIX] Fix `UnspecifiedPlatform` package name (vllm-project#11916) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [ci] fix gh200 tests (vllm-project#11919) Signed-off-by: youkaichao <youkaichao@gmail.com> * [misc] remove python function call for custom activation op (vllm-project#11885) Co-authored-by: youkaichao <youkaichao@gmail.com> * [platform] support pytorch custom op pluggable (vllm-project#11328) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * Replace "online inference" with "online serving" (vllm-project#11923) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ci] Fix sampler tests (vllm-project#11922) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Doc] [1/N] Initial guide for merged multi-modal processor (vllm-project#11925) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [platform] support custom torch.compile backend key (vllm-project#11318) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> * [Doc] Rename offline inference examples (vllm-project#11927) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Fix docstring in `get_ip` function (vllm-project#11932) Signed-off-by: Kuntai Du <kuntai@uchicago.edu> * Doc fix in `benchmark_long_document_qa_throughput.py` (vllm-project#11933) Signed-off-by: Kuntai Du <kuntai@uchicago.edu> * [Hardware][CPU] Support MOE models on x86 CPU (vllm-project#11831) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Misc] Clean up debug code in Deepseek-V3 (vllm-project#11930) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] Update benchmark_prefix_caching.py fixed example usage (vllm-project#11920) Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn> Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn> * [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (vllm-project#11939) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> * [mypy] Fix mypy warnings in api_server.py (vllm-project#11941) Signed-off-by: Fred Reiss <frreiss@us.ibm.com> * [ci] fix broken distributed-tests-4-gpus (vllm-project#11937) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (vllm-project#11672) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> * [Bugfix] fused_experts_impl wrong compute type for float32 (vllm-project#11921) Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> * [CI/Build] Move model-specific multi-modal processing tests (vllm-project#11934) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Doc] Basic guide for writing unit tests for new models (vllm-project#11951) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Bugfix] Fix RobertaModel loading (vllm-project#11940) Signed-off-by: NickLucche <nlucches@redhat.com> * [Model] Add cogagent model support vLLM (vllm-project#11742) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [V1] Avoid sending text prompt to core engine (vllm-project#11963) Signed-off-by: Roger Wang <ywang@roblox.com> * [CI/Build] Add markdown linter (vllm-project#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> * [Model] Initialize support for Deepseek-VL2 models (vllm-project#11578) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [Hardware][CPU] Multi-LoRA implementation for the CPU backend (vllm-project#11100) Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] workaround fix for MoE on TPU (vllm-project#11764) * [V1][Core][1/n] Logging and Metrics (vllm-project#11962) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [Model] Support GGUF models newly added in `transformers` 4.46.0 (vllm-project#9685) Signed-off-by: Isotr0py <2037008807@qq.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (vllm-project#11973) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [MISC] fix typo in kv transfer send recv test (vllm-project#11983) * [Bug] Fix usage of `.transpose()` and `.view()` consecutively. (vllm-project#11979) * [CI][Spec Decode] fix: broken test for EAGLE model (vllm-project#11972) Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> * [Misc] Fix Deepseek V2 fp8 kv-scale remapping (vllm-project#11947) Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> * [Misc]Minor Changes about Worker (vllm-project#11555) Signed-off-by: Chenguang Li <757486878@qq.com> * [platform] add ray_device_key (vllm-project#11948) Signed-off-by: youkaichao <youkaichao@gmail.com> * Fix Max Token ID for Qwen-VL-Chat (vllm-project#11980) Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> * [Kernel] unified_attention for Attention.forward (vllm-project#11967) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [Doc][V1] Update model implementation guide for V1 support (vllm-project#11998) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [Doc] Organise installation documentation into categories and tabs (vllm-project#11935) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [platform] add device_control env var (vllm-project#12009) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Platform] Move get_punica_wrapper() function to Platform (vllm-project#11516) Signed-off-by: Shanshan Shen <467638484@qq.com> * bugfix: Fix signature mismatch in benchmark's `get_tokenizer` function (vllm-project#11982) Signed-off-by: elijah <f1renze.142857@gmail.com> * Using list * Revert "[misc] improve memory profiling (vllm-project#11809)" This reverts commit 889e662. * Trying to make scales work with compileable attention * Docs lint --------- Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Liangfu Chen <liangfc@amazon.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sourashis Roy <sroy@roblox.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: Wallas Santos <wallashss@ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: yan ma <yan.ma@intel.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: mgoin <michael@neuralmagic.com> Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Signed-off-by: Ye Qi <yeq@meta.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Kuntai Du <kuntai@uchicago.edu> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Fred Reiss <frreiss@us.ibm.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: Akshat Tripathi <akshat@krai.ai> Signed-off-by: Oleg Mosalov <oleg@krai.ai> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu> Signed-off-by: Chenguang Li <757486878@qq.com> Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com> Signed-off-by: Shanshan Shen <467638484@qq.com> Signed-off-by: elijah <f1renze.142857@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: YiSheng5 <yi.sheng@intel.com> Co-authored-by: Zhonghua Deng <abatom@163.com> Co-authored-by: Liangfu Chen <liangfc@amazon.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan <yuan.zhou@intel.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: jiangjiadi <34134495+jiangjiadi@users.noreply.github.com> Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com> Co-authored-by: sroy745 <142070531+sroy745@users.noreply.github.com> Co-authored-by: Jie Fu (傅杰) <jiefu@tencent.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: WangErXiao <863579016@qq.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Wallas Henrique <wallashss@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com> Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com> Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Charles Frye <cfrye59@gmail.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: cennn <61925104+cennn@users.noreply.github.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: minmin <rmm0811@gmail.com> Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Fred Reiss <frreiss@us.ibm.com> Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com> Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com> Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: sixgod <evethwillbeok@outlook.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com> Co-authored-by: Oleg Mosalov <oleg@krai.ai> Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com> Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com> Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com> Co-authored-by: Concurrensee <yida.wu@amd.com> Co-authored-by: Chenguang Li <757486878@qq.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
…-project#12464) Signed-off-by: Isotr0py <2037008807@qq.com>
…12454) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Isotr0py <2037008807@qq.com>
…t#12339) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
…12469) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Bowen Wang <abmfy@icloud.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>
* Support FP8 FA from Quark format * Support FP8 FA from Quark format * nit: update comment
Upstream merge 25 01 27
* updating code blocks * typo * updated manifest * Including feedback * whitespace * Deepseek instructions * hyperlink fix * hyperlink fix * updating what is new * cpx update * typo * whitespace * whitespace
* integrate new cpa kernel, update tests and benchmark * added comments to mfma4 kernel * further comments for mfma16 kernel * clang-format * Lint * add flag for logits rtz conversion and disable by default * lint * [Bugfix]: Fix paged attention unit tests of ROCm#372 (ROCm#389) * [Bugfix]: fix paged attention tests based on the updated kernels in `csrc/attention/paged_attention_v1.cu`,`csrc/attention/paged_attention_v2.cu` and `csrc/rocm/attention.cu`. * improve code documentation. * lint --------- Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> --------- Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Joe Shajrawi <17753158+shajrawi@users.noreply.github.com> Co-authored-by: TJian <tunjian1996@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
…-llama-fp8 Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…-torch_scaled_mm-tj Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
This feature has been support through this PR #445 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing
Note: This PR feature requires ROCm 6.3 and later and GPU Arch MI300 and later.
Description
This PR involves the following enhancements
This is a PR specific to support Per-Token-Activation Per-Channel-Weight (PTPC-FP8) FP8 Quantization Inferencing.
The model will be quantized on-the-fly from BFloat16 to FP8. Model weight which are store in Float16 will need to be casted into BFloat16.
It used PyTorch latest rowwise scaled GEMM feature in
torch._scaled_mm
which is introduced in [ROCm] hipblaslt rowwise f8 gemm pytorch/pytorch#144432 , which speeds up current naive implementation by at least 2 times. For more details check out the Performance sectionDockerfile.rocm_base
PyTorch repo commit has been updated to3a585126
.Dockerfile.rocm
is left untouched as the base image is referencing to AMD docker hub registry. That base image at this point in time has already installed with PyTorch repo commit3a585126
.Dockerfile.rocm_base
.