forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ Fused MoE class #8
Closed
Closed
Changes from 2 commits
Commits
Show all changes
82 commits
Select commit
Hold shift + click to select a range
db1f07e
GPTQ Fused MoE class
ElizaWszola 6753789
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola 7df4014
Use FusedMoE layer for all loads
ElizaWszola c3dc249
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola 2fa03e5
Make sure that GPTQ runs through mixtral.py
ElizaWszola 8a504d9
enforce float16A/scales for marlin moe
ElizaWszola 689ea0a
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola ec47561
cleanup
ElizaWszola 2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat 561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele 77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac 008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki 32e7db2
Bump version to v0.6.0 (#8166)
simon-mo e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney 1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker 4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg 288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks 2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin 2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 9f97b3b
update/fix weight loading to support tp
dsikka db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill b841ac4
remove 8-bit stuff for now
ElizaWszola a245032
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola 9d8a80c
fix; update large model testing cases
dsikka e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm 315e22f
add hack to support unfused mixtral pathway for int8
dsikka de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 565cc43
fix install for tpu test
dsikka 1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat 9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith 23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka 29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten 12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker 41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele 2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin ce2702a
[tpu][misc] fix typo (#8260)
youkaichao 9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py 8886423
Move float16 typecast hack to gptq marlin moe method
ElizaWszola ab27497
Move output type conversion to gptq method as well
ElizaWszola 36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde 4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-redhat 847e860
Enable 8-bit weights in Fused Marlin MoE
ElizaWszola 430a9cb
fix rocm
ElizaWszola 48047aa
bad paste
ElizaWszola bfc4fae
add test case; fix imports for tests
dsikka c5a2f62
fix to adapt custom_routin_function
dsikka 2b308c4
Use select_experts to compute top_k tensors in fused moe
ElizaWszola 71256d4
bring back fused_moe_marlin -> fused_marlin_moe
ElizaWszola 7aa844c
GPTQ Fused MoE class
ElizaWszola 0f7bec3
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola cb0001e
Use FusedMoE layer for all loads
ElizaWszola 33090a3
Make sure that GPTQ runs through mixtral.py
ElizaWszola d479837
enforce float16A/scales for marlin moe
ElizaWszola 8baaec6
remove large model
dsikka 8fbc181
Cleanup, comments
ElizaWszola 839915f
cleanup
ElizaWszola a5bc626
remove 8-bit stuff for now
ElizaWszola c573fa1
update/fix weight loading to support tp
dsikka a991d82
fix; update large model testing cases
dsikka d57804d
add hack to support unfused mixtral pathway for int8
dsikka 96fa486
fix install for tpu test
dsikka 1faab90
Move float16 typecast hack to gptq marlin moe method
ElizaWszola 970e06a
Move output type conversion to gptq method as well
ElizaWszola fd0a4f2
typo fix; fix comment
dsikka 3ac9273
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola d51a2f4
Clarify comment, change how we process bias
ElizaWszola 08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele 58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs 12f05c5
Merge branch 'main' into gptq_fused_moe
dsikka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
GPTQMarlinMoEMethod
looks good. We'd want to useFusedMoE
as opposed to introducingGPTQFusedMoE
- we can work on this tomorrow in making this happen.