GPTQ Fused MoE class#8

Closed

ElizaWszola wants to merge 82 commits intomarlin-moe-8-bitfrom gptq_fused_moe

+6,722-1,916

Commits on Sep 3, 2024

Commits on Sep 4, 2024

Use FusedMoE layer for all loads
ElizaWszola
committed
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola
committed
Make sure that GPTQ runs through mixtral.py
ElizaWszola
committed
enforce float16A/scales for marlin moe
ElizaWszola
committed
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola
committed
cleanup
ElizaWszola
committed
[MISC] Consolidate FP8 kv-cache tests (vllm-project#8131 )
comaniac
authored
[CI/Build][ROCm] Enabling LoRA tests on ROCm (vllm-project#7369 )

alexeykondrat
and
simon-mo
authored
[CI] Change test input in Gemma LoRA test (vllm-project#8163 )
WoosukKwon
authored
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (vllm-project#5649 )

K-Mistele
and
constellate
authored
[MISC] Replace input token throughput with total token throughput (vllm-project#8164 )

comaniac
and
mgoin
authored
[Neuron] Adding support for adding/ overriding neuron configuration a… (vllm-project#8062 )

hbikki
and
Harsha Bikki
authored
Bump version to v0.6.0 (vllm-project#8166 )
simon-mo
authored
[Doc] [Misc] Create CODE_OF_CONDUCT.md (vllm-project#8161 )
mmcelaney
authored

Commits on Sep 5, 2024

Commits on Sep 6, 2024

Commits on Sep 7, 2024

Commits on Sep 8, 2024

[Bugfix] Fix async postprocessor in case of preemption (vllm-project#8267 )
alexm-redhat
authored

Commits on Sep 9, 2024

Enable 8-bit weights in Fused Marlin MoE

ElizaWszola
authored and
dsikka
committed
fix rocm

ElizaWszola
authored and
dsikka
committed
bad paste

ElizaWszola
authored and
dsikka
committed
add test case; fix imports for tests
dsikka
committed
fix to adapt custom_routin_function
dsikka
committed
Use select_experts to compute top_k tensors in fused moe

ElizaWszola
authored and
dsikka
committed
bring back fused_moe_marlin -> fused_marlin_moe

ElizaWszola
authored and
dsikka
committed
GPTQ Fused MoE class

ElizaWszola
authored and
dsikka
committed
Add GPTQMarlinMoEMethod to gptq_marlin.py

ElizaWszola
authored and
dsikka
committed
Use FusedMoE layer for all loads

ElizaWszola
authored and
dsikka
committed
Make sure that GPTQ runs through mixtral.py

ElizaWszola
authored and
dsikka
committed
enforce float16A/scales for marlin moe

ElizaWszola
authored and
dsikka
committed
remove large model
dsikka
committed
Cleanup, comments

ElizaWszola
authored and
dsikka
committed
cleanup

ElizaWszola
authored and
dsikka
committed
remove 8-bit stuff for now

ElizaWszola
authored and
dsikka
committed
update/fix weight loading to support tp
dsikka
committed
fix; update large model testing cases
dsikka
committed
add hack to support unfused mixtral pathway for int8
dsikka
committed
fix install for tpu test
dsikka
committed
Move float16 typecast hack to gptq marlin moe method

ElizaWszola
authored and
dsikka
committed
Move output type conversion to gptq method as well

ElizaWszola
authored and
dsikka
committed
typo fix; fix comment
dsikka
committed
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm into gptq_fused_moe
ElizaWszola
committed
Clarify comment, change how we process bias
ElizaWszola
committed
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272 )
K-Mistele
authored
[Frontend] Add progress reporting to run_batch.py (vllm-project#8060 )

alugowski
and
Adam Lugowski
authored
[Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292 )
vladislavkruglikov
authored
[Misc] GPTQ Activation Ordering (vllm-project#8135 )
kylesayrs
authored
Merge branch 'main' into gptq_fused_moe
dsikka
authored