Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Fused MoE class #8

Closed
wants to merge 82 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
db1f07e
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
6753789
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
7df4014
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
c3dc249
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
2fa03e5
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
8a504d9
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
689ea0a
Merge branch 'marlin-moe-8-bit' into gptq_fused_moe
ElizaWszola Sep 4, 2024
ec47561
cleanup
ElizaWszola Sep 4, 2024
2ad2e56
[MISC] Consolidate FP8 kv-cache tests (#8131)
comaniac Sep 4, 2024
d1dec64
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
alexeykondrat Sep 4, 2024
561d6f8
[CI] Change test input in Gemma LoRA test (#8163)
WoosukKwon Sep 4, 2024
e02ce49
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistra…
K-Mistele Sep 4, 2024
77d9e51
[MISC] Replace input token throughput with total token throughput (#8…
comaniac Sep 4, 2024
008cf88
[Neuron] Adding support for adding/ overriding neuron configuration a…
hbikki Sep 4, 2024
32e7db2
Bump version to v0.6.0 (#8166)
simon-mo Sep 4, 2024
e01c2be
[Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161)
mmcelaney Sep 4, 2024
1afc931
[bugfix] >1.43 constraint for openai (#8169)
SolitaryThinker Sep 5, 2024
4624d98
[Misc] Clean up RoPE forward_native (#8076)
WoosukKwon Sep 5, 2024
ba262c4
[ci] Mark LoRA test as soft-fail (#8160)
khluu Sep 5, 2024
e39ebf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8…
elfiegg Sep 5, 2024
288a938
[Doc] Indicate more information about supported modalities (#8181)
DarkLight1337 Sep 5, 2024
8685ba1
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parall…
Manikandan-Thangaraj-ZS0321 Sep 5, 2024
9da25a8
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
alex-jw-brooks Sep 5, 2024
2ee4528
Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165)
mgoin Sep 5, 2024
2febcf2
[Documentation][Spec Decode] Add documentation about lossless guarant…
sroy745 Sep 5, 2024
9f97b3b
update/fix weight loading to support tp
dsikka Sep 5, 2024
db3bf7c
[Core] Support load and unload LoRA in api server (#6566)
Jeffwan Sep 6, 2024
baa5467
[BugFix] Fix Granite model configuration (#8216)
njhill Sep 6, 2024
b841ac4
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
a245032
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 6, 2024
9d8a80c
fix; update large model testing cases
dsikka Sep 6, 2024
e5cab71
[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191)
afeldman-nm Sep 6, 2024
315e22f
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
de80783
[Misc] Use ray[adag] dependency instead of cuda (#7938)
ruisearch42 Sep 6, 2024
565cc43
fix install for tpu test
dsikka Sep 6, 2024
1447c97
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
alexeykondrat Sep 6, 2024
9db52ea
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize…
rasmith Sep 6, 2024
23f3222
[Misc] Remove `SqueezeLLM` (#8220)
dsikka Sep 6, 2024
29f49cd
[Model] Allow loading from original Mistral format (#8168)
patrickvonplaten Sep 6, 2024
12dd715
[misc] [doc] [frontend] LLM torch profiler support (#7943)
SolitaryThinker Sep 7, 2024
41e95c5
[Bugfix] Fix Hermes tool call chat template bug (#8256)
K-Mistele Sep 7, 2024
2f707fc
[Model] Multi-input support for LLaVA (#8238)
DarkLight1337 Sep 7, 2024
795b662
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_ser…
wschin Sep 7, 2024
ce2702a
[tpu][misc] fix typo (#8260)
youkaichao Sep 7, 2024
9f68e00
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
DarkLight1337 Sep 7, 2024
e807125
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
Isotr0py Sep 7, 2024
8886423
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
ab27497
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
36bf815
[Model][VLM] Decouple weight loading logic for `Paligemma` (#8269)
Isotr0py Sep 7, 2024
b962ee1
ppc64le: Dockerfile fixed, and a script for buildkite (#8026)
sumitd2 Sep 7, 2024
cfe712b
[CI/Build] Use python 3.12 in cuda image (#8133)
joerunde Sep 7, 2024
4ef41b8
[Bugfix] Fix async postprocessor in case of preemption (#8267)
alexm-redhat Sep 8, 2024
847e860
Enable 8-bit weights in Fused Marlin MoE
ElizaWszola Aug 30, 2024
430a9cb
fix rocm
ElizaWszola Aug 30, 2024
48047aa
bad paste
ElizaWszola Aug 30, 2024
bfc4fae
add test case; fix imports for tests
dsikka Aug 30, 2024
c5a2f62
fix to adapt custom_routin_function
dsikka Aug 30, 2024
2b308c4
Use select_experts to compute top_k tensors in fused moe
ElizaWszola Sep 2, 2024
71256d4
bring back fused_moe_marlin -> fused_marlin_moe
ElizaWszola Sep 3, 2024
7aa844c
GPTQ Fused MoE class
ElizaWszola Sep 3, 2024
0f7bec3
Add GPTQMarlinMoEMethod to gptq_marlin.py
ElizaWszola Sep 3, 2024
cb0001e
Use FusedMoE layer for all loads
ElizaWszola Sep 4, 2024
33090a3
Make sure that GPTQ runs through mixtral.py
ElizaWszola Sep 4, 2024
d479837
enforce float16A/scales for marlin moe
ElizaWszola Sep 4, 2024
8baaec6
remove large model
dsikka Sep 4, 2024
8fbc181
Cleanup, comments
ElizaWszola Sep 4, 2024
839915f
cleanup
ElizaWszola Sep 4, 2024
a5bc626
remove 8-bit stuff for now
ElizaWszola Sep 6, 2024
c573fa1
update/fix weight loading to support tp
dsikka Sep 5, 2024
a991d82
fix; update large model testing cases
dsikka Sep 6, 2024
d57804d
add hack to support unfused mixtral pathway for int8
dsikka Sep 6, 2024
96fa486
fix install for tpu test
dsikka Sep 6, 2024
1faab90
Move float16 typecast hack to gptq marlin moe method
ElizaWszola Sep 7, 2024
970e06a
Move output type conversion to gptq method as well
ElizaWszola Sep 7, 2024
fd0a4f2
typo fix; fix comment
dsikka Sep 9, 2024
3ac9273
Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …
ElizaWszola Sep 9, 2024
d51a2f4
Clarify comment, change how we process bias
ElizaWszola Sep 9, 2024
08287ef
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format…
K-Mistele Sep 9, 2024
58fcc85
[Frontend] Add progress reporting to run_batch.py (#8060)
alugowski Sep 9, 2024
f9b4a2d
[Bugfix] Correct adapter usage for cohere and jamba (#8292)
vladislavkruglikov Sep 9, 2024
c7cb5c3
[Misc] GPTQ Activation Ordering (#8135)
kylesayrs Sep 9, 2024
12f05c5
Merge branch 'main' into gptq_fused_moe
dsikka Sep 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[MISC] Replace input token throughput with total token throughput (vl…
…lm-project#8164)

Co-authored-by: Michael Goin <michael@neuralmagic.com>
  • Loading branch information
comaniac and mgoin authored Sep 4, 2024
commit 77d9e514a2284d5d0bd34b1518b9483ae7d8a05a
10 changes: 5 additions & 5 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
@@ -56,8 +56,8 @@ class BenchmarkMetrics:
total_input: int
total_output: int
request_throughput: float
input_throughput: float
output_throughput: float
total_token_throughput: float
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
@@ -283,8 +283,8 @@ def calculate_metrics(
total_input=total_input,
total_output=sum(actual_output_lens),
request_throughput=completed / dur_s,
input_throughput=total_input / dur_s,
output_throughput=sum(actual_output_lens) / dur_s,
total_token_throughput=(total_input + sum(actual_output_lens)) / dur_s,
mean_ttft_ms=np.mean(ttfts or 0) *
1000, # ttfts is empty if streaming is not supported by backend
std_ttft_ms=np.std(ttfts or 0) * 1000,
@@ -426,19 +426,19 @@ async def benchmark(
metrics.total_output))
print("{:<40} {:<10.2f}".format("Request throughput (req/s):",
metrics.request_throughput))
print("{:<40} {:<10.2f}".format("Input token throughput (tok/s):",
metrics.input_throughput))
print("{:<40} {:<10.2f}".format("Output token throughput (tok/s):",
metrics.output_throughput))
print("{:<40} {:<10.2f}".format("Total Token throughput (tok/s):",
metrics.total_token_throughput))

result = {
"duration": benchmark_duration,
"completed": metrics.completed,
"total_input_tokens": metrics.total_input,
"total_output_tokens": metrics.total_output,
"request_throughput": metrics.request_throughput,
"input_throughput": metrics.input_throughput,
"output_throughput": metrics.output_throughput,
"total_token_throughput": metrics.total_token_throughput,
"input_lens": [output.prompt_len for output in outputs],
"output_lens": actual_output_lens,
"ttfts": [output.ttft for output in outputs],
Loading