Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logprobs #45

Open
wants to merge 1,231 commits into
base: afeldman-nm/logprobs
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1231 commits
Select commit Hold shift + click to select a range
e7cfc4e
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
7e4bbda
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
1337071
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
f877a7d
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d2f058e
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
169a0ff
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
c11f172
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
f446a78
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
fda0fcb
removed fast tests from pipeline
afeldman-nm Dec 2, 2024
0590ec3
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
b18c9bb
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
0bcf5f4
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
b795477
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
073a4bd
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
e25810a
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
63a1641
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
995a148
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
ef31eab
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
ee1b910
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
1fa0b71
Update vllm/outputs.py
afeldman-nm Dec 2, 2024
6ba743f
Merge branch 'afeldman-nm/v1_logprobs' of https://github.com/neuralma…
afeldman-nm Dec 2, 2024
bc1c004
small fixes
afeldman-nm Dec 2, 2024
e95f275
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
bec886b
moved output processing commands into processor
afeldman-nm Dec 2, 2024
fef9f30
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
a4c4daf
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
554f431
added explanatory comment to EngineCore.update_from_output()
afeldman-nm Dec 2, 2024
5022307
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
5dea1d5
[misc] move functions to config.py (#10624)
youkaichao Nov 25, 2024
930f2cc
[Model] Support `is_causal` HF config field for Qwen2 model (#10621)
DarkLight1337 Nov 25, 2024
060ca2f
Super tiny little typo fix (#10633)
fzyzcjy Nov 25, 2024
084199b
[Bug]: Authorization ignored when root_path is set (#10606)
chaunceyjiang Nov 25, 2024
ad02c99
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devic…
wallashss Nov 25, 2024
c76bf01
[Docs] Add Snowflake Slides (#10641)
simon-mo Nov 25, 2024
5e36a52
[Model]: Add support for Aria model (#10514)
xffxff Nov 25, 2024
80a1dd4
[Model] Enable optional prefix when loading embedding models (#10639)
DarkLight1337 Nov 25, 2024
84e74aa
[Doc] Fix typos in docs (#10636)
DarkLight1337 Nov 25, 2024
0b34acf
[Model] Add OLMo November 2024 model (#10503)
2015aroras Nov 25, 2024
61dc22b
[misc] do not read HOST_IP (#10644)
youkaichao Nov 26, 2024
ea0c690
[bugfix] fix aria model and add torch.compile (#10645)
youkaichao Nov 26, 2024
e8d3cc3
[Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
sanketkaleoss Nov 26, 2024
ee2c7f5
[v1] EngineArgs for better config handling for v1 (#10382)
rickyyx Nov 26, 2024
0bd61fb
custom allreduce + torch.compile (#10121)
SageMoore Nov 26, 2024
dc8a363
[Misc] Remove outdated init protocols (#10655)
DarkLight1337 Nov 26, 2024
1f74fe9
[ci] add vllm_test_utils (#10659)
youkaichao Nov 26, 2024
53f9d49
[V1] Enable profile for LLMEngine (#10665)
jikunshang Nov 26, 2024
e82fe47
Squash commit of all changes from v1_logprobs
abf149 Nov 26, 2024
e395551
fixed issue with sample-logprob-only batches
abf149 Nov 26, 2024
ae66ae4
refactored logprobs tensor pythonization in scheduler
abf149 Nov 26, 2024
17d858d
added fast logprobs test
abf149 Nov 26, 2024
f5c0afd
wip refactor
abf149 Nov 26, 2024
f7833f3
format
abf149 Nov 26, 2024
704d635
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
andoorve Nov 26, 2024
cec0443
refactor
abf149 Nov 26, 2024
7315781
attempted sample_metadata fix; sample logprobs work, prompt logprobs …
abf149 Nov 26, 2024
2cee231
cleaned up sampling metadata
abf149 Nov 26, 2024
cc1e43a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
07f9e89
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
27e4923
small change
abf149 Nov 26, 2024
1ccef6c
partially re-enabled detokenize cases in test
abf149 Nov 26, 2024
a293451
deferring support for detokenization feature to subsequent SamplingPa…
abf149 Nov 26, 2024
86d0259
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
1f6d7d2
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
95dd578
tweak tolerance; fast check
afeldman-nm Nov 29, 2024
dd8ea8b
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
d414464
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
0f196ac
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
429d17e
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
89c4f78
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
a809ee1
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
57485ba
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
e255262
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
fcc7172
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9cc018a
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
d65fc83
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
046dfc4
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
9bf5c8d
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
4e53851
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
8239c6f
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
5a3a0eb
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
b22e27c
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
b5864e2
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
b9cabc9
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d61d661
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
39f4494
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
dcdf2f3
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
ea6ed6b
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
ac0b495
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
1362dac
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
bc6637c
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
3733796
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
170a30c
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
8d83244
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
d8499c0
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
3c8ced2
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
5146352
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
d95da87
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
7831672
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
a877540
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
cbf1489
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
db1ca39
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d198e8f
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
cf04e11
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
b58062b
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
bcdb5b8
removed fast tests from pipeline
afeldman-nm Dec 2, 2024
88f7f57
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
02eb179
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
8d5035d
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
ab21a28
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
6643bf2
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
9464931
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
777bb76
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
221ee79
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
39cd324
Update vllm/outputs.py
afeldman-nm Dec 2, 2024
5757476
small fixes
afeldman-nm Dec 2, 2024
3d1373c
moved output processing commands into processor
afeldman-nm Dec 2, 2024
05f39a9
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
74274c2
added explanatory comment to EngineCore.update_from_output()
afeldman-nm Dec 2, 2024
c9a7b3f
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
7ea421d
Merge branch 'afeldman-nm/v1_logprobs' of https://github.com/neuralma…
afeldman-nm Dec 2, 2024
f22facd
constructing dummy logprobs
afeldman-nm Dec 2, 2024
b16dd79
dummy logprobs with decodes
afeldman-nm Dec 2, 2024
0054ece
passing some detokenizer tests
afeldman-nm Dec 2, 2024
59853d5
fixing error during debug
afeldman-nm Dec 2, 2024
193e60c
existing detokenizer test checks are unbroken; need to add logprobs c…
afeldman-nm Dec 2, 2024
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
a078f89
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-redhat Dec 3, 2024
15f9825
merge
afeldman-nm Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
26b165e
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
30ea722
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
4fefd62
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
603f2b5
model runner returns logprobs as np arrays
afeldman-nm Dec 4, 2024
ac602d8
new request types
afeldman-nm Dec 4, 2024
2a9ef8c
first pass at only using numpy in engine core
afeldman-nm Dec 4, 2024
2fe9147
tested removal of pythonization from engine core
afeldman-nm Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
1283010
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
fee1e8e
Merge branch 'v1_logprobs' into move_pyth
afeldman-nm Dec 4, 2024
a46a8e5
wip detokenizer updates
afeldman-nm Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
0c04576
wip
afeldman-nm Dec 5, 2024
0f04d6e
wip
afeldman-nm Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
c6831ca
first pass at pythonization moved out of engine
afeldman-nm Dec 5, 2024
238bc46
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 5, 2024
86b18aa
Merge branch 'v1_logprobs' into move_pyth
afeldman-nm Dec 5, 2024
ae7e10c
incremental/non-incremental detokenized text comparison
afeldman-nm Dec 5, 2024
3cffca3
implemented the sample logprobs N+1 scenario in the front end
afeldman-nm Dec 5, 2024
73e4c12
fixed prompt logprob count bug
afeldman-nm Dec 5, 2024
5b49d36
passing one test!
afeldman-nm Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
66fe6bc
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 5, 2024
0cf2c79
successfully failing cumulative logprobs test
afeldman-nm Dec 5, 2024
49e0b33
cumulative logprob works
afeldman-nm Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
6558b37
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 6, 2024
5d36dcc
Merge branch 'v1_logprobs_merge' into v1_logprobs
afeldman-nm Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
e8bd247
wip
afeldman-nm Dec 6, 2024
9f39817
progress toward detok stop token test
afeldman-nm Dec 7, 2024
867bb71
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
58bcc5a
detokenizer stop tokens test passing; some slight engine fixes for th…
afeldman-nm Dec 7, 2024
696401e
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
d8361d3
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 7, 2024
85e58c9
Merge branch 'v1_logprobs_merge' into v1_logprobs
afeldman-nm Dec 7, 2024
6320868
refactored detokenizer
afeldman-nm Dec 7, 2024
54abd99
wip
afeldman-nm Dec 7, 2024
7852bb2
incremental detokenization test now also checks logprobs
afeldman-nm Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
8d82049
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
f6d4329
woosuk code structure suggestion
afeldman-nm Dec 7, 2024
aa15b75
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
a4eb6bc
detokenizer tests refactor
afeldman-nm Dec 7, 2024
06185d0
refactor
afeldman-nm Dec 7, 2024
90ed53d
refactoring
afeldman-nm Dec 7, 2024
48f4671
refactor
afeldman-nm Dec 7, 2024
7121739
refactoring to make logprobs var names clearer, touched a lot of file…
afeldman-nm Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
cef5ddb
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
bed24db
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
5ce8128
move
afeldman-nm Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
14c7e56
merge
afeldman-nm Dec 9, 2024
bdd0abf
removed VLLM_USE_V1 checks
afeldman-nm Dec 9, 2024
1fc981e
revert logprobs name changes
afeldman-nm Dec 9, 2024
dc63ac1
removing some unnecessary changes'
afeldman-nm Dec 9, 2024
4f30408
removed fast checks
afeldman-nm Dec 9, 2024
77488cb
wip test_completion
afeldman-nm Dec 12, 2024
f1a689c
toward completion tests
afeldman-nm Dec 12, 2024
e962aa7
serialization fix
afeldman-nm Dec 12, 2024
0f6790d
merged
robertgshaw2-redhat Dec 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
3 changes: 2 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Expand Down
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
8 changes: 4 additions & 4 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
7 changes: 6 additions & 1 deletion .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,15 @@ def test_lm_eval_correctness():
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
65 changes: 48 additions & 17 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,19 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- wait

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand All @@ -41,20 +44,48 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
28 changes: 28 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

## Description

This file contains the downloading link for benchmarking results.

- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
- [benchmarking results](artifact://results.zip)
- [benchmarking code](artifact://nightly-benchmarks.zip)

Please download the visualization scripts in the post


## Results reproduction

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code
```
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.

78 changes: 36 additions & 42 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,39 @@

# Nightly benchmark

The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1

<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->


## Hardware

One AWS node with 8x NVIDIA A100 GPUs.


## Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->

## Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >

## Results

{nightly_results_benchmarking_table}
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.

Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.

Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)


## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

# Known issues

- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
- TGI does not support `ignore-eos` flag.
Loading
Loading