-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] main from vllm-project:main #15
Conversation
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
…int (#3467) Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…#4494) Co-authored-by: Simon Mo <simon.mo@hey.com>
… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <anyisalin@gmail.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency
Remove the device="cuda" declarations in mixtral as promised in #4343
…n is not 1 and max_tokens is large & Add tests for preemption (#4451)
/ok-to-test |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Co-authored-by: Cade Daniel <edacih@gmail.com>
@pull[bot]: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Closed in favour of #18 |
This PR updates our grpc_server to add TGIS-style logs similar to https://github.com/IBM/text-generation-inference/blob/main/router/src/grpc_server.rs#L504-L512 This also disables the vllm per-request logging so that we don't double-log each request The timing info collected here is pretty rough, it doesn't plumb into the LLMEngine, it just times the generators to get the total time spent in the engine. We could do better, but this is a start. Example logs: ``` INFO 04-09 21:51:01 logs.py:43] generate_stream{input=[b'This is the story of Obama ridin...'] prefix_id= input_chars=[70] params=sampling { } stopping { max_new_tokens: 200 min_new_tokens: 16 } response { } decoding { } tokenization_time=0.45ms queue_and_inference_time=1096.67ms time_per_token=5.48ms total_time=1097.12ms input_toks=16}: Streaming response generated 200 tokens before NOT_FINISHED, output 848 chars: b' California. The story is told i...' INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.03ms queue_and_inference_time=122.23ms time_per_token=6.11ms total_time=124.26ms input_toks=124}: Sub-request 0 from batch of 2 generated 20 tokens before MAX_TOKENS, output 25 chars: b'?\\n\\n<!--\\n<!--\\n<!--\\n<!--\\n<!' INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.07ms queue_and_inference_time=122.22ms time_per_token=6.11ms total_time=124.29ms input_toks=7}: Sub-request 1 from batch of 2 generated 20 tokens before MAX_TOKENS, output 70 chars: b"?\\nI don't know.\\nI don't know.\\nI ..." ``` --------- Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Signed-off-by: Joe Runde <joe@joerun.de>
Correctly calculating the same value for the required cache blocks num for all torchrun processes
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )