Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔧 Enable NCCL monitoring in cuda build #309

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

joerunde
Copy link

Description

This PR updates the environment in the cuda image to enable torch's NCCL monitoring feature. This feature seems to be off by default, as users of our image report deadlocks lasting multiple hours before they manually delete their stuck pods. For more details on the configuration, see here.

There is currently a deadlock problem in vllm, where if one tensor parallel worker dies (say due to a cuda illegal memory access while processing a request) all of the workers will be cleaned up, but the driver (which is itself also a worker) will be deadlocked waiting on an NCCL collective call that will never return. Ideally this deadlock shouldn't happen since the NCCL process groups are destroyed when the workers exit. But, I've found discussions about how when multiple process groups are used, it's easy to trigger deadlocks: NVIDIA/nccl#1013 and vLLM appears to use multiple process groups to handle TP coms with NCCL.

There is an unreleased feature in torch to allow aborting all the comms waiting on a process group, which should solve this problem in the future: pytorch/pytorch#132291. We can't use that without pulling in a nightly build of pytorch and recompiling kernels against it though.

So, this PR aims to enable the NCCL monitor to have it unblock the deadlocked driver when this occurs. I've run a quick benchmark on 4xA100s and it doesn't appear that enabling this affects throughput. (For this benchmark I slightly modified benchmark_serving.py to remove the prune on long or short requests.)

Benchmark without monitoring enabled

$ python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-70B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --seed 1337 --request-rate 8 --percentile-metrics ttft,tpot,itl,e2el

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  153.22    
Total input tokens:                      222095    
Total generated tokens:                  180312    
Request throughput (req/s):              6.53      
Output token throughput (tok/s):         1176.85   
Total Token throughput (tok/s):          2626.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          215.70    
Median TTFT (ms):                        203.68    
P99 TTFT (ms):                           483.92    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.42     
Median TPOT (ms):                        70.36     
P99 TPOT (ms):                           122.97    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.23     
Median ITL (ms):                         47.57     
P99 ITL (ms):                            242.18    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          12270.44  
Median E2EL (ms):                        8886.79   
P99 E2EL (ms):                           48513.64  
==================================================

Benchmark with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1

$ python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-70B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --seed 1337 --request-rate 8 --percentile-metrics ttft,tpot,itl,e2el

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  152.97    
Total input tokens:                      222095    
Total generated tokens:                  180639    
Request throughput (req/s):              6.54      
Output token throughput (tok/s):         1180.88   
Total Token throughput (tok/s):          2632.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          226.92    
Median TTFT (ms):                        209.52    
P99 TTFT (ms):                           595.92    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.10     
Median TPOT (ms):                        71.31     
P99 TPOT (ms):                           120.03    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.94     
Median ITL (ms):                         47.73     
P99 ITL (ms):                            234.46    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          12432.18  
Median E2EL (ms):                        9021.05   
P99 E2EL (ms):                           50406.56  
==================================================

I also ran two large batch tests, with a 20k batch of requests from sharegpt using both the default --max-num-seqs 256 and --max-num-seqs 1024. Even with the timeout set to 1 second and running internal batches of ~950 requests, I didn't observe any NCCL timeouts.

The second setting, TORCH_NCCL_DUMP_ON_TIMEOUT=0, will disable the crash dump. The dump would likely take quite a while, and without pointing it to a volume mount the files would be lost on container restart anyway. Disabling it should lead to speedier restarts on crash.

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@openshift-ci openshift-ci bot requested review from dtrifiro and rpancham January 29, 2025 21:11
Copy link

@dtrifiro dtrifiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Is the plan to get rid of this once we bump pytorch and have _abort_process_group()?

Feel free to force merge-rebase.

Copy link

openshift-ci bot commented Jan 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro, joerunde

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@joerunde
Copy link
Author

joerunde commented Feb 4, 2025

@dtrifiro Unfortunately I tried out the new _abort_process_group() feature on a build with torch 2.6 and I couldn't get it to break the deadlocks :(
I think this will be okay to leave in for the foreseeable future, torch recommends turning it on to avoid deadlocking cluster resources.

Copy link

openshift-ci bot commented Feb 5, 2025

@joerunde: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/smoke-test 4cfd469 link true /test smoke-test
ci/prow/images 4cfd469 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@joerunde
Copy link
Author

joerunde commented Feb 5, 2025

Shoot, I was hoping the build would be working yesterday but no dice- gonna force merge w/ rebase

@joerunde joerunde merged commit fc1c038 into opendatahub-io:main Feb 5, 2025
3 of 6 checks passed
@joerunde joerunde deleted the torch-nccl-monitor branch February 5, 2025 17:07
@dtrifiro
Copy link

dtrifiro commented Feb 5, 2025

@joerunde the build actually succeeded, it seems just that the smoke test is hanging for some reason. @NickLucche patched it recently, but we're not sure why it's hanging since CI is not producing any logs 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants