🔧 Enable NCCL monitoring in cuda build #309

joerunde · 2025-01-29T21:11:21Z

Description

This PR updates the environment in the cuda image to enable torch's NCCL monitoring feature. This feature seems to be off by default, as users of our image report deadlocks lasting multiple hours before they manually delete their stuck pods. For more details on the configuration, see here.

There is currently a deadlock problem in vllm, where if one tensor parallel worker dies (say due to a cuda illegal memory access while processing a request) all of the workers will be cleaned up, but the driver (which is itself also a worker) will be deadlocked waiting on an NCCL collective call that will never return. Ideally this deadlock shouldn't happen since the NCCL process groups are destroyed when the workers exit. But, I've found discussions about how when multiple process groups are used, it's easy to trigger deadlocks: NVIDIA/nccl#1013 and vLLM appears to use multiple process groups to handle TP coms with NCCL.

There is an unreleased feature in torch to allow aborting all the comms waiting on a process group, which should solve this problem in the future: pytorch/pytorch#132291. We can't use that without pulling in a nightly build of pytorch and recompiling kernels against it though.

So, this PR aims to enable the NCCL monitor to have it unblock the deadlocked driver when this occurs. I've run a quick benchmark on 4xA100s and it doesn't appear that enabling this affects throughput. (For this benchmark I slightly modified benchmark_serving.py to remove the prune on long or short requests.)

Benchmark without monitoring enabled

$ python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-70B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --seed 1337 --request-rate 8 --percentile-metrics ttft,tpot,itl,e2el

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  153.22    
Total input tokens:                      222095    
Total generated tokens:                  180312    
Request throughput (req/s):              6.53      
Output token throughput (tok/s):         1176.85   
Total Token throughput (tok/s):          2626.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          215.70    
Median TTFT (ms):                        203.68    
P99 TTFT (ms):                           483.92    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          69.42     
Median TPOT (ms):                        70.36     
P99 TPOT (ms):                           122.97    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.23     
Median ITL (ms):                         47.57     
P99 ITL (ms):                            242.18    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          12270.44  
Median E2EL (ms):                        8886.79   
P99 E2EL (ms):                           48513.64  
==================================================

Benchmark with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1

$ python benchmarks/benchmark_serving.py --model meta-llama/Meta-Llama-3.1-70B-Instruct --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --seed 1337 --request-rate 8 --percentile-metrics ttft,tpot,itl,e2el

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  152.97    
Total input tokens:                      222095    
Total generated tokens:                  180639    
Request throughput (req/s):              6.54      
Output token throughput (tok/s):         1180.88   
Total Token throughput (tok/s):          2632.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          226.92    
Median TTFT (ms):                        209.52    
P99 TTFT (ms):                           595.92    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.10     
Median TPOT (ms):                        71.31     
P99 TPOT (ms):                           120.03    
---------------Inter-token Latency----------------
Mean ITL (ms):                           67.94     
Median ITL (ms):                         47.73     
P99 ITL (ms):                            234.46    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          12432.18  
Median E2EL (ms):                        9021.05   
P99 E2EL (ms):                           50406.56  
==================================================

I also ran two large batch tests, with a 20k batch of requests from sharegpt using both the default --max-num-seqs 256 and --max-num-seqs 1024. Even with the timeout set to 1 second and running internal batches of ~950 requests, I didn't observe any NCCL timeouts.

The second setting, TORCH_NCCL_DUMP_ON_TIMEOUT=0, will disable the crash dump. The dump would likely take quite a while, and without pointing it to a volume mount the files would be lost on container restart anyway. Disabling it should lead to speedier restarts on crash.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

dtrifiro

Looks good. Is the plan to get rid of this once we bump pytorch and have _abort_process_group()?

Feel free to force merge-rebase.

openshift-ci · 2025-01-31T09:02:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro, joerunde

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dtrifiro]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

joerunde · 2025-02-04T21:51:26Z

@dtrifiro Unfortunately I tried out the new _abort_process_group() feature on a build with torch 2.6 and I couldn't get it to break the deadlocks :(
I think this will be okay to leave in for the foreseeable future, torch recommends turning it on to avoid deadlocking cluster resources.

openshift-ci · 2025-02-05T01:49:48Z

@joerunde: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/smoke-test	`4cfd469`	link	true	`/test smoke-test`
ci/prow/images	`4cfd469`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

joerunde · 2025-02-05T17:07:20Z

Shoot, I was hoping the build would be working yesterday but no dice- gonna force merge w/ rebase

dtrifiro · 2025-02-05T17:44:18Z

@joerunde the build actually succeeded, it seems just that the smoke test is hanging for some reason. @NickLucche patched it recently, but we're not sure why it's hanging since CI is not producing any logs 😔

🔧 Enable NCCL monitoring in cuda build

e2b12fe

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

openshift-ci bot requested review from dtrifiro and rpancham January 29, 2025 21:11

dtrifiro approved these changes Jan 31, 2025

View reviewed changes

openshift-ci bot added the approved label Jan 31, 2025

bump for build pipeline

4cfd469

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

joerunde merged commit fc1c038 into opendatahub-io:main Feb 5, 2025
3 of 6 checks passed

joerunde deleted the torch-nccl-monitor branch February 5, 2025 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔧 Enable NCCL monitoring in cuda build #309

🔧 Enable NCCL monitoring in cuda build #309

joerunde commented Jan 29, 2025

dtrifiro left a comment •

edited

Loading

openshift-ci bot commented Jan 31, 2025

joerunde commented Feb 4, 2025

openshift-ci bot commented Feb 5, 2025

joerunde commented Feb 5, 2025

dtrifiro commented Feb 5, 2025

🔧 Enable NCCL monitoring in cuda build #309

🔧 Enable NCCL monitoring in cuda build #309

Conversation

joerunde commented Jan 29, 2025

Description

How Has This Been Tested?

Merge criteria:

dtrifiro left a comment • edited Loading

Choose a reason for hiding this comment

openshift-ci bot commented Jan 31, 2025

joerunde commented Feb 4, 2025

openshift-ci bot commented Feb 5, 2025

joerunde commented Feb 5, 2025

dtrifiro commented Feb 5, 2025

dtrifiro left a comment •

edited

Loading