-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🔧 Enable NCCL monitoring in cuda build #309
Conversation
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Is the plan to get rid of this once we bump pytorch and have _abort_process_group()
?
Feel free to force merge-rebase.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dtrifiro, joerunde The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@dtrifiro Unfortunately I tried out the new |
@joerunde: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Shoot, I was hoping the build would be working yesterday but no dice- gonna force merge w/ rebase |
@joerunde the build actually succeeded, it seems just that the smoke test is hanging for some reason. @NickLucche patched it recently, but we're not sure why it's hanging since CI is not producing any logs 😔 |
Description
This PR updates the environment in the cuda image to enable torch's NCCL monitoring feature. This feature seems to be off by default, as users of our image report deadlocks lasting multiple hours before they manually delete their stuck pods. For more details on the configuration, see here.
There is currently a deadlock problem in vllm, where if one tensor parallel worker dies (say due to a cuda illegal memory access while processing a request) all of the workers will be cleaned up, but the driver (which is itself also a worker) will be deadlocked waiting on an NCCL collective call that will never return. Ideally this deadlock shouldn't happen since the NCCL process groups are destroyed when the workers exit. But, I've found discussions about how when multiple process groups are used, it's easy to trigger deadlocks: NVIDIA/nccl#1013 and vLLM appears to use multiple process groups to handle TP coms with NCCL.
There is an unreleased feature in torch to allow aborting all the comms waiting on a process group, which should solve this problem in the future: pytorch/pytorch#132291. We can't use that without pulling in a nightly build of pytorch and recompiling kernels against it though.
So, this PR aims to enable the NCCL monitor to have it unblock the deadlocked driver when this occurs. I've run a quick benchmark on 4xA100s and it doesn't appear that enabling this affects throughput. (For this benchmark I slightly modified
benchmark_serving.py
to remove the prune on long or short requests.)Benchmark without monitoring enabled
Benchmark with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1
I also ran two large batch tests, with a 20k batch of requests from sharegpt using both the default
--max-num-seqs 256
and--max-num-seqs 1024
. Even with the timeout set to 1 second and running internal batches of ~950 requests, I didn't observe any NCCL timeouts.The second setting,
TORCH_NCCL_DUMP_ON_TIMEOUT=0
, will disable the crash dump. The dump would likely take quite a while, and without pointing it to a volume mount the files would be lost on container restart anyway. Disabling it should lead to speedier restarts on crash.How Has This Been Tested?
Merge criteria: