-
Notifications
You must be signed in to change notification settings - Fork 59
fix: Use default torch timeout for nccl watchdog unless overridden #521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Use default torch timeout for nccl watchdog unless overridden #521
Conversation
2ab4c09
to
0cbbe2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need some comments explaining why we're doing it this way, otherwise I think it's good. 👍
The default value is recommended, and we should not change it in production. The knob may still be useful for debugging or testing purposes though. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
0cbbe2b
to
ca87077
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this solution.
I'm curious whether the torch environment variable TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
from the docs overrides this or this overrides that var.
https://pytorch.org/docs/stable/torch_nccl_environment_variables.html
@JamesKunstle below is my understanding of how this works. I may be wrong since I'm new to the topic, but I'll try to link the relevant code for reference. Please double check me: it's important we understand how this works. So, there are three separate entities - a timeout for process group, a NCCL monitoring thread and a NCCL watchdog. They are separate entities serving separate needs. The process group timeout is what you configure when passing Each backend will implement it in some way, for example, NCCL will assign the timeout to each work item. This parameter is not backend specific. If we ever use a different backend (not Now, to watchdogs. These are backend specific. It's up to the backend to run a watchdog or some other mechanism to implement the PG timeout. Each NCCL rank starts a separate native thread running a peer watchdog. The watchdog thread will periodically check on each worker to see if it failed or timed out, and report back. It has its own sleep timer between iterations (just 100ms). There's also a NCCL monitoring thread. Also backend specific. This thread is running separate to a watchdog and monitors the watchdog itself. Specifically, it watches for its heartbeats. If a single heartbeat is not detected in In case you wonder, the monitoring thread is enabled by default since 2.3.0. It is controlled by What's a heartbeat? Just an atomic integer increment on a shared variable that is executed by the watchdog thread and watched by the monitoring thread. Important to note: for a watchdog thread heartbeat to happen, currently running collectives don't have to make any progress: as long as the watchdog thread is alive, it will heartbeat. So tuning The watchdog may even be disabled with |
The default value is recommended, and we should not change it in
production. The knob may still be useful for debugging or testing
purposes though.
Signed-off-by: Ihar Hrachyshka ihar.hrachyshka@gmail.com