🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #773

wlu1998 · 2024-10-16T07:36:26Z

wlu1998
Oct 16, 2024

Version

24.07

On which installation method(s) does this occur?

Docker

Describe the issue

when i train_graphcast on a single player with multiple cards，encounter the following error message：
[rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
I added the following code to the environment variables：
export NCCL_IB_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
but it didn't solve the problem

Minimum reproducible example

No response

Relevant log output

No response

Environment details

docker run -dit --restart always --shm-size 64G --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v /data/modulus:/data/modulus --name modulus --gpus all nvcr.io/nvidia/modulus/modulus:24.07 /bin/bash

mnabian · 2024-10-17T01:31:24Z

mnabian
Oct 17, 2024
Maintainer

Hi @wlu1998 , have you tried running this on a single GPU?
Also, can you post the output of nvidia-smi command here?

0 replies

wlu1998 · 2024-10-17T04:32:18Z

wlu1998
Oct 17, 2024
Author

Due to device limitations, running on a single GPU may result in “out of memory” issues.

0 replies

mnabian · 2024-10-17T19:44:27Z

mnabian
Oct 17, 2024
Maintainer

For the default configs, you need a GPU with at least 80GB of memory.
Also, GraphCast only supports distributed data parallelism, so your GPU memory requirement does not reduce if you run on multiple GPUs.
I suggest to try getting this to work on a single GPU for now.
You can reduce the size of the model and the memory overhead by changing some configs like processor_layers, hidden_dim, mesh_level: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/conf/config.yaml#L29

0 replies

wlu1998 · 2024-10-18T06:13:47Z

wlu1998
Oct 18, 2024
Author

My device is 24*4，so i can only run with with multi-GPU runs. And i want to solve the issue of "Some NCCL operations have failed or timed out".
Thank u for the suggest, i will try. Actually, I have tried to modify the annual data to the daily data.

0 replies

ram-cherukuri · 2024-11-15T17:23:07Z

ram-cherukuri
Nov 15, 2024
Maintainer

@wlu1998 If this issue is resolved, let us know so we can close the issue.

0 replies

wlu1998 · 2024-11-18T01:30:13Z

wlu1998
Nov 18, 2024
Author

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

0 replies

NicholasCao · 2024-11-27T02:28:10Z

NicholasCao
Nov 27, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

0 replies

wlu1998 · 2024-11-29T01:54:40Z

wlu1998
Nov 29, 2024
Author

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

我把模型架构调小之后用单卡进行的调试。不过nccl的问题一直没有解决，然后在某一次多卡运行的时候突然就好了，我没有做任何改动

0 replies

NicholasCao · 2024-12-04T09:45:32Z

NicholasCao
Dec 4, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

我把模型架构调小之后用单卡进行的调试。不过nccl的问题一直没有解决，然后在某一次多卡运行的时候突然就好了，我没有做任何改动

感谢已修复了，zero3需要所有节点同时保存😅

0 replies

Jeffry-wen · 2024-12-14T03:31:39Z

Jeffry-wen
Dec 14, 2024

@wlu1998 If this issue is resolved, let us know so we can close the issue.

resolved

HELP! Could you tell me how to resolve?

我把模型架构调小之后用单卡进行的调试。不过nccl的问题一直没有解决，然后在某一次多卡运行的时候突然就好了，我没有做任何改动

感谢已修复了，zero3需要所有节点同时保存😅

hi, I have encountered a similar situation. Could you tell me how to resolve?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #773

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. #773

wlu1998 Oct 16, 2024

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

Replies: 10 comments

mnabian Oct 17, 2024 Maintainer

wlu1998 Oct 17, 2024 Author

mnabian Oct 17, 2024 Maintainer

wlu1998 Oct 18, 2024 Author

ram-cherukuri Nov 15, 2024 Maintainer

wlu1998 Nov 18, 2024 Author

NicholasCao Nov 27, 2024

wlu1998 Nov 29, 2024 Author

NicholasCao Dec 4, 2024

Jeffry-wen Dec 14, 2024

wlu1998
Oct 16, 2024

mnabian
Oct 17, 2024
Maintainer

wlu1998
Oct 17, 2024
Author

mnabian
Oct 17, 2024
Maintainer

wlu1998
Oct 18, 2024
Author

ram-cherukuri
Nov 15, 2024
Maintainer

wlu1998
Nov 18, 2024
Author

NicholasCao
Nov 27, 2024

wlu1998
Nov 29, 2024
Author

NicholasCao
Dec 4, 2024

Jeffry-wen
Dec 14, 2024