Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How use nvidia sharp for training job #39

Open
Lzhang-hub opened this issue May 30, 2024 · 1 comment
Open

How use nvidia sharp for training job #39

Lzhang-hub opened this issue May 30, 2024 · 1 comment

Comments

@Lzhang-hub
Copy link

Thank for your greate job! it gave me a lot of help.
I have run nccl-test with nvidia sharp successfully, Then I want to training job with NGC torch image ,set same NCCL env with nccl-test job, but it can not use sharp.

Is there some other config for training job ? or a image can be used to training? thank you very much.

@Lzhang-hub
Copy link
Author

Update:
For same docker images, get Error in nccl log when training, it did not show up when nccl-test

hostname:177:4050 [6] NCCL INFO CollNet 06/0 : 6 [receive] via COLLNET/SHARP/6/GDRDMA
hostname:174:4057 [3] NCCL INFO SHARP rank 0/8 initialized on mlx5_4:1
[hostname:0:174 - tree_ops.c:314][2024-05-31 02:30:21] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[hostname:0:174 - comm.c:379][2024-05-31 02:30:21] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

hostname:174:4057 [3] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

hostname:174:4044 [3] NCCL INFO transport.cc:327 -> 2
hostname:176:4046 [5] NCCL INFO CollNet 05/0 : 5 [receive] via COLLNET/SHARP/5/GDRDMA
[hostname:0:175 - context.c:660][2024-05-31 02:30:21] INFO job (ID: 35887388397189922) resource request quota: ( osts:0 user_data_per_
ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[hostname:0:175 - context.c:857][2024-05-31 02:30:22] INFO sharp_job_id:425  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 qu
ota:(osts:11 user_data_per_ost:1024 max_groups:6 max_qps:1 max_group_channels:1)
[hostname:0:175 - context.c:872][2024-05-31 02:30:22] INFO sharp_job_id:425  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
hostname:175:4059 [4] NCCL INFO SHARP rank 0/8 initialized on mlx5_5:1
[hostname:0:175 - tree_ops.c:314][2024-05-31 02:30:22] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[hostname:0:175 - comm.c:379][2024-05-31 02:30:22] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

hostname:175:4059 [4] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

hostname:175:4049 [4] NCCL INFO transport.cc:327 -> 2
hostname:177:4050 [6] NCCL INFO CollNet 06/0 : 6 [receive] via COLLNET/SHARP/6/GDRDMA
[hostname:0:176 - context.c:660][2024-05-31 02:30:22] INFO job (ID: 35887946087231154) resource request quota: ( osts:0 user_data_per_
ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[hostname:0:176 - context.c:857][2024-05-31 02:30:23] INFO sharp_job_id:427  resv_key: tree_type:LLT tree_idx:0  treeID:8 caps:0x66 qu
ota:(osts:11 user_data_per_ost:1024 max_groups:6 max_qps:1 max_group_channels:1)
[hostname:0:176 - context.c:872][2024-05-31 02:30:23] INFO sharp_job_id:427  tree_type:SAT tree_idx:1  treeID:520 caps:0x76
hostname:176:4063 [5] NCCL INFO SHARP rank 0/8 initialized on mlx5_6:1
[hostname:0:176 - tree_ops.c:314][2024-05-31 02:30:23] ERROR Failed to lock SAT tree(ID:0x208. ret:0x4)
[hostname:0:176 - comm.c:379][2024-05-31 02:30:23] ERROR Failed to lock SAT tree(ID:0x208 ret:0xffffffee)

hostname:176:4063 [5] sharp_plugin.c:354 NCCL WARN SHARP group create: Streaming Tree lock failed (-18)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant