[Question] COnfiguration issues with mlcommon benchmarking #421

raghavendrachari08 · 2023-09-22T13:30:42Z

Hi,
I Am trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

shijieliu · 2023-09-25T06:55:43Z

Hi @raghavendrachari08

[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1

Looks to me the error is related with multinode MPI setting. Could you check your multinode MPI setting by running some demo code?

Abatpool · 2024-08-05T12:08:20Z

hello @raghavendrachari08 Did your dlrm mlcommon training on single node, go through successfully. I am coming from perspective where i am facing issues of related to the error #445 mentioned in previous link. How did you solve it. I am doing it on a single Nvidia DGX H100 node. Any help will be highly appreciated.

raghavendrachari08 added the question Further information is requested label Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] COnfiguration issues with mlcommon benchmarking #421

[Question] COnfiguration issues with mlcommon benchmarking #421

raghavendrachari08 commented Sep 22, 2023

shijieliu commented Sep 25, 2023

Abatpool commented Aug 5, 2024 •

edited

Loading

[Question] COnfiguration issues with mlcommon benchmarking #421

[Question] COnfiguration issues with mlcommon benchmarking #421

Comments

raghavendrachari08 commented Sep 22, 2023

shijieliu commented Sep 25, 2023

Abatpool commented Aug 5, 2024 • edited Loading

Abatpool commented Aug 5, 2024 •

edited

Loading