Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] COnfiguration issues with mlcommon benchmarking #421

Open
raghavendrachari08 opened this issue Sep 22, 2023 · 2 comments
Open
Labels
question Further information is requested

Comments

@raghavendrachari08
Copy link

Hi,
I Am trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

@raghavendrachari08 raghavendrachari08 added the question Further information is requested label Sep 22, 2023
@shijieliu
Copy link
Collaborator

Hi @raghavendrachari08

[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1

Looks to me the error is related with multinode MPI setting. Could you check your multinode MPI setting by running some demo code?

@Abatpool
Copy link

Abatpool commented Aug 5, 2024

hello @raghavendrachari08 Did your dlrm mlcommon training on single node, go through successfully. I am coming from perspective where i am facing issues of related to the error #445 mentioned in previous link. How did you solve it. I am doing it on a single Nvidia DGX H100 node. Any help will be highly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants