No Speedup in distributed gmres? #1732
Replies: 1 comment 2 replies
-
Looking at your code, I can't see any obvious issues. To check that there is nothing wrong with your system, you could try using our benchmarks, which are part of our repository. You have to build the benchmark [
{
"size": 1400000,
"stencil": "27pt",
"comm_pattern": "stencil",
"optimal": { "spmv": "csr-csr" }
}
] If you save this as
If this gives normal speedup behavior, then I would guess that the performance issues are due to matrix partition. Maybe something more sophisticated like metis or scotch is necessary to reduce the communication overhead. Also, how is your MPI configured? Does it support communication with device pointers? If so you can set during cmake |
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I am using ginkgo to solve large sparse complex linear systems with a system Matrix A of dimensions up to about 14400000x14400000. On a single Nvidia A100 solving the system takes about ten minutes. I employ ginkgo in an iterative optimization setting, where performance is a critical factor. Each Node where I run my computations on is equipped with up to four (some eight) Nvidia A100. Therefore, I want to use the distributed solving feature of ginkgo.
I tried to adapt the "The distributed-solver program" from the documentation. However, I do not see any speedup. Even more, my implementation with MPI is slower the more MPI processes I use.
Performance vs. number of A100s employed:
I checked through
nvidia-smi
, that the problem indeed seems to be distributed over multiple GPUs. The memory allocated on each A100 in case of two processes is about half of the memory allocated in the single-gpu case, which also makes sense to me. Only slightly above 50% of each GPU is used though. A single process uses 100%. The solution is correct, independent of how many GPUs are involved.Below is my code, does anyone see any obvious mistakes? Loading the matrices is a bit of a mess because I employ the fix mentioned in #1731 . The correct functionality of this approach is confirmed by a ginkgo program not employing any distributed features.
Any help would be very much appreciated.
Best regards,
Marco
Beta Was this translation helpful? Give feedback.
All reactions