No Speedup in distributed gmres? #1732

marco-butz · 2024-11-25T14:27:11Z

marco-butz
Nov 25, 2024

Hi everyone,

I am using ginkgo to solve large sparse complex linear systems with a system Matrix A of dimensions up to about 14400000x14400000. On a single Nvidia A100 solving the system takes about ten minutes. I employ ginkgo in an iterative optimization setting, where performance is a critical factor. Each Node where I run my computations on is equipped with up to four (some eight) Nvidia A100. Therefore, I want to use the distributed solving feature of ginkgo.

I tried to adapt the "The distributed-solver program" from the documentation. However, I do not see any speedup. Even more, my implementation with MPI is slower the more MPI processes I use.

Performance vs. number of A100s employed:

I checked through nvidia-smi, that the problem indeed seems to be distributed over multiple GPUs. The memory allocated on each A100 in case of two processes is about half of the memory allocated in the single-gpu case, which also makes sense to me. Only slightly above 50% of each GPU is used though. A single process uses 100%. The solution is correct, independent of how many GPUs are involved.

Below is my code, does anyone see any obvious mistakes? Loading the matrices is a bit of a mess because I employ the fix mentioned in #1731 . The correct functionality of this approach is confirmed by a ginkgo program not employing any distributed features.

Any help would be very much appreciated.

Best regards,
Marco

#include <iostream>

#include <chrono>
#include <fstream>
#include <ginkgo/ginkgo.hpp>
#include <fast_matrix_market/app/Eigen.hpp>

int main(int argc, char* argv[])
{
    const gko::experimental::mpi::environment env(argc, argv);

    using GlobalIndexType = gko::int32;
    using LocalIndexType = gko::int32;
    using ValueType = std::complex<double>;
    using RealValueType = gko::remove_complex<ValueType>;
    using vec = gko::matrix::Dense<ValueType>;
    using real_vec = gko::matrix::Dense<RealValueType>;
    using dist_vec = gko::experimental::distributed::Vector<ValueType>;
    using dist_mtx = gko::experimental::distributed::Matrix<ValueType, LocalIndexType, GlobalIndexType>;
    using part_type = gko::experimental::distributed::Partition<LocalIndexType, GlobalIndexType>;
    using solver = gko::solver::Gmres<ValueType>;

    const char* fileNameA = argv[1];
    const char* fileNameB = argv[2];
    const char* fileNameX0 = argv[3];
    const char* fileNameResult = argv[4];
    auto max_iters = std::stoi(argv[5]);
    auto tolerance = std::stof(argv[6]);
    bool verbose = std::string(argv[7]) == "true";

    const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
    const auto rank = comm.rank();

    int device_id = gko::experimental::mpi::map_rank_to_device_id(MPI_COMM_WORLD, gko::CudaExecutor::get_num_devices());

    if (verbose) {
      std::cout << "device_id: " << device_id << " reporting in" << std::endl;
      std::cout << "comm size: " << comm.size() << std::endl;
    }

    auto gpu = gko::CudaExecutor::create(device_id, gko::OmpExecutor::create());

    auto start_A_eigen = gko::experimental::mpi::get_walltime();
    Eigen::SparseMatrix<std::complex<double>, Eigen::RowMajor> A_eigen;
    std::ifstream A_stream(fileNameA);
    fast_matrix_market::read_matrix_market_eigen(A_stream, A_eigen);

    Eigen::VectorXcd b_eigen;
    Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> matB;
    std::ifstream b_stream(fileNameB);
    fast_matrix_market::read_matrix_market_eigen_dense(b_stream, matB);
    b_eigen = matB.col(0);

    Eigen::VectorXcd x_eigen;
    Eigen::Matrix<std::complex<double>, Eigen::Dynamic, Eigen::Dynamic> matX;
    std::ifstream x0_stream(fileNameX0);
    fast_matrix_market::read_matrix_market_eigen_dense(x0_stream, matX);
    x_eigen = matX.col(0);

    auto end_A_eigen = gko::experimental::mpi::get_walltime();
    if (verbose && comm.rank() == 0) {
        std::cout << "Reading matrices took " << end_A_eigen - start_A_eigen << " seconds." << std::endl;
    }

    auto row_ptrs = static_cast<GlobalIndexType*>(A_eigen.outerIndexPtr());
    auto col_idxs = static_cast<GlobalIndexType*>(A_eigen.innerIndexPtr());
    auto values = static_cast<ValueType*>(A_eigen.valuePtr());

    auto row_ptrs_view = gko::array<GlobalIndexType>::view(gpu->get_master(), A_eigen.outerSize() + 1, row_ptrs);
    auto col_idxs_view = gko::array<GlobalIndexType>::view(gpu->get_master(), A_eigen.nonZeros(), col_idxs);
    auto values_view = gko::array<ValueType>::view(gpu->get_master(), A_eigen.nonZeros(), values);

    auto gko_dim = gko::dim<2>{static_cast<gko::size_type>(A_eigen.rows()),
                               static_cast<gko::size_type>(A_eigen.cols())};

    auto A_loaded = gko::share(gko::matrix::Csr<ValueType, GlobalIndexType>::create(
         gpu->get_master(), gko_dim, std::move(values_view), std::move(col_idxs_view),
         std::move(row_ptrs_view)));

    auto values_b = static_cast<ValueType*>(b_eigen.data());
    auto values_view_b = gko::array<ValueType>::view(gpu->get_master(), b_eigen.rows(), values_b);
    auto gko_dim_b = gko::dim<2>{static_cast<gko::size_type>(b_eigen.rows()), 1};
    auto b_loaded = gko::matrix::Dense<ValueType>::create(gpu->get_master(), gko_dim_b, std::move(values_view_b), 1);

    auto values_x = static_cast<ValueType*>(x_eigen.data());
    auto values_view_x = gko::array<ValueType>::view(gpu->get_master(), x_eigen.rows(), values_x);
    auto gko_dim_x = gko::dim<2>{static_cast<gko::size_type>(x_eigen.rows()), 1};
    auto x_loaded = gko::matrix::Dense<ValueType>::create(gpu->get_master(), gko_dim_x, std::move(values_view_x), 1);

    auto start_distribution = gko::experimental::mpi::get_walltime();

    auto partition = gko::share(part_type::build_from_global_size_uniform(
      gpu->get_master(),
      comm.size(),
      static_cast<GlobalIndexType>(b_eigen.rows())));

    comm.synchronize();

    auto A_host = gko::share(dist_mtx::create(gpu->get_master(), comm));
    auto x_host = dist_vec::create(gpu->get_master(), comm);
    auto b_host = dist_vec::create(gpu->get_master(), comm);

    gko::matrix_data<ValueType, GlobalIndexType> A_mat_data;
    A_loaded->write(A_mat_data);
    gko::matrix_data<ValueType, GlobalIndexType> b_mat_data;
    b_loaded->write(b_mat_data);
    gko::matrix_data<ValueType, GlobalIndexType> x_mat_data;
    x_loaded->write(x_mat_data);
    A_host->read_distributed(A_mat_data, partition);
    b_host->read_distributed(b_mat_data, partition);
    x_host->read_distributed(x_mat_data, partition);

    auto A = gko::share(dist_mtx::create(gpu, comm));
    auto x = dist_vec::create(gpu, comm);
    auto b = dist_vec::create(gpu, comm);
    A->copy_from(A_host);
    b->copy_from(b_host);
    x->copy_from(x_host);

    comm.synchronize();

    auto end_distribution = gko::experimental::mpi::get_walltime();
    if (verbose && comm.rank() == 0) {
        std::cout << "Setting up distributed solving took " << end_distribution - start_distribution << " seconds." << std::endl;
    }

    auto start_solve = gko::experimental::mpi::get_walltime();

    std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
        gko::log::Convergence<ValueType>::create();

    auto solver_obj =
        solver::build()
            .with_criteria(
                gko::stop::Iteration::build().with_max_iters(max_iters),
                gko::stop::ResidualNorm<>::build().with_reduction_factor(tolerance))
            .with_krylov_dim(30)
            .on(gpu);

    solver_obj->add_logger(logger);
    solver_obj->generate(A)->apply(b, x);

    comm.synchronize();

    auto res_norm = gko::clone(gpu->get_master(),
                               gko::as<real_vec>(logger->get_residual_norm()));

    auto end_solve = gko::experimental::mpi::get_walltime();
    if (verbose && rank == 0) {
      std::cout << "Solving took " << end_solve - start_solve << " seconds." << std::endl;
      std::cout << "Final Res norm: " << res_norm->at(0, 0)
               << "\nIteration count: " << logger->get_num_iterations()
               << std::endl;
    }

    if (rank == 0) {
      auto x_master = gko::clone(gpu->get_master(), x->get_local_vector());

      auto start_write_result = gko::experimental::mpi::get_walltime();

      auto output_file = std::ofstream(fileNameResult);
      gko::write(output_file, gko::as<vec>(x_master.get()), gko::layout_type::array);
      output_file.close();

      auto end_write_result = gko::experimental::mpi::get_walltime();
      if (verbose && rank == 0) {
          std::cout << "Writing result took " << end_write_result - start_write_result << " seconds." << std::endl;
      }
    }
}

MarcelKoch · 2024-11-25T15:29:51Z

MarcelKoch
Nov 25, 2024
Maintainer

Looking at your code, I can't see any obvious issues. To check that there is nothing wrong with your system, you could try using our benchmarks, which are part of our repository. You have to build the benchmark solver_distributed_dcomplex (this is also the cmake target). This benchmark also allows you to read in your matrix file, but that might not be the best idea due to the issue you described previously.
You can run the benchmark with a synthesized matrix by providing the input:

[
        {
                "size": 1400000,
                "stencil": "27pt",
                "comm_pattern": "stencil",
                "optimal": { "spmv": "csr-csr" }
        }
]

If you save this as in.json, you can run the benchmark as:

mpirun -n <n> ./solver_distributed_dcomplex --solvers=gmres --gmres_restart=30 --max_iters=100 --detailed=false < in.json

If this gives normal speedup behavior, then I would guess that the performance issues are due to matrix partition. Maybe something more sophisticated like metis or scotch is necessary to reduce the communication overhead.

Also, how is your MPI configured? Does it support communication with device pointers? If so you can set during cmake -DGINKGO_HAVE_GPU_AWARE_MPI=ON, which will remove copying the communication buffers between the host and device.

2 replies

marco-butz Dec 3, 2024
Author

Hi @MarcelKoch and thank you very much for your answer.

I have been running the mentioned benchmark and see a speedup as expected supplying the stencil matrices.

Strangely, I cannot examine the performance of the benchmark with my matrices.
One MPI Process runs just fine:

mpirun -n 1 ./solver_distributed_dcomplex --executor=cuda --solvers=gmres --gmres_restart=30 --max_iters=10000 --rel-res-goal=0.0001 --detailed=false < in.json

MapSMtoCores for SM 8.9is undefined. The default value of 128 Cores/SM is used.
This is Ginkgo 1.9.0 (develop)
    running with core module 1.9.0 (develop)
    the reference module is  1.9.0 (develop)
    the OpenMP    module is  1.9.0 (develop)
    the CUDA      module is  1.9.0 (develop)
    the HIP       module is  not compiled
    the DPCPP     module is  not compiled
Running on CudaExecutor on device 0 (NVIDIA GeForce RTX 4090) with host ReferenceExecutor
Running with 2 warm iterations and 1 running iterations
The random seed for right hand sides is 42
Running gmres with 10000 iterations and residual goal of 1.000000e-04
The number of right hand sides is 1
Running test case /home/m/m_butz02/simframe/simFrame/remoteSolver/fdfd/standalone_ginkgo_mpi/test_mtx/fdfd_2D/A.mtx
Matrix is of size (93960, 93960)
	Running solver: gmres
[
    {
        "filename": "/home/m/m_butz02/simframe/simFrame/remoteSolver/fdfd/standalone_ginkgo_mpi/test_mtx/fdfd_2D/A.mtx",
        "rhs": "/home/m/m_butz02/simframe/simFrame/remoteSolver/fdfd/standalone_ginkgo_mpi/test_mtx/fdfd_2D/b.mtx",
        "optimal": {
            "spmv": "csr-csr"
        },
        "solver": {
            "gmres": {
                "recurrent_residuals": [],
                "true_residuals": [],
                "implicit_residuals": [],
                "iteration_timestamps": [],
                "rhs_norm": 2.516291640037829e-05,
                "generate": {
                    "components": {},
                    "time": 0.001293824
                },
                "apply": {
                    "components": {},
                    "iterations": 1202,
                    "time": 0.913978768
                },
                "residual_norm": 2.4926822912264125e-09,
                "repetitions": 1,
                "completed": true
            }
        },
        "rows": 93960,
        "cols": 93960
    }
]

While running it with 2 ore more processes just terminates in the middle of the execution without any error message:

mpirun -n 2 ./solver_distributed_dcomplex --executor=cuda --solvers=gmres --gmres_restart=30 --max_iters=10000 --rel-res-goal=0.0001 --detailed=false < in.json

MapSMtoCores for SM 8.9is undefined. The default value of 128 Cores/SM is used.
This is Ginkgo 1.9.0 (develop)
    running with core module 1.9.0 (develop)
    the reference module is  1.9.0 (develop)
    the OpenMP    module is  1.9.0 (develop)
    the CUDA      module is  1.9.0 (develop)
    the HIP       module is  not compiled
    the DPCPP     module is  not compiled
Running on CudaExecutor on device 0 (NVIDIA GeForce RTX 4090) with host ReferenceExecutor
Running with 2 warm iterations and 1 running iterations
The random seed for right hand sides is 42
Running gmres with 10000 iterations and residual goal of 1.000000e-04
The number of right hand sides is 1
Running test case /home/m/m_butz02/simframe/simFrame/remoteSolver/fdfd/standalone_ginkgo_mpi/test_mtx/fdfd_2D/A.mtx
MapSMtoCores for SM 8.9is undefined. The default value of 128 Cores/SM is used.
Running test case /home/m/m_butz02/simframe/simFrame/remoteSolver/fdfd/standalone_ginkgo_mpi/test_mtx/fdfd_2D/A.mtx
Matrix is of size (93960, 93960)
	Running solver: gmres
Matrix is of size (93960, 93960)
	Running solver: gmres

the test files I am using:

A_and_b.zip

The configuration CMake reports:

--    Summary of Configuration for Ginkgo (version 1.9.0 with tag develop)
--
--    Ginkgo configuration:
--        CMAKE_BUILD_TYPE:                                Release
--        BUILD_SHARED_LIBS:                               OFF
--        PROJECT_SOURCE_DIR:                              /home/m/m_butz02/ginkgo
--        PROJECT_BINARY_DIR:                              /home/m/m_butz02/ginkgo/build
--        CMAKE_CXX_COMPILER:                              GNU 12.3.0 on platform Linux x86_64
--                                                         /Applic.HPC/Easybuild/zen3/2023a/software/GCCcore/12.3.0/bin/c++
--    User configuration:
--      Enabled modules:
--        GINKGO_BUILD_OMP:                                ON
--        GINKGO_BUILD_MPI:                                ON
--        GINKGO_BUILD_REFERENCE:                          ON
--        GINKGO_BUILD_CUDA:                               ON
--        GINKGO_BUILD_HIP:                                OFF
--        GINKGO_BUILD_SYCL:                               OFF
--      Enabled features:
--        GINKGO_MIXED_PRECISION:                          OFF
--        GINKGO_HAVE_GPU_AWARE_MPI:                       OFF
--      Tests, benchmarks and examples:
--        GINKGO_BUILD_TESTS:                              ON
--        GINKGO_FAST_TESTS:                               OFF
--        GINKGO_BUILD_EXAMPLES:                           ON
--        GINKGO_EXTLIB_EXAMPLE:                           
--        GINKGO_BUILD_BENCHMARKS:                         ON
--        GINKGO_BENCHMARK_ENABLE_TUNING:                  OFF
--      Documentation:
--        GINKGO_BUILD_DOC:                                OFF
--        GINKGO_VERBOSE_LEVEL:                            1
--    
---------------------------------------------------------------------------------------------------------
--
--      Developer Tools:
--        GINKGO_DEVEL_TOOLS:                              OFF
--        GINKGO_WITH_CLANG_TIDY:                          OFF
--        GINKGO_WITH_IWYU:                                OFF
--        GINKGO_CHECK_CIRCULAR_DEPS:                      OFF
--        GINKGO_WITH_CCACHE:                              ON
---------------------------------------------------------------------------------------------------------
--
--      Components:
--        GINKGO_BUILD_PAPI_SDE:                           OFF
--        GINKGO_BUILD_HWLOC:                              OFF

If I specify -DGINKGO_FORCE_GPU_AWARE_MPI=ON the benchmark program crashes.

the message MapSMtoCores for SM 8.9is undefined. The default value of 128 Cores/SM is used. appears because I am testing on an RTX4090 here, instead of the A100.

MarcelKoch Dec 4, 2024
Maintainer

A hacky thing you could test is to run the benchmark with the --profile=true --profiler_hook=debug option. This should generate something like a stack trace if it exits unexpectedly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Speedup in distributed gmres? #1732

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

No Speedup in distributed gmres? #1732

marco-butz Nov 25, 2024

Replies: 1 comment · 2 replies

MarcelKoch Nov 25, 2024 Maintainer

marco-butz Dec 3, 2024 Author

MarcelKoch Dec 4, 2024 Maintainer

marco-butz
Nov 25, 2024

Replies: 1 comment 2 replies

MarcelKoch
Nov 25, 2024
Maintainer

marco-butz Dec 3, 2024
Author

MarcelKoch Dec 4, 2024
Maintainer