Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault using MACE with LAMMPS on GPU #823

Open
bgruzs opened this issue Feb 10, 2025 · 0 comments
Open

Segmentation fault using MACE with LAMMPS on GPU #823

bgruzs opened this issue Feb 10, 2025 · 0 comments

Comments

@bgruzs
Copy link

bgruzs commented Feb 10, 2025

I have been following the instructions for installing MACE in LAMMPS found at: https://mace-docs.readthedocs.io/en/latest/guide/lammps.html with a few modifications, downloading libtorch for CUDA 11.7, and modifying the Kokkos_ARCH commands when compiling LAMMPS to what I believe is compatible with my system. When I try to run a simulation in LAMMPS I have been getting segmentation faults, and I'm not sure how to fix the issue. The steps I'm taking during installation are:

  1. git clone --branch=mace --depth=1 https://github.com/ACEsuit/lammps
  2. wget https://download.pytorch.org/libtorch/cu117/libtorch-shared-with-deps-2.0.1%2Bcu117.zip
  3. unzip libtorch-shared-with-deps-2.0.1+cu117.zip
  4. mv libtorch libtorch-gpu
  5. Request an interactive job
  6. Loading the following modules:
Image
  1. Compiling LAMMPS:
cmake \
    -D CMAKE_BUILD_TYPE=Release \
    -D CMAKE_INSTALL_PREFIX=$(pwd) \
    -D CMAKE_CXX_STANDARD=17 \
    -D CMAKE_CXX_STANDARD_REQUIRED=ON \
    -D BUILD_MPI=ON \
    -D BUILD_SHARED_LIBS=ON \
    -D PKG_KOKKOS=ON \
    -D Kokkos_ENABLE_CUDA=ON \
    -D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper \
    -D Kokkos_ARCH_SKX=ON \
    -D Kokkos_ARCH_TURING75=ON \
    -D CMAKE_PREFIX_PATH=$(pwd)/../../libtorch-gpu \
    -D PKG_ML-MACE=ON \
    ../cmake
  1. make -j 20
  2. make install

While the command make -j 20 is running, I get several warning messages:

[ 21%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/compute_chunk.cpp.o
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/atom_vec.cpp: In member function ‘virtual void LAMMPS_NS::AtomVec::write_data_restricted_to_general()’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/atom_vec.cpp:2272:21: warning: ‘void* memcpy(void*, const void*, size_t)’ specified bound between 18446744056529682432 and 18446744073709551592 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
 2272 |   if (nlocal) memcpy(&x_hold[0][0],&x[0][0],3*nlocal*sizeof(double));
      |               ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
[ 58%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/kspace_deprecated.cpp.o
In function ‘fmt::v10_lmp::detail::format_decimal_result<Char*> fmt::v10_lmp::detail::format_decimal(Char*, UInt, int) [with Char = char; UInt = unsigned int]’,
    inlined from ‘fmt::v10_lmp::detail::format_decimal_result<Iterator> fmt::v10_lmp::detail::format_decimal(Iterator, UInt, int) [with Char = char; UInt = unsigned int; Iterator = fmt::v10_lmp::appender; typename std::enable_if<(! std::is_pointer<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1402:28,
    inlined from ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:3319:23:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1358:7: warning: writing 2 bytes into a region of size 0 [-Wstringop-overflow=]
 1358 |     memcpy(dst, src, 2);
      |     ~~^~~~~~~~~~~~~
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h: In function ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1401:6: note: at offset -2 to object ‘buffer’ with size 10 declared here
 1401 |   Char buffer[digits10<UInt>() + 1] = {};
      |      ^~~~~~
[ 87%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/KOKKOS/npair_halffull_kokkos.cpp.o
In function ‘fmt::v10_lmp::detail::format_decimal_result<Char*> fmt::v10_lmp::detail::format_decimal(Char*, UInt, int) [with Char = char; UInt = unsigned int]’,
    inlined from ‘fmt::v10_lmp::detail::format_decimal_result<Iterator> fmt::v10_lmp::detail::format_decimal(Iterator, UInt, int) [with Char = char; UInt = unsigned int; Iterator = fmt::v10_lmp::appender; typename std::enable_if<(! std::is_pointer<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1402:28,
    inlined from ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:3319:23:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1358:7: warning: writing 2 bytes into a region of size 0 [-Wstringop-overflow=]
 1358 |     memcpy(dst, src, 2);
      |     ~~^~~~~~~~~~~~~
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h: In function ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1401:6: note: at offset -2 to object ‘buffer’ with size 10 declared here
 1401 |   Char buffer[digits10<UInt>() + 1] = {};
      |      ^~~~~~

On my HPC system I'm using a Quadro RTX 6000 GPU, and in my slurm submission script, I load all of the same modules I used during the installation. I use pair commands the same way as stated in the MACE in LAMMPS docs, and my slurm submission script looks something like this:

#!/bin/bash

#SBATCH --job-name=mace_test        
#SBATCH --output=slurm.out       
#SBATCH --error=slurm.err          
#SBATCH --time=01:00:00           
#SBATCH --mem=10000               
#SBATCH --gres=gpu:1               
#SBATCH --constraint=rtx_6000     

module purge
module load slurm/ada-slurm/23.02.1
module load imkl/2019.1.144-iimpi-2019a
module load gcc/10.2.0
module load OpenMPI/4.1.1-GCC-10.3.0
module load CUDA/11.7.0
module load cuDNN/8.4.1.50-CUDA-11.7.0

eval "$(conda shell.bash hook)"
conda activate path/to/mace-lammps-env
unset I_MPI_PMI_LIBRARY
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0
mpirun path/to/lmp -k on g 1 -sf kk -in in.mace_test
conda deactivate

Upon running the job, I receive the following error message:

[g10:11681:0:11681] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)
[g10:11682:0:11682] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)
==== backtrace (tid:  11681) ====
 0 0x000000000002137e ucs_debug_print_backtrace()  /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.
10.0/src/ucs/debug/debug.c:656
 1 0x000000000006c247 MPI_Comm_rank()  ???:0
 2 0x0000000000b2684b LAMMPS_NS::Universe::Universe()  ???:0
 3 0x00000000009657d4 LAMMPS_NS::LAMMPS::LAMMPS()  ???:0
 4 0x000000000040499f main()  ???:0
==== backtrace (tid:  11682) ====
 5 0x0000000000022555 __libc_start_main()  ???:0
 6 0x0000000000404b4e _start()  ???:0
=================================
[g10:11681] *** Process received signal ***
 0 0x000000000002137e ucs_debug_print_backtrace()  /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.
10.0/src/ucs/debug/debug.c:656
 1 0x000000000006c247 MPI_Comm_rank()  ???:0
 2 0x0000000000b2684b LAMMPS_NS::Universe::Universe()  ???:0
 3 0x00000000009657d4 LAMMPS_NS::LAMMPS::LAMMPS()  ???:0
 4 0x000000000040499f main()  ???:0
 5 0x0000000000022555 __libc_start_main()  ???:0
 6 0x0000000000404b4e _start()  ???:0
=================================
[g10:11682] *** Process received signal ***
[g10:11681] Signal: Segmentation fault (11)
[g10:11681] Signal code:  (-6)
[g10:11681] Failing at address: 0x2c62900002da1
[g10:11682] Signal: Segmentation fault (11)
[g10:11682] Signal code:  (-6)
[g10:11682] Failing at address: 0x2c62900002da2
[g10:11681] [g10:11682] [ 0] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaab0088630]
[g10:11682] [ 1] /lib64/libpthread.so.0(+0xf630)[0x2aaab0088630]
/usr/ebuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Comm_rank+0x37)[0x2aaaaab3e247]
[g10:11682] [ 2] [g10:11681] [ 1] /usr/ebuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Comm_rank+0x37)[0x2
aaaaab3e247]
[g10:11681] [ 2] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS8
UniverseC2EPNS_6LAMMPSEi+0xfb)[0x2aaaab7f584b]
[g10:11682] [ 3] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS8
UniverseC2EPNS_6LAMMPSEi+0xfb)[0x2aaaab7f584b]
[g10:11681] [ 3] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS6
LAMMPSC2EiPPci+0xb4)[0x2aaaab6347d4]
[g10:11682] [ 4] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x40499f]
[g10:11682] [ 5] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS6
LAMMPSC2EiPPci+0xb4)[0x2aaaab6347d4]
[g10:11681] [ 4] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x40499f]
[g10:11681] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab09a4555]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab09a4555]
[g10:11682] [ 6] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x404b4e]
[g10:11682] *** End of error message ***
[g10:11681] [ 6] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x404b4e]
[g10:11681] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11681 on node g10 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I’m not sure if the problem is due to the installation, slurm scipt, or both. Any insight into how to resolve this issue is greatly appreciated, and if you need any further information please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant