You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been following the instructions for installing MACE in LAMMPS found at: https://mace-docs.readthedocs.io/en/latest/guide/lammps.html with a few modifications, downloading libtorch for CUDA 11.7, and modifying the Kokkos_ARCH commands when compiling LAMMPS to what I believe is compatible with my system. When I try to run a simulation in LAMMPS I have been getting segmentation faults, and I'm not sure how to fix the issue. The steps I'm taking during installation are:
While the command make -j 20 is running, I get several warning messages:
[ 21%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/compute_chunk.cpp.o
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/atom_vec.cpp: In member function ‘virtual void LAMMPS_NS::AtomVec::write_data_restricted_to_general()’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/atom_vec.cpp:2272:21: warning: ‘void* memcpy(void*, const void*, size_t)’ specified bound between 18446744056529682432 and 18446744073709551592 exceeds maximum object size 9223372036854775807 [-Wstringop-overflow=]
2272 | if (nlocal) memcpy(&x_hold[0][0],&x[0][0],3*nlocal*sizeof(double));
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[ 58%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/kspace_deprecated.cpp.o
In function ‘fmt::v10_lmp::detail::format_decimal_result<Char*> fmt::v10_lmp::detail::format_decimal(Char*, UInt, int) [with Char = char; UInt = unsigned int]’,
inlined from ‘fmt::v10_lmp::detail::format_decimal_result<Iterator> fmt::v10_lmp::detail::format_decimal(Iterator, UInt, int) [with Char = char; UInt = unsigned int; Iterator = fmt::v10_lmp::appender; typename std::enable_if<(! std::is_pointer<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1402:28,
inlined from ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:3319:23:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1358:7: warning: writing 2 bytes into a region of size 0 [-Wstringop-overflow=]
1358 | memcpy(dst, src, 2);
| ~~^~~~~~~~~~~~~
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h: In function ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1401:6: note: at offset -2 to object ‘buffer’ with size 10 declared here
1401 | Char buffer[digits10<UInt>() + 1] = {};
| ^~~~~~
[ 87%] Building CXX object CMakeFiles/lammps.dir/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/KOKKOS/npair_halffull_kokkos.cpp.o
In function ‘fmt::v10_lmp::detail::format_decimal_result<Char*> fmt::v10_lmp::detail::format_decimal(Char*, UInt, int) [with Char = char; UInt = unsigned int]’,
inlined from ‘fmt::v10_lmp::detail::format_decimal_result<Iterator> fmt::v10_lmp::detail::format_decimal(Iterator, UInt, int) [with Char = char; UInt = unsigned int; Iterator = fmt::v10_lmp::appender; typename std::enable_if<(! std::is_pointer<typename std::remove_cv<typename std::remove_reference<_Arg>::type>::type>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1402:28,
inlined from ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’ at /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:3319:23:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1358:7: warning: writing 2 bytes into a region of size 0 [-Wstringop-overflow=]
1358 | memcpy(dst, src, 2);
| ~~^~~~~~~~~~~~~
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h: In function ‘void fmt::v10_lmp::detail::format_hexfloat(Float, int, fmt::v10_lmp::detail::float_specs, fmt::v10_lmp::detail::buffer<char>&) [with Float = double; typename std::enable_if<(! std::integral_constant<bool, (std::numeric_limits<_Tp>::digits == 106)>::value), int>::type <anonymous> = 0]’:
/home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/src/fmt/format.h:1401:6: note: at offset -2 to object ‘buffer’ with size 10 declared here
1401 | Char buffer[digits10<UInt>() + 1] = {};
| ^~~~~~
On my HPC system I'm using a Quadro RTX 6000 GPU, and in my slurm submission script, I load all of the same modules I used during the installation. I use pair commands the same way as stated in the MACE in LAMMPS docs, and my slurm submission script looks something like this:
Upon running the job, I receive the following error message:
[g10:11681:0:11681] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)
[g10:11682:0:11682] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x440000e8)
==== backtrace (tid: 11681) ====
0 0x000000000002137e ucs_debug_print_backtrace() /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.
10.0/src/ucs/debug/debug.c:656
1 0x000000000006c247 MPI_Comm_rank() ???:0
2 0x0000000000b2684b LAMMPS_NS::Universe::Universe() ???:0
3 0x00000000009657d4 LAMMPS_NS::LAMMPS::LAMMPS() ???:0
4 0x000000000040499f main() ???:0
==== backtrace (tid: 11682) ====
5 0x0000000000022555 __libc_start_main() ???:0
6 0x0000000000404b4e _start() ???:0
=================================
[g10:11681] *** Process received signal ***
0 0x000000000002137e ucs_debug_print_backtrace() /umbc/ebuild-soft/cascade-lake/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.
10.0/src/ucs/debug/debug.c:656
1 0x000000000006c247 MPI_Comm_rank() ???:0
2 0x0000000000b2684b LAMMPS_NS::Universe::Universe() ???:0
3 0x00000000009657d4 LAMMPS_NS::LAMMPS::LAMMPS() ???:0
4 0x000000000040499f main() ???:0
5 0x0000000000022555 __libc_start_main() ???:0
6 0x0000000000404b4e _start() ???:0
=================================
[g10:11682] *** Process received signal ***
[g10:11681] Signal: Segmentation fault (11)
[g10:11681] Signal code: (-6)
[g10:11681] Failing at address: 0x2c62900002da1
[g10:11682] Signal: Segmentation fault (11)
[g10:11682] Signal code: (-6)
[g10:11682] Failing at address: 0x2c62900002da2
[g10:11681] [g10:11682] [ 0] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaab0088630]
[g10:11682] [ 1] /lib64/libpthread.so.0(+0xf630)[0x2aaab0088630]
/usr/ebuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Comm_rank+0x37)[0x2aaaaab3e247]
[g10:11682] [ 2] [g10:11681] [ 1] /usr/ebuild/software/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so.40(MPI_Comm_rank+0x37)[0x2
aaaaab3e247]
[g10:11681] [ 2] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS8
UniverseC2EPNS_6LAMMPSEi+0xfb)[0x2aaaab7f584b]
[g10:11682] [ 3] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS8
UniverseC2EPNS_6LAMMPSEi+0xfb)[0x2aaaab7f584b]
[g10:11681] [ 3] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS6
LAMMPSC2EiPPci+0xb4)[0x2aaaab6347d4]
[g10:11682] [ 4] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x40499f]
[g10:11682] [ 5] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/liblammps.so.0(_ZN9LAMMPS_NS6
LAMMPSC2EiPPci+0xb4)[0x2aaaab6347d4]
[g10:11681] [ 4] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x40499f]
[g10:11681] [ 5] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab09a4555]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab09a4555]
[g10:11682] [ 6] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x404b4e]
[g10:11682] *** End of error message ***
[g10:11681] [ 6] /home/bgruzs1/tjo_common/groupshared/bgruzs1/mace-lammps-env/lammps/build/lmp[0x404b4e]
[g10:11681] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 11681 on node g10 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I’m not sure if the problem is due to the installation, slurm scipt, or both. Any insight into how to resolve this issue is greatly appreciated, and if you need any further information please let me know.
The text was updated successfully, but these errors were encountered:
I have been following the instructions for installing MACE in LAMMPS found at: https://mace-docs.readthedocs.io/en/latest/guide/lammps.html with a few modifications, downloading libtorch for CUDA 11.7, and modifying the Kokkos_ARCH commands when compiling LAMMPS to what I believe is compatible with my system. When I try to run a simulation in LAMMPS I have been getting segmentation faults, and I'm not sure how to fix the issue. The steps I'm taking during installation are:
git clone --branch=mace --depth=1 https://github.com/ACEsuit/lammps
wget https://download.pytorch.org/libtorch/cu117/libtorch-shared-with-deps-2.0.1%2Bcu117.zip
unzip libtorch-shared-with-deps-2.0.1+cu117.zip
mv libtorch libtorch-gpu
make -j 20
make install
While the command
make -j 20
is running, I get several warning messages:On my HPC system I'm using a Quadro RTX 6000 GPU, and in my slurm submission script, I load all of the same modules I used during the installation. I use pair commands the same way as stated in the MACE in LAMMPS docs, and my slurm submission script looks something like this:
Upon running the job, I receive the following error message:
I’m not sure if the problem is due to the installation, slurm scipt, or both. Any insight into how to resolve this issue is greatly appreciated, and if you need any further information please let me know.
The text was updated successfully, but these errors were encountered: