Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing nest built with mpi without mpiexec appears to cause segfault (PyNEST-NG) #3400

Open
heplesser opened this issue Feb 7, 2025 · 4 comments
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Enhancement New functionality, model or documentation

Comments

@heplesser
Copy link
Contributor

When building the PyNEST-NG variant of NEST with MPI support, importing nest appears to lead to segfaults on Linux, see e.g., https://github.com/heplesser/nest-simulator/actions/runs/13130033031/job/36633256531#step:23:213. Invocation under control of mpiexec works. The problem does not occur on macOS.

I have so far observed this only in the testsuite. We need to understand what is going on and hopefully find a solution or at least a work-around.

I mark this as an "Enhancement", not a bug, because it is related to the PyNEST-NG under development.

@heplesser heplesser added I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Enhancement New functionality, model or documentation labels Feb 7, 2025
@github-project-automation github-project-automation bot moved this to To do in PyNEST-NG Feb 7, 2025
@med-ayssar
Copy link
Contributor

med-ayssar commented Feb 10, 2025

Hey @heplesser, I had a look at the issue, and I was able to reproduce the error and also took a loot the core dump associated with the seg-fault.

  • To reproduce just run:

-pytest -v $simple_file_just_import_nest.py -> might cause seg-fault.

#5 __strlen_avx2 ()
#6 0x00007ab6bbfa50a5 in opal_argv_join () from /lib/x86_64-linux-gnu/libopen-pal.so.40
#7 0x00007ab6bcdba7a2 in ompi_mpi_init () from /lib/x86_64-linux-gnu/libmpi.so.40
#8 0x00007ab6bcd50eec in PMPI_Init_thread () from /lib/x86_64-linux-gnu/libmpi.so.40
#9 0x00007ab6a448d078 in nest::MPIManager::init_mpi (this=0x56388d70bc60, argc=argc@entry=0x7fff39e5a904, argv=argv@entry=0x7fff39e5a908)

By checking the source code of MPI and the implementation of opal_argv_join, this function takes a pointer to argv and a delimiter, and iterates over argv starting from position 1 until reaching a nullptr.

However, in the new init function in pynest/nestkernel_api.pyx, the function does not append a nullptr at the end of the argv`, which will lead to an uninitialized memory access (Undefined behavior).

I don't know the use case of the new init function, but maybe one should take llapi_init_nest as reference to adjust the newly implemented function.

@heplesser
Copy link
Contributor Author

@med-ayssar Thanks for your detective work, that could have led to nasty consequences! Do you want to create a PR against my heplesser/pynest-ng-adac branch?

@med-ayssar
Copy link
Contributor

Yes, I can do that, but could you please explain to me why the new init function? Could we not just use the old implementation again?

@med-ayssar
Copy link
Contributor

@heplesser, Done ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: High Should be handled next T: Enhancement New functionality, model or documentation
Projects
Status: To do
Development

No branches or pull requests

2 participants