-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with restarting chains with MPI #304
Comments
Hi Andrej, There's definitely something odd going on here. We've had problems on some clusters where it wasn't properly dealing with the file number iteration (the __1, __2 etc) and initially I thought some part of that is happening here, but now I'm fairly convinced it's just bugs in the code. It seems to be correctly creating files with names but then it claims to be restarting only from and at the end it tries to delete This last bug is easy to fix, after line 303 of
I actually want to change the behavior so it removes the old files once it's done copying to new files (instead of at the end of the run), but I never got around to implementing this and usually do it by hand once I'm sure it's done copying or after some hours once I get around to it. It's very rare I have runs hit the limit on -N and it's usually wallclock limit stopping a run or me aborting manually due to seeing the chains are converged, and in both cases it doesn't remove the old files. The first bug doesn't change the functionality of the code as it's just printing the wrong information, so we can safely ignore that for now as I'm not sure off the top of my head how to fix it. Let me know if this seems to solve your problems. It should fix the errors, but you also mentioned it seems to be starting from the wrong point, so get back to me about that if it still seems to be working incorrectly. I'm sorry about the bugs! Best, |
Hi Thejs, Thank you very much for your quick reply! I just had a look at the files, and the initial lines do match the original files (for 2-8, the first one is gone now). I'll definitely try to implement the fix you suggested to resolve the errors. Concerning the starting point in the restarted chains, what I find odd (perhaps wrongly) is that in general the first step after restarting has drastically larger -lnLkl, i.e. it's like it starts from some very far away place. For example, in one of the chains, in the initial run -lnLkl went from ~1e4 to ~1e2 in few steps, while when it restarts it goes back to 1e3. Of course here we're talking about order of ~10s of steps, but I noticed this behaviour when running longer chains (N=1e5). Maybe I'm missing something here... Thanks again and best, |
Hi Andrej, Okay I tested this and once the io_mp.py bugfix is applied I get consistent behavior where the starting point is the last entry of the previous chain, for all chains. I checked this at the step proposal level, to make sure it really started from the correct point. Without the bugfix the starting point was the last point of the first chain, i.e. the __1 chain, which might have led you to see some unexpected behavior beyond what simply random steps could explain. Probably not significant enough to bias the results of an MCMC run, but a bug nevertheless. If a bestfit file is passed this behavior is overwritten - in that case it starts from the bestfit. You're not doing that, I'm just leaving a comment about it in case it's relevant for someone. Best, |
Hi Thejs, Thanks, I applied your bug fix and my (test) script doesn't fail anymore. I'm attaching below the job output file. It still says: Best, |
Hi Andrej, After line 115 of I hope this helps. Best, |
Hi,
Sorry if I missed something, but I really tried going through all previous issues concerning restarting chains with MPI, and I couldn’t resolve the issue I’m having. I think the closest issues I noticed are this one and #189. I am using Monte Python v3.5.0 with CLASS v2.6.3.
I first do this:
python montepython/MontePython.py run -p input/my.param -o chains/mpirun/ -f 0
Then I run an initial MPI run doing:
mpirun -np 8 python montepython/MontePython.py -p input/my.param -o chains/mpirun/ -N 10
Then I try restarting doing:
mpirun -np 8 python montepython/MontePython.py run -p input/my.param -o chains/mpirun/ -r chains/mpirun/2022-12-15_10__1.txt -N 50
What happens is that the new chains are created and they finish, however my job script fails in the end. I notice that the initial chain (… __1.txt) is removed. When I have a look at the restarted chains, it seems like they were not restarted from their original chains, but from some other value which I am not even sure what it is. I am really confused at what is happening, so I was wondering if you could you please help out?
I am attaching below the job output file.
Thanks and best,
Andrej
mpirun_5780.txt
The text was updated successfully, but these errors were encountered: