-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ectrans result is not reproducible on GPU when NPROC changes #144
Comments
Hi Philippe Yes, this is expected. It is rather unlikely that we can have reproducible results with different NPROC due to the batched FFTs and especially batched GEMMs. The GEMMs run on multiple layers at once, so it depends on the exact number of layers per rank. What is the use-case here? Is this a production requirement, or a debugging requirement? Depending on this, I would recommend
Any thoughts? |
Hello Lukas, Thank you for these explanations. Currently we regulary control the reproducibility of our models (ARPEGE & AROME); and it proves quite useful when we need to debug the model, as we can reduce the number of nodes and still reproduce a problem. It is also something we demand when writing specifications for buying a new machine. Apparently, everything in ARPEGE but the spectral transforms is reproducible when the number of MPI tasks changes. But I am not alone to decide on these matters, so I will talk about this with other Météo-France colleagues. I would also be curious to hear ECWMF opinion on this matter. |
As Lukas mentioned, this has been the case for some time due to the batched maths.
Two more points that should help down the line for this
|
Apparently changing NPROC changes numerical results when running on NVIDIA accelerators.
Is this expected ? If so, is it investigated ?
I can provide a small test case if necessary.
The text was updated successfully, but these errors were encountered: