Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ectrans result is not reproducible on GPU when NPROC changes #144

Open
pmarguinaud opened this issue Sep 2, 2024 · 3 comments
Open

ectrans result is not reproducible on GPU when NPROC changes #144

pmarguinaud opened this issue Sep 2, 2024 · 3 comments

Comments

@pmarguinaud
Copy link
Collaborator

Apparently changing NPROC changes numerical results when running on NVIDIA accelerators.

Is this expected ? If so, is it investigated ?

I can provide a small test case if necessary.

@lukasm91
Copy link
Collaborator

lukasm91 commented Sep 2, 2024

Hi Philippe

Yes, this is expected. It is rather unlikely that we can have reproducible results with different NPROC due to the batched FFTs and especially batched GEMMs. The GEMMs run on multiple layers at once, so it depends on the exact number of layers per rank.

What is the use-case here? Is this a production requirement, or a debugging requirement? Depending on this, I would recommend

  • running the CPU version if debugging only, if you need reproducible results in a different component
  • running with a fixed NPRTRV should be reproducible (or let's say, very likely we could make it reproducible). On any run, you could go down to NPRTRV ranks, i.e. if NPRTRV=1, it would be reproducible with 1 rank, in theory
  • it might be possible to implement a slow version for GEMMs/FFTs by just iterating instead of doing batched GEMM. IMO it is questionable if this is useful, because this is really slow, and it might only be useful for debugging purposes, i.e. one might also use the CPU version in this case.

Any thoughts?

@pmarguinaud
Copy link
Collaborator Author

pmarguinaud commented Sep 3, 2024

Hello Lukas,

Thank you for these explanations.

Currently we regulary control the reproducibility of our models (ARPEGE & AROME); and it proves quite useful when we need to debug the model, as we can reduce the number of nodes and still reproduce a problem.

It is also something we demand when writing specifications for buying a new machine.

Apparently, everything in ARPEGE but the spectral transforms is reproducible when the number of MPI tasks changes.

But I am not alone to decide on these matters, so I will talk about this with other Météo-France colleagues.

I would also be curious to hear ECWMF opinion on this matter.

@marsdeno
Copy link
Collaborator

marsdeno commented Sep 4, 2024

As Lukas mentioned, this has been the case for some time due to the batched maths.
My thoughts on this :

  • although the ability to run with task-count-independent results is an important debugging feature, I believe we do not run operationally with this mode activated
  • the task-count independence of results, or at least the ability to run in such a mode, should be maintained going forwards in the CPU codepath in ectrans
  • with these points said, I think in a large GPU-enabled run to debug we would be ok with a multi-step process : if bug can't be triggered with CPU ectrans, then most likely bug in ectrans, if it can, we regain task-count independence allowing debugging on smaller node count

Two more points that should help down the line for this

  • we are heading towards a unified ectrans library which allows dispatching to GPU or CPU at runtime
  • ectrans testing should be improved,for correctness checking of both CPU and GPU code paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants