-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Models are not deterministic / reproducible on GPU #6490
Comments
Thanks for the report! I just double checked with the latest code from We'll look into this! |
Thank you @svlandeg ! We can continue our experimentation phase even without determinism since the losses from various runs with different random seeds are more or loss the same. No big flunctuations. However, if we can have soon a new release with the fix it could be so great. Please note that the same thing happens when using a pre-trained model, etc, en_core_web_lg. |
Hi @svlandeg ! Do we have any update on this? |
I managed to track down the source of this problem. In the backprop in Unfortunately there is not a simple substitution for this without consequences. We could unroll the addition to control the order of operations but it would be too slow. This is also known to be an issue in Pytorch (which doesn't use cupy but a similar implementation) but because the actual change in values is small it's not generally considered an issue (see pytorch/pytorch#50469). That said we think we can design a deterministic equivalent with a more acceptable speed penalty and will be taking a look at it. In the meantime this is something to be aware of, and this will be the main issue for it, so just subscribe here if you'd like updates. |
How to reproduce the behaviour
I cannot reproduce the same results when training a NER model using GPU in Google Colab.
When running the same code with CPU it seems to work.
However, when enabling GPU with prefer_gpu() the reproduction is not working.
`
# Example code
`
Your Environment
The text was updated successfully, but these errors were encountered: