-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor device settable edit #251
Conversation
ad5e912
to
a013975
Compare
wait there's no correct way to check if mps is available or not in the code? This looks weird |
It does doesn't it? It is something that happens when we run this in the workflow action, see the logs here: @rosslwheeler also noticed the same on his fork, see this issue: #242. Might be a matter of time before it happens here? There is nothing wrong with the code, one should be able to ask torch if the MPS backend is available. |
In the builds that succeed, the mps available return False and we get this warning: "Skipping device Apple Paravirtual device that does not support Metal 2.0". It is interesting, it seems like on our forks this check returns True and we go ahead using metal even if it isn't really possible. I think this is a bug that is out of our hands and maybe it will get fixed before we have to worry about it. |
And why can't we just add a new flag --device to our argparse? that overrides an autodetect? |
Oh! we already use argparse. Sorry I missed that.. I remember some comments from nanoGPT where you made your own argparser.. Would you like me to write a --device argument that overrides if set? This entire thing is only to get around this problem that appears to be a bug on the platform side; because naturally mps.is_available() should be enough really. |
9a30adb
to
aa7fa8b
Compare
I threw it in as a --device argument. |
pushed the fix latest commit. |
One use case for --device (that is not just to get around a CI bug) is if you want to compare CPU performance with the accelerator at hand. |
In the CI, we run train_gpt2.py on Mac.
But lately some odd error appears. It looks like this:
Before then we had this situation:
It means that we could not really use Metal in the virtual CI workflow, our device became selected to cpu.
Now, it seems pytorch in some cases lets us use Metal, and we end up in that situation where when the API calls are done by Pytorch; we get errors.
So in order to specify that we actually want to run on the CPU (even whn MPS is what appears to be falsely "marked" as available) I made this little change.
I know you have a history with argparse, so I only did this via environment variables.