Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor device settable edit #251

Closed

Conversation

Ricardicus
Copy link
Contributor

In the CI, we run train_gpt2.py on Mac.
But lately some odd error appears. It looks like this:

2024-04-24T17:47:45.8970760Z RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 147.24 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
2024-04-24T17:47:46.2765760Z ##[error]Process completed with exit code 1.

Before then we had this situation:

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/torch/backends/mps/init.py:22: UserWarning: Skipping device Apple Paravirtual device that does not support Metal 2.0 (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSDevice.mm:101.)
return torch._C._mps_is_available()
Running pytorch 2.2.2
using device: cpu

It means that we could not really use Metal in the virtual CI workflow, our device became selected to cpu.
Now, it seems pytorch in some cases lets us use Metal, and we end up in that situation where when the API calls are done by Pytorch; we get errors.

So in order to specify that we actually want to run on the CPU (even whn MPS is what appears to be falsely "marked" as available) I made this little change.

I know you have a history with argparse, so I only did this via environment variables.

@karpathy
Copy link
Owner

wait there's no correct way to check if mps is available or not in the code? This looks weird

@Ricardicus
Copy link
Contributor Author

It does doesn't it?

It is something that happens when we run this in the workflow action, see the logs here:
https://github.com/Ricardicus/llm.c/actions/runs/8827724123/job/24235441372

@rosslwheeler also noticed the same on his fork, see this issue: #242. Might be a matter of time before it happens here? There is nothing wrong with the code, one should be able to ask torch if the MPS backend is available.

@Ricardicus
Copy link
Contributor Author

Ricardicus commented Apr 25, 2024

In the builds that succeed, the mps available return False and we get this warning: "Skipping device Apple Paravirtual device that does not support Metal 2.0". It is interesting, it seems like on our forks this check returns True and we go ahead using metal even if it isn't really possible. I think this is a bug that is out of our hands and maybe it will get fixed before we have to worry about it.

@karpathy
Copy link
Owner

And why can't we just add a new flag --device to our argparse? that overrides an autodetect?

@Ricardicus
Copy link
Contributor Author

Oh! we already use argparse. Sorry I missed that.. I remember some comments from nanoGPT where you made your own argparser.. Would you like me to write a --device argument that overrides if set? This entire thing is only to get around this problem that appears to be a bug on the platform side; because naturally mps.is_available() should be enough really.

@Ricardicus
Copy link
Contributor Author

I threw it in as a --device argument.

@karpathy
Copy link
Owner

pushed the fix latest commit.

@karpathy karpathy closed this Apr 25, 2024
@Ricardicus
Copy link
Contributor Author

One use case for --device (that is not just to get around a CI bug) is if you want to compare CPU performance with the accelerator at hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants