Minor device settable edit #251

Ricardicus · 2024-04-25T06:24:30Z

In the CI, we run train_gpt2.py on Mac.
But lately some odd error appears. It looks like this:

2024-04-24T17:47:45.8970760Z RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 147.24 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
2024-04-24T17:47:46.2765760Z ##[error]Process completed with exit code 1.

Before then we had this situation:

/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/torch/backends/mps/init.py:22: UserWarning: Skipping device Apple Paravirtual device that does not support Metal 2.0 (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSDevice.mm:101.)
return torch._C._mps_is_available()
Running pytorch 2.2.2
using device: cpu

It means that we could not really use Metal in the virtual CI workflow, our device became selected to cpu.
Now, it seems pytorch in some cases lets us use Metal, and we end up in that situation where when the API calls are done by Pytorch; we get errors.

So in order to specify that we actually want to run on the CPU (even whn MPS is what appears to be falsely "marked" as available) I made this little change.

I know you have a history with argparse, so I only did this via environment variables.

karpathy · 2024-04-25T15:23:23Z

wait there's no correct way to check if mps is available or not in the code? This looks weird

Ricardicus · 2024-04-25T16:05:21Z

It does doesn't it?

It is something that happens when we run this in the workflow action, see the logs here:
https://github.com/Ricardicus/llm.c/actions/runs/8827724123/job/24235441372

@rosslwheeler also noticed the same on his fork, see this issue: #242. Might be a matter of time before it happens here? There is nothing wrong with the code, one should be able to ask torch if the MPS backend is available.

Ricardicus · 2024-04-25T16:07:46Z

In the builds that succeed, the mps available return False and we get this warning: "Skipping device Apple Paravirtual device that does not support Metal 2.0". It is interesting, it seems like on our forks this check returns True and we go ahead using metal even if it isn't really possible. I think this is a bug that is out of our hands and maybe it will get fixed before we have to worry about it.

karpathy · 2024-04-25T16:16:32Z

And why can't we just add a new flag --device to our argparse? that overrides an autodetect?

Ricardicus · 2024-04-25T16:22:13Z

Oh! we already use argparse. Sorry I missed that.. I remember some comments from nanoGPT where you made your own argparser.. Would you like me to write a --device argument that overrides if set? This entire thing is only to get around this problem that appears to be a bug on the platform side; because naturally mps.is_available() should be enough really.

Ricardicus · 2024-04-25T16:30:07Z

I threw it in as a --device argument.

karpathy · 2024-04-25T16:32:01Z

pushed the fix latest commit.

Ricardicus · 2024-04-25T16:35:18Z

One use case for --device (that is not just to get around a CI bug) is if you want to compare CPU performance with the accelerator at hand.

Ricardicus force-pushed the minor_device_settable_edit branch from ad5e912 to a013975 Compare April 25, 2024 06:27

Ricardicus mentioned this pull request Apr 25, 2024

CI Mac issue with resources for Python? #242

Closed

traing_gpt2.py: Possibility to set device via environment variable

aa7fa8b

Ricardicus force-pushed the minor_device_settable_edit branch from 9a30adb to aa7fa8b Compare April 25, 2024 16:28

karpathy closed this Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor device settable edit #251

Minor device settable edit #251

Ricardicus commented Apr 25, 2024

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Ricardicus commented Apr 25, 2024 •

edited

Loading

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Minor device settable edit #251

Minor device settable edit #251

Conversation

Ricardicus commented Apr 25, 2024

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Ricardicus commented Apr 25, 2024 • edited Loading

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

karpathy commented Apr 25, 2024

Ricardicus commented Apr 25, 2024

Ricardicus commented Apr 25, 2024 •

edited

Loading