Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error #17

Open
mike-bioinf opened this issue Mar 20, 2025 · 3 comments
Open

CUDA error #17

mike-bioinf opened this issue Mar 20, 2025 · 3 comments

Comments

@mike-bioinf
Copy link

Hello,
I tried the fine-tuning procedure on different datasets and noticed that it fails for the larger ones (starting from around 600 samples). Initially, I was using a previous commit of this package, and the error I encountered was:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.52 GiB. GPU 0 has a total capacity of 31.74 GiB of which 2.03 GiB is free. Including non-PyTorch memory, this process has 29.71 GiB memory in use. Of the allocated memory 29.17 GiB is allocated by PyTorch, and 168.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Then cloning and using the actual version i obtain the following :

  File "/lustre/home/epasolli/tools/miniconda3/envs/tabpfn/lib/python3.12/site-packages/tabpfn/model/multi_head_attention.py", line 710, in compute_attention_heads
    attention_head_outputs = torch.nn.functional.scaled_dot_product_attention(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid configuration argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried also with very small batch sizes but the problem remains.

In addition, I encountered a second issue related to the batch division process for small datasets. From my understanding (sorry if this is not correct) the data loader uses a sklearn stratified splitter with a fixed number of 10 folds regardless of the specified batch size. However, for small datasets, this can be problematic since the smallest class may not have at least 10 observations, and sklearn enforce this.

@LennartPurucker
Copy link
Owner

LennartPurucker commented Mar 20, 2025

Heyho!

As far as I know from my experience of encountering CUDA VRAM issues, invalid configuration argument also points to insufficient VRAM.

How many features do you have? The VRAM requirements scale quadratically (i.e., very expensive) in terms of the number of features and samples. Moreover, for the backprop this becomes even more expensive. So, if you have either a lot of samples or a lot of features, you will quickly run into VRAM issues.

If you have a small matrix (small number of cells, whereby cells = number of features x number of samples), then it could be a VRAM leakage problem, but I would not be aware of a problem in my code for this.

To reduce VRAM, you could try to train with even smaller precision or use more GPUs. However, I have not explored many options in this regard so far.

From my understanding (sorry if this is not correct) the data loader uses a sklearn stratified splitter with a fixed number of 10 folds regardless of the specified batch size. However, for small datasets, this can be problematic since the smallest class may not have at least 10 observations, and sklearn enforce this.

Indeed, this is a known bug / intended behavior for sklearn splitting, and I have not used more clever splitting techniques in this repo. I would recommend manually editing the splitting strategy or folds for now (e.g., see the cross_val_splits argument here).

@mike-bioinf
Copy link
Author

Hi Lennart,

The number of features varies across datasets, but in all cases, they are under 500, as I filtered them to remain within TabPFN's recommended limits.

For personal learning, I created a smaller version of this repository focused specifically on classification tasks following your code and logic, (here a reference if you want to take a peak: https://github.com/mike-bioinf/itertabpfn/tree/master/src/itertabpfn/finetune ), and strangely the issue does not occur with this implementation.

One difference that comes to my mind is that I used the classic AdamW optimizer instead of the free-scheduled since i am unfamiliar with it. However, I’m unsure if this has anything to do with the problem.

Anyway I’ll dig further in the matter and will come back to you if i find something new.

Thanks for your time!

@LennartPurucker
Copy link
Owner

Nice, great to hear.

AdamW optimizer instead of the free-scheduled [...] I’m unsure if this has anything to do with the problem.

No, that makes sense. The implementation of schedulefree is likely not on a PyTorch-like quality level so far. Thus, this might be a bug for them. Or there is a problem with the requirements or platform, as schedulefree is more specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants