CUDA error #17

mike-bioinf · 2025-03-20T15:03:47Z

Hello,
I tried the fine-tuning procedure on different datasets and noticed that it fails for the larger ones (starting from around 600 samples). Initially, I was using a previous commit of this package, and the error I encountered was:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.52 GiB. GPU 0 has a total capacity of 31.74 GiB of which 2.03 GiB is free. Including non-PyTorch memory, this process has 29.71 GiB memory in use. Of the allocated memory 29.17 GiB is allocated by PyTorch, and 168.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Then cloning and using the actual version i obtain the following :

  File "/lustre/home/epasolli/tools/miniconda3/envs/tabpfn/lib/python3.12/site-packages/tabpfn/model/multi_head_attention.py", line 710, in compute_attention_heads
    attention_head_outputs = torch.nn.functional.scaled_dot_product_attention(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid configuration argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried also with very small batch sizes but the problem remains.

In addition, I encountered a second issue related to the batch division process for small datasets. From my understanding (sorry if this is not correct) the data loader uses a sklearn stratified splitter with a fixed number of 10 folds regardless of the specified batch size. However, for small datasets, this can be problematic since the smallest class may not have at least 10 observations, and sklearn enforce this.

The text was updated successfully, but these errors were encountered:

LennartPurucker · 2025-03-20T15:11:44Z

Heyho!

As far as I know from my experience of encountering CUDA VRAM issues, invalid configuration argument also points to insufficient VRAM.

How many features do you have? The VRAM requirements scale quadratically (i.e., very expensive) in terms of the number of features and samples. Moreover, for the backprop this becomes even more expensive. So, if you have either a lot of samples or a lot of features, you will quickly run into VRAM issues.

If you have a small matrix (small number of cells, whereby cells = number of features x number of samples), then it could be a VRAM leakage problem, but I would not be aware of a problem in my code for this.

To reduce VRAM, you could try to train with even smaller precision or use more GPUs. However, I have not explored many options in this regard so far.

From my understanding (sorry if this is not correct) the data loader uses a sklearn stratified splitter with a fixed number of 10 folds regardless of the specified batch size. However, for small datasets, this can be problematic since the smallest class may not have at least 10 observations, and sklearn enforce this.

Indeed, this is a known bug / intended behavior for sklearn splitting, and I have not used more clever splitting techniques in this repo. I would recommend manually editing the splitting strategy or folds for now (e.g., see the cross_val_splits argument here).

mike-bioinf · 2025-03-21T15:32:34Z

Hi Lennart,

The number of features varies across datasets, but in all cases, they are under 500, as I filtered them to remain within TabPFN's recommended limits.

For personal learning, I created a smaller version of this repository focused specifically on classification tasks following your code and logic, (here a reference if you want to take a peak: https://github.com/mike-bioinf/itertabpfn/tree/master/src/itertabpfn/finetune ), and strangely the issue does not occur with this implementation.

One difference that comes to my mind is that I used the classic AdamW optimizer instead of the free-scheduled since i am unfamiliar with it. However, I’m unsure if this has anything to do with the problem.

Anyway I’ll dig further in the matter and will come back to you if i find something new.

Thanks for your time!

LennartPurucker · 2025-03-21T15:46:10Z

Nice, great to hear.

AdamW optimizer instead of the free-scheduled [...] I’m unsure if this has anything to do with the problem.

No, that makes sense. The implementation of schedulefree is likely not on a PyTorch-like quality level so far. Thus, this might be a bug for them. Or there is a problem with the requirements or platform, as schedulefree is more specific.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error #17

CUDA error #17

mike-bioinf commented Mar 20, 2025

LennartPurucker commented Mar 20, 2025 •

edited

Loading

mike-bioinf commented Mar 21, 2025

LennartPurucker commented Mar 21, 2025

CUDA error #17

CUDA error #17

Comments

mike-bioinf commented Mar 20, 2025

LennartPurucker commented Mar 20, 2025 • edited Loading

mike-bioinf commented Mar 21, 2025

LennartPurucker commented Mar 21, 2025

LennartPurucker commented Mar 20, 2025 •

edited

Loading