llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

cebtenzzre · 2024-05-30T22:18:33Z

This is a proposed fix for the issue where CUDA OOM can happen later than expected and crash GPT4All. The question is whether the benefit (falling back early instead of crashing later) is worth the load latency cost.

After a model is loaded onto a CUDA device, we run one full batch of (meaningless) input through it. Small batches don't use as much VRAM, and llama.cpp seems to allocate the full KV cache for the context regardless of where in context the input lies - so n_batch matters a lot, but n_past seems to not matter at all.

The call to testModel() can be seen in the UI as the progress bar staying at near 100% before the load completes. With 24 layers of Llama 3 8B, this takes about 2 seconds on my GTX 970 and 0.3 seconds on my Tesla P40. Worst case timing under high memory pressure and a batch size of 512 (which I had to patch in since the upper limit is normally 128) is about 11.2 seconds. At a batch size of 128 I have seen this take as long as 7.6 seconds.

Testing

You can test this PR by choosing a model that does not fit in your card's VRAM and finding a number of layers to offload that just barely doesn't fit. On the main branch, GPT4All can crash either during load or when you are sending input to it. With this PR, an exception is logged to the console during testModel() and GPT4All falls back to CPU as it does for Kompute.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

manyoso · 2024-05-31T17:30:50Z

The latency is quite unfortunate...

cebtenzzre · 2024-06-04T18:44:28Z

Marking as draft because the master branch of llama.cpp-mainline had to be rolled back in favor of higher priority changes. The code itself is still reviewable.

cebtenzzre added 5 commits May 30, 2024 15:27

backend: make binding n_batch default consistent with UI

b48e336

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llamamodel: set batch size to known max to reduce mem usage

cff5a53

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

chatllm: do not report 100% progress until actually complete

a16df5d

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llama.cpp: update submodule for CUDA exceptions and CPU skip

19c9506

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

llamamodel: trigger CUDA OOM early so we can fall back

b4adcba

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre requested a review from manyoso May 30, 2024 22:18

cebtenzzre changed the title ~~llamamodel: prevent CUDA OOM crash by eagerly allocating VRAM~~ llamamodel: prevent CUDA OOM crash by allocating VRAM early May 30, 2024

manyoso requested a review from apage43 May 31, 2024 17:30

cebtenzzre marked this pull request as draft June 4, 2024 18:42

cebtenzzre mentioned this pull request Jun 5, 2024

[Feature] Use CUDA device as default #2402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

cebtenzzre commented May 30, 2024 •

edited

Loading

manyoso commented May 31, 2024

cebtenzzre commented Jun 4, 2024

llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

Are you sure you want to change the base?

llamamodel: prevent CUDA OOM crash by allocating VRAM early #2393

Conversation

cebtenzzre commented May 30, 2024 • edited Loading

Testing

manyoso commented May 31, 2024

cebtenzzre commented Jun 4, 2024

cebtenzzre commented May 30, 2024 •

edited

Loading