How to prevent model from using RAM when offloading full to GPU? #1490

Simplegram · 2024-05-27T14:07:38Z

Simplegram
May 27, 2024

How do I load a GGUF only on VRAM and not on RAM? In the original llama.cpp, a model only loads to VRAM, completely leaving RAM empty. I can load models as large as the combined size of my RAM and VRAM. I want to use llama-cpp-python because llama.cpp's API endpoint is not compatible with llamaindex's ReAct agent. Thanks in advance!

Answered by yassinz

May 28, 2024

Change parameter use_mmap=False

View full answer

yassinz · 2024-05-28T08:45:01Z

yassinz
May 28, 2024

Change parameter use_mmap=False

1 reply

Simplegram May 28, 2024
Author

Oh, my dumb mistake! I thought the parameter didn't work lol. I used it this way --use_mmap False before and it didn't change anything. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prevent model from using RAM when offloading full to GPU? #1490

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to prevent model from using RAM when offloading full to GPU? #1490

Simplegram May 27, 2024

Replies: 1 comment · 1 reply

yassinz May 28, 2024

Simplegram May 28, 2024 Author

Simplegram
May 27, 2024

Replies: 1 comment 1 reply

yassinz
May 28, 2024

Simplegram May 28, 2024
Author