How to prevent model from using RAM when offloading full to GPU? #1490
Answered
by
yassinz
Simplegram
asked this question in
Q&A
-
How do I load a GGUF only on VRAM and not on RAM? In the original llama.cpp, a model only loads to VRAM, completely leaving RAM empty. I can load models as large as the combined size of my RAM and VRAM. I want to use llama-cpp-python because llama.cpp's API endpoint is not compatible with llamaindex's ReAct agent. Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Answered by
yassinz
May 28, 2024
Replies: 1 comment 1 reply
-
Change parameter use_mmap=False |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
Simplegram
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Change parameter use_mmap=False