-
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Doc]: What version of vllm and lmcache does that example use https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py #15874
Comments
What is the advantage of using lmcache for CPU offloading? Why not just use cpu_offload_gb when you call LLM? |
Yes, but if the official example is provided then it should somewhat be runnable |
I can confirm that the example is working (with Build command with fixes that @chaunceyjiang suggested cd LMCache && \
sed -i 's/2\.5\.1/2.6.0/g' pyproject.toml setup.py && \
sed 's#numpy==1\.26\.4#numpy#g' pyproject.toml setup.py requirements.txt && \
python setup.py install Additionally, if you are using aarch64 (such as GH200), you might need to disable ![]() |
So why should I go through all these hassle to run lmcache instead of just cpu_offload_gb? |
cc @ApostaC |
Thank you @rajesh-s and @chaunceyjiang for your suggestions. I have reinstalled the libraries but for some reason the original issue persists. I am adding collect_env.py output The output of `python collect_env.py`
Here is the log output of the cpu_offload_lmcache.py. I am using Qwen/Qwen2.5-3B-Instruct for testing instead of Mistral. When lmcache is disabled everything works fine. I am running this code on a single 3090 RTX GPU. The output of `cpu_offload_lmcache.py`
|
Has anyone tried to use gemma3 compatible with the latest vLLM release i.e. v0.8.2? LMCache is unusable for recent models like gemma3 unless you use v0.8.2, has anyone tried new solutions? Thanks for your contribution |
📚 The doc issue
I have tried to run it with lmcache==0.1.4 with all experimental features (built from source) and vllm==0.8.3.dev136+geffc5d24 and it crashes with the segmentation fault.
Suggest a potential alternative/fix
Add requirements.txt for this example https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: