Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]: What version of vllm and lmcache does that example use https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py #15874

Open
1 task done
tanov25 opened this issue Apr 1, 2025 · 8 comments
Labels
documentation Improvements or additions to documentation

Comments

@tanov25
Copy link

tanov25 commented Apr 1, 2025

📚 The doc issue

I have tried to run it with lmcache==0.1.4 with all experimental features (built from source) and vllm==0.8.3.dev136+geffc5d24 and it crashes with the segmentation fault.

Suggest a potential alternative/fix

Add requirements.txt for this example https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@tanov25 tanov25 added the documentation Improvements or additions to documentation label Apr 1, 2025
@ymcki
Copy link

ymcki commented Apr 1, 2025

What is the advantage of using lmcache for CPU offloading? Why not just use cpu_offload_gb when you call LLM?

@tanov25
Copy link
Author

tanov25 commented Apr 1, 2025

Yes, but if the official example is provided then it should somewhat be runnable

@chaunceyjiang
Copy link
Contributor

I can run this example successfully using the LMCache built from source.

Image Tip: Since the vLLM code uses `torch == 2.6.0`, to avoid dependency conflicts, I made some modifications to the `LMCache` code. The modifications are as follows.
diff --git a/pyproject.toml b/pyproject.toml
index a4be411..9bb40ba 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ requires = [
     "setuptools >= 49.4.0",
     "wheel",
     "packaging",
-    "torch == 2.5.1",
+    "torch == 2.6.0",
     "ninja",
     "pybind11",
 ]
diff --git a/setup.py b/setup.py
index 23d87b7..ffd5cb7 100644
--- a/setup.py
+++ b/setup.py
@@ -26,7 +26,7 @@ setup(
     long_description_content_type="text/markdown",
     packages=find_packages(exclude=("csrc")),
     install_requires=[
-        "torch == 2.5.1", "numpy==1.26.4", "aiofiles", "pyyaml", "redis",
+        "torch == 2.6.0", "numpy==1.26.4", "aiofiles", "pyyaml", "redis",
         "nvtx", "safetensors", "transformers", "torchac_cuda >= 0.2.5",
         "sortedcontainers", "prometheus_client", "infinistore", "msgspec"

@rajesh-s
Copy link

rajesh-s commented Apr 1, 2025

I can confirm that the example is working (with vllm:0.8.1) when LMCache is built from scratch as well. Make sure you are using v1 mentioned here

Build command with fixes that @chaunceyjiang suggested

    cd LMCache && \
    sed -i 's/2\.5\.1/2.6.0/g' pyproject.toml setup.py && \
    sed 's#numpy==1\.26\.4#numpy#g' pyproject.toml setup.py requirements.txt && \
    python setup.py install

Additionally, if you are using aarch64 (such as GH200), you might need to disable infinistore backend, I created a forked version here.

Image

@ymcki
Copy link

ymcki commented Apr 2, 2025

So why should I go through all these hassle to run lmcache instead of just cpu_offload_gb?

@hmellor
Copy link
Member

hmellor commented Apr 2, 2025

cc @ApostaC

@tanov25
Copy link
Author

tanov25 commented Apr 2, 2025

Thank you @rajesh-s and @chaunceyjiang for your suggestions. I have reinstalled the libraries but for some reason the original issue persists. I am adding collect_env.py output

The output of `python collect_env.py`
INFO 04-02 18:58:45 [__init__.py:256] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 16:05:46) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-208-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.35.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      43 bits physical, 48 bits virtual
CPU(s):                             24
On-line CPU(s) list:                0-23
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         23
Model:                              1
Model name:                         AMD Ryzen Threadripper 1920X 12-Core Processor
Stepping:                           1
Frequency boost:                    enabled
CPU MHz:                            2166.676
CPU max MHz:                        3500.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           6986.03
Virtualization:                     AMD-V
L1d cache:                          384 KiB
L1i cache:                          768 KiB
L2 cache:                           6 MiB
L3 cache:                           32 MiB
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.3.0
[pip3] torch==2.6.0
[pip3] torchac_cuda==0.2.5
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.50.2
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pyzmq                     26.3.0                   pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.50.2                   pypi_0    pypi
[conda] triton                    3.1.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     NODE    0-23    0               N/A
GPU1    PHB      X      NODE    0-23    0               N/A
NIC0    NODE    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

Here is the log output of the cpu_offload_lmcache.py. I am using Qwen/Qwen2.5-3B-Instruct for testing instead of Mistral. When lmcache is disabled everything works fine. I am running this code on a single 3090 RTX GPU.

The output of `cpu_offload_lmcache.py`
INFO 04-02 19:08:07 [__init__.py:256] Automatically detected platform cuda.
INFO 04-02 19:08:14 [config.py:583] This model supports multiple tasks: {'reward', 'generate', 'embed', 'classify', 'score'}. Defaulting to 'generate'.
WARNING 04-02 19:08:14 [arg_utils.py:1765] --kv-transfer-config is not supported by the V1 Engine. Falling back to V0. 
INFO 04-02 19:08:14 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.1) with config: model='/home/tnovik/llm_dev/llm4reva/src/model/model_input', speculative_config=None, tokenizer='/home/tnovik/llm_dev/llm4reva/src/model/model_input', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/tnovik/llm_dev/llm4reva/src/model/model_input, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
INFO 04-02 19:08:16 [cuda.py:285] Using Flash Attention backend.
INFO 04-02 19:08:16 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-02 19:08:16 [lmcache_connector.py:43] Initializing LMCacheConfig under kv_transfer_config kv_connector='LMCacheConnector' kv_buffer_device='cuda' kv_buffer_size=1000000000.0 kv_role='kv_both' kv_rank=None kv_parallel_size=1 kv_ip='127.0.0.1' kv_port=14579 kv_connector_extra_config={}
WARNING LMCache: No LMCache configuration file is set. Trying to read configurations from the environment variables. [2025-04-02 19:08:16,732] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/integration/vllm/utils.py:25
WARNING LMCache: You can set the configuration file through the environment variable: LMCACHE_CONFIG_FILE [2025-04-02 19:08:16,732] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/integration/vllm/utils.py:27
INFO LMCache: Creating LMCacheEngine instance vllm-instance [2025-04-02 19:08:16,733] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/experimental/cache_engine.py:290
INFO LMCache: Initializing usage context. [2025-04-02 19:08:19,558] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:235
DEBUG LMCache: context message updated [2025-04-02 19:08:20,718] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:98
DEBUG LMCache: Unable to send lmcache context message [2025-04-02 19:08:25,725] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:101
DEBUG LMCache: context message updated [2025-04-02 19:08:25,725] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:98
DEBUG LMCache: Unable to send lmcache context message [2025-04-02 19:08:30,731] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:101
DEBUG LMCache: context message updated [2025-04-02 19:08:30,732] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:98
INFO 04-02 19:08:30 [model_runner.py:1110] Starting to load model /home/tnovik/llm_dev/llm4reva/src/model/model_input...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  3.62it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.64it/s]

INFO 04-02 19:08:31 [loader.py:429] Loading weights took 0.80 seconds
INFO 04-02 19:08:31 [model_runner.py:1146] Model loading took 5.7915 GB and 0.899054 seconds
INFO 04-02 19:08:33 [worker.py:267] Memory profiling takes 1.04 seconds
INFO 04-02 19:08:33 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.59GiB) x gpu_memory_utilization (0.80) = 18.87GiB
INFO 04-02 19:08:33 [worker.py:267] model weights take 5.79GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is 11.57GiB.
INFO 04-02 19:08:33 [executor_base.py:111] # cuda blocks: 21063, # CPU blocks: 7281
INFO 04-02 19:08:33 [executor_base.py:116] Maximum concurrency for 8000 tokens per request: 42.13x
INFO 04-02 19:08:34 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:   6%|███▏                                                   | 2/35 [00:00<00:14,  2.22it/s]DEBUG LMCache: Unable to send lmcache context message [2025-04-02 19:08:35,735] -- /home/tnovik/llm_dev/LLM-per-token-backend/.env12/lib/python3.12/site-packages/lmcache/usage_context.py:101
Capturing CUDA graph shapes: 100%|██████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.36it/s]
INFO 04-02 19:08:49 [model_runner.py:1570] Graph capturing finished in 15 secs, took 0.21 GiB
INFO 04-02 19:08:49 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 17.78 seconds
Processed prompts:   0%|                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
/root/miniconda3/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Segmentation fault

</details>

@gliuck
Copy link

gliuck commented Apr 4, 2025

Has anyone tried to use gemma3 compatible with the latest vLLM release i.e. v0.8.2? LMCache is unusable for recent models like gemma3 unless you use v0.8.2, has anyone tried new solutions? Thanks for your contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants