You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm testing the POC example. I can curl the backend model through the gateway from a client pod:
$ kubectl exec po/client -- curl -si $GTW_IP:$GTW_PORT/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Wed, 18 Dec 2024 22:35:43 GMT
server: uvicorn
content-length: 769
content-type: application/json
x-request-id: 737428ad-be14-44e2-976f-92160176f75b
{"id":"cmpl-737428ad-be14-44e2-976f-92160176f75b","object":"text_completion","created":1734561344,"model":"tweet-summary","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null}}
The ext-proc logs show the request being handled but with error: fetching cacheActiveLoraModel:
2024/12/18 22:33:06 Started process: -->
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream: -->
2024/12/18 22:33:06 --- In RequestHeaders processing ...
2024/12/18 22:33:06 Headers: &{RequestHeaders:headers:{headers:{key:":authority" raw_value:"$GTW_IP:$GTW_PORT"} headers:{key:":path" raw_value:"/v1/completions"} headers:{key:":method" raw_value:"POST"} headers:{key:":scheme" raw_value:"http"} headers:{key:"user-agent" raw_value:"curl/8.11.1"} headers:{key:"accept" raw_value:"*/*"} headers:{key:"content-type" raw_value:"application/json"} headers:{key:"content-length" raw_value:"123"} headers:{key:"x-forwarded-for" raw_value:"$CURL_CLIENT_POD_IP"} headers:{key:"x-forwarded-proto" raw_value:"http"} headers:{key:"x-envoy-internal" raw_value:"true"} headers:{key:"x-request-id" raw_value:"b0e3e720-85e5-44db-bb1c-c8af3d391caf"}}}
2024/12/18 22:33:06 EndOfStream: false
[request_header]Final headers being sent:
x-went-into-req-headers: true
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream: -->
2024/12/18 22:33:06 --- In RequestBody processing
2024/12/18 22:33:06 Error fetching cacheActiveLoraModel for pod vllm-llama2-7b-pool-55d46d588c-qqbsv and lora_adapter_requested tweet-summary: error fetching cacheActiveLoraModel for key vllm-llama2-7b-pool-55d46d588c-qqbsv:tweet-summary: Entry not found
Got cachePendingRequestActiveAdapters - Key: vllm-llama2-7b-pool-55d46d588c-qqbsv:, Value: {"Date":"2024-12-18T22:33:00Z","PodName":"vllm-llama2-7b-pool-55d46d588c-qqbsv","PendingRequests":0,"NumberOfActiveAdapters":0}
Fetched loraMetrics: []
Fetched requestMetrics: [{Date:2024-12-18T22:33:00Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
Searching for the best pod...
Selected pod with the least active adapters: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod IP: 10.244.0.38:8000
Liveness tweet-summary
No adapter
[request_body] Header Key: x-went-into-req-body, Header Value: true
[request_body] Header Key: target-pod, Header Value: 10.244.0.38:8000
The vLLM pod logs show the request being processed:
INFO 12-18 14:35:44 logger.py:37] Received request cmpl-737428ad-be14-44e2-976f-92160176f75b-0: prompt: 'Write as if you were a critic: San Francisco', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 14350, 408, 565, 366, 892, 263, 11164, 29901, 3087, 8970], lora_request: LoRARequest(lora_name='tweet-summary', lora_int_id=2, lora_path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', lora_local_path=None, long_lora_max_len=None, base_model_name='meta-llama/Llama-2-7b-hf'), prompt_adapter_request: None.
INFO 12-18 14:35:44 engine.py:267] Added request cmpl-737428ad-be14-44e2-976f-92160176f75b-0.
INFO 12-18 14:35:47 metrics.py:467] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.
The ext-proc server is not finding any active LoRA adapters for my vLLM pod:
I'm testing the POC example. I can curl the backend model through the gateway from a client pod:
The ext-proc logs show the request being handled but with error:
fetching cacheActiveLoraModel
:The vLLM pod logs show the request being processed:
The ext-proc server is not finding any active LoRA adapters for my vLLM pod:
I see the loader init container load the tweet-summary lora module:
The vLLM container logs show the tweet-summary lora module:
Any troubleshooting suggestions are much appreciated.
The text was updated successfully, but these errors were encountered: