No Active LoRA Adapters When Testing POC Example #109

danehans · 2024-12-18T23:16:53Z

I'm testing the POC example. I can curl the backend model through the gateway from a client pod:

$ kubectl exec po/client -- curl -si $GTW_IP:$GTW_PORT/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Wed, 18 Dec 2024 22:35:43 GMT
server: uvicorn
content-length: 769
content-type: application/json
x-request-id: 737428ad-be14-44e2-976f-92160176f75b

{"id":"cmpl-737428ad-be14-44e2-976f-92160176f75b","object":"text_completion","created":1734561344,"model":"tweet-summary","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null}}

The ext-proc logs show the request being handled but with error: fetching cacheActiveLoraModel:

2024/12/18 22:33:06 Started process:  -->
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestHeaders processing ...
2024/12/18 22:33:06 Headers: &{RequestHeaders:headers:{headers:{key:":authority"  raw_value:"$GTW_IP:$GTW_PORT"}  headers:{key:":path"  raw_value:"/v1/completions"}  headers:{key:":method"  raw_value:"POST"}  headers:{key:":scheme"  raw_value:"http"}  headers:{key:"user-agent"  raw_value:"curl/8.11.1"}  headers:{key:"accept"  raw_value:"*/*"}  headers:{key:"content-type"  raw_value:"application/json"}  headers:{key:"content-length"  raw_value:"123"}  headers:{key:"x-forwarded-for"  raw_value:"$CURL_CLIENT_POD_IP"}  headers:{key:"x-forwarded-proto"  raw_value:"http"}  headers:{key:"x-envoy-internal"  raw_value:"true"}  headers:{key:"x-request-id"  raw_value:"b0e3e720-85e5-44db-bb1c-c8af3d391caf"}}}
2024/12/18 22:33:06 EndOfStream: false
[request_header]Final headers being sent:
x-went-into-req-headers: true
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestBody processing
2024/12/18 22:33:06 Error fetching cacheActiveLoraModel for pod vllm-llama2-7b-pool-55d46d588c-qqbsv and lora_adapter_requested tweet-summary: error fetching cacheActiveLoraModel for key vllm-llama2-7b-pool-55d46d588c-qqbsv:tweet-summary: Entry not found
Got cachePendingRequestActiveAdapters - Key: vllm-llama2-7b-pool-55d46d588c-qqbsv:, Value: {"Date":"2024-12-18T22:33:00Z","PodName":"vllm-llama2-7b-pool-55d46d588c-qqbsv","PendingRequests":0,"NumberOfActiveAdapters":0}
Fetched loraMetrics: []
Fetched requestMetrics: [{Date:2024-12-18T22:33:00Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
Searching for the best pod...
Selected pod with the least active adapters: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod IP: 10.244.0.38:8000
Liveness tweet-summary
No adapter
[request_body] Header Key: x-went-into-req-body, Header Value: true
[request_body] Header Key: target-pod, Header Value: 10.244.0.38:8000

The vLLM pod logs show the request being processed:

INFO 12-18 14:35:44 logger.py:37] Received request cmpl-737428ad-be14-44e2-976f-92160176f75b-0: prompt: 'Write as if you were a critic: San Francisco', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 14350, 408, 565, 366, 892, 263, 11164, 29901, 3087, 8970], lora_request: LoRARequest(lora_name='tweet-summary', lora_int_id=2, lora_path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', lora_local_path=None, long_lora_max_len=None, base_model_name='meta-llama/Llama-2-7b-hf'), prompt_adapter_request: None.
INFO 12-18 14:35:44 engine.py:267] Added request cmpl-737428ad-be14-44e2-976f-92160176f75b-0.
INFO 12-18 14:35:47 metrics.py:467] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.

The ext-proc server is not finding any active LoRA adapters for my vLLM pod:

fetchMetricsPeriodically requestMetrics: [{Date:2024-12-18T22:33:30Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
fetchMetricsPeriodically loraMetrics: []

I see the loader init container load the tweet-summary lora module:

$ k logs deploy/vllm-llama2-7b-pool -c adapter-loader
['yard1/llama-2-7b-sql-lora-test', 'vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm']
Pulling adapter yard1/llama-2-7b-sql-lora-test
Fetching 9 files: 100%|██████████| 9/9 [00:01<00:00,  6.60it/s]
PAth here /adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c
Pulling adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
Fetching 8 files: 100%|██████████| 8/8 [00:01<00:00,  7.99it/s]
PAth here /adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403

The vLLM container logs show the tweet-summary lora module:

INFO 12-18 14:53:19 api_server.py:652] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='sql-lora', path='/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/', base_model_name=None), LoRAModulePath(name='tweet-summary', path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', base_model_name=None), LoRAModulePath(name='sql-lora-0', path='/adapters/yard1/llama-2-7b-sql-lora-test_0', base_model_name=None), LoRAModulePath(name='sql-lora-1', path='/adapters/yard1/llama-2-7b-sql-lora-test_1', base_model_name=None), LoRAModulePath(name='sql-lora-2', path='/adapters/yard1/llama-2-7b-sql-lora-test_2', base_model_name=None), LoRAModulePath(name='sql-lora-3', path='/adapters/yard1/llama-2-7b-sql-lora-test_3', base_model_name=None), LoRAModulePath(name='sql-lora-4', path='/adapters/yard1/llama-2-7b-sql-lora-test_4', base_model_name=None), LoRAModulePath(name='tweet-summary-0', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0', base_model_name=None), LoRAModulePath(name='tweet-summary-1', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1', base_model_name=None), LoRAModulePath(name='tweet-summary-2', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2', base_model_name=None), LoRAModulePath(name='tweet-summary-3', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3', base_model_name=None), LoRAModulePath(name='tweet-summary-4', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-2-7b-hf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=4, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

Any troubleshooting suggestions are much appreciated.

The text was updated successfully, but these errors were encountered:

Kellthuzad · 2024-12-18T23:36:39Z

Hey Daneyon! I can take a peek tomorrow

Kellthuzad · 2024-12-18T23:41:18Z

/assign kfswain

k8s-ci-robot assigned kfswain Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Active LoRA Adapters When Testing POC Example #109

No Active LoRA Adapters When Testing POC Example #109

danehans commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

No Active LoRA Adapters When Testing POC Example #109

No Active LoRA Adapters When Testing POC Example #109

Comments

danehans commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024

Kellthuzad commented Dec 18, 2024