Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Active LoRA Adapters When Testing POC Example #109

Open
danehans opened this issue Dec 18, 2024 · 2 comments
Open

No Active LoRA Adapters When Testing POC Example #109

danehans opened this issue Dec 18, 2024 · 2 comments
Assignees

Comments

@danehans
Copy link

I'm testing the POC example. I can curl the backend model through the gateway from a client pod:

$ kubectl exec po/client -- curl -si $GTW_IP:$GTW_PORT/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
HTTP/1.1 200 OK
date: Wed, 18 Dec 2024 22:35:43 GMT
server: uvicorn
content-length: 769
content-type: application/json
x-request-id: 737428ad-be14-44e2-976f-92160176f75b

{"id":"cmpl-737428ad-be14-44e2-976f-92160176f75b","object":"text_completion","created":1734561344,"model":"tweet-summary","choices":[{"index":0,"text":" Chronicle\n Write as if you were a human: San Francisco Chronicle\n\n 1. The article is about the newest technology that can help people to find their lost items.\n 2. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 3. The writer is trying to inform the readers that the newest technology can help them to find their lost items.\n 4. The writer is trying to inform","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":11,"total_tokens":111,"completion_tokens":100,"prompt_tokens_details":null}}

The ext-proc logs show the request being handled but with error: fetching cacheActiveLoraModel:

2024/12/18 22:33:06 Started process:  -->
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestHeaders processing ...
2024/12/18 22:33:06 Headers: &{RequestHeaders:headers:{headers:{key:":authority"  raw_value:"$GTW_IP:$GTW_PORT"}  headers:{key:":path"  raw_value:"/v1/completions"}  headers:{key:":method"  raw_value:"POST"}  headers:{key:":scheme"  raw_value:"http"}  headers:{key:"user-agent"  raw_value:"curl/8.11.1"}  headers:{key:"accept"  raw_value:"*/*"}  headers:{key:"content-type"  raw_value:"application/json"}  headers:{key:"content-length"  raw_value:"123"}  headers:{key:"x-forwarded-for"  raw_value:"$CURL_CLIENT_POD_IP"}  headers:{key:"x-forwarded-proto"  raw_value:"http"}  headers:{key:"x-envoy-internal"  raw_value:"true"}  headers:{key:"x-request-id"  raw_value:"b0e3e720-85e5-44db-bb1c-c8af3d391caf"}}}
2024/12/18 22:33:06 EndOfStream: false
[request_header]Final headers being sent:
x-went-into-req-headers: true
2024/12/18 22:33:06
2024/12/18 22:33:06
2024/12/18 22:33:06 Got stream:  -->
2024/12/18 22:33:06 --- In RequestBody processing
2024/12/18 22:33:06 Error fetching cacheActiveLoraModel for pod vllm-llama2-7b-pool-55d46d588c-qqbsv and lora_adapter_requested tweet-summary: error fetching cacheActiveLoraModel for key vllm-llama2-7b-pool-55d46d588c-qqbsv:tweet-summary: Entry not found
Got cachePendingRequestActiveAdapters - Key: vllm-llama2-7b-pool-55d46d588c-qqbsv:, Value: {"Date":"2024-12-18T22:33:00Z","PodName":"vllm-llama2-7b-pool-55d46d588c-qqbsv","PendingRequests":0,"NumberOfActiveAdapters":0}
Fetched loraMetrics: []
Fetched requestMetrics: [{Date:2024-12-18T22:33:00Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
Searching for the best pod...
Selected pod with the least active adapters: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod: vllm-llama2-7b-pool-55d46d588c-qqbsv
Selected target pod IP: 10.244.0.38:8000
Liveness tweet-summary
No adapter
[request_body] Header Key: x-went-into-req-body, Header Value: true
[request_body] Header Key: target-pod, Header Value: 10.244.0.38:8000

The vLLM pod logs show the request being processed:

INFO 12-18 14:35:44 logger.py:37] Received request cmpl-737428ad-be14-44e2-976f-92160176f75b-0: prompt: 'Write as if you were a critic: San Francisco', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=100, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [1, 14350, 408, 565, 366, 892, 263, 11164, 29901, 3087, 8970], lora_request: LoRARequest(lora_name='tweet-summary', lora_int_id=2, lora_path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', lora_local_path=None, long_lora_max_len=None, base_model_name='meta-llama/Llama-2-7b-hf'), prompt_adapter_request: None.
INFO 12-18 14:35:44 engine.py:267] Added request cmpl-737428ad-be14-44e2-976f-92160176f75b-0.
INFO 12-18 14:35:47 metrics.py:467] Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 9.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.

The ext-proc server is not finding any active LoRA adapters for my vLLM pod:

fetchMetricsPeriodically requestMetrics: [{Date:2024-12-18T22:33:30Z PodName:vllm-llama2-7b-pool-55d46d588c-qqbsv PendingRequests:0 NumberOfActiveAdapters:0}]
fetchMetricsPeriodically loraMetrics: []

I see the loader init container load the tweet-summary lora module:

$ k logs deploy/vllm-llama2-7b-pool -c adapter-loader
['yard1/llama-2-7b-sql-lora-test', 'vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm']
Pulling adapter yard1/llama-2-7b-sql-lora-test
Fetching 9 files: 100%|██████████| 9/9 [00:01<00:00,  6.60it/s]
PAth here /adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c
Pulling adapter vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
Fetching 8 files: 100%|██████████| 8/8 [00:01<00:00,  7.99it/s]
PAth here /adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403

The vLLM container logs show the tweet-summary lora module:

INFO 12-18 14:53:19 api_server.py:652] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='sql-lora', path='/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/', base_model_name=None), LoRAModulePath(name='tweet-summary', path='/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403', base_model_name=None), LoRAModulePath(name='sql-lora-0', path='/adapters/yard1/llama-2-7b-sql-lora-test_0', base_model_name=None), LoRAModulePath(name='sql-lora-1', path='/adapters/yard1/llama-2-7b-sql-lora-test_1', base_model_name=None), LoRAModulePath(name='sql-lora-2', path='/adapters/yard1/llama-2-7b-sql-lora-test_2', base_model_name=None), LoRAModulePath(name='sql-lora-3', path='/adapters/yard1/llama-2-7b-sql-lora-test_3', base_model_name=None), LoRAModulePath(name='sql-lora-4', path='/adapters/yard1/llama-2-7b-sql-lora-test_4', base_model_name=None), LoRAModulePath(name='tweet-summary-0', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0', base_model_name=None), LoRAModulePath(name='tweet-summary-1', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1', base_model_name=None), LoRAModulePath(name='tweet-summary-2', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2', base_model_name=None), LoRAModulePath(name='tweet-summary-3', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3', base_model_name=None), LoRAModulePath(name='tweet-summary-4', path='/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='meta-llama/Llama-2-7b-hf', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, mm_cache_preprocessor=False, enable_lora=True, enable_lora_bias=False, max_loras=4, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=12, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

Any troubleshooting suggestions are much appreciated.

@Kellthuzad
Copy link

Hey Daneyon! I can take a peek tomorrow

@Kellthuzad
Copy link

/assign kfswain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants