Skip to content

Commit

Permalink
Redirect vLLM runtime guide to Hugging Face runtime overview (#408)
Browse files Browse the repository at this point in the history
* Redirect vLLM runtime guide to Hugging Face runtime overview

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Update README.md

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
  • Loading branch information
terrytangyuan authored Nov 4, 2024
1 parent acbc8ed commit c0d2ed9
Showing 1 changed file with 2 additions and 82 deletions.
84 changes: 2 additions & 82 deletions docs/modelserving/v1beta1/llm/vllm/README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,3 @@
## Deploy the LLaMA model with vLLM Runtime
Serving LLM models can be surprisingly slow even on high end GPUs, [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
It supports [continuous batching](https://www.anyscale.com/blog/continuous-batching-llm-inference) for increased throughput and GPU utilization,
[paged attention](https://vllm.ai) to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.
## vLLM Runtime

You can deploy the LLaMA model with built vLLM inference server container image using the `InferenceService` yaml API spec.
We have work in progress integrating `vLLM` with `Open Inference Protocol` and KServe observability stack.

The LLaMA model can be downloaded from [huggingface](https://huggingface.co/meta-llama/Llama-2-7b) and upload to your cloud storage.

=== "Yaml"


```yaml
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-2-7b
spec:
predictor:
containers:
- args:
- --port
- "8080"
- --model
- /mnt/models
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
env:
- name: STORAGE_URI
value: gs://kfserving-examples/llm/huggingface/llama
image: kserve/vllmserver:latest
name: kserve-container
resources:
limits:
cpu: "4"
memory: 50Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 50Gi
nvidia.com/gpu: "1"
```

!!! Warning
vLLM runtime is still experimental, please expect API changes and further integration in the next KServe release.

=== "kubectl"
```bash
kubectl apply -f ./vllm.yaml
```

## Benchmarking vLLM Runtime

You can download the benchmark testing data set by running
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

The tokenizer can be found from the downloaded llama model.

Now, assuming that your ingress can be accessed at
`${INGRESS_HOST}:${INGRESS_PORT}` or you can follow [this instruction](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports)
to find out your ingress IP and port.

You can run the [benchmarking script](./benchmark.py) and send the inference request to the exposed URL.

```bash
python benchmark_serving.py --backend openai --port ${INGRESS_PORT} --host ${INGRESS_HOST} --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer ./tokenizer --request-rate 5
```

!!! success "Expected Output"

```{ .json .no-copy }
Total time: 216.81 s
Throughput: 4.61 requests/s
Average latency: 7.96 s
Average latency per token: 0.02 s
Average latency per output token: 0.04 s
```
The official vLLM support is available through [Hugging Face Serving Runtime](https://kserve.github.io/website/master/modelserving/v1beta1/llm/huggingface/).

0 comments on commit c0d2ed9

Please sign in to comment.