diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md
new file mode 100644
index 0000000000000..1e33793842bf8
--- /dev/null
+++ b/.buildkite/nightly-benchmarks/nightly-annotation.md
@@ -0,0 +1,28 @@
+
+## Description
+
+This file contains the downloading link for benchmarking results.
+
+- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
+- [benchmarking results](artifact://results.zip)
+- [benchmarking code](artifact://nightly-benchmarks.zip)
+
+Please download the visualization scripts in the post
+
+
+## Results reproduction
+
+- Find the docker we use in `benchmarking pipeline`
+- Deploy the docker, and inside the docker:
+ - Download `nightly-benchmarks.zip`.
+ - In the same folder, run the following code
+```
+export HF_TOKEN=
+apt update
+apt install -y git
+unzip nightly-benchmarks.zip
+VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
+```
+
+And the results will be inside `./benchmarks/results`.
+
diff --git a/.buildkite/nightly-benchmarks/nightly-descriptions.md b/.buildkite/nightly-benchmarks/nightly-descriptions.md
index c3d3cbf473968..7dec7a0fe0b4e 100644
--- a/.buildkite/nightly-benchmarks/nightly-descriptions.md
+++ b/.buildkite/nightly-benchmarks/nightly-descriptions.md
@@ -1,45 +1,39 @@
# Nightly benchmark
-The main goal of this benchmarking is two-fold:
-- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
-- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
-
-
-## Docker images
-
-We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
-- vllm/vllm-openai:v0.5.0.post1
-- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
-- openmmlab/lmdeploy:v0.5.0
-- ghcr.io/huggingface/text-generation-inference:2.1
-
-
-
-
-## Hardware
-
-One AWS node with 8x NVIDIA A100 GPUs.
-
-
-## Workload description
-
-We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
-
-- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
-- Output length: the corresponding output length of these 500 prompts.
-- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
-- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
-
-
-
-## Plots
-
-In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
-
-
-
-## Results
-
-{nightly_results_benchmarking_table}
+This benchmark aims to:
+- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
+- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
+
+Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
+
+Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
+
+
+## Setup
+
+- Docker images:
+ - vLLM: `vllm/vllm-openai:v0.6.2`
+ - SGLang: `lmsysorg/sglang:v0.3.2-cu121`
+ - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
+ - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
+ - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
+ - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
+- Hardware
+ - 8x Nvidia A100 GPUs
+- Workload:
+ - Dataset
+ - ShareGPT dataset
+ - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
+ - Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
+ - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
+ - Models: llama-3 8B, llama-3 70B.
+ - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
+ - Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
+ - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
+ - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
+
+# Known issues
+
+- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
+- TGI does not support `ignore-eos` flag.
\ No newline at end of file
diff --git a/.buildkite/nightly-benchmarks/nightly-pipeline.yaml b/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
index 6e399bb936fbc..199517e8b067c 100644
--- a/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
+++ b/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
@@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
common_container_settings: &common_container_settings
command:
- - bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
+ - bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
@@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- - label: "A100 trt benchmark"
+
+
+
+ - label: "A100 vllm step 10"
priority: 100
agents:
queue: A100
@@ -46,7 +49,21 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- - image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
+ - image: vllm/vllm-openai:v0.6.2
+ <<: *common_container_settings
+
+
+
+ - label: "A100 sglang benchmark"
+ priority: 100
+ agents:
+ queue: A100
+ plugins:
+ - kubernetes:
+ podSpec:
+ <<: *common_pod_spec
+ containers:
+ - image: lmsysorg/sglang:v0.3.2-cu121
<<: *common_container_settings
- label: "A100 lmdeploy benchmark"
@@ -58,11 +75,13 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- - image: openmmlab/lmdeploy:v0.5.0
+ - image: openmmlab/lmdeploy:v0.6.1-cu12
<<: *common_container_settings
-
- - label: "A100 vllm benchmark"
+
+
+
+ - label: "A100 trt llama-8B"
priority: 100
agents:
queue: A100
@@ -71,10 +90,25 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- - image: vllm/vllm-openai:latest
+ - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
+ env:
+ - name: VLLM_USAGE_SOURCE
+ value: ci-test
+ - name: HF_HOME
+ value: /root/.cache/huggingface
+ - name: VLLM_SOURCE_CODE_LOC
+ value: /workspace/build/buildkite/vllm/performance-benchmark
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-token-secret
+ key: token
+ - name: TEST_SELECTOR
+ value: "llama8B"
- - label: "A100 tgi benchmark"
+
+ - label: "A100 trt llama-70B"
priority: 100
agents:
queue: A100
@@ -83,12 +117,54 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- - image: ghcr.io/huggingface/text-generation-inference:2.1
+ - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
+ env:
+ - name: VLLM_USAGE_SOURCE
+ value: ci-test
+ - name: HF_HOME
+ value: /root/.cache/huggingface
+ - name: VLLM_SOURCE_CODE_LOC
+ value: /workspace/build/buildkite/vllm/performance-benchmark
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-token-secret
+ key: token
+ - name: TEST_SELECTOR
+ value: "llama70B"
+
+
+ # FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
+ # - label: "A100 trt benchmark"
+ # priority: 100
+ # agents:
+ # queue: A100
+ # plugins:
+ # - kubernetes:
+ # podSpec:
+ # <<: *common_pod_spec
+ # containers:
+ # - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
+ # <<: *common_container_settings
+
+
+ # FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
+ # - label: "A100 tgi benchmark"
+ # priority: 100
+ # agents:
+ # queue: A100
+ # plugins:
+ # - kubernetes:
+ # podSpec:
+ # <<: *common_pod_spec
+ # containers:
+ # - image: ghcr.io/huggingface/text-generation-inference:2.2.0
+ # <<: *common_container_settings
- wait
- - label: "Plot"
+ - label: "Collect the results"
priority: 100
agents:
queue: A100
@@ -117,4 +193,4 @@ steps:
name: hf-token-secret
key: token
- - wait
\ No newline at end of file
+ - block: ":rocket: check the results!"
\ No newline at end of file
diff --git a/.buildkite/nightly-benchmarks/run-nightly-suite.sh b/.buildkite/nightly-benchmarks/run-nightly-suite.sh
deleted file mode 100644
index 627a3e6971578..0000000000000
--- a/.buildkite/nightly-benchmarks/run-nightly-suite.sh
+++ /dev/null
@@ -1,76 +0,0 @@
-#!/bin/bash
-
-set -o pipefail
-set -x
-
-check_gpus() {
- # check the number of GPUs and GPU type.
- declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
- if [[ $gpu_count -gt 0 ]]; then
- echo "GPU found."
- else
- echo "Need at least 1 GPU to run benchmarking."
- exit 1
- fi
- declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
- echo "GPU type is $gpu_type"
-}
-
-check_hf_token() {
- # check if HF_TOKEN is available and valid
- if [[ -z "$HF_TOKEN" ]]; then
- echo "Error: HF_TOKEN is not set."
- exit 1
- elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
- echo "Error: HF_TOKEN does not start with 'hf_'."
- exit 1
- else
- echo "HF_TOKEN is set and valid."
- fi
-}
-
-main() {
-
- check_gpus
- check_hf_token
-
- df -h
-
- (which wget && which curl) || (apt-get update && apt-get install -y wget curl)
- (which jq) || (apt-get update && apt-get -y install jq)
-
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
- wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-
-
- # run lmdeploy
- if which lmdeploy >/dev/null; then
- echo "lmdeploy is available, redirect to run-lmdeploy-nightly.sh"
- bash ../.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh
- exit 0
- fi
-
- # run tgi
- if [ -e /tgi-entrypoint.sh ]; then
- echo "tgi is available, redirect to run-tgi-nightly.sh"
- bash ../.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh
- exit 0
- fi
-
- # run trt
- if which trtllm-build >/dev/null; then
- echo "trtllm is available, redirect to run-trt-nightly.sh"
- bash ../.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh
- exit 0
- fi
-
- # run vllm
- if [ -e /vllm-workspace ]; then
- echo "vllm is available, redirect to run-vllm-nightly.sh"
- bash ../.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh
- exit 0
- fi
-
-}
-
-main "$@"
\ No newline at end of file
diff --git a/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py b/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py
new file mode 100644
index 0000000000000..6059588fe7277
--- /dev/null
+++ b/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py
@@ -0,0 +1,95 @@
+import argparse
+import json
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from tabulate import tabulate
+
+
+def parse_arguments():
+ parser = argparse.ArgumentParser(
+ description=
+ 'Parse command line arguments for summary-nightly-results script.')
+ parser.add_argument('--results-folder',
+ type=str,
+ required=True,
+ help='The folder where the results are stored.')
+ parser.add_argument('--description',
+ type=str,
+ required=True,
+ help='Description of the results.')
+
+ args = parser.parse_args()
+ return args
+
+
+def get_perf(df, method, model, metric):
+
+ means = []
+
+ for qps in [2, 4, 8, 16, "inf"]:
+ target = df['Test name'].str.contains(model)
+ target = target & df['Engine'].str.contains(method)
+ target = target & df['Test name'].str.contains("qps_" + str(qps))
+ filtered_df = df[target]
+
+ if filtered_df.empty:
+ means.append(0.)
+ else:
+ means.append(filtered_df[metric].values[0])
+
+ return np.array(means)
+
+
+def get_perf_w_std(df, method, model, metric):
+
+ if metric in ["TTFT", "ITL"]:
+ mean = get_perf(df, method, model, "Mean " + metric + " (ms)")
+ mean = mean.tolist()
+ std = get_perf(df, method, model, "Std " + metric + " (ms)")
+ if std.mean() == 0:
+ std = None
+ success = get_perf(df, method, model, "Successful req.")
+ if std is not None:
+ std = std / np.sqrt(success)
+ std = std.tolist()
+
+ else:
+ assert metric == "Tput"
+ mean = get_perf(df, method, model, "Input Tput (tok/s)") + get_perf(
+ df, method, model, "Output Tput (tok/s)")
+ mean = mean.tolist()
+ std = None
+
+ return mean, std
+
+
+def main(args):
+ results_folder = Path(args.results_folder)
+
+ results = []
+
+ # collect results
+ for test_file in results_folder.glob("*_nightly_results.json"):
+ with open(test_file, "r") as f:
+ results = results + json.loads(f.read())
+
+ # generate markdown table
+ df = pd.DataFrame.from_dict(results)
+
+ md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)
+
+ with open(args.description, "r") as f:
+ description = f.read()
+
+ description = description.format(
+ nightly_results_benchmarking_table=md_table)
+
+ with open("nightly_results.md", "w") as f:
+ f.write(description)
+
+
+if __name__ == '__main__':
+ args = parse_arguments()
+ main(args)
diff --git a/.buildkite/nightly-benchmarks/scripts/launch-server.sh b/.buildkite/nightly-benchmarks/scripts/launch-server.sh
new file mode 100644
index 0000000000000..e9d7d6a8d760a
--- /dev/null
+++ b/.buildkite/nightly-benchmarks/scripts/launch-server.sh
@@ -0,0 +1,241 @@
+#!/bin/bash
+
+# Currently FP8 benchmark is NOT enabled.
+
+set -x
+server_params=$1
+common_params=$2
+
+json2args() {
+ # transforms the JSON string to command line args, and '_' is replaced to '-'
+ # example:
+ # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
+ # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
+ local json_string=$1
+ local args=$(
+ echo "$json_string" | jq -r '
+ to_entries |
+ map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
+ join(" ")
+ '
+ )
+ echo "$args"
+}
+
+launch_trt_server() {
+
+ model_path=$(echo "$common_params" | jq -r '.model')
+ model_name="${model_path#*/}"
+ model_type=$(echo "$server_params" | jq -r '.model_type')
+ model_dtype=$(echo "$server_params" | jq -r '.model_dtype')
+ model_tp_size=$(echo "$common_params" | jq -r '.tp')
+ max_batch_size=$(echo "$server_params" | jq -r '.max_batch_size')
+ max_input_len=$(echo "$server_params" | jq -r '.max_input_len')
+ max_seq_len=$(echo "$server_params" | jq -r '.max_seq_len')
+ max_num_tokens=$(echo "$server_params" | jq -r '.max_num_tokens')
+ trt_llm_version=$(echo "$server_params" | jq -r '.trt_llm_version')
+
+ # create model caching directory
+ cd ~
+ rm -rf models
+ mkdir -p models
+ cd models
+ models_dir=$(pwd)
+ trt_model_path=${models_dir}/${model_name}-trt-ckpt
+ trt_engine_path=${models_dir}/${model_name}-trt-engine
+
+ # clone tensorrt backend
+ cd /
+ rm -rf tensorrtllm_backend
+ git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
+ git lfs install
+ cd tensorrtllm_backend
+ git checkout $trt_llm_version
+ tensorrtllm_backend_dir=$(pwd)
+ git submodule update --init --recursive
+
+ # build trtllm engine
+ cd /tensorrtllm_backend
+ cd ./tensorrt_llm/examples/${model_type}
+ python3 convert_checkpoint.py \
+ --model_dir ${model_path} \
+ --dtype ${model_dtype} \
+ --tp_size ${model_tp_size} \
+ --output_dir ${trt_model_path}
+ trtllm-build \
+ --checkpoint_dir ${trt_model_path} \
+ --use_fused_mlp \
+ --reduce_fusion disable \
+ --workers 8 \
+ --gpt_attention_plugin ${model_dtype} \
+ --gemm_plugin ${model_dtype} \
+ --tp_size ${model_tp_size} \
+ --max_batch_size ${max_batch_size} \
+ --max_input_len ${max_input_len} \
+ --max_seq_len ${max_seq_len} \
+ --max_num_tokens ${max_num_tokens} \
+ --output_dir ${trt_engine_path}
+
+ # handle triton protobuf files and launch triton server
+ cd /tensorrtllm_backend
+ mkdir triton_model_repo
+ cp -r all_models/inflight_batcher_llm/* triton_model_repo/
+ cd triton_model_repo
+ rm -rf ./tensorrt_llm/1/*
+ cp -r ${trt_engine_path}/* ./tensorrt_llm/1
+ python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
+ python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5
+ python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false
+ python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size
+ python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1
+ cd /tensorrtllm_backend
+ python3 scripts/launch_triton_server.py \
+ --world_size=${model_tp_size} \
+ --model_repo=/tensorrtllm_backend/triton_model_repo &
+
+}
+
+launch_tgi_server() {
+ model=$(echo "$common_params" | jq -r '.model')
+ tp=$(echo "$common_params" | jq -r '.tp')
+ dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
+ dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
+ port=$(echo "$common_params" | jq -r '.port')
+ num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
+ server_args=$(json2args "$server_params")
+
+ if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
+ echo "Key 'fp8' exists in common params."
+ server_command="/tgi-entrypoint.sh \
+ --model-id $model \
+ --num-shard $tp \
+ --port $port \
+ --quantize fp8 \
+ $server_args"
+ else
+ echo "Key 'fp8' does not exist in common params."
+ server_command="/tgi-entrypoint.sh \
+ --model-id $model \
+ --num-shard $tp \
+ --port $port \
+ $server_args"
+ fi
+
+ echo "Server command: $server_command"
+ eval "$server_command" &
+
+}
+
+launch_lmdeploy_server() {
+ model=$(echo "$common_params" | jq -r '.model')
+ tp=$(echo "$common_params" | jq -r '.tp')
+ dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
+ dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
+ port=$(echo "$common_params" | jq -r '.port')
+ num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
+ server_args=$(json2args "$server_params")
+
+ server_command="lmdeploy serve api_server $model \
+ --tp $tp \
+ --server-port $port \
+ $server_args"
+
+ # run the server
+ echo "Server command: $server_command"
+ bash -c "$server_command" &
+}
+
+launch_sglang_server() {
+
+ model=$(echo "$common_params" | jq -r '.model')
+ tp=$(echo "$common_params" | jq -r '.tp')
+ dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
+ dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
+ port=$(echo "$common_params" | jq -r '.port')
+ num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
+ server_args=$(json2args "$server_params")
+
+ if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
+ echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
+ model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
+ server_command="python3 \
+ -m sglang.launch_server \
+ --tp $tp \
+ --model-path $model \
+ --port $port \
+ $server_args"
+ else
+ echo "Key 'fp8' does not exist in common params."
+ server_command="python3 \
+ -m sglang.launch_server \
+ --tp $tp \
+ --model-path $model \
+ --port $port \
+ $server_args"
+ fi
+
+ # run the server
+ echo "Server command: $server_command"
+ eval "$server_command" &
+}
+
+launch_vllm_server() {
+
+ export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
+
+ model=$(echo "$common_params" | jq -r '.model')
+ tp=$(echo "$common_params" | jq -r '.tp')
+ dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
+ dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
+ port=$(echo "$common_params" | jq -r '.port')
+ num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
+ server_args=$(json2args "$server_params")
+
+ if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
+ echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
+ model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
+ server_command="python3 \
+ -m vllm.entrypoints.openai.api_server \
+ -tp $tp \
+ --model $model \
+ --port $port \
+ $server_args"
+ else
+ echo "Key 'fp8' does not exist in common params."
+ server_command="python3 \
+ -m vllm.entrypoints.openai.api_server \
+ -tp $tp \
+ --model $model \
+ --port $port \
+ $server_args"
+ fi
+
+ # run the server
+ echo "Server command: $server_command"
+ eval "$server_command" &
+}
+
+main() {
+
+ if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then
+ launch_trt_server
+ fi
+
+ if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then
+ launch_tgi_server
+ fi
+
+ if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then
+ launch_lmdeploy_server
+ fi
+
+ if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then
+ launch_sglang_server
+ fi
+
+ if [[ "$CURRENT_LLM_SERVING_ENGINE" == *"vllm"* ]]; then
+ launch_vllm_server
+ fi
+}
+
+main
diff --git a/.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh b/.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh
deleted file mode 100644
index f8262653a6628..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh
+++ /dev/null
@@ -1,102 +0,0 @@
-#!/bin/bash
-
-
-server_params=$1
-common_params=$2
-
-
-
-model_path=$(echo "$common_params" | jq -r '.model')
-model_name="${model_path#*/}"
-model_type=$(echo "$server_params" | jq -r '.model_type')
-model_dtype=$(echo "$server_params" | jq -r '.model_dtype')
-model_tp_size=$(echo "$common_params" | jq -r '.tp')
-max_batch_size=$(echo "$server_params" | jq -r '.max_batch_size')
-max_input_len=$(echo "$server_params" | jq -r '.max_input_len')
-max_output_len=$(echo "$server_params" | jq -r '.max_output_len')
-trt_llm_version=$(echo "$server_params" | jq -r '.trt_llm_version')
-
-cd ~
-rm -rf models
-mkdir -p models
-cd models
-models_dir=$(pwd)
-trt_model_path=${models_dir}/${model_name}-trt-ckpt
-trt_engine_path=${models_dir}/${model_name}-trt-engine
-
-cd ~
-rm -rf tensorrt-demo
-git clone https://github.com/neuralmagic/tensorrt-demo.git
-cd tensorrt-demo
-tensorrt_demo_dir=$(pwd)
-
-# make sure the parameter inside tensorrt_demo is consistent to envvar
-sed -i.bak "/key: \"tokenizer_dir\"/,/string_value:/s|string_value: \".*\"|string_value: \"$model_path\"|" ./triton_model_repo/postprocessing/config.pbtxt
-sed -i.bak "/key: \"tokenizer_dir\"/,/string_value:/s|string_value: \".*\"|string_value: \"$model_path\"|" ./triton_model_repo/preprocessing/config.pbtxt
-sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/ensemble/config.pbtxt
-sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/preprocessing/config.pbtxt
-sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/postprocessing/config.pbtxt
-sed -i.bak "s|\(max_batch_size:\s*\)[0-9]*|\1$max_batch_size|g" ./triton_model_repo/tensorrt_llm_bls/config.pbtxt
-
-
-cd /
-rm -rf tensorrtllm_backend
-git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
-git lfs install
-cd tensorrtllm_backend
-git checkout $trt_llm_version
-tensorrtllm_backend_dir=$(pwd)
-git submodule update --init --recursive
-cp -r ${tensorrt_demo_dir}/triton_model_repo ${tensorrtllm_backend_dir}/
-
-cd /tensorrtllm_backend
-cd ./tensorrt_llm/examples/${model_type}
-
-
-if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
-
- echo "Key 'fp8' exists in common params. Use quantize.py instead of convert_checkpoint.py"
- echo "Reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md"
- python ../quantization/quantize.py \
- --model_dir ${model_path} \
- --dtype ${model_dtype} \
- --tp_size ${model_tp_size} \
- --output_dir ${trt_model_path} \
- --qformat fp8 \
- --kv_cache_dtype fp8 \
- --calib_size 2
-
-else
-
- echo "Key 'fp8' does not exist in common params. Use convert_checkpoint.py"
- python3 convert_checkpoint.py \
- --model_dir ${model_path} \
- --dtype ${model_dtype} \
- --tp_size ${model_tp_size} \
- --output_dir ${trt_model_path}
-
-fi
-
-
-
-trtllm-build \
---checkpoint_dir=${trt_model_path} \
---gpt_attention_plugin=${model_dtype} \
---gemm_plugin=${model_dtype} \
---remove_input_padding=enable \
---paged_kv_cache=enable \
---tp_size=${model_tp_size} \
---max_batch_size=${max_batch_size} \
---max_input_len=${max_input_len} \
---max_output_len=${max_output_len} \
---max_num_tokens=${max_output_len} \
---opt_num_tokens=${max_output_len} \
---output_dir=${trt_engine_path}
-
-cd /tensorrtllm_backend/triton_model_repo
-rm -rf ./tensorrt_llm/1/*
-cp -r ${trt_engine_path}/* ./tensorrt_llm/1
-cd /tensorrtllm_backend
-python3 scripts/launch_triton_server.py \
---world_size=${model_tp_size} \
---model_repo=/tensorrtllm_backend/triton_model_repo &
\ No newline at end of file
diff --git a/.buildkite/nightly-benchmarks/scripts/nightly-annotate.sh b/.buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
index 1168912c6e229..c6a1bbdeb7d48 100644
--- a/.buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
+++ b/.buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
@@ -8,6 +8,7 @@ main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
+ (which zip) || (apt-get install -y zip)
if [ ! -f /workspace/buildkite-agent ]; then
echo "buildkite-agent binary not found. Skip plotting the results."
@@ -24,17 +25,54 @@ main() {
ls
ls results/
- # generate figures
- python3 -m pip install tabulate pandas matplotlib
- python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
- --description $description \
- --results-folder results/
+ # upload benchmark results
+ zip -r results.zip results/
+ /workspace/buildkite-agent artifact upload "results.zip"
+
+ # upload benchmarking scripts
+ cd $VLLM_SOURCE_CODE_LOC/
+ zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
+ /workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"
+
+ cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
+ # upload benchmarking pipeline
+ /workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"
+
+ cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
+ /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md
+
+
+
+ # The figures should be genereated by a separate process outside the CI/CD pipeline
+
+ # # generate figures
+ # python3 -m pip install tabulate pandas matplotlib
+
+ # python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py \
+ # --description $description \
+ # --results-folder results/
+
+
+ # python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
+ # --description $description \
+ # --results-folder results/ \
+ # --dataset sharegpt
+
+ # python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
+ # --description $description \
+ # --results-folder results/ \
+ # --dataset sonnet_2048_128
+
+ # python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
+ # --description $description \
+ # --results-folder results/ \
+ # --dataset sonnet_128_2048
- # upload results and figures
- /workspace/buildkite-agent artifact upload "nightly_results.png"
- /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
- /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/tests/nightly-tests.json
- /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
+ # # upload results and figures
+ # /workspace/buildkite-agent artifact upload "nightly_results*.png"
+ # /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
+ # /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/tests/nightly-tests.json
+ # /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}
main "$@"
\ No newline at end of file
diff --git a/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py b/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py
deleted file mode 100644
index e5cfcc64a9b2a..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py
+++ /dev/null
@@ -1,135 +0,0 @@
-import argparse
-import json
-import math
-from pathlib import Path
-
-import matplotlib.pyplot as plt
-import pandas as pd
-from tabulate import tabulate
-
-
-def parse_arguments():
- parser = argparse.ArgumentParser(
- description=
- 'Parse command line arguments for summary-nightly-results script.')
- parser.add_argument('--results-folder',
- type=str,
- required=True,
- help='The folder where the results are stored.')
- parser.add_argument('--description',
- type=str,
- required=True,
- help='Description of the results.')
-
- args = parser.parse_args()
- return args
-
-
-def main(args):
- bar_colors = ['#56B4E9', '#009E73', '#D55E00', '#E69F00']
- results_folder = Path(args.results_folder)
-
- results = []
-
- # collect results
- for test_file in results_folder.glob("*_nightly_results.json"):
- with open(test_file, "r") as f:
- results = results + json.loads(f.read())
-
- # generate markdown table
- df = pd.DataFrame.from_dict(results)
-
- md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)
-
- with open(args.description, "r") as f:
- description = f.read()
-
- description = description.format(
- nightly_results_benchmarking_table=md_table)
-
- with open("nightly_results.md", "w") as f:
- f.write(description)
-
- plt.rcParams.update({'font.size': 20})
-
- # plot results
- fig, axes = plt.subplots(3, 3, figsize=(16, 14))
- fig.subplots_adjust(hspace=1)
- methods = ["vllm", "trt", "lmdeploy", "tgi"]
- for i, model in enumerate(["llama8B", "llama70B", "mixtral8x7B"]):
- for j, metric in enumerate(["TTFT", "ITL"]):
- means, stds = [], []
- for method in methods:
- target = df['Test name'].str.contains(model)
- target = target & df['Engine'].str.contains(method)
- filtered_df = df[target]
-
- if filtered_df.empty:
- means.append(0.)
- stds.append(0.)
- else:
- means.append(filtered_df[f"Mean {metric} (ms)"].values[0])
- std = filtered_df[f"Std {metric} (ms)"].values[0]
- success = filtered_df["Successful req."].values[0]
- stds.append(std / math.sqrt(success))
-
- print(model, metric)
- print(means, stds)
-
- ax = axes[i, j + 1]
-
- bars = ax.bar(
- ["vllm", "trt", "lmdeploy", "tgi"],
- means,
- yerr=stds,
- capsize=10,
- )
- for idx, bar in enumerate(bars):
- bar.set_color(bar_colors[idx])
- ax.set_ylim(bottom=0)
-
- ax.set_ylabel(f"{metric} (ms)")
- ax.set_title(f"{model} {metric}")
- ax.grid(axis='y')
-
- metric = "Tput"
- j = 0
- if True:
- tputs = []
- for method in methods:
- target = df['Test name'].str.contains(model)
- target = target & df['Engine'].str.contains(method)
- filtered_df = df[target]
-
- if filtered_df.empty:
- tputs.append(0.)
- else:
- input_tput = filtered_df["Input Tput (tok/s)"].values[0]
- output_tput = filtered_df["Output Tput (tok/s)"].values[0]
- tputs.append(input_tput + output_tput)
-
- print(model, metric)
- print(tputs)
-
- ax = axes[i, j]
-
- bars = ax.bar(
- ["vllm", "trt", "lmdeploy", "tgi"],
- tputs,
- )
- for idx, bar in enumerate(bars):
- bar.set_color(bar_colors[idx])
-
- ax.set_ylim(bottom=0)
-
- ax.set_ylabel("Tput (token/s)")
- ax.set_title(f"{model} {metric}")
- ax.grid(axis='y')
-
- fig.tight_layout()
- fig.savefig("nightly_results.png", bbox_inches='tight', dpi=400)
-
-
-if __name__ == '__main__':
- args = parse_arguments()
- main(args)
diff --git a/.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh b/.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh
deleted file mode 100644
index d6f112aaa42fd..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/run-lmdeploy-nightly.sh
+++ /dev/null
@@ -1,218 +0,0 @@
-#!/bin/bash
-
-set -o pipefail
-
-check_gpus() {
- # check the number of GPUs and GPU type.
- declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
- if [[ $gpu_count -gt 0 ]]; then
- echo "GPU found."
- else
- echo "Need at least 1 GPU to run benchmarking."
- exit 1
- fi
- declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
- echo "GPU type is $gpu_type"
-}
-
-kill_gpu_processes() {
- pkill lmdeploy || true
- # waiting for GPU processes to be fully killed
- sleep 10
- # Print the GPU memory usage
- # so that we know if all GPU processes are killed.
- gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
- # The memory usage should be 0 MB.
- echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
-}
-
-json2args() {
- # transforms the JSON string to command line args, and '_' is replaced to '-'
- # example:
- # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
- # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
- local json_string=$1
- local args=$(
- echo "$json_string" | jq -r '
- to_entries |
- map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
- join(" ")
- '
- )
- echo "$args"
-}
-
-wait_for_server() {
- # wait for vllm server to start
- # return 1 if vllm server crashes
- timeout 1200 bash -c '
- until curl -s localhost:8000/v1/completions > /dev/null; do
- sleep 1
- done' && return 0 || return 1
-}
-
-run_serving_tests() {
- # run serving tests using `benchmark_serving.py`
- # $1: a json file specifying serving test cases
-
- local serving_test_file
- serving_test_file=$1
-
- # Iterate over serving tests
- jq -c '.[]' "$serving_test_file" | while read -r params; do
- # get the test name, and append the GPU type back to it.
- test_name=$(echo "$params" | jq -r '.test_name')
-
- # if TEST_SELECTOR is set, only run the test cases that match the selector
- if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
- echo "Skip test case $test_name."
- continue
- fi
-
- # append lmdeploy to the test name
- test_name=lmdeploy_$test_name
-
- # get common parameters
- common_params=$(echo "$params" | jq -r '.common_parameters')
- model=$(echo "$common_params" | jq -r '.model')
- tp=$(echo "$common_params" | jq -r '.tp')
- dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
- dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
- port=$(echo "$common_params" | jq -r '.port')
- num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
-
-
-
- # get client and server arguments
- server_params=$(echo "$params" | jq -r '.lmdeploy_server_parameters')
- client_params=$(echo "$params" | jq -r '.lmdeploy_client_parameters')
- server_args=$(json2args "$server_params")
- client_args=$(json2args "$client_params")
- qps_list=$(echo "$params" | jq -r '.qps_list')
- qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
- echo "Running over qps list $qps_list"
-
- # check if there is enough GPU to run the test
- if [[ $gpu_count -lt $tp ]]; then
- echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
- continue
- fi
-
- # prepare tokenizer
- rm -rf /tokenizer_cache
- mkdir /tokenizer_cache
- python ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
- --model "$model" \
- --cachedir /tokenizer_cache
-
- server_command="lmdeploy serve api_server $model \
- --tp $tp \
- --server-port $port \
- $server_args"
-
- # run the server
- echo "Running test case $test_name"
- echo "Server command: $server_command"
- bash -c "$server_command" &
-
- # wait until the server is alive
- wait_for_server
- if [ $? -eq 0 ]; then
- echo ""
- echo "lmdeploy server is up and running."
- else
- echo ""
- echo "lmdeploy failed to start within the timeout period."
- break
- fi
-
- # get model name
- model_name=$(python ../.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py)
-
- # iterate over different QPS
- for qps in $qps_list; do
- # remove the surrounding single quote from qps
- if [[ "$qps" == *"inf"* ]]; then
- echo "qps was $qps"
- qps="inf"
- echo "now qps is $qps"
- fi
-
- new_test_name=$test_name"_qps_"$qps
-
- client_command="python3 benchmark_serving.py \
- --backend lmdeploy \
- --tokenizer /tokenizer_cache \
- --dataset-name $dataset_name \
- --dataset-path $dataset_path \
- --num-prompts $num_prompts \
- --port $port \
- --save-result \
- --result-dir $RESULTS_FOLDER \
- --result-filename ${new_test_name}.json \
- --request-rate $qps \
- --model \"$model_name\" \
- $client_args"
-
- echo "Running test case $test_name with qps $qps"
- echo "Client command: $client_command"
-
- eval "$client_command"
-
- # record the benchmarking commands
- jq_output=$(jq -n \
- --arg server "$server_command" \
- --arg client "$client_command" \
- --arg gpu "$gpu_type" \
- --arg engine "lmdeploy" \
- '{
- server_command: $server,
- client_command: $client,
- gpu_type: $gpu,
- engine: $engine
- }')
- echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
-
- done
-
- # clean up
- kill_gpu_processes
- rm -rf /root/.cache/huggingface/*
- done
-}
-
-
-upload_to_buildkite() {
- # upload the benchmarking results to buildkite
-
- # if the agent binary is not found, skip uploading the results, exit 0
- if [ ! -f /workspace/buildkite-agent ]; then
- echo "buildkite-agent binary not found. Skip uploading the results."
- return 0
- fi
- # /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
- /workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
-}
-
-
-main() {
-
- check_gpus
- # enter vllm directory
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
-
- declare -g RESULTS_FOLDER=results/
- mkdir -p $RESULTS_FOLDER
- BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
-
- python -m pip install transformers==4.41.2
-
- export CURRENT_LLM_SERVING_ENGINE=lmdeploy
- run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
- python -m pip install tabulate pandas
- python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
- upload_to_buildkite
-
-}
-
-main "$@"
diff --git a/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh b/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
new file mode 100644
index 0000000000000..dd8c15e0700eb
--- /dev/null
+++ b/.buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
@@ -0,0 +1,357 @@
+#!/bin/bash
+
+set -o pipefail
+set -x
+
+check_gpus() {
+ # check the number of GPUs and GPU type.
+ declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
+ if [[ $gpu_count -gt 0 ]]; then
+ echo "GPU found."
+ else
+ echo "Need at least 1 GPU to run benchmarking."
+ exit 1
+ fi
+ declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
+ echo "GPU type is $gpu_type"
+}
+
+check_hf_token() {
+ # check if HF_TOKEN is available and valid
+ if [[ -z "$HF_TOKEN" ]]; then
+ echo "Error: HF_TOKEN is not set."
+ exit 1
+ elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
+ echo "Error: HF_TOKEN does not start with 'hf_'."
+ exit 1
+ else
+ echo "HF_TOKEN is set and valid."
+ fi
+}
+
+
+upload_to_buildkite() {
+ # upload the benchmarking results to buildkite
+
+ # if the agent binary is not found, skip uploading the results, exit 0
+ if [ ! -f /workspace/buildkite-agent ]; then
+ echo "buildkite-agent binary not found. Skip uploading the results."
+ return 0
+ fi
+ # /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
+ /workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
+}
+
+
+get_current_llm_serving_engine() {
+
+ if which lmdeploy >/dev/null; then
+ echo "Container: lmdeploy"
+ export CURRENT_LLM_SERVING_ENGINE=lmdeploy
+ return
+ fi
+
+ if [ -e /tgi-entrypoint.sh ]; then
+ echo "Container: tgi"
+ export CURRENT_LLM_SERVING_ENGINE=tgi
+ return
+ fi
+
+ if which trtllm-build >/dev/null; then
+ echo "Container: tensorrt-llm"
+ export CURRENT_LLM_SERVING_ENGINE=trt
+ return
+ fi
+
+ if [ -e /sgl-workspace ]; then
+ echo "Container: sglang"
+ export CURRENT_LLM_SERVING_ENGINE=sglang
+ return
+ fi
+
+ if [ -e /vllm-workspace ]; then
+ echo "Container: vllm"
+ # move to a completely irrelevant directory, to avoid import vllm from current folder
+ export CURRENT_LLM_SERVING_ENGINE=vllm
+
+ return
+ fi
+}
+
+json2args() {
+ # transforms the JSON string to command line args, and '_' is replaced to '-'
+ # example:
+ # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
+ # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
+ local json_string=$1
+ local args=$(
+ echo "$json_string" | jq -r '
+ to_entries |
+ map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
+ join(" ")
+ '
+ )
+ echo "$args"
+}
+
+kill_gpu_processes() {
+ pkill -f python
+ pkill -f python3
+ pkill -f tritonserver
+ pkill -f pt_main_thread
+ pkill -f text-generation
+ pkill -f lmdeploy
+
+ while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do
+ sleep 1
+ done
+}
+
+wait_for_server() {
+ # wait for vllm server to start
+ # return 1 if vllm server crashes
+ timeout 1200 bash -c '
+ until curl -s localhost:8000/v1/completions > /dev/null; do
+ sleep 1
+ done' && return 0 || return 1
+}
+
+ensure_installed() {
+ # Ensure that the given command is installed by apt-get
+ local cmd=$1
+ if ! which $cmd >/dev/null; then
+ apt-get update && apt-get install -y $cmd
+ fi
+}
+
+run_serving_tests() {
+ # run serving tests using `benchmark_serving.py`
+ # $1: a json file specifying serving test cases
+
+ local serving_test_file
+ serving_test_file=$1
+
+ # Iterate over serving tests
+ jq -c '.[]' "$serving_test_file" | while read -r params; do
+ # get the test name, and append the GPU type back to it.
+ test_name=$(echo "$params" | jq -r '.test_name')
+
+ # if TEST_SELECTOR is set, only run the test cases that match the selector
+ if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
+ echo "Skip test case $test_name."
+ continue
+ fi
+
+ # prepend the current serving engine to the test name
+ test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
+
+ # get common parameters
+ common_params=$(echo "$params" | jq -r '.common_parameters')
+ model=$(echo "$common_params" | jq -r '.model')
+ tp=$(echo "$common_params" | jq -r '.tp')
+ dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
+ dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
+ port=$(echo "$common_params" | jq -r '.port')
+ num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
+ reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
+
+ # get client and server arguments
+ server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
+ client_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_client_parameters")
+ client_args=$(json2args "$client_params")
+ qps_list=$(echo "$params" | jq -r '.qps_list')
+ qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
+ echo "Running over qps list $qps_list"
+
+ # check if there is enough GPU to run the test
+ if [[ $gpu_count -lt $tp ]]; then
+ echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
+ continue
+ fi
+
+ if [[ $reuse_server == "true" ]]; then
+ echo "Reuse previous server for test case $test_name"
+ else
+ kill_gpu_processes
+ bash $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh \
+ "$server_params" "$common_params"
+ fi
+
+ wait_for_server
+
+ if [ $? -eq 0 ]; then
+ echo ""
+ echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
+ else
+ echo ""
+ echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
+ break
+ fi
+
+ # prepare tokenizer
+ # this is required for lmdeploy.
+ cd $VLLM_SOURCE_CODE_LOC/benchmarks
+ rm -rf /tokenizer_cache
+ mkdir /tokenizer_cache
+ python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
+ --model "$model" \
+ --cachedir /tokenizer_cache
+ cd $VLLM_SOURCE_CODE_LOC/benchmarks
+
+
+ # change model name for lmdeploy (it will not follow standard hf name)
+ if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
+ model=$(python ../.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py)
+ fi
+
+ # iterate over different QPS
+ for qps in $qps_list; do
+ # remove the surrounding single quote from qps
+ if [[ "$qps" == *"inf"* ]]; then
+ echo "qps was $qps"
+ qps="inf"
+ echo "now qps is $qps"
+ fi
+
+ new_test_name=$test_name"_qps_"$qps
+
+ backend=$CURRENT_LLM_SERVING_ENGINE
+
+ if [[ $backend = "trt" ]]; then
+ backend="tensorrt-llm"
+ fi
+
+ if [[ "$backend" == *"vllm"* ]]; then
+ backend="vllm"
+ fi
+
+ if [[ "$dataset_name" = "sharegpt" ]]; then
+
+ client_command="python3 benchmark_serving.py \
+ --backend $backend \
+ --tokenizer /tokenizer_cache \
+ --model $model \
+ --dataset-name $dataset_name \
+ --dataset-path $dataset_path \
+ --num-prompts $num_prompts \
+ --port $port \
+ --save-result \
+ --result-dir $RESULTS_FOLDER \
+ --result-filename ${new_test_name}.json \
+ --request-rate $qps \
+ --ignore-eos \
+ $client_args"
+
+ elif [[ "$dataset_name" = "sonnet" ]]; then
+
+ sonnet_input_len=$(echo "$common_params" | jq -r '.sonnet_input_len')
+ sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
+ sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
+
+ client_command="python3 benchmark_serving.py \
+ --backend $backend \
+ --tokenizer /tokenizer_cache \
+ --model $model \
+ --dataset-name $dataset_name \
+ --dataset-path $dataset_path \
+ --num-prompts $num_prompts \
+ --sonnet-input-len $sonnet_input_len \
+ --sonnet-output-len $sonnet_output_len \
+ --sonnet-prefix-len $sonnet_prefix_len \
+ --port $port \
+ --save-result \
+ --result-dir $RESULTS_FOLDER \
+ --result-filename ${new_test_name}.json \
+ --request-rate $qps \
+ --ignore-eos \
+ $client_args"
+
+ else
+
+ echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
+ exit 1
+
+ fi
+
+
+
+ echo "Running test case $test_name with qps $qps"
+ echo "Client command: $client_command"
+
+ eval "$client_command"
+
+ server_command="None"
+
+ # record the benchmarking commands
+ jq_output=$(jq -n \
+ --arg server "$server_command" \
+ --arg client "$client_command" \
+ --arg gpu "$gpu_type" \
+ --arg engine "$CURRENT_LLM_SERVING_ENGINE" \
+ '{
+ server_command: $server,
+ client_command: $client,
+ gpu_type: $gpu,
+ engine: $engine
+ }')
+ echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
+
+ done
+
+ done
+
+ kill_gpu_processes
+}
+
+
+prepare_dataset() {
+
+ # download sharegpt dataset
+ cd $VLLM_SOURCE_CODE_LOC/benchmarks
+ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+
+ # duplicate sonnet by 4x, to allow benchmarking with input length 2048
+ cd $VLLM_SOURCE_CODE_LOC/benchmarks
+ echo "" > sonnet_4x.txt
+ for _ in {1..4}
+ do
+ cat sonnet.txt >> sonnet_4x.txt
+ done
+
+}
+
+main() {
+
+ # check if the environment variable is successfully injected from yaml
+
+ check_gpus
+ check_hf_token
+ get_current_llm_serving_engine
+
+ pip install -U transformers
+
+ # check storage
+ df -h
+
+ ensure_installed wget
+ ensure_installed curl
+ ensure_installed jq
+
+ prepare_dataset
+
+ cd $VLLM_SOURCE_CODE_LOC/benchmarks
+ declare -g RESULTS_FOLDER=results/
+ mkdir -p $RESULTS_FOLDER
+ BENCHMARK_ROOT=$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/
+
+ # run the test
+ run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
+
+ # upload benchmark results to buildkite
+ python3 -m pip install tabulate pandas
+ python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py
+ upload_to_buildkite
+
+}
+
+main "$@"
diff --git a/.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh b/.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh
deleted file mode 100644
index fed03654f8b77..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/run-tgi-nightly.sh
+++ /dev/null
@@ -1,216 +0,0 @@
-#!/bin/bash
-
-set -o pipefail
-
-check_gpus() {
- # check the number of GPUs and GPU type.
- declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
- if [[ $gpu_count -gt 0 ]]; then
- echo "GPU found."
- else
- echo "Need at least 1 GPU to run benchmarking."
- exit 1
- fi
- declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
- echo "GPU type is $gpu_type"
-}
-
-kill_gpu_processes() {
- pkill text-generation || true
- # waiting for GPU processes to be fully killed
- sleep 10
- # Print the GPU memory usage
- # so that we know if all GPU processes are killed.
- gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
- # The memory usage should be 0 MB.
- echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
-}
-
-json2args() {
- # transforms the JSON string to command line args, and '_' is replaced to '-'
- # example:
- # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
- # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
- local json_string=$1
- local args=$(
- echo "$json_string" | jq -r '
- to_entries |
- map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
- join(" ")
- '
- )
- echo "$args"
-}
-
-wait_for_server() {
- timeout 1200 bash -c '
- until curl -s localhost:8000/generate_stream > /dev/null; do
- sleep 1
- done' && return 0 || return 1
-}
-
-run_serving_tests() {
- # run serving tests using `benchmark_serving.py`
- # $1: a json file specifying serving test cases
-
- local serving_test_file
- serving_test_file=$1
-
- # Iterate over serving tests
- jq -c '.[]' "$serving_test_file" | while read -r params; do
- # get the test name, and append the GPU type back to it.
- test_name=$(echo "$params" | jq -r '.test_name')
-
-
- # if TEST_SELECTOR is set, only run the test cases that match the selector
- if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
- echo "Skip test case $test_name."
- continue
- fi
-
- # append tgi to the test name
- test_name=tgi_$test_name
-
- # get common parameters
- common_params=$(echo "$params" | jq -r '.common_parameters')
- model=$(echo "$common_params" | jq -r '.model')
- tp=$(echo "$common_params" | jq -r '.tp')
- dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
- dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
- port=$(echo "$common_params" | jq -r '.port')
- num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
-
- # get client and server arguments
- server_params=$(echo "$params" | jq -r '.tgi_server_parameters')
- client_params=$(echo "$params" | jq -r '.tgi_client_parameters')
- server_args=$(json2args "$server_params")
- client_args=$(json2args "$client_params")
- qps_list=$(echo "$params" | jq -r '.qps_list')
- qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
- echo "Running over qps list $qps_list"
-
- # check if there is enough GPU to run the test
- if [[ $gpu_count -lt $tp ]]; then
- echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
- continue
- fi
-
- if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
- echo "Key 'fp8' exists in common params."
- server_command="/tgi-entrypoint.sh \
- --model-id $model \
- --num-shard $tp \
- --port $port \
- --quantize fp8 \
- $server_args"
- else
- echo "Key 'fp8' does not exist in common params."
- server_command="/tgi-entrypoint.sh \
- --model-id $model \
- --num-shard $tp \
- --port $port \
- $server_args"
- fi
-
-
-
-
- # run the server
- echo "Running test case $test_name"
- echo "Server command: $server_command"
- eval "$server_command" &
-
- # wait until the server is alive
- wait_for_server
- if [ $? -eq 0 ]; then
- echo ""
- echo "tgi server is up and running."
- else
- echo ""
- echo "tgi failed to start within the timeout period."
- break
- fi
-
- # iterate over different QPS
- for qps in $qps_list; do
- # remove the surrounding single quote from qps
- if [[ "$qps" == *"inf"* ]]; then
- echo "qps was $qps"
- qps="inf"
- echo "now qps is $qps"
- fi
-
- new_test_name=$test_name"_qps_"$qps
-
- client_command="python3 benchmark_serving.py \
- --backend tgi \
- --model $model \
- --dataset-name $dataset_name \
- --dataset-path $dataset_path \
- --num-prompts $num_prompts \
- --port $port \
- --save-result \
- --result-dir $RESULTS_FOLDER \
- --result-filename ${new_test_name}.json \
- --request-rate $qps \
- $client_args"
-
- echo "Running test case $test_name with qps $qps"
- echo "Client command: $client_command"
-
- eval "$client_command"
-
- # record the benchmarking commands
- jq_output=$(jq -n \
- --arg server "$server_command" \
- --arg client "$client_command" \
- --arg gpu "$gpu_type" \
- --arg engine "tgi" \
- '{
- server_command: $server,
- client_command: $client,
- gpu_type: $gpu,
- engine: $engine
- }')
- echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
-
- done
-
- # clean up
- kill_gpu_processes
- rm -rf /root/.cache/huggingface/*
- done
-}
-
-
-
-upload_to_buildkite() {
- # upload the benchmarking results to buildkite
-
- # if the agent binary is not found, skip uploading the results, exit 0
- if [ ! -f /workspace/buildkite-agent ]; then
- echo "buildkite-agent binary not found. Skip uploading the results."
- return 0
- fi
- # /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
- /workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
-}
-
-main() {
-
- check_gpus
- # enter vllm directory
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
- declare -g RESULTS_FOLDER=results/
- mkdir -p $RESULTS_FOLDER
- BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
-
- export CURRENT_LLM_SERVING_ENGINE=tgi
- run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
- python -m pip install tabulate pandas
- python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
- upload_to_buildkite
-
-}
-
-main "$@"
diff --git a/.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh b/.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh
deleted file mode 100644
index 4a82b9ec64d71..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/run-trt-nightly.sh
+++ /dev/null
@@ -1,214 +0,0 @@
-#!/bin/bash
-
-set -o pipefail
-
-check_gpus() {
- # check the number of GPUs and GPU type.
- declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
- if [[ $gpu_count -gt 0 ]]; then
- echo "GPU found."
- else
- echo "Need at least 1 GPU to run benchmarking."
- exit 1
- fi
- declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
- echo "GPU type is $gpu_type"
-}
-
-kill_gpu_processes() {
- pkill tritonserver || true
- # waiting for GPU processes to be fully killed
- sleep 20
- # Print the GPU memory usage
- # so that we know if all GPU processes are killed.
- gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
- # The memory usage should be 0 MB.
- echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
-}
-
-json2args() {
- # transforms the JSON string to command line args, and '_' is replaced to '-'
- # example:
- # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
- # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
- local json_string=$1
- local args=$(
- echo "$json_string" | jq -r '
- to_entries |
- map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
- join(" ")
- '
- )
- echo "$args"
-}
-
-wait_for_server() {
- timeout 1200 bash -c '
- until curl -s localhost:8000/generate_stream > /dev/null; do
- sleep 1
- done' && return 0 || return 1
-}
-
-run_serving_tests() {
- # run serving tests using `benchmark_serving.py`
- # $1: a json file specifying serving test cases
-
- local serving_test_file
- serving_test_file=$1
-
- # Iterate over serving tests
- jq -c '.[]' "$serving_test_file" | while read -r params; do
- # get the test name, and append the GPU type back to it.
- test_name=$(echo "$params" | jq -r '.test_name')
-
- # if TEST_SELECTOR is set, only run the test cases that match the selector
- if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
- echo "Skip test case $test_name."
- continue
- fi
-
- # append trt to the test name
- test_name=trt_$test_name
-
- # get common parameters
- common_params=$(echo "$params" | jq -r '.common_parameters')
- model=$(echo "$common_params" | jq -r '.model')
- tp=$(echo "$common_params" | jq -r '.tp')
- dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
- dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
- port=$(echo "$common_params" | jq -r '.port')
- num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
-
- # get client and server arguments
- server_params=$(echo "$params" | jq -r '.trt_server_parameters')
- client_params=$(echo "$params" | jq -r '.trt_client_parameters')
- client_args=$(json2args "$client_params")
- qps_list=$(echo "$params" | jq -r '.qps_list')
- qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
- echo "Running over qps list $qps_list"
-
- # check if there is enough GPU to run the test
- if [[ $gpu_count -lt $tp ]]; then
- echo "Required model_tp_size $tp but only $gpu_count GPU found. Skip testcase $test_name."
- continue
- fi
-
-
-
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
-
-
- echo "Running test case $test_name"
- bash ../.buildkite/nightly-benchmarks/scripts/launch-trt-server.sh "$server_params" "$common_params"
-
- # wait until the server is alive
- wait_for_server
- if [ $? -eq 0 ]; then
- echo ""
- echo "trt server is up and running."
- else
- echo ""
- echo "trt failed to start within the timeout period."
- break
- fi
-
- # prepare tokenizer
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
- rm -rf /tokenizer_cache
- mkdir /tokenizer_cache
- python ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
- --model "$model" \
- --cachedir /tokenizer_cache
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
-
-
- # iterate over different QPS
- for qps in $qps_list; do
- # remove the surrounding single quote from qps
- if [[ "$qps" == *"inf"* ]]; then
- echo "qps was $qps"
- qps="inf"
- echo "now qps is $qps"
- fi
-
- new_test_name=$test_name"_qps_"$qps
-
- client_command="python3 benchmark_serving.py \
- --backend tensorrt-llm \
- --tokenizer /tokenizer_cache \
- --model $model \
- --dataset-name $dataset_name \
- --dataset-path $dataset_path \
- --num-prompts $num_prompts \
- --port $port \
- --save-result \
- --result-dir $RESULTS_FOLDER \
- --result-filename ${new_test_name}.json \
- --request-rate $qps \
- $client_args"
-
- echo "Running test case $test_name with qps $qps"
- echo "Client command: $client_command"
-
- eval "$client_command"
-
- server_command=""
- # record the benchmarking commands
- jq_output=$(jq -n \
- --arg server "$server_command" \
- --arg client "$client_command" \
- --arg gpu "$gpu_type" \
- --arg engine "trt" \
- '{
- server_command: $server,
- client_command: $client,
- gpu_type: $gpu,
- engine: $engine
- }')
- echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
-
- done
-
- # clean up
- kill_gpu_processes
- rm -rf /root/.cache/huggingface/*
- done
-}
-
-upload_to_buildkite() {
- # upload the benchmarking results to buildkite
-
- # if the agent binary is not found, skip uploading the results, exit 0
- if [ ! -f /workspace/buildkite-agent ]; then
- echo "buildkite-agent binary not found. Skip uploading the results."
- return 0
- fi
- # /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
- /workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
-}
-
-
-main() {
-
- check_gpus
-
-
- # enter vllm directory
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
-
- declare -g RESULTS_FOLDER=results/
- mkdir -p $RESULTS_FOLDER
- BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
-
- # update transformers package, to make sure mixtral tokenizer is available
- python -m pip install transformers -U
-
- export CURRENT_LLM_SERVING_ENGINE=trt
- run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
- python -m pip install tabulate pandas
- python $BENCHMARK_ROOT/scripts/summary-nightly-results.py
- upload_to_buildkite
-
-}
-
-main "$@"
diff --git a/.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh b/.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh
deleted file mode 100644
index 663045b8a9122..0000000000000
--- a/.buildkite/nightly-benchmarks/scripts/run-vllm-nightly.sh
+++ /dev/null
@@ -1,221 +0,0 @@
-#!/bin/bash
-
-set -o pipefail
-
-check_gpus() {
- # check the number of GPUs and GPU type.
- declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
- if [[ $gpu_count -gt 0 ]]; then
- echo "GPU found."
- else
- echo "Need at least 1 GPU to run benchmarking."
- exit 1
- fi
- declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}')
- echo "GPU type is $gpu_type"
-}
-
-kill_gpu_processes() {
- # kill all processes on GPU.
- pkill pt_main_thread
- sleep 10
-
- # remove vllm config file
- rm -rf ~/.config/vllm
-
- # Print the GPU memory usage
- # so that we know if all GPU processes are killed.
- gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
- # The memory usage should be 0 MB.
- echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
-}
-
-json2args() {
- # transforms the JSON string to command line args, and '_' is replaced to '-'
- # example:
- # input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
- # output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
- local json_string=$1
- local args=$(
- echo "$json_string" | jq -r '
- to_entries |
- map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
- join(" ")
- '
- )
- echo "$args"
-}
-
-wait_for_server() {
- # wait for vllm server to start
- # return 1 if vllm server crashes
- timeout 1200 bash -c '
- until curl -s localhost:8000/v1/completions > /dev/null; do
- sleep 1
- done' && return 0 || return 1
-}
-
-run_serving_tests() {
- # run serving tests using `benchmark_serving.py`
- # $1: a json file specifying serving test cases
-
- local serving_test_file
- serving_test_file=$1
-
- # Iterate over serving tests
- jq -c '.[]' "$serving_test_file" | while read -r params; do
- # get the test name, and append the GPU type back to it.
- test_name=$(echo "$params" | jq -r '.test_name')
-
- # if TEST_SELECTOR is set, only run the test cases that match the selector
- if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
- echo "Skip test case $test_name."
- continue
- fi
-
- # append vllm to the test name
- test_name=vllm_$test_name
-
-
- # get common parameters
- common_params=$(echo "$params" | jq -r '.common_parameters')
- model=$(echo "$common_params" | jq -r '.model')
- tp=$(echo "$common_params" | jq -r '.tp')
- dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
- dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
- port=$(echo "$common_params" | jq -r '.port')
- num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
-
- # get client and server arguments
- server_params=$(echo "$params" | jq -r '.vllm_server_parameters')
- client_params=$(echo "$params" | jq -r '.vllm_client_parameters')
- server_args=$(json2args "$server_params")
- client_args=$(json2args "$client_params")
- qps_list=$(echo "$params" | jq -r '.qps_list')
- qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
- echo "Running over qps list $qps_list"
-
- # check if there is enough GPU to run the test
- if [[ $gpu_count -lt $tp ]]; then
- echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
- continue
- fi
-
- if echo "$common_params" | jq -e 'has("fp8")' > /dev/null; then
- echo "Key 'fp8' exists in common params. Use neuralmagic fp8 model for convenience."
- model=$(echo "$common_params" | jq -r '.neuralmagic_quantized_model')
- server_command="python3 \
- -m vllm.entrypoints.openai.api_server \
- -tp $tp \
- --model $model \
- --port $port \
- $server_args"
- else
- echo "Key 'fp8' does not exist in common params."
- server_command="python3 \
- -m vllm.entrypoints.openai.api_server \
- -tp $tp \
- --model $model \
- --port $port \
- $server_args"
- fi
-
- # run the server
- echo "Running test case $test_name"
- echo "Server command: $server_command"
- eval "$server_command" &
-
- # wait until the server is alive
- wait_for_server
- if [ $? -eq 0 ]; then
- echo ""
- echo "vllm server is up and running."
- else
- echo ""
- echo "vllm failed to start within the timeout period."
- break
- fi
-
- # iterate over different QPS
- for qps in $qps_list; do
- # remove the surrounding single quote from qps
- if [[ "$qps" == *"inf"* ]]; then
- echo "qps was $qps"
- qps="inf"
- echo "now qps is $qps"
- fi
-
- new_test_name=$test_name"_qps_"$qps
-
- client_command="python3 benchmark_serving.py \
- --backend vllm \
- --model $model \
- --dataset-name $dataset_name \
- --dataset-path $dataset_path \
- --num-prompts $num_prompts \
- --port $port \
- --save-result \
- --result-dir $RESULTS_FOLDER \
- --result-filename ${new_test_name}.json \
- --request-rate $qps \
- $client_args"
-
- echo "Running test case $test_name with qps $qps"
- echo "Client command: $client_command"
-
- eval "$client_command"
-
- # record the benchmarking commands
- jq_output=$(jq -n \
- --arg server "$server_command" \
- --arg client "$client_command" \
- --arg gpu "$gpu_type" \
- --arg engine "vllm" \
- '{
- server_command: $server,
- client_command: $client,
- gpu_type: $gpu,
- engine: $engine
- }')
- echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
-
- done
-
- # clean up
- kill_gpu_processes
- rm -rf /root/.cache/huggingface/*
- done
-}
-
-
-upload_to_buildkite() {
- # upload the benchmarking results to buildkite
-
- # if the agent binary is not found, skip uploading the results, exit 0
- if [ ! -f /workspace/buildkite-agent ]; then
- echo "buildkite-agent binary not found. Skip uploading the results."
- return 0
- fi
- # /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
- /workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
-}
-
-main() {
-
- check_gpus
- # enter vllm directory
- cd $VLLM_SOURCE_CODE_LOC/benchmarks
- declare -g RESULTS_FOLDER=results/
- mkdir -p $RESULTS_FOLDER
- BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
-
- export CURRENT_LLM_SERVING_ENGINE=vllm
- run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json
-
- python3 -m pip install tabulate pandas
- python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py
- upload_to_buildkite
-
-}
-
-main "$@"
diff --git a/.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py b/.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py
index 782d1ef9aab98..4e4d4cd4ca3c6 100644
--- a/.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py
+++ b/.buildkite/nightly-benchmarks/scripts/summary-nightly-results.py
@@ -17,10 +17,17 @@
"request_throughput": "Tput (req/s)",
"mean_ttft_ms": "Mean TTFT (ms)",
"std_ttft_ms": "Std TTFT (ms)",
+ "median_ttft_ms": "Median TTFT (ms)",
"mean_itl_ms": "Mean ITL (ms)",
"std_itl_ms": "Std ITL (ms)",
- "input_throughput": "Input Tput (tok/s)",
+ "median_itl_ms": "Median ITL (ms)",
+ "mean_tpot_ms": "Mean TPOT (ms)",
+ "std_tpot_ms": "Std TPOT (ms)",
+ "median_tpot_ms": "Median TPOT (ms)",
+ "total_token_throughput": "Total Token Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)",
+ "total_input_tokens": "Total input tokens",
+ "total_output_tokens": "Total output tokens",
"engine": "Engine",
}
diff --git a/.buildkite/nightly-benchmarks/tests/nightly-tests.json b/.buildkite/nightly-benchmarks/tests/nightly-tests.json
index f250833c62710..fda1a7a3ec53c 100644
--- a/.buildkite/nightly-benchmarks/tests/nightly-tests.json
+++ b/.buildkite/nightly-benchmarks/tests/nightly-tests.json
@@ -1,16 +1,18 @@
[
{
- "test_name": "llama8B_tp1",
- "qps_list": [4],
+ "test_name": "llama8B_tp1_sharegpt",
+ "qps_list": [4,8,16,32,"inf"],
"common_parameters": {
- "model": "meta-llama/Meta-Llama-3-8B",
+ "model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 500,
- "port": 8000
+ "port": 8000,
+ "reuse_server": false
},
"lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
},
"lmdeploy_client_parameters": {
},
@@ -21,34 +23,158 @@
},
"trt_server_parameters": {
"model_type": "llama",
- "model_dtype": "float16",
- "max_batch_size": 256,
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
"max_input_len": 4096,
- "max_output_len": 4096,
- "trt_llm_version": "r24.04"
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
},
"trt_client_parameters": {
"endpoint": "/v2/models/ensemble/generate_stream"
+ },
+ "vllm_server_parameters": {
+ "disable_log_stats": "",
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
+ },
+ "vllm_client_parameters": {
+ },
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "enable_torch_compile": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
+ }
+ },
+ {
+ "test_name": "llama8B_tp1_sonnet_512_16",
+ "qps_list": [4,8,16,32,"inf"],
+ "common_parameters": {
+ "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+ "tp": 1,
+ "dataset_name": "sonnet",
+ "dataset_path": "./sonnet_4x.txt",
+ "num_prompts": 500,
+ "port": 8000,
+ "sonnet_input_len": 512,
+ "sonnet_output_len": 16,
+ "sonnet_prefix_len": 50,
+ "reuse_server": true
+ },
+ "lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
+ },
+ "lmdeploy_client_parameters": {
+ },
+ "tgi_server_parameters": {
+ },
+ "tgi_client_parameters": {
+ "endpoint": "/generate_stream"
+ },
+ "trt_server_parameters": {
+ "model_type": "llama",
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
+ "max_input_len": 4096,
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
+ },
+ "trt_client_parameters": {
+ "endpoint": "/v2/models/ensemble/generate_stream"
+ },
+ "vllm_server_parameters": {
+ "disable_log_stats": "",
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
+ },
+ "vllm_client_parameters": {
+ },
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "enable_torch_compile": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
+ }
+ },
+ {
+ "test_name": "llama8B_tp1_sonnet_512_256",
+ "qps_list": [4,8,16,32,"inf"],
+ "common_parameters": {
+ "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+ "tp": 1,
+ "dataset_name": "sonnet",
+ "dataset_path": "./sonnet_4x.txt",
+ "num_prompts": 500,
+ "port": 8000,
+ "sonnet_input_len": 512,
+ "sonnet_output_len": 256,
+ "sonnet_prefix_len": 50,
+ "reuse_server": true
+ },
+ "lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
+ },
+ "lmdeploy_client_parameters": {
+ },
+ "tgi_server_parameters": {
+ },
+ "tgi_client_parameters": {
+ "endpoint": "/generate_stream"
+ },
+ "trt_server_parameters": {
+ "model_type": "llama",
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
+ "max_input_len": 4096,
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
},
+ "trt_client_parameters": {
+ "endpoint": "/v2/models/ensemble/generate_stream"
+ },
"vllm_server_parameters": {
"disable_log_stats": "",
- "disable_log_requests": ""
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
},
"vllm_client_parameters": {
+ },
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "enable_torch_compile": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
}
},
{
- "test_name": "llama70B_tp4",
- "qps_list": [2],
+ "test_name": "llama70B_tp4_sharegpt",
+ "qps_list": [4,8,16,32,"inf"],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"tp": 4,
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 500,
- "port": 8000
+ "port": 8000,
+ "reuse_server": false
},
"lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
},
"lmdeploy_client_parameters": {
},
@@ -59,34 +185,50 @@
},
"trt_server_parameters": {
"model_type": "llama",
- "model_dtype": "float16",
- "max_batch_size": 256,
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
"max_input_len": 4096,
- "max_output_len": 4096,
- "trt_llm_version": "r24.04"
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
},
"trt_client_parameters": {
"endpoint": "/v2/models/ensemble/generate_stream"
- },
+ },
"vllm_server_parameters": {
"disable_log_stats": "",
- "disable_log_requests": ""
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
},
"vllm_client_parameters": {
+ },
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
}
},
{
- "test_name": "mixtral8x7B_tp2",
- "qps_list": [2],
+ "test_name": "llama70B_tp4_sonnet_512_16",
+ "qps_list": [4,8,16,32,"inf"],
"common_parameters": {
- "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
- "tp": 2,
- "dataset_name": "sharegpt",
- "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
+ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
+ "tp": 4,
+ "dataset_name": "sonnet",
+ "dataset_path": "./sonnet_4x.txt",
"num_prompts": 500,
- "port": 8000
+ "port": 8000,
+ "sonnet_input_len": 512,
+ "sonnet_output_len": 16,
+ "sonnet_prefix_len": 50,
+ "reuse_server": true
},
"lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
},
"lmdeploy_client_parameters": {
},
@@ -97,20 +239,85 @@
},
"trt_server_parameters": {
"model_type": "llama",
- "model_dtype": "float16",
- "max_batch_size": 256,
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
"max_input_len": 4096,
- "max_output_len": 4096,
- "trt_llm_version": "r24.04"
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
},
"trt_client_parameters": {
"endpoint": "/v2/models/ensemble/generate_stream"
+ },
+ "vllm_server_parameters": {
+ "disable_log_stats": "",
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
+ },
+ "vllm_client_parameters": {
},
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
+ }
+ },
+ {
+ "test_name": "llama70B_tp4_sonnet_512_256",
+ "qps_list": [4,8,16,32,"inf"],
+ "common_parameters": {
+ "model": "meta-llama/Meta-Llama-3-70B-Instruct",
+ "tp": 4,
+ "dataset_name": "sonnet",
+ "dataset_path": "./sonnet_4x.txt",
+ "num_prompts": 500,
+ "port": 8000,
+ "sonnet_input_len": 512,
+ "sonnet_output_len": 256,
+ "sonnet_prefix_len": 50,
+ "reuse_server": true
+ },
+ "lmdeploy_server_parameters": {
+ "dtype": "bfloat16"
+ },
+ "lmdeploy_client_parameters": {
+ },
+ "tgi_server_parameters": {
+ },
+ "tgi_client_parameters": {
+ "endpoint": "/generate_stream"
+ },
+ "trt_server_parameters": {
+ "model_type": "llama",
+ "model_dtype": "bfloat16",
+ "max_batch_size": 2048,
+ "max_input_len": 4096,
+ "max_seq_len": 6144,
+ "max_num_tokens": 16384,
+ "trt_llm_version": "v0.11.0"
+ },
+ "trt_client_parameters": {
+ "endpoint": "/v2/models/ensemble/generate_stream"
+ },
"vllm_server_parameters": {
"disable_log_stats": "",
- "disable_log_requests": ""
+ "disable_log_requests": "",
+ "gpu_memory_utilization": 0.9,
+ "num_scheduler_steps": 10,
+ "max_num_seqs": 512,
+ "dtype": "bfloat16"
},
"vllm_client_parameters": {
+ },
+ "sglang_server_parameters": {
+ "disable_radix_cache": "",
+ "dtype": "bfloat16"
+ },
+ "sglang_client_parameters": {
}
}
]
\ No newline at end of file
diff --git a/.buildkite/run-cpu-test.sh b/.buildkite/run-cpu-test.sh
index 73ce82c5857ab..c1c471ec974f8 100644
--- a/.buildkite/run-cpu-test.sh
+++ b/.buildkite/run-cpu-test.sh
@@ -23,6 +23,7 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"
# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator datamodel_code_generator
+ pytest -v -s tests/models/encoder_decoder/language
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 8a6c1fb14b2a9..4be524808a23a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -433,6 +433,8 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu"
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h"
"csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu"
+ "csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.h"
+ "csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.cu"
"csrc/moe/marlin_moe_ops.cu")
set_gencode_flags_for_srcs(
@@ -482,6 +484,17 @@ if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
return()
endif ()
+# vLLM flash attention requires VLLM_GPU_ARCHES to contain the set of target
+# arches in the CMake syntax (75-real, 89-virtual, etc), since we clear the
+# arches in the CUDA case (and instead set the gencodes on a per file basis)
+# we need to manually set VLLM_GPU_ARCHES here.
+if(VLLM_GPU_LANG STREQUAL "CUDA")
+ foreach(_ARCH ${CUDA_ARCHS})
+ string(REPLACE "." "" _ARCH "${_ARCH}")
+ list(APPEND VLLM_GPU_ARCHES "${_ARCH}-real")
+ endforeach()
+endif()
+
#
# Build vLLM flash attention from source
#
diff --git a/README.md b/README.md
index 53749cb36b972..f0b7ce02d556d 100644
--- a/README.md
+++ b/README.md
@@ -15,17 +15,8 @@ Easy, fast, and cheap LLM serving for everyone
----
-
-**vLLM, AMD, Anyscale Meet & Greet at [Ray Summit 2024](http://raysummit.anyscale.com) (Monday, Sept 30th, 5-7pm PT) at Marriott Marquis San Francisco**
-
-We are excited to announce our special vLLM event in collaboration with AMD and Anyscale.
-Join us to learn more about recent advancements of vLLM on MI300X.
-Register [here](https://lu.ma/db5ld9n5) and be a part of the event!
-
----
-
*Latest News* 🔥
+- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?tab.day=20241001&search.sessiontracks=1719251906298001uzJ2) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
@@ -137,4 +128,4 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
* For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord.
* For security disclosures, please use Github's security advisory feature.
-* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
\ No newline at end of file
+* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
diff --git a/benchmarks/backend_request_func.py b/benchmarks/backend_request_func.py
index 3def4a6d67acf..4813fde27f0bc 100644
--- a/benchmarks/backend_request_func.py
+++ b/benchmarks/backend_request_func.py
@@ -23,9 +23,9 @@ class RequestFuncInput:
output_len: int
model: str
best_of: int = 1
- use_beam_search: bool = False
logprobs: Optional[int] = None
multi_modal_content: Optional[dict] = None
+ ignore_eos: bool = False
@dataclass
@@ -48,13 +48,13 @@ async def async_request_tgi(
assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
- assert not request_func_input.use_beam_search
params = {
"best_of": request_func_input.best_of,
"max_new_tokens": request_func_input.output_len,
"do_sample": True,
"temperature": 0.01, # TGI does not accept 0.0 temperature.
"top_p": 0.99, # TGI does not accept 1.0 top_p.
+ # TGI does not accept ignore_eos flag.
}
payload = {
"inputs": request_func_input.prompt,
@@ -119,7 +119,6 @@ async def async_request_trt_llm(
assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
- assert not request_func_input.use_beam_search
assert request_func_input.best_of == 1
payload = {
"accumulate_tokens": True,
@@ -129,6 +128,8 @@ async def async_request_trt_llm(
"max_tokens": request_func_input.output_len,
"stream": True,
}
+ if request_func_input.ignore_eos:
+ payload["min_length"] = request_func_input.output_len
output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len
@@ -183,7 +184,6 @@ async def async_request_deepspeed_mii(
) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1
- assert not request_func_input.use_beam_search
payload = {
"prompt": request_func_input.prompt,
@@ -231,7 +231,6 @@ async def async_request_openai_completions(
), "OpenAI Completions API URL must end with 'completions' or 'profile'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
- assert not request_func_input.use_beam_search
payload = {
"model": request_func_input.model,
"prompt": request_func_input.prompt,
@@ -240,6 +239,7 @@ async def async_request_openai_completions(
"max_tokens": request_func_input.output_len,
"logprobs": request_func_input.logprobs,
"stream": True,
+ "ignore_eos": request_func_input.ignore_eos,
}
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
@@ -312,7 +312,6 @@ async def async_request_openai_chat_completions(
), "OpenAI Chat Completions API URL must end with 'chat/completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
- assert not request_func_input.use_beam_search
content = [{"type": "text", "text": request_func_input.prompt}]
if request_func_input.multi_modal_content:
content.append(request_func_input.multi_modal_content)
@@ -327,6 +326,7 @@ async def async_request_openai_chat_completions(
"temperature": 0.0,
"max_tokens": request_func_input.output_len,
"stream": True,
+ "ignore_eos": request_func_input.ignore_eos,
}
headers = {
"Content-Type": "application/json",
@@ -430,4 +430,5 @@ def get_tokenizer(
"openai-chat": async_request_openai_chat_completions,
"tensorrt-llm": async_request_trt_llm,
"scalellm": async_request_openai_completions,
+ "sglang": async_request_openai_completions,
}
diff --git a/benchmarks/benchmark_latency.py b/benchmarks/benchmark_latency.py
index eadf994cacd34..938d7acd5687c 100644
--- a/benchmarks/benchmark_latency.py
+++ b/benchmarks/benchmark_latency.py
@@ -51,9 +51,8 @@ def main(args: argparse.Namespace):
sampling_params = SamplingParams(
n=args.n,
- temperature=0.0 if args.use_beam_search else 1.0,
+ temperature=1.0,
top_p=1.0,
- use_beam_search=args.use_beam_search,
ignore_eos=True,
max_tokens=args.output_len,
)
diff --git a/benchmarks/benchmark_prefix_caching.py b/benchmarks/benchmark_prefix_caching.py
index 3e90fdfb78e10..eeb43a692076e 100644
--- a/benchmarks/benchmark_prefix_caching.py
+++ b/benchmarks/benchmark_prefix_caching.py
@@ -113,7 +113,7 @@ def repeat_and_sort_requests(requests: List[Tuple[str, int, int]],
def main(args):
tokenizer = get_tokenizer(args.model, trust_remote_code=True)
input_length_range = tuple(map(int, args.input_length_range.split(':')))
-
+ random.seed(args.seed)
if args.dataset_path is not None:
print(f"Start to sample {args.num_prompts} prompts"
"from {args.dataset_path}")
@@ -194,5 +194,9 @@ def main(args):
default='128:256',
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
+ parser.add_argument("--seed",
+ type=int,
+ default=0,
+ help='Random seed for reproducibility')
args = parser.parse_args()
main(args)
diff --git a/benchmarks/benchmark_prioritization.py b/benchmarks/benchmark_prioritization.py
index 0ba29fabca59b..8843e3a927a01 100644
--- a/benchmarks/benchmark_prioritization.py
+++ b/benchmarks/benchmark_prioritization.py
@@ -68,7 +68,6 @@ def run_vllm(
tensor_parallel_size: int,
seed: int,
n: int,
- use_beam_search: bool,
trust_remote_code: bool,
dtype: str,
max_model_len: Optional[int],
@@ -114,9 +113,8 @@ def run_vllm(
sampling_params.append(
SamplingParams(
n=n,
- temperature=0.0 if use_beam_search else 1.0,
+ temperature=1.0,
top_p=1.0,
- use_beam_search=use_beam_search,
ignore_eos=True,
max_tokens=output_len,
))
@@ -144,15 +142,16 @@ def main(args: argparse.Namespace):
args.output_len)
if args.backend == "vllm":
- elapsed_time = run_vllm(
- requests, args.model, args.tokenizer, args.quantization,
- args.tensor_parallel_size, args.seed, args.n, args.use_beam_search,
- args.trust_remote_code, args.dtype, args.max_model_len,
- args.enforce_eager, args.kv_cache_dtype,
- args.quantization_param_path, args.device,
- args.enable_prefix_caching, args.enable_chunked_prefill,
- args.max_num_batched_tokens, args.gpu_memory_utilization,
- args.download_dir)
+ elapsed_time = run_vllm(requests, args.model, args.tokenizer,
+ args.quantization, args.tensor_parallel_size,
+ args.seed, args.n, args.trust_remote_code,
+ args.dtype, args.max_model_len,
+ args.enforce_eager, args.kv_cache_dtype,
+ args.quantization_param_path, args.device,
+ args.enable_prefix_caching,
+ args.enable_chunked_prefill,
+ args.max_num_batched_tokens,
+ args.gpu_memory_utilization, args.download_dir)
else:
raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len
@@ -203,7 +202,6 @@ def main(args: argparse.Namespace):
type=int,
default=1,
help="Number of generated sequences per prompt.")
- parser.add_argument("--use-beam-search", action="store_true")
parser.add_argument("--num-prompts",
type=int,
default=200,
diff --git a/benchmarks/benchmark_serving.py b/benchmarks/benchmark_serving.py
index 56c37b241a359..292d1f37fbf3e 100644
--- a/benchmarks/benchmark_serving.py
+++ b/benchmarks/benchmark_serving.py
@@ -391,12 +391,12 @@ async def benchmark(
input_requests: List[Tuple[str, int, int]],
logprobs: Optional[int],
best_of: int,
- use_beam_search: bool,
request_rate: float,
disable_tqdm: bool,
profile: bool,
selected_percentile_metrics: List[str],
selected_percentiles: List[str],
+ ignore_eos: bool,
):
if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend]
@@ -418,8 +418,8 @@ async def benchmark(
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
- use_beam_search=use_beam_search,
multi_modal_content=test_mm_content,
+ ignore_eos=ignore_eos,
)
test_output = await request_func(request_func_input=test_input)
if not test_output.success:
@@ -439,7 +439,6 @@ async def benchmark(
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
- use_beam_search=use_beam_search,
multi_modal_content=test_mm_content,
)
profile_output = await request_func(request_func_input=profile_input)
@@ -462,7 +461,6 @@ async def benchmark(
output_len=output_len,
logprobs=logprobs,
best_of=best_of,
- use_beam_search=use_beam_search,
multi_modal_content=mm_content,
)
tasks.append(
@@ -481,7 +479,6 @@ async def benchmark(
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
- use_beam_search=use_beam_search,
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
@@ -677,7 +674,6 @@ def main(args: argparse.Namespace):
input_requests=input_requests,
logprobs=args.logprobs,
best_of=args.best_of,
- use_beam_search=args.use_beam_search,
request_rate=args.request_rate,
disable_tqdm=args.disable_tqdm,
profile=args.profile,
@@ -685,6 +681,7 @@ def main(args: argparse.Namespace):
selected_percentiles=[
float(p) for p in args.metric_percentiles.split(",")
],
+ ignore_eos=args.ignore_eos,
))
# Save config and results to json
@@ -698,7 +695,6 @@ def main(args: argparse.Namespace):
result_json["model_id"] = model_id
result_json["tokenizer_id"] = tokenizer_id
result_json["best_of"] = args.best_of
- result_json["use_beam_search"] = args.use_beam_search
result_json["num_prompts"] = args.num_prompts
# Metadata
@@ -863,6 +859,11 @@ def main(args: argparse.Namespace):
"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
" format.",
)
+ parser.add_argument(
+ "--ignore-eos",
+ action="store_true",
+ help="Set ignore_eos flag when sending the benchmark request."
+ "Warning: ignore_eos is not supported in deepspeed_mii and tgi.")
parser.add_argument(
"--percentile-metrics",
type=str,
diff --git a/benchmarks/benchmark_throughput.py b/benchmarks/benchmark_throughput.py
index 68b401d5bbbb7..3781863f77e64 100644
--- a/benchmarks/benchmark_throughput.py
+++ b/benchmarks/benchmark_throughput.py
@@ -15,6 +15,7 @@
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS
+from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser, merge_async_iterators
@@ -72,7 +73,6 @@ def run_vllm(
tensor_parallel_size: int,
seed: int,
n: int,
- use_beam_search: bool,
trust_remote_code: bool,
dtype: str,
max_model_len: Optional[int],
@@ -90,7 +90,6 @@ def run_vllm(
download_dir: Optional[str] = None,
load_format: str = EngineArgs.load_format,
disable_async_output_proc: bool = False,
- use_new_beam_search_impl: bool = False,
) -> float:
from vllm import LLM, SamplingParams
llm = LLM(
@@ -126,29 +125,32 @@ def run_vllm(
sampling_params.append(
SamplingParams(
n=n,
- temperature=0.0 if use_beam_search else 1.0,
+ temperature=1.0,
top_p=1.0,
- use_beam_search=use_beam_search,
ignore_eos=True,
max_tokens=output_len,
))
- if not use_new_beam_search_impl:
+ use_beam_search = False
+
+ if not use_beam_search:
start = time.perf_counter()
llm.generate(prompts, sampling_params, use_tqdm=True)
end = time.perf_counter()
else:
- assert use_beam_search
prompts = [prompt for prompt, _, _ in requests]
# output_len should be the same for all requests.
output_len = requests[0][2]
for prompt, input_len, _output_len in requests:
assert _output_len == output_len
start = time.perf_counter()
- llm.beam_search(prompts,
- beam_width=n,
- max_tokens=output_len,
- ignore_eos=True)
+ llm.beam_search(
+ prompts,
+ BeamSearchParams(
+ beam_width=n,
+ max_tokens=output_len,
+ ignore_eos=True,
+ ))
end = time.perf_counter()
return end - start
@@ -161,7 +163,6 @@ async def run_vllm_async(
tensor_parallel_size: int,
seed: int,
n: int,
- use_beam_search: bool,
trust_remote_code: bool,
dtype: str,
max_model_len: Optional[int],
@@ -220,9 +221,8 @@ async def run_vllm_async(
sampling_params.append(
SamplingParams(
n=n,
- temperature=0.0 if use_beam_search else 1.0,
+ temperature=1.0,
top_p=1.0,
- use_beam_search=use_beam_search,
ignore_eos=True,
max_tokens=output_len,
))
@@ -244,11 +244,9 @@ def run_hf(
model: str,
tokenizer: PreTrainedTokenizerBase,
n: int,
- use_beam_search: bool,
max_batch_size: int,
trust_remote_code: bool,
) -> float:
- assert not use_beam_search
llm = AutoModelForCausalLM.from_pretrained(
model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code)
if llm.config.model_type == "llama":
@@ -280,7 +278,7 @@ def run_hf(
padding=True).input_ids
llm_outputs = llm.generate(
input_ids=input_ids.cuda(),
- do_sample=not use_beam_search,
+ do_sample=True,
num_return_sequences=n,
temperature=1.0,
top_p=1.0,
@@ -336,7 +334,7 @@ def main(args: argparse.Namespace):
if args.backend == "vllm":
run_args = [
requests, args.model, args.tokenizer, args.quantization,
- args.tensor_parallel_size, args.seed, args.n, args.use_beam_search,
+ args.tensor_parallel_size, args.seed, args.n,
args.trust_remote_code, args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device,
@@ -351,12 +349,11 @@ def main(args: argparse.Namespace):
run_args.append(args.disable_frontend_multiprocessing)
elapsed_time = uvloop.run(run_vllm_async(*run_args))
else:
- elapsed_time = run_vllm(*run_args, args.use_new_beam_search_impl)
+ elapsed_time = run_vllm(*run_args)
elif args.backend == "hf":
assert args.tensor_parallel_size == 1
elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
- args.use_beam_search, args.hf_max_batch_size,
- args.trust_remote_code)
+ args.hf_max_batch_size, args.trust_remote_code)
elif args.backend == "mii":
elapsed_time = run_mii(requests, args.model, args.tensor_parallel_size,
args.output_len)
@@ -410,8 +407,6 @@ def main(args: argparse.Namespace):
type=int,
default=1,
help="Number of generated sequences per prompt.")
- parser.add_argument("--use-beam-search", action="store_true")
- parser.add_argument("--use-new-beam-search-impl", action="store_true")
parser.add_argument("--num-prompts",
type=int,
default=1000,
@@ -566,8 +561,6 @@ def main(args: argparse.Namespace):
raise ValueError("dtype must be auto for MII backend.")
if args.n != 1:
raise ValueError("n must be 1 for MII backend.")
- if args.use_beam_search:
- raise ValueError("Beam search is not supported for MII backend.")
if args.quantization is not None:
raise ValueError("Quantization is only for vLLM backend.")
if args.hf_max_batch_size is not None:
diff --git a/cmake/cpu_extension.cmake b/cmake/cpu_extension.cmake
index 3c474bd58d04e..bc5f24d3f591c 100644
--- a/cmake/cpu_extension.cmake
+++ b/cmake/cpu_extension.cmake
@@ -84,7 +84,12 @@ endif()
message(STATUS "CPU extension compile flags: ${CXX_COMPILE_FLAGS}")
-list(APPEND LIBS dnnl numa)
+list(APPEND LIBS numa)
+
+# Appending the dnnl library for the AVX2 and AVX512, as it is not utilized by Power architecture.
+if (AVX2_FOUND OR AVX512_FOUND)
+ list(APPEND LIBS dnnl)
+endif()
#
# _C extension
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel.h b/csrc/moe/marlin_kernels/marlin_moe_kernel.h
index 0bd3017226c94..a217401b3d7c2 100644
--- a/csrc/moe/marlin_kernels/marlin_moe_kernel.h
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel.h
@@ -38,6 +38,7 @@ using FragA = Vec;
using FragB = Vec;
using FragC = Vec;
using FragS = Vec; // quantization scales
+using FragZP = Vec;
// Predicated asynchronous global->shared copy; used for inputs A where we apply
// predication to handle batchsizes that are not multiples of 16.
@@ -175,6 +176,46 @@ __device__ inline FragB dequant(int q) {
return frag_b;
}
+template <>
+__device__ inline FragB dequant(int q) {
+ const int LO = 0x000f000f;
+ const int HI = 0x00f000f0;
+ const int EX = 0x64006400;
+ // Guarantee that the `(a & b) | c` operations are LOP3s.
+ int lo = lop3<(0xf0 & 0xcc) | 0xaa>(q, LO, EX);
+ int hi = lop3<(0xf0 & 0xcc) | 0xaa>(q, HI, EX);
+
+ const int SUB = 0x64006400;
+ const int MUL = 0x2c002c00;
+ const int ADD = 0xd400d400;
+ FragB frag_b;
+ frag_b[0] = __hsub2(*reinterpret_cast(&lo),
+ *reinterpret_cast(&SUB));
+ frag_b[1] = __hfma2(*reinterpret_cast(&hi),
+ *reinterpret_cast(&MUL),
+ *reinterpret_cast(&ADD));
+ return frag_b;
+}
+
+template <>
+__device__ inline FragB dequant(int q) {
+ static constexpr uint32_t mask_for_elt_01 = 0x5250;
+ static constexpr uint32_t mask_for_elt_23 = 0x5351;
+ static constexpr uint32_t start_byte_for_fp16 = 0x64646464;
+
+ uint32_t lo = prmt(q);
+ uint32_t hi = prmt(q);
+
+ static constexpr uint32_t I8s_TO_F16s_MAGIC_NUM = 0x64006400;
+
+ FragB frag_b;
+ frag_b[0] = __hsub2(*reinterpret_cast(&lo),
+ *reinterpret_cast(&I8s_TO_F16s_MAGIC_NUM));
+ frag_b[1] = __hsub2(*reinterpret_cast(&hi),
+ *reinterpret_cast(&I8s_TO_F16s_MAGIC_NUM));
+ return frag_b;
+}
+
// Multiply dequantized values by the corresponding quantization scale; used
// only for grouped quantization.
__device__ inline void scale(FragB& frag_b, FragS& frag_s, int i) {
@@ -183,11 +224,10 @@ __device__ inline void scale(FragB& frag_b, FragS& frag_s, int i) {
frag_b[1] = __hmul2(frag_b[1], s);
}
-// Given 2 floats multiply by 2 scales (halves)
-__device__ inline void scale_float(float* c, FragS& s) {
- __half* s_ptr = reinterpret_cast<__half*>(&s);
- c[0] = __fmul_rn(c[0], __half2float(s_ptr[0]));
- c[1] = __fmul_rn(c[1], __half2float(s_ptr[1]));
+__device__ inline void sub_zp(FragB& frag_b, half2& frag_zp, int i) {
+ half2 zp = __half2half2(reinterpret_cast<__half*>(&frag_zp)[i]);
+ frag_b[0] = __hsub2(frag_b[0], zp);
+ frag_b[1] = __hsub2(frag_b[1], zp);
}
// Same as above, but for act_order (each K is multiplied individually)
@@ -205,6 +245,13 @@ __device__ inline void scale4(FragB& frag_b, FragS& frag_s_1, FragS& frag_s_2,
frag_b[1] = __hmul2(frag_b[1], s_val_3_4);
}
+// Given 2 floats multiply by 2 scales (halves)
+__device__ inline void scale_float(float* c, FragS& s) {
+ __half* s_ptr = reinterpret_cast<__half*>(&s);
+ c[0] = __fmul_rn(c[0], __half2float(s_ptr[0]));
+ c[1] = __fmul_rn(c[1], __half2float(s_ptr[1]));
+}
+
// Wait until barrier reaches `count`, then lock for current threadblock.
__device__ inline void barrier_acquire(int* lock, int count) {
if (threadIdx.x == 0) {
@@ -248,10 +295,11 @@ template shared
// fetch pipeline
const bool has_act_order, // whether act_order is enabled
+ const bool has_zp, // whether zero-points are enabled
const int group_blocks = -1 // number of consecutive 16x16 blocks
// with a separate quantization scale
>
-__device__ inline void MarlinMoESingle(
+__device__ void MarlinMoESingle(
const int4* __restrict__ A, // fp16 input matrix of shape mxk
const int4* __restrict__ B, // 4bit quantized weight matrix of shape kxn
int4* __restrict__ C, // fp16 output buffer of shape mxn
@@ -259,6 +307,8 @@ __device__ inline void MarlinMoESingle(
const float* __restrict__ topk_weights, // float topk weights
const int4* __restrict__ scales_ptr, // fp16 quantization scales of shape
// (k/groupsize)xn
+ const int4* __restrict__ zp_ptr, // 4bit packed zero-points of shape
+ // (k/groupsize)x(n/pack_factor)
const int* __restrict__ g_idx, // int32 group indices of shape k
const int* __restrict__ expert_offsets,
int num_groups, // number of scale groups per output channel
@@ -400,8 +450,12 @@ __device__ inline void MarlinMoESingle(
int tb_n_warps = thread_n_blocks / 4;
int act_s_col_tb_stride = act_s_col_warp_stride * tb_n_warps;
- constexpr int sorted_sh_stride = threads;
- constexpr int sorted_gl_stride = threads;
+ // Zero-points sizes/strides
+ int zp_gl_stride = (prob_n / pack_factor) / 4;
+ constexpr int zp_sh_stride = ((16 * thread_n_blocks) / pack_factor) / 4;
+ constexpr int zp_tb_groups = s_tb_groups;
+ constexpr int zp_sh_stage = has_zp ? zp_tb_groups * zp_sh_stride : 0;
+ int zp_gl_rd_delta = zp_gl_stride;
// Global A read index of current thread.
int a_gl_rd = a_gl_stride * (threadIdx.x / a_gl_rd_delta_o) +
@@ -442,6 +496,19 @@ __device__ inline void MarlinMoESingle(
int s_sh_wr = threadIdx.x;
bool s_sh_wr_pred = threadIdx.x < s_sh_stride;
+ // Zero-points
+ int zp_gl_rd;
+ if constexpr (has_zp) {
+ if constexpr (group_blocks == -1) {
+ zp_gl_rd = zp_sh_stride * slice_col + threadIdx.x;
+ } else {
+ zp_gl_rd = zp_gl_stride * ((thread_k_blocks * slice_row) / group_blocks) +
+ zp_sh_stride * slice_col + threadIdx.x;
+ }
+ }
+ int zp_sh_wr = threadIdx.x;
+ bool zp_sh_wr_pred = threadIdx.x < zp_sh_stride;
+
// We use a different scale layout for grouped and column-wise quantization as
// we scale a `half2` tile in column-major layout in the former and in
// row-major in the latter case.
@@ -453,23 +520,29 @@ __device__ inline void MarlinMoESingle(
s_sh_rd = 8 * ((threadIdx.x / 32) % (thread_n_blocks / 4)) +
(threadIdx.x % 32) % 4;
+ // Zero-points have the same read layout as the scales
+ // (without column-wise case)
+ constexpr int num_col_threads = 8;
+ constexpr int num_row_threads = 4;
+ constexpr int num_ints_per_thread = 8 / pack_factor;
+ int zp_sh_rd;
+ if constexpr (has_zp) {
+ zp_sh_rd = num_ints_per_thread * num_col_threads *
+ ((threadIdx.x / 32) % (thread_n_blocks / 4)) +
+ num_ints_per_thread * ((threadIdx.x % 32) / num_row_threads);
+ }
+
int sh_first_group_id = -1;
int sh_num_groups = -1;
constexpr int sh_max_num_groups = 32;
- int shs_size;
- if constexpr (has_act_order)
- shs_size = sh_max_num_groups * s_sh_stride + threads;
- else
- shs_size = group_blocks > 0 ? stages * s_sh_stage : threads;
-
extern __shared__ int4 sh[];
// Shared memory storage for global fetch pipelines.
int4* sh_a = sh;
int4* sh_b = sh_a + (stages * a_sh_stage);
int4* sh_g_idx = sh_b + (stages * b_sh_stage);
- int4* sh_s = sh_g_idx + (stages * g_idx_stage);
- int* sh_sorted = (int*)(sh_s + shs_size);
+ int4* sh_zp = sh_g_idx + (stages * g_idx_stage);
+ int4* sh_s = sh_zp + (stages * zp_sh_stage);
// Precompute which thread should not read memory in which iterations; this is
// needed if there are more threads than required for a certain tilesize or
@@ -525,8 +598,10 @@ __device__ inline void MarlinMoESingle(
FragA frag_a[2][thread_m_blocks];
I4 frag_b_quant[2][b_thread_vecs];
FragC frag_c[thread_m_blocks][4][2];
- FragS frag_s[2][4]; // No act-order
- FragS act_frag_s[2][4][4]; // For act-order
+ FragS frag_s[2][4]; // No act-order
+ FragS act_frag_s[2][4][4]; // For act-order
+ int frag_qzp[2][num_ints_per_thread]; // Zero-points
+ FragZP frag_zp; // Zero-points in fp16
// Zero accumulators.
auto zero_accums = [&]() {
@@ -633,6 +708,28 @@ __device__ inline void MarlinMoESingle(
}
}
}
+
+ if constexpr (has_zp && group_blocks != -1) {
+ int4* sh_zp_stage = sh_zp + zp_sh_stage * pipe;
+
+ if constexpr (group_blocks >= thread_k_blocks) {
+ // Only fetch zero-points if this tile starts a new group
+ if (pipe % (group_blocks / thread_k_blocks) == 0) {
+ if (zp_sh_wr_pred) {
+ cp_async4(&sh_zp_stage[zp_sh_wr], &zp_ptr[zp_gl_rd]);
+ }
+ zp_gl_rd += zp_gl_rd_delta;
+ }
+ } else {
+ for (int i = 0; i < zp_tb_groups; i++) {
+ if (zp_sh_wr_pred) {
+ cp_async4(&sh_zp_stage[i * zp_sh_stride + zp_sh_wr],
+ &zp_ptr[zp_gl_rd]);
+ }
+ zp_gl_rd += zp_gl_rd_delta;
+ }
+ }
+ }
}
}
// Insert a fence even when we are winding down the pipeline to ensure that
@@ -640,15 +737,9 @@ __device__ inline void MarlinMoESingle(
cp_async_fence();
};
- // TODO we are currently hitting illegal memory accesses when fetching
- // sorted_ids to shared data: fix this
- auto fetch_sorted_ids_to_shared = [&]() {
- const int mpt = ceildiv(prob_m, threads);
- for (int i = 0; i < mpt; i++) {
- if ((i * sorted_gl_stride) + threadIdx.x < prob_m) {
- sh_sorted[(i * sorted_sh_stride) + threadIdx.x] =
- sorted_ids[(i * sorted_gl_stride) + threadIdx.x];
- }
+ auto fetch_zp_to_shared = [&]() {
+ if (zp_sh_wr_pred) {
+ cp_async4(&sh_zp[zp_sh_wr], &zp_ptr[zp_gl_rd]);
}
};
@@ -799,8 +890,83 @@ __device__ inline void MarlinMoESingle(
}
};
+ auto fetch_zp_to_registers = [&](int k, int full_pipe) {
+ // This code does not handle group_blocks == 0,
+ // which signifies act_order.
+ // has_zp implies AWQ, which doesn't have act_order,
+ static_assert(!has_zp || group_blocks != 0);
+
+ if constexpr (has_zp) {
+ int pipe = full_pipe % stages;
+
+ if constexpr (group_blocks == -1) {
+ for (int i = 0; i < num_ints_per_thread; i++) {
+ frag_qzp[k % 2][i] = (reinterpret_cast(sh_zp))[zp_sh_rd + i];
+ }
+
+ } else if constexpr (group_blocks >= thread_k_blocks) {
+ int4* sh_zp_stage =
+ sh_zp + zp_sh_stage * ((group_blocks / thread_k_blocks) *
+ (pipe / (group_blocks / thread_k_blocks)));
+ for (int i = 0; i < num_ints_per_thread; i++) {
+ frag_qzp[k % 2][i] =
+ (reinterpret_cast(sh_zp_stage))[zp_sh_rd + i];
+ }
+ } else {
+ int warp_id = threadIdx.x / 32;
+ int n_warps = thread_n_blocks / 4;
+
+ int warp_row = warp_id / n_warps;
+
+ int cur_k = warp_row * 16;
+ cur_k += k_iter_size * (k % b_sh_wr_iters);
+
+ int k_blocks = cur_k / 16;
+ int cur_group_id = 0;
+
+ // Suppress bogus and persistent divide-by-zero warning
+ #pragma nv_diagnostic push
+ #pragma nv_diag_suppress divide_by_zero
+ cur_group_id = k_blocks / group_blocks;
+ #pragma nv_diagnostic pop
+
+ int4* sh_zp_stage = sh_zp + zp_sh_stage * pipe;
+
+ sh_zp_stage += cur_group_id * zp_sh_stride;
+
+ for (int i = 0; i < num_ints_per_thread; i++) {
+ frag_qzp[k % 2][i] =
+ (reinterpret_cast(sh_zp_stage))[zp_sh_rd + i];
+ }
+ }
+ }
+ };
+
// Execute the actual tensor core matmul of a sub-tile.
auto matmul = [&](int k) {
+ if constexpr (has_zp) {
+ FragB frag_zp_0;
+ FragB frag_zp_1;
+ int zp_quant_0, zp_quant_1;
+
+ if constexpr (w_type.size_bits() == 4) {
+ zp_quant_0 = frag_qzp[k % 2][0];
+ zp_quant_1 = zp_quant_0 >> 8;
+ } else {
+ static_assert(w_type.size_bits() == 8);
+ zp_quant_0 = frag_qzp[k % 2][0];
+ zp_quant_1 = frag_qzp[k % 2][1];
+ }
+
+ frag_zp_0 = dequant(zp_quant_0);
+ frag_zp_1 = dequant(zp_quant_1);
+
+ frag_zp[0] = frag_zp_0[0];
+ frag_zp[1] = frag_zp_0[1];
+ frag_zp[2] = frag_zp_1[0];
+ frag_zp[3] = frag_zp_1[1];
+ }
+
// We have the m dimension as the inner loop in order to encourage overlapping
// dequantization and matmul operations.
#pragma unroll
@@ -818,6 +984,10 @@ __device__ inline void MarlinMoESingle(
FragB frag_b0 = dequant(b_quant_0);
FragB frag_b1 = dequant(b_quant_1);
+ // Apply zero-point to frag_b0
+ if constexpr (has_zp) {
+ sub_zp(frag_b0, frag_zp[j], 0);
+ }
// Apply scale to frag_b0
if constexpr (has_act_order) {
@@ -829,6 +999,11 @@ __device__ inline void MarlinMoESingle(
}
}
+ // Apply zero-point to frag_b1
+ if constexpr (has_zp) {
+ sub_zp(frag_b1, frag_zp[j], 1);
+ }
+
// Apply scale to frag_b1
if constexpr (has_act_order) {
scale4(frag_b1, act_frag_s[k % 2][0][j], act_frag_s[k % 2][1][j],
@@ -1062,9 +1237,6 @@ __device__ inline void MarlinMoESingle(
// Start global fetch and register load pipelines.
auto start_pipes = [&]() {
- // TODO re-enable after fixing this function
- // fetch_sorted_ids_to_shared();
- // __syncthreads();
#pragma unroll
for (int i = 0; i < stages - 1; i++) {
@@ -1075,6 +1247,12 @@ __device__ inline void MarlinMoESingle(
}
fetch_scales_to_shared(true, g_idx[slice_k_start], g_idx[last_g_idx]);
}
+
+ if constexpr (has_zp && group_blocks == -1) {
+ if (i == 0) {
+ fetch_zp_to_shared();
+ }
+ }
fetch_to_shared(i, i, i < slice_iters);
}
@@ -1083,6 +1261,7 @@ __device__ inline void MarlinMoESingle(
init_same_group(0);
fetch_to_registers(0, 0);
fetch_scales_to_registers(0, 0);
+ fetch_zp_to_registers(0, 0);
a_gl_rd += a_gl_rd_delta_o * (stages - 1);
slice_k_start_shared_fetch += tb_k * (stages - 1);
};
@@ -1102,6 +1281,7 @@ __device__ inline void MarlinMoESingle(
for (int k = 0; k < b_sh_wr_iters; k++) {
fetch_to_registers(k + 1, pipe % stages);
fetch_scales_to_registers(k + 1, pipe);
+ fetch_zp_to_registers(k + 1, pipe);
if (k == b_sh_wr_iters - 2) {
fetch_to_shared((pipe + stages - 1) % stages, pipe,
slice_iters >= stages);
@@ -1236,7 +1416,9 @@ __device__ inline void MarlinMoESingle(
} else {
s_gl_rd = s_sh_stride * slice_col + threadIdx.x;
+ zp_gl_rd = zp_sh_stride * slice_col + threadIdx.x;
}
+
start_pipes();
}
}
@@ -1250,6 +1432,7 @@ template shared
// fetch pipeline
const bool has_act_order, // whether act_order is enabled
+ const bool has_zp, // whether zero-points are enabled
const int group_blocks = -1 // number of consecutive 16x16 blocks
// with a separate quantization scale
>
@@ -1261,6 +1444,8 @@ __global__ void MarlinMoE(
const float* __restrict__ topk_weights, // float topk weights
const int4* __restrict__ scales_ptr, // fp16 quantization scales of shape
// (k/groupsize)xn
+ const int4* __restrict__ zp_ptr, // 4bit packed zero-points of shape
+ // (k/groupsize)x(n/pack_factor)
const int* __restrict__ g_idx, // int32 group indices of shape k
const int* __restrict__ expert_offsets,
int num_groups, // number of scale groups per output channel
@@ -1309,29 +1494,29 @@ __global__ void MarlinMoE(
if (max_block == 1) {
MarlinMoESingle(
- A, B, C, sorted_ids_expert, topk_weights, scales_ptr, g_idx,
+ stages, has_act_order, has_zp, group_blocks>(
+ A, B, C, sorted_ids_expert, topk_weights, scales_ptr, zp_ptr, g_idx,
expert_offsets, num_groups, expert_idx, num_experts, topk, prob_m,
prob_n, prob_k, tot_m, locks, replicate_input, apply_weights,
current_m_block);
} else if (max_block == 2) {
MarlinMoESingle(
- A, B, C, sorted_ids_expert, topk_weights, scales_ptr, g_idx,
+ stages, has_act_order, has_zp, group_blocks>(
+ A, B, C, sorted_ids_expert, topk_weights, scales_ptr, zp_ptr, g_idx,
expert_offsets, num_groups, expert_idx, num_experts, topk, prob_m,
prob_n, prob_k, tot_m, locks, replicate_input, apply_weights,
current_m_block);
} else if (max_block == 3) {
MarlinMoESingle(
- A, B, C, sorted_ids_expert, topk_weights, scales_ptr, g_idx,
+ stages, has_act_order, has_zp, group_blocks>(
+ A, B, C, sorted_ids_expert, topk_weights, scales_ptr, zp_ptr, g_idx,
expert_offsets, num_groups, expert_idx, num_experts, topk, prob_m,
prob_n, prob_k, tot_m, locks, replicate_input, apply_weights,
current_m_block);
} else {
MarlinMoESingle(
- A, B, C, sorted_ids_expert, topk_weights, scales_ptr, g_idx,
+ stages, has_act_order, has_zp, group_blocks>(
+ A, B, C, sorted_ids_expert, topk_weights, scales_ptr, zp_ptr, g_idx,
expert_offsets, num_groups, expert_idx, num_experts, topk, prob_m,
prob_n, prob_k, tot_m, locks, replicate_input, apply_weights,
current_m_block);
@@ -1347,6 +1532,7 @@ template shared
// fetch pipeline
const bool has_act_order, // whether act_order is enabled
+ const bool has_zp, // whether zero-points are enabled
const int group_blocks = -1 // number of consecutive 16x16 blocks
// with a separate quantization scale
>
@@ -1358,6 +1544,8 @@ __global__ void MarlinMoE(
const float* __restrict__ topk_weights, // float topk weights
const int4* __restrict__ scales_ptr, // fp16 quantization scales of shape
// (k/groupsize)xn
+ const int4* __restrict__ zp_ptr, // 4bit packed zero-points of shape
+ // (k/groupsize)x(n/pack_factor)
const int* __restrict__ g_idx, // int32 group indices of shape k
const int* __restrict__ expert_offsets,
int num_groups, // number of scale groups per output channel
@@ -1374,7 +1562,6 @@ __global__ void MarlinMoE(
int current_m_block, // current m block to start kernel computation from
int max_par, // maximum parallelism
int cfg_max_m_blocks // upper bound on m blocks
-
) {
// Marlin is not implemented yet for SM < 8.0
assert(false);
@@ -1389,37 +1576,41 @@ __global__ void MarlinMoE(
const int USER_THREADS =
256; // Note: This is only used with user-provided thread_k/n
const int STAGES = 4; // 4 pipeline stages fit into shared memory
-// const int SHARED_MEM =
-// 96 * 1024; // max shared memory on compute capability 8.6 (< 8.0)
static constexpr int min_thread_n = 64;
static constexpr int min_thread_k = 64;
#define __CALL_IF_MOE(W_TYPE, THREAD_N_BLOCKS, THREAD_K_BLOCKS, HAS_ACT_ORDER, \
- GROUP_BLOCKS, NUM_THREADS) \
+ HAS_ZP, GROUP_BLOCKS, NUM_THREADS) \
else if (q_type == W_TYPE && thread_n_blocks == THREAD_N_BLOCKS && \
thread_k_blocks == THREAD_K_BLOCKS && \
- has_act_order == HAS_ACT_ORDER && group_blocks == GROUP_BLOCKS && \
- num_threads == NUM_THREADS) { \
+ has_act_order == HAS_ACT_ORDER && has_zp == HAS_ZP && \
+ group_blocks == GROUP_BLOCKS && num_threads == NUM_THREADS) { \
cudaFuncSetAttribute( \
MarlinMoE, \
+ STAGES, HAS_ACT_ORDER, HAS_ZP, GROUP_BLOCKS>, \
cudaFuncAttributeMaxDynamicSharedMemorySize, max_shared_mem); \
MarlinMoE \
+ STAGES, HAS_ACT_ORDER, HAS_ZP, GROUP_BLOCKS> \
<<>>( \
A_ptr, B_ptr, C_ptr, sorted_ids_ptr, topk_weights_ptr, s_ptr, \
- g_idx_ptr, expert_offsets_ptr, num_groups, expert_idx, \
+ zp_ptr, g_idx_ptr, expert_offsets_ptr, num_groups, expert_idx, \
num_experts, topk, prob_m, prob_n, prob_k, tot_m, locks, \
replicate_input, apply_weights, m_block, max_par, \
cfg_max_m_blocks); \
}
-#define GPTQ_CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, NUM_THREADS) \
- __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, true, 0, NUM_THREADS) \
- __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, -1, NUM_THREADS) \
- __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, 2, NUM_THREADS) \
- __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, 4, NUM_THREADS) \
- __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, 8, NUM_THREADS)
+#define GPTQ_CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, true, false, 0, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, false, -1, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, false, 2, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, false, 4, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, false, 8, NUM_THREADS)
+
+#define AWQ_CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, true, -1, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, true, 2, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, true, 4, NUM_THREADS) \
+ __CALL_IF_MOE(W_TYPE, N_BLOCKS, K_BLOCKS, false, true, 8, NUM_THREADS)
} // namespace marlin_moe
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.cu b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.cu
new file mode 100644
index 0000000000000..77bc0dd90edde
--- /dev/null
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.cu
@@ -0,0 +1,31 @@
+#include "marlin_moe_kernel_ku4.h"
+
+namespace marlin_moe {
+
+// We return bool so we can create these different kernel calls as a sequence
+// of if-elseif's.
+bool call_marlin_moe_kernel_ku4(
+ vllm::ScalarType const& q_type, int thread_n_blocks, int thread_k_blocks,
+ bool has_act_order, int group_blocks, int num_threads, int blocks,
+ int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
+ const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks) {
+ bool has_zp = true;
+
+ if (false) {
+ }
+ AWQ_CALL_IF_MOE(vllm::kU4, 16, 4, 256)
+ AWQ_CALL_IF_MOE(vllm::kU4, 8, 8, 256)
+ AWQ_CALL_IF_MOE(vllm::kU4, 8, 4, 128)
+ AWQ_CALL_IF_MOE(vllm::kU4, 4, 8, 128)
+ else {
+ return false;
+ }
+ return true;
+}
+
+} // namespace marlin_moe
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.h b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.h
new file mode 100644
index 0000000000000..833fadf37721f
--- /dev/null
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4.h
@@ -0,0 +1,20 @@
+#pragma once
+
+#include "marlin_moe_kernel.h"
+
+namespace marlin_moe {
+
+// We return bool so we can create these different kernel calls as a sequence
+// of if-elseif's.
+bool call_marlin_moe_kernel_ku4(
+ vllm::ScalarType const& q_type, int thread_n_blocks, int thread_k_blocks,
+ bool has_act_order, int group_blocks, int num_threads, int blocks,
+ int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
+ const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks);
+
+} // namespace marlin_moe
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu
index cbafd9ffe7474..f7e57b0375945 100644
--- a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.cu
@@ -9,11 +9,13 @@ bool call_marlin_moe_kernel_ku4b8(
bool has_act_order, int group_blocks, int num_threads, int blocks,
int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
- const float* topk_weights_ptr, const int4* s_ptr, const int* g_idx_ptr,
- int* expert_offsets_ptr, int num_groups, int expert_idx, int num_experts,
- int topk, int prob_m, int prob_n, int prob_k, int tot_m, int* locks,
- bool replicate_input, bool apply_weights, int m_block, int max_par,
- int cfg_max_m_blocks) {
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks) {
+ bool has_zp = false;
+
if (false) {
}
GPTQ_CALL_IF_MOE(vllm::kU4B8, 16, 4, 256)
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h
index 9eacb42c115f0..494da8f10e262 100644
--- a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku4b8.h
@@ -11,10 +11,10 @@ bool call_marlin_moe_kernel_ku4b8(
bool has_act_order, int group_blocks, int num_threads, int blocks,
int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
- const float* topk_weights_ptr, const int4* s_ptr, const int* g_idx_ptr,
- int* expert_offsets_ptr, int num_groups, int expert_idx, int num_experts,
- int topk, int prob_m, int prob_n, int prob_k, int tot_m, int* locks,
- bool replicate_input, bool apply_weights, int m_block, int max_par,
- int cfg_max_m_blocks);
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks);
} // namespace marlin_moe
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu
index c46712474f715..a901f0b11cd78 100644
--- a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.cu
@@ -9,11 +9,13 @@ bool call_marlin_moe_kernel_ku8b128(
bool has_act_order, int group_blocks, int num_threads, int blocks,
int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
- const float* topk_weights_ptr, const int4* s_ptr, const int* g_idx_ptr,
- int* expert_offsets_ptr, int num_groups, int expert_idx, int num_experts,
- int topk, int prob_m, int prob_n, int prob_k, int tot_m, int* locks,
- bool replicate_input, bool apply_weights, int m_block, int max_par,
- int cfg_max_m_blocks) {
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks) {
+ bool has_zp = false;
+
if (false) {
}
GPTQ_CALL_IF_MOE(vllm::kU8B128, 16, 4, 256)
diff --git a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h
index 7cd9acafb3b80..f3018aa0c1ab7 100644
--- a/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h
+++ b/csrc/moe/marlin_kernels/marlin_moe_kernel_ku8b128.h
@@ -9,10 +9,10 @@ bool call_marlin_moe_kernel_ku8b128(
bool has_act_order, int group_blocks, int num_threads, int blocks,
int max_shared_mem, cudaStream_t stream, const int4* A_ptr,
const int4* B_ptr, int4* C_ptr, const int* sorted_ids_ptr,
- const float* topk_weights_ptr, const int4* s_ptr, const int* g_idx_ptr,
- int* expert_offsets_ptr, int num_groups, int expert_idx, int num_experts,
- int topk, int prob_m, int prob_n, int prob_k, int tot_m, int* locks,
- bool replicate_input, bool apply_weights, int m_block, int max_par,
- int cfg_max_m_blocks);
+ const float* topk_weights_ptr, const int4* s_ptr, const int4* zp_ptr,
+ const int* g_idx_ptr, int* expert_offsets_ptr, int num_groups,
+ int expert_idx, int num_experts, int topk, int prob_m, int prob_n,
+ int prob_k, int tot_m, int* locks, bool replicate_input, bool apply_weights,
+ int m_block, int max_par, int cfg_max_m_blocks);
}
diff --git a/csrc/moe/marlin_moe_ops.cu b/csrc/moe/marlin_moe_ops.cu
index 661490d95e791..e2db4e4196b6f 100644
--- a/csrc/moe/marlin_moe_ops.cu
+++ b/csrc/moe/marlin_moe_ops.cu
@@ -30,6 +30,7 @@
#include "core/registration.h"
#include "marlin_kernels/marlin_moe_kernel_ku4b8.h"
#include "marlin_kernels/marlin_moe_kernel_ku8b128.h"
+#include "marlin_kernels/marlin_moe_kernel_ku4.h"
template
inline std::string str(T x) {
@@ -157,6 +158,7 @@ thread_config_t small_batch_thread_configs[] = {
{128, 64, 128}, // Reduce N 2X, same K
{64, 256, 256}, // Reduce K 2X, increase N 2X
{64, 128, 128}, // Reduce K 2X, same N
+ {64, 64, 128}, // Reduce both 2X
};
thread_config_t large_batch_thread_configs[] = {
@@ -167,6 +169,7 @@ thread_config_t large_batch_thread_configs[] = {
{128, 128, 256}, // Reduce N 2X, increase K 2X
{64, 128, 128}, // Reduce N 2X, same K
{128, 64, 128}, // Reduce N 4X, increase K 2X
+ {64, 64, 128}, // Reduce N 4X, same K
};
int get_scales_cache_size(thread_config_t const& th_config, int prob_m,
@@ -312,27 +315,28 @@ exec_config_t determine_thread_config(int prob_m, int prob_n, int prob_k,
return exec_config_t{0, {-1, -1, -1}};
}
-#define CALL_MOE_KERNEL_FUNCTION(KERNEL_FUNCTION) \
- else if (KERNEL_FUNCTION(q_type, thread_n_blocks, thread_k_blocks, \
- has_act_order, group_blocks, num_threads, blocks, \
- max_shared_mem, stream, A_ptr, B_ptr, C_ptr, \
- sorted_ids_ptr, topk_weights_ptr, s_ptr, g_idx_ptr, \
- expert_offsets_ptr, num_groups, expert_idx, \
- num_experts, topk, prob_m, prob_n, prob_k, tot_m, \
- locks, replicate_input, apply_weights, m_block, \
- max_par, exec_cfg.max_m_blocks)) { \
+#define CALL_MOE_KERNEL_FUNCTION(KERNEL_FUNCTION) \
+ else if (KERNEL_FUNCTION( \
+ q_type, thread_n_blocks, thread_k_blocks, has_act_order, \
+ group_blocks, num_threads, blocks, max_shared_mem, stream, \
+ A_ptr, B_ptr, C_ptr, sorted_ids_ptr, topk_weights_ptr, s_ptr, \
+ zp_ptr, g_idx_ptr, expert_offsets_ptr, num_groups, expert_idx, \
+ num_experts, topk, prob_m, prob_n, prob_k, tot_m, locks, \
+ replicate_input, apply_weights, m_block, max_par, \
+ exec_cfg.max_m_blocks)) { \
}
void marlin_mm_moe(const void* A, const void* B, void* C,
const void* sorted_ids, const void* topk_weights,
- const void* topk_ids, const void* s, const void* g_idx,
- const void* perm, void* a_tmp, void* expert_offsets,
- int prob_m, int prob_n, int prob_k, void* workspace,
- vllm::ScalarType const& q_type, bool has_act_order,
- bool is_k_full, int num_groups, int group_size,
- int num_experts, int topk, int moe_block_size, int dev,
- cudaStream_t stream, int thread_k, int thread_n, int sms,
- int max_par, bool replicate_input, bool apply_weights) {
+ const void* topk_ids, const void* s, void* zp,
+ const void* g_idx, const void* perm, void* a_tmp,
+ void* expert_offsets, int prob_m, int prob_n, int prob_k,
+ void* workspace, vllm::ScalarType const& q_type,
+ bool has_act_order, bool is_k_full, bool has_zp,
+ int num_groups, int group_size, int num_experts, int topk,
+ int moe_block_size, int dev, cudaStream_t stream,
+ int thread_k, int thread_n, int sms, int max_par,
+ bool replicate_input, bool apply_weights) {
TORCH_CHECK(prob_m > 0 && prob_n > 0 && prob_k > 0, "Invalid MNK = [", prob_m,
", ", prob_n, ", ", prob_k, "]");
@@ -436,6 +440,8 @@ void marlin_mm_moe(const void* A, const void* B, void* C,
const float* topk_weights_ptr = (const float*)topk_weights;
const int* sorted_ids_ptr = (const int*)sorted_ids;
const int4* s_ptr = (const int4*)s + num_groups * prob_n / 8 * expert_idx;
+ const int4* zp_ptr =
+ (const int4*)zp + num_groups * prob_n / (pack_factor * 4) * expert_idx;
const int* g_idx_ptr = (const int*)g_idx + prob_k * expert_idx;
const int* perm_ptr = (const int*)perm + prob_k * expert_idx;
int* locks = (int*)workspace;
@@ -456,6 +462,7 @@ void marlin_mm_moe(const void* A, const void* B, void* C,
}
CALL_MOE_KERNEL_FUNCTION(call_marlin_moe_kernel_ku4b8)
CALL_MOE_KERNEL_FUNCTION(call_marlin_moe_kernel_ku8b128)
+ CALL_MOE_KERNEL_FUNCTION(call_marlin_moe_kernel_ku4)
else {
TORCH_CHECK(false, "Unsupported shapes: MNK = [" + str(prob_m) + ", " +
str(prob_n) + ", " + str(prob_k) + "]" +
@@ -475,13 +482,21 @@ torch::Tensor marlin_gemm_moe(
const torch::Tensor& a, const torch::Tensor& b_q_weights,
const torch::Tensor& sorted_ids, const torch::Tensor& topk_weights,
const torch::Tensor& topk_ids, const torch::Tensor& b_scales,
- const torch::Tensor& g_idx, const torch::Tensor& perm,
- torch::Tensor& workspace, vllm::ScalarTypeTorchPtr const& b_q_type,
- int64_t size_m, int64_t size_n, int64_t size_k, bool is_k_full,
- int64_t num_experts, int64_t topk, int64_t moe_block_size,
- bool replicate_input, bool apply_weights) {
- TORCH_CHECK(*b_q_type == vllm::kU4B8 || *b_q_type == vllm::kU8B128,
- "b_q_type must be uint4b8 or uint8b128. Got = ", b_q_type->str());
+ torch::Tensor& b_zeros, const torch::Tensor& g_idx,
+ const torch::Tensor& perm, torch::Tensor& workspace,
+ vllm::ScalarTypeTorchPtr const& b_q_type, int64_t size_m, int64_t size_n,
+ int64_t size_k, bool is_k_full, int64_t num_experts, int64_t topk,
+ int64_t moe_block_size, bool replicate_input, bool apply_weights) {
+ bool has_zp = b_zeros.size(1) != 0;
+ if (has_zp) {
+ TORCH_CHECK(
+ *b_q_type == vllm::kU4,
+ "b_q_type must be u4 when has_zp = True. Got = ", b_q_type->str());
+ } else {
+ TORCH_CHECK(
+ *b_q_type == vllm::kU4B8 || *b_q_type == vllm::kU8B128,
+ "b_q_type must be uint4b8 or uint8b128. Got = ", b_q_type->str());
+ }
int pack_factor = 32 / b_q_type->size_bits();
@@ -543,14 +558,27 @@ torch::Tensor marlin_gemm_moe(
}
}
+ // Verify b_zeros
+ if (has_zp) {
+ int rank = b_zeros.sizes().size();
+ TORCH_CHECK(rank == 3, "b_zeros rank = ", rank, " is not 3");
+ TORCH_CHECK(b_zeros.size(1) == num_groups,
+ "b_zeros dim 1 = ", b_zeros.size(1),
+ " is not num_groups = ", num_groups);
+ TORCH_CHECK(b_zeros.size(2) == size_n / pack_factor,
+ "b_zeros dim 2 = ", b_zeros.size(2),
+ " is not size_n / pack_factor = ", size_n / pack_factor);
+ }
+
marlin_moe::marlin_mm_moe(
a.data_ptr(), b_q_weights.data_ptr(), c.data_ptr(), sorted_ids.data_ptr(),
topk_weights.data_ptr(), topk_ids.data_ptr(), b_scales.data_ptr(),
- g_idx.data_ptr(), perm.data_ptr(), a_tmp.data_ptr(),
+ b_zeros.data_ptr(), g_idx.data_ptr(), perm.data_ptr(), a_tmp.data_ptr(),
expert_offsets.data_ptr(), size_m, size_n, size_k, workspace.data_ptr(),
- *b_q_type, has_act_order, is_k_full, num_groups, group_size, num_experts,
- topk, moe_block_size, dev, at::cuda::getCurrentCUDAStream(dev), thread_k,
- thread_n, sms, max_par, replicate_input, apply_weights);
+ *b_q_type, has_act_order, is_k_full, has_zp, num_groups, group_size,
+ num_experts, topk, moe_block_size, dev,
+ at::cuda::getCurrentCUDAStream(dev), thread_k, thread_n, sms, max_par,
+ replicate_input, apply_weights);
return c;
}
diff --git a/csrc/moe/torch_bindings.cpp b/csrc/moe/torch_bindings.cpp
index cbc8754f7a5b2..18fbc57ac7834 100644
--- a/csrc/moe/torch_bindings.cpp
+++ b/csrc/moe/torch_bindings.cpp
@@ -12,7 +12,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
m.def(
"marlin_gemm_moe(Tensor! a, Tensor! b_q_weights, Tensor! sorted_ids, "
"Tensor! topk_weights, Tensor! topk_ids, Tensor! b_scales, Tensor! "
- "g_idx, Tensor! perm, Tensor! workspace, "
+ "b_zeros, Tensor! g_idx, Tensor! perm, Tensor! workspace, "
"__torch__.torch.classes._core_C.ScalarType b_q_type, int size_m, "
"int size_n, int size_k, bool is_k_full, int num_experts, int topk, "
"int moe_block_size, bool replicate_input, bool apply_weights)"
diff --git a/csrc/quantization/gptq_marlin/gptq_marlin.cu b/csrc/quantization/gptq_marlin/gptq_marlin.cu
index 227bc19b914a0..5efe15d2b2f6b 100644
--- a/csrc/quantization/gptq_marlin/gptq_marlin.cu
+++ b/csrc/quantization/gptq_marlin/gptq_marlin.cu
@@ -2260,7 +2260,7 @@ torch::Tensor gptq_marlin_gemm(torch::Tensor& a, torch::Tensor& b_q_weight,
"b_zeros dim 0 = ", b_zeros.size(0),
" is not num_groups = ", num_groups);
TORCH_CHECK(b_zeros.size(1) == size_n / pack_factor,
- "b_zeros dim 1 = ", b_scales.size(1),
+ "b_zeros dim 1 = ", b_zeros.size(1),
" is not size_n / pack_factor = ", size_n / pack_factor);
}
diff --git a/docs/source/models/adding_model.rst b/docs/source/models/adding_model.rst
index 5cffb58cafd96..fa1003874033e 100644
--- a/docs/source/models/adding_model.rst
+++ b/docs/source/models/adding_model.rst
@@ -85,21 +85,21 @@ When it comes to the linear layers, we provide the following options to parallel
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
-* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
+* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple :code:`ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
-Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
+Note that all the linear layers above take :code:`linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
4. Implement the weight loading logic
-------------------------------------
You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
-This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
+This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for :code:`MergedColumnParallelLinear` and :code:`QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
5. Register your model
----------------------
-Finally, register your :code:`*ForCausalLM` class to the :code:`_MODELS` in `vllm/model_executor/models/__init__.py `_.
+Finally, register your :code:`*ForCausalLM` class to the :code:`_MODELS` in `vllm/model_executor/models/registry.py `_.
6. Out-of-Tree Model Integration
--------------------------------------------
@@ -114,6 +114,18 @@ Just add the following lines in your code:
from your_code import YourModelForCausalLM
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
+If your model imports modules that initialize CUDA, consider instead lazy-importing it to avoid an error like :code:`RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
+
+.. code-block:: python
+
+ from vllm import ModelRegistry
+
+ ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
+
+.. important::
+ If your model is a multimodal model, make sure the model class implements the :class:`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
+ Read more about that :ref:`here `.
+
If you are running api server with :code:`vllm serve `, you can wrap the entrypoint with the following code:
.. code-block:: python
diff --git a/docs/source/models/supported_models.rst b/docs/source/models/supported_models.rst
index 23f08bfa9756e..dea109cb17f58 100644
--- a/docs/source/models/supported_models.rst
+++ b/docs/source/models/supported_models.rst
@@ -7,10 +7,12 @@ vLLM supports a variety of generative Transformer models in `HuggingFace Transfo
The following is the list of model architectures that are currently supported by vLLM.
Alongside each architecture, we include some popular models that use it.
-----
+Text-only Language Models
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Text Generation
+---------------
-Decoder-only Language Models
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. list-table::
:widths: 25 25 50 5 5
:header-rows: 1
@@ -40,6 +42,11 @@ Decoder-only Language Models
- :code:`bigscience/bloom`, :code:`bigscience/bloomz`, etc.
-
- ✅︎
+ * - :code:`BartForConditionalGeneration`
+ - BART
+ - :code:`facebook/bart-base`, :code:`facebook/bart-large-cnn`, etc.
+ -
+ -
* - :code:`ChatGLMModel`
- ChatGLM
- :code:`THUDM/chatglm2-6b`, :code:`THUDM/chatglm3-6b`, etc.
@@ -259,11 +266,55 @@ Decoder-only Language Models
.. note::
Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
-.. _supported_vlms:
+Text Embedding
+--------------
+
+.. list-table::
+ :widths: 25 25 50 5 5
+ :header-rows: 1
+
+ * - Architecture
+ - Models
+ - Example HuggingFace Models
+ - :ref:`LoRA `
+ - :ref:`PP `
+ * - :code:`Gemma2Model`
+ - Gemma2-based
+ - :code:`BAAI/bge-multilingual-gemma2`, etc.
+ -
+ - ✅︎
+ * - :code:`MistralModel`
+ - Mistral-based
+ - :code:`intfloat/e5-mistral-7b-instruct`, etc.
+ -
+ - ✅︎
+
+Reward Modeling
+---------------
+
+.. list-table::
+ :widths: 25 25 50 5 5
+ :header-rows: 1
+
+ * - Architecture
+ - Models
+ - Example HuggingFace Models
+ - :ref:`LoRA `
+ - :ref:`PP `
+ * - :code:`Qwen2ForRewardModel`
+ - Qwen2-based
+ - :code:`Qwen/Qwen2.5-Math-RM-72B`, etc.
+ -
+ - ✅︎
+
+.. note::
+ As an interim measure, these models are supported via Embeddings API. See `this RFC `_ for upcoming changes.
Multimodal Language Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+.. _supported_vlms:
+
.. list-table::
:widths: 25 25 25 25 5 5
:header-rows: 1
@@ -378,6 +429,7 @@ Multimodal Language Models
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630
+----
If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Otherwise, please refer to :ref:`Adding a New Model ` and :ref:`Enabling Multimodal Inputs `
diff --git a/docs/source/models/vlm.rst b/docs/source/models/vlm.rst
index 3f4f01e3ae7ac..8f5aa58f9f2b9 100644
--- a/docs/source/models/vlm.rst
+++ b/docs/source/models/vlm.rst
@@ -6,10 +6,9 @@ Using VLMs
vLLM provides experimental support for Vision Language Models (VLMs). See the :ref:`list of supported VLMs here `.
This document shows you how to run and serve these models using vLLM.
-.. important::
- We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
-
- We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub `_ if you have any feedback or feature requests.
+.. note::
+ We are actively iterating on VLM support. See `this RFC `_ for upcoming changes,
+ and `open an issue on GitHub `_ if you have any feedback or feature requests.
Offline Inference
-----------------
@@ -23,10 +22,6 @@ The :class:`~vllm.LLM` class can be instantiated in much the same way as languag
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
-.. note::
- We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
- the above snippet. Specifically, ``image_feature_size`` can no longer be specified as we now calculate that internally for each model.
-
To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
diff --git a/docs/source/serving/openai_compatible_server.md b/docs/source/serving/openai_compatible_server.md
index 8bb7067faa97c..9132e12a36ba5 100644
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -140,7 +140,7 @@ $ vllm serve SOME_MODEL --config config.yaml
```
---
**NOTE**
-In case an argument is supplied using command line and the config file, the value from the commandline will take precedence.
+In case an argument is supplied simultaneously using command line and the config file, the value from the commandline will take precedence.
The order of priorities is `command line > config file values > defaults`.
---
diff --git a/examples/llm_engine_example.py b/examples/llm_engine_example.py
index ca41f32b12b31..60d894aae9692 100644
--- a/examples/llm_engine_example.py
+++ b/examples/llm_engine_example.py
@@ -18,9 +18,6 @@ def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
temperature=0.8,
top_p=0.95,
frequency_penalty=0.1)),
- ("It is only with the heart that one can see rightly",
- SamplingParams(n=3, best_of=3, use_beam_search=True,
- temperature=0.0)),
]
diff --git a/examples/multilora_inference.py b/examples/multilora_inference.py
index 6aa25b4689ec8..043220d979c3c 100644
--- a/examples/multilora_inference.py
+++ b/examples/multilora_inference.py
@@ -43,15 +43,6 @@ def create_test_prompts(
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora", 1, lora_path)),
- (
- "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", # noqa: E501
- SamplingParams(n=3,
- best_of=3,
- use_beam_search=True,
- temperature=0,
- max_tokens=128,
- stop_token_ids=[32003]),
- LoRARequest("sql-lora", 1, lora_path)),
(
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", # noqa: E501
SamplingParams(temperature=0.0,
@@ -60,15 +51,6 @@ def create_test_prompts(
max_tokens=128,
stop_token_ids=[32003]),
LoRARequest("sql-lora2", 2, lora_path)),
- (
- "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_11 (nationality VARCHAR, elector VARCHAR)\n\n question: When Anchero Pantaleone was the elector what is under nationality? [/user] [assistant]", # noqa: E501
- SamplingParams(n=3,
- best_of=3,
- use_beam_search=True,
- temperature=0,
- max_tokens=128,
- stop_token_ids=[32003]),
- LoRARequest("sql-lora", 1, lora_path)),
]
diff --git a/examples/offline_inference_with_prefix.py b/examples/offline_inference_with_prefix.py
index 04c2843792a1b..3b3e0ae64a037 100644
--- a/examples/offline_inference_with_prefix.py
+++ b/examples/offline_inference_with_prefix.py
@@ -1,7 +1,8 @@
-from time import time
-
from vllm import LLM, SamplingParams
+# NOTE: This is just a running example. For benchmarking purpose,
+# please see benchmarks/benchmark_prefix_caching.py
+
# Common prefix.
prefix = (
"You are an expert school principal, skilled in effectively managing "
@@ -37,9 +38,7 @@
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
-start_time_regular = time()
outputs = regular_llm.generate(generating_prompts, sampling_params)
-duration_regular = time() - start_time_regular
regular_generated_texts = []
# Print the outputs.
@@ -55,9 +54,7 @@
prefix_cached_llm.generate(generating_prompts[0], sampling_params)
# Generate with prefix caching.
-start_time_cached = time()
outputs = prefix_cached_llm.generate(generating_prompts, sampling_params)
-duration_cached = time() - start_time_cached
print("Results with `enable_prefix_caching`")
@@ -77,6 +74,3 @@
for i in range(len(prompts))
])
print(f"Generated answers are the same: {generated_same}")
-
-speedup = round(duration_regular / duration_cached, 2)
-print(f"Speed up of cached generation compared to the regular is: {speedup}")
diff --git a/find_cuda_init.py b/find_cuda_init.py
new file mode 100644
index 0000000000000..51db23102f9ac
--- /dev/null
+++ b/find_cuda_init.py
@@ -0,0 +1,33 @@
+import importlib
+import traceback
+from typing import Callable
+from unittest.mock import patch
+
+
+def find_cuda_init(fn: Callable[[], object]) -> None:
+ """
+ Helper function to debug CUDA re-initialization errors.
+
+ If `fn` initializes CUDA, prints the stack trace of how this happens.
+ """
+ from torch.cuda import _lazy_init
+
+ stack = None
+
+ def wrapper():
+ nonlocal stack
+ stack = traceback.extract_stack()
+ return _lazy_init()
+
+ with patch("torch.cuda._lazy_init", wrapper):
+ fn()
+
+ if stack is not None:
+ print("==== CUDA Initialized ====")
+ print("".join(traceback.format_list(stack)).strip())
+ print("==========================")
+
+
+if __name__ == "__main__":
+ find_cuda_init(
+ lambda: importlib.import_module("vllm.model_executor.models.llava"))
diff --git a/tests/basic_correctness/test_preemption.py b/tests/basic_correctness/test_preemption.py
index 05e7859759002..4e502cfb5f4f8 100644
--- a/tests/basic_correctness/test_preemption.py
+++ b/tests/basic_correctness/test_preemption.py
@@ -23,11 +23,9 @@
@pytest.fixture(scope="module", autouse=True)
def check_settings():
assert ENABLE_ARTIFICIAL_PREEMPT is True, (
- "Use an env var VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1, "
- "VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1. "
+ "Use an env var VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1."
"`VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 "
- "VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 pytest "
- "tests/basic_correctness/test_preemption.py`")
+ "pytest tests/basic_correctness/test_preemption.py`")
@pytest.fixture
@@ -137,114 +135,6 @@ def test_preemption(
assert total_preemption == total_recorded_preemption
-@pytest.mark.parametrize("model", MODELS)
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("max_tokens", [96])
-@pytest.mark.parametrize("beam_width", [4])
-def test_swap(
- caplog_vllm,
- hf_runner,
- vllm_runner,
- example_prompts,
- model: str,
- dtype: str,
- max_tokens: int,
- beam_width: int,
- worker_use_ray: bool,
-) -> None:
- """Use beam search enables swapping."""
- example_prompts = example_prompts[:1]
- with hf_runner(model, dtype=dtype) as hf_model:
- hf_outputs = hf_model.generate_beam_search(example_prompts, beam_width,
- max_tokens)
-
- with vllm_runner(
- model,
- dtype=dtype,
- swap_space=10,
- disable_log_stats=False,
- worker_use_ray=worker_use_ray,
- ) as vllm_model:
- vllm_outputs = vllm_model.generate_beam_search(example_prompts,
- beam_width, max_tokens)
- assert (vllm_model.model.llm_engine.scheduler[0].artificial_preempt_cnt
- < ARTIFICIAL_PREEMPTION_MAX_CNT)
- total_preemption = (
- vllm_model.model.llm_engine.scheduler[0].num_cumulative_preemption)
-
- for i in range(len(example_prompts)):
- hf_output_ids, _ = hf_outputs[i]
- vllm_output_ids, _ = vllm_outputs[i]
- assert len(hf_output_ids) == len(vllm_output_ids)
- for j in range(len(hf_output_ids)):
- assert hf_output_ids[j] == vllm_output_ids[j], (
- f"Test{i} output{j}:\nHF: {hf_output_ids}\n"
- f"vLLM: {vllm_output_ids}")
-
- assert ("is preempted by PreemptionMode.SWAP mode because there "
- "is not enough KV cache space." in caplog_vllm.text)
- # Ensure the count bucket of request-level histogram metrics matches
- # the number of requests as a simple sanity check to ensure metrics are
- # generated
- preemption_metrics = None
- for m in REGISTRY.collect():
- if m.name == "vllm:num_preemptions":
- preemption_metrics = m
- assert preemption_metrics is not None
- total_recorded_preemption = 0
- for sample in preemption_metrics.samples:
- total_recorded_preemption += sample.value
- assert total_preemption == total_recorded_preemption
-
-
-@pytest.mark.parametrize("model", MODELS)
-@pytest.mark.parametrize("dtype", ["float"])
-@pytest.mark.parametrize("max_tokens", [96])
-@pytest.mark.parametrize("beam_width", [4])
-@pytest.mark.parametrize("use_v2_block_manager", [True, False])
-def test_swap_infeasible(
- vllm_runner,
- example_prompts,
- model: str,
- dtype: str,
- max_tokens: int,
- beam_width: int,
- worker_use_ray: bool,
- use_v2_block_manager: bool,
-) -> None:
- """Verify infeasible swap request will be ignored."""
- BLOCK_SIZE = 16
- prefill_blocks = 2
- decode_blocks = max_tokens // BLOCK_SIZE
- example_prompts = example_prompts[:1]
- with vllm_runner(
- model,
- dtype=dtype,
- swap_space=10,
- block_size=BLOCK_SIZE,
- # Since beam search have more than 1 sequence, prefill +
- # decode blocks are not enough to finish.
- num_gpu_blocks_override=prefill_blocks + decode_blocks,
- max_model_len=(prefill_blocks + decode_blocks) * BLOCK_SIZE,
- worker_use_ray=worker_use_ray,
- use_v2_block_manager=use_v2_block_manager,
- ) as vllm_model:
- sampling_params = SamplingParams(n=beam_width,
- use_beam_search=True,
- temperature=0.0,
- max_tokens=max_tokens,
- ignore_eos=True)
- req_outputs = vllm_model.model.generate(
- example_prompts,
- sampling_params=sampling_params,
- )
- assert (vllm_model.model.llm_engine.scheduler[0].artificial_preempt_cnt
- < ARTIFICIAL_PREEMPTION_MAX_CNT)
-
- # Verify the request is ignored and not hang.
- assert req_outputs[0].outputs[0].finish_reason == "length"
-
-
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["float"])
@pytest.mark.parametrize("max_tokens", [96])
diff --git a/tests/conftest.py b/tests/conftest.py
index 45dc5e8323ca4..baa6bae03a451 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -35,6 +35,7 @@
to_enc_dec_tuple_list, zip_enc_dec_prompts)
from vllm.logger import init_logger
from vllm.outputs import RequestOutput
+from vllm.sampling_params import BeamSearchParams
from vllm.utils import (STR_DTYPE_TO_TORCH_DTYPE, cuda_device_count_stateless,
identity, is_cpu)
@@ -277,6 +278,7 @@ def __init__(
SentenceTransformer(
model_name,
device="cpu",
+ trust_remote_code=True,
).to(dtype=torch_dtype))
else:
model_kwargs = model_kwargs if model_kwargs is not None else {}
@@ -780,7 +782,6 @@ def generate_encoder_decoder_greedy_logprobs(
List[TokensTextLogprobsPromptLogprobs]]:
greedy_logprobs_params = SamplingParams(
temperature=0.0,
- use_beam_search=False,
max_tokens=max_tokens,
logprobs=num_logprobs,
prompt_logprobs=(num_prompt_logprobs),
@@ -793,25 +794,14 @@ def generate_encoder_decoder_greedy_logprobs(
encoder_decoder_prompts, greedy_logprobs_params)
def generate_beam_search(
- self,
- prompts: List[str],
- beam_width: int,
- max_tokens: int,
- ) -> List[Tuple[List[List[int]], List[str]]]:
- beam_search_params = SamplingParams(n=beam_width,
- use_beam_search=True,
- temperature=0.0,
- max_tokens=max_tokens)
- outputs = self.generate(prompts, beam_search_params)
- return outputs
-
- def generate_beam_search_new(
self,
prompts: Union[List[str], List[List[int]]],
beam_width: int,
max_tokens: int,
) -> List[Tuple[List[List[int]], List[str]]]:
- outputs = self.model.beam_search(prompts, beam_width, max_tokens)
+ outputs = self.model.beam_search(
+ prompts,
+ BeamSearchParams(beam_width=beam_width, max_tokens=max_tokens))
returned_outputs = []
for output in outputs:
token_ids = [x.tokens for x in output.sequences]
@@ -879,15 +869,17 @@ def num_gpus_available():
temp_dir = tempfile.gettempdir()
-_dummy_path = os.path.join(temp_dir, "dummy_opt")
+_dummy_opt_path = os.path.join(temp_dir, "dummy_opt")
+_dummy_llava_path = os.path.join(temp_dir, "dummy_llava")
+_dummy_gemma2_embedding_path = os.path.join(temp_dir, "dummy_gemma2_embedding")
@pytest.fixture
def dummy_opt_path():
- json_path = os.path.join(_dummy_path, "config.json")
- if not os.path.exists(_dummy_path):
+ json_path = os.path.join(_dummy_opt_path, "config.json")
+ if not os.path.exists(_dummy_opt_path):
snapshot_download(repo_id="facebook/opt-125m",
- local_dir=_dummy_path,
+ local_dir=_dummy_opt_path,
ignore_patterns=[
"*.bin", "*.bin.index.json", "*.pt", "*.h5",
"*.msgpack"
@@ -898,4 +890,42 @@ def dummy_opt_path():
config["architectures"] = ["MyOPTForCausalLM"]
with open(json_path, "w") as f:
json.dump(config, f)
- return _dummy_path
+ return _dummy_opt_path
+
+
+@pytest.fixture
+def dummy_llava_path():
+ json_path = os.path.join(_dummy_llava_path, "config.json")
+ if not os.path.exists(_dummy_llava_path):
+ snapshot_download(repo_id="llava-hf/llava-1.5-7b-hf",
+ local_dir=_dummy_llava_path,
+ ignore_patterns=[
+ "*.bin", "*.bin.index.json", "*.pt", "*.h5",
+ "*.msgpack"
+ ])
+ assert os.path.exists(json_path)
+ with open(json_path, "r") as f:
+ config = json.load(f)
+ config["architectures"] = ["MyLlava"]
+ with open(json_path, "w") as f:
+ json.dump(config, f)
+ return _dummy_llava_path
+
+
+@pytest.fixture
+def dummy_gemma2_embedding_path():
+ json_path = os.path.join(_dummy_gemma2_embedding_path, "config.json")
+ if not os.path.exists(_dummy_gemma2_embedding_path):
+ snapshot_download(repo_id="BAAI/bge-multilingual-gemma2",
+ local_dir=_dummy_gemma2_embedding_path,
+ ignore_patterns=[
+ "*.bin", "*.bin.index.json", "*.pt", "*.h5",
+ "*.msgpack"
+ ])
+ assert os.path.exists(json_path)
+ with open(json_path, "r") as f:
+ config = json.load(f)
+ config["architectures"] = ["MyGemma2Embedding"]
+ with open(json_path, "w") as f:
+ json.dump(config, f)
+ return _dummy_gemma2_embedding_path
diff --git a/tests/core/block/e2e/test_correctness.py b/tests/core/block/e2e/test_correctness.py
index b3d3667b37d88..033778d2c35e0 100644
--- a/tests/core/block/e2e/test_correctness.py
+++ b/tests/core/block/e2e/test_correctness.py
@@ -85,73 +85,6 @@ def test_v1_v2_greedy_equality_with_preemption(baseline_llm_generator,
assert baseline_token_ids == test_token_ids
-@pytest.mark.parametrize(
- "common_llm_kwargs",
- [{
- # Use a small model for a fast test.
- "model": "facebook/opt-125m",
-
- # skip cuda graph creation for fast test.
- "enforce_eager": True,
-
- # Use a large block size to trigger more copy-on-writes.
- "block_size": 32,
- }])
-@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
-@pytest.mark.parametrize("baseline_llm_kwargs", [{
- "use_v2_block_manager": False
-}])
-@pytest.mark.parametrize("test_llm_kwargs", [{
- "use_v2_block_manager": True,
- "preemption_mode": "swap"
-}, {
- "use_v2_block_manager": True,
- "preemption_mode": "recompute"
-}])
-@pytest.mark.parametrize("batch_size", [10])
-@pytest.mark.parametrize("seed", [1])
-def test_v1_v2_greedy_equality_with_cow(baseline_llm_generator,
- test_llm_generator, batch_size):
- """Verify beam search equality with block manager v1 and v2.
-
- This requires copy-on-writes; if the v1 and v2 output is the same, then
- we have some confidence cow is working.
- """
- output_len = 128
- temperature = 0.0
-
- prompts = [
- "Hello, my name is",
- "The president of the United States is",
- "The capital of France is",
- "The future of AI is",
- ]
-
- prompts = [prompt for prompt, _ in zip(cycle(prompts), range(batch_size))]
-
- sampling_params = SamplingParams(
- max_tokens=output_len,
- ignore_eos=True,
- temperature=temperature,
- use_beam_search=True,
- best_of=2,
- )
-
- print('Getting token ids from block manager v1')
- baseline_token_ids = get_token_ids_from_llm_generator(
- baseline_llm_generator, prompts, sampling_params)
-
- print('Getting token ids from block manager v2')
- test_token_ids = get_token_ids_from_llm_generator(test_llm_generator,
- prompts, sampling_params)
-
- for expected_token_ids, actual_token_ids in zip(baseline_token_ids,
- test_token_ids):
- assert expected_token_ids == actual_token_ids
-
- assert baseline_token_ids == test_token_ids
-
-
@pytest.mark.parametrize(
"common_llm_kwargs",
[{
diff --git a/tests/core/test_num_computed_tokens_update.py b/tests/core/test_num_computed_tokens_update.py
new file mode 100644
index 0000000000000..f3ec24e7bee3e
--- /dev/null
+++ b/tests/core/test_num_computed_tokens_update.py
@@ -0,0 +1,81 @@
+import pytest
+
+from tests.conftest import VllmRunner
+from tests.core.utils import create_dummy_prompt
+from vllm.engine.llm_engine import LLMEngine
+from vllm.platforms import current_platform
+from vllm.sequence import SequenceGroup
+
+MODEL = "JackFram/llama-160m"
+
+
+def add_seq_group_to_engine(engine: LLMEngine, seq_group: SequenceGroup):
+ scheduler = engine.scheduler[0]
+ scheduler.add_seq_group(seq_group)
+
+
+@pytest.mark.parametrize("num_scheduler_steps", [1, 8])
+@pytest.mark.parametrize("enable_chunked_prefill", [False, True])
+@pytest.mark.parametrize("enforce_eager", [False, True])
+def test_num_computed_tokens_update(num_scheduler_steps: int,
+ enable_chunked_prefill: bool,
+ enforce_eager: bool):
+
+ is_multi_step = num_scheduler_steps > 1
+ is_multi_step_chunked_prefill = is_multi_step and enable_chunked_prefill
+
+ if is_multi_step_chunked_prefill and current_platform.is_rocm():
+ pytest.skip("Multi-step with Chunked-Prefill does not support "
+ "rocm_flash_attn backend")
+
+ # Make a vllm engine
+ runner = VllmRunner(model_name=MODEL,
+ gpu_memory_utilization=0.7,
+ use_v2_block_manager=True,
+ num_scheduler_steps=num_scheduler_steps,
+ enable_chunked_prefill=enable_chunked_prefill,
+ enforce_eager=enforce_eager)
+ engine: LLMEngine = runner.model.llm_engine
+
+ # In multi-step + chunked-prefill there is no separate single prompt step.
+ # What is scheduled will run for num_scheduler_steps always.
+ num_prompt_steps = num_scheduler_steps \
+ if is_multi_step_chunked_prefill else 1
+
+ num_output_tokens_list = [4, 8, 12, 15, 16, 17]
+
+ # Create sequence and add to engine
+ prompt_len = 10
+
+ for req_idx, num_output_tokens in enumerate(num_output_tokens_list):
+ seq, seq_group = create_dummy_prompt(request_id=str(req_idx),
+ prompt_length=prompt_len,
+ min_tokens=num_output_tokens,
+ max_tokens=num_output_tokens)
+ add_seq_group_to_engine(engine, seq_group)
+
+ assert seq.data.get_num_computed_tokens() == 0
+
+ for _ in range(num_prompt_steps):
+ # prompt steps
+ engine.step()
+
+ if not seq.is_finished():
+ prompt_num_computed_tokens = seq.data.get_num_computed_tokens()
+ # Test correctness of num_computed_tokens after the prompt steps
+ assert prompt_num_computed_tokens == \
+ prompt_len + num_prompt_steps - 1
+
+ decode_step_counter = 0
+ while not seq.is_finished():
+ # Test correctness of num_computed_tokens after the decode steps
+ assert seq.data.get_num_computed_tokens(
+ ) == prompt_num_computed_tokens + decode_step_counter
+ for _ in range(num_scheduler_steps):
+ # decode step
+ engine.step()
+ decode_step_counter += 1
+
+ # Test correctness of num_computed_tokens after the sequence finish.
+ assert seq.data.get_num_computed_tokens(
+ ) == prompt_len + num_output_tokens - 1
diff --git a/tests/core/utils.py b/tests/core/utils.py
index 40d8f51fc186e..a95a573db7cd3 100644
--- a/tests/core/utils.py
+++ b/tests/core/utils.py
@@ -13,9 +13,10 @@ def create_dummy_prompt(
prompt_length: int,
block_size: Optional[int] = None,
lora_request: Optional[LoRARequest] = None,
- use_beam_search: bool = False,
best_of: int = 1,
prompt_tokens: Optional[List[int]] = None,
+ min_tokens: int = 0,
+ max_tokens: int = 16,
) -> Tuple[Sequence, SequenceGroup]:
if not block_size:
block_size = prompt_length
@@ -35,8 +36,9 @@ def create_dummy_prompt(
seqs=[prompt],
arrival_time=time.time(),
sampling_params=SamplingParams(
- use_beam_search=use_beam_search,
- best_of=best_of),
+ best_of=best_of,
+ max_tokens=max_tokens,
+ min_tokens=min_tokens),
lora_request=lora_request)
return prompt, seq_group
@@ -48,7 +50,6 @@ def create_dummy_prompt_encoder_decoder(
encoder_prompt_length: int,
block_size: Optional[int] = None,
lora_request: Optional[LoRARequest] = None,
- use_beam_search: bool = False,
best_of: int = 1,
) -> Tuple[Sequence, Sequence, SequenceGroup]:
if not block_size:
@@ -81,9 +82,7 @@ def create_dummy_prompt_encoder_decoder(
from_decoder_prompt=False)
seq_group = SequenceGroup(request_id=request_id,
seqs=[decoder_prompt],
- sampling_params=SamplingParams(
- use_beam_search=use_beam_search,
- best_of=best_of),
+ sampling_params=SamplingParams(best_of=best_of),
arrival_time=time.time(),
lora_request=lora_request,
encoder_seq=encoder_prompt)
diff --git a/tests/data/test_config.yaml b/tests/data/test_config.yaml
index 20d499624de2e..42f4f6f7bb992 100644
--- a/tests/data/test_config.yaml
+++ b/tests/data/test_config.yaml
@@ -1,2 +1,3 @@
port: 12312
+served_model_name: mymodel
tensor_parallel_size: 2
diff --git a/tests/distributed/test_pipeline_parallel.py b/tests/distributed/test_pipeline_parallel.py
index 1f62cdc7e06a8..88d0a4ba7f57b 100644
--- a/tests/distributed/test_pipeline_parallel.py
+++ b/tests/distributed/test_pipeline_parallel.py
@@ -7,7 +7,7 @@
"""
import os
from dataclasses import dataclass
-from typing import List, NamedTuple, Optional
+from typing import List, Literal, NamedTuple, Optional
import pytest
@@ -97,6 +97,9 @@ def iter_params(self, model_name: str):
self.trust_remote_code, self.tokenizer_mode)
+# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
+# The values displayed here are only a rough indicator of the size of the model
+
# yapf: disable
GENERATION_MODEL_SETTINGS = {
# [DETAILED TESTS]
@@ -104,15 +107,13 @@ def iter_params(self, model_name: str):
# [FAST TESTS]
# Uses Llama
# "BAAI/AquilaChat-7B": PPTestSettings.fast(),
- # TODO: Test on larger GPU
- # "Snowflake/snowflake-arctic-instruct": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
+ "Snowflake/snowflake-arctic-instruct": PPTestSettings.fast(tp_base=8, trust_remote_code=True), # noqa: E501
"baichuan-inc/Baichuan-7B": PPTestSettings.fast(trust_remote_code=True),
"baichuan-inc/Baichuan2-13B-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
"bigscience/bloomz-1b1": PPTestSettings.fast(),
"THUDM/chatglm3-6b": PPTestSettings.fast(trust_remote_code=True),
"CohereForAI/c4ai-command-r-v01": PPTestSettings.fast(tp_base=2, trust_remote_code=True), # noqa: E501
- # TODO: Test on larger GPU
- # "databricks/dbrx-instruct": PPTestSettings.fast(),
+ "databricks/dbrx-instruct": PPTestSettings.fast(tp_base=8),
"Deci/DeciLM-7B-instruct": PPTestSettings.fast(trust_remote_code=True),
"deepseek-ai/deepseek-llm-7b-chat": PPTestSettings.fast(),
"deepseek-ai/DeepSeek-V2-Lite-Chat": PPTestSettings.fast(trust_remote_code=True), # noqa: E501
@@ -161,8 +162,9 @@ def iter_params(self, model_name: str):
EMBEDDING_MODEL_SETTINGS = { # type: ignore[var-annotated]
# [FAST TESTS]
- # Uses Llama
- # "intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(),
+ "intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(),
+ "BAAI/bge-multilingual-gemma2": PPTestSettings.fast(),
+ "Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(tp_base=4, trust_remote_code=True), # noqa: E501
}
MULTIMODAL_MODEL_SETTINGS = {
@@ -192,40 +194,35 @@ def iter_params(self, model_name: str):
}
# yapf: enable
-MODEL_SETTINGS = {
- **GENERATION_MODEL_SETTINGS,
- **EMBEDDING_MODEL_SETTINGS,
- **MULTIMODAL_MODEL_SETTINGS,
-}
-
-# You can update this on your local machine to run specific tests
+# NOTE: You can update this on your local machine to run specific tests
TEST_MODELS = [
+ # [LANGUAGE GENERATION]
"meta-llama/Meta-Llama-3-8B",
- "facebook/chameleon-7b",
+ "ibm/PowerLM-3b",
+ # [LANGUAGE EMBEDDING]
+ "intfloat/e5-mistral-7b-instruct",
+ "BAAI/bge-multilingual-gemma2",
+ # [MULTIMODAL GENERATION]
"OpenGVLab/InternVL2-1B",
"microsoft/Phi-3-vision-128k-instruct",
- "mistralai/Pixtral-12B-2409",
"fixie-ai/ultravox-v0_3",
]
-@pytest.mark.parametrize(
- ("model_name", "parallel_setup", "distributed_backend",
- "trust_remote_code", "tokenizer_mode"),
- [
- params for model_name, settings in MODEL_SETTINGS.items()
- for params in settings.iter_params(model_name)
- if model_name in TEST_MODELS
- ],
-)
-@fork_new_process_for_each_test
-def test_compare_tp(model_name: str, parallel_setup: ParallelSetup,
- distributed_backend: str, trust_remote_code: bool,
- tokenizer_mode: Optional[str], num_gpus_available):
+def _compare_tp(
+ model_name: str,
+ parallel_setup: ParallelSetup,
+ distributed_backend: str,
+ trust_remote_code: bool,
+ tokenizer_mode: Optional[str],
+ num_gpus_available: int,
+ *,
+ method: Literal["generate", "encode"] = "encode",
+):
tp_size, pp_size, eager_mode, chunked_prefill = parallel_setup
- if num_gpus_available < tp_size:
- pytest.skip(f"Need at least {tp_size} GPUs to run the test")
+ if num_gpus_available < tp_size * pp_size:
+ pytest.skip(f"Need at least {tp_size} x {pp_size} GPUs")
if VLLM_MULTI_NODE and distributed_backend == "mp":
pytest.skip("Skipping multi-node pipeline parallel test for "
"multiprocessing distributed backend")
@@ -286,10 +283,95 @@ def test_compare_tp(model_name: str, parallel_setup: ParallelSetup,
]
try:
- compare_two_settings(model_name, pp_args, tp_args, pp_env)
+ compare_two_settings(model_name,
+ pp_args,
+ tp_args,
+ pp_env,
+ method=method)
except Exception:
if pp_env is None:
raise
else:
# Ray ADAG tests are flaky, so we don't want to fail the test
logger.exception("Ray ADAG tests failed")
+
+
+@pytest.mark.parametrize(
+ ("model_name", "parallel_setup", "distributed_backend",
+ "trust_remote_code", "tokenizer_mode"),
+ [
+ params for model_name, settings in GENERATION_MODEL_SETTINGS.items()
+ for params in settings.iter_params(model_name)
+ if model_name in TEST_MODELS
+ ],
+)
+@fork_new_process_for_each_test
+def test_tp_language_generation(
+ model_name: str,
+ parallel_setup: ParallelSetup,
+ distributed_backend: str,
+ trust_remote_code: bool,
+ tokenizer_mode: Optional[str],
+ num_gpus_available,
+):
+ _compare_tp(model_name,
+ parallel_setup,
+ distributed_backend,
+ trust_remote_code,
+ tokenizer_mode,
+ num_gpus_available,
+ method="generate")
+
+
+@pytest.mark.parametrize(
+ ("model_name", "parallel_setup", "distributed_backend",
+ "trust_remote_code", "tokenizer_mode"),
+ [
+ params for model_name, settings in EMBEDDING_MODEL_SETTINGS.items()
+ for params in settings.iter_params(model_name)
+ if model_name in TEST_MODELS
+ ],
+)
+@fork_new_process_for_each_test
+def test_tp_language_embedding(
+ model_name: str,
+ parallel_setup: ParallelSetup,
+ distributed_backend: str,
+ trust_remote_code: bool,
+ tokenizer_mode: Optional[str],
+ num_gpus_available,
+):
+ _compare_tp(model_name,
+ parallel_setup,
+ distributed_backend,
+ trust_remote_code,
+ tokenizer_mode,
+ num_gpus_available,
+ method="encode")
+
+
+@pytest.mark.parametrize(
+ ("model_name", "parallel_setup", "distributed_backend",
+ "trust_remote_code", "tokenizer_mode"),
+ [
+ params for model_name, settings in MULTIMODAL_MODEL_SETTINGS.items()
+ for params in settings.iter_params(model_name)
+ if model_name in TEST_MODELS
+ ],
+)
+@fork_new_process_for_each_test
+def test_tp_multimodal_generation(
+ model_name: str,
+ parallel_setup: ParallelSetup,
+ distributed_backend: str,
+ trust_remote_code: bool,
+ tokenizer_mode: Optional[str],
+ num_gpus_available,
+):
+ _compare_tp(model_name,
+ parallel_setup,
+ distributed_backend,
+ trust_remote_code,
+ tokenizer_mode,
+ num_gpus_available,
+ method="generate")
diff --git a/tests/entrypoints/openai/test_audio.py b/tests/entrypoints/openai/test_audio.py
index a9a0ac012c8ff..df8a140283fbb 100644
--- a/tests/entrypoints/openai/test_audio.py
+++ b/tests/entrypoints/openai/test_audio.py
@@ -21,7 +21,9 @@ def server():
"--dtype",
"bfloat16",
"--max-model-len",
- "4096",
+ "2048",
+ "--max-num-seqs",
+ "5",
"--enforce-eager",
]
diff --git a/tests/entrypoints/openai/test_completion.py b/tests/entrypoints/openai/test_completion.py
index d77cd57f12471..61da5513cb130 100644
--- a/tests/entrypoints/openai/test_completion.py
+++ b/tests/entrypoints/openai/test_completion.py
@@ -495,25 +495,30 @@ async def test_batch_completions(client: openai.AsyncOpenAI, model_name: str):
assert len(batch.choices) == 2
assert batch.choices[0].text == batch.choices[1].text
- # test n = 2
- batch = await client.completions.create(
- model=model_name,
- prompt=prompts,
- n=2,
- max_tokens=5,
- temperature=0.0,
- extra_body=dict(
- # NOTE: this has to be true for n > 1 in vLLM, but not necessary
- # for official client.
- use_beam_search=True),
- )
- assert len(batch.choices) == 4
- assert batch.choices[0].text != batch.choices[
- 1].text, "beam search should be different"
- assert batch.choices[0].text == batch.choices[
- 2].text, "two copies of the same prompt should be the same"
- assert batch.choices[1].text == batch.choices[
- 3].text, "two copies of the same prompt should be the same"
+ try:
+ # test n = 2
+ batch = await client.completions.create(
+ model=model_name,
+ prompt=prompts,
+ n=2,
+ max_tokens=5,
+ temperature=0.0,
+ extra_body=dict(
+ # NOTE: this has to be true for n > 1 in vLLM, but
+ # not necessary for official client.
+ use_beam_search=True),
+ )
+ assert len(batch.choices) == 4
+ assert batch.choices[0].text != batch.choices[
+ 1].text, "beam search should be different"
+ assert batch.choices[0].text == batch.choices[
+ 2].text, "two copies of the same prompt should be the same"
+ assert batch.choices[1].text == batch.choices[
+ 3].text, "two copies of the same prompt should be the same"
+ except BadRequestError as e:
+ # the only allowed exception is when beam search is not supported
+ # in the default mqllmengine
+ assert "--disable-frontend-multiprocessing" in str(e)
# test streaming
batch = await client.completions.create(
diff --git a/tests/entrypoints/openai/test_embedding.py b/tests/entrypoints/openai/test_embedding.py
index 3baaeab2feeaf..f119c6c1201c9 100644
--- a/tests/entrypoints/openai/test_embedding.py
+++ b/tests/entrypoints/openai/test_embedding.py
@@ -144,3 +144,64 @@ async def test_batch_base64_embedding(embedding_client: openai.AsyncOpenAI,
0].embedding
assert responses_float.data[1].embedding == responses_default.data[
1].embedding
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+ "model_name",
+ [EMBEDDING_MODEL_NAME],
+)
+async def test_single_embedding_truncation(
+ embedding_client: openai.AsyncOpenAI, model_name: str):
+ input_texts = [
+ "Como o Brasil pode fomentar o desenvolvimento de modelos de IA?",
+ ]
+
+ # test single embedding
+ embeddings = await embedding_client.embeddings.create(
+ model=model_name,
+ input=input_texts,
+ extra_body={"truncate_prompt_tokens": 10})
+ assert embeddings.id is not None
+ assert len(embeddings.data) == 1
+ assert len(embeddings.data[0].embedding) == 4096
+ assert embeddings.usage.completion_tokens == 0
+ assert embeddings.usage.prompt_tokens == 10
+ assert embeddings.usage.total_tokens == 10
+
+ input_tokens = [
+ 1, 24428, 289, 18341, 26165, 285, 19323, 283, 289, 26789, 3871, 28728,
+ 9901, 340, 2229, 385, 340, 315, 28741, 28804, 2
+ ]
+ embeddings = await embedding_client.embeddings.create(
+ model=model_name,
+ input=input_tokens,
+ extra_body={"truncate_prompt_tokens": 10})
+
+ assert embeddings.id is not None
+ assert len(embeddings.data) == 1
+ assert len(embeddings.data[0].embedding) == 4096
+ assert embeddings.usage.completion_tokens == 0
+ assert embeddings.usage.prompt_tokens == 10
+ assert embeddings.usage.total_tokens == 10
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+ "model_name",
+ [EMBEDDING_MODEL_NAME],
+)
+async def test_single_embedding_truncation_invalid(
+ embedding_client: openai.AsyncOpenAI, model_name: str):
+ input_texts = [
+ "Como o Brasil pode fomentar o desenvolvimento de modelos de IA?",
+ ]
+
+ with pytest.raises(openai.BadRequestError):
+ embeddings = await embedding_client.embeddings.create(
+ model=model_name,
+ input=input_texts,
+ extra_body={"truncate_prompt_tokens": 8193})
+ assert "error" in embeddings.object
+ assert "truncate_prompt_tokens value is greater than max_model_len. "\
+ "Please, select a smaller truncation size." in embeddings.message
diff --git a/tests/entrypoints/openai/test_vision.py b/tests/entrypoints/openai/test_vision.py
index f61fa127b7d06..81d79601124a7 100644
--- a/tests/entrypoints/openai/test_vision.py
+++ b/tests/entrypoints/openai/test_vision.py
@@ -23,9 +23,16 @@
@pytest.fixture(scope="module")
def server():
args = [
- "--dtype", "bfloat16", "--max-model-len", "4096", "--max-num-seqs",
- "5", "--enforce-eager", "--trust-remote-code", "--limit-mm-per-prompt",
- f"image={MAXIMUM_IMAGES}"
+ "--dtype",
+ "bfloat16",
+ "--max-model-len",
+ "2048",
+ "--max-num-seqs",
+ "5",
+ "--enforce-eager",
+ "--trust-remote-code",
+ "--limit-mm-per-prompt",
+ f"image={MAXIMUM_IMAGES}",
]
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
diff --git a/tests/kernels/test_awq_marlin.py b/tests/kernels/test_awq_marlin.py
new file mode 100644
index 0000000000000..0738ea9b97edb
--- /dev/null
+++ b/tests/kernels/test_awq_marlin.py
@@ -0,0 +1,160 @@
+"""Test AWQ with fused MoE Marlin kernels.
+
+Run `pytest tests/kernels/test_awq_marlin.py`.
+"""
+import pytest
+import torch
+
+from tests.kernels.utils import (compute_max_diff, stack_and_dev, torch_moe,
+ torch_moe_single)
+from vllm.model_executor.layers.fused_moe.fused_marlin_moe import (
+ fused_marlin_moe, single_marlin_moe)
+from vllm.model_executor.layers.fused_moe.fused_moe import fused_topk
+from vllm.model_executor.layers.quantization.utils.marlin_utils_test import (
+ awq_marlin_quantize)
+from vllm.scalar_type import scalar_types
+
+
+@pytest.mark.parametrize("m", [64, 512, 222, 33, 1])
+@pytest.mark.parametrize("n", [128, 2048, 256, 1024])
+@pytest.mark.parametrize("k", [128, 1024, 512])
+@pytest.mark.parametrize("e", [8, 64])
+@pytest.mark.parametrize("topk", [2, 6])
+@pytest.mark.parametrize("group_size", [-1, 32, 64, 128])
+def test_fused_marlin_moe_awq(
+ m: int,
+ n: int,
+ k: int,
+ e: int,
+ topk: int,
+ group_size: int,
+):
+ torch.manual_seed(7)
+
+ num_bits = 4
+ quant_type = scalar_types.uint4
+ dtype = torch.float16
+ a = torch.randn((m, k), device="cuda", dtype=dtype) / 10
+ w1 = torch.randn((e, 2 * n, k), device="cuda", dtype=dtype) / 10
+ w2 = torch.randn((e, k, n), device="cuda", dtype=dtype) / 10
+
+ w_ref1_l = []
+ qweights1_l = []
+ scales1_l = []
+ zp1_l = []
+
+ for i in range(w1.shape[0]):
+ w_ref1, qweight1, scales1, zp1 = awq_marlin_quantize(
+ w1[i].transpose(1, 0), quant_type, group_size)
+ w_ref1_l.append(w_ref1)
+ qweights1_l.append(qweight1)
+ scales1_l.append(scales1)
+ zp1_l.append(zp1)
+
+ w_ref1 = stack_and_dev(w_ref1_l)
+ qweight1 = stack_and_dev(qweights1_l).contiguous()
+ scales1 = stack_and_dev(scales1_l)
+ zp1 = stack_and_dev(zp1_l)
+
+ w_ref2_l = []
+ qweights2_l = []
+ scales2_l = []
+ zp2_l = []
+
+ for i in range(w2.shape[0]):
+ w_ref2, qweight2, scales2, zp2 = awq_marlin_quantize(
+ w2[i].transpose(1, 0), quant_type, group_size)
+ w_ref2_l.append(w_ref2)
+ qweights2_l.append(qweight2)
+ scales2_l.append(scales2)
+ zp2_l.append(zp2)
+
+ w_ref2 = stack_and_dev(w_ref2_l)
+ qweight2 = stack_and_dev(qweights2_l).contiguous()
+ scales2 = stack_and_dev(scales2_l)
+ zp2 = stack_and_dev(zp2_l)
+
+ score = torch.randn((m, e), device="cuda", dtype=dtype)
+
+ topk_weights, topk_ids = fused_topk(a, score, topk, False)
+ marlin_output = fused_marlin_moe(
+ a,
+ qweight1,
+ qweight2,
+ scales1,
+ scales2,
+ score,
+ topk_weights,
+ topk_ids,
+ w1_zeros=zp1,
+ w2_zeros=zp2,
+ num_bits=num_bits,
+ )
+
+ torch_output = torch_moe(
+ a,
+ w_ref1.transpose(1, 2),
+ w_ref2.transpose(1, 2),
+ score,
+ topk,
+ )
+
+ assert compute_max_diff(marlin_output, torch_output) < 4e-2
+
+
+@pytest.mark.skip("This test is here for the sake of debugging, "
+ "don't run it in automated tests.")
+@pytest.mark.parametrize("m", [64, 512, 222, 33, 1])
+@pytest.mark.parametrize("n", [128, 2048, 256, 1024])
+@pytest.mark.parametrize("k", [128, 1024, 512])
+@pytest.mark.parametrize("e", [8, 64])
+@pytest.mark.parametrize("topk", [2, 6])
+@pytest.mark.parametrize("group_size", [-1, 32, 64, 128])
+def test_single_marlin_moe_multiply_awq(
+ m: int,
+ n: int,
+ k: int,
+ e: int,
+ topk: int,
+ group_size: int,
+):
+ torch.manual_seed(7)
+
+ num_bits = 4
+ quant_type = scalar_types.uint4
+ dtype = torch.float16
+ a = torch.randn((m, k), device="cuda", dtype=dtype) / 10
+ w = torch.randn((e, n, k), device="cuda", dtype=dtype) / 10
+
+ w_ref_l = []
+ qweights_l = []
+ scales_l = []
+ zp_l = []
+
+ for i in range(w.shape[0]):
+ w_ref, qweight, scales, zp = awq_marlin_quantize(
+ w[i].transpose(1, 0), quant_type, group_size)
+ w_ref_l.append(w_ref)
+ qweights_l.append(qweight)
+ scales_l.append(scales)
+ zp_l.append(zp)
+
+ w_ref = stack_and_dev(w_ref_l)
+ qweight = stack_and_dev(qweights_l).contiguous()
+ scales = stack_and_dev(scales_l).contiguous()
+ zp = stack_and_dev(zp_l).contiguous()
+
+ score = torch.randn((m, e), device="cuda", dtype=dtype)
+
+ marlin_output = single_marlin_moe(a,
+ qweight,
+ scales,
+ score,
+ topk,
+ renormalize=False,
+ w_zeros=zp,
+ num_bits=num_bits)
+
+ torch_output = torch_moe_single(a, w_ref.transpose(1, 2), score, topk)
+
+ assert compute_max_diff(marlin_output, torch_output) < 1e-2
diff --git a/tests/kernels/test_moe.py b/tests/kernels/test_moe.py
index cbbb5c9b79c42..b73c45b9cd198 100644
--- a/tests/kernels/test_moe.py
+++ b/tests/kernels/test_moe.py
@@ -2,16 +2,14 @@
Run `pytest tests/kernels/test_moe.py`.
"""
-from typing import List
-
import pytest
import torch
from transformers import MixtralConfig
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
-from tests.kernels.utils import opcheck
+from tests.kernels.utils import (compute_max_diff, opcheck, stack_and_dev,
+ torch_moe, torch_moe_single)
from vllm import _custom_ops as ops
-from vllm.model_executor.layers.activation import SiluAndMul
from vllm.model_executor.layers.fused_moe import fused_moe
from vllm.model_executor.layers.fused_moe.fused_marlin_moe import (
fused_marlin_moe, single_marlin_moe)
@@ -24,37 +22,6 @@
from vllm.utils import seed_everything
-def torch_moe(a, w1, w2, score, topk):
- B, D = a.shape
- a = a.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
- out = torch.zeros(B * topk, w2.shape[1], dtype=a.dtype, device=a.device)
- score = torch.softmax(score, dim=-1, dtype=torch.float32)
- topk_weight, topk_ids = torch.topk(score, topk)
- topk_weight = topk_weight.view(-1)
- topk_ids = topk_ids.view(-1)
- for i in range(w1.shape[0]):
- mask = topk_ids == i
- if mask.sum():
- out[mask] = SiluAndMul()(
- a[mask] @ w1[i].transpose(0, 1)) @ w2[i].transpose(0, 1)
- return (out.view(B, -1, w2.shape[1]) *
- topk_weight.view(B, -1, 1).to(out.dtype)).sum(dim=1)
-
-
-def torch_moe_single(a, w, score, topk):
- B, D = a.shape
- a = a.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
- out = torch.zeros(B * topk, w.shape[1], dtype=a.dtype, device=a.device)
- score = torch.softmax(score, dim=-1, dtype=torch.float32)
- _, topk_ids = torch.topk(score, topk)
- topk_ids = topk_ids.view(-1)
- for i in range(w.shape[0]):
- mask = topk_ids == i
- if mask.sum():
- out[mask] = a[mask] @ w[i].transpose(0, 1)
- return (out.view(B, -1, w.shape[1])).sum(dim=1)
-
-
@pytest.mark.parametrize("m", [1024 * 128, 512, 222, 33, 1])
@pytest.mark.parametrize("n", [2048, 256, 1024])
@pytest.mark.parametrize("k", [128, 511, 1024])
@@ -127,20 +94,10 @@ def test_mixtral_moe(dtype: torch.dtype):
atol=mixtral_moe_tol[dtype])
-def stack_and_dev(tensors: List[torch.Tensor]):
- dev = tensors[0].device
- return torch.stack(tensors, dim=0).to(dev)
-
-
-def compute_max_diff(output, output_ref):
- return torch.mean(torch.abs(output - output_ref)) / torch.mean(
- torch.abs(output_ref))
-
-
@pytest.mark.parametrize("m", [64, 512, 222, 33, 1])
@pytest.mark.parametrize("n", [128, 2048, 256, 1024])
@pytest.mark.parametrize("k", [128, 1024, 512])
-@pytest.mark.parametrize("e", [4, 8, 64])
+@pytest.mark.parametrize("e", [8, 64])
@pytest.mark.parametrize("topk", [2, 6])
@pytest.mark.parametrize("group_size", [-1, 32, 64, 128])
@pytest.mark.parametrize("act_order", [True, False])
@@ -159,9 +116,6 @@ def test_fused_marlin_moe(
):
seed_everything(7)
- if topk > e:
- return
-
# Filter act_order
if act_order:
if group_size == -1:
@@ -241,15 +195,15 @@ def test_fused_marlin_moe(
a,
qweight1,
qweight2,
+ scales1,
+ scales2,
score,
- g_idx1,
- g_idx2,
- sort_indices1,
- sort_indices2,
topk_weights,
topk_ids,
- w1_scale=scales1,
- w2_scale=scales2,
+ g_idx1=g_idx1,
+ g_idx2=g_idx2,
+ sort_indices1=sort_indices1,
+ sort_indices2=sort_indices2,
num_bits=num_bits,
is_k_full=is_k_full,
)
@@ -280,9 +234,13 @@ def test_fused_marlin_moe(
device="cuda",
requires_grad=False)
+ zp = torch.empty((0, 0),
+ dtype=dtype,
+ device="cuda",
+ requires_grad=False)
opcheck(torch.ops._moe_C.marlin_gemm_moe,
(a, qweight1, sorted_token_ids, topk_weights, topk_ids,
- scales1, g_idx1, sort_indices1, workspace, quant_type, m,
+ scales1, zp, g_idx1, sort_indices1, workspace, quant_type, m,
2 * n, k, True, e, topk, block_size_m, True, False))
@@ -291,7 +249,7 @@ def test_fused_marlin_moe(
@pytest.mark.parametrize("m", [64, 512, 222, 33, 1])
@pytest.mark.parametrize("n", [128, 2048, 256, 1024])
@pytest.mark.parametrize("k", [128, 1024, 512])
-@pytest.mark.parametrize("e", [4, 8, 64])
+@pytest.mark.parametrize("e", [8, 64])
@pytest.mark.parametrize("topk", [2, 6])
@pytest.mark.parametrize("group_size", [-1, 32, 64, 128])
@pytest.mark.parametrize("act_order", [True, False])
@@ -308,8 +266,6 @@ def test_single_marlin_moe_multiply(
num_bits: int,
is_k_full: bool,
):
- if topk > e:
- return
# Filter act_order
if act_order:
@@ -355,13 +311,14 @@ def test_single_marlin_moe_multiply(
qweight,
scales,
score,
- g_idx,
- sort_indices,
topk,
renormalize=False,
+ g_idx=g_idx,
+ sort_indices=sort_indices,
num_bits=num_bits,
is_k_full=is_k_full,
)
+
torch_output = torch_moe_single(a, w_ref.transpose(1, 2), score, topk)
assert compute_max_diff(marlin_output, torch_output) < 1e-2
diff --git a/tests/kernels/utils.py b/tests/kernels/utils.py
index 08004efe9e2f8..a2d414f636e13 100644
--- a/tests/kernels/utils.py
+++ b/tests/kernels/utils.py
@@ -12,6 +12,7 @@
from torch._prims_common import TensorLikeType
from vllm.attention import AttentionBackend, AttentionMetadata, AttentionType
+from vllm.model_executor.layers.activation import SiluAndMul
from vllm.utils import (STR_BACKEND_ENV_VAR, STR_XFORMERS_ATTN_VAL,
make_tensor_with_pad)
@@ -974,6 +975,50 @@ def fp8_allclose(
equal_nan=equal_nan)).item())
+# Marlin MoE test utils
+
+
+def stack_and_dev(tensors: List[torch.Tensor]):
+ dev = tensors[0].device
+ return torch.stack(tensors, dim=0).to(dev)
+
+
+def compute_max_diff(output, output_ref):
+ return torch.mean(torch.abs(output - output_ref)) / torch.mean(
+ torch.abs(output_ref))
+
+
+def torch_moe(a, w1, w2, score, topk):
+ B, D = a.shape
+ a = a.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
+ out = torch.zeros(B * topk, w2.shape[1], dtype=a.dtype, device=a.device)
+ score = torch.softmax(score, dim=-1, dtype=torch.float32)
+ topk_weight, topk_ids = torch.topk(score, topk)
+ topk_weight = topk_weight.view(-1)
+ topk_ids = topk_ids.view(-1)
+ for i in range(w1.shape[0]):
+ mask = topk_ids == i
+ if mask.sum():
+ out[mask] = SiluAndMul()(
+ a[mask] @ w1[i].transpose(0, 1)) @ w2[i].transpose(0, 1)
+ return (out.view(B, -1, w2.shape[1]) *
+ topk_weight.view(B, -1, 1).to(out.dtype)).sum(dim=1)
+
+
+def torch_moe_single(a, w, score, topk):
+ B, D = a.shape
+ a = a.view(B, -1, D).repeat(1, topk, 1).reshape(-1, D)
+ out = torch.zeros(B * topk, w.shape[1], dtype=a.dtype, device=a.device)
+ score = torch.softmax(score, dim=-1, dtype=torch.float32)
+ _, topk_ids = torch.topk(score, topk)
+ topk_ids = topk_ids.view(-1)
+ for i in range(w.shape[0]):
+ mask = topk_ids == i
+ if mask.sum():
+ out[mask] = a[mask] @ w[i].transpose(0, 1)
+ return (out.view(B, -1, w.shape[1])).sum(dim=1)
+
+
# A special version of op check that has a restricted default set of test_utils
# and a patched version of allclose that supports fp8 types.
def opcheck(op: Union[torch._ops.OpOverload, torch._ops.OpOverloadPacket,
diff --git a/tests/lora/conftest.py b/tests/lora/conftest.py
index 7f6f60f38b5de..da98fac99cf22 100644
--- a/tests/lora/conftest.py
+++ b/tests/lora/conftest.py
@@ -173,6 +173,11 @@ def mixtral_lora_files():
return snapshot_download(repo_id="SangBinCho/mixtral-lora")
+@pytest.fixture(scope="session")
+def mixtral_lora_files_all_target_modules():
+ return snapshot_download(repo_id="dyang415/mixtral-lora-v0")
+
+
@pytest.fixture(scope="session")
def gemma_lora_files():
return snapshot_download(repo_id="wskwon/gemma-7b-test-lora")
diff --git a/tests/lora/test_mixtral.py b/tests/lora/test_mixtral.py
index b5b4a79eb9567..dddc299da446b 100644
--- a/tests/lora/test_mixtral.py
+++ b/tests/lora/test_mixtral.py
@@ -9,12 +9,9 @@
MODEL_PATH = "mistralai/Mixtral-8x7B-Instruct-v0.1"
-def do_sample(llm: vllm.LLM, lora_path: str, lora_id: int) -> List[str]:
- prompts = [
- "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nSpellForce 3 is a pretty bad game. The developer Grimlore Games is clearly a bunch of no-talent hacks, and 2017 was a terrible year for games anyway. [/user] [assistant]", # noqa: E501
- "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nI wanted to like Grimlore Games' 2017 entry, but in SpellForce 3 they just didn't get anything right. [/user] [assistant]", # noqa: E501
- "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nBioShock is a good role-playing, action-adventure, shooter that released for PlayStation, Xbox, and PC in 2007. It is available on Steam, and it has a Mac release but not a Linux release. [/user] [assistant]", # noqa: E501
- ]
+def do_sample(llm: vllm.LLM, lora_path: str, lora_id: int,
+ prompts: List[str]) -> List[str]:
+
sampling_params = vllm.SamplingParams(temperature=0, max_tokens=256)
outputs = llm.generate(
prompts,
@@ -33,22 +30,71 @@ def do_sample(llm: vllm.LLM, lora_path: str, lora_id: int) -> List[str]:
@pytest.mark.parametrize("tp_size", [4])
def test_mixtral_lora(mixtral_lora_files, tp_size):
+ """Original test, the LoRA model has the common target modules, not all"""
if torch.cuda.device_count() < tp_size:
pytest.skip(f"Not enough GPUs for tensor parallelism {tp_size}")
- llm = vllm.LLM(MODEL_PATH,
- enable_lora=True,
- max_num_seqs=16,
- max_loras=4,
- distributed_executor_backend="ray",
- tensor_parallel_size=tp_size)
+ prompts = [
+ "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nSpellForce 3 is a pretty bad game. The developer Grimlore Games is clearly a bunch of no-talent hacks, and 2017 was a terrible year for games anyway. [/user] [assistant]", # noqa: E501
+ "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nI wanted to like Grimlore Games' 2017 entry, but in SpellForce 3 they just didn't get anything right. [/user] [assistant]", # noqa: E501
+ "[system] Given a target sentence construct the underlying meaning representation\nof the input sentence as a single function with attributes and attribute\nvalues. This function should describe the target string accurately and the\nfunction must be one of the following ['inform', 'request', 'give_opinion',\n'confirm', 'verify_attribute', 'suggest', 'request_explanation',\n'recommend', 'request_attribute'].\n\nThe attributes must be one of the following:\n['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating',\n'genres', 'player_perspective', 'has_multiplayer', 'platforms',\n'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier'] [/system] [user] Here is the target sentence:\nBioShock is a good role-playing, action-adventure, shooter that released for PlayStation, Xbox, and PC in 2007. It is available on Steam, and it has a Mac release but not a Linux release. [/user] [assistant]", # noqa: E501
+ ]
+
+ llm = vllm.LLM(
+ MODEL_PATH,
+ enable_lora=True,
+ max_num_seqs=16,
+ max_loras=4,
+ distributed_executor_backend="ray",
+ tensor_parallel_size=tp_size,
+ )
expected_lora_output = [
"give_opinion(name[SpellForce 3], release_year[2017], developer[Grimlore Games], rating[poor])", # noqa: E501
"give_opinion(name[SpellForce 3], developer[Grimlore Games], release_year[2017], rating[poor])", # noqa: E501
"inform(name[BioShock], release_year[2007], rating[good], genres[action-adventure, role-playing, shooter], platforms[PlayStation, Xbox, PC], available_on_steam[yes], has_linux_release[no], has_mac_release[yes])", # noqa: E501
]
- assert do_sample(llm, mixtral_lora_files,
- lora_id=1) == expected_lora_output
- assert do_sample(llm, mixtral_lora_files,
- lora_id=2) == expected_lora_output
+ assert do_sample(llm, mixtral_lora_files, lora_id=1,
+ prompts=prompts) == expected_lora_output
+ assert do_sample(llm, mixtral_lora_files, lora_id=2,
+ prompts=prompts) == expected_lora_output
+
+
+@pytest.mark.parametrize("tp_size", [4])
+def test_mixtral_lora_all_target_modules(mixtral_lora_files_all_target_modules,
+ tp_size):
+ """This LoRA model has all supported Mixtral target modules"""
+
+ if torch.cuda.device_count() < tp_size:
+ pytest.skip(f"Not enough GPUs for tensor parallelism {tp_size}")
+
+ prompts = [
+ "Question: What happens if you touch the eyes of a blind man?\nChoices:\nB: The man's sight will be restored.\nA: Nothing happens if you touch the eyes of a blind man.\nC: The man will get his eyesight back.\nD: The man will no longer be blind.\nAnswer:", # noqa: E501
+ "Question: Which will increase the energy of motion of water molecules?\nChoices:\nA: add heat\nB: add ice\nC: remove some of the water\nD: place the water in the freezer\nAnswer:", # noqa: E501
+ "Since Craig threw aluminum cans in the trash and Benjamin recycled, _ was environmentally irresponsible.\nChoices:\n1: Craig\n2: Benjamin\nAnswer:", # noqa: E501
+ ]
+
+ llm = vllm.LLM(
+ MODEL_PATH,
+ enable_lora=True,
+ max_num_seqs=16,
+ max_loras=4,
+ distributed_executor_backend="ray",
+ tensor_parallel_size=tp_size,
+ max_lora_rank=32,
+ )
+
+ expected_lora_output = [
+ "A: Nothing happens if you touch the eyes of a blind man.",
+ "A: add heat",
+ "1: Craig",
+ ]
+
+ assert do_sample(llm,
+ mixtral_lora_files_all_target_modules,
+ lora_id=1,
+ prompts=prompts) == expected_lora_output
+ assert do_sample(llm,
+ mixtral_lora_files_all_target_modules,
+ lora_id=2,
+ prompts=prompts) == expected_lora_output
diff --git a/tests/models/decoder_only/language/test_gguf.py b/tests/models/decoder_only/language/test_gguf.py
index 8fc64a10c84af..5dc83942632fd 100644
--- a/tests/models/decoder_only/language/test_gguf.py
+++ b/tests/models/decoder_only/language/test_gguf.py
@@ -19,12 +19,12 @@
# FIXME: Move this to confest
MODELS = [
- ("TinyLlama/TinyLlama-1.1B-Chat-v1.0",
- hf_hub_download("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF",
- filename="tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")),
- ("TinyLlama/TinyLlama-1.1B-Chat-v1.0",
- hf_hub_download("duyntnet/TinyLlama-1.1B-Chat-v1.0-imatrix-GGUF",
- filename="TinyLlama-1.1B-Chat-v1.0-IQ4_XS.gguf")),
+ ("meta-llama/Llama-3.2-1B-Instruct",
+ hf_hub_download("bartowski/Llama-3.2-1B-Instruct-GGUF",
+ filename="Llama-3.2-1B-Instruct-Q4_K_M.gguf")),
+ ("meta-llama/Llama-3.2-1B-Instruct",
+ hf_hub_download("bartowski/Llama-3.2-1B-Instruct-GGUF",
+ filename="Llama-3.2-1B-Instruct-IQ4_XS.gguf")),
("Qwen/Qwen2-1.5B-Instruct",
hf_hub_download("Qwen/Qwen2-1.5B-Instruct-GGUF",
filename="qwen2-1_5b-instruct-q4_k_m.gguf")),
diff --git a/tests/models/decoder_only/vision_language/test_internvl.py b/tests/models/decoder_only/vision_language/test_internvl.py
index a756f8214edee..49cab75d8ea53 100644
--- a/tests/models/decoder_only/vision_language/test_internvl.py
+++ b/tests/models/decoder_only/vision_language/test_internvl.py
@@ -97,7 +97,8 @@ def __init__(self, hf_runner: HfRunner):
self.tokenizer = hf_runner.tokenizer
self.dtype = hf_runner.model.dtype
- self.config = AutoConfig.from_pretrained(hf_runner.model_name)
+ self.config = AutoConfig.from_pretrained(hf_runner.model_name,
+ trust_remote_code=True)
self.vision_config = self.config.vision_config
self.use_thumbnail = self.config.use_thumbnail
self.min_num = self.config.min_dynamic_patch
diff --git a/tests/models/embedding/language/test_embedding.py b/tests/models/embedding/language/test_embedding.py
index 6556998b68a74..be316c6e12da1 100644
--- a/tests/models/embedding/language/test_embedding.py
+++ b/tests/models/embedding/language/test_embedding.py
@@ -1,6 +1,6 @@
"""Compare the outputs of HF and vLLM for Mistral models using greedy sampling.
-Run `pytest tests/models/test_llama_embedding.py`.
+Run `pytest tests/models/embedding/language/test_embedding.py`.
"""
import pytest
import torch
@@ -8,6 +8,7 @@
MODELS = [
"intfloat/e5-mistral-7b-instruct",
+ "BAAI/bge-multilingual-gemma2",
]
@@ -28,6 +29,14 @@ def test_models(
model: str,
dtype: str,
) -> None:
+ # The example_prompts has ending "\n", for example:
+ # "Write a short story about a robot that dreams for the first time.\n"
+ # sentence_transformers will strip the input texts, see:
+ # https://github.com/UKPLab/sentence-transformers/blob/v3.1.1/sentence_transformers/models/Transformer.py#L159
+ # This makes the input_ids different between hf_model and vllm_model.
+ # So we need to strip the input texts to avoid test failing.
+ example_prompts = [str(s).strip() for s in example_prompts]
+
with hf_runner(model, dtype=dtype, is_embedding_model=True) as hf_model:
hf_outputs = hf_model.encode(example_prompts)
diff --git a/tests/models/encoder_decoder/language/test_bart.py b/tests/models/encoder_decoder/language/test_bart.py
index 758a9b743b397..8e8862fadbf04 100644
--- a/tests/models/encoder_decoder/language/test_bart.py
+++ b/tests/models/encoder_decoder/language/test_bart.py
@@ -4,220 +4,214 @@
"""
from typing import List, Optional, Tuple, Type
-from vllm.utils import is_cpu
-
-if not is_cpu():
- # CPU backend is not currently supported with encoder/decoder models
- # skip test definitions entirely to avoid importing GPU kernel libs
- # (xFormers, etc.)
-
- import pytest
- from transformers import AutoModelForSeq2SeqLM
-
- from vllm.sequence import SampleLogprobs
-
- from ....conftest import (DecoderPromptType, ExplicitEncoderDecoderPrompt,
- HfRunner, VllmRunner)
- from ....utils import multi_gpu_test
- from ...utils import check_logprobs_close
-
- MODELS = ["facebook/bart-base", "facebook/bart-large-cnn"]
-
- def vllm_to_hf_output(
- vllm_output: Tuple[List[int], str, Optional[SampleLogprobs]],
- decoder_prompt_type: DecoderPromptType,
- ):
- """Sanitize vllm output to be comparable with hf output."""
- output_ids, output_str, out_logprobs = vllm_output
-
- hf_output_str = output_str + ""
- if decoder_prompt_type == DecoderPromptType.NONE:
- hf_output_str = "" + hf_output_str
-
- return output_ids, hf_output_str, out_logprobs
-
- def run_test(
- hf_runner: Type[HfRunner],
- vllm_runner: Type[VllmRunner],
- prompts: List[ExplicitEncoderDecoderPrompt[str, str]],
- decoder_prompt_type: DecoderPromptType,
- model: str,
- *,
- dtype: str,
- max_tokens: int,
- num_logprobs: int,
- tensor_parallel_size: int,
- distributed_executor_backend: Optional[str] = None,
- ) -> None:
- '''
- Test the vLLM BART model for a variety of encoder/decoder input prompts,
- by validating it against HuggingFace (HF) BART.
-
- Arguments:
-
- * hf_runner: HuggingFace (HF) test model runner
- * vllm_runner: vLLM test model runner
- * example_encoder_decoder_prompts: test fixture which provides a
- dictionary of dummy prompts
- * model: the HF ID of the specific BART variant under test
- * dtype: the tensor datatype to employ
- * max_tokens
- * num_logprobs
- * decoder_prompt_type: key into the example_encoder_decoder_prompts
- dictionary; selects specific encoder/decoder
- prompt scenarios to test
-
- A note on using HF BART as a baseline for validating vLLM BART,
- specifically when the decoder prompt is None.
-
- The HF GenerationMixin's default behavior is to force the first
- decoded token to be if the prompt does not already contain
- (this is accomplished using a logit
- processor setting.)
-
- So when we use HF BART as our baseline for comparison, note that
- when the user provides a request with a None decoder prompt
- (i.e. a singleton encoder prompt, or else an explicit encoder/
- decoder prompt with the decoder sub-prompt set to None), HF and
- vLLM handle this in different ways:
-
- * HF will (1) tokenize the None prompt as an empty token-list,
- (2) append to the beginning, yielding
- [], (3) pass this token list to the model, and
- then (4) after computing logits during prefill, override the model
- logits & force to be the first generated token.
-
- * vLLM will (1) tokenize the None prompt as [], (2) append decoder-
- start-token to the beginning, yielding [],
- (3) pass these tokens to the model & proceed with generation.
-
- The net effect is that compared to vLLM, the list of HF *decoded* tokens
- will contain one more initial than the vLLM generated tokens,
- because vLLM's token is injected into the prompt rather than into
- the generated output. This is in spite of the fact that overall, the
- complete sequences (prompt + decoded tokens) produced by vLLM will match
- HF.
-
- So when we use HF decoded token output to validate vLLM's decoded token
- output, the testing process must account for the difference in decoded
- token sequences between vLLM and HF specifically in the
- decoder-prompt-is-None case.
-
- One option is to disable the logit processor feature that forces the
- token to be decoded (forced_bos_token_id = None), eliminating
- the problem entirely. However this is not "normal" BART usage.
-
- The other option is - only in the decoder-prompt-is-None case - to
- discard the first decoded token from the HF output before comparing it
- to vLLM.
-
- To that end, when testing the scenario where the decoder prompt is None
- (and only in that one scenario), this test skips the first HF decoded
- token during the process of validating the vLLM decoded output.
- '''
-
- # NOTE: take care of the order. run vLLM first, and then run HF.
- # vLLM needs a fresh new process without cuda initialization.
- # if we run HF first, the cuda initialization will be done and it
- # will hurt multiprocessing backend with fork method (the default).
-
- # Note: currently encoder/decoder models are only compatible with
- # enforce_eager=True. Normally this is not a problem because
- # for encoder/decoder models vLLM will
- # default to enforce_eager=True if enforce_eager
- # is left unspecified. However, the
- # VllmRunner test fixture (which wraps around the LLM class) defaults to
- # enforce_eager=False (a behavior which a number of already-exisitng
- # decoder-only unit tests expect), so when testing an encoder/decoder
- # model we must explicitly specify enforce_eager=True in the VllmRunner
- # constructor.
- with vllm_runner(
- model,
- dtype=dtype,
- tensor_parallel_size=tensor_parallel_size,
- distributed_executor_backend=distributed_executor_backend,
- enforce_eager=True) as vllm_model:
- vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
- prompts, max_tokens, num_logprobs)
-
- # Configuration settings for HF baseline
- hf_kwargs = {
- "top_k": None,
- "num_beams": 1,
- "repetition_penalty": 1.0,
- "top_p": 1.0,
- "length_penalty": 1.0,
- "early_stopping": False,
- "no_repeat_ngram_size": None,
- "min_length": 0
- }
-
- with hf_runner(model, dtype=dtype,
- auto_cls=AutoModelForSeq2SeqLM) as hf_model:
- hf_outputs = (
- hf_model.generate_encoder_decoder_greedy_logprobs_limit(
- prompts,
- max_tokens,
- num_logprobs,
- **hf_kwargs,
- ))
-
- hf_skip_tokens = (1 if decoder_prompt_type == DecoderPromptType.NONE
- else 0)
-
- check_logprobs_close(
- outputs_0_lst=hf_outputs,
- outputs_1_lst=[
- vllm_to_hf_output(vllm_output, decoder_prompt_type)
- for vllm_output in vllm_outputs
- ],
- name_0="hf",
- name_1="vllm",
- num_outputs_0_skip_tokens=hf_skip_tokens,
- )
-
- @pytest.mark.parametrize("model", MODELS)
- @pytest.mark.parametrize("dtype", ["float", "bfloat16"])
- @pytest.mark.parametrize("max_tokens", [64])
- @pytest.mark.parametrize("num_logprobs", [5])
- @pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
- def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts,
- model, dtype, max_tokens, num_logprobs,
- decoder_prompt_type) -> None:
-
- run_test(
- hf_runner,
- vllm_runner,
- example_encoder_decoder_prompts[decoder_prompt_type],
- decoder_prompt_type,
- model,
- dtype=dtype,
- max_tokens=max_tokens,
- num_logprobs=num_logprobs,
- tensor_parallel_size=1,
- )
-
- @multi_gpu_test(num_gpus=2)
- @pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"])
- @pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
- @pytest.mark.parametrize("dtype", ["float"])
- @pytest.mark.parametrize("max_tokens", [64])
- @pytest.mark.parametrize("num_logprobs", [5])
- @pytest.mark.parametrize("decoder_prompt_type", [DecoderPromptType.CUSTOM])
- def test_models_distributed(hf_runner, vllm_runner,
- example_encoder_decoder_prompts,
- distributed_executor_backend, model, dtype,
- max_tokens, num_logprobs,
- decoder_prompt_type) -> None:
- run_test(
- hf_runner,
- vllm_runner,
- example_encoder_decoder_prompts[decoder_prompt_type],
- decoder_prompt_type,
- model,
- dtype=dtype,
- max_tokens=max_tokens,
- num_logprobs=num_logprobs,
- tensor_parallel_size=2,
- distributed_executor_backend=distributed_executor_backend,
- )
+import pytest
+from transformers import AutoModelForSeq2SeqLM
+
+from vllm.sequence import SampleLogprobs
+
+from ....conftest import (DecoderPromptType, ExplicitEncoderDecoderPrompt,
+ HfRunner, VllmRunner)
+from ....utils import multi_gpu_test
+from ...utils import check_logprobs_close
+
+MODELS = ["facebook/bart-base", "facebook/bart-large-cnn"]
+
+
+def vllm_to_hf_output(
+ vllm_output: Tuple[List[int], str, Optional[SampleLogprobs]],
+ decoder_prompt_type: DecoderPromptType,
+):
+ """Sanitize vllm output to be comparable with hf output."""
+ output_ids, output_str, out_logprobs = vllm_output
+
+ hf_output_str = output_str + ""
+ if decoder_prompt_type == DecoderPromptType.NONE:
+ hf_output_str = "" + hf_output_str
+
+ return output_ids, hf_output_str, out_logprobs
+
+
+def run_test(
+ hf_runner: Type[HfRunner],
+ vllm_runner: Type[VllmRunner],
+ prompts: List[ExplicitEncoderDecoderPrompt[str, str]],
+ decoder_prompt_type: DecoderPromptType,
+ model: str,
+ *,
+ dtype: str,
+ max_tokens: int,
+ num_logprobs: int,
+ tensor_parallel_size: int,
+ distributed_executor_backend: Optional[str] = None,
+) -> None:
+ '''
+ Test the vLLM BART model for a variety of encoder/decoder input prompts,
+ by validating it against HuggingFace (HF) BART.
+
+ Arguments:
+
+ * hf_runner: HuggingFace (HF) test model runner
+ * vllm_runner: vLLM test model runner
+ * example_encoder_decoder_prompts: test fixture which provides a
+ dictionary of dummy prompts
+ * model: the HF ID of the specific BART variant under test
+ * dtype: the tensor datatype to employ
+ * max_tokens
+ * num_logprobs
+ * decoder_prompt_type: key into the example_encoder_decoder_prompts
+ dictionary; selects specific encoder/decoder
+ prompt scenarios to test
+
+ A note on using HF BART as a baseline for validating vLLM BART,
+ specifically when the decoder prompt is None.
+
+ The HF GenerationMixin's default behavior is to force the first
+ decoded token to be if the prompt does not already contain
+ (this is accomplished using a logit
+ processor setting.)
+
+ So when we use HF BART as our baseline for comparison, note that
+ when the user provides a request with a None decoder prompt
+ (i.e. a singleton encoder prompt, or else an explicit encoder/
+ decoder prompt with the decoder sub-prompt set to None), HF and
+ vLLM handle this in different ways:
+
+ * HF will (1) tokenize the None prompt as an empty token-list,
+ (2) append to the beginning, yielding
+ [], (3) pass this token list to the model, and
+ then (4) after computing logits during prefill, override the model
+ logits & force to be the first generated token.
+
+ * vLLM will (1) tokenize the None prompt as [], (2) append decoder-
+ start-token to the beginning, yielding [],
+ (3) pass these tokens to the model & proceed with generation.
+
+ The net effect is that compared to vLLM, the list of HF *decoded* tokens
+ will contain one more initial than the vLLM generated tokens,
+ because vLLM's token is injected into the prompt rather than into
+ the generated output. This is in spite of the fact that overall, the
+ complete sequences (prompt + decoded tokens) produced by vLLM will match
+ HF.
+
+ So when we use HF decoded token output to validate vLLM's decoded token
+ output, the testing process must account for the difference in decoded
+ token sequences between vLLM and HF specifically in the
+ decoder-prompt-is-None case.
+
+ One option is to disable the logit processor feature that forces the
+ token to be decoded (forced_bos_token_id = None), eliminating
+ the problem entirely. However this is not "normal" BART usage.
+
+ The other option is - only in the decoder-prompt-is-None case - to
+ discard the first decoded token from the HF output before comparing it
+ to vLLM.
+
+ To that end, when testing the scenario where the decoder prompt is None
+ (and only in that one scenario), this test skips the first HF decoded
+ token during the process of validating the vLLM decoded output.
+ '''
+
+ # NOTE: take care of the order. run vLLM first, and then run HF.
+ # vLLM needs a fresh new process without cuda initialization.
+ # if we run HF first, the cuda initialization will be done and it
+ # will hurt multiprocessing backend with fork method (the default).
+
+ # Note: currently encoder/decoder models are only compatible with
+ # enforce_eager=True. Normally this is not a problem because
+ # for encoder/decoder models vLLM will
+ # default to enforce_eager=True if enforce_eager
+ # is left unspecified. However, the
+ # VllmRunner test fixture (which wraps around the LLM class) defaults to
+ # enforce_eager=False (a behavior which a number of already-exisitng
+ # decoder-only unit tests expect), so when testing an encoder/decoder
+ # model we must explicitly specify enforce_eager=True in the VllmRunner
+ # constructor.
+ with vllm_runner(model,
+ dtype=dtype,
+ tensor_parallel_size=tensor_parallel_size,
+ distributed_executor_backend=distributed_executor_backend,
+ enforce_eager=True) as vllm_model:
+ vllm_outputs = vllm_model.generate_encoder_decoder_greedy_logprobs(
+ prompts, max_tokens, num_logprobs)
+
+ # Configuration settings for HF baseline
+ hf_kwargs = {
+ "top_k": None,
+ "num_beams": 1,
+ "repetition_penalty": 1.0,
+ "top_p": 1.0,
+ "length_penalty": 1.0,
+ "early_stopping": False,
+ "no_repeat_ngram_size": None,
+ "min_length": 0
+ }
+
+ with hf_runner(model, dtype=dtype,
+ auto_cls=AutoModelForSeq2SeqLM) as hf_model:
+ hf_outputs = (hf_model.generate_encoder_decoder_greedy_logprobs_limit(
+ prompts,
+ max_tokens,
+ num_logprobs,
+ **hf_kwargs,
+ ))
+
+ hf_skip_tokens = (1
+ if decoder_prompt_type == DecoderPromptType.NONE else 0)
+
+ check_logprobs_close(
+ outputs_0_lst=hf_outputs,
+ outputs_1_lst=[
+ vllm_to_hf_output(vllm_output, decoder_prompt_type)
+ for vllm_output in vllm_outputs
+ ],
+ name_0="hf",
+ name_1="vllm",
+ num_outputs_0_skip_tokens=hf_skip_tokens,
+ )
+
+
+@pytest.mark.parametrize("model", MODELS)
+@pytest.mark.parametrize("dtype", ["float", "bfloat16"])
+@pytest.mark.parametrize("max_tokens", [64])
+@pytest.mark.parametrize("num_logprobs", [5])
+@pytest.mark.parametrize("decoder_prompt_type", list(DecoderPromptType))
+def test_models(hf_runner, vllm_runner, example_encoder_decoder_prompts, model,
+ dtype, max_tokens, num_logprobs, decoder_prompt_type) -> None:
+
+ run_test(
+ hf_runner,
+ vllm_runner,
+ example_encoder_decoder_prompts[decoder_prompt_type],
+ decoder_prompt_type,
+ model,
+ dtype=dtype,
+ max_tokens=max_tokens,
+ num_logprobs=num_logprobs,
+ tensor_parallel_size=1,
+ )
+
+
+@multi_gpu_test(num_gpus=2)
+@pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"])
+@pytest.mark.parametrize("model", ["facebook/bart-large-cnn"])
+@pytest.mark.parametrize("dtype", ["float"])
+@pytest.mark.parametrize("max_tokens", [64])
+@pytest.mark.parametrize("num_logprobs", [5])
+@pytest.mark.parametrize("decoder_prompt_type", [DecoderPromptType.CUSTOM])
+def test_models_distributed(hf_runner, vllm_runner,
+ example_encoder_decoder_prompts,
+ distributed_executor_backend, model, dtype,
+ max_tokens, num_logprobs,
+ decoder_prompt_type) -> None:
+ run_test(
+ hf_runner,
+ vllm_runner,
+ example_encoder_decoder_prompts[decoder_prompt_type],
+ decoder_prompt_type,
+ model,
+ dtype=dtype,
+ max_tokens=max_tokens,
+ num_logprobs=num_logprobs,
+ tensor_parallel_size=2,
+ distributed_executor_backend=distributed_executor_backend,
+ )
diff --git a/tests/models/encoder_decoder/vision_language/test_mllama.py b/tests/models/encoder_decoder/vision_language/test_mllama.py
index 254185537e403..78a5c8158e16e 100644
--- a/tests/models/encoder_decoder/vision_language/test_mllama.py
+++ b/tests/models/encoder_decoder/vision_language/test_mllama.py
@@ -195,11 +195,6 @@ def _run_test(
def process(hf_inputs: BatchEncoding):
return hf_inputs
- from transformers.models.mllama import MllamaConfig as MllamaConfigHf
-
- # use transformer's MllamaConfig for hf_runner
- # and vllm's MllamaConfig for vllm_runner
- AutoConfig.register("mllama", MllamaConfigHf, exist_ok=True)
with hf_runner(model,
dtype=dtype,
model_kwargs={"device_map": "auto"},
@@ -213,8 +208,6 @@ def process(hf_inputs: BatchEncoding):
for prompts, images in inputs
]
- from vllm.transformers_utils.configs.mllama import MllamaConfig
- AutoConfig.register("mllama", MllamaConfig, exist_ok=True)
for hf_outputs, vllm_outputs in zip(hf_outputs_per_image,
vllm_outputs_per_image):
check_logprobs_close(
diff --git a/tests/models/test_oot_registration.py b/tests/models/test_oot_registration.py
index 5cb82a5ac4c7d..94be215258f89 100644
--- a/tests/models/test_oot_registration.py
+++ b/tests/models/test_oot_registration.py
@@ -2,7 +2,8 @@
import pytest
-from vllm import LLM, SamplingParams
+from vllm import LLM, PoolingParams, SamplingParams
+from vllm.assets.image import ImageAsset
from ..utils import fork_new_process_for_each_test
@@ -16,7 +17,7 @@ def test_plugin(dummy_opt_path):
@fork_new_process_for_each_test
-def test_oot_registration(dummy_opt_path):
+def test_oot_registration_text_generation(dummy_opt_path):
os.environ["VLLM_PLUGINS"] = "register_dummy_model"
prompts = ["Hello, my name is", "The text does not matter"]
sampling_params = SamplingParams(temperature=0)
@@ -29,3 +30,52 @@ def test_oot_registration(dummy_opt_path):
# make sure only the first token is generated
rest = generated_text.replace(first_token, "")
assert rest == ""
+
+
+@fork_new_process_for_each_test
+def test_oot_registration_embedding(dummy_gemma2_embedding_path):
+ os.environ["VLLM_PLUGINS"] = "register_dummy_model"
+ prompts = ["Hello, my name is", "The text does not matter"]
+ sampling_params = PoolingParams()
+ llm = LLM(model=dummy_gemma2_embedding_path, load_format="dummy")
+ outputs = llm.encode(prompts, sampling_params)
+
+ for output in outputs:
+ assert all(v == 0 for v in output.outputs.embedding)
+
+
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+
+
+@fork_new_process_for_each_test
+def test_oot_registration_multimodal(dummy_llava_path):
+ os.environ["VLLM_PLUGINS"] = "register_dummy_model"
+ prompts = [{
+ "prompt": "What's in the image?",
+ "multi_modal_data": {
+ "image": image
+ },
+ }, {
+ "prompt": "Describe the image",
+ "multi_modal_data": {
+ "image": image
+ },
+ }]
+
+ sampling_params = SamplingParams(temperature=0)
+ llm = LLM(model=dummy_llava_path,
+ load_format="dummy",
+ max_num_seqs=1,
+ trust_remote_code=True,
+ gpu_memory_utilization=0.98,
+ max_model_len=4096,
+ enforce_eager=True,
+ limit_mm_per_prompt={"image": 1})
+ first_token = llm.get_tokenizer().decode(0)
+ outputs = llm.generate(prompts, sampling_params)
+
+ for output in outputs:
+ generated_text = output.outputs[0].text
+ # make sure only the first token is generated
+ rest = generated_text.replace(first_token, "")
+ assert rest == ""
diff --git a/tests/models/test_registry.py b/tests/models/test_registry.py
index ee5c9e8ccb196..a2194fa15f90e 100644
--- a/tests/models/test_registry.py
+++ b/tests/models/test_registry.py
@@ -3,16 +3,36 @@
import pytest
import torch.cuda
-from vllm.model_executor.models import _MODELS, ModelRegistry
+from vllm.model_executor.models import (is_embedding_model,
+ is_text_generation_model,
+ supports_multimodal)
+from vllm.model_executor.models.registry import (_EMBEDDING_MODELS,
+ _MULTIMODAL_MODELS,
+ _SPECULATIVE_DECODING_MODELS,
+ _TEXT_GENERATION_MODELS,
+ ModelRegistry)
from vllm.platforms import current_platform
from ..utils import fork_new_process_for_each_test
-@pytest.mark.parametrize("model_arch", _MODELS)
+@pytest.mark.parametrize("model_arch", ModelRegistry.get_supported_archs())
def test_registry_imports(model_arch):
# Ensure all model classes can be imported successfully
- ModelRegistry.resolve_model_cls(model_arch)
+ model_cls, _ = ModelRegistry.resolve_model_cls(model_arch)
+
+ if model_arch in _SPECULATIVE_DECODING_MODELS:
+ pass # Ignore these models which do not have a unified format
+ else:
+ assert is_text_generation_model(model_cls) is (
+ model_arch in _TEXT_GENERATION_MODELS
+ or model_arch in _MULTIMODAL_MODELS)
+
+ assert is_embedding_model(model_cls) is (model_arch
+ in _EMBEDDING_MODELS)
+
+ assert supports_multimodal(model_cls) is (model_arch
+ in _MULTIMODAL_MODELS)
@fork_new_process_for_each_test
diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/__init__.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/__init__.py
index dcc0305e657ab..62a8f871fa51b 100644
--- a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/__init__.py
+++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/__init__.py
@@ -1,26 +1,20 @@
-from typing import Optional
-
-import torch
-
from vllm import ModelRegistry
-from vllm.model_executor.models.opt import OPTForCausalLM
-from vllm.model_executor.sampling_metadata import SamplingMetadata
-
-
-class MyOPTForCausalLM(OPTForCausalLM):
-
- def compute_logits(
- self, hidden_states: torch.Tensor,
- sampling_metadata: SamplingMetadata) -> Optional[torch.Tensor]:
- # this dummy model always predicts the first token
- logits = super().compute_logits(hidden_states, sampling_metadata)
- if logits is not None:
- logits.zero_()
- logits[:, 0] += 1.0
- return logits
def register():
- # register our dummy model
+ # Test directly passing the model
+ from .my_opt import MyOPTForCausalLM
+
if "MyOPTForCausalLM" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model("MyOPTForCausalLM", MyOPTForCausalLM)
+
+ # Test passing lazy model
+ if "MyGemma2Embedding" not in ModelRegistry.get_supported_archs():
+ ModelRegistry.register_model(
+ "MyGemma2Embedding",
+ "vllm_add_dummy_model.my_gemma_embedding:MyGemma2Embedding",
+ )
+
+ if "MyLlava" not in ModelRegistry.get_supported_archs():
+ ModelRegistry.register_model("MyLlava",
+ "vllm_add_dummy_model.my_llava:MyLlava")
diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py
new file mode 100644
index 0000000000000..1d61f6b74f520
--- /dev/null
+++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_gemma_embedding.py
@@ -0,0 +1,34 @@
+from typing import List, Optional, Union
+
+import torch
+
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.models.gemma2_embedding import Gemma2EmbeddingModel
+from vllm.sequence import IntermediateTensors
+
+
+class MyGemma2Embedding(Gemma2EmbeddingModel):
+
+ def forward(
+ self,
+ input_ids: torch.Tensor,
+ positions: torch.Tensor,
+ kv_caches: List[torch.Tensor],
+ attn_metadata: AttentionMetadata,
+ intermediate_tensors: Optional[IntermediateTensors] = None,
+ inputs_embeds: Optional[torch.Tensor] = None,
+ ) -> Union[torch.Tensor, IntermediateTensors]:
+ hidden_states = super().forward(
+ input_ids,
+ positions,
+ kv_caches,
+ attn_metadata,
+ intermediate_tensors=intermediate_tensors,
+ inputs_embeds=inputs_embeds,
+ )
+
+ if isinstance(hidden_states, IntermediateTensors):
+ return hidden_states
+
+ # Return all-zero embeddings
+ return torch.zeros_like(hidden_states)
diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_llava.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_llava.py
new file mode 100644
index 0000000000000..3ebd7864b8fc8
--- /dev/null
+++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_llava.py
@@ -0,0 +1,28 @@
+from typing import Optional
+
+import torch
+
+from vllm.inputs import INPUT_REGISTRY
+from vllm.model_executor.models.llava import (LlavaForConditionalGeneration,
+ dummy_data_for_llava,
+ get_max_llava_image_tokens,
+ input_processor_for_llava)
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+from vllm.multimodal import MULTIMODAL_REGISTRY
+
+
+@MULTIMODAL_REGISTRY.register_image_input_mapper()
+@MULTIMODAL_REGISTRY.register_max_image_tokens(get_max_llava_image_tokens)
+@INPUT_REGISTRY.register_dummy_data(dummy_data_for_llava)
+@INPUT_REGISTRY.register_input_processor(input_processor_for_llava)
+class MyLlava(LlavaForConditionalGeneration):
+
+ def compute_logits(
+ self, hidden_states: torch.Tensor,
+ sampling_metadata: SamplingMetadata) -> Optional[torch.Tensor]:
+ # this dummy model always predicts the first token
+ logits = super().compute_logits(hidden_states, sampling_metadata)
+ if logits is not None:
+ logits.zero_()
+ logits[:, 0] += 1.0
+ return logits
diff --git a/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_opt.py b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_opt.py
new file mode 100644
index 0000000000000..569ef216c9f0a
--- /dev/null
+++ b/tests/plugins/vllm_add_dummy_model/vllm_add_dummy_model/my_opt.py
@@ -0,0 +1,19 @@
+from typing import Optional
+
+import torch
+
+from vllm.model_executor.models.opt import OPTForCausalLM
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+
+
+class MyOPTForCausalLM(OPTForCausalLM):
+
+ def compute_logits(
+ self, hidden_states: torch.Tensor,
+ sampling_metadata: SamplingMetadata) -> Optional[torch.Tensor]:
+ # this dummy model always predicts the first token
+ logits = super().compute_logits(hidden_states, sampling_metadata)
+ if logits is not None:
+ logits.zero_()
+ logits[:, 0] += 1.0
+ return logits
diff --git a/tests/samplers/test_beam_search.py b/tests/samplers/test_beam_search.py
index a9bedc2956fdd..4d1a6978d4c55 100644
--- a/tests/samplers/test_beam_search.py
+++ b/tests/samplers/test_beam_search.py
@@ -33,8 +33,8 @@ def test_beam_search_single_input(
max_tokens)
with vllm_runner(model, dtype=dtype) as vllm_model:
- vllm_outputs = vllm_model.generate_beam_search_new(
- example_prompts, beam_width, max_tokens)
+ vllm_outputs = vllm_model.generate_beam_search(example_prompts,
+ beam_width, max_tokens)
for i in range(len(example_prompts)):
hf_output_ids, hf_output_texts = hf_outputs[i]
diff --git a/tests/samplers/test_sampler.py b/tests/samplers/test_sampler.py
index 9d4932dd1f5b1..28c34064f670c 100644
--- a/tests/samplers/test_sampler.py
+++ b/tests/samplers/test_sampler.py
@@ -159,26 +159,6 @@ def test_sampler_all_random_seed_deterministic(seed: int, device: str):
assert first_sampler_output == second_sampler_output
-@pytest.mark.parametrize("seed", RANDOM_SEEDS)
-@pytest.mark.parametrize("device", CUDA_DEVICES)
-def test_sampler_all_beam(seed: int, device: str):
- set_random_seed(seed)
- torch.set_default_device(device)
- batch_size = random.randint(1, 256)
- _, fake_logits, sampler = _prepare_test(batch_size)
-
- sampling_params = SamplingParams(
- temperature=0,
- best_of=2,
- use_beam_search=True,
- )
- _do_sample(batch_size, fake_logits, sampler, sampling_params, device)
- # no assertion here as I am not sure how to determine whether
- # the outputs are expected - in other words, this just tests
- # whether there are no exceptions in the sampler
- # when handling an all-beam search case.
-
-
@pytest.mark.parametrize("seed", RANDOM_SEEDS)
@pytest.mark.parametrize("device", CUDA_DEVICES)
def test_sampler_min_tokens_penalty(seed: int, device: str):
@@ -479,7 +459,7 @@ def test_sampler_mixed(seed: int, device: str):
seq_lens: List[int] = []
for i in range(batch_size):
expected: Optional[List[int]] = None
- sampling_type = random.randint(0, 3)
+ sampling_type = random.randint(0, 2)
if sampling_type == 0:
sampling_params = SamplingParams(temperature=0)
expected = [int(torch.argmax(fake_logits[i], dim=-1).item())]
@@ -498,10 +478,7 @@ def test_sampler_mixed(seed: int, device: str):
for idx in range(n):
fake_logits[i, i + idx] = 1e2
expected = list(range(i, i + n))
- else:
- sampling_params = SamplingParams(temperature=0,
- use_beam_search=True,
- best_of=2)
+
expected_tokens.append(expected)
seq_group_metadata_list.append(
SequenceGroupMetadata(
@@ -530,9 +507,6 @@ def test_sampling():
zip(sampler_output, seq_group_metadata_list)):
assert metadata.sampling_params is not None
- if metadata.sampling_params.use_beam_search:
- continue
-
if (metadata.sampling_params.seed is not None
and expected_tokens[i] is None):
# Record seeded random result to compare with results of
diff --git a/tests/test_utils.py b/tests/test_utils.py
index c7cb663068c0f..f3017a8582ea8 100644
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -136,6 +136,8 @@ def parser():
def parser_with_config():
parser = FlexibleArgumentParser()
parser.add_argument('serve')
+ parser.add_argument('model_tag')
+ parser.add_argument('--served-model-name', type=str)
parser.add_argument('--config', type=str)
parser.add_argument('--port', type=int)
parser.add_argument('--tensor-parallel-size', type=int)
@@ -190,33 +192,47 @@ def test_missing_required_argument(parser):
def test_cli_override_to_config(parser_with_config):
args = parser_with_config.parse_args([
- 'serve', '--config', './data/test_config.yaml',
+ 'serve', 'mymodel', '--config', './data/test_config.yaml',
'--tensor-parallel-size', '3'
])
assert args.tensor_parallel_size == 3
args = parser_with_config.parse_args([
- 'serve', '--tensor-parallel-size', '3', '--config',
+ 'serve', 'mymodel', '--tensor-parallel-size', '3', '--config',
'./data/test_config.yaml'
])
assert args.tensor_parallel_size == 3
+ assert args.port == 12312
+ args = parser_with_config.parse_args([
+ 'serve', 'mymodel', '--tensor-parallel-size', '3', '--config',
+ './data/test_config.yaml', '--port', '666'
+ ])
+ assert args.tensor_parallel_size == 3
+ assert args.port == 666
def test_config_args(parser_with_config):
args = parser_with_config.parse_args(
- ['serve', '--config', './data/test_config.yaml'])
+ ['serve', 'mymodel', '--config', './data/test_config.yaml'])
assert args.tensor_parallel_size == 2
def test_config_file(parser_with_config):
with pytest.raises(FileNotFoundError):
- parser_with_config.parse_args(['serve', '--config', 'test_config.yml'])
+ parser_with_config.parse_args(
+ ['serve', 'mymodel', '--config', 'test_config.yml'])
with pytest.raises(ValueError):
parser_with_config.parse_args(
- ['serve', '--config', './data/test_config.json'])
+ ['serve', 'mymodel', '--config', './data/test_config.json'])
with pytest.raises(ValueError):
parser_with_config.parse_args([
- 'serve', '--tensor-parallel-size', '3', '--config', '--batch-size',
- '32'
+ 'serve', 'mymodel', '--tensor-parallel-size', '3', '--config',
+ '--batch-size', '32'
])
+
+
+def test_no_model_tag(parser_with_config):
+ with pytest.raises(ValueError):
+ parser_with_config.parse_args(
+ ['serve', '--config', './data/test_config.yaml'])
diff --git a/tests/utils.py b/tests/utils.py
index 8c8a7c4bf0c70..55c813728b1e0 100644
--- a/tests/utils.py
+++ b/tests/utils.py
@@ -8,13 +8,13 @@
import warnings
from contextlib import contextmanager
from pathlib import Path
-from typing import Any, Callable, Dict, List, Optional, Union
+from typing import Any, Callable, Dict, List, Literal, Optional, Union
import openai
import pytest
import requests
from openai.types.completion import Completion
-from typing_extensions import ParamSpec
+from typing_extensions import ParamSpec, assert_never
from tests.models.utils import TextTextLogprobs
from vllm.distributed import (ensure_model_parallel_initialized,
@@ -163,11 +163,140 @@ def get_async_client(self):
)
+def _test_completion(
+ client: openai.OpenAI,
+ model: str,
+ prompt: str,
+ token_ids: List[int],
+):
+ results = []
+
+ # test with text prompt
+ completion = client.completions.create(model=model,
+ prompt=prompt,
+ max_tokens=5,
+ temperature=0.0)
+
+ results.append({
+ "test": "single_completion",
+ "text": completion.choices[0].text,
+ "finish_reason": completion.choices[0].finish_reason,
+ "usage": completion.usage,
+ })
+
+ # test using token IDs
+ completion = client.completions.create(
+ model=model,
+ prompt=token_ids,
+ max_tokens=5,
+ temperature=0.0,
+ )
+
+ results.append({
+ "test": "token_ids",
+ "text": completion.choices[0].text,
+ "finish_reason": completion.choices[0].finish_reason,
+ "usage": completion.usage,
+ })
+
+ # test seeded random sampling
+ completion = client.completions.create(model=model,
+ prompt=prompt,
+ max_tokens=5,
+ seed=33,
+ temperature=1.0)
+
+ results.append({
+ "test": "seeded_sampling",
+ "text": completion.choices[0].text,
+ "finish_reason": completion.choices[0].finish_reason,
+ "usage": completion.usage,
+ })
+
+ # test seeded random sampling with multiple prompts
+ completion = client.completions.create(model=model,
+ prompt=[prompt, prompt],
+ max_tokens=5,
+ seed=33,
+ temperature=1.0)
+
+ results.append({
+ "test":
+ "seeded_sampling",
+ "text": [choice.text for choice in completion.choices],
+ "finish_reason":
+ [choice.finish_reason for choice in completion.choices],
+ "usage":
+ completion.usage,
+ })
+
+ # test simple list
+ batch = client.completions.create(
+ model=model,
+ prompt=[prompt, prompt],
+ max_tokens=5,
+ temperature=0.0,
+ )
+
+ results.append({
+ "test": "simple_list",
+ "text0": batch.choices[0].text,
+ "text1": batch.choices[1].text,
+ })
+
+ # test streaming
+ batch = client.completions.create(
+ model=model,
+ prompt=[prompt, prompt],
+ max_tokens=5,
+ temperature=0.0,
+ stream=True,
+ )
+
+ texts = [""] * 2
+ for chunk in batch:
+ assert len(chunk.choices) == 1
+ choice = chunk.choices[0]
+ texts[choice.index] += choice.text
+
+ results.append({
+ "test": "streaming",
+ "texts": texts,
+ })
+
+ return results
+
+
+def _test_embeddings(
+ client: openai.OpenAI,
+ model: str,
+ text: str,
+):
+ results = []
+
+ # test with text input
+ embeddings = client.embeddings.create(
+ model=model,
+ input=text,
+ encoding_format="float",
+ )
+
+ results.append({
+ "test": "single_embedding",
+ "embedding": embeddings.data[0].embedding,
+ "usage": embeddings.usage,
+ })
+
+ return results
+
+
def compare_two_settings(model: str,
arg1: List[str],
arg2: List[str],
env1: Optional[Dict[str, str]] = None,
env2: Optional[Dict[str, str]] = None,
+ *,
+ method: Literal["generate", "encode"] = "generate",
max_wait_seconds: Optional[float] = None) -> None:
"""
Launch API server with two different sets of arguments/environments
@@ -219,96 +348,12 @@ def compare_two_settings(model: str,
"root": served_model.root,
})
- # test with text prompt
- completion = client.completions.create(model=model,
- prompt=prompt,
- max_tokens=5,
- temperature=0.0)
-
- results.append({
- "test": "single_completion",
- "text": completion.choices[0].text,
- "finish_reason": completion.choices[0].finish_reason,
- "usage": completion.usage,
- })
-
- # test using token IDs
- completion = client.completions.create(
- model=model,
- prompt=token_ids,
- max_tokens=5,
- temperature=0.0,
- )
-
- results.append({
- "test": "token_ids",
- "text": completion.choices[0].text,
- "finish_reason": completion.choices[0].finish_reason,
- "usage": completion.usage,
- })
-
- # test seeded random sampling
- completion = client.completions.create(model=model,
- prompt=prompt,
- max_tokens=5,
- seed=33,
- temperature=1.0)
-
- results.append({
- "test": "seeded_sampling",
- "text": completion.choices[0].text,
- "finish_reason": completion.choices[0].finish_reason,
- "usage": completion.usage,
- })
-
- # test seeded random sampling with multiple prompts
- completion = client.completions.create(model=model,
- prompt=[prompt, prompt],
- max_tokens=5,
- seed=33,
- temperature=1.0)
-
- results.append({
- "test":
- "seeded_sampling",
- "text": [choice.text for choice in completion.choices],
- "finish_reason":
- [choice.finish_reason for choice in completion.choices],
- "usage":
- completion.usage,
- })
-
- # test simple list
- batch = client.completions.create(
- model=model,
- prompt=[prompt, prompt],
- max_tokens=5,
- temperature=0.0,
- )
-
- results.append({
- "test": "simple_list",
- "text0": batch.choices[0].text,
- "text1": batch.choices[1].text,
- })
-
- # test streaming
- batch = client.completions.create(
- model=model,
- prompt=[prompt, prompt],
- max_tokens=5,
- temperature=0.0,
- stream=True,
- )
- texts = [""] * 2
- for chunk in batch:
- assert len(chunk.choices) == 1
- choice = chunk.choices[0]
- texts[choice.index] += choice.text
- results.append({
- "test": "streaming",
- "texts": texts,
- })
+ if method == "generate":
+ results += _test_completion(client, model, prompt, token_ids)
+ elif method == "encode":
+ results += _test_embeddings(client, model, prompt)
+ else:
+ assert_never(method)
n = len(results) // 2
arg1_results = results[:n]
diff --git a/tests/weight_loading/models-large.txt b/tests/weight_loading/models-large.txt
index 3e6eba04f1a87..5fda910fde084 100644
--- a/tests/weight_loading/models-large.txt
+++ b/tests/weight_loading/models-large.txt
@@ -3,3 +3,4 @@ compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-channel-quantize
compressed-tensors, nm-testing/Mixtral-8x7B-Instruct-v0.1-W8A16-quantized, main
compressed-tensors, mgoin/DeepSeek-Coder-V2-Lite-Instruct-FP8, main
gptq_marlin, TheBloke/Mixtral-8x7B-v0.1-GPTQ, main
+awq_marlin, casperhansen/deepseek-coder-v2-instruct-awq, main
\ No newline at end of file
diff --git a/tests/weight_loading/run_model_weight_loading_test.sh b/tests/weight_loading/run_model_weight_loading_test.sh
index 0cb45d1780c2c..e80c1d6c5849c 100755
--- a/tests/weight_loading/run_model_weight_loading_test.sh
+++ b/tests/weight_loading/run_model_weight_loading_test.sh
@@ -1,7 +1,20 @@
#!/bin/bash
SUCCESS=0
-IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "weight_loading/models.txt"
+while getopts "c:" OPT; do
+ case ${OPT} in
+ c )
+ CONFIG="$OPTARG"
+ ;;
+ \? )
+ usage
+ exit 1
+ ;;
+ esac
+done
+
+
+IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
diff --git a/vllm/_custom_ops.py b/vllm/_custom_ops.py
index 05f036af331f1..24e008dc38022 100644
--- a/vllm/_custom_ops.py
+++ b/vllm/_custom_ops.py
@@ -568,6 +568,20 @@ def gptq_marlin_moe_repack(b_q_weight: torch.Tensor, perm: torch.Tensor,
return output
+def awq_marlin_moe_repack(b_q_weight: torch.Tensor, perm: torch.Tensor,
+ size_k: int, size_n: int,
+ num_bits: int) -> torch.Tensor:
+ num_experts = b_q_weight.shape[0]
+ assert size_k % 16 == 0
+ output = torch.empty((num_experts, size_k // 16, size_n * (num_bits // 2)),
+ device=b_q_weight.device,
+ dtype=b_q_weight.dtype)
+ for e in range(num_experts):
+ output[e] = torch.ops._C.awq_marlin_repack(b_q_weight[e], size_k,
+ size_n, num_bits)
+ return output
+
+
def gptq_marlin_gemm(a: torch.Tensor,
b_q_weight: torch.Tensor,
b_scales: torch.Tensor,
@@ -828,11 +842,12 @@ def marlin_gemm_moe_fake(a: torch.Tensor, b_q_weights: torch.Tensor,
sorted_ids: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor, b_scales: torch.Tensor,
- g_idx: torch.Tensor, perm: torch.Tensor,
- workspace: torch.Tensor, b_q_type: ScalarType,
- size_m: int, size_n: int, size_k: int,
- is_k_full: bool, num_experts: int, topk: int,
- moe_block_size: int, replicate_input: bool,
+ b_zero_points: torch.Tensor, g_idx: torch.Tensor,
+ perm: torch.Tensor, workspace: torch.Tensor,
+ b_q_type: ScalarType, size_m: int, size_n: int,
+ size_k: int, is_k_full: bool, num_experts: int,
+ topk: int, moe_block_size: int,
+ replicate_input: bool,
apply_weights: bool) -> torch.Tensor:
return torch.empty((size_m, topk, size_n),
dtype=a.dtype,
diff --git a/vllm/attention/backends/flashinfer.py b/vllm/attention/backends/flashinfer.py
index 40e804934cbdd..ba9b2d043c640 100644
--- a/vllm/attention/backends/flashinfer.py
+++ b/vllm/attention/backends/flashinfer.py
@@ -26,6 +26,7 @@
compute_slot_mapping_start_idx,
is_block_tables_empty)
from vllm.attention.ops.paged_attn import PagedAttention
+from vllm.forward_context import get_forward_context
from vllm.utils import (async_tensor_h2d, get_kv_cache_torch_dtype,
make_tensor_with_pad)
@@ -761,73 +762,132 @@ def forward(
"encoder/decoder cross-attention "
"are not implemented for "
"FlashInferImpl")
- num_tokens, hidden_size = query.shape
- query = query.view(-1, self.num_heads, self.head_size)
- key = key.view(-1, self.num_kv_heads, self.head_size)
- value = value.view(-1, self.num_kv_heads, self.head_size)
- if attn_metadata.num_prefill_tokens > 0:
- assert attn_metadata.num_decode_tokens == 0, (
- "Chunked prefill is not supported with flashinfer yet.")
- if attn_metadata.num_decode_tokens > 0:
- assert attn_metadata.num_prefill_tokens == 0, (
- "Chunked prefill is not supported with flashinfer yet.")
- if kv_cache.numel() > 0:
- # Use the same reshape and cache kernel as flash attention.
- ops.reshape_and_cache_flash(
- key,
- value,
- kv_cache[:, 0],
- kv_cache[:, 1],
- attn_metadata.slot_mapping.flatten(),
- self.kv_cache_dtype,
- k_scale,
- v_scale,
+ return torch.ops.vllm.unified_flash_infer(
+ query,
+ key,
+ value,
+ self.num_heads,
+ self.head_size,
+ self.num_kv_heads,
+ kv_cache,
+ self.kv_cache_dtype,
+ k_scale,
+ v_scale,
+ self.scale,
+ self.sliding_window,
+ self.alibi_slopes,
+ self.logits_soft_cap,
+ )
+
+
+@torch.library.custom_op("vllm::unified_flash_infer",
+ mutates_args=["kv_cache"])
+def unified_flash_infer(
+ query: torch.Tensor,
+ key: torch.Tensor,
+ value: torch.Tensor,
+ num_heads: int,
+ head_size: int,
+ num_kv_heads: int,
+ kv_cache: torch.Tensor,
+ kv_cache_dtype: str,
+ k_scale: float,
+ v_scale: float,
+ softmax_scale: float,
+ window_size: Optional[List[int]] = None,
+ alibi_slopes: Optional[torch.Tensor] = None,
+ logits_soft_cap: Optional[float] = None,
+) -> torch.Tensor:
+
+ current_metadata = get_forward_context()
+ assert current_metadata is not None
+ assert isinstance(current_metadata, FlashInferMetadata)
+ attn_metadata: FlashInferMetadata = current_metadata
+
+ num_tokens, hidden_size = query.shape
+ query = query.view(-1, num_heads, head_size)
+ key = key.view(-1, num_kv_heads, head_size)
+ value = value.view(-1, num_kv_heads, head_size)
+
+ if attn_metadata.num_prefill_tokens > 0:
+ assert attn_metadata.num_decode_tokens == 0, (
+ "Chunked prefill is not supported with flashinfer yet.")
+ if attn_metadata.num_decode_tokens > 0:
+ assert attn_metadata.num_prefill_tokens == 0, (
+ "Chunked prefill is not supported with flashinfer yet.")
+ if kv_cache.numel() > 0:
+ # Use the same reshape and cache kernel as flash attention.
+ ops.reshape_and_cache_flash(
+ key,
+ value,
+ kv_cache[:, 0],
+ kv_cache[:, 1],
+ attn_metadata.slot_mapping.flatten(),
+ kv_cache_dtype,
+ k_scale,
+ v_scale,
+ )
+ # The FlashInfer api requires data to be in fp8_e4m3 or fp8_e5m2
+ # to process the cache when the kv_cache_dtype is fp8
+ if kv_cache_dtype.startswith("fp8"):
+ torch_dtype = FlashInferBackend.get_fp8_dtype_for_flashinfer(
+ kv_cache_dtype)
+ kv_cache = kv_cache.view(torch_dtype)
+
+ query = query.contiguous() # Flashinfer requires query to be contiguous
+ if prefill_meta := attn_metadata.prefill_metadata:
+ # We will use flash attention for prefill
+ # when kv_cache is not provided.
+ # This happens when vllm runs the profiling to
+ # determine the number of blocks.
+ if kv_cache.numel() == 0:
+ output = flash_attn_varlen_func(
+ q=query,
+ k=key,
+ v=value,
+ cu_seqlens_q=prefill_meta.seq_start_loc,
+ cu_seqlens_k=prefill_meta.seq_start_loc,
+ max_seqlen_q=prefill_meta.max_prefill_seq_len,
+ max_seqlen_k=prefill_meta.max_prefill_seq_len,
+ softmax_scale=softmax_scale,
+ causal=True,
+ window_size=window_size,
+ alibi_slopes=alibi_slopes,
)
- # The FlashInfer api requires data to be in fp8_e4m3 or fp8_e5m2
- # to process the cache when the kv_cache_dtype is fp8
- if self.kv_cache_dtype.startswith("fp8"):
- torch_dtype = FlashInferBackend.get_fp8_dtype_for_flashinfer(
- self.kv_cache_dtype)
- kv_cache = kv_cache.view(torch_dtype)
-
- query = query.contiguous(
- ) # Flashinfer requires query to be contiguous
- if prefill_meta := attn_metadata.prefill_metadata:
- # We will use flash attention for prefill
- # when kv_cache is not provided.
- # This happens when vllm runs the profiling to
- # determine the number of blocks.
- if kv_cache.numel() == 0:
- output = flash_attn_varlen_func(
- q=query,
- k=key,
- v=value,
- cu_seqlens_q=prefill_meta.seq_start_loc,
- cu_seqlens_k=prefill_meta.seq_start_loc,
- max_seqlen_q=prefill_meta.max_prefill_seq_len,
- max_seqlen_k=prefill_meta.max_prefill_seq_len,
- softmax_scale=self.scale,
- causal=True,
- window_size=self.sliding_window,
- alibi_slopes=self.alibi_slopes,
- )
- else:
- assert prefill_meta is not None
- assert prefill_meta.prefill_wrapper is not None
- output = prefill_meta.prefill_wrapper.forward(
- query,
- kv_cache,
- logits_soft_cap=self.logits_soft_cap,
- causal=True)
else:
- assert attn_metadata.decode_metadata is not None
- assert attn_metadata.decode_metadata.decode_wrapper is not None
- output = attn_metadata.decode_metadata.decode_wrapper.forward(
- query,
- kv_cache,
- sm_scale=self.scale,
- logits_soft_cap=self.logits_soft_cap,
- k_scale=k_scale,
- v_scale=v_scale)
- return output.view(num_tokens, hidden_size)
+ assert prefill_meta is not None
+ assert prefill_meta.prefill_wrapper is not None
+ output = prefill_meta.prefill_wrapper.forward(
+ query, kv_cache, logits_soft_cap=logits_soft_cap, causal=True)
+ else:
+ assert attn_metadata.decode_metadata is not None
+ assert attn_metadata.decode_metadata.decode_wrapper is not None
+ output = attn_metadata.decode_metadata.decode_wrapper.forward(
+ query,
+ kv_cache,
+ sm_scale=softmax_scale,
+ logits_soft_cap=logits_soft_cap,
+ k_scale=k_scale,
+ v_scale=v_scale)
+ return output.view(num_tokens, hidden_size)
+
+
+@unified_flash_infer.register_fake
+def _(
+ query: torch.Tensor,
+ key: torch.Tensor,
+ value: torch.Tensor,
+ num_heads: int,
+ head_size: int,
+ num_kv_heads: int,
+ kv_cache: torch.Tensor,
+ kv_cache_dtype: str,
+ k_scale: float,
+ v_scale: float,
+ softmax_scale: float,
+ window_size: Optional[List[int]] = None,
+ alibi_slopes: Optional[torch.Tensor] = None,
+ logits_soft_cap: Optional[float] = None,
+) -> torch.Tensor:
+ return torch.empty_like(query).contiguous()
diff --git a/vllm/attention/backends/rocm_flash_attn.py b/vllm/attention/backends/rocm_flash_attn.py
index fb5cd11ec033a..7456aab8b8d2a 100644
--- a/vllm/attention/backends/rocm_flash_attn.py
+++ b/vllm/attention/backends/rocm_flash_attn.py
@@ -191,12 +191,22 @@ def decode_metadata(self) -> Optional["ROCmFlashAttentionMetadata"]:
)
return self._cached_decode_metadata
- def advance_step(self, model_input: "ModelInputForGPUWithSamplingMetadata",
+ def advance_step(self,
+ model_input: "ModelInputForGPUWithSamplingMetadata",
sampled_token_ids: Optional[torch.Tensor],
- block_size: int, num_seqs: int, num_queries: int):
+ block_size: int,
+ num_seqs: int,
+ num_queries: int,
+ turn_prefills_into_decodes: bool = False):
"""
Update metadata in-place to advance one decode step.
"""
+
+ assert not turn_prefills_into_decodes, \
+ ("Chunked prefill is not supported with rocm_flash_attn yet."
+ "turn_prefills_into_decodes is a Multi-Step + Chunked-Prefill "
+ "specific parameter.")
+
# When using cudagraph, the num_seqs is padded to the next captured
# batch sized, but num_queries tracks the actual number of requests in
# the batch. For --enforce-eager mode, num_seqs == num_queries
diff --git a/vllm/attention/backends/torch_sdpa.py b/vllm/attention/backends/torch_sdpa.py
index 2a215331704c1..ef8d576616838 100644
--- a/vllm/attention/backends/torch_sdpa.py
+++ b/vllm/attention/backends/torch_sdpa.py
@@ -75,6 +75,22 @@ class TorchSDPAMetadata(AttentionMetadata, PagedAttentionMetadata):
slot_mapping: torch.Tensor
seq_lens: Optional[List[int]]
+ # Begin encoder attn & enc/dec cross-attn fields...
+ # Encoder sequence lengths representation
+ encoder_seq_lens: Optional[List[int]] = None
+ encoder_seq_lens_tensor: Optional[torch.Tensor] = None
+
+ # Maximum sequence length among encoder sequences
+ max_encoder_seq_len: Optional[int] = None
+
+ # Number of tokens input to encoder
+ num_encoder_tokens: Optional[int] = None
+
+ # Cross-attention memory-mapping data structures: slot mapping
+ # and block tables
+ cross_slot_mapping: Optional[torch.Tensor] = None
+ cross_block_tables: Optional[torch.Tensor] = None
+
def __post_init__(self):
# Set during the execution of the first attention op.
# It is a list because it is needed to set per prompt
@@ -82,6 +98,28 @@ def __post_init__(self):
# from xformer API.
# will not appear in the __repr__ and __init__
self.attn_bias: Optional[List[torch.Tensor]] = None
+ self.encoder_attn_bias: Optional[List[torch.Tensor]] = None
+ self.cross_attn_bias: Optional[List[torch.Tensor]] = None
+
+ @property
+ def is_all_encoder_attn_metadata_set(self):
+ '''
+ All attention metadata required for encoder attention is set.
+ '''
+ return ((self.encoder_seq_lens is not None)
+ and (self.encoder_seq_lens_tensor is not None)
+ and (self.max_encoder_seq_len is not None))
+
+ @property
+ def is_all_cross_attn_metadata_set(self):
+ '''
+ All attention metadata required for enc/dec cross-attention is set.
+
+ Superset of encoder attention required metadata.
+ '''
+ return (self.is_all_encoder_attn_metadata_set
+ and (self.cross_slot_mapping is not None)
+ and (self.cross_block_tables is not None))
@property
def prefill_metadata(self) -> Optional["TorchSDPAMetadata"]:
@@ -101,6 +139,136 @@ def decode_metadata(self) -> Optional["TorchSDPAMetadata"]:
return self
+ def get_seq_lens(
+ self,
+ attn_type: AttentionType,
+ ):
+ '''
+ Extract appropriate sequence lengths from attention metadata
+ according to attention type.
+
+ Arguments:
+
+ * attn_metadata: Attention metadata structure associated with attention
+ * attn_type: encoder attention, decoder self-attention,
+ encoder/decoder cross-attention
+
+ Returns:
+ * Appropriate sequence lengths tensor for query
+ * Appropriate sequence lengths tensor for key & value
+ '''
+
+ if attn_type == AttentionType.DECODER:
+ seq_lens_q = self.seq_lens
+ seq_lens_kv = self.seq_lens
+ elif attn_type == AttentionType.ENCODER:
+ seq_lens_q = self.encoder_seq_lens
+ seq_lens_kv = self.encoder_seq_lens
+ elif attn_type == AttentionType.ENCODER_DECODER:
+ seq_lens_q = self.seq_lens
+ seq_lens_kv = self.encoder_seq_lens
+ else:
+ raise AttributeError(f"Invalid attention type {str(attn_type)}")
+ return seq_lens_q, seq_lens_kv
+
+ def get_attn_bias(
+ self,
+ attn_type: AttentionType,
+ ) -> Optional[List[torch.Tensor]]:
+ '''
+ Extract appropriate attention bias from attention metadata
+ according to attention type.
+
+ Arguments:
+
+ * attn_metadata: Attention metadata structure associated with attention
+ * attn_type: encoder attention, decoder self-attention,
+ encoder/decoder cross-attention
+
+ Returns:
+ * Appropriate attention bias value given the attention type
+ '''
+
+ if attn_type == AttentionType.DECODER:
+ return self.attn_bias
+ elif attn_type == AttentionType.ENCODER:
+ return self.encoder_attn_bias
+ elif attn_type == AttentionType.ENCODER_DECODER:
+ return self.cross_attn_bias
+ else:
+ raise AttributeError(f"Invalid attention type {str(attn_type)}")
+
+ def set_attn_bias(
+ self,
+ attn_bias: List[torch.Tensor],
+ attn_type: AttentionType,
+ ) -> None:
+ '''
+ Update appropriate attention bias field of attention metadata,
+ according to attention type.
+
+ Arguments:
+
+ * attn_metadata: Attention metadata structure associated with attention
+ * attn_bias: The desired attention bias value
+ * attn_type: encoder attention, decoder self-attention,
+ encoder/decoder cross-attention
+ '''
+
+ if attn_type == AttentionType.DECODER:
+ self.attn_bias = attn_bias
+ elif attn_type == AttentionType.ENCODER:
+ self.encoder_attn_bias = attn_bias
+ elif attn_type == AttentionType.ENCODER_DECODER:
+ self.cross_attn_bias = attn_bias
+ else:
+ raise AttributeError(f"Invalid attention type {str(attn_type)}")
+
+ def get_seq_len_block_table_args(
+ self,
+ attn_type: AttentionType,
+ ) -> tuple:
+ '''
+ The particular choice of sequence-length- and block-table-related
+ attributes which should be extracted from attn_metadata is dependent
+ on the type of attention operation.
+
+ Decoder attn -> select entirely decoder self-attention-related fields
+ Encoder/decoder cross-attn -> select encoder sequence lengths &
+ cross-attn block-tables fields
+ Encoder attn -> select encoder sequence lengths fields & no block tables
+
+ Arguments:
+
+ * attn_metadata: Attention metadata structure associated with attention
+ * is_prompt: True if prefill, False otherwise
+ * attn_type: encoder attention, decoder self-attention,
+ encoder/decoder cross-attention
+
+ Returns:
+
+ * Appropriate sequence-lengths tensor
+ * Appropriate max sequence-length scalar
+ * Appropriate block tables (or None)
+ '''
+
+ if attn_type == AttentionType.DECODER:
+ # Decoder self-attention
+ # Choose max_seq_len based on whether we are in prompt_run
+ return (self.seq_lens_tensor, self.max_decode_seq_len,
+ self.block_tables)
+ elif attn_type == AttentionType.ENCODER_DECODER:
+ # Enc/dec cross-attention KVs match encoder sequence length;
+ # cross-attention utilizes special "cross" block tables
+ return (self.encoder_seq_lens_tensor, self.max_encoder_seq_len,
+ self.cross_block_tables)
+ elif attn_type == AttentionType.ENCODER:
+ # No block tables associated with encoder attention
+ return (self.encoder_seq_lens_tensor, self.max_encoder_seq_len,
+ None)
+ else:
+ raise AttributeError(f"Invalid attention type {str(attn_type)}")
+
class TorchSDPABackendImpl(AttentionImpl[TorchSDPAMetadata]):
@@ -171,84 +339,101 @@ def forward(
shape = [num_tokens, num_heads * head_size]
"""
assert k_scale == 1.0 and v_scale == 1.0
- if attn_type != AttentionType.DECODER:
- raise NotImplementedError("Encoder self-attention and "
- "encoder/decoder cross-attention "
- "are not implemented for "
- "TorchSDPABackendImpl")
- num_tokens, hidden_size = query.shape
+ if (attn_type == AttentionType.ENCODER
+ and (not attn_metadata.is_all_encoder_attn_metadata_set)):
+ raise AttributeError("Encoder attention requires setting "
+ "encoder metadata attributes.")
+ elif (attn_type == AttentionType.ENCODER_DECODER
+ and (not attn_metadata.is_all_cross_attn_metadata_set)):
+ raise AttributeError("Encoder/decoder cross-attention "
+ "requires setting cross-attention "
+ "metadata attributes.")
+
# Reshape the query, key, and value tensors.
query = query.view(-1, self.num_heads, self.head_size)
- key = key.view(-1, self.num_kv_heads, self.head_size)
- value = value.view(-1, self.num_kv_heads, self.head_size)
-
- if kv_cache.numel() > 0:
+ if key is not None:
+ assert value is not None
+ key = key.view(-1, self.num_kv_heads, self.head_size)
+ value = value.view(-1, self.num_kv_heads, self.head_size)
+ else:
+ assert value is None
+
+ if (attn_type != AttentionType.ENCODER and kv_cache.numel() > 0):
+ # KV-cache during decoder-self- or
+ # encoder-decoder-cross-attention, but not
+ # during encoder attention.
+ #
+ # Even if there are no new key/value pairs to cache,
+ # we still need to break out key_cache and value_cache
+ # i.e. for later use by paged attention
key_cache, value_cache = PagedAttention.split_kv_cache(
kv_cache, self.num_kv_heads, self.head_size)
- PagedAttention.write_to_paged_cache(key, value, key_cache,
- value_cache,
- attn_metadata.slot_mapping,
- self.kv_cache_dtype, k_scale,
- v_scale)
- if attn_metadata.is_prompt:
+ if (key is not None) and (value is not None):
+ if attn_type == AttentionType.ENCODER_DECODER:
+ # Update cross-attention KV cache (prefill-only)
+ # During cross-attention decode, key & value will be None,
+ # preventing this IF-statement branch from running
+ updated_slot_mapping = attn_metadata.cross_slot_mapping
+ else:
+ # Update self-attention KV cache (prefill/decode)
+ updated_slot_mapping = attn_metadata.slot_mapping
+
+ PagedAttention.write_to_paged_cache(key, value, key_cache,
+ value_cache,
+ updated_slot_mapping,
+ self.kv_cache_dtype,
+ k_scale, v_scale)
+
+ if attn_type != AttentionType.ENCODER:
+ # Decoder self-attention supports chunked prefill.
+ # Encoder/decoder cross-attention requires no chunked
+ # prefill (100% prefill or 100% decode tokens, no mix)
+ num_prefill_tokens = attn_metadata.num_prefill_tokens
+ num_decode_tokens = attn_metadata.num_decode_tokens
+ else:
+ # Encoder attention - chunked prefill is not applicable;
+ # derive token-count from query shape & and treat them
+ # as 100% prefill tokens
+ assert attn_metadata.num_encoder_tokens is not None
+ num_prefill_tokens = attn_metadata.num_encoder_tokens
+ num_decode_tokens = 0
+
+ if attn_type == AttentionType.DECODER:
+ # Only enforce this shape-constraint for decoder
+ # self-attention
+ assert key.shape[0] == num_prefill_tokens + num_decode_tokens
+ assert value.shape[0] == num_prefill_tokens + num_decode_tokens
+
+ if prefill_meta := attn_metadata.prefill_metadata:
assert attn_metadata.seq_lens is not None
if (kv_cache.numel() == 0
- or attn_metadata.block_tables.numel() == 0):
- if self.num_kv_heads != self.num_heads:
- key = key.repeat_interleave(self.num_queries_per_kv, dim=1)
- value = value.repeat_interleave(self.num_queries_per_kv,
- dim=1)
-
- if attn_metadata.attn_bias is None:
- if self.alibi_slopes is not None:
- att_masks = _make_alibi_bias(
- self.alibi_slopes, query.dtype,
- attn_metadata.seq_lens) # type: ignore
- elif self.sliding_window is not None:
- att_masks = _make_sliding_window_bias(
- attn_metadata.seq_lens, self.sliding_window,
- query.dtype) # type: ignore
- else:
- att_masks = [None] * len(attn_metadata.seq_lens)
- attn_metadata.attn_bias = att_masks
-
- query = query.movedim(0, query.dim() - 2)
- key = key.movedim(0, key.dim() - 2)
- value = value.movedim(0, value.dim() - 2)
-
- start = 0
- output = torch.empty(
- (num_tokens, self.num_heads, self.head_size),
- dtype=query.dtype)
- for seq_len, mask in zip(attn_metadata.seq_lens,
- attn_metadata.attn_bias):
- end = start + seq_len
- sub_out = scaled_dot_product_attention(
- query[None, :, start:end, :],
- key[None, :, start:end, :],
- value[None, :, start:end, :],
- attn_mask=mask,
- dropout_p=0.0,
- is_causal=not self.need_mask,
- scale=self.scale).squeeze(0).movedim(
- query.dim() - 2, 0)
- output[start:end, :, :] = sub_out
- start = end
+ or prefill_meta.block_tables.numel() == 0):
+ output = self._run_sdpa_forward(query,
+ key,
+ value,
+ prefill_meta,
+ attn_type=attn_type)
else:
# prefix-enabled attention
raise RuntimeError(
"Torch SDPA backend doesn't support prefix decoding.")
- else:
+ if decode_meta := attn_metadata.decode_metadata:
# Decoding run.
+ (
+ seq_lens_arg,
+ max_seq_len_arg,
+ block_tables_arg,
+ ) = decode_meta.get_seq_len_block_table_args(attn_type)
+
output = PagedAttention.forward_decode(
query,
key_cache,
value_cache,
- attn_metadata.block_tables,
- attn_metadata.seq_lens_tensor,
- attn_metadata.max_decode_seq_len,
+ block_tables_arg,
+ seq_lens_arg,
+ max_seq_len_arg,
self.kv_cache_dtype,
self.num_kv_heads,
self.scale,
@@ -260,6 +445,59 @@ def forward(
# Reshape the output tensor.
return output.view(-1, self.num_heads * self.head_size)
+ def _run_sdpa_forward(
+ self,
+ query: torch.Tensor,
+ key: torch.Tensor,
+ value: torch.Tensor,
+ attn_metadata: TorchSDPAMetadata,
+ attn_type: AttentionType = AttentionType.DECODER,
+ ):
+ if self.num_kv_heads != self.num_heads:
+ key = key.repeat_interleave(self.num_queries_per_kv, dim=1)
+ value = value.repeat_interleave(self.num_queries_per_kv, dim=1)
+
+ attn_masks = attn_metadata.get_attn_bias(attn_type)
+ if attn_masks is None:
+ if self.alibi_slopes is not None:
+ attn_masks = _make_alibi_bias(
+ self.alibi_slopes, query.dtype,
+ attn_metadata.seq_lens) # type: ignore
+ elif self.sliding_window is not None:
+ assert attn_metadata.seq_lens is not None
+ attn_masks = _make_sliding_window_bias(
+ attn_metadata.seq_lens, self.sliding_window,
+ query.dtype) # type: ignore
+ else:
+ seq_lens, _ = attn_metadata.get_seq_lens(attn_type)
+ attn_masks = [None] * len(seq_lens)
+ attn_metadata.set_attn_bias(attn_masks, attn_type)
+
+ output = torch.empty_like(query)
+ query = query.movedim(0, query.dim() - 2)
+ key = key.movedim(0, key.dim() - 2)
+ value = value.movedim(0, value.dim() - 2)
+
+ causal_attn = (attn_type == AttentionType.DECODER)
+
+ seq_lens_q, seq_lens_kv = attn_metadata.get_seq_lens(attn_type)
+ start_q, start_kv = 0, 0
+ for seq_len_q, seq_len_kv, mask in zip(seq_lens_q, seq_lens_kv,
+ attn_masks):
+ end_q = start_q + seq_len_q
+ end_kv = start_kv + seq_len_kv
+ sub_out = scaled_dot_product_attention(
+ query[None, :, start_q:end_q, :],
+ key[None, :, start_kv:end_kv, :],
+ value[None, :, start_kv:end_kv, :],
+ attn_mask=mask,
+ dropout_p=0.0,
+ is_causal=causal_attn and not self.need_mask,
+ scale=self.scale).squeeze(0).movedim(query.dim() - 2, 0)
+ output[start_q:end_q, :, :] = sub_out
+ start_q, start_kv = end_q, end_kv
+ return output
+
def _make_alibi_bias(
alibi_slopes: torch.Tensor,
diff --git a/vllm/core/block/block_table.py b/vllm/core/block/block_table.py
index a9f4bd871dfda..d10cb29ef4a7c 100644
--- a/vllm/core/block/block_table.py
+++ b/vllm/core/block/block_table.py
@@ -220,7 +220,6 @@ def free(self) -> None:
occupied by each block. After freeing all the blocks, the `_blocks` list
is set to `None`.
"""
- assert self._is_allocated
for block in self.blocks:
self._allocator.free(block)
self._blocks.reset()
@@ -239,7 +238,6 @@ def physical_block_ids(self) -> List[int]:
List[int]: A list of physical block indices for the blocks in the
BlockTable.
"""
- assert self._is_allocated
return self._blocks.ids()
def get_unseen_token_ids(self, sequence_token_ids: List[int]) -> List[int]:
diff --git a/vllm/core/block_manager_v2.py b/vllm/core/block_manager_v2.py
index 0fad5fa99daf8..c7ee6609306d7 100644
--- a/vllm/core/block_manager_v2.py
+++ b/vllm/core/block_manager_v2.py
@@ -151,7 +151,9 @@ def _allocate_sequence(self, seq: Sequence) -> BlockTable:
block_allocator=self.block_allocator,
max_block_sliding_window=self.max_block_sliding_window,
)
- block_table.allocate(seq.get_token_ids())
+ if seq.get_token_ids():
+ # Add blocks to the block table only if the sequence is non empty.
+ block_table.allocate(seq.get_token_ids())
return block_table
diff --git a/vllm/core/scheduler.py b/vllm/core/scheduler.py
index f3a5016d0e62a..c57e6cd716405 100644
--- a/vllm/core/scheduler.py
+++ b/vllm/core/scheduler.py
@@ -1202,9 +1202,9 @@ def _can_append_slots(self, seq_group: SequenceGroup,
seq_group=seq_group, num_lookahead_slots=num_lookahead_slots)
def _allow_async_output_proc(self, seq_group: SequenceGroup) -> bool:
+ # TODO: does it work with parallel sampling?
no_beam_search = seq_group.sampling_params is None or (
- seq_group.sampling_params.best_of == 1
- and not seq_group.sampling_params.use_beam_search)
+ seq_group.sampling_params.best_of == 1)
return no_beam_search
def schedule(
diff --git a/vllm/distributed/device_communicators/custom_all_reduce.py b/vllm/distributed/device_communicators/custom_all_reduce.py
index c95192a5a1bcc..7de5b05a0b053 100644
--- a/vllm/distributed/device_communicators/custom_all_reduce.py
+++ b/vllm/distributed/device_communicators/custom_all_reduce.py
@@ -265,24 +265,21 @@ def all_reduce_unreg(self, inp: torch.Tensor, out: torch.Tensor = None):
def custom_all_reduce(self, input: torch.Tensor) -> Optional[torch.Tensor]:
# when custom allreduce is disabled, this will be None
- if self.disabled:
+ if self.disabled or not self.should_custom_ar(input):
return None
if self._IS_CAPTURING:
if torch.cuda.is_current_stream_capturing():
- if self.should_custom_ar(input):
- return self.all_reduce_reg(input)
+ return self.all_reduce_reg(input)
else:
- if self.should_custom_ar(input):
- # if warm up, mimic the allocation pattern
- # since custom allreduce is out-of-place
- return torch.empty_like(input)
+ # if warm up, mimic the allocation pattern
+ # since custom allreduce is out-of-place
+ return torch.empty_like(input)
else:
# note: outside of cuda graph context,
# custom allreduce incurs a cost of cudaMemcpy, which should
# be small(<=1% of overall latency) compared to the performance
# gains of using custom kernels
- if self.should_custom_ar(input):
- return self.all_reduce_unreg(input)
+ return self.all_reduce_unreg(input)
return None
diff --git a/vllm/distributed/parallel_state.py b/vllm/distributed/parallel_state.py
index d3ac4eb78b155..6e1970bfed98a 100644
--- a/vllm/distributed/parallel_state.py
+++ b/vllm/distributed/parallel_state.py
@@ -105,7 +105,7 @@ def inplace_all_reduce(tensor: torch.Tensor, group_name: str) -> None:
group = _groups[group_name]()
if group is None:
raise ValueError(f"Group {group_name} is destroyed.")
- group._all_reduce(tensor)
+ group._all_reduce_in_place(tensor)
@inplace_all_reduce.register_fake
def _(tensor: torch.Tensor, group_name: str) -> None:
@@ -118,7 +118,7 @@ def outplace_all_reduce(tensor: torch.Tensor,
group = _groups[group_name]()
if group is None:
raise ValueError(f"Group {group_name} is destroyed.")
- return group._all_reduce(tensor)
+ return group._all_reduce_out_place(tensor)
@outplace_all_reduce.register_fake
def _(tensor: torch.Tensor, group_name: str) -> torch.Tensor:
@@ -338,14 +338,17 @@ def all_reduce(self, input_: torch.Tensor) -> torch.Tensor:
return input_
if not supports_custom_op():
- return self._all_reduce(input_)
+ self._all_reduce_in_place(input_)
+ return input_
if self.tpu_communicator is not None and \
not self.tpu_communicator.disabled:
# TPU handles Dynamo with its own logic.
- return self._all_reduce(input_)
+ return self.tpu_communicator.all_reduce(input_)
- if self.ca_comm is not None and self.ca_comm.should_custom_ar(input_):
+ if self.ca_comm is not None and \
+ not self.ca_comm.disabled and \
+ self.ca_comm.should_custom_ar(input_):
return torch.ops.vllm.outplace_all_reduce(
input_, group_name=self.unique_name)
else:
@@ -353,25 +356,15 @@ def all_reduce(self, input_: torch.Tensor) -> torch.Tensor:
group_name=self.unique_name)
return input_
- def _all_reduce(self, input_: torch.Tensor) -> torch.Tensor:
- """
- The actual all-reduce implementation.
-
- NOTE: This operation will be applied in-place or out-of-place.
- Always assume this function modifies its input, but use the return
- value as the output.
- """
+ def _all_reduce_out_place(self, input_: torch.Tensor) -> torch.Tensor:
ca_comm = self.ca_comm
+ assert ca_comm is not None
+ assert not ca_comm.disabled
+ out = ca_comm.custom_all_reduce(input_)
+ assert out is not None
+ return out
- # For TPUs, use TPU communicator.
- tpu_comm = self.tpu_communicator
- if tpu_comm is not None and not tpu_comm.disabled:
- return tpu_comm.all_reduce(input_)
-
- if ca_comm is not None:
- out = ca_comm.custom_all_reduce(input_)
- if out is not None:
- return out
+ def _all_reduce_in_place(self, input_: torch.Tensor) -> None:
pynccl_comm = self.pynccl_comm
if (pynccl_comm is not None and not pynccl_comm.disabled):
pynccl_comm.all_reduce(input_)
@@ -380,7 +373,6 @@ def _all_reduce(self, input_: torch.Tensor) -> torch.Tensor:
ipex.distributed.all_reduce(input_, group=self.device_group)
else:
torch.distributed.all_reduce(input_, group=self.device_group)
- return input_
def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor:
world_size = self.world_size
diff --git a/vllm/engine/arg_utils.py b/vllm/engine/arg_utils.py
index 3f0a8d3df8b32..cae95d20ca23d 100644
--- a/vllm/engine/arg_utils.py
+++ b/vllm/engine/arg_utils.py
@@ -183,6 +183,8 @@ class EngineArgs:
def __post_init__(self):
if self.tokenizer is None:
self.tokenizer = self.model
+ from vllm.plugins import load_general_plugins
+ load_general_plugins()
@staticmethod
def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
diff --git a/vllm/engine/async_llm_engine.py b/vllm/engine/async_llm_engine.py
index e7d770c976319..50269493d64e9 100644
--- a/vllm/engine/async_llm_engine.py
+++ b/vllm/engine/async_llm_engine.py
@@ -14,23 +14,26 @@
from vllm.engine.async_timeout import asyncio_timeout
from vllm.engine.llm_engine import LLMEngine, SchedulerOutputState
from vllm.engine.metrics_types import StatLoggerBase
+from vllm.entrypoints.llm import BeamSearchSequence
from vllm.executor.executor_base import ExecutorAsyncBase
from vllm.executor.gpu_executor import GPUExecutorAsync
from vllm.executor.ray_utils import initialize_ray_cluster
-from vllm.inputs import PromptType
+from vllm.inputs import PromptType, TokensPrompt
from vllm.logger import init_logger
from vllm.lora.request import LoRARequest
from vllm.model_executor.guided_decoding import (
get_guided_decoding_logits_processor)
from vllm.model_executor.layers.sampler import SamplerOutput
-from vllm.outputs import EmbeddingRequestOutput, RequestOutput
+from vllm.outputs import (CompletionOutput, EmbeddingRequestOutput,
+ RequestOutput)
from vllm.pooling_params import PoolingParams
from vllm.prompt_adapter.request import PromptAdapterRequest
-from vllm.sampling_params import SamplingParams
+from vllm.sampling_params import BeamSearchParams, SamplingParams
from vllm.sequence import ExecuteModelRequest
from vllm.transformers_utils.tokenizer import AnyTokenizer
from vllm.usage.usage_lib import UsageContext
-from vllm.utils import deprecate_kwargs, weak_bind
+from vllm.utils import (collect_from_async_generator, deprecate_kwargs,
+ get_beam_search_score, random_uuid, weak_bind)
logger = init_logger(__name__)
ENGINE_ITERATION_TIMEOUT_S = envs.VLLM_ENGINE_ITERATION_TIMEOUT_S
@@ -1036,6 +1039,104 @@ async def generate(
):
yield LLMEngine.validate_output(output, RequestOutput)
+ async def beam_search(
+ self,
+ prompt: Union[PromptType, List[int]],
+ request_id: str,
+ params: BeamSearchParams,
+ ) -> AsyncGenerator[RequestOutput, None]:
+
+ beam_width = params.beam_width
+ max_tokens = params.max_tokens
+ ignore_eos = params.ignore_eos
+ temperature = params.temperature
+ length_penalty = params.length_penalty
+
+ def sort_beams_key(x: BeamSearchSequence) -> float:
+ return get_beam_search_score(x.tokens, x.cum_logprob,
+ tokenizer.eos_token_id,
+ length_penalty)
+
+ tokenizer = await self.get_tokenizer()
+ tokenizedPrompt = prompt if isinstance(
+ prompt, list) else tokenizer.encode(prompt)
+ tokenizedLength = len(tokenizedPrompt)
+
+ beam_search_params = SamplingParams(logprobs=2 * beam_width,
+ max_tokens=1,
+ temperature=temperature)
+ all_beams = [BeamSearchSequence(tokens=tokenizedPrompt, cum_logprob=0)]
+ completed = []
+
+ for _ in range(max_tokens):
+ prompts_batch = [
+ TokensPrompt(prompt_token_ids=beam.tokens)
+ for beam in all_beams
+ ]
+
+ tasks = []
+
+ request_id = f"beam_search-{random_uuid()}"
+ for i, individual_prompt in enumerate(prompts_batch):
+ request_id_item = f"{request_id}-{i}"
+ task = asyncio.create_task(
+ collect_from_async_generator(
+ self.generate(individual_prompt, beam_search_params,
+ request_id_item)))
+ tasks.append(task)
+
+ output = await asyncio.gather(*tasks)
+
+ output = [x[0] for x in output]
+
+ logger.info(output)
+
+ new_beams = []
+ for i, current_beam in enumerate(all_beams):
+ result = output[i]
+
+ if result.outputs[0].logprobs is not None:
+ logprobs = result.outputs[0].logprobs[0]
+ for token_id, logprob_obj in logprobs.items():
+ new_beam = BeamSearchSequence(
+ tokens=current_beam.tokens + [token_id],
+ cum_logprob=current_beam.cum_logprob +
+ logprob_obj.logprob)
+
+ if token_id == tokenizer.eos_token_id and \
+ not ignore_eos:
+ completed.append(new_beam)
+ else:
+ new_beams.append(new_beam)
+
+ sorted_beams = sorted(new_beams, key=sort_beams_key, reverse=True)
+ all_beams = sorted_beams[:beam_width]
+
+ completed.extend(all_beams)
+ sorted_completed = sorted(completed, key=sort_beams_key, reverse=True)
+ best_beams = sorted_completed[:beam_width]
+
+ for beam in best_beams:
+ beam.text = tokenizer.decode(beam.tokens[tokenizedLength:])
+
+ beam_search_output = RequestOutput(
+ request_id=request_id,
+ prompt=prompt,
+ outputs=[
+ CompletionOutput(
+ text=beam.text,
+ cumulative_logprob=beam.cum_logprob,
+ token_ids=beam.tokens,
+ index=i,
+ logprobs=beam.cum_logprob,
+ ) for (i, beam) in enumerate(best_beams)
+ ],
+ finished=True,
+ prompt_token_ids=tokenizedPrompt,
+ prompt_logprobs=None)
+
+ yield LLMEngine.validate_output(beam_search_output, RequestOutput)
+
async def encode(
self,
prompt: PromptType,
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 89d6bb74e4030..fdc8ca6405375 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -290,9 +290,6 @@ def __init__(
model_config.mm_processor_kwargs,
)
# TODO(woosuk): Print more configs in debug mode.
- from vllm.plugins import load_general_plugins
- load_general_plugins()
-
self.model_config = model_config
self.cache_config = cache_config
self.lora_config = lora_config
@@ -972,6 +969,45 @@ def _process_sequence_group_outputs(
return
+ def _update_num_computed_tokens_for_multi_step_prefill(
+ self, seq_group: SequenceGroup,
+ seq_group_meta: SequenceGroupMetadata,
+ is_first_step_output: Optional[bool]):
+ """
+ This function updates num_computed_tokens for prompt sequences
+ when Multi-Step is enabled.
+
+ seq_group: SequenceGroup to update the num_computed_tokens for.
+ seq_group_meta: Metadata of the given SequenceGroup.
+ is_first_step_output: Optional[bool] -
+ When available, is_first_step_output indicates if the appended
+ output token is the output of the first-step in multi-step.
+ A value of None indicates that outputs from all steps in
+ in multi-step are submitted in a single burst.
+ """
+
+ assert self.scheduler_config.is_multi_step
+
+ if not seq_group_meta.is_prompt:
+ # num_computed_token updates for multi-step decodes happen after
+ # the tokens are appended to the sequence.
+ return
+
+ do_update: bool = False
+ if self.scheduler_config.chunked_prefill_enabled:
+ # In multi-step + chunked-prefill case, the prompt sequences
+ # that are scheduled are fully processed in the first step.
+ do_update = is_first_step_output is None or is_first_step_output
+ else:
+ # Normal multi-step decoding case. In this case prompt-sequences
+ # are actually single-stepped. Always update in this case.
+ assert seq_group.state.num_steps == 1
+ do_update = True
+
+ if do_update:
+ seq_group.update_num_computed_tokens(
+ seq_group_meta.token_chunk_size)
+
def _process_model_outputs(self,
ctx: SchedulerContext,
request_id: Optional[str] = None) -> None:
@@ -982,64 +1018,6 @@ def _process_model_outputs(self,
request_id: If provided, then only this request is going to be processed
"""
- def update_prefill_num_computed_tokens(
- seq_group: SequenceGroup,
- seq_group_meta: SequenceGroupMetadata, num_outputs: int,
- is_first_step_output: Optional[bool]) -> None:
- """
- When multi-step and chunked-prefill are enabled together, the
- prefill sequence scheduled for multi-step execution turn into
- decodes in the first step itself. This function accounts
- for that conversion.
-
- seq_group: SequenceGroup - A prefill seq_group
- seq_group_meta: SequenceGroupMetadata - Metadata of the given
- prefill seq_group
- num_outputs: int - number of output tokens being processed for the
- given seq_group
- is_first_step_output: Optional[bool] -
- If multi-step is enabled and num_outputs is 1, this value
- indicates if this outputs belongs to the first step in the
- multi-step.
- If multi-step is enabled and num_outputs > 1, this value
- must be None, as num_outputs > 1 indicates that outputs from
- all the steps in multi-step are submitted in a single burst.
- When multi-step is disabled, this value is always True.
- """
-
- assert seq_group_meta.is_prompt
-
- token_chunk_size = seq_group_meta.token_chunk_size
-
- if num_outputs == 1:
- assert is_first_step_output is not None
-
- if seq_group_meta.state.num_steps == 1:
- assert is_first_step_output is True
- seq_group.update_num_computed_tokens(token_chunk_size)
- return
-
- # multi-step prefill is only supported when multi-step is
- # enabled with chunked prefill
- assert self.scheduler_config.is_multi_step and \
- self.scheduler_config.chunked_prefill_enabled
- if is_first_step_output is True:
- # This sequence is a prompt during the first step only.
- seq_group.update_num_computed_tokens(token_chunk_size)
- return
-
- assert is_first_step_output is None
-
- # multi-step prefill is only supported when multi-step is
- # enabled with chunked prefill. Outputs from all the steps are
- # submitted in a single burst.
- assert self.scheduler_config.is_multi_step and \
- self.scheduler_config.chunked_prefill_enabled
- assert num_outputs == seq_group_meta.state.num_steps, \
- f"#outputs {len(outputs)} - num steps {seq_group_meta.state.num_steps}" #noqa
- # This sequence is a prompt during the first step only.
- seq_group.update_num_computed_tokens(token_chunk_size)
-
now = time.time()
if len(ctx.output_queue) == 0:
@@ -1100,7 +1078,7 @@ def update_prefill_num_computed_tokens(
seq_group_meta = seq_group_metadata_list[i]
scheduled_seq_group = scheduler_outputs.scheduled_seq_groups[i]
- seq_group = scheduled_seq_group.seq_group
+ seq_group: SequenceGroup = scheduled_seq_group.seq_group
if seq_group.is_finished():
finished_before.append(i)
@@ -1111,14 +1089,14 @@ def update_prefill_num_computed_tokens(
else:
output = [outputs_by_sequence_group[0][i]]
- if not is_async and seq_group_meta.is_prompt:
- # Updates for all decodes happen when we actually append the
- # token ids to the seq in process_outputs.
- update_prefill_num_computed_tokens(seq_group, seq_group_meta,
- len(output),
- is_first_step_output)
- elif not is_async:
- seq_group.update_num_computed_tokens(1)
+ if not is_async:
+ if self.scheduler_config.is_multi_step:
+ # Updates happen only if the sequence is prefill
+ self._update_num_computed_tokens_for_multi_step_prefill(
+ seq_group, seq_group_meta, is_first_step_output)
+ else:
+ seq_group.update_num_computed_tokens(
+ seq_group_meta.token_chunk_size)
if outputs:
for o in outputs:
@@ -1142,16 +1120,8 @@ def update_prefill_num_computed_tokens(
else:
self.output_processor.process_prompt_logprob(seq_group, output)
if seq_group_meta.do_sample:
- output_token_num = self.output_processor.process_outputs(
+ self.output_processor.process_outputs(
seq_group, output, is_async)
- if self.speculative_config:
- # We -1 here because we always
- # (w/o speculative decoding) add the number of
- # computed tokens by one in the decoding phase.
- # Therefore, we remove that one token that
- # is already added.
- seq_group.update_num_computed_tokens(output_token_num -
- 1)
if seq_group.is_finished():
finished_now.append(i)
@@ -1260,20 +1230,15 @@ def _advance_to_next_step(
if seq_group.is_finished():
continue
- if seq_group_metadata.is_prompt:
- if self.scheduler_config.is_multi_step and \
- self.scheduler_config.chunked_prefill_enabled:
- # Prompts are scheduled in multi-step only when
- # chunking is enabled. These prompts turn into
- # decodes after the very first step. Therefore,
- # we skip the update to the num_computed_tokens
- # here.
- seq_group.update_num_computed_tokens(1)
- else:
- seq_group.update_num_computed_tokens(
- seq_group_metadata.token_chunk_size)
+ if self.scheduler_config.is_multi_step:
+ # Updates happen only if the sequence is prefill
+ self._update_num_computed_tokens_for_multi_step_prefill(
+ seq_group, seq_group_metadata,
+ seq_group.state.num_steps == 1)
else:
- seq_group.update_num_computed_tokens(1)
+ seq_group.update_num_computed_tokens(
+ seq_group_metadata.token_chunk_size)
+
if seq_group_metadata.do_sample:
assert len(sequence_group_outputs.samples) == 1, (
"Async output processor expects a single sample"
@@ -1283,7 +1248,15 @@ def _advance_to_next_step(
assert len(seq_group.seqs) == 1
seq = seq_group.seqs[0]
- seq.append_token_id(sample.output_token, sample.logprobs)
+
+ if self.scheduler_config.is_multi_step:
+ is_prefill_append = seq.data.get_num_uncomputed_tokens(
+ ) == 0
+ seq.append_token_id(sample.output_token, sample.logprobs)
+ if not is_prefill_append:
+ seq_group.update_num_computed_tokens(1)
+ else:
+ seq.append_token_id(sample.output_token, sample.logprobs)
def step(self) -> List[Union[RequestOutput, EmbeddingRequestOutput]]:
"""Performs one decoding iteration and returns newly generated results.
diff --git a/vllm/engine/output_processor/interfaces.py b/vllm/engine/output_processor/interfaces.py
index 554880a3cc438..50adaf4e59188 100644
--- a/vllm/engine/output_processor/interfaces.py
+++ b/vllm/engine/output_processor/interfaces.py
@@ -1,5 +1,5 @@
from abc import ABC, abstractmethod
-from typing import Callable, List, Optional
+from typing import Callable, List
from vllm.config import SchedulerConfig
from vllm.core.scheduler import Scheduler
@@ -58,14 +58,10 @@ def create_output_processor(
@abstractmethod
def process_outputs(self, sequence_group: SequenceGroup,
outputs: List[SequenceGroupOutput],
- is_async: bool) -> Optional[int]:
+ is_async: bool) -> None:
"""Process new token ids for the sequence group. Handles logic such as
detokenization, stop checking, and freeing/forking sequences in the
scheduler.
-
- Return the number of new tokens generated in the sequence group.
- The returned value is optional because it is only used for
- speculative decoding mqa scorer.
"""
pass
diff --git a/vllm/engine/output_processor/multi_step.py b/vllm/engine/output_processor/multi_step.py
index f35b1ba9c2bdd..47de3656ca892 100644
--- a/vllm/engine/output_processor/multi_step.py
+++ b/vllm/engine/output_processor/multi_step.py
@@ -1,5 +1,5 @@
import functools
-from typing import Callable, List, Optional
+from typing import Callable, List
from vllm.core.scheduler import Scheduler
from vllm.engine.output_processor.interfaces import (
@@ -69,7 +69,7 @@ def _log_prompt_logprob_unsupported_warning_once():
def process_outputs(self,
sequence_group: SequenceGroup,
outputs: List[SequenceGroupOutput],
- is_async: bool = False) -> Optional[int]:
+ is_async: bool = False) -> None:
"""Append new tokens in the outputs to sequences in the sequence group.
This only supports sequence groups of size 1. It supports greater than
@@ -84,10 +84,6 @@ def process_outputs(self,
tokens from the previous step. If this is true, then
no tokens need to be appended since it is already done
externally (before the next schedule() call)
-
- Returns:
- The number of tokens appended to the sequence. This is optional
- because only speculative decode uses this return value.
"""
# Sequences can be in RUNNING or FINISHED_ABORTED state
# once scheduled, as a sequence is moved to FINSIHED_ABORTED
@@ -110,7 +106,6 @@ def process_outputs(self,
# was already appended, so we only need to do the rest of the
# postprocessor: Detokenization + stopping logic
self._process_decode_and_stop(seq, sequence_group.sampling_params)
- return None
else:
# Standard multi-step case
@@ -126,8 +121,8 @@ def process_outputs(self,
]
assert valid_samples
- return self._process_seq_outputs(seq, valid_samples,
- sequence_group.sampling_params)
+ self._process_seq_outputs(seq, valid_samples,
+ sequence_group.sampling_params)
def _process_decode_and_stop(self, seq: Sequence,
sampling_params: SamplingParams) -> None:
@@ -145,7 +140,7 @@ def _process_decode_and_stop(self, seq: Sequence,
def _process_seq_outputs(self, seq: Sequence,
valid_samples: List[SequenceOutput],
- sampling_params: SamplingParams) -> int:
+ sampling_params: SamplingParams) -> None:
output_token_ids = [sample.output_token for sample in valid_samples]
output_logprobs = [sample.logprobs for sample in valid_samples]
@@ -168,6 +163,7 @@ def _process_seq_outputs(self, seq: Sequence,
output_token_ids = output_token_ids[:i + 1]
break
+ is_prefill_sampled_token = seq.data.get_num_uncomputed_tokens() == 0
# Incrementally append tokens to the sequence, as if we had only one new
# token.
for output_token_id, output_logprob in zip(output_token_ids,
@@ -177,8 +173,14 @@ def _process_seq_outputs(self, seq: Sequence,
logprobs=output_logprob,
)
+ if is_prefill_sampled_token:
+ is_prefill_sampled_token = False
+ else:
+ # Update num_computed_tokens iff the sampled token is not from
+ # a prefill step.
+ seq.data.update_num_computed_tokens(1)
+
self._process_decode_and_stop(seq, sampling_params)
if seq.is_finished():
break
- return len(output_token_ids)
diff --git a/vllm/engine/output_processor/single_step.py b/vllm/engine/output_processor/single_step.py
index e288aa0c4aafd..00d9297e41d99 100644
--- a/vllm/engine/output_processor/single_step.py
+++ b/vllm/engine/output_processor/single_step.py
@@ -1,4 +1,4 @@
-from typing import Dict, List, Optional, Tuple, Union
+from typing import Dict, List, Tuple
from vllm.config import SchedulerConfig
from vllm.core.scheduler import Scheduler
@@ -6,7 +6,6 @@
SequenceGroupOutputProcessor)
from vllm.engine.output_processor.stop_checker import StopChecker
from vllm.logger import init_logger
-from vllm.sampling_params import SamplingParams
from vllm.sequence import (Sequence, SequenceGroup, SequenceGroupOutput,
SequenceOutput, SequenceStatus)
from vllm.transformers_utils.detokenizer import Detokenizer
@@ -113,7 +112,7 @@ def _process_sequence_group_outputs(self, seq_group: SequenceGroup,
outputs: SequenceGroupOutput,
is_async: bool) -> None:
sampling_params = seq_group.sampling_params
- if sampling_params.best_of == 1 and not sampling_params.use_beam_search:
+ if sampling_params.best_of == 1:
# only have one output sample
sample = outputs.samples[0]
# only have one sequence
@@ -142,7 +141,6 @@ def _process_sequence_group_outputs(self, seq_group: SequenceGroup,
# Process samples
samples = outputs.samples
parent_seqs = seq_group.get_seqs(status=SequenceStatus.RUNNING)
- existing_finished_seqs = seq_group.get_finished_seqs()
parent_child_dict: Dict[int, List[SequenceOutput]] = {
parent_seq.seq_id: []
for parent_seq in parent_seqs
@@ -197,106 +195,9 @@ def _process_sequence_group_outputs(self, seq_group: SequenceGroup,
lora_req=seq_group.lora_request,
)
- # Non-beam search case
- if not sampling_params.use_beam_search:
- # For newly created child sequences, add them to the sequence group
- # and fork them in block manager if they are not finished.
- for seq, parent in child_seqs:
- if seq is not parent:
- seq_group.add(seq)
- if not seq.is_finished():
- for scheduler in self.scheduler:
- scheduler.fork_seq(parent, seq)
-
- # Free the finished and selected parent sequences' memory in block
- # manager. Keep them in the sequence group as candidate output.
- # NOTE: we need to fork the new sequences before freeing the
- # old sequences.
- for seq, parent in child_seqs:
- if seq is parent and seq.is_finished():
- for scheduler in self.scheduler:
- scheduler.free_seq(seq)
- return
-
- # Beam search case
- # Select the child sequences to keep in the sequence group.
- selected_child_seqs: List[Tuple[Sequence, Optional[Sequence]]] = []
- unselected_child_seqs: List[Tuple[Sequence, Optional[Sequence]]] = []
- beam_width = sampling_params.best_of
- length_penalty = sampling_params.length_penalty
-
- # Select the newly finished sequences with the highest scores
- # to replace existing finished sequences.
- # Tuple of (seq, parent, is_new)
- existing_finished_seqs = [(seq, None, False)
- for seq in existing_finished_seqs]
- new_finished_seqs = [(seq, parent, True) for seq, parent in child_seqs
- if seq.is_finished()]
- all_finished_seqs = existing_finished_seqs + new_finished_seqs
- # Sort the finished sequences by their scores.
- all_finished_seqs.sort(key=lambda x: x[0].get_beam_search_score(
- length_penalty=length_penalty, eos_token_id=x[0].eos_token_id),
- reverse=True)
- for seq, parent, is_new in all_finished_seqs[:beam_width]:
- if is_new:
- # A newly generated child sequence finishes and has a high
- # score, so we will add it into the sequence group.
- selected_child_seqs.append((seq, parent))
- for seq, parent, is_new in all_finished_seqs[beam_width:]:
- if is_new:
- # A newly generated child sequence finishes but has a low
- # score, so we will not add it into the sequence group.
- # Additionally, if this sequence is a continuation of a
- # parent sequence, we will need remove the parent sequence
- # from the sequence group.
- unselected_child_seqs.append((seq, parent))
- else:
- # An existing finished sequence has a low score, so we will
- # remove it from the sequence group.
- seq_group.remove(seq.seq_id)
-
- # select the top beam_width sequences from the running
- # sequences for the next iteration to continue the beam
- # search.
- running_child_seqs = [(seq, parent) for seq, parent in child_seqs
- if not seq.is_finished()]
- # Sort the running sequences by their scores.
- running_child_seqs.sort(key=lambda x: x[0].get_beam_search_score(
- length_penalty=length_penalty, eos_token_id=x[0].eos_token_id),
- reverse=True)
-
- # Check if we can stop the beam search.
- if len(running_child_seqs) == 0:
- # No running sequences, stop the beam search.
- stop_beam_search = True
- elif len(all_finished_seqs) < beam_width:
- # Not enough finished sequences, continue the beam search.
- stop_beam_search = False
- else:
- # Check the early stopping criteria
- best_running_seq = running_child_seqs[0][0]
- current_worst_seq = all_finished_seqs[beam_width - 1][0]
- stop_beam_search = self._check_beam_search_early_stopping(
- sampling_params.early_stopping, sampling_params,
- best_running_seq, current_worst_seq)
-
- if stop_beam_search:
- # Stop the beam search and remove all the running sequences from
- # the sequence group.
- unselected_child_seqs.extend(running_child_seqs)
- else:
- # Continue the beam search and select the top beam_width sequences
- # to continue the beam search.
- selected_child_seqs.extend(running_child_seqs[:beam_width])
- # The remaining running sequences will not be used in the next
- # iteration. Again, if these sequences are continuations of
- # parent sequences, we will need to remove the parent sequences
- # from the sequence group.
- unselected_child_seqs.extend(running_child_seqs[beam_width:])
-
# For newly created child sequences, add them to the sequence group
# and fork them in block manager if they are not finished.
- for seq, parent in selected_child_seqs:
+ for seq, parent in child_seqs:
if seq is not parent:
seq_group.add(seq)
if not seq.is_finished():
@@ -305,61 +206,10 @@ def _process_sequence_group_outputs(self, seq_group: SequenceGroup,
# Free the finished and selected parent sequences' memory in block
# manager. Keep them in the sequence group as candidate output.
- for seq, parent in selected_child_seqs:
+ # NOTE: we need to fork the new sequences before freeing the
+ # old sequences.
+ for seq, parent in child_seqs:
if seq is parent and seq.is_finished():
for scheduler in self.scheduler:
scheduler.free_seq(seq)
-
- # Remove the unselected parent sequences from the sequence group and
- # free their memory in block manager.
- for seq, parent in unselected_child_seqs:
- if seq is parent:
- # Remove the parent sequence if it is not selected for next
- # iteration
- seq_group.remove(seq.seq_id)
- for scheduler in self.scheduler:
- scheduler.free_seq(seq)
-
- def _check_beam_search_early_stopping(
- self,
- early_stopping: Union[bool, str],
- sampling_params: SamplingParams,
- best_running_seq: Sequence,
- current_worst_seq: Sequence,
- ) -> bool:
- assert sampling_params.use_beam_search
- length_penalty = sampling_params.length_penalty
- if early_stopping is True:
- return True
-
- current_worst_score = current_worst_seq.get_beam_search_score(
- length_penalty=length_penalty,
- eos_token_id=current_worst_seq.eos_token_id)
- if early_stopping is False:
- highest_attainable_score = best_running_seq.get_beam_search_score(
- length_penalty=length_penalty,
- eos_token_id=best_running_seq.eos_token_id)
- else:
- assert early_stopping == "never"
- if length_penalty > 0.0:
- # If length_penalty > 0.0, beam search will prefer longer
- # sequences. The highest attainable score calculation is
- # based on the longest possible sequence length in this case.
- max_possible_length = max(
- best_running_seq.get_prompt_len() +
- sampling_params.max_tokens,
- self.scheduler_config.max_model_len)
- highest_attainable_score = (
- best_running_seq.get_beam_search_score(
- length_penalty=length_penalty,
- eos_token_id=best_running_seq.eos_token_id,
- seq_len=max_possible_length))
- else:
- # Otherwise, beam search will prefer shorter sequences. The
- # highest attainable score calculation is based on the current
- # sequence length.
- highest_attainable_score = (
- best_running_seq.get_beam_search_score(
- length_penalty=length_penalty,
- eos_token_id=best_running_seq.eos_token_id))
- return current_worst_score >= highest_attainable_score
+ return
diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py
index 98d6df944da67..439f3769f9fbd 100644
--- a/vllm/entrypoints/llm.py
+++ b/vllm/entrypoints/llm.py
@@ -22,13 +22,14 @@
from vllm.outputs import EmbeddingRequestOutput, RequestOutput
from vllm.pooling_params import PoolingParams
from vllm.prompt_adapter.request import PromptAdapterRequest
-from vllm.sampling_params import (GuidedDecodingParams, RequestOutputKind,
- SamplingParams)
+from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams,
+ RequestOutputKind, SamplingParams)
from vllm.transformers_utils.tokenizer import (AnyTokenizer, MistralTokenizer,
get_cached_tokenizer)
from vllm.transformers_utils.tokenizer_group import TokenizerGroup
from vllm.usage.usage_lib import UsageContext
-from vllm.utils import Counter, deprecate_kwargs, is_list_of
+from vllm.utils import (Counter, deprecate_kwargs, get_beam_search_score,
+ is_list_of)
logger = init_logger(__name__)
@@ -180,15 +181,7 @@ def __init__(
if "disable_log_stats" not in kwargs:
kwargs["disable_log_stats"] = True
- removed_vision_keys = (
- "image_token_id",
- "image_feature_size",
- "image_input_shape",
- "image_input_type",
- )
- if any(k in kwargs for k in removed_vision_keys):
- raise TypeError(
- "There is no need to pass vision-related arguments anymore.")
+
engine_args = EngineArgs(
model=model,
tokenizer=tokenizer,
@@ -394,10 +387,7 @@ def generate(
def beam_search(
self,
prompts: List[Union[str, List[int]]],
- beam_width: int,
- max_tokens: int,
- ignore_eos: bool = False,
- temperature: float = 0.0,
+ params: BeamSearchParams,
) -> List[BeamSearchOutput]:
"""
Generate sequences using beam search.
@@ -405,14 +395,23 @@ def beam_search(
Args:
prompts: A list of prompts. Each prompt can be a string or a list
of token IDs.
- beam_width: The number of beams to keep at each step.
- max_tokens: The max number of tokens to generate for each prompt.
- temperature: The temperature to use for generation.
-
+ params: The beam search parameters.
+
TODO: how does beam search work together with length penalty, frequency
penalty, and stopping criteria, etc.?
"""
+ beam_width = params.beam_width
+ max_tokens = params.max_tokens
+ temperature = params.temperature
+ ignore_eos = params.ignore_eos
+ length_penalty = params.length_penalty
+
+ def sort_beams_key(x: BeamSearchSequence) -> float:
+ return get_beam_search_score(x.tokens, x.cum_logprob,
+ tokenizer.eos_token_id,
+ length_penalty)
+
tokenizer = self.get_tokenizer()
# generate 2 * beam_width candidates at each step
# following the huggingface transformers implementation
@@ -474,7 +473,7 @@ def beam_search(
else:
instance_new_beams.append(new_beam)
sorted_beams = sorted(instance_new_beams,
- key=lambda x: x.cum_logprob,
+ key=sort_beams_key,
reverse=True)
instance.beams = sorted_beams[:beam_width]
@@ -482,7 +481,7 @@ def beam_search(
for instance in instances:
instance.completed.extend(instance.beams)
sorted_completed = sorted(instance.completed,
- key=lambda x: x.cum_logprob,
+ key=sort_beams_key,
reverse=True)
best_beams = sorted_completed[:beam_width]
diff --git a/vllm/entrypoints/logger.py b/vllm/entrypoints/logger.py
index 091896e1c7a69..584ee0d9e1c54 100644
--- a/vllm/entrypoints/logger.py
+++ b/vllm/entrypoints/logger.py
@@ -4,7 +4,7 @@
from vllm.lora.request import LoRARequest
from vllm.pooling_params import PoolingParams
from vllm.prompt_adapter.request import PromptAdapterRequest
-from vllm.sampling_params import SamplingParams
+from vllm.sampling_params import BeamSearchParams, SamplingParams
logger = init_logger(__name__)
@@ -21,7 +21,8 @@ def log_inputs(
request_id: str,
prompt: Optional[str],
prompt_token_ids: Optional[List[int]],
- params: Optional[Union[SamplingParams, PoolingParams]],
+ params: Optional[Union[SamplingParams, PoolingParams,
+ BeamSearchParams]],
lora_request: Optional[LoRARequest],
prompt_adapter_request: Optional[PromptAdapterRequest],
) -> None:
diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py
index 623f1180bb443..6f1135f8093ba 100644
--- a/vllm/entrypoints/openai/protocol.py
+++ b/vllm/entrypoints/openai/protocol.py
@@ -11,8 +11,8 @@
from vllm.entrypoints.chat_utils import ChatCompletionMessageParam
from vllm.pooling_params import PoolingParams
-from vllm.sampling_params import (GuidedDecodingParams, RequestOutputKind,
- SamplingParams)
+from vllm.sampling_params import (BeamSearchParams, GuidedDecodingParams,
+ RequestOutputKind, SamplingParams)
from vllm.sequence import Logprob
from vllm.utils import random_uuid
@@ -184,7 +184,6 @@ class ChatCompletionRequest(OpenAIBaseModel):
min_p: float = 0.0
repetition_penalty: float = 1.0
length_penalty: float = 1.0
- early_stopping: bool = False
stop_token_ids: Optional[List[int]] = Field(default_factory=list)
include_stop_str_in_output: bool = False
ignore_eos: bool = False
@@ -288,6 +287,23 @@ class ChatCompletionRequest(OpenAIBaseModel):
# doc: end-chat-completion-extra-params
+ def to_beam_search_params(self,
+ default_max_tokens: int) -> BeamSearchParams:
+ max_tokens = self.max_tokens
+ if max_tokens is None:
+ max_tokens = default_max_tokens
+
+ n = self.n if self.n is not None else 1
+ temperature = self.temperature if self.temperature is not None else 0.0
+
+ return BeamSearchParams(
+ beam_width=n,
+ max_tokens=max_tokens,
+ ignore_eos=self.ignore_eos,
+ temperature=temperature,
+ length_penalty=self.length_penalty,
+ )
+
def to_sampling_params(self, default_max_tokens: int) -> SamplingParams:
max_tokens = self.max_tokens
if max_tokens is None:
@@ -329,12 +345,9 @@ def to_sampling_params(self, default_max_tokens: int) -> SamplingParams:
ignore_eos=self.ignore_eos,
max_tokens=max_tokens,
min_tokens=self.min_tokens,
- use_beam_search=self.use_beam_search,
- early_stopping=self.early_stopping,
skip_special_tokens=self.skip_special_tokens,
spaces_between_special_tokens=self.spaces_between_special_tokens,
include_stop_str_in_output=self.include_stop_str_in_output,
- length_penalty=self.length_penalty,
truncate_prompt_tokens=self.truncate_prompt_tokens,
output_kind=RequestOutputKind.DELTA if self.stream \
else RequestOutputKind.FINAL_ONLY,
@@ -502,7 +515,6 @@ class CompletionRequest(OpenAIBaseModel):
min_p: float = 0.0
repetition_penalty: float = 1.0
length_penalty: float = 1.0
- early_stopping: bool = False
stop_token_ids: Optional[List[int]] = Field(default_factory=list)
include_stop_str_in_output: bool = False
ignore_eos: bool = False
@@ -567,6 +579,23 @@ class CompletionRequest(OpenAIBaseModel):
# doc: end-completion-extra-params
+ def to_beam_search_params(self,
+ default_max_tokens: int) -> BeamSearchParams:
+ max_tokens = self.max_tokens
+ if max_tokens is None:
+ max_tokens = default_max_tokens
+
+ n = self.n if self.n is not None else 1
+ temperature = self.temperature if self.temperature is not None else 0.0
+
+ return BeamSearchParams(
+ beam_width=n,
+ max_tokens=max_tokens,
+ ignore_eos=self.ignore_eos,
+ temperature=temperature,
+ length_penalty=self.length_penalty,
+ )
+
def to_sampling_params(self, default_max_tokens: int) -> SamplingParams:
max_tokens = self.max_tokens
if max_tokens is None:
@@ -609,13 +638,10 @@ def to_sampling_params(self, default_max_tokens: int) -> SamplingParams:
ignore_eos=self.ignore_eos,
max_tokens=max_tokens if not echo_without_generation else 1,
min_tokens=self.min_tokens,
- use_beam_search=self.use_beam_search,
- early_stopping=self.early_stopping,
prompt_logprobs=prompt_logprobs,
skip_special_tokens=self.skip_special_tokens,
spaces_between_special_tokens=self.spaces_between_special_tokens,
include_stop_str_in_output=self.include_stop_str_in_output,
- length_penalty=self.length_penalty,
truncate_prompt_tokens=self.truncate_prompt_tokens,
output_kind=RequestOutputKind.DELTA if self.stream \
else RequestOutputKind.FINAL_ONLY,
@@ -671,6 +697,7 @@ class EmbeddingRequest(OpenAIBaseModel):
encoding_format: Literal["float", "base64"] = "float"
dimensions: Optional[int] = None
user: Optional[str] = None
+ truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
# doc: begin-embedding-pooling-params
additional_data: Optional[Any] = None
diff --git a/vllm/entrypoints/openai/serving_chat.py b/vllm/entrypoints/openai/serving_chat.py
index ce529f6f0ff58..c4652be6fe821 100644
--- a/vllm/entrypoints/openai/serving_chat.py
+++ b/vllm/entrypoints/openai/serving_chat.py
@@ -9,6 +9,7 @@
from fastapi import Request
from vllm.config import ModelConfig
+from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.chat_utils import (ConversationMessage,
apply_hf_chat_template,
@@ -33,6 +34,7 @@
from vllm.inputs import TokensPrompt
from vllm.logger import init_logger
from vllm.outputs import CompletionOutput, RequestOutput
+from vllm.sampling_params import BeamSearchParams, SamplingParams
from vllm.sequence import Logprob
from vllm.tracing import (contains_trace_headers, extract_trace_headers,
log_tracing_disabled_warning)
@@ -203,9 +205,15 @@ async def create_chat_completion(
assert prompt_inputs is not None
- sampling_params = request.to_sampling_params(
- default_max_tokens=self.max_model_len -
- len(prompt_inputs["prompt_token_ids"]))
+ sampling_params: Union[SamplingParams, BeamSearchParams]
+ default_max_tokens = self.max_model_len - len(
+ prompt_inputs["prompt_token_ids"])
+ if request.use_beam_search:
+ sampling_params = request.to_beam_search_params(
+ default_max_tokens)
+ else:
+ sampling_params = request.to_sampling_params(
+ default_max_tokens)
self._log_inputs(request_id,
prompt_inputs,
@@ -227,15 +235,26 @@ async def create_chat_completion(
and contains_trace_headers(raw_request.headers)):
log_tracing_disabled_warning()
- result_generator = self.engine_client.generate(
- engine_inputs,
- sampling_params,
- request_id,
- lora_request=lora_request,
- trace_headers=trace_headers,
- prompt_adapter_request=prompt_adapter_request,
- priority=request.priority,
- )
+ if isinstance(sampling_params, BeamSearchParams):
+ if not isinstance(self.engine_client, AsyncLLMEngine):
+ raise ValueError(
+ "Beam search in the API server is only supported with"
+ " AsyncLLMEngine. please add "
+ "`--disable-frontend-multiprocessing` to "
+ "use beam search.")
+ result_generator = self.engine_client.beam_search(
+ engine_inputs['prompt_token_ids'], request_id,
+ sampling_params)
+ else:
+ result_generator = self.engine_client.generate(
+ engine_inputs,
+ sampling_params,
+ request_id,
+ lora_request=lora_request,
+ trace_headers=trace_headers,
+ prompt_adapter_request=prompt_adapter_request,
+ priority=request.priority,
+ )
except ValueError as e:
# TODO: Use a vllm-specific Validation Error
return self.create_error_response(str(e))
@@ -283,10 +302,6 @@ async def chat_completion_stream_generator(
finish_reason_sent = [False] * num_choices
num_prompt_tokens = 0
- tool_parsers: List[Optional[ToolParser]] = [
- self.tool_parser(tokenizer) if self.tool_parser else None
- ] * num_choices
-
if isinstance(request.tool_choice, ChatCompletionNamedToolChoiceParam):
tool_choice_function_name = request.tool_choice.function.name
else:
@@ -305,6 +320,21 @@ async def chat_completion_stream_generator(
else:
previous_texts, all_previous_token_ids = None, None
+ # Prepare the tool parser if it's needed
+ try:
+ if tool_choice_auto and self.tool_parser:
+ tool_parsers: List[Optional[ToolParser]] = [
+ self.tool_parser(tokenizer)
+ ] * num_choices
+ else:
+ tool_parsers = [None] * num_choices
+ except RuntimeError as e:
+ logger.error("Error in tool parser creation: %s", e)
+ data = self.create_streaming_error_response(str(e))
+ yield f"data: {data}\n\n"
+ yield "data: [DONE]\n\n"
+ return
+
try:
async for res in result_generator:
if res.prompt_token_ids is not None:
@@ -685,7 +715,12 @@ async def chat_completion_full_generator(
or request.tool_choice is None) and self.enable_auto_tools \
and self.tool_parser:
- tool_parser = self.tool_parser(tokenizer)
+ try:
+ tool_parser = self.tool_parser(tokenizer)
+ except RuntimeError as e:
+ logger.error("Error in tool parser creation: %s", e)
+ return self.create_error_response(str(e))
+
tool_call_info = tool_parser.extract_tool_calls(
output.text, request=request)
tools_called = tool_call_info.tools_called
diff --git a/vllm/entrypoints/openai/serving_completion.py b/vllm/entrypoints/openai/serving_completion.py
index 59e69121deb9e..bf9e9850797a6 100644
--- a/vllm/entrypoints/openai/serving_completion.py
+++ b/vllm/entrypoints/openai/serving_completion.py
@@ -8,6 +8,7 @@
from fastapi import Request
from vllm.config import ModelConfig
+from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.logger import RequestLogger
# yapf conflicts with isort for this block
@@ -28,6 +29,7 @@
PromptAdapterPath)
from vllm.logger import init_logger
from vllm.outputs import RequestOutput
+from vllm.sampling_params import BeamSearchParams, SamplingParams
from vllm.sequence import Logprob
from vllm.tracing import (contains_trace_headers, extract_trace_headers,
log_tracing_disabled_warning)
@@ -120,9 +122,15 @@ async def create_completion(
))
for i, prompt_inputs in enumerate(prompts):
- sampling_params = request.to_sampling_params(
- default_max_tokens=self.max_model_len -
- len(prompt_inputs["prompt_token_ids"]))
+ sampling_params: Union[SamplingParams, BeamSearchParams]
+ default_max_tokens = self.max_model_len - len(
+ prompt_inputs["prompt_token_ids"])
+ if request.use_beam_search:
+ sampling_params = request.to_beam_search_params(
+ default_max_tokens)
+ else:
+ sampling_params = request.to_sampling_params(
+ default_max_tokens)
request_id_item = f"{request_id}-{i}"
@@ -141,15 +149,29 @@ async def create_completion(
raw_request.headers):
log_tracing_disabled_warning()
- generator = self.engine_client.generate(
- {"prompt_token_ids": prompt_inputs["prompt_token_ids"]},
- sampling_params,
- request_id_item,
- lora_request=lora_request,
- prompt_adapter_request=prompt_adapter_request,
- trace_headers=trace_headers,
- priority=request.priority,
- )
+ if isinstance(sampling_params, BeamSearchParams):
+ if not isinstance(self.engine_client, AsyncLLMEngine):
+ raise ValueError(
+ "Beam search in the API server is only supported"
+ " with AsyncLLMEngine. please add "
+ "`--disable-frontend-multiprocessing` to "
+ "use beam search.")
+ generator = self.engine_client.beam_search(
+ prompt_inputs["prompt_token_ids"], request_id_item,
+ sampling_params)
+ else:
+ generator = self.engine_client.generate(
+ {
+ "prompt_token_ids":
+ prompt_inputs["prompt_token_ids"]
+ },
+ sampling_params,
+ request_id_item,
+ lora_request=lora_request,
+ prompt_adapter_request=prompt_adapter_request,
+ trace_headers=trace_headers,
+ priority=request.priority,
+ )
generators.append(generator)
except ValueError as e:
diff --git a/vllm/entrypoints/openai/serving_embedding.py b/vllm/entrypoints/openai/serving_embedding.py
index d6f337a7236d6..e9504cfa64b65 100644
--- a/vllm/entrypoints/openai/serving_embedding.py
+++ b/vllm/entrypoints/openai/serving_embedding.py
@@ -110,6 +110,17 @@ async def create_embedding(
request_id = f"embd-{random_uuid()}"
created_time = int(time.monotonic())
+ truncate_prompt_tokens = None
+
+ if request.truncate_prompt_tokens is not None:
+ if request.truncate_prompt_tokens <= self.max_model_len:
+ truncate_prompt_tokens = request.truncate_prompt_tokens
+ else:
+ return self.create_error_response(
+ "truncate_prompt_tokens value is "
+ "greater than max_model_len."
+ " Please, select a smaller truncation size.")
+
# Schedule the request and get the result generator.
generators: List[AsyncGenerator[EmbeddingRequestOutput, None]] = []
try:
@@ -123,11 +134,9 @@ async def create_embedding(
pooling_params = request.to_pooling_params()
prompts = list(
- self._tokenize_prompt_input_or_inputs(
- request,
- tokenizer,
- request.input,
- ))
+ self._tokenize_prompt_input_or_inputs(request, tokenizer,
+ request.input,
+ truncate_prompt_tokens))
for i, prompt_inputs in enumerate(prompts):
request_id_item = f"{request_id}-{i}"
diff --git a/vllm/entrypoints/openai/serving_engine.py b/vllm/entrypoints/openai/serving_engine.py
index 1a0669d8d12c5..e6d2ab93d3363 100644
--- a/vllm/entrypoints/openai/serving_engine.py
+++ b/vllm/entrypoints/openai/serving_engine.py
@@ -29,7 +29,7 @@
from vllm.lora.request import LoRARequest
from vllm.pooling_params import PoolingParams
from vllm.prompt_adapter.request import PromptAdapterRequest
-from vllm.sampling_params import SamplingParams
+from vllm.sampling_params import BeamSearchParams, SamplingParams
from vllm.sequence import Logprob
from vllm.transformers_utils.tokenizer import AnyTokenizer
from vllm.utils import AtomicCounter
@@ -371,7 +371,8 @@ def _log_inputs(
self,
request_id: str,
inputs: Union[str, List[int], TextTokensPrompt],
- params: Optional[Union[SamplingParams, PoolingParams]],
+ params: Optional[Union[SamplingParams, PoolingParams,
+ BeamSearchParams]],
lora_request: Optional[LoRARequest],
prompt_adapter_request: Optional[PromptAdapterRequest],
) -> None:
diff --git a/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py
index 40f041767190b..6c5bcc7dd59b1 100644
--- a/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py
+++ b/vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py
@@ -50,10 +50,10 @@ def __init__(self, tokenizer: AnyTokenizer):
raise ValueError(
"The model tokenizer must be passed to the ToolParser "
"constructor during construction.")
- self.tool_call_start_token_id: int = self.model_tokenizer.vocab[
- self.tool_call_start_token]
- self.tool_call_end_token_id: int = self.model_tokenizer.vocab[
- self.tool_call_end_token]
+ self.tool_call_start_token_id: int = self.model_tokenizer.vocab.get(
+ self.tool_call_start_token, None)
+ self.tool_call_end_token_id: int = self.model_tokenizer.vocab.get(
+ self.tool_call_end_token, None)
if not self.tool_call_start_token_id or not self.tool_call_end_token_id:
raise RuntimeError(
"Hermes 2 Pro Tool parser could not locate tool call start/end "
diff --git a/vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py b/vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py
index 1db30797ac6fc..9580fa115c6b3 100644
--- a/vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py
+++ b/vllm/entrypoints/openai/tool_parsers/mistral_tool_parser.py
@@ -61,8 +61,13 @@ def __init__(self, tokenizer: AnyTokenizer):
self.streamed_args_for_tool: List[str] = [
] # map what has been streamed for each tool so far to a list
self.bot_token = "[TOOL_CALLS]"
- self.bot_token_id = self.model_tokenizer.get_vocab()[self.bot_token]
+ self.bot_token_id = self.model_tokenizer.get_vocab().get(
+ self.bot_token, None)
self.tool_call_regex = re.compile(r"\[{.*?}\]", re.DOTALL)
+ if not self.bot_token_id:
+ raise RuntimeError(
+ "Mistral Tool Parser could not locate the tool call token in "
+ "the tokenizer!")
def extract_tool_calls(
self,
diff --git a/vllm/envs.py b/vllm/envs.py
index 0f46ac4f61fdf..d15cded416385 100644
--- a/vllm/envs.py
+++ b/vllm/envs.py
@@ -63,7 +63,6 @@
VLLM_TORCH_PROFILER_DIR: Optional[str] = None
VLLM_USE_TRITON_AWQ: bool = False
VLLM_ALLOW_RUNTIME_LORA_UPDATING: bool = False
- VLLM_ALLOW_DEPRECATED_BEAM_SEARCH: bool = False
VLLM_SKIP_P2P_CHECK: bool = False
@@ -198,10 +197,6 @@ def get_default_config_root():
lambda: (os.environ.get("VLLM_USE_TRITON_FLASH_ATTN", "True").lower() in
("true", "1")),
- # If set, allowing the use of deprecated beam search implementation
- "VLLM_ALLOW_DEPRECATED_BEAM_SEARCH":
- lambda: os.environ.get("VLLM_ALLOW_DEPRECATED_BEAM_SEARCH", "0") == "1",
-
# Internal flag to enable Dynamo graph capture
"VLLM_TEST_DYNAMO_GRAPH_CAPTURE":
lambda: int(os.environ.get("VLLM_TEST_DYNAMO_GRAPH_CAPTURE", "0")),
diff --git a/vllm/lora/models.py b/vllm/lora/models.py
index 1f80c716bc481..91e9f55e82433 100644
--- a/vllm/lora/models.py
+++ b/vllm/lora/models.py
@@ -24,8 +24,7 @@
from vllm.lora.punica import PunicaWrapper
from vllm.lora.utils import (from_layer, from_layer_logits_processor,
parse_fine_tuned_lora_name, replace_submodule)
-from vllm.model_executor.models.interfaces import (SupportsLoRA,
- supports_multimodal)
+from vllm.model_executor.models import SupportsLoRA, supports_multimodal
from vllm.model_executor.models.module_mapping import MultiModelKeys
from vllm.model_executor.models.utils import PPMissingLayer
from vllm.utils import is_pin_memory_available
diff --git a/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py b/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py
index 8177e846127ee..5964d5a5465fd 100644
--- a/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py
+++ b/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py
@@ -10,15 +10,24 @@
from vllm.scalar_type import scalar_types
+def get_scalar_type(num_bits: int, has_zp: bool):
+ if has_zp:
+ assert num_bits == 4
+ return scalar_types.uint4
+ else:
+ return scalar_types.uint4b8 if num_bits == 4 else scalar_types.uint8b128
+
+
def single_marlin_moe(
hidden_states: torch.Tensor,
w: torch.Tensor,
scales: torch.Tensor,
gating_output: torch.Tensor,
- g_idx: torch.Tensor,
- perm: torch.Tensor,
topk: int,
renormalize: bool,
+ g_idx: Optional[torch.Tensor] = None,
+ sort_indices: Optional[torch.Tensor] = None,
+ w_zeros: Optional[torch.Tensor] = None,
override_config: Optional[Dict[str, Any]] = None,
num_bits: int = 8,
is_k_full: bool = True,
@@ -34,10 +43,12 @@ def single_marlin_moe(
- scales (torch.Tensor): The quantization scales.
- gating_output (torch.Tensor): The output of the gating operation
(before softmax).
- - g_idx (torch.Tensor): The act_order indices.
- - perm (torch.Tensor): The act_order input permutation.
+ - g_idx (Optional[torch.Tensor]): Optional act_order indices.
+ - sort_indices (Optional[torch.Tensor]): Optional act_order input
+ permutation.
- topk (int): The number of top-k experts to select.
- renormalize (bool): If True, renormalize the top-k weights to sum to 1.
+ - w_zeros (Optional[torch.Tensor]): Optional zero points to be used for w.
- override_config (Optional[Dict[str, Any]]): Optional override
for the kernel configuration.
- num_bits (bool): The number of bits in expert weights quantization.
@@ -79,16 +90,34 @@ def single_marlin_moe(
max_workspace_size = (N // 64) * 16
workspace = torch.zeros(max_workspace_size,
dtype=torch.int,
- device="cuda",
+ device=hidden_states.device,
+ requires_grad=False)
+
+ has_zero_point = w_zeros is not None
+ if w_zeros is None:
+ w_zeros = torch.empty((0, 0),
+ dtype=hidden_states.dtype,
+ device=hidden_states.device,
+ requires_grad=False)
+
+ if g_idx is None:
+ g_idx = torch.empty((0, 0),
+ dtype=torch.int32,
+ device=hidden_states.device,
requires_grad=False)
- scalar_type = (scalar_types.uint4b8
- if num_bits == 4 else scalar_types.uint8b128)
+ if sort_indices is None:
+ sort_indices = torch.empty((0),
+ dtype=torch.int32,
+ device=hidden_states.device,
+ requires_grad=False)
+
+ scalar_type = get_scalar_type(num_bits, has_zero_point)
intermediate_cache = torch.ops._moe_C.marlin_gemm_moe(
hidden_states, w, sorted_token_ids, topk_weights, topk_ids, scales,
- g_idx, perm, workspace, scalar_type, M, N, K, is_k_full, E, topk,
- block_size_m, True, False)
+ w_zeros, g_idx, sort_indices, workspace, scalar_type, M, N, K,
+ is_k_full, E, topk, block_size_m, True, False)
return torch.sum(intermediate_cache.view(*intermediate_cache.shape), dim=1)
@@ -97,16 +126,18 @@ def fused_marlin_moe(
hidden_states: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
+ w1_scale: torch.Tensor,
+ w2_scale: torch.Tensor,
gating_output: torch.Tensor,
- g_idx1: torch.Tensor,
- g_idx2: torch.Tensor,
- perm1: torch.Tensor,
- perm2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
+ g_idx1: Optional[torch.Tensor] = None,
+ g_idx2: Optional[torch.Tensor] = None,
+ sort_indices1: Optional[torch.Tensor] = None,
+ sort_indices2: Optional[torch.Tensor] = None,
+ w1_zeros: Optional[torch.Tensor] = None,
+ w2_zeros: Optional[torch.Tensor] = None,
override_config: Optional[Dict[str, Any]] = None,
- w1_scale: Optional[torch.Tensor] = None,
- w2_scale: Optional[torch.Tensor] = None,
num_bits: int = 8,
is_k_full: bool = True,
) -> torch.Tensor:
@@ -118,21 +149,22 @@ def fused_marlin_moe(
- hidden_states (torch.Tensor): The input tensor to the MoE layer.
- w1 (torch.Tensor): The first set of expert weights.
- w2 (torch.Tensor): The second set of expert weights.
+ - w1_scale (torch.Tensor): Scale to be used for w1.
+ - w2_scale (torch.Tensor): Scale to be used for w2.
- gating_output (torch.Tensor): The output of the gating operation
(before softmax).
- - g_idx1 (torch.Tensor): The first set of act_order indices.
- - g_idx2 (torch.Tensor): The second set of act_order indices.
- - perm1 (torch.Tensor): The first act_order input permutation.
- - perm2 (torch.Tensor): The second act_order input permutation.
+ - g_idx1 (Optional[torch.Tensor]): The first set of act_order indices.
+ - g_idx2 (Optional[torch.Tensor]): The second set of act_order indices.
+ - sort_indices1 (Optional[torch.Tensor]): The first act_order input
+ permutation.
+ - sort_indices2 (Optional[torch.Tensor]): The second act_order input
+ permutation.
- topk_weights (torch.Tensor): Top-k weights.
- topk_ids (torch.Tensor): Indices of topk-k elements.
- - renormalize (bool): If True, renormalize the top-k weights to sum to 1.
- override_config (Optional[Dict[str, Any]]): Optional override
for the kernel configuration.
- - w1_scale (Optional[torch.Tensor]): Optional scale to be used for
- w1.
- - w2_scale (Optional[torch.Tensor]): Optional scale to be used for
- w2.
+ - w1_zeros (Optional[torch.Tensor]): Optional zero points to be used for w1.
+ - w2_zeros (Optional[torch.Tensor]): Optional zero points to be used for w2.
- num_bits (bool): The number of bits in expert weights quantization.
Returns:
@@ -152,6 +184,20 @@ def fused_marlin_moe(
assert hidden_states.dtype == torch.float16
assert num_bits in [4, 8]
+ has_no_act_order = (g_idx1 is None and g_idx2 is None
+ and sort_indices1 is None and sort_indices2 is None)
+ has_all_act_order = (g_idx1 is not None and g_idx2 is not None
+ and sort_indices1 is not None
+ and sort_indices2 is not None)
+ assert has_no_act_order or has_all_act_order, (
+ "g_idx and sorted_indices "
+ "must be all not None or must be all None")
+
+ has_no_zp = w1_zeros is None and w2_zeros is None
+ has_all_zp = w1_zeros is not None and w2_zeros is not None
+ assert has_no_zp or has_all_zp, ("zero points must be both not None or "
+ "must be both None")
+
M, K = hidden_states.shape
E = w1.shape[0]
N = w2.shape[1] * 16
@@ -172,14 +218,42 @@ def fused_marlin_moe(
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
- max_workspace_size = ((M + 255) // 256) * (max(2 * N, K) // 64) * 16
+ max_workspace_size = (max(2 * N, K) // 64) * 16
workspace = torch.zeros(max_workspace_size,
dtype=torch.int,
device="cuda",
requires_grad=False)
- scalar_type = (scalar_types.uint4b8
- if num_bits == 4 else scalar_types.uint8b128)
+ if has_no_zp:
+ w1_zeros = torch.empty((0, 0),
+ dtype=hidden_states.dtype,
+ device=hidden_states.device,
+ requires_grad=False)
+ w2_zeros = torch.empty((0, 0),
+ dtype=hidden_states.dtype,
+ device=hidden_states.device,
+ requires_grad=False)
+
+ if has_no_act_order:
+ g_idx1 = torch.empty((0, 0),
+ dtype=torch.int32,
+ device=hidden_states.device,
+ requires_grad=False)
+ g_idx2 = torch.empty((0, 0),
+ dtype=torch.int32,
+ device=hidden_states.device,
+ requires_grad=False)
+ sort_indices1 = torch.empty((0),
+ dtype=torch.int32,
+ device=hidden_states.device,
+ requires_grad=False)
+ sort_indices2 = torch.empty((0, 0),
+ dtype=torch.int32,
+ device=hidden_states.device,
+ requires_grad=False)
+
+ scalar_type1 = get_scalar_type(num_bits, has_all_zp)
+ scalar_type2 = get_scalar_type(num_bits, has_all_zp)
intermediate_cache2 = torch.empty(
(M * topk_ids.shape[1], N),
@@ -194,10 +268,11 @@ def fused_marlin_moe(
topk_weights,
topk_ids,
w1_scale,
+ w1_zeros,
g_idx1,
- perm1,
+ sort_indices1,
workspace,
- scalar_type,
+ scalar_type1,
M,
2 * N,
K,
@@ -218,10 +293,11 @@ def fused_marlin_moe(
topk_weights,
topk_ids,
w2_scale,
+ w2_zeros,
g_idx2,
- perm2,
+ sort_indices2,
workspace,
- scalar_type,
+ scalar_type2,
M,
K,
N,
diff --git a/vllm/model_executor/layers/linear.py b/vllm/model_executor/layers/linear.py
index f7af9f328e887..77ba6200e7842 100644
--- a/vllm/model_executor/layers/linear.py
+++ b/vllm/model_executor/layers/linear.py
@@ -440,17 +440,23 @@ def weight_loader(self,
param.shard_weight_type[loaded_shard_id] = loaded_weight.item()
return
- if is_gguf_weight and isinstance(param, UninitializedParameter):
- from gguf.constants import GGML_QUANT_SIZES
+ if is_gguf_weight:
+ tp_size = get_tensor_model_parallel_world_size()
+ tp_rank = get_tensor_model_parallel_rank()
+
+ output_dim = getattr(param, "output_dim", None)
+ shard_size = loaded_weight.size(output_dim) // tp_size
+ start_idx = tp_rank * shard_size
+
+ loaded_weight = loaded_weight.narrow(output_dim, start_idx,
+ shard_size)
- ori_shape = param.tensor_shape
- weight_types = self.qweight_type.shard_weight_type.values()
- row_size = []
- for weight_type in weight_types:
- block_size, type_size = GGML_QUANT_SIZES[weight_type]
- row_size.append(ori_shape[1] // block_size * type_size)
- q_shape = (ori_shape[0], max(row_size))
- param.materialize(q_shape, dtype=loaded_weight.dtype)
+ param.shard_id.append(loaded_shard_id)
+ param.shard_id_map[loaded_shard_id] = len(param.data_container)
+ param.data_container.append(loaded_weight)
+ if len(param.data_container) == 2:
+ self.qweight = param.materialize_nested()
+ return
param_data = param.data
output_dim = getattr(param, "output_dim", None)
@@ -515,18 +521,6 @@ def weight_loader(self,
shard_offset = loaded_weight.shape[output_dim] * \
loaded_shard_id
- if is_gguf_weight:
- tp_size = get_tensor_model_parallel_world_size()
- output_dim = getattr(param, "output_dim", None)
- shard_shape = list(loaded_weight.shape)
- shard_shape[output_dim] = shard_shape[output_dim] // tp_size
- param.shard_id.append(loaded_shard_id)
- param.shard_size[loaded_shard_id] = shard_shape
-
- input_dim = getattr(param, "input_dim", None)
- input_size = loaded_weight.shape[input_dim]
- param_data = param_data.narrow(input_dim, 0, input_size)
-
param_data = param_data.narrow(output_dim, shard_offset,
shard_size)
start_idx = tp_rank * shard_size
@@ -783,17 +777,23 @@ def weight_loader(self,
param.shard_weight_type[loaded_shard_id] = loaded_weight.item()
return
- if is_gguf_weight and isinstance(param, UninitializedParameter):
- from gguf.constants import GGML_QUANT_SIZES
+ if is_gguf_weight:
+ tp_size = get_tensor_model_parallel_world_size()
+ tp_rank = get_tensor_model_parallel_rank()
- ori_shape = param.tensor_shape
- weight_types = self.qweight_type.shard_weight_type.values()
- row_size = []
- for weight_type in weight_types:
- block_size, type_size = GGML_QUANT_SIZES[weight_type]
- row_size.append(ori_shape[1] // block_size * type_size)
- q_shape = (ori_shape[0], max(row_size))
- param.materialize(q_shape, dtype=loaded_weight.dtype)
+ output_dim = getattr(param, "output_dim", None)
+ shard_size = loaded_weight.size(output_dim) // tp_size
+ start_idx = tp_rank * shard_size
+
+ loaded_weight = loaded_weight.narrow(output_dim, start_idx,
+ shard_size)
+
+ param.shard_id.append(loaded_shard_id)
+ param.shard_id_map[loaded_shard_id] = len(param.data_container)
+ param.data_container.append(loaded_weight)
+ if len(param.data_container) == 3:
+ self.qweight = param.materialize_nested()
+ return
param_data = param.data
output_dim = getattr(param, "output_dim", None)
@@ -883,18 +883,6 @@ def weight_loader(self,
shard_size, shard_offset = adjust_bitsandbytes_4bit_shard(
param, orig_qkv_offsets, loaded_shard_id)
- if is_gguf_weight:
- tp_size = get_tensor_model_parallel_world_size()
- output_dim = getattr(param, "output_dim", None)
- shard_shape = list(loaded_weight.shape)
- shard_shape[output_dim] = shard_shape[output_dim] // tp_size
- param.shard_id.append(loaded_shard_id)
- param.shard_size[loaded_shard_id] = shard_shape
-
- input_dim = getattr(param, "input_dim", None)
- input_size = loaded_weight.shape[input_dim]
- param_data = param_data.narrow(input_dim, 0, input_size)
-
param_data = param_data.narrow(output_dim, shard_offset,
shard_size)
if loaded_shard_id == "q":
diff --git a/vllm/model_executor/layers/quantization/awq_marlin.py b/vllm/model_executor/layers/quantization/awq_marlin.py
index fe33b7341fd38..294fe11815c0f 100644
--- a/vllm/model_executor/layers/quantization/awq_marlin.py
+++ b/vllm/model_executor/layers/quantization/awq_marlin.py
@@ -1,16 +1,21 @@
-from typing import Any, Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional
import torch
+from torch.nn import Parameter
from vllm import _custom_ops as ops
from vllm.logger import init_logger
-from vllm.model_executor.layers.linear import LinearBase, LinearMethodBase
+from vllm.model_executor.layers.fused_moe.layer import (
+ FusedMoE, FusedMoEMethodBase, FusedMoeWeightScaleSupported)
+from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
+ set_weight_attrs)
from vllm.model_executor.layers.quantization.base_config import (
- QuantizationConfig)
+ QuantizationConfig, QuantizeMethodBase)
from vllm.model_executor.layers.quantization.utils import replace_parameter
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
apply_awq_marlin_linear, awq_to_marlin_zero_points, check_marlin_supported,
- marlin_make_empty_g_idx, marlin_make_workspace, marlin_permute_scales,
+ marlin_make_empty_g_idx, marlin_make_workspace, marlin_moe_permute_scales,
+ marlin_permute_scales, moe_awq_to_marlin_zero_points,
verify_marlin_supported, verify_marlin_supports_shape)
from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
from vllm.model_executor.parameter import (GroupQuantScaleParameter,
@@ -35,12 +40,13 @@ def __init__(self, weight_bits: int, group_size: int, has_zp: bool,
self.group_size = group_size
self.has_zp = has_zp
self.lm_head_quantized = lm_head_quantized
+ self.weight_bits = weight_bits
- if weight_bits not in self.TYPE_MAP:
- raise ValueError(f"Unsupported num_bits = {weight_bits}. "
+ if self.weight_bits not in self.TYPE_MAP:
+ raise ValueError(f"Unsupported num_bits = {self.weight_bits}. "
f"Supported num_bits = {self.TYPE_MAP.keys()}")
- self.quant_type = self.TYPE_MAP[weight_bits]
+ self.quant_type = self.TYPE_MAP[self.weight_bits]
verify_marlin_supported(self.quant_type,
group_size=self.group_size,
@@ -98,10 +104,12 @@ def override_quantization_method(cls, hf_quant_cfg,
return None
def get_quant_method(self, layer: torch.nn.Module,
- prefix: str) -> Optional["AWQMarlinLinearMethod"]:
+ prefix: str) -> Optional["QuantizeMethodBase"]:
if (isinstance(layer, LinearBase) or
(isinstance(layer, ParallelLMHead) and self.lm_head_quantized)):
return AWQMarlinLinearMethod(self)
+ elif isinstance(layer, FusedMoE):
+ return AWQMoEMethod(self)
return None
def get_scaled_act_names(self) -> List[str]:
@@ -271,4 +279,182 @@ def apply(
quant_type=self.quant_config.quant_type,
output_size_per_partition=layer.output_size_per_partition,
input_size_per_partition=layer.input_size_per_partition,
- bias=bias)
\ No newline at end of file
+ bias=bias)
+
+
+class AWQMoEMethod(FusedMoEMethodBase):
+
+ def __init__(self, quant_config: AWQMarlinConfig):
+ self.quant_config = quant_config
+
+ def create_weights(self, layer: torch.nn.Module, num_experts: int,
+ hidden_size: int, intermediate_size: int,
+ params_dtype: torch.dtype, **extra_weight_attrs):
+ extra_weight_attrs.update({
+ "is_transposed":
+ True,
+ "quant_method":
+ FusedMoeWeightScaleSupported.GROUP.value,
+ })
+
+ w13_qweight = Parameter(torch.empty(num_experts,
+ hidden_size,
+ 2 * intermediate_size //
+ self.quant_config.pack_factor,
+ dtype=torch.int32),
+ requires_grad=False)
+ layer.register_parameter("w13_qweight", w13_qweight)
+ set_weight_attrs(w13_qweight, extra_weight_attrs)
+
+ w2_qweight = Parameter(torch.empty(num_experts,
+ intermediate_size,
+ hidden_size //
+ self.quant_config.pack_factor,
+ dtype=torch.int32),
+ requires_grad=False)
+ layer.register_parameter("w2_qweight", w2_qweight)
+ set_weight_attrs(w2_qweight, extra_weight_attrs)
+
+ num_groups_w13 = hidden_size // self.quant_config.group_size
+ num_groups_w2 = intermediate_size // self.quant_config.group_size
+
+ # WEIGHT_SCALES
+ # Allocate 2 scales for w1 and w3 respectively.
+ w13_scales = Parameter(torch.empty(num_experts,
+ num_groups_w13,
+ intermediate_size * 2,
+ dtype=params_dtype),
+ requires_grad=False)
+ layer.register_parameter("w13_scales", w13_scales)
+ set_weight_attrs(w13_scales, extra_weight_attrs)
+
+ w2_scales = Parameter(torch.empty(num_experts,
+ num_groups_w2,
+ hidden_size,
+ dtype=params_dtype),
+ requires_grad=False)
+ layer.register_parameter("w2_scales", w2_scales)
+ set_weight_attrs(w2_scales, extra_weight_attrs)
+
+ # WEIGHT_ZERO_POINT
+ # Allocate 2 zero points for w1 and w3 respectively.
+ w13_qzeros = Parameter(torch.empty(num_experts,
+ num_groups_w13,
+ 2 * intermediate_size //
+ self.quant_config.pack_factor,
+ dtype=torch.int32),
+ requires_grad=False)
+ layer.register_parameter("w13_qzeros", w13_qzeros)
+ set_weight_attrs(w13_qzeros, extra_weight_attrs)
+
+ w2_qzeros = Parameter(torch.empty(num_experts,
+ num_groups_w2,
+ hidden_size //
+ self.quant_config.pack_factor,
+ dtype=torch.int32),
+ requires_grad=False)
+ layer.register_parameter("w2_qzeros", w2_qzeros)
+ set_weight_attrs(w2_qzeros, extra_weight_attrs)
+
+ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+ num_experts = layer.w13_qweight.shape[0]
+ device = layer.w13_qweight.device
+
+ layer.w13_g_idx_sort_indices = torch.nn.Parameter(
+ torch.empty((num_experts, 0), dtype=torch.int32, device=device),
+ requires_grad=False,
+ )
+ layer.w2_g_idx_sort_indices = torch.nn.Parameter(
+ torch.empty((num_experts, 0), dtype=torch.int32, device=device),
+ requires_grad=False,
+ )
+
+ marlin_w13_qweight = ops.awq_marlin_moe_repack(
+ layer.w13_qweight,
+ layer.w13_g_idx_sort_indices,
+ size_k=layer.w13_qweight.shape[1],
+ size_n=layer.w13_qweight.shape[2] * self.quant_config.pack_factor,
+ num_bits=self.quant_config.weight_bits,
+ )
+ replace_parameter(layer, "w13_qweight", marlin_w13_qweight)
+
+ marlin_w2_qweight = ops.awq_marlin_moe_repack(
+ layer.w2_qweight,
+ layer.w2_g_idx_sort_indices,
+ size_k=layer.w2_qweight.shape[1],
+ size_n=layer.w2_qweight.shape[2] * self.quant_config.pack_factor,
+ num_bits=self.quant_config.weight_bits,
+ )
+ replace_parameter(layer, "w2_qweight", marlin_w2_qweight)
+
+ # Why does this take the intermediate size for size_k?
+ marlin_w13_scales = marlin_moe_permute_scales(
+ s=layer.w13_scales,
+ size_k=layer.intermediate_size_per_partition,
+ size_n=layer.w13_scales.shape[2],
+ group_size=self.quant_config.group_size,
+ )
+
+ replace_parameter(layer, "w13_scales", marlin_w13_scales)
+
+ marlin_w2_scales = marlin_moe_permute_scales(
+ s=layer.w2_scales,
+ size_k=layer.intermediate_size_per_partition,
+ size_n=layer.w2_scales.shape[2],
+ group_size=self.quant_config.group_size,
+ )
+ replace_parameter(layer, "w2_scales", marlin_w2_scales)
+
+ marlin_w13_zp = moe_awq_to_marlin_zero_points(
+ layer.w13_qzeros,
+ size_k=layer.w13_qzeros.shape[1],
+ size_n=layer.w13_qzeros.shape[2] * self.quant_config.pack_factor,
+ num_bits=self.quant_config.weight_bits)
+ replace_parameter(layer, "w13_qzeros", marlin_w13_zp)
+
+ marlin_w2_zp = moe_awq_to_marlin_zero_points(
+ layer.w2_qzeros,
+ size_k=layer.w2_qzeros.shape[1],
+ size_n=layer.w2_qzeros.shape[2] * self.quant_config.pack_factor,
+ num_bits=self.quant_config.weight_bits)
+ replace_parameter(layer, "w2_qzeros", marlin_w2_zp)
+
+ def apply(
+ self,
+ layer: torch.nn.Module,
+ x: torch.Tensor,
+ router_logits: torch.Tensor,
+ top_k: int,
+ renormalize: bool = True,
+ use_grouped_topk: bool = False,
+ num_expert_group: Optional[int] = None,
+ topk_group: Optional[int] = None,
+ custom_routing_function: Optional[Callable] = None,
+ ) -> torch.Tensor:
+
+ from vllm.model_executor.layers.fused_moe.fused_marlin_moe import (
+ fused_marlin_moe)
+
+ topk_weights, topk_ids = FusedMoE.select_experts(
+ hidden_states=x,
+ router_logits=router_logits,
+ use_grouped_topk=use_grouped_topk,
+ top_k=top_k,
+ renormalize=renormalize,
+ topk_group=topk_group,
+ num_expert_group=num_expert_group,
+ custom_routing_function=custom_routing_function)
+
+ return fused_marlin_moe(
+ x,
+ layer.w13_qweight,
+ layer.w2_qweight,
+ layer.w13_scales,
+ layer.w2_scales,
+ router_logits,
+ topk_weights,
+ topk_ids,
+ w1_zeros=layer.w13_qzeros,
+ w2_zeros=layer.w2_qzeros,
+ num_bits=self.quant_config.weight_bits,
+ )
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
index 6666a4bf1f26a..af04d725159f9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
@@ -498,14 +498,14 @@ def apply(
x,
layer.w13_weight_packed,
layer.w2_weight_packed,
+ layer.w13_weight_scale,
+ layer.w2_weight_scale,
router_logits,
- layer.w13_g_idx,
- layer.w2_g_idx,
- layer.w13_g_idx_sort_indices,
- layer.w2_g_idx_sort_indices,
topk_weights,
topk_ids,
- w1_scale=layer.w13_weight_scale,
- w2_scale=layer.w2_weight_scale,
+ g_idx1=layer.w13_g_idx,
+ g_idx2=layer.w2_g_idx,
+ sort_indices1=layer.w13_g_idx_sort_indices,
+ sort_indices2=layer.w2_g_idx_sort_indices,
num_bits=self.num_bits,
)
diff --git a/vllm/model_executor/layers/quantization/gguf.py b/vllm/model_executor/layers/quantization/gguf.py
index dc83017bcc7f9..d73b9f6d92832 100644
--- a/vllm/model_executor/layers/quantization/gguf.py
+++ b/vllm/model_executor/layers/quantization/gguf.py
@@ -86,15 +86,16 @@ def create_weights(self, layer: torch.nn.Module,
output_size_per_partition = sum(output_partition_sizes)
tensor_shape = (output_size_per_partition, input_size_per_partition)
- qweight = UninitializedParameter(requires_grad=False)
+ qweight = GGUFUninitializedParameter(requires_grad=False)
set_weight_attrs(
qweight, {
"input_dim": 1,
"output_dim": 0,
"tensor_shape": tensor_shape,
"is_gguf_weight": True,
- "shard_size": {},
+ "data_container": [],
"shard_id": [],
+ "shard_id_map": {},
})
set_weight_attrs(qweight, extra_weight_attrs)
layer.register_parameter("qweight", qweight)
@@ -116,21 +117,17 @@ def apply(self,
layer: torch.nn.Module,
x: torch.Tensor,
bias: Optional[torch.Tensor] = None) -> torch.Tensor:
- shard_size = getattr(layer.qweight, "shard_size", None)
shard_id = getattr(layer.qweight, "shard_id", None)
- if shard_id and shard_size:
- result = []
- offset = 0
+ if shard_id:
# dequantize shard weights respectively
shard_id = ["q", "k", "v"] if "q" in shard_id else shard_id
+ qweight = layer.qweight.unbind(0)
+ result = []
for id in shard_id:
- shard_weight = layer.qweight[
- offset:offset +
- shard_size[id][0], :shard_size[id][1]].contiguous()
+ q_idx = layer.qweight.shard_id_map[id]
qweight_type = layer.qweight_type.shard_weight_type[id]
- result.append(_fuse_mul_mat(x, shard_weight, qweight_type))
- offset += shard_size[id][0]
+ result.append(_fuse_mul_mat(x, qweight[q_idx], qweight_type))
out = torch.cat(result, axis=1)
else:
qweight = layer.qweight
@@ -162,3 +159,20 @@ def embedding(self, layer: torch.nn.Module,
dequant = ops.ggml_dequantize(quant, qweight_type, hidden_size,
x_flat.shape[0])
return dequant.view(*x.shape, hidden_size)
+
+
+class GGUFUninitializedParameter(UninitializedParameter):
+ cls_to_become = Parameter
+ data_container: List[torch.Tensor]
+
+ def materialize_nested(self) -> Parameter:
+ nested_data = torch.nested.nested_tensor(self.data_container,
+ device=self.device,
+ dtype=torch.uint8)
+ self.data_container.clear()
+ param = torch.Tensor._make_subclass(self.cls_to_become,
+ nested_data,
+ require_grad=False)
+ for k, v in self.__dict__.items():
+ setattr(param, k, v)
+ return param
diff --git a/vllm/model_executor/layers/quantization/gptq_marlin.py b/vllm/model_executor/layers/quantization/gptq_marlin.py
index 3d3ce711e58b0..e77191796bd7e 100644
--- a/vllm/model_executor/layers/quantization/gptq_marlin.py
+++ b/vllm/model_executor/layers/quantization/gptq_marlin.py
@@ -557,14 +557,14 @@ def apply(
x,
layer.w13_qweight,
layer.w2_qweight,
+ layer.w13_scales,
+ layer.w2_scales,
router_logits,
- layer.w13_g_idx,
- layer.w2_g_idx,
- layer.w13_g_idx_sort_indices,
- layer.w2_g_idx_sort_indices,
topk_weights,
topk_ids,
- w1_scale=layer.w13_scales,
- w2_scale=layer.w2_scales,
+ g_idx1=layer.w13_g_idx,
+ g_idx2=layer.w2_g_idx,
+ sort_indices1=layer.w13_g_idx_sort_indices,
+ sort_indices2=layer.w2_g_idx_sort_indices,
num_bits=self.quant_config.quant_type.size_bits,
).to(orig_dtype)
diff --git a/vllm/model_executor/layers/quantization/utils/marlin_utils.py b/vllm/model_executor/layers/quantization/utils/marlin_utils.py
index 53762965732ce..9a1defa409714 100644
--- a/vllm/model_executor/layers/quantization/utils/marlin_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/marlin_utils.py
@@ -208,6 +208,7 @@ def marlin_moe_permute_scales(
device=s.device,
dtype=s.dtype,
)
+
for e in range(num_experts):
output[e] = marlin_permute_scales(s[e], size_k, size_n, group_size)
return output
@@ -258,6 +259,20 @@ def awq_to_marlin_zero_points(q_zp_packed: torch.Tensor, size_k: int,
return marlin_zp
+def moe_awq_to_marlin_zero_points(q_zp_packed: torch.Tensor, size_k: int,
+ size_n: int, num_bits: int):
+ num_experts = q_zp_packed.shape[0]
+ output = torch.empty(
+ (num_experts, q_zp_packed.shape[1], q_zp_packed.shape[2]),
+ device=q_zp_packed.device,
+ dtype=q_zp_packed.dtype,
+ )
+ for e in range(num_experts):
+ output[e] = awq_to_marlin_zero_points(q_zp_packed[e], size_k, size_n,
+ num_bits)
+ return output
+
+
def apply_gptq_marlin_linear(
input: torch.Tensor,
weight: torch.Tensor,
diff --git a/vllm/model_executor/layers/sampler.py b/vllm/model_executor/layers/sampler.py
index cfa857b8f9606..0b959da79c3be 100644
--- a/vllm/model_executor/layers/sampler.py
+++ b/vllm/model_executor/layers/sampler.py
@@ -947,8 +947,6 @@ def get_logprobs(
# largest num logprobs in this API. If every logprobs is None, it will be
# set to -1.
largest_num_logprobs = -1
- # If beam search is enabled.
- use_beam_search = False
# Select indices to compute logprob from, ranks of token ids, and the top
# k token ids from logprobs.
@@ -981,8 +979,6 @@ def get_logprobs(
largest_num_logprobs = max(largest_num_logprobs,
sampling_params.logprobs)
- use_beam_search = use_beam_search or sampling_params.use_beam_search
-
assert len(next_token_ids) == len(query_indices)
if len(query_indices) == 0:
@@ -995,7 +991,7 @@ def get_logprobs(
# If largest_num_logprobs == -1, i.e. no logprobs are requested, we can
# skip the whole logprob calculation.
- if largest_num_logprobs >= 0 or use_beam_search:
+ if largest_num_logprobs >= 0:
query_indices_gpu = torch.tensor(query_indices, device=logprobs.device)
next_token_ids_gpu = torch.tensor(next_token_ids,
device=logprobs.device)
@@ -1121,13 +1117,12 @@ def _get_sampled_logprob_if_needed(
"""Compute the sample logprob if needed."""
seq_ids = seq_group.seq_ids
num_logprobs = seq_group.sampling_params.logprobs
- use_beam_search = seq_group.sampling_params.use_beam_search
sampled_logprobs: SampleLogprobs = []
next_token_ids, parent_seq_ids = sample_result
if seq_group.do_sample:
assert len(next_token_ids) > 0
- if num_logprobs is None and not use_beam_search:
+ if num_logprobs is None:
for next_token_id in next_token_ids:
# Use a dummy logprob
sampled_logprobs.append({next_token_id: Logprob(inf)})
diff --git a/vllm/model_executor/model_loader/loader.py b/vllm/model_executor/model_loader/loader.py
index d2831f0eb5fcc..444bff430bc16 100644
--- a/vllm/model_executor/model_loader/loader.py
+++ b/vllm/model_executor/model_loader/loader.py
@@ -41,9 +41,8 @@
get_gguf_extra_tensor_names, get_quant_config, gguf_quant_weights_iterator,
initialize_dummy_weights, np_cache_weights_iterator, pt_weights_iterator,
safetensors_weights_iterator)
-from vllm.model_executor.models.interfaces import (has_inner_state,
- supports_lora,
- supports_multimodal)
+from vllm.model_executor.models import (has_inner_state, supports_lora,
+ supports_multimodal)
from vllm.model_executor.utils import set_weight_attrs
from vllm.platforms import current_platform
from vllm.utils import is_pin_memory_available
diff --git a/vllm/model_executor/model_loader/neuron.py b/vllm/model_executor/model_loader/neuron.py
index 594ae442ef328..00c82fb77186c 100644
--- a/vllm/model_executor/model_loader/neuron.py
+++ b/vllm/model_executor/model_loader/neuron.py
@@ -1,4 +1,5 @@
"""Utilities for selecting and loading neuron models."""
+import copy
import importlib
import os
from typing import Dict, List, Optional, Tuple
@@ -13,6 +14,8 @@
from vllm.model_executor.layers.quantization import get_quantization_config
from vllm.model_executor.layers.sampler import Sampler, SamplerOutput
from vllm.model_executor.sampling_metadata import SamplingMetadata
+from vllm.sequence import (CompletionSequenceGroupOutput, Logprob,
+ SequenceOutput)
TORCH_DTYPE_TO_NEURON_AMP = {
"auto": "f32",
@@ -37,15 +40,18 @@
class NeuronCasualLM(nn.Module):
- def __init__(
- self,
- config: PretrainedConfig,
- ) -> None:
+ def __init__(self,
+ config: PretrainedConfig,
+ on_device_sampling_disabled: bool = False) -> None:
super().__init__()
self.config = config
self.logits_processor = LogitsProcessor(config.vocab_size,
logits_as_input=True)
- self.sampler = Sampler()
+
+ self.on_device_sampling_disabled = on_device_sampling_disabled
+ if self.on_device_sampling_disabled:
+ # Use default sampler
+ self.sampler = Sampler()
# Lazy initialized
self.model: nn.Module
@@ -71,8 +77,29 @@ def sample(
logits: torch.Tensor,
sampling_metadata: SamplingMetadata,
) -> Optional[SamplerOutput]:
- next_tokens = self.sampler(logits, sampling_metadata)
- return next_tokens
+
+ if self.on_device_sampling_disabled:
+ next_tokens = self.sampler(logits, sampling_metadata)
+ return next_tokens
+
+ # On-device sampling outputs the token ids directly.
+ sampled_token_ids = logits.flatten()
+ next_tokens = []
+ sample_idx = 0
+ for seq_group in sampling_metadata.seq_groups:
+ samples = []
+ for seq_id in seq_group.seq_ids:
+ token_id = sampled_token_ids[sample_idx].item()
+ samples.append(
+ SequenceOutput(parent_seq_id=seq_id,
+ output_token=token_id,
+ logprobs={token_id: Logprob(token_id)}))
+ sample_idx += 1
+ next_tokens.append(
+ CompletionSequenceGroupOutput(samples=samples,
+ prompt_logprobs=None))
+
+ return SamplerOutput(outputs=next_tokens)
def load_weights(self, model_name_or_path: str, **kwargs):
arch = _get_model_architecture(self.config)
@@ -157,10 +184,22 @@ def _get_default_neuron_config(model_config: ModelConfig,
quant=neuron_quantization_config_builder(model_config.quantization)
if model_config.quantization else None,
continuous_batching=continuous_batching_config,
- weight_tiling=bool(model_config.quantization))
+ weight_tiling=bool(model_config.quantization),
+ on_device_generation=_get_neuron_on_device_generation_config(
+ model_config))
return default_neuron_args
+def _get_neuron_on_device_generation_config(model_config: ModelConfig):
+ if not _is_neuron_on_device_sampling_disabled(model_config):
+ return copy.deepcopy(model_config.neuron_sampling_params)
+ return None
+
+
+def _is_neuron_on_device_sampling_disabled(model_config: ModelConfig) -> bool:
+ return not getattr(model_config, "neuron_sampling_params", None)
+
+
def _get_neuron_config_after_override(default_neuron_config,
overridden_neuron_config):
from transformers_neuronx.config import NeuronConfig
@@ -174,7 +213,9 @@ def get_neuron_model(model_config: ModelConfig,
scheduler_config: SchedulerConfig) -> nn.Module:
# Create a model instance.
- model = NeuronCasualLM(model_config.hf_config)
+ model = NeuronCasualLM(
+ model_config.hf_config,
+ _is_neuron_on_device_sampling_disabled(model_config))
default_neuron_config_args = _get_default_neuron_config(
model_config, parallel_config, scheduler_config)
diff --git a/vllm/model_executor/model_loader/utils.py b/vllm/model_executor/model_loader/utils.py
index 2bfe6ea09bd62..b95c0b7cd0612 100644
--- a/vllm/model_executor/model_loader/utils.py
+++ b/vllm/model_executor/model_loader/utils.py
@@ -23,7 +23,9 @@ def get_model_architecture(
architectures = getattr(model_config.hf_config, "architectures", [])
# Special handling for quantized Mixtral.
# FIXME(woosuk): This is a temporary hack.
- mixtral_supported = ["fp8", "compressed-tensors", "gptq_marlin"]
+ mixtral_supported = [
+ "fp8", "compressed-tensors", "gptq_marlin", "awq_marlin"
+ ]
if (model_config.quantization is not None
and model_config.quantization not in mixtral_supported
diff --git a/vllm/model_executor/models/__init__.py b/vllm/model_executor/models/__init__.py
index 2f9cb2b760a82..eaa2b93eb3331 100644
--- a/vllm/model_executor/models/__init__.py
+++ b/vllm/model_executor/models/__init__.py
@@ -1,325 +1,23 @@
-import importlib
-import string
-import subprocess
-import sys
-import uuid
-from functools import lru_cache, partial
-from typing import Callable, Dict, List, Optional, Tuple, Type, Union
-
-import torch.nn as nn
-
-from vllm.logger import init_logger
-from vllm.utils import is_hip
-
-from .interfaces import supports_multimodal, supports_pp
-
-logger = init_logger(__name__)
-
-_GENERATION_MODELS = {
- "AquilaModel": ("llama", "LlamaForCausalLM"),
- "AquilaForCausalLM": ("llama", "LlamaForCausalLM"), # AquilaChat2
- "ArcticForCausalLM": ("arctic", "ArcticForCausalLM"),
- "BaiChuanForCausalLM": ("baichuan", "BaiChuanForCausalLM"), # baichuan-7b
- "BaichuanForCausalLM": ("baichuan", "BaichuanForCausalLM"), # baichuan-13b
- "BloomForCausalLM": ("bloom", "BloomForCausalLM"),
- "ChatGLMModel": ("chatglm", "ChatGLMForCausalLM"),
- "ChatGLMForConditionalGeneration": ("chatglm", "ChatGLMForCausalLM"),
- "CohereForCausalLM": ("commandr", "CohereForCausalLM"),
- "DbrxForCausalLM": ("dbrx", "DbrxForCausalLM"),
- "DeciLMForCausalLM": ("decilm", "DeciLMForCausalLM"),
- "DeepseekForCausalLM": ("deepseek", "DeepseekForCausalLM"),
- "DeepseekV2ForCausalLM": ("deepseek_v2", "DeepseekV2ForCausalLM"),
- "ExaoneForCausalLM": ("exaone", "ExaoneForCausalLM"),
- "FalconForCausalLM": ("falcon", "FalconForCausalLM"),
- "GemmaForCausalLM": ("gemma", "GemmaForCausalLM"),
- "Gemma2ForCausalLM": ("gemma2", "Gemma2ForCausalLM"),
- "GPT2LMHeadModel": ("gpt2", "GPT2LMHeadModel"),
- "GPTBigCodeForCausalLM": ("gpt_bigcode", "GPTBigCodeForCausalLM"),
- "GPTJForCausalLM": ("gpt_j", "GPTJForCausalLM"),
- "GPTNeoXForCausalLM": ("gpt_neox", "GPTNeoXForCausalLM"),
- "GraniteForCausalLM": ("granite", "GraniteForCausalLM"),
- "GraniteMoeForCausalLM": ("granitemoe", "GraniteMoeForCausalLM"),
- "InternLMForCausalLM": ("llama", "LlamaForCausalLM"),
- "InternLM2ForCausalLM": ("internlm2", "InternLM2ForCausalLM"),
- "JAISLMHeadModel": ("jais", "JAISLMHeadModel"),
- "JambaForCausalLM": ("jamba", "JambaForCausalLM"),
- "LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
- # For decapoda-research/llama-*
- "LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
- "MistralForCausalLM": ("llama", "LlamaForCausalLM"),
- "MixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),
- "QuantMixtralForCausalLM": ("mixtral_quant", "MixtralForCausalLM"),
- # transformers's mpt class has lower case
- "MptForCausalLM": ("mpt", "MPTForCausalLM"),
- "MPTForCausalLM": ("mpt", "MPTForCausalLM"),
- "MiniCPMForCausalLM": ("minicpm", "MiniCPMForCausalLM"),
- "MiniCPM3ForCausalLM": ("minicpm3", "MiniCPM3ForCausalLM"),
- "NemotronForCausalLM": ("nemotron", "NemotronForCausalLM"),
- "OlmoForCausalLM": ("olmo", "OlmoForCausalLM"),
- "OlmoeForCausalLM": ("olmoe", "OlmoeForCausalLM"),
- "OPTForCausalLM": ("opt", "OPTForCausalLM"),
- "OrionForCausalLM": ("orion", "OrionForCausalLM"),
- "PersimmonForCausalLM": ("persimmon", "PersimmonForCausalLM"),
- "PhiForCausalLM": ("phi", "PhiForCausalLM"),
- "Phi3ForCausalLM": ("phi3", "Phi3ForCausalLM"),
- "Phi3SmallForCausalLM": ("phi3_small", "Phi3SmallForCausalLM"),
- "PhiMoEForCausalLM": ("phimoe", "PhiMoEForCausalLM"),
- "Qwen2ForCausalLM": ("qwen2", "Qwen2ForCausalLM"),
- "Qwen2MoeForCausalLM": ("qwen2_moe", "Qwen2MoeForCausalLM"),
- "Qwen2VLForConditionalGeneration":
- ("qwen2_vl", "Qwen2VLForConditionalGeneration"),
- "RWForCausalLM": ("falcon", "FalconForCausalLM"),
- "StableLMEpochForCausalLM": ("stablelm", "StablelmForCausalLM"),
- "StableLmForCausalLM": ("stablelm", "StablelmForCausalLM"),
- "Starcoder2ForCausalLM": ("starcoder2", "Starcoder2ForCausalLM"),
- "SolarForCausalLM": ("solar", "SolarForCausalLM"),
- "XverseForCausalLM": ("xverse", "XverseForCausalLM"),
- # NOTE: The below models are for speculative decoding only
- "MedusaModel": ("medusa", "Medusa"),
- "EAGLEModel": ("eagle", "EAGLE"),
- "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"),
-}
-
-_EMBEDDING_MODELS = {
- "MistralModel": ("llama_embedding", "LlamaEmbeddingModel"),
- "Qwen2ForRewardModel": ("qwen2_rm", "Qwen2ForRewardModel"),
-}
-
-_MULTIMODAL_MODELS = {
- "Blip2ForConditionalGeneration":
- ("blip2", "Blip2ForConditionalGeneration"),
- "ChameleonForConditionalGeneration":
- ("chameleon", "ChameleonForConditionalGeneration"),
- "FuyuForCausalLM": ("fuyu", "FuyuForCausalLM"),
- "InternVLChatModel": ("internvl", "InternVLChatModel"),
- "LlavaForConditionalGeneration": ("llava",
- "LlavaForConditionalGeneration"),
- "LlavaNextForConditionalGeneration": ("llava_next",
- "LlavaNextForConditionalGeneration"),
- "LlavaNextVideoForConditionalGeneration":
- ("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
- "LlavaOnevisionForConditionalGeneration":
- ("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
- "MiniCPMV": ("minicpmv", "MiniCPMV"),
- "PaliGemmaForConditionalGeneration": ("paligemma",
- "PaliGemmaForConditionalGeneration"),
- "Phi3VForCausalLM": ("phi3v", "Phi3VForCausalLM"),
- "PixtralForConditionalGeneration": ("pixtral",
- "PixtralForConditionalGeneration"),
- "QWenLMHeadModel": ("qwen", "QWenLMHeadModel"),
- "Qwen2VLForConditionalGeneration": ("qwen2_vl",
- "Qwen2VLForConditionalGeneration"),
- "UltravoxModel": ("ultravox", "UltravoxModel"),
- "MllamaForConditionalGeneration": ("mllama",
- "MllamaForConditionalGeneration"),
-}
-_CONDITIONAL_GENERATION_MODELS = {
- "BartModel": ("bart", "BartForConditionalGeneration"),
- "BartForConditionalGeneration": ("bart", "BartForConditionalGeneration"),
-}
-
-_MODELS = {
- **_GENERATION_MODELS,
- **_EMBEDDING_MODELS,
- **_MULTIMODAL_MODELS,
- **_CONDITIONAL_GENERATION_MODELS,
-}
-
-# Architecture -> type.
-# out of tree models
-_OOT_MODELS: Dict[str, Type[nn.Module]] = {}
-
-# Models not supported by ROCm.
-_ROCM_UNSUPPORTED_MODELS: List[str] = []
-
-# Models partially supported by ROCm.
-# Architecture -> Reason.
-_ROCM_SWA_REASON = ("Sliding window attention (SWA) is not yet supported in "
- "Triton flash attention. For half-precision SWA support, "
- "please use CK flash attention by setting "
- "`VLLM_USE_TRITON_FLASH_ATTN=0`")
-_ROCM_PARTIALLY_SUPPORTED_MODELS: Dict[str, str] = {
- "Qwen2ForCausalLM":
- _ROCM_SWA_REASON,
- "MistralForCausalLM":
- _ROCM_SWA_REASON,
- "MixtralForCausalLM":
- _ROCM_SWA_REASON,
- "PaliGemmaForConditionalGeneration":
- ("ROCm flash attention does not yet "
- "fully support 32-bit precision on PaliGemma"),
- "Phi3VForCausalLM":
- ("ROCm Triton flash attention may run into compilation errors due to "
- "excessive use of shared memory. If this happens, disable Triton FA "
- "by setting `VLLM_USE_TRITON_FLASH_ATTN=0`")
-}
-
-
-class ModelRegistry:
-
- @staticmethod
- def _get_module_cls_name(model_arch: str) -> Tuple[str, str]:
- module_relname, cls_name = _MODELS[model_arch]
- return f"vllm.model_executor.models.{module_relname}", cls_name
-
- @staticmethod
- @lru_cache(maxsize=128)
- def _try_get_model_stateful(model_arch: str) -> Optional[Type[nn.Module]]:
- if model_arch not in _MODELS:
- return None
-
- module_name, cls_name = ModelRegistry._get_module_cls_name(model_arch)
- module = importlib.import_module(module_name)
- return getattr(module, cls_name, None)
-
- @staticmethod
- def _try_get_model_stateless(model_arch: str) -> Optional[Type[nn.Module]]:
- if model_arch in _OOT_MODELS:
- return _OOT_MODELS[model_arch]
-
- if is_hip():
- if model_arch in _ROCM_UNSUPPORTED_MODELS:
- raise ValueError(
- f"Model architecture {model_arch} is not supported by "
- "ROCm for now.")
- if model_arch in _ROCM_PARTIALLY_SUPPORTED_MODELS:
- logger.warning(
- "Model architecture %s is partially supported by ROCm: %s",
- model_arch, _ROCM_PARTIALLY_SUPPORTED_MODELS[model_arch])
-
- return None
-
- @staticmethod
- def _try_load_model_cls(model_arch: str) -> Optional[Type[nn.Module]]:
- model = ModelRegistry._try_get_model_stateless(model_arch)
- if model is not None:
- return model
-
- return ModelRegistry._try_get_model_stateful(model_arch)
-
- @staticmethod
- def resolve_model_cls(
- architectures: Union[str, List[str]], ) -> Tuple[Type[nn.Module], str]:
- if isinstance(architectures, str):
- architectures = [architectures]
- if not architectures:
- logger.warning("No model architectures are specified")
-
- for arch in architectures:
- model_cls = ModelRegistry._try_load_model_cls(arch)
- if model_cls is not None:
- return (model_cls, arch)
-
- raise ValueError(
- f"Model architectures {architectures} are not supported for now. "
- f"Supported architectures: {ModelRegistry.get_supported_archs()}")
-
- @staticmethod
- def get_supported_archs() -> List[str]:
- return list(_MODELS.keys()) + list(_OOT_MODELS.keys())
-
- @staticmethod
- def register_model(model_arch: str, model_cls: Type[nn.Module]):
- if model_arch in _MODELS:
- logger.warning(
- "Model architecture %s is already registered, and will be "
- "overwritten by the new model class %s.", model_arch,
- model_cls.__name__)
-
- _OOT_MODELS[model_arch] = model_cls
-
- @staticmethod
- @lru_cache(maxsize=128)
- def _check_stateless(
- func: Callable[[Type[nn.Module]], bool],
- model_arch: str,
- *,
- default: Optional[bool] = None,
- ) -> bool:
- """
- Run a boolean function against a model and return the result.
-
- If the model is not found, returns the provided default value.
-
- If the model is not already imported, the function is run inside a
- subprocess to avoid initializing CUDA for the main program.
- """
- model = ModelRegistry._try_get_model_stateless(model_arch)
- if model is not None:
- return func(model)
-
- if model_arch not in _MODELS and default is not None:
- return default
-
- module_name, cls_name = ModelRegistry._get_module_cls_name(model_arch)
-
- valid_name_characters = string.ascii_letters + string.digits + "._"
- if any(s not in valid_name_characters for s in module_name):
- raise ValueError(f"Unsafe module name detected for {model_arch}")
- if any(s not in valid_name_characters for s in cls_name):
- raise ValueError(f"Unsafe class name detected for {model_arch}")
- if any(s not in valid_name_characters for s in func.__module__):
- raise ValueError(f"Unsafe module name detected for {func}")
- if any(s not in valid_name_characters for s in func.__name__):
- raise ValueError(f"Unsafe class name detected for {func}")
-
- err_id = uuid.uuid4()
-
- stmts = ";".join([
- f"from {module_name} import {cls_name}",
- f"from {func.__module__} import {func.__name__}",
- f"assert {func.__name__}({cls_name}), '{err_id}'",
- ])
-
- result = subprocess.run([sys.executable, "-c", stmts],
- capture_output=True)
-
- if result.returncode != 0:
- err_lines = [line.decode() for line in result.stderr.splitlines()]
- if err_lines and err_lines[-1] != f"AssertionError: {err_id}":
- err_str = "\n".join(err_lines)
- raise RuntimeError(
- "An unexpected error occurred while importing the model in "
- f"another process. Error log:\n{err_str}")
-
- return result.returncode == 0
-
- @staticmethod
- def is_embedding_model(architectures: Union[str, List[str]]) -> bool:
- if isinstance(architectures, str):
- architectures = [architectures]
- if not architectures:
- logger.warning("No model architectures are specified")
-
- return any(arch in _EMBEDDING_MODELS for arch in architectures)
-
- @staticmethod
- def is_multimodal_model(architectures: Union[str, List[str]]) -> bool:
- if isinstance(architectures, str):
- architectures = [architectures]
- if not architectures:
- logger.warning("No model architectures are specified")
-
- is_mm = partial(ModelRegistry._check_stateless,
- supports_multimodal,
- default=False)
-
- return any(is_mm(arch) for arch in architectures)
-
- @staticmethod
- def is_pp_supported_model(architectures: Union[str, List[str]]) -> bool:
- if isinstance(architectures, str):
- architectures = [architectures]
- if not architectures:
- logger.warning("No model architectures are specified")
-
- is_pp = partial(ModelRegistry._check_stateless,
- supports_pp,
- default=False)
-
- return any(is_pp(arch) for arch in architectures)
-
+from .interfaces import (HasInnerState, SupportsLoRA, SupportsMultiModal,
+ SupportsPP, has_inner_state, supports_lora,
+ supports_multimodal, supports_pp)
+from .interfaces_base import (VllmModelForEmbedding,
+ VllmModelForTextGeneration, is_embedding_model,
+ is_text_generation_model)
+from .registry import ModelRegistry
__all__ = [
"ModelRegistry",
+ "VllmModelForEmbedding",
+ "is_embedding_model",
+ "VllmModelForTextGeneration",
+ "is_text_generation_model",
+ "HasInnerState",
+ "has_inner_state",
+ "SupportsLoRA",
+ "supports_lora",
+ "SupportsMultiModal",
+ "supports_multimodal",
+ "SupportsPP",
+ "supports_pp",
]
diff --git a/vllm/model_executor/models/gemma2.py b/vllm/model_executor/models/gemma2.py
index 9fddaac3a0837..bd3c1114c929f 100644
--- a/vllm/model_executor/models/gemma2.py
+++ b/vllm/model_executor/models/gemma2.py
@@ -40,7 +40,7 @@
from vllm.sequence import IntermediateTensors
from .interfaces import SupportsLoRA, SupportsPP
-from .utils import (is_pp_missing_parameter,
+from .utils import (group_weights_with_prefix, is_pp_missing_parameter,
make_empty_intermediate_tensors_factory, make_layers)
logger = init_logger(__name__)
@@ -273,16 +273,19 @@ def __init__(
def forward(
self,
- input_ids: torch.Tensor,
+ input_ids: Optional[torch.Tensor],
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
intermediate_tensors: Optional[IntermediateTensors],
+ inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
if get_pp_group().is_first_rank:
- hidden_states = self.embed_tokens(input_ids)
+ if inputs_embeds is not None:
+ hidden_states = inputs_embeds
+ else:
+ hidden_states = self.embed_tokens(input_ids)
hidden_states *= self.normalizer
-
residual = None
else:
assert intermediate_tensors is not None
@@ -305,6 +308,49 @@ def forward(
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
+ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+ stacked_params_mapping = [
+ # (param_name, shard_name, shard_id)
+ ("qkv_proj", "q_proj", "q"),
+ ("qkv_proj", "k_proj", "k"),
+ ("qkv_proj", "v_proj", "v"),
+ ("gate_up_proj", "gate_proj", 0),
+ ("gate_up_proj", "up_proj", 1),
+ ]
+ params_dict = dict(self.named_parameters())
+ loaded_params: Set[str] = set()
+ for name, loaded_weight in weights:
+ for (param_name, shard_name, shard_id) in stacked_params_mapping:
+ if shard_name not in name:
+ continue
+ name = name.replace(shard_name, param_name)
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+ if is_pp_missing_parameter(name, self):
+ continue
+ param = params_dict[name]
+ weight_loader = param.weight_loader
+ weight_loader(param, loaded_weight, shard_id)
+ break
+ else:
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+ if is_pp_missing_parameter(name, self):
+ continue
+ param = params_dict[name]
+ weight_loader = getattr(param, "weight_loader",
+ default_weight_loader)
+ weight_loader(param, loaded_weight)
+ loaded_params.add(name)
+
+ unloaded_params = params_dict.keys() - loaded_params
+ if unloaded_params:
+ logger.warning(
+ "Some weights are not initialized from checkpoints: %s",
+ unloaded_params)
+
class Gemma2ForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
packed_modules_mapping = {
@@ -388,48 +434,19 @@ def sample(
return next_tokens
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
- stacked_params_mapping = [
- # (param_name, shard_name, shard_id)
- ("qkv_proj", "q_proj", "q"),
- ("qkv_proj", "k_proj", "k"),
- ("qkv_proj", "v_proj", "v"),
- ("gate_up_proj", "gate_proj", 0),
- ("gate_up_proj", "up_proj", 1),
- ]
- params_dict = dict(self.named_parameters())
- loaded_params: Set[str] = set()
- for name, loaded_weight in weights:
- for (param_name, shard_name, shard_id) in stacked_params_mapping:
- if shard_name not in name:
- continue
- name = name.replace(shard_name, param_name)
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = param.weight_loader
- weight_loader(param, loaded_weight, shard_id)
- break
- else:
- # lm_head is not used in vllm as it is tied with embed_token.
- # To prevent errors, skip loading lm_head.weight.
- if "lm_head.weight" in name:
- continue
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
+ weights_group = group_weights_with_prefix(weights)
+
+ self.model.load_weights(weights_group["model"])
+
+ if not self.config.tie_word_embeddings:
+ # NOTE: For now self.lm_head is not defined because
+ # tie_word_embeddings is assumed to the False
+ lm_head_dict = dict(self.lm_head.named_parameters())
+ for name, loaded_weight in weights_group["lm_head"]:
+ if is_pp_missing_parameter(name, self.lm_head):
continue
- param = params_dict[name]
+
+ param = lm_head_dict[name]
weight_loader = getattr(param, "weight_loader",
default_weight_loader)
weight_loader(param, loaded_weight)
- loaded_params.add(name)
-
- unloaded_params = params_dict.keys() - loaded_params
- if unloaded_params:
- logger.warning(
- "Some weights are not initialized from checkpoints: %s",
- unloaded_params)
diff --git a/vllm/model_executor/models/gemma2_embedding.py b/vllm/model_executor/models/gemma2_embedding.py
new file mode 100644
index 0000000000000..e8e10598c1644
--- /dev/null
+++ b/vllm/model_executor/models/gemma2_embedding.py
@@ -0,0 +1,57 @@
+from typing import Iterable, List, Optional, Tuple, Union
+
+import torch
+from torch import nn
+
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.layers.pooler import Pooler, PoolingType
+from vllm.model_executor.pooling_metadata import PoolingMetadata
+from vllm.sequence import IntermediateTensors, PoolerOutput
+
+from .gemma2 import Gemma2Model
+from .interfaces import SupportsPP
+
+
+class Gemma2EmbeddingModel(nn.Module, SupportsPP):
+ """A model that uses Gemma2 with additional embedding functionalities.
+
+ This class encapsulates the Gemma2Model and provides an interface for
+ embedding operations and customized pooling functions.
+
+ Attributes:
+ model: An instance of Gemma2Model used for forward operations.
+ _pooler: An instance of Pooler used for pooling operations.
+ """
+
+ def __init__(
+ self,
+ **kwargs,
+ ) -> None:
+ super().__init__()
+ self.model = Gemma2Model(**kwargs)
+ self._pooler = Pooler(pooling_type=PoolingType.LAST, normalize=True)
+
+ self.make_empty_intermediate_tensors = (
+ self.model.make_empty_intermediate_tensors)
+
+ def forward(
+ self,
+ input_ids: Optional[torch.Tensor],
+ positions: torch.Tensor,
+ kv_caches: List[torch.Tensor],
+ attn_metadata: AttentionMetadata,
+ intermediate_tensors: Optional[IntermediateTensors] = None,
+ inputs_embeds: Optional[torch.Tensor] = None,
+ ) -> Union[torch.Tensor, IntermediateTensors]:
+ return self.model(input_ids, positions, kv_caches, attn_metadata,
+ intermediate_tensors, inputs_embeds)
+
+ def pooler(
+ self,
+ hidden_states: torch.Tensor,
+ pooling_metadata: PoolingMetadata,
+ ) -> Optional[PoolerOutput]:
+ return self._pooler(hidden_states, pooling_metadata)
+
+ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+ self.model.load_weights(weights)
diff --git a/vllm/model_executor/models/interfaces.py b/vllm/model_executor/models/interfaces.py
index 298174fa05965..278dfc52078ef 100644
--- a/vllm/model_executor/models/interfaces.py
+++ b/vllm/model_executor/models/interfaces.py
@@ -1,4 +1,3 @@
-import inspect
from typing import (TYPE_CHECKING, ClassVar, Dict, List, Literal, Optional,
Protocol, Type, Union, overload, runtime_checkable)
@@ -6,9 +5,9 @@
from typing_extensions import TypeIs
from vllm.logger import init_logger
+from vllm.utils import supports_kw
if TYPE_CHECKING:
- from vllm.attention import AttentionMetadata
from vllm.config import LoRAConfig, MultiModalConfig, SchedulerConfig
from vllm.sequence import IntermediateTensors
@@ -142,9 +141,7 @@ def supports_lora(
return result
-def _supports_lora(
- model: Union[Type[object], object],
-) -> Union[TypeIs[Type[SupportsLoRA]], TypeIs[SupportsLoRA]]:
+def _supports_lora(model: Union[Type[object], object]) -> bool:
if isinstance(model, type):
return isinstance(model, _SupportsLoRAType)
@@ -175,10 +172,7 @@ def make_empty_intermediate_tensors(
def forward(
self,
- input_ids: torch.Tensor,
- position_ids: torch.Tensor,
- kv_caches: List[torch.Tensor],
- attn_metadata: "AttentionMetadata",
+ *,
intermediate_tensors: Optional["IntermediateTensors"],
) -> Union[torch.Tensor, "IntermediateTensors"]:
"""
@@ -205,10 +199,7 @@ def make_empty_intermediate_tensors(
def forward(
self,
- input_ids: torch.Tensor,
- position_ids: torch.Tensor,
- kv_caches: List[torch.Tensor],
- attn_metadata: "AttentionMetadata",
+ *,
intermediate_tensors: Optional["IntermediateTensors"],
) -> Union[torch.Tensor, "IntermediateTensors"]:
...
@@ -257,24 +248,19 @@ def supports_pp(
return supports_attributes and supports_inspect
-def _supports_pp_attributes(
- model: Union[Type[object], object],
-) -> Union[bool, TypeIs[Type[SupportsPP]], TypeIs[SupportsPP]]:
+def _supports_pp_attributes(model: Union[Type[object], object]) -> bool:
if isinstance(model, type):
return isinstance(model, _SupportsPPType)
return isinstance(model, SupportsPP)
-def _supports_pp_inspect(
- model: Union[Type[object], object],
-) -> Union[bool, TypeIs[Type[SupportsPP]], TypeIs[SupportsPP]]:
+def _supports_pp_inspect(model: Union[Type[object], object]) -> bool:
model_forward = getattr(model, "forward", None)
if not callable(model_forward):
return False
- forward_params = inspect.signature(model_forward).parameters
- return "intermediate_tensors" in forward_params
+ return supports_kw(model_forward, "intermediate_tensors")
@runtime_checkable
diff --git a/vllm/model_executor/models/interfaces_base.py b/vllm/model_executor/models/interfaces_base.py
new file mode 100644
index 0000000000000..8d2d422f9891c
--- /dev/null
+++ b/vllm/model_executor/models/interfaces_base.py
@@ -0,0 +1,191 @@
+from typing import (TYPE_CHECKING, List, Optional, Protocol, Type, Union,
+ overload, runtime_checkable)
+
+import torch
+import torch.nn as nn
+from transformers import PretrainedConfig
+from typing_extensions import TypeIs, TypeVar
+
+from vllm.logger import init_logger
+from vllm.utils import supports_kw
+
+if TYPE_CHECKING:
+ from vllm.attention import AttentionMetadata
+ from vllm.config import CacheConfig
+ from vllm.model_executor.layers.pooler import PoolerOutput
+ from vllm.model_executor.layers.quantization import QuantizationConfig
+ from vllm.model_executor.layers.sampler import SamplerOutput
+ from vllm.model_executor.pooling_metadata import PoolingMetadata
+ from vllm.model_executor.sampling_metadata import SamplingMetadata
+
+logger = init_logger(__name__)
+
+# The type of HF config
+C_co = TypeVar("C_co", bound=PretrainedConfig, covariant=True)
+
+# The type of hidden states
+# Currently, T = torch.Tensor for all models except for Medusa
+# which has T = List[torch.Tensor]
+T = TypeVar("T", default=torch.Tensor)
+T_co = TypeVar("T_co", default=torch.Tensor, covariant=True)
+
+# NOTE: Unlike those in `interfaces.py`, we don't define `ClassVar` tags
+# for the base interfaces to avoid breaking OOT registration for existing models
+# that don't inherit from the base interface classes
+
+
+@runtime_checkable
+class VllmModel(Protocol[C_co, T_co]):
+
+ def __init__(
+ self,
+ config: C_co,
+ *,
+ cache_config: Optional["CacheConfig"],
+ quant_config: Optional["QuantizationConfig"],
+ ) -> None:
+ ...
+
+ def forward(
+ self,
+ input_ids: torch.Tensor,
+ positions: torch.Tensor,
+ kv_caches: List[torch.Tensor],
+ attn_metadata: "AttentionMetadata",
+ ) -> T_co:
+ ...
+
+
+def _check_vllm_model_init(model: Union[Type[object], object]) -> bool:
+ model_init = model.__init__
+ vllm_kws = ("cache_config", "quant_config")
+ missing_kws = tuple(kw for kw in vllm_kws
+ if not supports_kw(model_init, kw))
+
+ if missing_kws and (isinstance(model, type)
+ and issubclass(model, nn.Module)):
+ logger.warning(
+ "The model (%s) is missing "
+ "vLLM-specific keywords from its initializer: %s",
+ model,
+ missing_kws,
+ )
+
+ return len(missing_kws) == 0
+
+
+def _check_vllm_model_forward(model: Union[Type[object], object]) -> bool:
+ model_forward = getattr(model, "forward", None)
+ if not callable(model_forward):
+ return False
+
+ vllm_kws = ("input_ids", "positions", "kv_caches", "attn_metadata")
+ missing_kws = tuple(kw for kw in vllm_kws
+ if not supports_kw(model_forward, kw))
+
+ if missing_kws and (isinstance(model, type)
+ and issubclass(model, nn.Module)):
+ logger.warning(
+ "The model (%s) is missing "
+ "vLLM-specific keywords from its initializer: %s",
+ model,
+ missing_kws,
+ )
+
+ return len(missing_kws) == 0
+
+
+@overload
+def is_vllm_model(model: Type[object]) -> TypeIs[Type[VllmModel]]:
+ ...
+
+
+@overload
+def is_vllm_model(model: object) -> TypeIs[VllmModel]:
+ ...
+
+
+def is_vllm_model(
+ model: Union[Type[object], object],
+) -> Union[TypeIs[Type[VllmModel]], TypeIs[VllmModel]]:
+ return _check_vllm_model_init(model) and _check_vllm_model_forward(model)
+
+
+@runtime_checkable
+class VllmModelForTextGeneration(VllmModel[C_co, T], Protocol[C_co, T]):
+
+ def compute_logits(
+ self,
+ hidden_states: T,
+ sampling_metadata: "SamplingMetadata",
+ ) -> Optional[T]:
+ """Return `None` if TP rank > 0."""
+ ...
+
+ def sample(
+ self,
+ logits: T,
+ sampling_metadata: "SamplingMetadata",
+ ) -> "SamplerOutput":
+ """Only called on TP rank 0."""
+ ...
+
+
+@overload
+def is_text_generation_model(
+ model: Type[object]) -> TypeIs[Type[VllmModelForTextGeneration]]:
+ ...
+
+
+@overload
+def is_text_generation_model(
+ model: object) -> TypeIs[VllmModelForTextGeneration]:
+ ...
+
+
+def is_text_generation_model(
+ model: Union[Type[object], object],
+) -> Union[TypeIs[Type[VllmModelForTextGeneration]],
+ TypeIs[VllmModelForTextGeneration]]:
+ if not is_vllm_model(model):
+ return False
+
+ if isinstance(model, type):
+ return isinstance(model, VllmModelForTextGeneration)
+
+ return isinstance(model, VllmModelForTextGeneration)
+
+
+@runtime_checkable
+class VllmModelForEmbedding(VllmModel[C_co, T], Protocol[C_co, T]):
+
+ def pooler(
+ self,
+ hidden_states: T,
+ pooling_metadata: "PoolingMetadata",
+ ) -> "PoolerOutput":
+ """Only called on TP rank 0."""
+ ...
+
+
+@overload
+def is_embedding_model(
+ model: Type[object]) -> TypeIs[Type[VllmModelForEmbedding]]:
+ ...
+
+
+@overload
+def is_embedding_model(model: object) -> TypeIs[VllmModelForEmbedding]:
+ ...
+
+
+def is_embedding_model(
+ model: Union[Type[object], object],
+) -> Union[TypeIs[Type[VllmModelForEmbedding]], TypeIs[VllmModelForEmbedding]]:
+ if not is_vllm_model(model):
+ return False
+
+ if isinstance(model, type):
+ return isinstance(model, VllmModelForEmbedding)
+
+ return isinstance(model, VllmModelForEmbedding)
diff --git a/vllm/model_executor/models/jamba.py b/vllm/model_executor/models/jamba.py
index 330a2b6e3fd7f..06ec324b3e108 100644
--- a/vllm/model_executor/models/jamba.py
+++ b/vllm/model_executor/models/jamba.py
@@ -25,20 +25,18 @@
causal_conv1d_fn, causal_conv1d_update)
from vllm.model_executor.layers.mamba.ops.mamba_ssm import (
selective_scan_fn, selective_state_update)
-from vllm.model_executor.layers.quantization.base_config import (
- QuantizationConfig)
+from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.sampler import Sampler, SamplerOutput
from vllm.model_executor.layers.vocab_parallel_embedding import (
DEFAULT_VOCAB_PADDING_SIZE, ParallelLMHead, VocabParallelEmbedding)
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
-from vllm.model_executor.models.interfaces import HasInnerState
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.model_executor.utils import set_weight_attrs
from vllm.sequence import IntermediateTensors
from vllm.worker.model_runner import (_BATCH_SIZES_TO_CAPTURE,
_get_graph_batch_size)
-from .interfaces import SupportsLoRA
+from .interfaces import HasInnerState, SupportsLoRA
KVCache = Tuple[torch.Tensor, torch.Tensor]
diff --git a/vllm/model_executor/models/llama.py b/vllm/model_executor/models/llama.py
index 8cedea73e9968..74a4afbd36550 100644
--- a/vllm/model_executor/models/llama.py
+++ b/vllm/model_executor/models/llama.py
@@ -51,7 +51,8 @@
from vllm.utils import is_hip
from .interfaces import SupportsLoRA, SupportsPP
-from .utils import (PPMissingLayer, is_pp_missing_parameter,
+from .utils import (PPMissingLayer, group_weights_with_prefix,
+ is_pp_missing_parameter,
make_empty_intermediate_tensors_factory, make_layers)
@@ -348,6 +349,90 @@ def forward(
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
+ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+ stacked_params_mapping = [
+ # (param_name, shard_name, shard_id)
+ (".qkv_proj", ".q_proj", "q"),
+ (".qkv_proj", ".k_proj", "k"),
+ (".qkv_proj", ".v_proj", "v"),
+ (".gate_up_proj", ".gate_proj", 0),
+ (".gate_up_proj", ".up_proj", 1),
+ ]
+ params_dict = dict(self.named_parameters())
+ for name, loaded_weight in weights:
+ if "rotary_emb.inv_freq" in name:
+ continue
+ if ("rotary_emb.cos_cached" in name
+ or "rotary_emb.sin_cached" in name):
+ # Models trained using ColossalAI may include these tensors in
+ # the checkpoint. Skip them.
+ continue
+ if scale_name := get_compressed_tensors_cache_scale(name):
+ # Loading kv cache scales for compressed-tensors quantization
+ param = params_dict[scale_name]
+ weight_loader = getattr(param, "weight_loader",
+ default_weight_loader)
+ loaded_weight = loaded_weight[0]
+ weight_loader(param, loaded_weight)
+ continue
+ for param_name, weight_name, shard_id in stacked_params_mapping:
+ if weight_name not in name:
+ continue
+ name = name.replace(weight_name, param_name)
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+
+ if is_pp_missing_parameter(name, self):
+ continue
+
+ param = params_dict[name]
+ weight_loader = param.weight_loader
+ weight_loader(param, loaded_weight, shard_id)
+
+ break
+ else:
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+ # Remapping the name of FP8 kv-scale.
+ name = maybe_remap_kv_scale_name(name, params_dict)
+ if name is None:
+ continue
+
+ if is_pp_missing_parameter(name, self):
+ continue
+
+ param = params_dict[name]
+ weight_loader = getattr(param, "weight_loader",
+ default_weight_loader)
+ weight_loader(param, loaded_weight)
+
+ # If this function is called, it should always initialize KV cache scale
+ # factors (or else raise an exception). Thus, handled exceptions should
+ # make sure to leave KV cache scale factors in a known good (dummy) state
+ def load_kv_cache_scales(self, quantization_param_path: str) -> None:
+ tp_size = get_tensor_model_parallel_world_size()
+ tp_rank = get_tensor_model_parallel_rank()
+ for layer_idx, scaling_factor in kv_cache_scales_loader(
+ quantization_param_path, tp_rank, tp_size,
+ self.config.num_hidden_layers,
+ self.config.__class__.model_type):
+ if not isinstance(self.layers[layer_idx], nn.Identity):
+ layer_self_attn = self.layers[layer_idx].self_attn
+
+ if is_hip():
+ # The scaling factor convention we are assuming is
+ # quantized_value * scaling_factor ~= true_value
+ # which is consistent with the practice of setting
+ # scaling_factor = tensor_amax / FPtype_max
+ scaling_factor *= 2
+ if hasattr(layer_self_attn, "kv_scale"):
+ layer_self_attn.attn._kv_scale = scaling_factor
+ else:
+ raise RuntimeError("Self attention has no KV cache scaling "
+ "factor attribute!")
+
class LlamaForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
packed_modules_mapping = {
@@ -373,6 +458,7 @@ class LlamaForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
"gate_proj": ("gate_up_proj", 0),
"up_proj": ("gate_up_proj", 1),
}
+
# Mistral/Llama models can also be loaded with --load-format mistral
# from consolidated.safetensors checkpoints
mistral_mapping = {
@@ -429,7 +515,7 @@ def __init__(
quant_config=quant_config,
)
if config.tie_word_embeddings:
- self.lm_head.weight = self.model.embed_tokens.weight
+ self.lm_head = self.model.embed_tokens
logit_scale = getattr(config, "logit_scale", 1.0)
self.logits_processor = LogitsProcessor(self.unpadded_vocab_size,
@@ -468,106 +554,40 @@ def sample(self, logits: torch.Tensor,
return next_tokens
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
- stacked_params_mapping = [
- # (param_name, shard_name, shard_id)
- (".qkv_proj", ".q_proj", "q"),
- (".qkv_proj", ".k_proj", "k"),
- (".qkv_proj", ".v_proj", "v"),
- (".gate_up_proj", ".gate_proj", 0),
- (".gate_up_proj", ".up_proj", 1),
+ weights = [
+ self.maybe_remap_mistral(name, loaded_weight)
+ for name, loaded_weight in weights
]
- params_dict = dict(self.named_parameters())
# print(*[(n, w['meta'] if 'meta' in w else "") for n, w in weights], sep="\n")
# print(*[(n, w) for n, w in weights], sep="\n")
- for name, loaded_weight in weights:
- name, loaded_weight = self.maybe_remap_mistral(name, loaded_weight)
-
- if "rotary_emb.inv_freq" in name:
- continue
- if ("rotary_emb.cos_cached" in name
- or "rotary_emb.sin_cached" in name):
- # Models trained using ColossalAI may include these tensors in
- # the checkpoint. Skip them.
- continue
- # With tie_word_embeddings, we can skip lm_head.weight
- # The weight might appear unnecessarily in the files if the model is
- # processed with quantization, LoRA, fine-tuning, etc.
- if self.config.tie_word_embeddings and "lm_head.weight" in name:
- continue
- if scale_name := get_compressed_tensors_cache_scale(name):
- # Loading kv cache scales for compressed-tensors quantization
- param = params_dict[scale_name]
- weight_loader = getattr(param, "weight_loader",
- default_weight_loader)
- loaded_weight = loaded_weight[0]
- weight_loader(param, loaded_weight)
- continue
- for param_name, weight_name, shard_id in stacked_params_mapping:
- if weight_name not in name:
- continue
- name = name.replace(weight_name, param_name)
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
-
- if is_pp_missing_parameter(name, self):
- continue
-
- param = params_dict[name]
- weight_loader = param.weight_loader
- weight_loader(param, loaded_weight, shard_id)
+ weights_group = group_weights_with_prefix(weights)
- break
- else:
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- # Remapping the name of FP8 kv-scale.
- name = maybe_remap_kv_scale_name(name, params_dict)
- if name is None:
- continue
+ self.model.load_weights(weights_group["model"])
- if is_pp_missing_parameter(name, self):
+ if not self.config.tie_word_embeddings:
+ lm_head_dict = dict(self.lm_head.named_parameters())
+ for name, loaded_weight in weights_group["lm_head"]:
+ if is_pp_missing_parameter(name, self.lm_head):
continue
- param = params_dict[name]
+ param = lm_head_dict[name]
weight_loader = getattr(param, "weight_loader",
default_weight_loader)
weight_loader(param, loaded_weight)
- # If this function is called, it should always initialize KV cache scale
- # factors (or else raise an exception). Thus, handled exceptions should
- # make sure to leave KV cache scale factors in a known good (dummy) state
def load_kv_cache_scales(self, quantization_param_path: str) -> None:
- tp_size = get_tensor_model_parallel_world_size()
- tp_rank = get_tensor_model_parallel_rank()
- for layer_idx, scaling_factor in kv_cache_scales_loader(
- quantization_param_path, tp_rank, tp_size,
- self.config.num_hidden_layers,
- self.config.__class__.model_type):
- if not isinstance(self.model.layers[layer_idx], nn.Identity):
- layer_self_attn = self.model.layers[layer_idx].self_attn
-
- if is_hip():
- # The scaling factor convention we are assuming is
- # quantized_value * scaling_factor ~= true_value
- # which is consistent with the practice of setting
- # scaling_factor = tensor_amax / FPtype_max
- scaling_factor *= 2
- if hasattr(layer_self_attn, "kv_scale"):
- layer_self_attn.attn._kv_scale = scaling_factor
- else:
- raise RuntimeError("Self attention has no KV cache scaling "
- "factor attribute!")
+ self.model.load_kv_cache_scales(quantization_param_path)
# This function is used to remap the mistral format as
# used by Mistral and Llama <=2
def maybe_remap_mistral(
- self, name: str,
- loaded_weight: torch.Tensor) -> Tuple[str, torch.Tensor]:
+ self,
+ name: str,
+ loaded_weight: torch.Tensor,
+ ) -> Tuple[str, torch.Tensor]:
- def permute(w, n_heads):
+ def permute(w: torch.Tensor, n_heads: int):
attn_in = self.config.head_dim * n_heads
attn_out = self.config.hidden_size
diff --git a/vllm/model_executor/models/llama_embedding.py b/vllm/model_executor/models/llama_embedding.py
index ce05d8e3911bf..13574e84d7aa2 100644
--- a/vllm/model_executor/models/llama_embedding.py
+++ b/vllm/model_executor/models/llama_embedding.py
@@ -5,13 +5,11 @@
from vllm.attention import AttentionMetadata
from vllm.model_executor.layers.pooler import Pooler, PoolingType
-from vllm.model_executor.model_loader.weight_utils import default_weight_loader
-from vllm.model_executor.models.llama import LlamaModel
from vllm.model_executor.pooling_metadata import PoolingMetadata
from vllm.sequence import IntermediateTensors, PoolerOutput
from .interfaces import SupportsPP
-from .utils import is_pp_missing_parameter
+from .llama import LlamaModel
class LlamaEmbeddingModel(nn.Module, SupportsPP):
@@ -44,9 +42,8 @@ def forward(
intermediate_tensors: Optional[IntermediateTensors] = None,
inputs_embeds: Optional[torch.Tensor] = None,
) -> Union[torch.Tensor, IntermediateTensors]:
- return self.model.forward(input_ids, positions, kv_caches,
- attn_metadata, intermediate_tensors,
- inputs_embeds)
+ return self.model(input_ids, positions, kv_caches, attn_metadata,
+ intermediate_tensors, inputs_embeds)
def pooler(
self,
@@ -56,43 +53,7 @@ def pooler(
return self._pooler(hidden_states, pooling_metadata)
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
- stacked_params_mapping = [
- # (param_name, shard_name, shard_id)
- ("qkv_proj", "q_proj", "q"),
- ("qkv_proj", "k_proj", "k"),
- ("qkv_proj", "v_proj", "v"),
- ("gate_up_proj", "gate_proj", 0),
- ("gate_up_proj", "up_proj", 1),
- ]
- params_dict = dict(self.model.named_parameters())
- for name, loaded_weight in weights:
- if "rotary_emb.inv_freq" in name:
- continue
- if ("rotary_emb.cos_cached" in name
- or "rotary_emb.sin_cached" in name):
- # Models trained using ColossalAI may include these tensors in
- # the checkpoint. Skip them.
- continue
- for (param_name, weight_name, shard_id) in stacked_params_mapping:
- if weight_name not in name:
- continue
- name = name.replace(weight_name, param_name)
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = param.weight_loader
- weight_loader(param, loaded_weight, shard_id)
- break
- else:
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = getattr(param, "weight_loader",
- default_weight_loader)
- weight_loader(param, loaded_weight)
+ self.model.load_weights(weights)
+
+ def load_kv_cache_scales(self, quantization_param_path: str) -> None:
+ self.model.load_kv_cache_scales(quantization_param_path)
diff --git a/vllm/model_executor/models/mixtral.py b/vllm/model_executor/models/mixtral.py
index f93ba0875c8b1..dd384eee7ac79 100644
--- a/vllm/model_executor/models/mixtral.py
+++ b/vllm/model_executor/models/mixtral.py
@@ -322,10 +322,8 @@ class MixtralForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
# LoRA specific attributes
supported_lora_modules = [
- "qkv_proj",
- "o_proj",
- "embed_tokens",
- "lm_head",
+ "qkv_proj", "o_proj", "embed_tokens", "lm_head", "w1", "w2", "w3",
+ "gate"
]
embedding_modules = {
"embed_tokens": "input_embeddings",
diff --git a/vllm/model_executor/models/phi3v.py b/vllm/model_executor/models/phi3v.py
index ebfffb25360cd..b875a83f876be 100644
--- a/vllm/model_executor/models/phi3v.py
+++ b/vllm/model_executor/models/phi3v.py
@@ -467,9 +467,10 @@ def input_processor_for_phi3v(ctx: InputContext,
input_height=h,
num_crops=num_crops))
elif isinstance(image_data, torch.Tensor):
- num_images, image_feature_size, hidden_size = image_data.shape
+ image_feature_size = [image_data.shape[0]]
+ image_data = [image_data]
elif is_list_of(image_data, torch.Tensor):
- image_feature_size = [item.shape[1] for item in image_data]
+ image_feature_size = [item.shape[0] for item in image_data]
else:
raise TypeError(f"Invalid image type: {type(image_data)}")
@@ -611,9 +612,6 @@ def _parse_and_validate_image_input(
image_sizes = kwargs.pop("image_sizes", None)
image_embeds = kwargs.pop("image_embeds", None)
- if pixel_values is None:
- return None
-
if pixel_values is None and image_embeds is None:
return None
@@ -650,7 +648,17 @@ def _process_image_input(
) -> torch.Tensor:
if image_input["type"] == "image_embeds":
- return image_input["data"]
+ image_data = image_input["data"]
+ if is_list_of(image_data, torch.Tensor):
+ # it's already a list of tensors
+ return image_data
+ if len(image_data.shape) == 3:
+ # 3D tensor
+ return list(torch.unbind(image_data, dim=0))
+ raise ValueError(
+ "We expect batched 2D tensors;"
+ "this can be either a list of 2D tensors or a single 3D tensor."
+ )
assert self.vision_embed_tokens is not None
image_embeds = self.vision_embed_tokens(image_input["data"],
diff --git a/vllm/model_executor/models/qwen2.py b/vllm/model_executor/models/qwen2.py
index 04c1a224c981c..f9db87b7a9fbc 100644
--- a/vllm/model_executor/models/qwen2.py
+++ b/vllm/model_executor/models/qwen2.py
@@ -48,7 +48,8 @@
from vllm.sequence import IntermediateTensors
from .interfaces import SupportsLoRA, SupportsPP
-from .utils import (PPMissingLayer, is_pp_missing_parameter,
+from .utils import (PPMissingLayer, group_weights_with_prefix,
+ is_pp_missing_parameter,
make_empty_intermediate_tensors_factory, make_layers)
@@ -300,6 +301,47 @@ def forward(
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
+ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+ stacked_params_mapping = [
+ # (param_name, shard_name, shard_id)
+ ("qkv_proj", "q_proj", "q"),
+ ("qkv_proj", "k_proj", "k"),
+ ("qkv_proj", "v_proj", "v"),
+ ("gate_up_proj", "gate_proj", 0),
+ ("gate_up_proj", "up_proj", 1),
+ ]
+ params_dict = dict(self.named_parameters(remove_duplicate=False))
+ for name, loaded_weight in weights:
+ if "rotary_emb.inv_freq" in name:
+ continue
+ for (param_name, weight_name, shard_id) in stacked_params_mapping:
+ if weight_name not in name:
+ continue
+ name = name.replace(weight_name, param_name)
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+ if is_pp_missing_parameter(name, self):
+ continue
+ param = params_dict[name]
+ weight_loader = param.weight_loader
+ weight_loader(param, loaded_weight, shard_id)
+ break
+ else:
+ # Skip loading extra bias for GPTQ models.
+ if name.endswith(".bias") and name not in params_dict:
+ continue
+ # Remapping the name of FP8 kv-scale.
+ name = maybe_remap_kv_scale_name(name, params_dict)
+ if name is None:
+ continue
+ if is_pp_missing_parameter(name, self):
+ continue
+ param = params_dict[name]
+ weight_loader = getattr(param, "weight_loader",
+ default_weight_loader)
+ weight_loader(param, loaded_weight)
+
class Qwen2ForCausalLM(nn.Module, SupportsLoRA, SupportsPP):
packed_modules_mapping = {
@@ -393,44 +435,17 @@ def sample(
return next_tokens
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
- stacked_params_mapping = [
- # (param_name, shard_name, shard_id)
- ("qkv_proj", "q_proj", "q"),
- ("qkv_proj", "k_proj", "k"),
- ("qkv_proj", "v_proj", "v"),
- ("gate_up_proj", "gate_proj", 0),
- ("gate_up_proj", "up_proj", 1),
- ]
- params_dict = dict(self.named_parameters(remove_duplicate=False))
- for name, loaded_weight in weights:
- if "rotary_emb.inv_freq" in name:
- continue
- if self.config.tie_word_embeddings and "lm_head.weight" in name:
- continue
- for (param_name, weight_name, shard_id) in stacked_params_mapping:
- if weight_name not in name:
- continue
- name = name.replace(weight_name, param_name)
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = param.weight_loader
- weight_loader(param, loaded_weight, shard_id)
- break
- else:
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- # Remapping the name of FP8 kv-scale.
- name = maybe_remap_kv_scale_name(name, params_dict)
- if name is None:
- continue
- if is_pp_missing_parameter(name, self):
+ weights_group = group_weights_with_prefix(weights)
+
+ self.model.load_weights(weights_group["model"])
+
+ if not self.config.tie_word_embeddings:
+ lm_head_dict = dict(self.lm_head.named_parameters())
+ for name, loaded_weight in weights_group["lm_head"]:
+ if is_pp_missing_parameter(name, self.lm_head):
continue
- param = params_dict[name]
+
+ param = lm_head_dict[name]
weight_loader = getattr(param, "weight_loader",
default_weight_loader)
weight_loader(param, loaded_weight)
diff --git a/vllm/model_executor/models/qwen2_rm.py b/vllm/model_executor/models/qwen2_rm.py
index 51cef5c47c4d1..1aeab72b46522 100644
--- a/vllm/model_executor/models/qwen2_rm.py
+++ b/vllm/model_executor/models/qwen2_rm.py
@@ -4,7 +4,7 @@
# Copyright 2024 The Qwen team.
# Copyright 2023 The vLLM team.
"""Inference-only Qwen2-RM model compatible with HuggingFace weights."""
-from typing import Iterable, List, Optional, Tuple
+from typing import Iterable, List, Optional, Tuple, Union
import torch
from torch import nn
@@ -15,15 +15,14 @@
from vllm.model_executor.layers.linear import (ColumnParallelLinear,
RowParallelLinear)
from vllm.model_executor.layers.pooler import Pooler, PoolingType
-from vllm.model_executor.layers.quantization.base_config import (
- QuantizationConfig)
-from vllm.model_executor.model_loader.weight_utils import (
- default_weight_loader, maybe_remap_kv_scale_name)
-from vllm.model_executor.models.qwen2 import Qwen2Model
+from vllm.model_executor.layers.quantization import QuantizationConfig
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.model_executor.pooling_metadata import PoolingMetadata
from vllm.sequence import IntermediateTensors, PoolerOutput
-from .utils import is_pp_missing_parameter
+from .interfaces import SupportsPP
+from .qwen2 import Qwen2Model
+from .utils import group_weights_with_prefix
class ReLU(nn.Module):
@@ -37,7 +36,7 @@ def forward(self, input):
return self.activation(input)
-class Qwen2ForRewardModel(nn.Module):
+class Qwen2ForRewardModel(nn.Module, SupportsPP):
packed_modules_mapping = {
"qkv_proj": [
"q_proj",
@@ -97,6 +96,9 @@ def __init__(
)
self._pooler = Pooler(pooling_type=PoolingType.ALL, normalize=False)
+ self.make_empty_intermediate_tensors = (
+ self.model.make_empty_intermediate_tensors)
+
def forward(
self,
input_ids: torch.Tensor,
@@ -104,7 +106,7 @@ def forward(
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
intermediate_tensors: Optional[IntermediateTensors] = None,
- ) -> torch.Tensor:
+ ) -> Union[torch.Tensor, IntermediateTensors]:
hidden_states = self.model(input_ids, positions, kv_caches,
attn_metadata, intermediate_tensors)
logits, _ = self.score(hidden_states)
@@ -118,45 +120,13 @@ def pooler(
return self._pooler(hidden_states, pooling_metadata)
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
- stacked_params_mapping = [
- # (param_name, shard_name, shard_id)
- ("qkv_proj", "q_proj", "q"),
- ("qkv_proj", "k_proj", "k"),
- ("qkv_proj", "v_proj", "v"),
- ("gate_up_proj", "gate_proj", 0),
- ("gate_up_proj", "up_proj", 1),
- ]
- params_dict = dict(self.named_parameters(remove_duplicate=False))
- for name, loaded_weight in weights:
- # Skip loading lm_head for embedding model
- if name == "lm_head.weight":
- continue
- if "rotary_emb.inv_freq" in name:
- continue
- for (param_name, weight_name, shard_id) in stacked_params_mapping:
- if weight_name not in name:
- continue
- name = name.replace(weight_name, param_name)
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = param.weight_loader
- weight_loader(param, loaded_weight, shard_id)
- break
- else:
- # Skip loading extra bias for GPTQ models.
- if name.endswith(".bias") and name not in params_dict:
- continue
- # Remapping the name of FP8 kv-scale.
- name = maybe_remap_kv_scale_name(name, params_dict)
- if name is None:
- continue
- if is_pp_missing_parameter(name, self):
- continue
- param = params_dict[name]
- weight_loader = getattr(param, "weight_loader",
- default_weight_loader)
- weight_loader(param, loaded_weight)
+ weights_group = group_weights_with_prefix(weights)
+
+ self.model.load_weights(weights_group["model"])
+
+ score_dict = dict(self.score.named_parameters())
+ for name, loaded_weight in weights_group["score"]:
+ param = score_dict[name]
+ weight_loader = getattr(param, "weight_loader",
+ default_weight_loader)
+ weight_loader(param, loaded_weight)
diff --git a/vllm/model_executor/models/qwen2_vl.py b/vllm/model_executor/models/qwen2_vl.py
index fd8e2436c1e1f..24fd5152ecd09 100644
--- a/vllm/model_executor/models/qwen2_vl.py
+++ b/vllm/model_executor/models/qwen2_vl.py
@@ -967,6 +967,9 @@ def _parse_and_validate_image_input(
image_grid_thw=image_grid_thw)
if image_embeds is not None:
+ image_embeds = self._validate_and_reshape_mm_tensor(
+ image_embeds, "image embeds")
+
if not isinstance(image_embeds, torch.Tensor):
raise ValueError("Incorrect type of image embeddings. "
f"Got type: {type(image_embeds)}")
diff --git a/vllm/model_executor/models/registry.py b/vllm/model_executor/models/registry.py
new file mode 100644
index 0000000000000..46c69f17f4471
--- /dev/null
+++ b/vllm/model_executor/models/registry.py
@@ -0,0 +1,373 @@
+import importlib
+import string
+import subprocess
+import sys
+import uuid
+from functools import lru_cache, partial
+from typing import Callable, Dict, List, Optional, Tuple, Type, Union
+
+import torch.nn as nn
+
+from vllm.logger import init_logger
+from vllm.utils import is_hip
+
+from .interfaces import supports_multimodal, supports_pp
+from .interfaces_base import is_embedding_model, is_text_generation_model
+
+logger = init_logger(__name__)
+
+_TEXT_GENERATION_MODELS = {
+ # [Decoder-only]
+ "AquilaModel": ("llama", "LlamaForCausalLM"),
+ "AquilaForCausalLM": ("llama", "LlamaForCausalLM"), # AquilaChat2
+ "ArcticForCausalLM": ("arctic", "ArcticForCausalLM"),
+ "BaiChuanForCausalLM": ("baichuan", "BaiChuanForCausalLM"), # baichuan-7b
+ "BaichuanForCausalLM": ("baichuan", "BaichuanForCausalLM"), # baichuan-13b
+ "BloomForCausalLM": ("bloom", "BloomForCausalLM"),
+ "ChatGLMModel": ("chatglm", "ChatGLMForCausalLM"),
+ "ChatGLMForConditionalGeneration": ("chatglm", "ChatGLMForCausalLM"),
+ "CohereForCausalLM": ("commandr", "CohereForCausalLM"),
+ "DbrxForCausalLM": ("dbrx", "DbrxForCausalLM"),
+ "DeciLMForCausalLM": ("decilm", "DeciLMForCausalLM"),
+ "DeepseekForCausalLM": ("deepseek", "DeepseekForCausalLM"),
+ "DeepseekV2ForCausalLM": ("deepseek_v2", "DeepseekV2ForCausalLM"),
+ "ExaoneForCausalLM": ("exaone", "ExaoneForCausalLM"),
+ "FalconForCausalLM": ("falcon", "FalconForCausalLM"),
+ "GemmaForCausalLM": ("gemma", "GemmaForCausalLM"),
+ "Gemma2ForCausalLM": ("gemma2", "Gemma2ForCausalLM"),
+ "GPT2LMHeadModel": ("gpt2", "GPT2LMHeadModel"),
+ "GPTBigCodeForCausalLM": ("gpt_bigcode", "GPTBigCodeForCausalLM"),
+ "GPTJForCausalLM": ("gpt_j", "GPTJForCausalLM"),
+ "GPTNeoXForCausalLM": ("gpt_neox", "GPTNeoXForCausalLM"),
+ "GraniteForCausalLM": ("granite", "GraniteForCausalLM"),
+ "GraniteMoeForCausalLM": ("granitemoe", "GraniteMoeForCausalLM"),
+ "InternLMForCausalLM": ("llama", "LlamaForCausalLM"),
+ "InternLM2ForCausalLM": ("internlm2", "InternLM2ForCausalLM"),
+ "JAISLMHeadModel": ("jais", "JAISLMHeadModel"),
+ "JambaForCausalLM": ("jamba", "JambaForCausalLM"),
+ "LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
+ # For decapoda-research/llama-*
+ "LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
+ "MistralForCausalLM": ("llama", "LlamaForCausalLM"),
+ "MixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),
+ "QuantMixtralForCausalLM": ("mixtral_quant", "MixtralForCausalLM"),
+ # transformers's mpt class has lower case
+ "MptForCausalLM": ("mpt", "MPTForCausalLM"),
+ "MPTForCausalLM": ("mpt", "MPTForCausalLM"),
+ "MiniCPMForCausalLM": ("minicpm", "MiniCPMForCausalLM"),
+ "MiniCPM3ForCausalLM": ("minicpm3", "MiniCPM3ForCausalLM"),
+ "NemotronForCausalLM": ("nemotron", "NemotronForCausalLM"),
+ "OlmoForCausalLM": ("olmo", "OlmoForCausalLM"),
+ "OlmoeForCausalLM": ("olmoe", "OlmoeForCausalLM"),
+ "OPTForCausalLM": ("opt", "OPTForCausalLM"),
+ "OrionForCausalLM": ("orion", "OrionForCausalLM"),
+ "PersimmonForCausalLM": ("persimmon", "PersimmonForCausalLM"),
+ "PhiForCausalLM": ("phi", "PhiForCausalLM"),
+ "Phi3ForCausalLM": ("phi3", "Phi3ForCausalLM"),
+ "Phi3SmallForCausalLM": ("phi3_small", "Phi3SmallForCausalLM"),
+ "PhiMoEForCausalLM": ("phimoe", "PhiMoEForCausalLM"),
+ "Qwen2ForCausalLM": ("qwen2", "Qwen2ForCausalLM"),
+ "Qwen2MoeForCausalLM": ("qwen2_moe", "Qwen2MoeForCausalLM"),
+ "Qwen2VLForConditionalGeneration":
+ ("qwen2_vl", "Qwen2VLForConditionalGeneration"),
+ "RWForCausalLM": ("falcon", "FalconForCausalLM"),
+ "StableLMEpochForCausalLM": ("stablelm", "StablelmForCausalLM"),
+ "StableLmForCausalLM": ("stablelm", "StablelmForCausalLM"),
+ "Starcoder2ForCausalLM": ("starcoder2", "Starcoder2ForCausalLM"),
+ "SolarForCausalLM": ("solar", "SolarForCausalLM"),
+ "XverseForCausalLM": ("xverse", "XverseForCausalLM"),
+ # [Encoder-decoder]
+ "BartModel": ("bart", "BartForConditionalGeneration"),
+ "BartForConditionalGeneration": ("bart", "BartForConditionalGeneration"),
+}
+
+_EMBEDDING_MODELS = {
+ "MistralModel": ("llama_embedding", "LlamaEmbeddingModel"),
+ "Qwen2ForRewardModel": ("qwen2_rm", "Qwen2ForRewardModel"),
+ "Gemma2Model": ("gemma2_embedding", "Gemma2EmbeddingModel"),
+}
+
+_MULTIMODAL_MODELS = {
+ "Blip2ForConditionalGeneration":
+ ("blip2", "Blip2ForConditionalGeneration"),
+ "ChameleonForConditionalGeneration":
+ ("chameleon", "ChameleonForConditionalGeneration"),
+ "FuyuForCausalLM": ("fuyu", "FuyuForCausalLM"),
+ "InternVLChatModel": ("internvl", "InternVLChatModel"),
+ "LlavaForConditionalGeneration": ("llava",
+ "LlavaForConditionalGeneration"),
+ "LlavaNextForConditionalGeneration": ("llava_next",
+ "LlavaNextForConditionalGeneration"),
+ "LlavaNextVideoForConditionalGeneration":
+ ("llava_next_video", "LlavaNextVideoForConditionalGeneration"),
+ "LlavaOnevisionForConditionalGeneration":
+ ("llava_onevision", "LlavaOnevisionForConditionalGeneration"),
+ "MiniCPMV": ("minicpmv", "MiniCPMV"),
+ "PaliGemmaForConditionalGeneration": ("paligemma",
+ "PaliGemmaForConditionalGeneration"),
+ "Phi3VForCausalLM": ("phi3v", "Phi3VForCausalLM"),
+ "PixtralForConditionalGeneration": ("pixtral",
+ "PixtralForConditionalGeneration"),
+ "QWenLMHeadModel": ("qwen", "QWenLMHeadModel"),
+ "Qwen2VLForConditionalGeneration": ("qwen2_vl",
+ "Qwen2VLForConditionalGeneration"),
+ "UltravoxModel": ("ultravox", "UltravoxModel"),
+ "MllamaForConditionalGeneration": ("mllama",
+ "MllamaForConditionalGeneration"),
+}
+
+_SPECULATIVE_DECODING_MODELS = {
+ "EAGLEModel": ("eagle", "EAGLE"),
+ "MedusaModel": ("medusa", "Medusa"),
+ "MLPSpeculatorPreTrainedModel": ("mlp_speculator", "MLPSpeculator"),
+}
+
+_MODELS = {
+ **_TEXT_GENERATION_MODELS,
+ **_EMBEDDING_MODELS,
+ **_MULTIMODAL_MODELS,
+ **_SPECULATIVE_DECODING_MODELS,
+}
+
+# Architecture -> type or (module, class).
+# out of tree models
+_OOT_MODELS: Dict[str, Type[nn.Module]] = {}
+_OOT_MODELS_LAZY: Dict[str, Tuple[str, str]] = {}
+
+# Models not supported by ROCm.
+_ROCM_UNSUPPORTED_MODELS: List[str] = []
+
+# Models partially supported by ROCm.
+# Architecture -> Reason.
+_ROCM_SWA_REASON = ("Sliding window attention (SWA) is not yet supported in "
+ "Triton flash attention. For half-precision SWA support, "
+ "please use CK flash attention by setting "
+ "`VLLM_USE_TRITON_FLASH_ATTN=0`")
+_ROCM_PARTIALLY_SUPPORTED_MODELS: Dict[str, str] = {
+ "Qwen2ForCausalLM":
+ _ROCM_SWA_REASON,
+ "MistralForCausalLM":
+ _ROCM_SWA_REASON,
+ "MixtralForCausalLM":
+ _ROCM_SWA_REASON,
+ "PaliGemmaForConditionalGeneration":
+ ("ROCm flash attention does not yet "
+ "fully support 32-bit precision on PaliGemma"),
+ "Phi3VForCausalLM":
+ ("ROCm Triton flash attention may run into compilation errors due to "
+ "excessive use of shared memory. If this happens, disable Triton FA "
+ "by setting `VLLM_USE_TRITON_FLASH_ATTN=0`")
+}
+
+
+class ModelRegistry:
+
+ @staticmethod
+ def _get_module_cls_name(model_arch: str) -> Tuple[str, str]:
+ if model_arch in _MODELS:
+ module_relname, cls_name = _MODELS[model_arch]
+ return f"vllm.model_executor.models.{module_relname}", cls_name
+
+ if model_arch in _OOT_MODELS_LAZY:
+ return _OOT_MODELS_LAZY[model_arch]
+
+ raise KeyError(model_arch)
+
+ @staticmethod
+ @lru_cache(maxsize=128)
+ def _try_get_model_stateful(model_arch: str) -> Optional[Type[nn.Module]]:
+ try:
+ mod_name, cls_name = ModelRegistry._get_module_cls_name(model_arch)
+ except KeyError:
+ return None
+
+ module = importlib.import_module(mod_name)
+ return getattr(module, cls_name, None)
+
+ @staticmethod
+ def _try_get_model_stateless(model_arch: str) -> Optional[Type[nn.Module]]:
+ if model_arch in _OOT_MODELS:
+ return _OOT_MODELS[model_arch]
+
+ if is_hip():
+ if model_arch in _ROCM_UNSUPPORTED_MODELS:
+ raise ValueError(
+ f"Model architecture {model_arch} is not supported by "
+ "ROCm for now.")
+ if model_arch in _ROCM_PARTIALLY_SUPPORTED_MODELS:
+ logger.warning(
+ "Model architecture %s is partially supported by ROCm: %s",
+ model_arch, _ROCM_PARTIALLY_SUPPORTED_MODELS[model_arch])
+
+ return None
+
+ @staticmethod
+ def _try_load_model_cls(model_arch: str) -> Optional[Type[nn.Module]]:
+ model = ModelRegistry._try_get_model_stateless(model_arch)
+ if model is not None:
+ return model
+
+ return ModelRegistry._try_get_model_stateful(model_arch)
+
+ @staticmethod
+ def resolve_model_cls(
+ architectures: Union[str, List[str]], ) -> Tuple[Type[nn.Module], str]:
+ if isinstance(architectures, str):
+ architectures = [architectures]
+ if not architectures:
+ logger.warning("No model architectures are specified")
+
+ for arch in architectures:
+ model_cls = ModelRegistry._try_load_model_cls(arch)
+ if model_cls is not None:
+ return (model_cls, arch)
+
+ raise ValueError(
+ f"Model architectures {architectures} are not supported for now. "
+ f"Supported architectures: {ModelRegistry.get_supported_archs()}")
+
+ @staticmethod
+ def get_supported_archs() -> List[str]:
+ return list(_MODELS.keys()) + list(_OOT_MODELS.keys())
+
+ @staticmethod
+ def register_model(model_arch: str, model_cls: Union[Type[nn.Module],
+ str]):
+ """
+ Register an external model to be used in vLLM.
+
+ :code:`model_cls` can be either:
+
+ - A :class:`torch.nn.Module` class directly referencing the model.
+ - A string in the format :code:`:` which can be used to
+ lazily import the model. This is useful to avoid initializing CUDA
+ when importing the model and thus the related error
+ :code:`RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+ """
+ if model_arch in _MODELS:
+ logger.warning(
+ "Model architecture %s is already registered, and will be "
+ "overwritten by the new model class %s.", model_arch,
+ model_cls)
+
+ if isinstance(model_cls, str):
+ split_str = model_cls.split(":")
+ if len(split_str) != 2:
+ msg = "Expected a string in the format `:`"
+ raise ValueError(msg)
+
+ module_name, cls_name = split_str
+ _OOT_MODELS_LAZY[model_arch] = module_name, cls_name
+ else:
+ _OOT_MODELS[model_arch] = model_cls
+
+ @staticmethod
+ @lru_cache(maxsize=128)
+ def _check_stateless(
+ func: Callable[[Type[nn.Module]], bool],
+ model_arch: str,
+ *,
+ default: Optional[bool] = None,
+ ) -> bool:
+ """
+ Run a boolean function against a model and return the result.
+
+ If the model is not found, returns the provided default value.
+
+ If the model is not already imported, the function is run inside a
+ subprocess to avoid initializing CUDA for the main program.
+ """
+ model = ModelRegistry._try_get_model_stateless(model_arch)
+ if model is not None:
+ return func(model)
+
+ try:
+ mod_name, cls_name = ModelRegistry._get_module_cls_name(model_arch)
+ except KeyError:
+ if default is not None:
+ return default
+
+ raise
+
+ valid_name_characters = string.ascii_letters + string.digits + "._"
+ if any(s not in valid_name_characters for s in mod_name):
+ raise ValueError(f"Unsafe module name detected for {model_arch}")
+ if any(s not in valid_name_characters for s in cls_name):
+ raise ValueError(f"Unsafe class name detected for {model_arch}")
+ if any(s not in valid_name_characters for s in func.__module__):
+ raise ValueError(f"Unsafe module name detected for {func}")
+ if any(s not in valid_name_characters for s in func.__name__):
+ raise ValueError(f"Unsafe class name detected for {func}")
+
+ err_id = uuid.uuid4()
+
+ stmts = ";".join([
+ f"from {mod_name} import {cls_name}",
+ f"from {func.__module__} import {func.__name__}",
+ f"assert {func.__name__}({cls_name}), '{err_id}'",
+ ])
+
+ result = subprocess.run([sys.executable, "-c", stmts],
+ capture_output=True)
+
+ if result.returncode != 0:
+ err_lines = [line.decode() for line in result.stderr.splitlines()]
+ if err_lines and err_lines[-1] != f"AssertionError: {err_id}":
+ err_str = "\n".join(err_lines)
+ raise RuntimeError(
+ "An unexpected error occurred while importing the model in "
+ f"another process. Error log:\n{err_str}")
+
+ return result.returncode == 0
+
+ @staticmethod
+ def is_text_generation_model(architectures: Union[str, List[str]]) -> bool:
+ if isinstance(architectures, str):
+ architectures = [architectures]
+ if not architectures:
+ logger.warning("No model architectures are specified")
+
+ is_txt_gen = partial(ModelRegistry._check_stateless,
+ is_text_generation_model,
+ default=False)
+
+ return any(is_txt_gen(arch) for arch in architectures)
+
+ @staticmethod
+ def is_embedding_model(architectures: Union[str, List[str]]) -> bool:
+ if isinstance(architectures, str):
+ architectures = [architectures]
+ if not architectures:
+ logger.warning("No model architectures are specified")
+
+ is_emb = partial(ModelRegistry._check_stateless,
+ is_embedding_model,
+ default=False)
+
+ return any(is_emb(arch) for arch in architectures)
+
+ @staticmethod
+ def is_multimodal_model(architectures: Union[str, List[str]]) -> bool:
+ if isinstance(architectures, str):
+ architectures = [architectures]
+ if not architectures:
+ logger.warning("No model architectures are specified")
+
+ is_mm = partial(ModelRegistry._check_stateless,
+ supports_multimodal,
+ default=False)
+
+ return any(is_mm(arch) for arch in architectures)
+
+ @staticmethod
+ def is_pp_supported_model(architectures: Union[str, List[str]]) -> bool:
+ if isinstance(architectures, str):
+ architectures = [architectures]
+ if not architectures:
+ logger.warning("No model architectures are specified")
+
+ is_pp = partial(ModelRegistry._check_stateless,
+ supports_pp,
+ default=False)
+
+ return any(is_pp(arch) for arch in architectures)
diff --git a/vllm/model_executor/models/ultravox.py b/vllm/model_executor/models/ultravox.py
index daa6e72dd1002..101cf38c96b01 100644
--- a/vllm/model_executor/models/ultravox.py
+++ b/vllm/model_executor/models/ultravox.py
@@ -38,6 +38,7 @@
from vllm.sequence import (VLLM_TOKEN_ID_ARRAY_TYPE, IntermediateTensors,
SequenceData)
from vllm.transformers_utils.configs.ultravox import UltravoxConfig
+from vllm.utils import is_list_of
from .interfaces import SupportsMultiModal, SupportsPP
@@ -119,6 +120,10 @@ def input_mapper_for_ultravox(ctx: InputContext, data: object):
if not isinstance(data, list):
data = [data]
+ # If the audio inputs are embeddings, no need for preprocessing
+ if is_list_of(data, torch.Tensor, check="all"):
+ return MultiModalInputs({"audio_embeds": data})
+
audio_features = []
for audio_input in data:
if not isinstance(audio_input, tuple):
@@ -165,25 +170,30 @@ def input_processor_for_ultravox(ctx: InputContext, llm_inputs: LLMInputs):
audios = [audios]
audio_token_counts = []
- for audio_data, sample_rate in audios:
- audio_length = audio_data.shape[0]
- if sample_rate != feature_extractor.sampling_rate:
- # Account for resampling.
- adjustment = feature_extractor.sampling_rate / sample_rate
- audio_length = math.ceil(adjustment * audio_length)
-
- feature_extractor_output_length = math.ceil(
- (audio_length - (feature_extractor.hop_length - 1)) /
- feature_extractor.hop_length)
-
- uv_config = ctx.get_hf_config(UltravoxConfig)
- audio_num_tokens = min(
- max(
- 1,
- math.ceil(feature_extractor_output_length /
- (uv_config.stack_factor * 2))),
- get_ultravox_max_audio_tokens(ctx))
- audio_token_counts.append(audio_num_tokens)
+ for audio in audios:
+ if isinstance(audio, torch.Tensor):
+ audio_num_tokens = audio.shape[1]
+ audio_token_counts.append(audio_num_tokens)
+ else:
+ audio_data, sample_rate = audio
+ audio_length = audio_data.shape[0]
+ if sample_rate != feature_extractor.sampling_rate:
+ # Account for resampling.
+ adjustment = feature_extractor.sampling_rate / sample_rate
+ audio_length = math.ceil(adjustment * audio_length)
+
+ feature_extractor_output_length = math.ceil(
+ (audio_length - (feature_extractor.hop_length - 1)) /
+ feature_extractor.hop_length)
+
+ uv_config = ctx.get_hf_config(UltravoxConfig)
+ audio_num_tokens = min(
+ max(
+ 1,
+ math.ceil(feature_extractor_output_length /
+ (uv_config.stack_factor * 2))),
+ get_ultravox_max_audio_tokens(ctx))
+ audio_token_counts.append(audio_num_tokens)
tokenizer = cached_get_tokenizer(ctx.model_config.tokenizer)
diff --git a/vllm/model_executor/models/utils.py b/vllm/model_executor/models/utils.py
index 761f0406b1333..916f373d4481e 100644
--- a/vllm/model_executor/models/utils.py
+++ b/vllm/model_executor/models/utils.py
@@ -306,10 +306,12 @@ def get_pp_missing_layer_names(model: torch.nn.Module) -> List[str]:
def is_pp_missing_parameter(name: str, model: torch.nn.Module) -> bool:
"""Check if a parameter is missing in a pipeline parallel model."""
- for missing_layer_name in get_pp_missing_layer_names(model):
- if name.startswith(missing_layer_name):
- return True
- return False
+ if isinstance(model, PPMissingLayer):
+ return True
+
+ return any(
+ name.startswith(missing_layer_name)
+ for missing_layer_name in get_pp_missing_layer_names(model))
def make_empty_intermediate_tensors_factory(keys: List[str], hidden_size: int):
diff --git a/vllm/outputs.py b/vllm/outputs.py
index 44cde6b561d85..4f29226aa5128 100644
--- a/vllm/outputs.py
+++ b/vllm/outputs.py
@@ -142,11 +142,7 @@ def from_seq_group(cls, seq_group: SequenceGroup,
else:
# Get the top-n sequences.
n = sampling_params.n
- if sampling_params.use_beam_search:
- sorting_key = lambda seq: seq.get_beam_search_score(
- sampling_params.length_penalty)
- else:
- sorting_key = lambda seq: seq.get_cumulative_logprob()
+ sorting_key = lambda seq: seq.get_cumulative_logprob()
sorted_seqs = sorted(seqs, key=sorting_key, reverse=True)
top_n_seqs = sorted_seqs[:n]
diff --git a/vllm/sampling_params.py b/vllm/sampling_params.py
index 83f76410882de..e074312280584 100644
--- a/vllm/sampling_params.py
+++ b/vllm/sampling_params.py
@@ -10,7 +10,6 @@
from pydantic import BaseModel
from typing_extensions import Annotated
-import vllm.envs as envs
from vllm.logger import init_logger
logger = init_logger(__name__)
@@ -23,7 +22,6 @@ class SamplingType(IntEnum):
GREEDY = 0
RANDOM = 1
RANDOM_SEED = 2
- BEAM = 3
LogitsProcessor = Union[Callable[[List[int], torch.Tensor], torch.Tensor],
@@ -134,16 +132,6 @@ class SamplingParams(
considered, relative to the probability of the most likely token.
Must be in [0, 1]. Set to 0 to disable this.
seed: Random seed to use for the generation.
- use_beam_search: Whether to use beam search instead of sampling.
- length_penalty: Float that penalizes sequences based on their length.
- Used in beam search.
- early_stopping: Controls the stopping condition for beam search. It
- accepts the following values: `True`, where the generation stops as
- soon as there are `best_of` complete candidates; `False`, where an
- heuristic is applied and the generation stops when is it very
- unlikely to find better candidates; `"never"`, where the beam search
- procedure only stops when there cannot be better candidates
- (canonical beam search algorithm).
stop: List of strings that stop the generation when they are generated.
The returned output will not contain the stop strings.
stop_token_ids: List of tokens that stop the generation when they are
@@ -193,9 +181,6 @@ class SamplingParams(
top_k: int = -1
min_p: float = 0.0
seed: Optional[int] = None
- use_beam_search: bool = False
- length_penalty: float = 1.0
- early_stopping: Union[bool, str] = False
stop: Optional[Union[str, List[str]]] = None
stop_token_ids: Optional[List[int]] = None
ignore_eos: bool = False
@@ -238,9 +223,6 @@ def from_optional(
top_k: int = -1,
min_p: float = 0.0,
seed: Optional[int] = None,
- use_beam_search: bool = False,
- length_penalty: float = 1.0,
- early_stopping: Union[bool, str] = False,
stop: Optional[Union[str, List[str]]] = None,
stop_token_ids: Optional[List[int]] = None,
include_stop_str_in_output: bool = False,
@@ -280,9 +262,6 @@ def from_optional(
top_k=top_k,
min_p=min_p,
seed=seed,
- use_beam_search=use_beam_search,
- length_penalty=length_penalty,
- early_stopping=early_stopping,
stop=stop,
stop_token_ids=stop_token_ids,
include_stop_str_in_output=include_stop_str_in_output,
@@ -334,20 +313,13 @@ def __post_init__(self) -> None:
self.output_text_buffer_length = max(len(s) for s in self.stop) - 1
self._verify_args()
- if self.use_beam_search:
- if not envs.VLLM_ALLOW_DEPRECATED_BEAM_SEARCH:
- raise ValueError(
- "Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the `vllm.LLM.use_beam_search` method for dedicated beam search instead, or set the environment variable `VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1` to suppress this error. For more details, see https://github.com/vllm-project/vllm/issues/8306 ." # noqa
- )
- self._verify_beam_search()
- else:
- self._verify_non_beam_search()
- if self.temperature < _SAMPLING_EPS:
- # Zero temperature means greedy sampling.
- self.top_p = 1.0
- self.top_k = -1
- self.min_p = 0.0
- self._verify_greedy_sampling()
+
+ if self.temperature < _SAMPLING_EPS:
+ # Zero temperature means greedy sampling.
+ self.top_p = 1.0
+ self.top_k = -1
+ self.min_p = 0.0
+ self._verify_greedy_sampling()
# eos_token_id is added to this by the engine
self._all_stop_token_ids = set(self.stop_token_ids)
@@ -417,31 +389,6 @@ def _verify_args(self) -> None:
RequestOutputKind.DELTA):
raise ValueError("best_of must equal n to use output_kind=DELTA")
- def _verify_beam_search(self) -> None:
- if self.best_of == 1:
- raise ValueError("best_of must be greater than 1 when using beam "
- f"search. Got {self.best_of}.")
- if self.temperature > _SAMPLING_EPS:
- raise ValueError("temperature must be 0 when using beam search.")
- if self.top_p < 1.0 - _SAMPLING_EPS:
- raise ValueError("top_p must be 1 when using beam search.")
- if self.top_k != -1:
- raise ValueError("top_k must be -1 when using beam search.")
- if self.early_stopping not in [True, False, "never"]:
- raise ValueError(
- f"early_stopping must be True, False, or 'never', "
- f"got {self.early_stopping}.")
-
- def _verify_non_beam_search(self) -> None:
- if self.early_stopping is not False:
- raise ValueError("early_stopping is not effective and must be "
- "False when not using beam search.")
- if (self.length_penalty < 1.0 - _SAMPLING_EPS
- or self.length_penalty > 1.0 + _SAMPLING_EPS):
- raise ValueError(
- "length_penalty is not effective and must be the "
- "default value of 1.0 when not using beam search.")
-
def _verify_greedy_sampling(self) -> None:
assert isinstance(self.best_of, int)
if self.best_of > 1:
@@ -476,8 +423,6 @@ def update_from_generation_config(
@cached_property
def sampling_type(self) -> SamplingType:
- if self.use_beam_search:
- return SamplingType.BEAM
if self.temperature < _SAMPLING_EPS:
return SamplingType.GREEDY
if self.seed is not None:
@@ -514,9 +459,6 @@ def __repr__(self) -> str:
f"top_k={self.top_k}, "
f"min_p={self.min_p}, "
f"seed={self.seed}, "
- f"use_beam_search={self.use_beam_search}, "
- f"length_penalty={self.length_penalty}, "
- f"early_stopping={self.early_stopping}, "
f"stop={self.stop}, "
f"stop_token_ids={self.stop_token_ids}, "
f"include_stop_str_in_output={self.include_stop_str_in_output}, "
@@ -530,3 +472,16 @@ def __repr__(self) -> str:
f"{self.spaces_between_special_tokens}, "
f"truncate_prompt_tokens={self.truncate_prompt_tokens}), "
f"guided_decoding={self.guided_decoding}")
+
+
+class BeamSearchParams(
+ msgspec.Struct,
+ omit_defaults=True, # type: ignore[call-arg]
+ # required for @cached_property.
+ dict=True): # type: ignore[call-arg]
+ """Beam search parameters for text generation."""
+ beam_width: int
+ max_tokens: int
+ ignore_eos: bool = False
+ temperature: float = 0.0
+ length_penalty: float = 1.0
diff --git a/vllm/sequence.py b/vllm/sequence.py
index 781bcedde2b52..9116408a001ff 100644
--- a/vllm/sequence.py
+++ b/vllm/sequence.py
@@ -577,25 +577,6 @@ def get_output_token_ids(self) -> Tuple[int, ...]:
def get_cumulative_logprob(self) -> float:
return self.data.cumulative_logprob
- def get_beam_search_score(self,
- length_penalty: float = 1.0,
- seq_len: Optional[int] = None,
- eos_token_id: Optional[int] = None) -> float:
- """Calculate the beam search score with length penalty.
-
- Adapted from
-
- https://github.com/huggingface/transformers/blob/ccb92be23def445f2afdea94c31286f84b89eb5b/src/transformers/generation/beam_search.py#L938
- """
- if seq_len is None:
- seq_len = self.get_len()
- # NOTE: HF implementation does not count the EOS token
- # towards the length, we align with that here for testing.
- if (eos_token_id is not None
- and self.get_last_token_id() == eos_token_id):
- seq_len -= 1
- return self.get_cumulative_logprob() / (seq_len**length_penalty)
-
def is_finished(self) -> bool:
return SequenceStatus.is_finished(self.status)
@@ -809,25 +790,18 @@ def set_finished_time(self, time: Optional[float]) -> None:
def get_max_num_running_seqs(self) -> int:
"""The maximum number of sequences running in parallel in the remaining
lifetime of the request."""
- if self.sampling_params and self.sampling_params.use_beam_search:
- # For beam search, maximally there will always be `best_of` beam
- # candidates running in the future.
+ if self.sampling_params:
best_of = self.sampling_params.best_of
assert isinstance(best_of, int)
- return best_of
- else:
- if self.sampling_params:
- best_of = self.sampling_params.best_of
- assert isinstance(best_of, int)
- if best_of > self.num_seqs():
- # At prompt stage, the sequence group is not yet filled up
- # and only have one sequence running. However, in the
- # generation stage, we will have `best_of` sequences
- # running.
- return best_of
- # At sampling stages, return the number of actual sequences
- # that are not finished yet.
- return self.num_unfinished_seqs()
+ if best_of > self.num_seqs():
+ # At prompt stage, the sequence group is not yet filled up
+ # and only have one sequence running. However, in the
+ # generation stage, we will have `best_of` sequences
+ # running.
+ return best_of
+ # At sampling stages, return the number of actual sequences
+ # that are not finished yet.
+ return self.num_unfinished_seqs()
def get_seqs(
self,
diff --git a/vllm/spec_decode/draft_model_runner.py b/vllm/spec_decode/draft_model_runner.py
index 984747c53c6c0..aaf6ec5f508c8 100644
--- a/vllm/spec_decode/draft_model_runner.py
+++ b/vllm/spec_decode/draft_model_runner.py
@@ -6,11 +6,16 @@
from vllm.model_executor.layers.sampler import SamplerOutput
try:
- from vllm.attention.backends.flash_attn import FlashAttentionMetadata
-except ModuleNotFoundError:
- # vllm_flash_attn is not installed, use the identical ROCm FA metadata
- from vllm.attention.backends.rocm_flash_attn import (
- ROCmFlashAttentionMetadata as FlashAttentionMetadata)
+ try:
+ from vllm.attention.backends.flash_attn import FlashAttentionMetadata
+ except (ModuleNotFoundError, ImportError):
+ # vllm_flash_attn is not installed, try the ROCm FA metadata
+ from vllm.attention.backends.rocm_flash_attn import (
+ ROCmFlashAttentionMetadata as FlashAttentionMetadata)
+except (ModuleNotFoundError, ImportError) as err:
+ raise RuntimeError(
+ "Draft model speculative decoding currently only supports"
+ "CUDA and ROCm flash attention backend.") from err
from vllm.config import (CacheConfig, DeviceConfig, LoadConfig, LoRAConfig,
ModelConfig, ObservabilityConfig, ParallelConfig,
diff --git a/vllm/transformers_utils/config.py b/vllm/transformers_utils/config.py
index 0f20e8d0c8213..bfba4ca77e1fe 100644
--- a/vllm/transformers_utils/config.py
+++ b/vllm/transformers_utils/config.py
@@ -1,4 +1,3 @@
-import contextlib
import enum
import json
from pathlib import Path
@@ -61,13 +60,6 @@
**_CONFIG_REGISTRY_OVERRIDE_HF
}
-for name, cls in _CONFIG_REGISTRY.items():
- with contextlib.suppress(ValueError):
- if name in _CONFIG_REGISTRY_OVERRIDE_HF:
- AutoConfig.register(name, cls, exist_ok=True)
- else:
- AutoConfig.register(name, cls)
-
class ConfigFormat(str, enum.Enum):
AUTO = "auto"
diff --git a/vllm/utils.py b/vllm/utils.py
index a025c3c40a434..9c6f1a347fb83 100644
--- a/vllm/utils.py
+++ b/vllm/utils.py
@@ -504,6 +504,15 @@ async def merge_async_iterators(
await it.aclose()
+async def collect_from_async_generator(
+ iterator: AsyncGenerator[T, None]) -> List[T]:
+ """Collect all items from an async generator into a list."""
+ items = []
+ async for item in iterator:
+ items.append(item)
+ return items
+
+
def get_ip() -> str:
host_ip = envs.VLLM_HOST_IP
if host_ip:
@@ -1201,11 +1210,21 @@ def _pull_args_from_config(args: List[str]) -> List[str]:
config_args = FlexibleArgumentParser._load_config_file(file_path)
# 0th index is for {serve,chat,complete}
+ # followed by model_tag (only for serve)
# followed by config args
# followed by rest of cli args.
# maintaining this order will enforce the precedence
# of cli > config > defaults
- args = [args[0]] + config_args + args[1:index] + args[index + 2:]
+ if args[0] == "serve":
+ if index == 1:
+ raise ValueError(
+ "No model_tag specified! Please check your command-line"
+ " arguments.")
+ args = [args[0]] + [
+ args[1]
+ ] + config_args + args[2:index] + args[index + 2:]
+ else:
+ args = [args[0]] + config_args + args[1:index] + args[index + 2:]
return args
@@ -1258,6 +1277,15 @@ async def _run_task_with_lock(task: Callable, lock: asyncio.Lock, *args,
return await task(*args, **kwargs)
+def supports_kw(callable: Callable[..., object], kw_name: str) -> bool:
+ params = inspect.signature(callable).parameters
+ if kw_name in params:
+ return True
+
+ return any(param.kind == inspect.Parameter.VAR_KEYWORD
+ for param in params.values())
+
+
def get_allowed_kwarg_only_overrides(
callable: Callable[..., object],
overrides: Optional[Dict[str, Any]],
@@ -1342,3 +1370,22 @@ def dec(self, num=1):
@property
def value(self):
return self._value
+
+
+def get_beam_search_score(
+ tokens: List[int],
+ cumulative_logprob: float,
+ eos_token_id: int,
+ length_penalty: float = 1.0,
+) -> float:
+ """Calculate the beam search score with length penalty.
+
+ Adapted from
+
+ https://github.com/huggingface/transformers/blob/ccb92be23def445f2afdea94c31286f84b89eb5b/src/transformers/generation/beam_search.py#L938
+ """
+ seq_len = len(tokens)
+ if tokens[-1] == eos_token_id:
+ seq_len -= 1
+
+ return cumulative_logprob / (seq_len**length_penalty)
diff --git a/vllm/worker/cpu_enc_dec_model_runner.py b/vllm/worker/cpu_enc_dec_model_runner.py
new file mode 100644
index 0000000000000..8ebbf6db939bc
--- /dev/null
+++ b/vllm/worker/cpu_enc_dec_model_runner.py
@@ -0,0 +1,311 @@
+import dataclasses
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Type, cast
+
+import torch
+
+from vllm.attention import AttentionMetadata
+from vllm.model_executor.layers.sampler import SamplerOutput
+from vllm.multimodal import MultiModalInputs
+from vllm.sequence import IntermediateTensors, SequenceGroupMetadata
+from vllm.utils import make_tensor_with_pad
+from vllm.worker.cpu_model_runner import (CPUModelRunner,
+ ModelInputForCPUBuilder,
+ ModelInputForCPUWithSamplingMetadata)
+from vllm.worker.model_runner_base import (
+ _add_attn_metadata_broadcastable_dict,
+ _add_sampling_metadata_broadcastable_dict)
+
+if TYPE_CHECKING:
+ from vllm.attention.backends.abstract import AttentionBackend
+
+
+@dataclasses.dataclass(frozen=True)
+class EncoderDecoderModelInputForCPU(ModelInputForCPUWithSamplingMetadata):
+ """
+ Used by the EncoderDecoderModelRunner.
+ """
+ encoder_input_tokens: Optional[torch.Tensor] = None
+ encoder_input_positions: Optional[torch.Tensor] = None
+
+ def as_broadcastable_tensor_dict(self) -> Dict[str, Any]:
+ tensor_dict = {
+ "input_tokens": self.input_tokens,
+ "input_positions": self.input_positions,
+ "encoder_input_tokens": self.encoder_input_tokens,
+ "encoder_input_positions": self.encoder_input_positions,
+ }
+ _add_attn_metadata_broadcastable_dict(tensor_dict, self.attn_metadata)
+ _add_sampling_metadata_broadcastable_dict(tensor_dict,
+ self.sampling_metadata)
+ return tensor_dict
+
+ @classmethod
+ def from_broadcasted_tensor_dict(
+ cls,
+ tensor_dict: Dict[str, Any],
+ attn_backend: Optional["AttentionBackend"] = None,
+ ) -> "EncoderDecoderModelInputForCPU":
+ return cast(
+ EncoderDecoderModelInputForCPU,
+ super().from_broadcasted_tensor_dict(tensor_dict, attn_backend))
+
+
+class CPUEncoderDecoderModelRunner(CPUModelRunner):
+ _model_input_cls: Type[EncoderDecoderModelInputForCPU] = (
+ EncoderDecoderModelInputForCPU)
+ _builder_cls: Type[ModelInputForCPUBuilder] = ModelInputForCPUBuilder
+
+ def _list_to_int32_tensor(
+ self,
+ _list: List[int],
+ ) -> torch.Tensor:
+ return torch.tensor(_list, dtype=torch.int32, device=self.device)
+
+ def _list_to_long_tensor(
+ self,
+ _list: List[int],
+ ) -> torch.Tensor:
+ return torch.tensor(_list, dtype=torch.long, device=self.device)
+
+ def _empty_int32_tensor(self) -> torch.Tensor:
+ return self._list_to_int32_tensor([])
+
+ def _empty_long_tensor(self) -> torch.Tensor:
+ return self._list_to_long_tensor([])
+
+ def make_model_input_from_broadcasted_tensor_dict(
+ self, tensor_dict: Dict[str,
+ Any]) -> EncoderDecoderModelInputForCPU:
+ return EncoderDecoderModelInputForCPU.from_broadcasted_tensor_dict(
+ tensor_dict,
+ attn_backend=self.attn_backend,
+ )
+
+ def prepare_model_input(
+ self,
+ seq_group_metadata_list: List[SequenceGroupMetadata],
+ virtual_engine: int = 0,
+ finished_requests_ids: Optional[List[str]] = None
+ ) -> EncoderDecoderModelInputForCPU:
+ model_input = super().prepare_model_input(seq_group_metadata_list,
+ virtual_engine,
+ finished_requests_ids)
+ model_input = cast(EncoderDecoderModelInputForCPU, model_input)
+ (
+ attn_metadata,
+ encoder_input_tokens_tensor,
+ encoder_input_positions_tensor,
+ ) = self._prepare_encoder_model_input_tensors(seq_group_metadata_list,
+ model_input)
+ return dataclasses.replace(
+ model_input,
+ attn_metadata=attn_metadata,
+ encoder_input_tokens=encoder_input_tokens_tensor,
+ encoder_input_positions=encoder_input_positions_tensor,
+ )
+
+ def _prepare_encoder_model_input_tensors(
+ self,
+ seq_group_metadata_list: List[SequenceGroupMetadata],
+ model_input: EncoderDecoderModelInputForCPU,
+ ) -> Tuple[AttentionMetadata, Optional[torch.Tensor],
+ Optional[torch.Tensor]]:
+ """Helper method to prepare the encoder- and cross-attn-related
+ model inputs based on a given sequence group. These additional inputs
+ are used to augment an already-computed `EncoderDecoderModelInput`
+ data structure which already has decoder-related model inputs
+ populated.
+
+ Sets the following attn_metadata fields:
+ * `num_encoder_tokens`
+ * `encoder_seq_lens`
+ * `encoder_seq_lens_tensor`
+ * `max_encoder_seq_len`
+ * `cross_slot_mapping`
+ * `cross_block_tables`
+
+ Constructs a new model inputs data structure, based on
+ (1) the existing fields in the `model_inputs` argument,
+ and (2) the following additional fields which are
+ computed (or in the case of `attn_metadata`, updated)
+ by this function:
+ * attn_metadata
+ * encoder_input_tokens
+ * encoder_input_positions
+
+ Arguments:
+
+ * seq_group_metadata_list: list of sequence groups for which to
+ compute inputs
+ * model_inputs: model inputs data structure with decoder-oriented
+ fields already computed.
+
+ Return:
+
+ * Updated model inputs data structure
+ """
+
+ if len(seq_group_metadata_list) == 0:
+ return (model_input.attn_metadata, None, None)
+
+ # Since we are not supporting chunked prefill either the entire
+ # batch is prefill or it is decode
+ is_prompt = seq_group_metadata_list[0].is_prompt
+
+ # Build encoder inputs
+ encoder_seq_lens: List[int] = []
+ if is_prompt:
+ # Prefill phase.
+ cross_block_tables = self._empty_int32_tensor().view(
+ len(seq_group_metadata_list), -1)
+
+ # Extract input tokens/positions, cross-attention slot-mapping,
+ # & seq len from each sequence group metadata
+ (
+ encoder_input_tokens,
+ encoder_input_positions,
+ cross_slot_mapping,
+ ) = (
+ [],
+ [],
+ [],
+ )
+ for seq_group_metadata in seq_group_metadata_list:
+ # Build seq lens
+ seq_len = seq_group_metadata.encoder_seq_data.get_len()
+ token_ids = seq_group_metadata.encoder_seq_data.get_token_ids()
+ encoder_seq_lens.append(seq_len)
+
+ # Build slot mapping
+ for i in range(0, seq_len):
+ block_number = seq_group_metadata.cross_block_table[
+ i // self.block_size]
+ block_offset = i % self.block_size
+ slot = block_number * self.block_size + block_offset
+ cross_slot_mapping.append(slot)
+
+ # Build encoder input tokens
+ encoder_input_tokens.extend(token_ids)
+ encoder_input_positions.extend(list(range(0, seq_len)))
+
+ # Convert tokens/positions & cross-attention
+ # slot-mapping to encoder input tensors
+ encoder_input_tokens_tensor = self._list_to_long_tensor(
+ encoder_input_tokens)
+ encoder_input_positions_tensor = self._list_to_long_tensor(
+ encoder_input_positions)
+ cross_slot_mapping_tensor = self._list_to_long_tensor(
+ cross_slot_mapping)
+
+ else:
+ # Decode phase.
+ encoder_input_tokens_tensor = self._empty_long_tensor()
+ encoder_input_positions_tensor = self._empty_long_tensor()
+ cross_slot_mapping_tensor = self._empty_long_tensor()
+ # Extract cross-attention block tables &
+ # seq len from each sequence group metadata.
+ # Cross-attention block tables are empty
+ # during vLLM memory profiling.
+ cross_block_tables = []
+ for seq_group_metadata in seq_group_metadata_list:
+ for _ in range(len(seq_group_metadata.seq_data)):
+ encoder_seq_lens.append(
+ seq_group_metadata.encoder_seq_data.get_len())
+ cross_block_table = seq_group_metadata.cross_block_table
+ cross_block_tables.append([] if (
+ cross_block_table is None) else cross_block_table)
+
+ max_len_of_block_table = max(
+ len(block_table) for block_table in cross_block_tables)
+
+ cross_block_tables = make_tensor_with_pad(
+ cross_block_tables,
+ max_len=max_len_of_block_table,
+ pad=0,
+ dtype=torch.int32,
+ device=self.device,
+ )
+
+ # Compute encoder sequence lengths & encoder
+ # sequence starting offset tensors
+ max_encoder_seq_len = max(encoder_seq_lens, default=0)
+ encoder_seq_lens_tensor = self._list_to_int32_tensor(encoder_seq_lens)
+ encoder_seq_start_loc = torch.zeros(encoder_seq_lens_tensor.shape[0] +
+ 1,
+ dtype=torch.int32,
+ device=self.device)
+ torch.cumsum(encoder_seq_lens_tensor,
+ dim=0,
+ dtype=encoder_seq_start_loc.dtype,
+ out=encoder_seq_start_loc[1:])
+
+ # Update attention metadata with encoder-oriented attributes
+ attn_metadata = model_input.attn_metadata
+ assert attn_metadata is not None
+ (
+ attn_metadata.num_encoder_tokens,
+ attn_metadata.encoder_seq_lens,
+ attn_metadata.encoder_seq_lens_tensor,
+ attn_metadata.max_encoder_seq_len,
+ attn_metadata.cross_slot_mapping,
+ attn_metadata.cross_block_tables,
+ ) = (
+ sum(encoder_seq_lens),
+ encoder_seq_lens,
+ encoder_seq_lens_tensor,
+ max_encoder_seq_len,
+ cross_slot_mapping_tensor,
+ cross_block_tables,
+ )
+
+ return (attn_metadata, encoder_input_tokens_tensor,
+ encoder_input_positions_tensor)
+
+ @torch.no_grad()
+ def execute_model(
+ self,
+ model_input: EncoderDecoderModelInputForCPU,
+ kv_caches: List[torch.Tensor],
+ intermediate_tensors: Optional[IntermediateTensors] = None,
+ num_steps: int = 1,
+ ) -> Optional[List[SamplerOutput]]:
+ if num_steps > 1:
+ raise ValueError(
+ "CPU worker does not support multi-step execution.")
+
+ model_executable = self.model
+ execute_model_kwargs = {
+ "input_ids":
+ model_input.input_tokens,
+ "positions":
+ model_input.input_positions,
+ "encoder_input_ids":
+ model_input.encoder_input_tokens,
+ "encoder_positions":
+ model_input.encoder_input_positions,
+ "kv_caches":
+ kv_caches,
+ "attn_metadata":
+ model_input.attn_metadata,
+ **MultiModalInputs.as_kwargs(model_input.multi_modal_kwargs or {},
+ device=self.device),
+ "intermediate_tensors":
+ intermediate_tensors,
+ }
+
+ hidden_states = model_executable(**execute_model_kwargs)
+
+ # Compute the logits.
+ logits = self.model.compute_logits(hidden_states,
+ model_input.sampling_metadata)
+
+ # Only perform sampling in the driver worker.
+ if not self.is_driver_worker:
+ return []
+
+ # Sample the next token.
+ output = self.model.sample(
+ logits=logits,
+ sampling_metadata=model_input.sampling_metadata,
+ )
+ return [output]
diff --git a/vllm/worker/cpu_model_runner.py b/vllm/worker/cpu_model_runner.py
index cebb0f36a2b28..a03c562532179 100644
--- a/vllm/worker/cpu_model_runner.py
+++ b/vllm/worker/cpu_model_runner.py
@@ -19,7 +19,7 @@
MultiModalInputs)
from vllm.sequence import (IntermediateTensors, SequenceData,
SequenceGroupMetadata)
-from vllm.utils import STR_NOT_IMPL_ENC_DEC_ERR_STRS, make_tensor_with_pad
+from vllm.utils import make_tensor_with_pad
from vllm.worker.model_runner_base import (
ModelRunnerBase, ModelRunnerInputBase, ModelRunnerInputBuilderBase,
_add_attn_metadata_broadcastable_dict,
@@ -133,7 +133,7 @@ def build(self) -> ModelInputForCPU:
(input_tokens, input_positions,
attn_metadata) = self._prepare_decode(
self.seq_group_metadata_list)
- seq_lens = []
+ seq_lens = None
return self.model_input_cls(
input_tokens=input_tokens,
@@ -434,10 +434,6 @@ def __init__(
# Lazy initialization.
self.model: nn.Module # Set after init_Model
- if self.model_config.is_encoder_decoder_model:
- raise NotImplementedError(
- STR_NOT_IMPL_ENC_DEC_ERR_STRS['STR_NOT_IMPL_ENC_DEC_CPU'])
-
@property
def model_is_mrope(self) -> bool:
"""Detect if the model has "mrope" rope_scaling type.
@@ -459,8 +455,8 @@ def load_model(self) -> None:
def make_model_input_from_broadcasted_tensor_dict(
self,
tensor_dict: Dict[str, Any],
- ) -> ModelInputForCPU:
- return ModelInputForCPU.from_broadcasted_tensor_dict(
+ ) -> ModelInputForCPUWithSamplingMetadata:
+ return ModelInputForCPUWithSamplingMetadata.from_broadcasted_tensor_dict( # noqa: E501
tensor_dict,
attn_backend=self.attn_backend,
)
diff --git a/vllm/worker/cpu_worker.py b/vllm/worker/cpu_worker.py
index 5e36fba6ccdea..7384ffcb2c5e5 100644
--- a/vllm/worker/cpu_worker.py
+++ b/vllm/worker/cpu_worker.py
@@ -1,5 +1,5 @@
"""A CPU worker class."""
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple, Type
import torch
import torch.distributed
@@ -15,6 +15,7 @@
from vllm.model_executor import set_random_seed
from vllm.sequence import ExecuteModelRequest
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE
+from vllm.worker.cpu_enc_dec_model_runner import CPUEncoderDecoderModelRunner
from vllm.worker.cpu_model_runner import CPUModelRunner
from vllm.worker.worker_base import (LocalOrDistributedWorkerBase,
LoraNotSupportedWorkerBase, WorkerInput)
@@ -163,7 +164,10 @@ def __init__(
else:
self.local_omp_cpuid = omp_cpuids.split("|")[rank]
- self.model_runner: CPUModelRunner = CPUModelRunner(
+ ModelRunnerClass: Type[CPUModelRunner] = CPUModelRunner
+ if self._is_encoder_decoder_model():
+ ModelRunnerClass = CPUEncoderDecoderModelRunner
+ self.model_runner: CPUModelRunner = ModelRunnerClass(
model_config,
parallel_config,
scheduler_config,
@@ -205,6 +209,9 @@ def stop_profile(self):
raise RuntimeError("Profiler is not enabled.")
self.profiler.stop()
+ def _is_encoder_decoder_model(self):
+ return self.model_config.is_encoder_decoder_model
+
def init_device(self) -> None:
if self.local_omp_cpuid != "all":
ret = torch.ops._C_utils.init_cpu_threads_env(self.local_omp_cpuid)
diff --git a/vllm/worker/embedding_model_runner.py b/vllm/worker/embedding_model_runner.py
index 1fd37eac6b851..a7f5b2d4fdd1f 100644
--- a/vllm/worker/embedding_model_runner.py
+++ b/vllm/worker/embedding_model_runner.py
@@ -1,11 +1,12 @@
import dataclasses
-from typing import Any, Dict, List, Optional, Tuple, Type
+from typing import Any, Dict, List, Optional, Tuple, Type, Union
import torch
from vllm.config import (CacheConfig, DeviceConfig, LoadConfig, LoRAConfig,
ModelConfig, ObservabilityConfig, ParallelConfig,
PromptAdapterConfig, SchedulerConfig)
+from vllm.distributed import get_pp_group
from vllm.forward_context import set_forward_context
from vllm.logger import init_logger
from vllm.model_executor.pooling_metadata import PoolingMetadata
@@ -66,7 +67,7 @@ def execute_model(
kv_caches: List[torch.Tensor],
intermediate_tensors: Optional[IntermediateTensors] = None,
num_steps: int = 1,
- ) -> Optional[List[PoolerOutput]]:
+ ) -> Optional[Union[List[PoolerOutput], IntermediateTensors]]:
if num_steps > 1:
raise ValueError(
"EmbeddingModelRunner does not support multi-step execution.")
@@ -107,28 +108,52 @@ def execute_model(
for _ in range(num_layers)
]
- execute_model_kwargs = {
- "input_ids":
- model_input.input_tokens,
- "positions":
- model_input.input_positions,
- "kv_caches":
- kv_caches,
- "attn_metadata":
- model_input.attn_metadata,
- **MultiModalInputs.as_kwargs(model_input.multi_modal_kwargs or {},
- device=self.device),
- }
+ multi_modal_kwargs = model_input.multi_modal_kwargs or {}
+ if (self.observability_config is not None
+ and self.observability_config.collect_model_forward_time):
+ model_forward_start = torch.cuda.Event(enable_timing=True)
+ model_forward_end = torch.cuda.Event(enable_timing=True)
+ model_forward_start.record()
with set_forward_context(model_input.attn_metadata):
- hidden_states = model_executable(**execute_model_kwargs)
+ hidden_or_intermediate_states = model_executable(
+ input_ids=model_input.input_tokens,
+ positions=model_input.input_positions,
+ kv_caches=kv_caches,
+ attn_metadata=model_input.attn_metadata,
+ intermediate_tensors=intermediate_tensors,
+ **MultiModalInputs.as_kwargs(multi_modal_kwargs,
+ device=self.device))
+
+ if (self.observability_config is not None
+ and self.observability_config.collect_model_forward_time):
+ model_forward_end.record()
+
+ # Only perform pooling in the last pipeline stage.
+ if not get_pp_group().is_last_rank:
+ if (self.is_driver_worker
+ and hidden_or_intermediate_states is not None
+ and isinstance(hidden_or_intermediate_states,
+ IntermediateTensors)
+ and self.observability_config is not None
+ and self.observability_config.collect_model_forward_time):
+ model_forward_end.synchronize()
+ model_forward_time = model_forward_start.elapsed_time(
+ model_forward_end)
+ orig_model_forward_time = 0.0
+ if intermediate_tensors is not None:
+ orig_model_forward_time = intermediate_tensors.tensors.get(
+ "model_forward_time", torch.tensor(0.0)).item()
+ hidden_or_intermediate_states.tensors["model_forward_time"] = (
+ torch.tensor(model_forward_time + orig_model_forward_time))
+ return hidden_or_intermediate_states
# Only perform pooling in the driver worker.
if not self.is_driver_worker:
return []
return [
- self.model.pooler(hidden_states=hidden_states,
+ self.model.pooler(hidden_states=hidden_or_intermediate_states,
pooling_metadata=model_input.pooling_metadata)
]
diff --git a/vllm/worker/model_runner.py b/vllm/worker/model_runner.py
index 51f65cbfcf862..9784438841980 100644
--- a/vllm/worker/model_runner.py
+++ b/vllm/worker/model_runner.py
@@ -35,8 +35,7 @@
from vllm.model_executor.layers.sampler import SamplerOutput
from vllm.model_executor.model_loader import get_model
from vllm.model_executor.model_loader.tensorizer import TensorizerConfig
-from vllm.model_executor.models.interfaces import (supports_lora,
- supports_multimodal)
+from vllm.model_executor.models import supports_lora, supports_multimodal
from vllm.model_executor.models.utils import set_cpu_offload_max_bytes
from vllm.multimodal import (MULTIMODAL_REGISTRY, BatchedTensorInputs,
MultiModalInputs, MultiModalRegistry)
diff --git a/vllm/worker/neuron_model_runner.py b/vllm/worker/neuron_model_runner.py
index 0cf7445d4388d..44d4845a838ef 100644
--- a/vllm/worker/neuron_model_runner.py
+++ b/vllm/worker/neuron_model_runner.py
@@ -1,9 +1,11 @@
+import os
from dataclasses import dataclass
from importlib.util import find_spec
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
import torch
from torch import nn
+from transformers_neuronx.config import GenerationConfig
from vllm.config import (DeviceConfig, ModelConfig, ParallelConfig,
SchedulerConfig)
@@ -50,6 +52,9 @@ def from_broadcasted_tensor_dict(
class NeuronModelRunner(ModelRunnerBase[ModelInputForNeuron]):
+ # NEURON has an upper limit on the top_k
+ _MAX_NEURON_SAMPLING_TOP_K = 256
+
def __init__(
self,
model_config: ModelConfig,
@@ -76,6 +81,34 @@ def __init__(
# Lazy initialization.
self.model: nn.Module # initialize after load_model.
+ # Once NEURON_ON_DEVICE_SAMPLING_DISABLED is set to a non-zero value,
+ # turn off on-device sampling.
+ self._on_device_sampling_disabled = int(
+ os.getenv("NEURON_ON_DEVICE_SAMPLING_DISABLED", "0"))
+
+ # NEURON needs to update sampling parameters when request IDs change
+ # across batches. This variable stores the previous batch's request IDs
+ # to determine if an update is needed.
+ self._previous_batch_request_ids: List[str] = []
+
+ if not self._on_device_sampling_disabled:
+ logger.warning(
+ "On-device sampling is turned on in Neuron by default, only "
+ "top_k, top_p, and temperature are current supported sampling "
+ "parameters. To turn off the on-device sampling, please set "
+ "the environment variable NEURON_ON_DEVICE_SAMPLING_DISABLED=1."
+ )
+ self.model_config.neuron_sampling_params = GenerationConfig(
+ max_length=self.scheduler_config.max_model_len,
+ do_sample=True,
+ per_batch_line=True,
+ top_k=[self._MAX_NEURON_SAMPLING_TOP_K] \
+ * self.scheduler_config.max_num_seqs,
+ top_p=[1.0] * self.scheduler_config.max_num_seqs,
+ temperature=[1.0] * self.scheduler_config.max_num_seqs,
+ dynamic=True,
+ global_top_k=self._MAX_NEURON_SAMPLING_TOP_K)
+
def load_model(self) -> None:
if find_spec("transformers_neuronx") is not None:
self.model = get_neuron_model(
@@ -215,7 +248,7 @@ def prepare_model_input(
else:
(input_tokens, input_positions,
input_block_ids) = self._prepare_decode(seq_group_metadata_list)
- seq_lens = []
+ seq_lens = None
sampling_metadata = SamplingMetadata.prepare(
seq_group_metadata_list,
seq_lens,
@@ -227,12 +260,49 @@ def prepare_model_input(
self.pin_memory,
generators=self.get_generators(finished_requests_ids))
+ if not self._on_device_sampling_disabled:
+ # Once the request IDs are changed in current iteration, we will
+ # update the on-device sampling parameters.
+ current_batch_request_ids = [
+ seq_group_meta_data.request_id
+ for seq_group_meta_data in seq_group_metadata_list
+ ]
+ if current_batch_request_ids != self._previous_batch_request_ids:
+ self._update_neuron_sampling_params(sampling_metadata)
+ self._previous_batch_request_ids = current_batch_request_ids
+
return ModelInputForNeuron(input_tokens=input_tokens,
input_positions=input_positions,
input_block_ids=input_block_ids,
sampling_metadata=sampling_metadata,
multi_modal_kwargs=multi_modal_kwargs)
+ def _update_neuron_sampling_params(self,
+ sampling_metadata: SamplingMetadata):
+ # Update Neuron sampling parameters (GenerationConfig in Neuron)
+ current_sampling_params = self.model_config.neuron_sampling_params
+ assert current_sampling_params is not None, (
+ f"Failed to update sampling_params, "
+ f"current sampling params is {current_sampling_params}")
+
+ top_k = current_sampling_params.top_k
+ top_p = current_sampling_params.top_p
+ temperature = current_sampling_params.temperature
+ for index, sequence_group_to_sample in enumerate(
+ sampling_metadata.seq_groups):
+ top_k[index] = self._convert_to_neuron_top_k(
+ sequence_group_to_sample.sampling_params.top_k)
+ top_p[index] = sequence_group_to_sample.sampling_params.top_p
+ temperature[index] = \
+ sequence_group_to_sample.sampling_params.temperature
+
+ self.model.model.update_generation_config(current_sampling_params)
+
+ def _convert_to_neuron_top_k(self, top_k: int) -> int:
+ if top_k < 0 or top_k > self._MAX_NEURON_SAMPLING_TOP_K:
+ return self._MAX_NEURON_SAMPLING_TOP_K
+ return top_k
+
@torch.inference_mode()
def execute_model(
self,
@@ -253,9 +323,13 @@ def execute_model(
device=self.device),
)
- # Compute the logits.
- logits = self.model.compute_logits(hidden_states,
- model_input.sampling_metadata)
+ # Compute the logits only if the on-device sampling is turned off as
+ # on-device sampling outputs the token ids.
+ if self._on_device_sampling_disabled:
+ logits = self.model.compute_logits(hidden_states,
+ model_input.sampling_metadata)
+ else:
+ logits = hidden_states
# Sample the next token.
output = self.model.sample(
diff --git a/vllm/worker/tpu_model_runner.py b/vllm/worker/tpu_model_runner.py
index 2472ac25aee44..12e4215038d74 100644
--- a/vllm/worker/tpu_model_runner.py
+++ b/vllm/worker/tpu_model_runner.py
@@ -453,9 +453,6 @@ def _prepare_sample(
f"Best of > {_MAX_NUM_SAMPLES} is not supported by the TPU "
"backend.")
best_of.append(sampling_params.best_of)
- if sampling_params.use_beam_search:
- raise NotImplementedError(
- "Beam search is not supported by the TPU backend.")
if sampling_params.logprobs is not None:
raise NotImplementedError(
"logprobs is not currently supported by the TPU backend.")