Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(DO NOT MERGE) IBM release WIP #76

Closed
wants to merge 106 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
f9fa4e4
[Doc] Documentation on supported hardware for quantization methods (#…
mgoin Jun 21, 2024
299af70
[BugFix] exclude version 1.15.0 for modelscope (#5668)
zhyncs Jun 21, 2024
a455d65
[ci][test] fix ca test in main (#5746)
youkaichao Jun 21, 2024
f4c1a10
[LoRA] Add support for pinning lora adapters in the LRU cache (#5603)
rohithkrn Jun 21, 2024
b0b518d
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616)
jikunshang Jun 22, 2024
bc4ae91
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs…
DamonFool Jun 22, 2024
ec2ed1b
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_ba…
zifeitong Jun 22, 2024
a77856f
[Bugfix] Fix pin_lora error in TPU executor (#5760)
WoosukKwon Jun 22, 2024
06baabc
[Docs][TPU] Add installation tip for TPU (#5761)
WoosukKwon Jun 22, 2024
7cd7a7a
[core][distributed] improve shared memory broadcast (#5754)
youkaichao Jun 22, 2024
945732a
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
varun-sundar-rabindranath Jun 23, 2024
7923319
[Distributed] Add send and recv helpers (#5719)
andoorve Jun 23, 2024
f41fff4
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requi…
Isotr0py Jun 24, 2024
a6d3e9e
[doc][faq] add warning to download models for every nodes (#5783)
youkaichao Jun 24, 2024
1408567
[Doc] Add "Suggest edit" button to doc pages (#5789)
mgoin Jun 24, 2024
ab86561
[Doc] Add Phi-3-medium to list of supported models (#5788)
mgoin Jun 24, 2024
657a3f8
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args…
CatherineSue Jun 24, 2024
c9b8f8a
[ci] Remove aws template (#5757)
khluu Jun 25, 2024
65b7543
[Doc] Add notice about breaking changes to VLMs (#5818)
DarkLight1337 Jun 25, 2024
79df20a
[Speculative Decoding] Support draft model on different tensor-paral…
wooyeonlee0 Jun 25, 2024
1187a29
[Misc] Remove useless code in cpu_worker (#5824)
DamonFool Jun 25, 2024
54b3304
[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748)
Yard1 Jun 25, 2024
976f4fa
[doc][distributed] add both gloo and nccl tests (#5834)
youkaichao Jun 25, 2024
f88e861
[CI/Build] Add unit testing for FlexibleArgumentParser (#5798)
mgoin Jun 25, 2024
caf1017
[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16`…
dsikka Jun 25, 2024
a50a2e9
[Hardware][TPU] Refactor TPU backend (#5831)
WoosukKwon Jun 25, 2024
8509fef
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improv…
mawong-amd Jun 25, 2024
9595933
[Hardware][TPU] Raise errors for unsupported sampling params (#5850)
WoosukKwon Jun 25, 2024
b2f42b3
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
tdoublep Jun 26, 2024
2100b12
[Bugfix] Fix assertion in NeuronExecutor (#5841)
aws-patlange Jun 26, 2024
6690735
[Core] Refactor Worker and ModelRunner to consolidate control plane c…
stephanie-wang Jun 26, 2024
58ba441
[Misc][Doc] Add Example of using OpenAI Server with VLM (#5832)
ywang96 Jun 26, 2024
9cccdc8
[bugfix][distributed] fix shm broadcast when the queue size is full (…
youkaichao Jun 26, 2024
85de228
[Bugfix] Fix embedding to support 2D inputs (#5829)
WoosukKwon Jun 26, 2024
68f0bf0
[Bugfix][TPU] Fix KV cache size calculation (#5860)
WoosukKwon Jun 26, 2024
1efccbb
[CI/Build] Refactor image test assets (#5821)
DarkLight1337 Jun 26, 2024
3ecbfaa
[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560)
ProExpertProg Jun 26, 2024
039331a
[Frontend] Add tokenize/detokenize endpoints (#5054)
sasha0552 Jun 26, 2024
36893cd
[Hardware][TPU] Support parallel sampling & Swapping (#5855)
WoosukKwon Jun 26, 2024
4f1c218
[Bugfix][TPU] Fix CPU cache allocation (#5869)
WoosukKwon Jun 26, 2024
1b510c0
Support CPU inference with VSX PowerPC ISA (#5652)
ChipKerchner Jun 26, 2024
e426d1a
[doc] update usage of env var to avoid conflict (#5873)
youkaichao Jun 26, 2024
48a146d
[Misc] Add example for LLaVA-NeXT (#5879)
ywang96 Jun 27, 2024
0d58c6d
[BugFix] Fix cuda graph for MLPSpeculator (#5875)
njhill Jun 27, 2024
220da63
[Doc] Add note about context length in Phi-3-Vision example (#5887)
DarkLight1337 Jun 27, 2024
205d24f
[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted prop…
xwjiang2010 Jun 27, 2024
8399340
[Model] Add base class for LoRA-supported models (#5018)
DarkLight1337 Jun 27, 2024
9b1a3f6
[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888)
ywang96 Jun 27, 2024
1aa6d22
[CI/Build] [1/3] Reorganize entrypoints tests (#5526)
DarkLight1337 Jun 27, 2024
60b6ff7
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896)
DarkLight1337 Jun 27, 2024
adc793e
[doc][misc] add note for Kubernetes users (#5916)
youkaichao Jun 27, 2024
ca516bd
[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5…
njhill Jun 27, 2024
c5ff677
[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849)
njhill Jun 27, 2024
98472e6
[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922)
ywang96 Jun 27, 2024
7e85492
[Model] Add Gemma 2 (#5908)
WoosukKwon Jun 27, 2024
d877872
[core][misc] remove logical block (#5882)
youkaichao Jun 27, 2024
c7e3715
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
divakar-amd Jun 27, 2024
acd5bee
[Hardware][TPU] Optimize KV cache swapping (#5878)
WoosukKwon Jun 28, 2024
e3e46d2
[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast prope…
xwjiang2010 Jun 28, 2024
79b7f2e
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU…
Isotr0py Jun 28, 2024
b3099a6
[Core] Registry for processing model inputs (#5214)
DarkLight1337 Jun 28, 2024
7e9585c
Unmark fused_moe config json file as executable (#5960)
tlrmchlsmth Jun 28, 2024
67511e9
[Hardware][Intel] OpenVINO vLLM backend (#5379)
ilya-lavrenov Jun 28, 2024
58bd180
[Bugfix] Better error message for MLPSpeculator when `num_speculative…
tdoublep Jun 28, 2024
befff46
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
DarkLight1337 Jun 28, 2024
85a7a62
[Distributed] Make it clear that % should not be in tensor dict keys.…
xwjiang2010 Jun 28, 2024
2ab69c8
[Spec Decode] Introduce DraftModelRunner (#5799)
comaniac Jun 28, 2024
28e9598
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
tlrmchlsmth Jun 28, 2024
9afc7e5
[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Sim…
robertgshaw2-redhat Jun 28, 2024
9ee1cb9
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP…
robertgshaw2-redhat Jun 28, 2024
a23f27b
Support Deepseek-V2 (#4650)
zwd003 Jun 28, 2024
67e8298
[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is en…
mgoin Jun 28, 2024
421564c
Unmark more files as executable (#5962)
tlrmchlsmth Jun 28, 2024
a6b188d
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadEr…
robertgshaw2-redhat Jun 28, 2024
96e23ec
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for …
LiuXiaoxuanPKU Jun 28, 2024
28773bf
[Bugfix][TPU] Fix TPU sampler output (#5978)
WoosukKwon Jun 29, 2024
9ea7506
[Bugfix][TPU] Fix pad slot id (#5977)
WoosukKwon Jun 29, 2024
2eae371
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
mcalman Jun 29, 2024
ea2321d
[Misc] Extend vLLM Metrics logging API (#5925)
SolitaryThinker Jun 29, 2024
23534ab
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
joerunde Jun 29, 2024
bf2cd68
[Bugfix] Fix precisions in Gemma 1 (#5913)
WoosukKwon Jun 29, 2024
6837825
[Misc] Update Phi-3-Vision Example (#5981)
ywang96 Jun 29, 2024
86d5e5d
[Bugfix] Support `eos_token_id` from `config.json` (#5954)
DarkLight1337 Jun 29, 2024
4951f09
[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum …
Yard1 Jun 29, 2024
f79e443
[Kernel] Raise an exception in MoE kernel if the batch size is larger…
comaniac Jun 29, 2024
57570df
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
robertgshaw2-redhat Jun 29, 2024
46c13c0
[CI/Build] Add TP test for vision models (#5892)
DarkLight1337 Jun 29, 2024
2c3044d
[ CI/Build ] LM Eval Harness Based CI Testing (#5838)
robertgshaw2-redhat Jun 29, 2024
ee9c4d1
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix…
mawong-amd Jun 29, 2024
eddb80a
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989)
ywang96 Jun 30, 2024
075c3f9
[CI/Build] Reuse code for checking output consistency (#5988)
DarkLight1337 Jun 30, 2024
c14b831
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
DarkLight1337 Jun 30, 2024
045125b
[ci][distributed] fix device count call
youkaichao Jun 30, 2024
7443549
[Frontend]: Support base64 embedding (#5935)
llmpros Jun 30, 2024
6b3a037
[Lora] Use safetensor keys instead of adapter_config.json to find une…
rkooo567 Jun 30, 2024
1e2049e
[ CI ] Temporarily Disable Large LM-Eval Tests (#6005)
robertgshaw2-redhat Jun 30, 2024
5b65eb0
[Misc] Fix `get_min_capability` (#5971)
dsikka Jun 30, 2024
bde1a5a
[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify …
robertgshaw2-redhat Jun 30, 2024
169f3df
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
youkaichao Jul 1, 2024
b1d1398
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into…
sroy745 Jul 1, 2024
567df3b
[ CI ] Re-enable Large Model LM Eval (#6031)
robertgshaw2-redhat Jul 1, 2024
7325ac0
[doc][misc] remove deprecated api server in doc (#6037)
youkaichao Jul 1, 2024
979fcb5
[Misc] update benchmark backend for scalellm (#6018)
zhyncs Jul 1, 2024
c544ecf
[doc][misc] further lower visibility of simple api server (#6041)
youkaichao Jul 1, 2024
0558bcc
Squash 4645
prashantgupta24 Jul 1, 2024
2987012
🚧 add adapter changes
prashantgupta24 Jul 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
2 changes: 2 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.2

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
59 changes: 59 additions & 0 deletions .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash

usage() {
echo``
echo "Runs lm eval harness on GSM8k using vllm and compares to "
echo "precomputed baseline (measured by HF transformers.)"
echo
echo "usage: ${0} <options>"
echo
echo " -c - path to the test data config (e.g. configs/small-models.txt)"
echo " -t - tensor parallel size"
echo
}

SUCCESS=0

while getopts "c:t:" OPT; do
case ${OPT} in
c )
CONFIG="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
LOCAL_SUCCESS=0

echo "=== RUNNING MODEL: $MODEL_CONFIG WITH TP SIZE: $TP_SIZE==="

export LM_EVAL_TEST_DATA_FILE=$PWD/configs/${MODEL_CONFIG}
export LM_EVAL_TP_SIZE=$TP_SIZE
pytest -s test_lm_eval_correctness.py || LOCAL_SUCCESS=$?

if [[ $LOCAL_SUCCESS == 0 ]]; then
echo "=== PASSED MODEL: ${MODEL_CONFIG} ==="
else
echo "=== FAILED MODEL: ${MODEL_CONFIG} ==="
fi

SUCCESS=$((SUCCESS + LOCAL_SUCCESS))

done

if [ "${SUCCESS}" -eq "0" ]; then
exit 0
else
exit 1
fi
54 changes: 54 additions & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

* export LM_EVAL_TEST_DATA_FILE=configs/Meta-Llama-3-70B-Instruct.yaml
* export LM_EVAL_TP_SIZE=4
* pytest -s test_lm_eval_correctness.py
"""

import os
from pathlib import Path

import lm_eval
import numpy
import yaml

RTOL = 0.02
TEST_DATA_FILE = os.environ.get(
"LM_EVAL_TEST_DATA_FILE",
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

TP_SIZE = os.environ.get("LM_EVAL_TP_SIZE", 1)


def launch_lm_eval(eval_config):
model_args = f"pretrained={eval_config['model_name']}," \
f"tensor_parallel_size={TP_SIZE}"

results = lm_eval.simple_evaluate(
model="vllm",
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto")

return results


def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))

# Launch eval requests.
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
14 changes: 14 additions & 0 deletions .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex

# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .

# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py
49 changes: 37 additions & 12 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# In this file, you can add more tests to run either by adding a new step or
# adding a new command to an existing step. See different options here for examples.
# This script will be feed into Jinja template in `test-template-aws.j2` to generate
# the final pipeline yaml file.

# This script will be feed into Jinja template in `test-template-aws.j2` at
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# to generate the final pipeline yaml file.


steps:
- label: Regression Test
Expand All @@ -24,7 +27,9 @@ steps:

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core
commands:
- pytest -v -s core
- pytest -v -s distributed/test_parallel_state.py

- label: Distributed Comm Ops Test
#mirror_hardwares: [amd]
Expand All @@ -39,19 +44,21 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ../.buildkite/download-images.sh
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=llava-hf/llava-1.5-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=microsoft/Phi-3-vision-128k-instruct DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- TEST_DIST_MODEL=llava-hf/llava-1.5-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_multimodal_broadcast.py
- TEST_DIST_MODEL=microsoft/Phi-3-vision-128k-instruct DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_multimodal_broadcast.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py

Expand All @@ -60,14 +67,12 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 4
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s distributed/test_pynccl.py
# We want to test that models which use 2 GPUs work with 4 GPUs, which is why we duplicate them here.
# See https://github.com/vllm-project/vllm/pull/5473#issuecomment-2166601837 for context.
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

- label: Engine Test
mirror_hardwares: [amd]
Expand All @@ -77,8 +82,8 @@ steps:
mirror_hardwares: [amd]

commands:
- pytest -v -s entrypoints -m llm
- pytest -v -s entrypoints -m openai
- pytest -v -s entrypoints/llm
- pytest -v -s entrypoints/openai

- label: Examples Test
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -186,6 +191,22 @@ steps:
- pip install aiohttp
- bash run-benchmarks.sh

- label: LM Eval Small Models
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1

- label: LM Eval Large Models
gpu: a100
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
commands:
- pip install lm-eval
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-large.txt -t 4

- label: Documentation Build
working_dir: "/vllm-workspace/test_docs/docs"
no_gpu: True
Expand All @@ -202,3 +223,7 @@ steps:
- pytest -v -s distributed/test_custom_all_reduce.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.5/flashinfer-0.0.5+cu121torch2.3-cp310-cp310-linux_x86_64.whl
- VLLM_ATTENTION_BACKEND=FLASHINFER TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- VLLM_ATTENTION_BACKEND=FLASHINFER TEST_DIST_MODEL=meta-llama/Meta-Llama-3-8B DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s -x lora/test_mixtral.py
Loading
Loading