Skip to content

Commit

Permalink
Rebase 2025.01.29 (#751)
Browse files Browse the repository at this point in the history
  • Loading branch information
kzawora-intel authored Jan 29, 2025
2 parents 2d152ed + 69cb139 commit 1710059
Show file tree
Hide file tree
Showing 119 changed files with 5,869 additions and 2,838 deletions.
Empty file modified .buildkite/run-tpu-test.sh
100644 → 100755
Empty file.
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,4 @@ repos:
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
4 changes: 2 additions & 2 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
sphinx==6.2.1
sphinx-argparse==0.4.0
sphinx-book-theme==1.0.1
sphinx-copybutton==0.5.2
myst-parser==3.0.1
sphinx-argparse==0.4.0
sphinx-design==0.6.1
sphinx-togglebutton==0.3.2
myst-parser==3.0.1
msgspec
cloudpickle

Expand Down
4 changes: 2 additions & 2 deletions docs/source/api/engine/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
.. currentmodule:: vllm.engine
```

```{toctree}
:::{toctree}
:caption: Engines
:maxdepth: 2

llm_engine
async_llm_engine
```
:::
4 changes: 2 additions & 2 deletions docs/source/api/model/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

## Submodules

```{toctree}
:::{toctree}
:maxdepth: 1

interfaces_base
interfaces
adapters
```
:::
4 changes: 2 additions & 2 deletions docs/source/api/multimodal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@ Looking to add your own multi-modal model? Please follow the instructions listed

## Submodules

```{toctree}
:::{toctree}
:maxdepth: 1

inputs
parse
processing
profiling
registry
```
:::
4 changes: 2 additions & 2 deletions docs/source/api/offline_inference/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Offline Inference

```{toctree}
:::{toctree}
:caption: Contents
:maxdepth: 1

llm
llm_inputs
```
:::
4 changes: 2 additions & 2 deletions docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ The edges of the build graph represent:

- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)

> ```{figure} /assets/contributing/dockerfile-stages-dependency.png
> :::{figure} /assets/contributing/dockerfile-stages-dependency.png
> :align: center
> :alt: query
> :width: 100%
> ```
> :::
>
> Made using: <https://github.com/patrickhoefler/dockerfilegraph>
>
Expand Down
8 changes: 4 additions & 4 deletions docs/source/contributing/model/basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.

```{warning}
:::{warning}
Make sure to review and adhere to the original code's copyright and licensing terms!
```
:::

## 2. Make your code compatible with vLLM

Expand Down Expand Up @@ -80,10 +80,10 @@ def forward(
...
```

```{note}
:::{note}
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
```
:::

For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.

Expand Down
12 changes: 6 additions & 6 deletions docs/source/contributing/model/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,24 @@

This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.

```{toctree}
:::{toctree}
:caption: Contents
:maxdepth: 1

basic
registration
tests
multimodal
```
:::

```{note}
:::{note}
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
```
:::

```{tip}
:::{tip}
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
or ask on our [developer slack](https://slack.vllm.ai).
We will be happy to help you out!
```
:::
32 changes: 16 additions & 16 deletions docs/source/contributing/model/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ Further update the model as follows:
return vision_embeddings
```

```{important}
:::{important}
The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
```
:::

- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.

Expand Down Expand Up @@ -89,10 +89,10 @@ Further update the model as follows:
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```

```{note}
:::{note}
The model class does not have to be named {code}`*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
```
:::

## 2. Specify processing information

Expand Down Expand Up @@ -120,8 +120,8 @@ When calling the model, the output embeddings from the visual encoder are assign
containing placeholder feature tokens. Therefore, the number of placeholder feature tokens should be equal
to the size of the output embeddings.

::::{tab-set}
:::{tab-item} Basic example: LLaVA
:::::{tab-set}
::::{tab-item} Basic example: LLaVA
:sync: llava

Looking at the code of HF's `LlavaForConditionalGeneration`:
Expand Down Expand Up @@ -254,12 +254,12 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
return {"image": self.get_max_image_tokens()}
```

```{note}
:::{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
```

:::

::::
:::::

## 3. Specify dummy inputs

Expand Down Expand Up @@ -315,17 +315,17 @@ def get_dummy_processor_inputs(
Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
to fill in the missing details about HF processing.

```{seealso}
:::{seealso}
[Multi-Modal Data Processing](#mm-processing)
```
:::

### Multi-modal fields

Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.

::::{tab-set}
:::{tab-item} Basic example: LLaVA
:::::{tab-set}
::::{tab-item} Basic example: LLaVA
:sync: llava

Looking at the model's `forward` method:
Expand Down Expand Up @@ -367,13 +367,13 @@ def _get_mm_fields_config(
)
```

```{note}
:::{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
```

:::

::::
:::::

### Prompt replacements

Expand Down
16 changes: 8 additions & 8 deletions docs/source/contributing/model/registration.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ After you have implemented your model (see [tutorial](#new-model-basic)), put it
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](#supported-models) to promote your model!

```{important}
:::{important}
The list of models in each section should be maintained in alphabetical order.
```
:::

## Out-of-tree models

You can load an external model using a plugin without modifying the vLLM codebase.

```{seealso}
:::{seealso}
[vLLM's Plugin System](#plugin-system)
```
:::

To register the model, use the following code:

Expand All @@ -45,11 +45,11 @@ from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
```

```{important}
:::{important}
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
Read more about that [here](#supports-multimodal).
```
:::

```{note}
:::{note}
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
```
:::
8 changes: 4 additions & 4 deletions docs/source/contributing/model/tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.

```{important}
:::{important}
The list of models in each section should be maintained in alphabetical order.
```
:::

```{tip}
:::{tip}
If your model requires a development version of HF Transformers, you can set
`min_transformers_version` to skip the test in CI until the model is released.
```
:::

## Optional Tests

Expand Down
12 changes: 6 additions & 6 deletions docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,17 +35,17 @@ pre-commit run --all-files
pytest tests/
```

```{note}
:::{note}
Currently, the repository is not fully checked by `mypy`.
```
:::

## Issues

If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

```{important}
:::{important}
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
```
:::

## Pull Requests & Code Reviews

Expand Down Expand Up @@ -81,9 +81,9 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly.

```{note}
:::{note}
If the PR spans more than one category, please include all relevant prefixes.
```
:::

### Code Quality

Expand Down
12 changes: 6 additions & 6 deletions docs/source/contributing/profiling/profiling_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@ The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` en

When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.

```{warning}
:::{warning}
Only enable profiling in a development environment.
```
:::

Traces can be visualized using <https://ui.perfetto.dev/>.

```{tip}
:::{tip}
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
```
:::

```{tip}
:::{tip}
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
`export VLLM_RPC_TIMEOUT=1800000`
```
:::

## Example commands and usage

Expand Down
Loading

0 comments on commit 1710059

Please sign in to comment.