From 3de2b1eafb12e420c563cb7153d4d2f0e8451ca9 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Fri, 10 Jan 2025 11:25:20 +0800 Subject: [PATCH] [Doc] Show default pooling method in a table (#11904) Signed-off-by: DarkLight1337 --- docs/source/models/generative_models.md | 8 ++-- docs/source/models/pooling_models.md | 59 +++++++++++++++++-------- 2 files changed, 45 insertions(+), 22 deletions(-) diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md index 6228c7c2ac957..a9f74c4d3fbb8 100644 --- a/docs/source/models/generative_models.md +++ b/docs/source/models/generative_models.md @@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text. +For generative models, the only supported `--task` option is `"generate"`. +Usually, this is automatically inferred so you don't have to specify it. + ## Offline Inference The {class}`~vllm.LLM` class provides various methods for offline inference. See [Engine Arguments](#engine-args) for a list of options when initializing the model. -For generative models, the only supported {code}`task` option is {code}`"generate"`. -Usually, this is automatically inferred so you don't have to specify it. - ### `LLM.generate` The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM. @@ -33,7 +33,7 @@ for output in outputs: ``` You can optionally control the language generation by passing {class}`~vllm.SamplingParams`. -For example, you can use greedy sampling by setting {code}`temperature=0`: +For example, you can use greedy sampling by setting `temperature=0`: ```python llm = LLM(model="facebook/opt-125m") diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md index 3e4407cfdc233..745f3fd81980d 100644 --- a/docs/source/models/pooling_models.md +++ b/docs/source/models/pooling_models.md @@ -14,30 +14,53 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature pooling models as they only work on the generation or decode stage, so performance may not improve as much. ``` -## Offline Inference - -The {class}`~vllm.LLM` class provides various methods for offline inference. -See [Engine Arguments](#engine-args) for a list of options when initializing the model. - -For pooling models, we support the following {code}`task` options: - -- Embedding ({code}`"embed"` / {code}`"embedding"`) -- Classification ({code}`"classify"`) -- Sentence Pair Scoring ({code}`"score"`) -- Reward Modeling ({code}`"reward"`) +For pooling models, we support the following `--task` options. +The selected option sets the default pooler used to extract the final hidden states: + +```{list-table} +:widths: 50 25 25 25 +:header-rows: 1 + +* - Task + - Pooling Type + - Normalization + - Softmax +* - Embedding (`embed`) + - `LAST` + - ✅︎ + - ✗ +* - Classification (`classify`) + - `LAST` + - ✗ + - ✅︎ +* - Sentence Pair Scoring (`score`) + - \* + - \* + - \* +* - Reward Modeling (`reward`) + - `ALL` + - ✗ + - ✗ +``` -The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used: +\*The default pooler is always defined by the model. -- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization. -- Classification: Extract only the hidden states corresponding to the last token, and apply softmax. -- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax. -- Reward Modeling: Extract all of the hidden states and return them directly. +```{note} +If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table. +``` When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models, -we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`). +we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`). -You can customize the model's pooling method via the {code}`override_pooler_config` option, +```{tip} +You can customize the model's pooling method via the `--override-pooler-config` option, which takes priority over both the model's and Sentence Transformers's defaults. +``` + +## Offline Inference + +The {class}`~vllm.LLM` class provides various methods for offline inference. +See [Engine Arguments](#engine-args) for a list of options when initializing the model. ### `LLM.encode`