Note: use Right-click > Open in new tab
to open links in this file
This document overviews the vLLM request-processing pipeline for encoder/decoder models, highlighting those aspects which differ from vLLM's decoder-only pipeline.
Figure 1: Encoder/decoder architecture during the prefill (left) and decode (right) phases, omitting certain details such as layer norm. Encoder layers are abstracted as gray boxes, while decoder layers are blown-up to show how self-attention (blue) and cross-attention (orange) utilize KV caching. The KV caches shown are the decoder self-attn cache (blue; "Self") and the encoder/decoder cross-attn cache (orange; "Cross"). Although the model architecture does not change per se between the prefill and decode phases, nonetheless the encoder is omitted in the decode-phase diagram because all computations on the encoder hidden states are handled by the cross-attention KV cache, which is read-only during decode-phase. The decoder self-attention KV cache is updated with each new decoded token.
Figure 1, which reviews the generalized architecture of encoder/decoder models, clarifies why encoder/decoder models place additional requirements on the vLLM processing pipeline compared to decoder-only models:
- Both the encoder and decoder modules accept input prompts (Figure 1.) This means that, upon receiving a user request, vLLM must be able to either extract or infer both an encoder input prompt and a decoder input prompt.
- At the other end, the request output - in order to be sensible to a human reader - must include the encoder & decoder input prompts as well as the decoded tokens.
- Once the encoder & decoder input prompts have been extracted, it must be possible to construct the usual vLLM intermediate representations such as
Sequence
,SequenceGroup
,SequenceGroupMetadata
, etc. Specifically, the encoder sequence - which is static after prefill - must be tracked alongside the decoder sequences, which are updated at each scheduler step. - Figure 1 (right) shows that after prefill, it is possible to entirely discard the encoder output hidden states & make sole use of the cached cross-attention KVs to perform decode-phase cross-attention. This has significant implications for the request-processing pipeline:
- In addition to the pre-existing vLLM "decoder self-attention" KV caches and associated block tables, the vLLM scheduler must now construct an "encoder/decoder cross-attention KV cache" and associated block table for each
SequenceGroupMetadata
. - The block manager - orchestrated by the scheduler - must correctly allocate, free and swap cross-attention KV cache.
- The model runner must pass cross-attention block tables and slot mappings as input to models. The
AttentionMetadata
structure must include additional fields related to encoder sequences & cross-attention memory-mapping.
- In addition to the pre-existing vLLM "decoder self-attention" KV caches and associated block tables, the vLLM scheduler must now construct an "encoder/decoder cross-attention KV cache" and associated block table for each
- Finally, the attention backends must support encoder attention, decoder self-attention, and encoder/decoder cross-attention. At time of writing only XFormers attention backend supports all of these capabilities.
The following sections will add more detail. It may be helpful to review the official vLLM input process pipeline doc before continuing.
For decoder-only models, a vLLM request specifies a single text or tokens prompt, and possibly multimodal data.
An encoder/decoder model is architecturally capable of accepting two prompts, and currently vLLM does not support multimodal input for encoder/decoder models. So it would seem that vLLM requests should be able to pass in two prompts.
However, taking HF transformers as an example, it is also normal for encoder/decoder models to be invoked by passing in only a single prompt. In these cases, the input to model.generate()
is typically passed to the encoder. The reason is that the encoder input is usually the "primary" input reflecting the purpose the model was designed for, whereas the decoder input - if specified by the user at all - is for tuning model behavior. For example:
- With HuggingFace (HF) BART, invoking
model.generate(prompt)
passesprompt
to the encoder input, because the encoder encodes the question or document to summarize. - With HF Whisper - a speech recognition model - preprocessed audio embeddings are passed to the encoder as input. The user rarely specifies a decoder prompt directly; instead the
WhisperConfig
determines translation language, timestamps, task, and other model behaviors. During inference Whisper effectively injects these settings into the decoder sequence as control tokens.
This suggests that when vLLM is running an encoder/decoder model, requests must at minimum always contain an encoder input prompt (or, in the future, multimodal data.)
However, it may be desirable for a user to be able to tweak the decoder prompt by injecting custom control tokens to tune model behavior. So it should also be possible for a request to specify a decoder prompt in addition to the encoder prompt.
To that end, vLLM supports the following request formats for encoder/decoder models:
-
Singleton prompt (implicitly an encoder prompt)
-
Singleton prompt string
- vLLM will tokenize this prompt and pass the token-list to the encoder.
- vLLM will pass a default prompt to the decoder.
For example passing the singleton prompt below to vLLM BART
"The rain in spain falls mainly on the"
results in
Encoder prompt: tokenize("The rain in spain falls mainly on the") Decoder prompt: [2,0] # <DEC><BOS>
where
<DEC>
is decoder start token. -
Singleton
TextPrompt
- vLLM will extract the prompt text, tokenize it and pass the token-list to the encoder.
- vLLM will pass a default prompt to the decoder.
For example passing the
TextPrompt
below to vLLM BARTTextPrompt( 'prompt': "The rain in spain falls mainly on the", )
results in
Encoder prompt: tokenize("The rain in spain falls mainly on the") Decoder prompt: [2,0] # <DEC><BOS>
which is the same as for the raw text prompt;
TextPrompt
will only be differentiated from raw text prompts once multi-modal encoder/decoder support is added, at which point themulti_modal_data
field ofTextPrompt
may be used. -
Singleton
TokensPrompt
with prompt tokens- vLLM will pass the unmodified token-list directly to the encoder.
- vLLM will pass a default prompt to the decoder.
For example passing the
TokensPrompt
below to vLLM BARTTokensPrompt( 'prompt_tokens': [2,0,171,5,2] )
results in
Encoder prompt: [2,0,171,5,2] Decoder prompt: [2,0] # <DEC><BOS>
Note that currently the
multi_modal_data
field ofTokensPrompt
may not be used.
-
-
Explicit encoder/decoder prompt
-
Structure:
ExplicitEncoderDecoderPrompt( 'encoder_prompt': <Singleton prompt>, 'decoder_prompt': <Singleton prompt> )
- Each sub-prompt may be any of the aforementioned types of singleton prompt
- vLLM will tokenize any sub-prompt which is not a token-list into a token-list
- vLLM will preprocess the decoder prompt; the default behavior is to append a
<DEC>
token (ID=2 for BART tokenizer) to the beginning of the decoder token list, unless an initial<DEC>
token is already present. - vLLM will pass the encoder prompt tokens to the encoder and the preprocessed decoder prompt tokens to the decoder
For example passing the
ExplicitEncoderDecoderPrompt
below to BARTExplicitEncoderDecoderPrompt( 'encoder_prompt': TextPrompt( 'prompt': "The rain in spain falls mainly on the", ), 'decoder_prompt': [2, 0, 51, 178, 2] )
results in
Encoder prompt: tokenize("The rain in spain falls mainly on the") Decoder prompt: [2, 0, 51, 178, 2]
-
Additional notes on encoder/decoder prompts
- With regards to decoder prompt preprocessing, vLLM emulates the default behavior of HuggingFace transformers
GenerationMixin
for encoder/decoder models:- vLLM's default decoder prompt is
<DEC><BOS>
where<DEC>
is decoder start token and<BOS>
is beginning-of-sequence token. This is an approximation of ofGenerationMixin
's default behavior when it receives aNone
decoder prompt, which is to (1) choose<DEC>
as the default prompt, and (2) employ a logit processing constraint which forces the first decoded token to be<BOS>
. - When the user specifies a decoder prompt that does not begin with
<DEC>
,<DEC>
will be prepended to the prompt tokens during decoder prompt preprocessing. If the prompt tokens already begin with<DEC>
then decoder prompt processing makes no change.
- vLLM's default decoder prompt is
- However, if you are adding a new encoder/decoder model to vLLM you should consider whether vLLM's default decoder prompt & decoder prompt preprocessing logic need to be specialized for your model.
-
Upon receiving a request,
LLMEngine
processes the input prompt into anEncoderDecoderLLMInput
instanceclass EncoderDecoderLLMInputs(LLMInputs): """ The inputs in :class:`~vllm.LLMEngine` before they are passed to the model executor. This specifies the required data for encoder-decoder models. """ encoder_prompt_token_ids: List[int] """The token IDs of the encoder prompt.""" encoder_prompt: NotRequired[Optional[str]] """ The original encoder prompt text corresponding to the token IDs, if available. """
Note that in addition to the encoder-oriented fields shown above,
EncoderDecoderLLMInputs
also retains the decoder-orientedprompt
andprompt_token_ids
fields defined forLLMInputs
. -
vLLM allows a
Sequence
to be constructed from anLLMInputs
instance, via theinputs
constructor argument. Thus it is also possible to construct aSequence
from anEncoderDecoderLLMInputs
instance.- The
Sequence
constructor has afrom_decoder_prompt
argument:from_decoder_prompt=True
will construct the sequence frominputs.prompt_token_ids
andinputs.prompt
from_decoder_prompt=False
will construct the sequence frominputs.encoder_prompt_token_ids
andinputs.encoder_prompt
- The
-
SequenceGroup
represents all sequences associated with a single request. NowSequenceGroup
has an additionalencoder_seq
member, which allows it to represent the encoder input sequence associated with a request. -
SequenceGroupMetadata
encapsulates metadata - including but not limited to sequence data & block tables - associated with a given request in a given inference step. NowSequenceGroupMetadata
has additionalencoder_seq_data
andcross_block_table
fields for representing encoder input sequence data and encoder/decoder cross-attention block table, respectively.SequenceGroupMetadata.block_table
is aDict[int,List[int]]
because the sequence group contains a self-attention block table for each each decoder sequence; each decoder sequence has an integer ID which is used to look up its block table.SequenceGroupMetadata.cross_block_table
is anOptional[List[int]]
because there is maximum one cross-attention block table per sequence group (the decoder-only pipeline employs no cross-attention block tables)
(For brevity, the impact of encoder/decoder on block space manager v2 is omitted here.)
The block manager contains two internal block table representations
block_tables: Dict[int, BlockTable]
: a decoderSequence
ID -> self-attention block table mappingcross_block_tables: Dict[str, BlockTable]
: aSequenceGroup
request ID -> cross-attention block table mapping- Rationale: as described earlier there is max one cross-attention block-table per
SequenceGroup
, thereforeSequenceGroup
request IDs are sufficient for identifying cross-attention block-tables - Note: (1)
SequenceGroup
IDs are globally-unique in vLLM (2)SequenceGroup
request IDs are strings - For decoder-only models,
cross_block_tables
will be an empty dictionary
- Rationale: as described earlier there is max one cross-attention block-table per
block_man.get_block_table(seq)
returns the self-attention block-table associated with the Sequence
in the argument.
block_man.get_cross_block_table(seq_group)
returns the cross-attention block-table associated with the SequenceGroup
in the argument.
- The block manager is managing (
$total\ num\ gpu\ blocks$ ) GPU memory blocks and ($total\ num\ cpu\ blocks$ ) CPU memory blocks -
block_man.allocate(seq_group)
provisions:-
One self-attention KV cache for each decoder sequence in the
SequenceGroup
-
One KV cache for cross-attention, with the number of token slots equal to the length of the
SequenceGroup
's encoder sequence. -
Allocation yields a block table for the cross-attention KV cache & one block table for each self-attention KV cache
-
Total # of blocks:
$$(seq\ group\ blocks) = |(cross\ attn\ blocktable)| + \sum_{i}^{num\ seqs}{|(seq_{i}\ decoder\ self\ attn\ block\ table)|}$$ -
After allocation,
$$(free\ gpu\ blocks\ after\ alloc) = (free\ gpu\ blocks) - (seq\ group\ blocks)$$
-
-
block_man.swap_out(seq_group)
accomplishes GPU -> CPU swap for aSequenceGroup
's KV caches.-
After swap,
$$(free\ gpu\ blocks\ after\ swap\ out) = (free\ gpu\ blocks) + (seq\ group\ blocks)$$ $$(free\ cpu\ blocks\ after\ swap\ out) = (free\ cpu\ blocks) - (seq\ group\ blocks)$$
-
-
block_man.swap_in(seq_group)
accomplishes CPU -> GPU swap for aSequenceGroup
's KV caches.-
After swap,
$$(free\ gpu\ blocks\ after\ swap\ in) = (free\ gpu\ blocks) - (seq\ group\ blocks)$$ $$(free\ cpu\ blocks\ after\ swap\ in) = (free\ cpu\ blocks) + (seq\ group\ blocks)$$
-
-
block_man.free(seq)
frees the self-attention KV cache blocks associated with theSequence
argument passed in.-
After
free()
,$$(free\ device\ blocks\ after\ free) = (free\ device\ blocks)\ + |(seq_{i}\ decoder\ self\ attn\ block\ table)|$$ where
$device$ is whichever of${CPU,GPU}$ theSequenceGroup
currently resides in, and$i$ is theSequence
id
-
-
block_man.free_cross(seq_group)
frees the cross-attention KV cache blocks associated with theSequenceGroup
argument passed in.-
After
free_cross()
,$$(free\ device\ blocks\ after\ free\ cross) = (free\ device\ blocks)\ + |(cross\ attn\ blocktable)|$$ where
$device$ is whichever of${CPU,GPU}$ theSequenceGroup
currently resides in.
-
-
block_man.reset()
frees all cache blocks associated with all block tables managed by the block manager, after which there are$total\ num\ gpu\ blocks$ free GPU memory blocks and$total\ num\ cpu\ blocks$ free CPU memory blocks
- For encoder/decoder models only,
-
Scheduler.schedule()
now has the added behavior of passingencoder_seq_data
andcross_block_table
to theSequenceGroupMetadata
constructorScheduler.scheduler()
obtainsencoder_seq_data
fromSequenceGroup.encoder_seq
andcross_block_table
fromblock_man.get_cross_block_table(seq_group)
-
Scheduler.abort_seq_group(req_id)
now has the added effect of freeing the cross-attention block-table associated with theSequenceGroup
with request IDreq_id
Scheduler._free_seq_group_cross_attn_blocks(seq_group)
is the helper function which frees the cross-attention block-table
-
Scheduler.free_finished_seq_groups()
now has the added effect of invokingScheduler._free_seq_group_cross_attn_blocks(seq_group)
against all finishedSequenceGroup
s, which frees theSequenceGroup
cross-attention block tables.Scheduler.free_finished_seq_groups()
is invoked byLLMEngine._process_model_outputs()
-
Generally speaking, ModelRunner
and its subclasses consume the SequenceGroupMetadata
instances constructed by the Scheduler
.
For decoder-only models, ModelRunner
utilizes these SequenceGroupMetadata
instances to:
- Construct the decoder input tokens/positions
- Build the decoder self-attention slot-mappings data structure
- Compute the decoder sequence lengths, token-counts, etc.
- Construct an
AttentionMetadata
instance
For encoder/decoder models, EncoderDecoderModelRunner
prepares all of the same decoder-oriented model inputs, but additionally constructs:
- The encoder input tokens/positions
- The encoder/decoder cross-attention slot-mappings data structure
- The encoder sequence lengths, token-counts, etc.
See the how-to guides for encoder/decoder model forward()
method arguments and suggested encoder/decoder model architecture.
Two factors which complicate scaled dot-product (SDP) attention computation in the vLLM backend are:
-
For an
$N$ -sequence batch, vLLM passes the model a single token vector which is the concatenation of the$N$ sequences (without padding), and which has a total number of tokens equal to the sum of the token-counts of all$N$ sequences. vLLM expects the model to pass tokens to theAttention
layer in this single-vector format, which means all sequences are handled in a single SDP attention computation. But critically the sequences must attend only to themselves and not each other during theAttention
layer computation. This effectively requires discarding parts of the SDP attention score matrix corresponding to attention between sequences. -
(Encoder/decoder only) By default (i.e. unless a particular model specifies otherwise), non-causal attention is employed for encoder attention & encoder/decoder cross-attention, while causal attention is employed for decoder self-attention.
vLLM addresses both requirements by augmenting SDP attention with a causal or non-causal block-diagonal attention mask. The SDP attention computation may be augmented with a bias or mask matrix
The SDP attention score computation
vLLM attention backend: prefill-phase block-diagonal masks for encoder, decoder self-, and encoder/decoder cross-attention
The
Note the rectangular shape of the diagonal blocks in the prefill cross-attention mask, as compared to the square blocks in the encoder and decoder self-attention masks. In encoder attention and decoder self-attention, Q and K are derived from the same source (previous encoder or decoder layer output, respectively) and thus have the same length during prefill; therefore, the regions of the SDP attention score matrix corresponding to intra-sequence attention will always be square during prefill. In contrast, cross-attention Q is derived from the decoder self-attention hidden states while K is derived from the encoder output hidden states. Since the encoder and decoder have different input prompts during prefill, Q and K may differ in length for cross-attention, which is why the diagonal blocks are rectangular during prefill.
Currently, vLLM does not support passing arbitrary fully-materialized
More specifically, the vLLM XFormers attention backend internally constructs instances of BlockDiagonalMask
or BlockDiagonalCausalMask
and passes them to the XFormers kernels. BlockDiagonalMask
is utilized when Attention.forward()
is invoked with attn_type=AttentionType.ENCODER
or attn_type=AttentionType.ENCODER_DECODER
. BlockDiagonalCausalMask
is utilized when Attention.forward()
is invoked with attn_type=AttentionType.DECODER
.
The vLLM flash-attention backend does not currently support encoder attention or encoder/decoder cross-attention; however, a logical approach would be to configure block-diagonal mask shape using FlashAttentionMetadata.query_start_loc
and FlashAttentionMetadata.seq_start_loc
, and causality using the flash-attention kernel's causal=True/False
flag.
The vLLM Flashinfer backend also does not currently support encoder attention or encoder/decoder cross-attention; however a similar approach based on query_start_loc
/seq_start_loc
/causal
flag should work
Note that adding encoder/decoder support to backends other than XFormers is a workstream in the encoder/decoder RFC.
vLLM does not construct any kind of attention bias for decode-phase paged attention.
The paged attention kernels avoid computing inter-sequence attention by design, so this does not need to be enforced by a block-diagonal mask.
There is no need to impose a causal or non-causal attention mask on the paged attention, because the natural behavior of paged attention is to have the query attend to all KVs in the context. This suffices for both decode-phase decoder self-attention (where the context is past decoded tokens) and decode-phase encoder/decoder cross-attention (where the context is the static set of encoder tokens.)
However, once custom attention bias is supported, an explicit decode-phase attention mask will be required, the reason being that custom attention bias allows the attention between the query vector and the preceding KVs to be weighted non-uniformly.