Resolve Speculative Decode RTE #823

tannervoas742 · 2025-02-13T04:11:36Z

Speculative decoding fails when batch size exceeds 1 due to incorrect handling of mixed speculative and non-speculative sequences in the batch.
This PR corrects batch expansion ordering and accounts for padding sequences.

Below is a reference to the line where we combine speculative and non-speculative. The order is clearly non-speculative first followed by speculative.

vllm-fork/vllm/spec_decode/batch_expansion.py

Line 135 in 4d91f3b

target_seq_group_metadata_list = non_spec_seqs + spec_expanded_seqs

Below is a reference to the line where we pad the batch with dummy sequences. These also must be accounted for.

vllm-fork/vllm/worker/hpu_model_runner.py

Lines 1270 to 1275 in 4d91f3b

    
           batch_size_padded = self.bucketing_ctx.get_padded_batch_size( 
        
               real_batch_size, False) 
        
           batch_size_padding = batch_size_padded - real_batch_size 
        
           if batch_size_padding > 0: 
        
               encoder_seq_lens.extend(encoder_seq_lens[0] 
        
                                       for _ in range(batch_size_padding))

Below is the error that is encountered without this fix.

With this fix significantly higher throughputs can be achieved and accuracy is unimpacted (examined bert F1 accuracy; tested llama-3.1-8B with n-gram speculative decoding).

- Speculative decoding fails when batch size is greater than 1 due to incorrect handling of mixed speculative and non-speculative sequences in the batch. - This PR corrects batch expansion ordering and accounts for padding sequences. Signed-off-by: Voas, Tanner <tanner.voas@intel.com>

tannervoas742 requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners February 13, 2025 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve Speculative Decode RTE #823

Resolve Speculative Decode RTE #823

tannervoas742 commented Feb 13, 2025 •

edited by github-actions bot

Loading

	batch_size_padded = self.bucketing_ctx.get_padded_batch_size(
	real_batch_size, False)
	batch_size_padding = batch_size_padded - real_batch_size
	if batch_size_padding > 0:
	encoder_seq_lens.extend(encoder_seq_lens[0]
	for _ in range(batch_size_padding))

Resolve Speculative Decode RTE #823

Are you sure you want to change the base?

Resolve Speculative Decode RTE #823

Conversation

tannervoas742 commented Feb 13, 2025 • edited by github-actions bot Loading

tannervoas742 commented Feb 13, 2025 •

edited by github-actions bot

Loading