Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Below is a reference to the line where we combine speculative and non-speculative. The order is clearly non-speculative first followed by speculative.
vllm-fork/vllm/spec_decode/batch_expansion.py
Line 135 in 4d91f3b
Below is a reference to the line where we pad the batch with dummy sequences. These also must be accounted for.
vllm-fork/vllm/worker/hpu_model_runner.py
Lines 1270 to 1275 in 4d91f3b
Below is the error that is encountered without this fix.
![image](https://private-user-images.githubusercontent.com/62364327/412713203-f4f15d3b-288c-45fd-aa43-cc61ba451c7b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk0NDUzNDUsIm5iZiI6MTczOTQ0NTA0NSwicGF0aCI6Ii82MjM2NDMyNy80MTI3MTMyMDMtZjRmMTVkM2ItMjg4Yy00NWZkLWFhNDMtY2M2MWJhNDUxYzdiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEzVDExMTA0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTQ4ZDE1NDY2NTcyNzNjN2IwNDUwNDlkNTYwYTQ2ODA3MTllMDY2ZTgyMDFlN2Y0ZjFhNmVmZWRmNGNiZmJiNzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Mod8Xt3LwL2-PQZJE5gT7-o3FQ8oBStzuEQKTSXJsJY)
With this fix significantly higher throughputs can be achieved and accuracy is unimpacted (examined bert F1 accuracy; tested llama-3.1-8B with n-gram speculative decoding).