-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed-ups not observed for Quantized models such as GPTQ, AWQ #2
Comments
Hello @NamburiSrinath, This is a very interesting idea to use a quantized model as the drafter in speculative decoding! Speculative Decoding can be quite sensitive to parameters, especially when the drafter’s inference speed is close to the one of the target model. Did you compare the inference speeds of the target model and the drafter? This can be done in the CLI using the Since your acceptance rate is very high (0.988), increasing gamma might help improve the overall throughput. This would allow the drafter to generate more drafts, and the high acceptance rate would ensure most of them are utilized. PS: Could you provide some information about the quantized drafter model (origin model, tokenizer) and your setup (I assume you might have at least an A100)? Feel free to share any additional results. I’m very interested in seeing how this develops! |
Thanks @romsto, Let me compare the speed-ups of target and drafter model. In the meantime, here's additional info.
My take - I believe (especially) with quantized models, the GPU utilization is kinda strange and it might not properly use it. As in, I somehow believe that there's some process that's going on in the CPU and thus a slow-down. Why? - Because I tried to replace the drafter model with pruned version (instead of quantized one) and was able to observe a decent speedup! |
@romsto - Here's the model in case you want to work on the same drafter model as I work: https://huggingface.co/NamburiSrinath/Llama2-7b-GPTQ-8b-32g |
Hi @romsto, I experimented and here are the new findings For compressed model, I used 2 versions -- One when pruned and one when quantized. Ignore the response accuracies, below are the outputs from terminal. Pruned model
Quantized model
If you see the Throughputs are very less when we use quantized models. As I mentioned earlier, I believe there's something going on with GPU utilization when using quantized models!! and this might need your attention. |
Very interesting results, thanks for sharing. |
Hi @romsto, Simply checking if you were able to come up with something |
Hello @NamburiSrinath, sorry for the delay but I am currently not available. I'll try to manage this ASAP. |
Hi @romsto,
Excellent repo, thank you :)
I am trying to use a quantized model (one by GPTQ for eg;) as drafter model but am unable to observe any speed-up following the Readme.
Below I attach code for reproducibility (only if you have GPTQ based model, but let me know if you plan to experiment and I can push the model to HF).
and here's the terminal output
I expected that the quantized model will have a higher throughput despite high acceptance rate. Do let me know if I am missing something here.
The text was updated successfully, but these errors were encountered: