Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-ups not observed for Quantized models such as GPTQ, AWQ #2

Open
NamburiSrinath opened this issue Jan 22, 2025 · 7 comments
Open

Comments

@NamburiSrinath
Copy link

Hi @romsto,

Excellent repo, thank you :)

I am trying to use a quantized model (one by GPTQ for eg;) as drafter model but am unable to observe any speed-up following the Readme.

Below I attach code for reproducibility (only if you have GPTQ based model, but let me know if you plan to experiment and I can push the model to HF).

import time
from termcolor import colored
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoConfig
import torch

target_model_name = 'meta-llama/Llama-2-7b-chat-hf'
target = AutoModelForCausalLM.from_pretrained(target_model_name, torch_dtype=torch.float16, device_map='cuda')

drafter_model_name = 'GPTQ_Llama_models/8b_32g' ## Assume I've a compressed GPTQ model already ready!
drafter = AutoModelForCausalLM.from_pretrained(drafter_model_name, device_map='cuda')

tokenizer = AutoTokenizer.from_pretrained(target_model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

prefix = "Translate to English: Je m'appelle Romain. N'hésitez pas à contribuer à mon projet !"

chat_templated = f"<s>[INST] {prefix} [/INST]\n"
input_ids = tokenizer(chat_templated, return_tensors="pt").input_ids
input_ids = input_ids[0].tolist() 

from sampling import speculative_generate, autoregressive_generate
# from sampling import speculative_generate_encoder_decoder, autoregressive_generate_encoder_decoder
from utils.logits_processor import NucleusProcessor

# Parameters
gen_len = 100       # Maximum number of tokens generated (could over pass when using speculative decoding)
gamma = 4          # Number of drafts generated by the drafter model at each step
logits_processor = NucleusProcessor(temperature=.6, top_p=.9) # Nucleus sampling with p=0.9 and T=0.6

# Generate text using the classic auto-regressive decoding (slow)
ar_start_time = time.time()
output_ids_ar = autoregressive_generate( # or autoregressive_generate_encoder_decoder for encoder-decoder models
                input_ids,
                target,
                logits_processor=logits_processor,
                max_gen_len=gen_len,
                eos_tokens_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
            )
ar_end_time = time.time()
output_ar = tokenizer.decode(output_ids_ar, skip_special_tokens=True)

print(colored("=========== Target AR ===========", "blue"))
print(colored("Out:", "blue"), output_ar)
base_throughput = len(output_ar) / (ar_end_time - ar_start_time)
print(colored(f"Throughput: {base_throughput:.1f} tokens/s", "blue"))
print(colored("=========== Target AR ===========", "blue"))

# Generate text using the speculative decoding (faster)
spec_start_time = time.time()
output_ids_sd, alpha = speculative_generate( # or speculative_generate_encoder_decoder for encoder-decoder models
                input_ids,
                drafter,
                target,
                logits_processor=logits_processor,
                gamma=gamma,
                max_gen_len=gen_len,
                eos_tokens_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
            )
spec_end_time = time.time()
output_sd = tokenizer.decode(output_ids_sd, skip_special_tokens=True)

print(colored("========== Speculative ==========", "green"))
print(colored("Out:", "green"), output_sd)
print(colored(f"Acceptance rate: {alpha:.3f}", "green")) # Number of drafts accepted by the target model divided by the number of drafts generated
spec_throughput = len(output_sd) / (spec_end_time - spec_start_time)
print(colored(f"Throughput: {spec_throughput:.1f} tokens/s", "green"))
print(colored("========== Speculative ==========", "green"))
print(colored(f"Throughput increase: {((spec_throughput / base_throughput)) * 100:.1f}%", "magenta")) 

and here's the terminal output

=========== Target AR ===========
Out: 
"My name is Romain. Don't hesitate to contribute to my project!"

In English, "m'appelle" is pronounced "my name is" and "n'hésitez pas" is pronounced "don't hesitate". So, the full translation of the sentence is:

"My name is Romain. Don't hesitate to contribute to my project!"
Throughput: 148.0 tokens/s
=========== Target AR ===========
========== Speculative ==========
Out: 
"My name is Romain. Don't hesitate to contribute to my project!"

In English, "m'appelle" is the verb "to call oneself" or "to be called", and "Romain" is the name of the person speaking. "N'hésitez pas" is a polite way of saying "don't hesitate", and "contribuer à mon projet" means "to contribute to my project". So
Acceptance rate: 0.988
Throughput: 35.7 tokens/s
========== Speculative ==========
Throughput increase: 24.1%

I expected that the quantized model will have a higher throughput despite high acceptance rate. Do let me know if I am missing something here.

@NamburiSrinath NamburiSrinath changed the title Speed-ups for Quantized models such as GPTQ, AWQ Speed-ups not observed for Quantized models such as GPTQ, AWQ Jan 22, 2025
@romsto
Copy link
Owner

romsto commented Jan 22, 2025

Hello @NamburiSrinath,

This is a very interesting idea to use a quantized model as the drafter in speculative decoding!

Speculative Decoding can be quite sensitive to parameters, especially when the drafter’s inference speed is close to the one of the target model. Did you compare the inference speeds of the target model and the drafter? This can be done in the CLI using the /drafter command before generating. If their speeds are too similar, the potential for speed-up is reduced.

Since your acceptance rate is very high (0.988), increasing gamma might help improve the overall throughput. This would allow the drafter to generate more drafts, and the high acceptance rate would ensure most of them are utilized.

PS: Could you provide some information about the quantized drafter model (origin model, tokenizer) and your setup (I assume you might have at least an A100)?

Feel free to share any additional results. I’m very interested in seeing how this develops!

@NamburiSrinath
Copy link
Author

Thanks @romsto, Let me compare the speed-ups of target and drafter model. In the meantime, here's additional info.

  1. The quantized model is same as Llama-2-chat-hf but a quantized version using GPTQ or AWQ. As such, the tokenizer will be the same as Llama-2.
  2. If you specifically want the drafter model I used for similar debugging purposes, do let me know and I can push the model to HF.

My take - I believe (especially) with quantized models, the GPU utilization is kinda strange and it might not properly use it. As in, I somehow believe that there's some process that's going on in the CPU and thus a slow-down.

Why? - Because I tried to replace the drafter model with pruned version (instead of quantized one) and was able to observe a decent speedup!

@NamburiSrinath
Copy link
Author

@romsto - Here's the model in case you want to work on the same drafter model as I work: https://huggingface.co/NamburiSrinath/Llama2-7b-GPTQ-8b-32g

@NamburiSrinath
Copy link
Author

Hi @romsto, I experimented and here are the new findings

For compressed model, I used 2 versions -- One when pruned and one when quantized. Ignore the response accuracies, below are the outputs from terminal.

Pruned model

Target model: meta-llama/Llama-2-7b-chat-hf
Drafter model: Pruned model
Loading models...
> /drafter
Drafter generation: True
> What are the states and their respective capitals in USA?
========== Speculative ==========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Acceptance rate: 0.941
Throughput: 108.0 tokens/s
========== Speculative ==========
========== Ngram Assisted ==========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Acceptance rate: 0.013
Throughput: 117.0 tokens/s
========== Ngram Assisted ==========
Throughput increase: 92.3%
=========== Target AR ===========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Throughput: 118.3 tokens/s
=========== Target AR ===========
Throughput increase: 91.3%
========== Drafter AR ==========
Out:  The 50 states of the United States of America are:
xxx

I hope this helps! Let me know if you have any questions.
Throughput: 110.5 tokens/s
========== Drafter AR ==========

Quantized model

Target model: meta-llama/Llama-2-7b-chat-hf
Drafter model: quantized_model
Loading models...
/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq):
/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4674: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
> /drafter
Drafter generation: True
> What are the states and their respective capitals in USA?
========== Speculative ==========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Acceptance rate: 1.000
Throughput: 27.4 tokens/s
========== Speculative ==========
========== Ngram Assisted ==========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Acceptance rate: 0.013
Throughput: 33.8 tokens/s
========== Ngram Assisted ==========
Throughput increase: 81.1%
=========== Target AR ===========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Throughput: 33.7 tokens/s
=========== Target AR ===========
Throughput increase: 81.2%
========== Drafter AR ==========
Out:  Sure! Here are the 50 states of the United States and their respective capitals:
xxx

I hope that helps! Let me know if you have any other questions.
Throughput: 33.7 tokens/s
========== Drafter AR ==========

If you see the Throughputs are very less when we use quantized models. As I mentioned earlier, I believe there's something going on with GPU utilization when using quantized models!! and this might need your attention.

@romsto
Copy link
Owner

romsto commented Jan 30, 2025

Very interesting results, thanks for sharing.
Indeed the GPTQ quantization seems to have an impact on the whole process, even when the model is not directly used.
I will try to investigate this phenomenon on my side.

@NamburiSrinath
Copy link
Author

Hi @romsto,

Simply checking if you were able to come up with something

@romsto
Copy link
Owner

romsto commented Feb 14, 2025

Hello @NamburiSrinath, sorry for the delay but I am currently not available. I'll try to manage this ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants