From 94ab451f09d11ad14325f0a39652fbcee56307d8 Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Sat, 25 Jan 2025 04:51:14 +0000 Subject: [PATCH 1/9] squash Signed-off-by: Kyle Sayers --- README.md | 1 + examples/multimodal_vision/README.md | 31 ++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+) create mode 100644 examples/multimodal_vision/README.md diff --git a/README.md b/README.md index fd7f2f3e3..9ba3caae3 100644 --- a/README.md +++ b/README.md @@ -39,6 +39,7 @@ Applying quantization with `llmcompressor`: * [Activation quantization to `fp8`](examples/quantization_w8a8_fp8) * [Weight only quantization to `int4`](examples/quantization_w4a16) * [Quantizing MoE LLMs](examples/quantizing_moe) +* [Quantizing Multimodal VLMs](examples/multimodal_vision) ### User Guides Deep dives into advanced usage of `llmcompressor`: diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md new file mode 100644 index 000000000..e0eb93d94 --- /dev/null +++ b/examples/multimodal_vision/README.md @@ -0,0 +1,31 @@ +# Quantizing Multimodal Vision-Language Models # +This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ W4A16 quantization scheme. + +## Using your own models ## + +```python3 +recipe = [ + GPTQModifier( + targets="Linear", + scheme="W4A16", + sequential_targets=["MistralDecoderLayer"], + ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"], + ), +] +``` + +### Sequential Targets ### + +### Ignore ### + +### Tracing Errors ### +Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/README.md). + +### Adding Smoothquant Mappings ### + +### Adding Data Collator ### +* TODO: create a default "multimodal" collator + +## Customizing Dataset and Quantization Scheme ## +. For a detailed walkthrough of customzing datasets and quantization for W4A16, see the +[Quantization Guide](/examples/quantization_w4a16/README.md). \ No newline at end of file From aae6e4e40adc196b6dd2ca65bb3fa2ac6c4f82d3 Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:20:09 +0000 Subject: [PATCH 2/9] finished Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 51 +++++++++--- .../llama3_small_example.py | 80 +++++++++++++++++++ 2 files changed, 121 insertions(+), 10 deletions(-) create mode 100644 examples/quantization_w4a16/llama3_small_example.py diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index e0eb93d94..880797328 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -1,7 +1,35 @@ # Quantizing Multimodal Vision-Language Models # -This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ W4A16 quantization scheme. -## Using your own models ## +

+ sample image from MS COCO dataset +

+ +
+

+<|system|>
+You are a helpful assistant.
+<|user|>
+Please describe the animal in this image
+<|assistant|>
+The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard.
+    
+
+ +This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ quantization. Most examples do not demonstrate quantizing separate vision encoder parameters if they exist, as compressing these parameters offers little benefit with repsect to performance-accuracy tradeoff. + +## Compressing Your Own Model ## +To use your own multimodal modal, start with an existing example change the `model_id` to match your own model stub. +```python3 +model_id = "path/to/your/model" +model = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + torch_dtype="auto", +) +``` + +## Customizing GPTQModifier Parameters ## +The GPTQModifier is the modifier responsible for performing quantization of the model weights. For more information on quantizing with different weight schemes, see the `quantization_` examples in the [examples folder](/examples/). ```python3 recipe = [ @@ -15,17 +43,20 @@ recipe = [ ``` ### Sequential Targets ### +Sequential targets are the modules which determine the granularity of error propagation and activation offloading when performing forward passes of the model. These are typically the "transformer blocks" of the model, also referred to as "layers" with llm-compressor. + +Choosing sequential targets with higher granularity (for example "Linear" instead of "LlamaDecoderLayer") will result in fewer hessians being allocated at the same time, decreasing the memory requirements for compression. This may also increase the recovered accuracy of the model, as compression error is propagated at a higher granularity. However, using higher granularity sequential targets may also increase compression time, as more time is spent offloading and onloading activations. ### Ignore ### +If your model is not traceable for your desired dataset, first consider adding any problematic modules to the ignore list. Doing this prevents the model tracer from tracing the internals of those modules, thereby avoid the untraceable operations. -### Tracing Errors ### -Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/README.md). +For example, in this model graph, the internals of the MllamaVisionModel are not traced (we don't see the individual MllamaVisionEncoder layers, ect.). However, we can no longer target the modules within the MllamaVisionModel such as the MllamaVisionEncoder as sequential targets. If any modules within the MllamaVisionModel are being compressed, their hessians will all be allocated at the same time, increasing peak memory usage. -### Adding Smoothquant Mappings ### +## Tracing Errors ## +Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/README.md). -### Adding Data Collator ### -* TODO: create a default "multimodal" collator +## Adding Your Own Smoothquant Mappings ## +For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](src/llmcompressor/modifiers/smoothquant/README.md). -## Customizing Dataset and Quantization Scheme ## -. For a detailed walkthrough of customzing datasets and quantization for W4A16, see the -[Quantization Guide](/examples/quantization_w4a16/README.md). \ No newline at end of file +## Adding Your Own Data Collator ## +Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements. \ No newline at end of file diff --git a/examples/quantization_w4a16/llama3_small_example.py b/examples/quantization_w4a16/llama3_small_example.py new file mode 100644 index 000000000..036feb5fa --- /dev/null +++ b/examples/quantization_w4a16/llama3_small_example.py @@ -0,0 +1,80 @@ +from datasets import load_dataset +from transformers import AutoModelForCausalLM, AutoTokenizer + +from llmcompressor.modifiers.quantization import GPTQModifier +from llmcompressor.transformers import oneshot + +# Select model and load it. +MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct" + +model = AutoModelForCausalLM.from_pretrained( + MODEL_ID, + device_map="auto", + torch_dtype="auto", +) +tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) + +# Select calibration dataset. +DATASET_ID = "HuggingFaceH4/ultrachat_200k" +DATASET_SPLIT = "train_sft" + +# Select number of samples. 512 samples is a good place to start. +# Increasing the number of samples can improve accuracy. +NUM_CALIBRATION_SAMPLES = 512 +MAX_SEQUENCE_LENGTH = 2048 + +# Load dataset and preprocess. +ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) +ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) + + +def preprocess(example): + return { + "text": tokenizer.apply_chat_template( + example["messages"], + tokenize=False, + ) + } + + +ds = ds.map(preprocess) + + +# Tokenize inputs. +def tokenize(sample): + return tokenizer( + sample["text"], + padding=False, + max_length=MAX_SEQUENCE_LENGTH, + truncation=True, + add_special_tokens=False, + ) + + +ds = ds.map(tokenize, remove_columns=ds.column_names) + +# Configure the quantization algorithm to run. +# * quantize the weights to 4 bit with GPTQ with a group size 128 +recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) + +# Apply algorithms. +oneshot( + model=model, + dataset=ds, + recipe=recipe, + max_seq_length=MAX_SEQUENCE_LENGTH, + num_calibration_samples=NUM_CALIBRATION_SAMPLES, +) + +# Confirm generations of the quantized model look sane. +print("\n\n") +print("========== SAMPLE GENERATION ==============") +input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") +output = model.generate(input_ids, max_new_tokens=100) +print(tokenizer.decode(output[0])) +print("==========================================\n\n") + +# Save to disk compressed. +SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" +model.save_pretrained(SAVE_DIR, save_compressed=True) +tokenizer.save_pretrained(SAVE_DIR) From cd7122dfe037d16fe3efb880cfdc2eada906702e Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:25:48 +0000 Subject: [PATCH 3/9] change style Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 880797328..16c827669 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -3,17 +3,17 @@

sample image from MS COCO dataset

+ -
-

+``` 
 <|system|>
 You are a helpful assistant.
 <|user|>
 Please describe the animal in this image
 <|assistant|>
 The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard.
-    
-
+``` +
This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ quantization. Most examples do not demonstrate quantizing separate vision encoder parameters if they exist, as compressing these parameters offers little benefit with repsect to performance-accuracy tradeoff. From dfad9c670b433326db34d55527303a64252fd48c Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:27:00 +0000 Subject: [PATCH 4/9] more style Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 16c827669..1d2aa004c 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -11,7 +11,9 @@ You are a helpful assistant. <|user|> Please describe the animal in this image <|assistant|> -The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard. +The animal in the image is a white kitten. +It has a fluffy coat and is resting on a white keyboard. +The kitten appears to be comfortable and relaxed, possibly enjoying the warmth of the keyboard. ``` From 3c915b2a8f7c9bf3266e2b53d99e1e70fe1fee65 Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:27:22 +0000 Subject: [PATCH 5/9] more style Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 1d2aa004c..4ab1a33dd 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -8,8 +8,10 @@ ``` <|system|> You are a helpful assistant. + <|user|> Please describe the animal in this image + <|assistant|> The animal in the image is a white kitten. It has a fluffy coat and is resting on a white keyboard. From 9dc6b58fe76c34ea511658329439b822e55f8080 Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:28:47 +0000 Subject: [PATCH 6/9] fix link Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 2 +- .../llama3_small_example.py | 80 ------------------- 2 files changed, 1 insertion(+), 81 deletions(-) delete mode 100644 examples/quantization_w4a16/llama3_small_example.py diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 4ab1a33dd..47eceb116 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -60,7 +60,7 @@ For example, in this model graph, the internals of the MllamaVisionModel are not Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/README.md). ## Adding Your Own Smoothquant Mappings ## -For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](src/llmcompressor/modifiers/smoothquant/README.md). +For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](/src/llmcompressor/modifiers/smoothquant/README.md). ## Adding Your Own Data Collator ## Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements. \ No newline at end of file diff --git a/examples/quantization_w4a16/llama3_small_example.py b/examples/quantization_w4a16/llama3_small_example.py deleted file mode 100644 index 036feb5fa..000000000 --- a/examples/quantization_w4a16/llama3_small_example.py +++ /dev/null @@ -1,80 +0,0 @@ -from datasets import load_dataset -from transformers import AutoModelForCausalLM, AutoTokenizer - -from llmcompressor.modifiers.quantization import GPTQModifier -from llmcompressor.transformers import oneshot - -# Select model and load it. -MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct" - -model = AutoModelForCausalLM.from_pretrained( - MODEL_ID, - device_map="auto", - torch_dtype="auto", -) -tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) - -# Select calibration dataset. -DATASET_ID = "HuggingFaceH4/ultrachat_200k" -DATASET_SPLIT = "train_sft" - -# Select number of samples. 512 samples is a good place to start. -# Increasing the number of samples can improve accuracy. -NUM_CALIBRATION_SAMPLES = 512 -MAX_SEQUENCE_LENGTH = 2048 - -# Load dataset and preprocess. -ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) -ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) - - -def preprocess(example): - return { - "text": tokenizer.apply_chat_template( - example["messages"], - tokenize=False, - ) - } - - -ds = ds.map(preprocess) - - -# Tokenize inputs. -def tokenize(sample): - return tokenizer( - sample["text"], - padding=False, - max_length=MAX_SEQUENCE_LENGTH, - truncation=True, - add_special_tokens=False, - ) - - -ds = ds.map(tokenize, remove_columns=ds.column_names) - -# Configure the quantization algorithm to run. -# * quantize the weights to 4 bit with GPTQ with a group size 128 -recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) - -# Apply algorithms. -oneshot( - model=model, - dataset=ds, - recipe=recipe, - max_seq_length=MAX_SEQUENCE_LENGTH, - num_calibration_samples=NUM_CALIBRATION_SAMPLES, -) - -# Confirm generations of the quantized model look sane. -print("\n\n") -print("========== SAMPLE GENERATION ==============") -input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda") -output = model.generate(input_ids, max_new_tokens=100) -print(tokenizer.decode(output[0])) -print("==========================================\n\n") - -# Save to disk compressed. -SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" -model.save_pretrained(SAVE_DIR, save_compressed=True) -tokenizer.save_pretrained(SAVE_DIR) From 92e9000fb710e6e24b64126c5b260641530cbf32 Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:29:43 +0000 Subject: [PATCH 7/9] fix links Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 47eceb116..b238ecc97 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -57,7 +57,7 @@ If your model is not traceable for your desired dataset, first consider adding a For example, in this model graph, the internals of the MllamaVisionModel are not traced (we don't see the individual MllamaVisionEncoder layers, ect.). However, we can no longer target the modules within the MllamaVisionModel such as the MllamaVisionEncoder as sequential targets. If any modules within the MllamaVisionModel are being compressed, their hessians will all be allocated at the same time, increasing peak memory usage. ## Tracing Errors ## -Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/README.md). +Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/GUIDE.md). ## Adding Your Own Smoothquant Mappings ## For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](/src/llmcompressor/modifiers/smoothquant/README.md). From 4a2ffc31c48867d34ce81764c55b2bde86a1391e Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 19:51:43 +0000 Subject: [PATCH 8/9] remove out of context paragraph Signed-off-by: Kyle Sayers --- examples/multimodal_vision/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index b238ecc97..9b0b4f982 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -54,8 +54,6 @@ Choosing sequential targets with higher granularity (for example "Linear" instea ### Ignore ### If your model is not traceable for your desired dataset, first consider adding any problematic modules to the ignore list. Doing this prevents the model tracer from tracing the internals of those modules, thereby avoid the untraceable operations. -For example, in this model graph, the internals of the MllamaVisionModel are not traced (we don't see the individual MllamaVisionEncoder layers, ect.). However, we can no longer target the modules within the MllamaVisionModel such as the MllamaVisionEncoder as sequential targets. If any modules within the MllamaVisionModel are being compressed, their hessians will all be allocated at the same time, increasing peak memory usage. - ## Tracing Errors ## Because the architectures of vision-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/GUIDE.md). From be52b4d9ccbb9c8bcb902b6999b9c9b1d7ad4dfd Mon Sep 17 00:00:00 2001 From: Kyle Sayers Date: Mon, 27 Jan 2025 15:57:33 -0500 Subject: [PATCH 9/9] Update examples/multimodal_vision/README.md Co-authored-by: Brian Dellabetta --- examples/multimodal_vision/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/multimodal_vision/README.md b/examples/multimodal_vision/README.md index 9b0b4f982..69f31ffb0 100644 --- a/examples/multimodal_vision/README.md +++ b/examples/multimodal_vision/README.md @@ -19,7 +19,7 @@ The kitten appears to be comfortable and relaxed, possibly enjoying the warmth o ``` -This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ quantization. Most examples do not demonstrate quantizing separate vision encoder parameters if they exist, as compressing these parameters offers little benefit with repsect to performance-accuracy tradeoff. +This directory contains example scripts for quantizing a variety of vision-language models using the GPTQ quantization. Most examples do not demonstrate quantizing separate vision encoder parameters if they exist, as compressing these parameters offers little benefit with respect to performance-accuracy tradeoff. ## Compressing Your Own Model ## To use your own multimodal modal, start with an existing example change the `model_id` to match your own model stub.