Replace tokenizer with processor #955

kylesayrs · 2024-12-05T00:43:00Z

Purpose

Prepare to support processors and vision datasets
It's important to rename and retype variable to better reflect its more widened definition

Prerequisites

Remove unused sparseml.export utilities #950

Postrequisites

Vision Datasets #943

Changes

Rename and retype instances of tokenizer to processor
Add processor pathway argument to which tokenizer is internally reassigned to
Add typing definitions in src/llmcompressor/typing.py
Special handling of tokenizer in src/llmcompressor/transformers/finetune/data/base.py, src/llmcompressor/transformers/finetune/data/ultrachat_200k.py, src/llmcompressor/transformers/finetune/session_mixin.py

Testing

No new functionality is added, CI tests should pass

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

github-actions · 2024-12-05T00:43:10Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…es-tokenizer

rahul-tuli

LGTM! Thanks for this.

src/llmcompressor/typing.py

dsikka

I think we need to make it very clear:

What a processor is vs a tokenizer
If either/or can be provided and in what cases

dsikka · 2024-12-14T20:22:48Z

src/llmcompressor/transformers/finetune/runner.py

        """
        Loads datasets for each flow based on data_args, stores a Dataset for each
        enabled flow in self.datasets

        :param tokenizer: tokenizer to use for dataset tokenization
        """
        if self._data_args.dataset is None:
-            self.tokenizer = self._model_args.tokenizer
+            self.processor = self._model_args.processor


Seems like we're keeping the tokenizer in the model_args as well? What if both are specified? Or only tokenizer?

See the newly added model args handling logic

dsikka · 2024-12-14T20:29:42Z

src/llmcompressor/transformers/finetune/text_generation.py

+def initialize_processor_from_path(
+    model_args: ModelArguments, model: PreTrainedModel, teacher: PreTrainedModel
+) -> Processor:
+    processor_src = model_args.processor


same, what if a tokenizer is provided?

See the newly added model args handling logic

kylesayrs · 2024-12-16T17:21:17Z

@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here

Processor = Union[
    PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
]

We should continue to support the tokenizer model arg, but internally reassign it to the processor variable name for code simplicity.

# silently assign tokenizer to processor
if model_args.tokenizer:
    if model_args.processor:
        raise ValueError("Cannot use both a tokenizer and processor")
    model_args.processor = model_args.tokenizer
model_args.tokenizer = None

dsikka · 2024-12-16T18:30:52Z

@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here
Processor = Union[
    PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
]
We should continue to support the tokenizer model arg, but internally reassign it to the processor variable name for code simplicity.
# silently assign tokenizer to processor
if model_args.tokenizer:
    if model_args.processor:
        raise ValueError("Cannot use both a tokenizer and processor")
    model_args.processor = model_args.tokenizer
model_args.tokenizer = None

I think this is fine. My two comments about clarity were specific to being clear towards users - either in the model_args or through text_generation.py script

kylesayrs · 2024-12-17T03:27:13Z

@dsikka

There is help text attached to the newly added processor arg which users can read
We throw an error if both are passed

I think this should be clear enough messaging without being annoying/verbose

dsikka · 2024-12-17T13:58:45Z

@dsikka

There is help text attached to the newly added processor arg which users can read

We throw an error if both are passed

I think this should be clear enough messaging without being annoying/verbose

Oh sorry, missed the help text.
Sounds good

* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

@shubhra

## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

kylesayrs added 6 commits December 3, 2024 00:16

remove sparseml utilities

bf4744a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

use in model_load

7e516c1

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove use of RECIPE FILE NAME

9e33641

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

rename to RECIPE_FILE_NAME, avoid circular import

58c0fba

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove qa ignore

1180b34

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

replace tokenizer with processor

1aba16d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs self-assigned this Dec 5, 2024

defer data collator changes

89bda30

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

This was referenced Dec 5, 2024

VLM Support via GPTQ Hooks and Data Pipelines #914

Merged

Vision Datasets #943

Merged

kylesayrs added 2 commits December 9, 2024 17:29

Merge remote-tracking branch 'origin' into kylesayrs/processor-replac…

d97ef2b

…es-tokenizer

Merge branch 'main' into kylesayrs/processor-replaces-tokenizer

b8e867d

This was referenced Dec 10, 2024

Update text_generation.py #938

Closed

Update session_mixin.py #941

Closed

Merge branch 'main' into kylesayrs/processor-replaces-tokenizer

8918917

kylesayrs requested review from horheynm, dsikka and rahul-tuli December 12, 2024 23:10

rahul-tuli approved these changes Dec 13, 2024

View reviewed changes

src/llmcompressor/typing.py Show resolved Hide resolved

Merge branch 'main' into kylesayrs/processor-replaces-tokenizer

8d72269

dsikka requested changes Dec 14, 2024

View reviewed changes

kylesayrs requested a review from dsikka December 17, 2024 04:05

Merge branch 'main' into kylesayrs/processor-replaces-tokenizer

3f25398

dsikka approved these changes Dec 17, 2024

View reviewed changes

dsikka merged commit ad972c2 into main Dec 17, 2024
6 of 7 checks passed

dsikka deleted the kylesayrs/processor-replaces-tokenizer branch December 17, 2024 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace tokenizer with processor #955

Replace tokenizer with processor #955

kylesayrs commented Dec 5, 2024

github-actions bot commented Dec 5, 2024

rahul-tuli left a comment

dsikka left a comment

dsikka Dec 14, 2024

kylesayrs Dec 16, 2024

dsikka Dec 14, 2024

kylesayrs Dec 16, 2024

kylesayrs commented Dec 16, 2024 •

edited

Loading

dsikka commented Dec 16, 2024

kylesayrs commented Dec 17, 2024

dsikka commented Dec 17, 2024

Replace tokenizer with processor #955

Replace tokenizer with processor #955

Conversation

kylesayrs commented Dec 5, 2024

Purpose

Prerequisites

Postrequisites

Changes

Testing

github-actions bot commented Dec 5, 2024

rahul-tuli left a comment

Choose a reason for hiding this comment

dsikka left a comment

Choose a reason for hiding this comment

dsikka Dec 14, 2024

Choose a reason for hiding this comment

kylesayrs Dec 16, 2024

Choose a reason for hiding this comment

dsikka Dec 14, 2024

Choose a reason for hiding this comment

kylesayrs Dec 16, 2024

Choose a reason for hiding this comment

kylesayrs commented Dec 16, 2024 • edited Loading

dsikka commented Dec 16, 2024

kylesayrs commented Dec 17, 2024

dsikka commented Dec 17, 2024

kylesayrs commented Dec 16, 2024 •

edited

Loading