-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace tokenizer with processor #955
Conversation
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make it very clear:
- What a processor is vs a tokenizer
- If either/or can be provided and in what cases
""" | ||
Loads datasets for each flow based on data_args, stores a Dataset for each | ||
enabled flow in self.datasets | ||
|
||
:param tokenizer: tokenizer to use for dataset tokenization | ||
""" | ||
if self._data_args.dataset is None: | ||
self.tokenizer = self._model_args.tokenizer | ||
self.processor = self._model_args.processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we're keeping the tokenizer in the model_args as well? What if both are specified? Or only tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the newly added model args handling logic
def initialize_processor_from_path( | ||
model_args: ModelArguments, model: PreTrainedModel, teacher: PreTrainedModel | ||
) -> Processor: | ||
processor_src = model_args.processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, what if a tokenizer is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the newly added model args handling logic
@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here Processor = Union[
PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
] We should continue to support the # silently assign tokenizer to processor
if model_args.tokenizer:
if model_args.processor:
raise ValueError("Cannot use both a tokenizer and processor")
model_args.processor = model_args.tokenizer
model_args.tokenizer = None |
I think this is fine. My two comments about clarity were specific to being clear towards users - either in the model_args or through text_generation.py script |
I think this should be clear enough messaging without being annoying/verbose |
Oh sorry, missed the help text. |
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Purpose
Prerequisites
Postrequisites
Changes
processor
pathway argument to whichtokenizer
is internally reassigned tosrc/llmcompressor/typing.py
src/llmcompressor/transformers/finetune/data/base.py
,src/llmcompressor/transformers/finetune/data/ultrachat_200k.py
,src/llmcompressor/transformers/finetune/session_mixin.py
Testing