Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate Utilities #193

Merged
merged 32 commits into from
Dec 20, 2024
Merged

Accelerate Utilities #193

merged 32 commits into from
Dec 20, 2024

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Oct 21, 2024

Purpose

  • Implement offloading utility functions which greatly simplify/clarify offloading-related code in llm-compressor
  • Explicitly initialize quantization parameters as offloaded if the module is offloaded

Prerequisites

Changes

Changes not covered by prerequisites:

  • Implement getattr_chain utility function (also used by llm-compressor)
  • Implement depreciated utility decorator for future depreciations
  • Implement register_offload_parameter and delete_offload_parameter for easier initialization and removal of parameters related to quantization
  • Begin newly initialized quantization parameters on cpu if the module is offloaded offload
    • Faster performance, removes dependency on get_execution_device

Depreciation Strategy

These functions should be depreciated, each for their own reason. These strategies will be implemented in follow-up PRs

Function Depreciation Reason Depreciation Strategy
is_module_offloaded Use official has_offloaded_params redirect to has_offloaded_params & depreciation warning
get_execution_device Not useful as a general util Remove uses from LC & depreciation warning
get_offloaded_device Folded into update_offload_parameter Replace uses in LC with update_offload_parameter & depreciation warning
update_prefix_dict Folded into update_offload_parameter Replace uses in LC with update_offload_parameter & depreciation warning
update_parameter_data Use update_offload_data for better args ordering. Open to keeping this one around Remove uses from LC and CT & depreciation warning

Upstream Strategy

Upstreaming functions to accelerate is a low priority, but comes with the benefit of more reviews and more official support

Function Upstream Version
register_offload_parameter  N/A
update_offload_data  N/A
delete_offload_parameter  N/A
has_offloaded_params  1.1.0
align_module_device  1.1.0

@kylesayrs kylesayrs marked this pull request as ready for review November 19, 2024 02:46
@kylesayrs kylesayrs self-assigned this Nov 19, 2024
@kylesayrs kylesayrs changed the title [WIP] Accelerate Utilities Accelerate Utilities Nov 19, 2024
@kylesayrs kylesayrs requested a review from horheynm November 28, 2024 17:04
Copy link
Member

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! with a few nits, good work on this!

src/compressed_tensors/utils/helpers.py Show resolved Hide resolved
src/compressed_tensors/utils/helpers.py Show resolved Hide resolved
src/compressed_tensors/utils/offload.py Show resolved Hide resolved
src/compressed_tensors/utils/offload.py Outdated Show resolved Hide resolved
src/compressed_tensors/utils/offload.py Show resolved Hide resolved
rahul-tuli
rahul-tuli previously approved these changes Dec 6, 2024
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the replacement for get_execution_device?

@kylesayrs
Copy link
Contributor Author

@dsikka The function update_parameter_data takes new_param_data as an input and uses this to update parameter data. Previously, this function would simply overwrite the old data with new_param_data. Now, in order to reduce complexity and increase performance, update_parameter_data requires that parameter being updated and the new parameter data are the same shape.

This assumption causes an error in mock_per_token_calibration, tests/test_quantization/test_configs/test_strategies.py, which revealed to me that the shape used to initialize the per_token strategy and the shape computed by calculate_qparams are different shapes. I consider this ambiguity to be a bug which was causing the test to fail.

Copy link
Contributor

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall.

Do you mind adding a simple lifecycle dosctring which shows the steps of offloaded modules/parameters to make it slightly easier to follow how the parameters are updated?

I also think we should kick-off W4A16/W8A8 oneshot workflows, similar to what we did here: https://app.asana.com/0/1207078450218847/1208568399648361/f to make sure it runs to completion. I think past issues we've seen have been with g_idx and activation quantization parameters.

@dsikka
Copy link
Contributor

dsikka commented Dec 9, 2024

What would be the replacement for get_execution_device?

I think I understand from your PR as to why this can be removed.

@kylesayrs
Copy link
Contributor Author

@dsikka w.r.t. get_execution_device

  1. The function isn't guaranteed to be performant for all device maps, for example half-offloaded models
  2. The function has very few uses, so it may be worth removing

For these reasons it's a candidate (and we'll need it for the immediate future), but future work can determine whether we want to keep/ update it

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
horheynm
horheynm previously approved these changes Dec 19, 2024
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@dsikka dsikka merged commit 85b473e into main Dec 20, 2024
1 check passed
@dsikka dsikka deleted the kylesayrs/upstream-candidates branch December 20, 2024 16:29
dsikka added a commit to vllm-project/llm-compressor that referenced this pull request Jan 8, 2025
## Purpose ##
* Enable oneshot quantization of vision-language models

![VLM
Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543)
[Llama_3 2-Vision
Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0)

## Related Issues ##
* Fixes #91
* Fixes #961
* Fixes #990

## Prerequisites ##
* neuralmagic/compressed-tensors#193
* #917
* #943
  * #955
    * #950
* #998
* #1014

## Changes ##
### VLM Support ###
* Add multimodal examples in `examples/multimodal_vision`
* Modify `custom_offload_device_map` to support models which are not
`XForCausalLM`
* Add custom data collators for VLM models in
`src/llmcompressor/transformers/utils/data_collator.py`

### GPTQModifier ###
* Implement hooks-based compression in `GPTQModifier`
* This replaces layer-compressor, which made many assumptions about
model architecture
* This also enables finer-grained sequential compression such as
[true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential)
* Functions previously implemented in `gptq_wrapper.py` are now
implemented in `gptq_quantize.py`
* Implement `offload_hessians` parameter in `GPTQModifier`
* Implement data-pipelines-based calibration in `GPTQModifier`
* First an attempt will be made to trace the model and run the
`sequential` pipeline
* If that fails, assumptions will be made about the model architecture
and an attempt will be made to run the `layer_sequential` pipeline
* This ensures backwards compatibility with any previously supported
models
* If that fails, then the basic pipeline will be used, which is
guaranteed to run but may require using `offlo ad_hessians`
* Change hessian instability from a `ValueError` to a `_LinAlgError` so
it can be ignored by the gptq pipeline fallback mechanism
* Add support for conv2d as indicated by
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54)

### Data Pipelines ###
* Implement the basic skeletons of data pipelines, which are subject to
change when data pipelines are pulled out of modifiers
* Basic Pipeline
* Performs standard forward passes through the model with provided
dataloader
* Used as fallback, as well as in the future for basic calibration
passes
* Layer Sequential Pipeline
  * Refactor of `LayerCompressor` as a straight-forward data pipeline
  * Uses `IntermediatesCache` to handle activation offloading
* Sequential Pipeline
* Utilizes graph tracing implemented by `torch.fx` to trace the graph in
order to determine where sequential targets (layers) exist in the graph
and what their inputs and outputs are
  * Implements BFS algorithm to assign nodes to partitions
* An ideal implementation consolidates partition indices to assign each
node to the latest possible partition, delaying execution. The current
implementation addresses the most common case (node.op == get_attr)
* Each partition (`Subgraph`) is compiled as an executable python
function with the proper inputs and outputs
  * Uses `IntermediatesCache` to handle activation offloading
* Implement `IntermediatesCache` which automagically handles the
offloading and onloading of activations from batches
* This class is capable of offloading many non-standard activation types
such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast`
  * For convenience, the class also handles masking padding
  * The class is tested in `tests/llmcompressor/pipelines/test_cache.py`

### Tracing ###
* In order to support sequential quantization of the large variety of
different multimodal model architectures, some model definitions have to
be altered to support tracing
* If the calibration dataset is text only, most LLMs and VLMs are
traceable without additional work. Multimodal calibration datasets are
more likely to require additional work to make tracable
* For many VLMs (but not all), the vision tower is not traceable without
significant work. However, this only affects sequential error
propagation and (minimal?) increased memory usage, which leaves the door
open for future support for quantizing modules in the vision tower
* Add traceable model definitions for llava, mistral, mllama, and glm
* All copyright licenses allow for alteration and redistribution, the
line `# vllm-project: no copyright` was added in similar style to
[text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18)

## Future Work/ Follow ups ##
* #1027
* #1032
* #1039
* #1030
* Create better data collators capable of handling larger batch sizes in
order to support VLM fine tuning
* Better support prompt masking for multimodal processors in order to
support VLM fine tuning

## Winogrande Evaluations ##

Model | Dataset | Scheme | Runtime | Winogrande |
-- | -- | -- | -- | --
Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 
Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 
Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 
openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 
Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 
Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 
Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 
Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 
llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 
Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 

`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True
--tasks winogrande --num_fewshot 5 --batch_size 32`
`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1
--tasks winogrande --num_fewshot 5 --batch_size 1`

## MMMU Evaluations ##
Credit to @shubhra 

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-11B-Vision | N/A | Dense | 0.4144
Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300
Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377
Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-90B-Vision | N/A | Dense | 0.5388
Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278
Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111
Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Pixtral-12B-2409 | N/A | Dense | 0.5022
Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322
Pixtral-12B-2409 | flickr | W4A16 | 0.4500
Pixtral-12B-2409 | flickr | W4A16-group | 0.4689

## Testing ##
*
[Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996)

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants