Accelerate Utilities #193

kylesayrs · 2024-10-21T18:46:39Z

Purpose

Implement offloading utility functions which greatly simplify/clarify offloading-related code in llm-compressor
Explicitly initialize quantization parameters as offloaded if the module is offloaded

Prerequisites

Changes

Changes not covered by prerequisites:

Implement getattr_chain utility function (also used by llm-compressor)
Implement depreciated utility decorator for future depreciations
Implement register_offload_parameter and delete_offload_parameter for easier initialization and removal of parameters related to quantization
Begin newly initialized quantization parameters on cpu if the module is offloaded offload
- Faster performance, removes dependency on get_execution_device

Depreciation Strategy

These functions should be depreciated, each for their own reason. These strategies will be implemented in follow-up PRs

Function	Depreciation Reason	Depreciation Strategy
is_module_offloaded	Use official has_offloaded_params	redirect to has_offloaded_params & depreciation warning
get_execution_device	Not useful as a general util	Remove uses from LC & depreciation warning
get_offloaded_device	Folded into update_offload_parameter	Replace uses in LC with update_offload_parameter & depreciation warning
update_prefix_dict	Folded into update_offload_parameter	Replace uses in LC with update_offload_parameter & depreciation warning
update_parameter_data	Use update_offload_data for better args ordering. Open to keeping this one around	Remove uses from LC and CT & depreciation warning

Upstream Strategy

Upstreaming functions to accelerate is a low priority, but comes with the benefit of more reviews and more official support

Function	Upstream Version
register_offload_parameter	N/A
update_offload_data	N/A
delete_offload_parameter	N/A
has_offloaded_params	1.1.0
align_module_device	1.1.0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/utils/offload.py

src/compressed_tensors/utils/helpers.py

rahul-tuli

LGTM! with a few nits, good work on this!

src/compressed_tensors/utils/helpers.py

src/compressed_tensors/utils/offload.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

dsikka

What would be the replacement for get_execution_device?

kylesayrs · 2024-12-09T16:58:40Z

@dsikka The function update_parameter_data takes new_param_data as an input and uses this to update parameter data. Previously, this function would simply overwrite the old data with new_param_data. Now, in order to reduce complexity and increase performance, update_parameter_data requires that parameter being updated and the new parameter data are the same shape.

This assumption causes an error in mock_per_token_calibration, tests/test_quantization/test_configs/test_strategies.py, which revealed to me that the shape used to initialize the per_token strategy and the shape computed by calculate_qparams are different shapes. I consider this ambiguity to be a bug which was causing the test to fail.

dsikka

This looks good overall.

Do you mind adding a simple lifecycle dosctring which shows the steps of offloaded modules/parameters to make it slightly easier to follow how the parameters are updated?

I also think we should kick-off W4A16/W8A8 oneshot workflows, similar to what we did here: https://app.asana.com/0/1207078450218847/1208568399648361/f to make sure it runs to completion. I think past issues we've seen have been with g_idx and activation quantization parameters.

src/compressed_tensors/utils/helpers.py

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/utils/offload.py

dsikka · 2024-12-09T17:03:10Z

What would be the replacement for get_execution_device?

I think I understand from your PR as to why this can be removed.

kylesayrs · 2024-12-09T17:11:03Z

@dsikka w.r.t. get_execution_device

The function isn't guaranteed to be performant for all device maps, for example half-offloaded models
The function has very few uses, so it may be worth removing

For these reasons it's a candidate (and we'll need it for the immediate future), but future work can determine whether we want to keep/ update it

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

src/compressed_tensors/utils/offload.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

src/compressed_tensors/utils/offload.py

src/compressed_tensors/utils/helpers.py

src/compressed_tensors/utils/offload.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

@shubhra

## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

wip

bddc83c

kylesayrs mentioned this pull request Oct 22, 2024

additional fixes for HFQuantizer compatibility #136

Closed

kylesayrs added 10 commits October 23, 2024 05:05

add modify_offload_module

94d8c56

update docs

f939e98

WIP

167e741

cleanup functions, begin depreciation

cb6edb1

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove extra space

cb70047

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

revert get_offloaded_device

98a2889

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

update to align_module_device

8cd69ef

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add requires skip for accelerate

0d23183

Merge remote-tracking branch 'origin' into kylesayrs/upstream-candidates

82235b3

fix per token initialization

0b0d8b6

kylesayrs mentioned this pull request Nov 19, 2024

[Bugfix] Update expected shape for per token strategy #210

Merged

remove align_module_device

95e5907

kylesayrs marked this pull request as ready for review November 19, 2024 02:46

kylesayrs self-assigned this Nov 19, 2024

kylesayrs changed the title ~~[WIP] Accelerate Utilities~~ Accelerate Utilities Nov 19, 2024

horheynm reviewed Nov 20, 2024

View reviewed changes

src/compressed_tensors/quantization/lifecycle/initialize.py Outdated Show resolved Hide resolved

horheynm reviewed Nov 20, 2024

View reviewed changes

src/compressed_tensors/utils/offload.py Outdated Show resolved Hide resolved

horheynm reviewed Nov 20, 2024

View reviewed changes

src/compressed_tensors/utils/helpers.py Show resolved Hide resolved

kylesayrs requested a review from horheynm November 28, 2024 17:04

kylesayrs added 2 commits December 2, 2024 23:25

Merge remote-tracking branch 'origin' into kylesayrs/upstream-candidates

a6a3198

Merge remote-tracking branch 'origin' into kylesayrs/upstream-candidates

e3c3f95

kylesayrs mentioned this pull request Dec 5, 2024

VLM Support via GPTQ Hooks and Data Pipelines vllm-project/llm-compressor#914

Merged

rahul-tuli reviewed Dec 6, 2024

View reviewed changes

rahul-tuli previously approved these changes Dec 6, 2024

View reviewed changes

respond to nits

81a1eab

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed rahul-tuli’s stale review via 81a1eab December 6, 2024 03:59

kylesayrs mentioned this pull request Dec 6, 2024

Accelerate Utilities Follow-up #224

Merged

kylesayrs marked this pull request as draft December 6, 2024 05:52

support OffloadedWeightsLoader

64f4d98

dsikka reviewed Dec 9, 2024

View reviewed changes

src/compressed_tensors/utils/helpers.py Show resolved Hide resolved

src/compressed_tensors/quantization/lifecycle/initialize.py Show resolved Hide resolved

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

kylesayrs added 2 commits December 10, 2024 18:59

add lifecycle docstring

b8ae387

implement offload_to_weights_map with recursive definition

870095e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

dsikka reviewed Dec 16, 2024

View reviewed changes

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

kylesayrs added 7 commits December 16, 2024 14:12

add docstring

77411ca

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix type hint

a5b1792

add check_accelerate guard

ed9ee4e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

make device used by clearer

1632cc3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

update update_prefix_dict

1c55a10

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reuse fixture

9177650

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge remote-tracking branch 'origin' into kylesayrs/upstream-candidates

38d7dbf

horheynm reviewed Dec 19, 2024

View reviewed changes

src/compressed_tensors/utils/offload.py Outdated Show resolved Hide resolved

horheynm reviewed Dec 19, 2024

View reviewed changes

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

horheynm reviewed Dec 19, 2024

View reviewed changes

src/compressed_tensors/utils/helpers.py Show resolved Hide resolved

horheynm reviewed Dec 19, 2024

View reviewed changes

src/compressed_tensors/utils/offload.py Outdated Show resolved Hide resolved

horheynm previously approved these changes Dec 19, 2024

View reviewed changes

use apply rather than recursion

df3e186

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed horheynm’s stale review via df3e186 December 19, 2024 20:51

kylesayrs added 2 commits December 19, 2024 20:59

clearer delete_from_weights_map

665c987

add offload_device argument (#228)

0f4760a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

dsikka approved these changes Dec 20, 2024

View reviewed changes

rahul-tuli approved these changes Dec 20, 2024

View reviewed changes

dsikka merged commit 85b473e into main Dec 20, 2024
1 check passed

dsikka deleted the kylesayrs/upstream-candidates branch December 20, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate Utilities #193

Accelerate Utilities #193

kylesayrs commented Oct 21, 2024 •

edited

Loading

rahul-tuli left a comment

dsikka left a comment

kylesayrs commented Dec 9, 2024

dsikka left a comment •

edited

Loading

dsikka commented Dec 9, 2024

kylesayrs commented Dec 9, 2024

Accelerate Utilities #193

Accelerate Utilities #193

Conversation

kylesayrs commented Oct 21, 2024 • edited Loading

Purpose

Prerequisites

Changes

Depreciation Strategy

Upstream Strategy

rahul-tuli left a comment

Choose a reason for hiding this comment

dsikka left a comment

Choose a reason for hiding this comment

kylesayrs commented Dec 9, 2024

dsikka left a comment • edited Loading

Choose a reason for hiding this comment

dsikka commented Dec 9, 2024

kylesayrs commented Dec 9, 2024

kylesayrs commented Oct 21, 2024 •

edited

Loading

dsikka left a comment •

edited

Loading