Skip to content

Commit

Permalink
Refactoring of ImageProcessorFast (huggingface#35069)
Browse files Browse the repository at this point in the history
* add init and base image processing functions

* add add_fast_image_processor to transformers-cli

* add working fast image processor clip

* add fast image processor to doc, working tests

* remove "to be implemented" SigLip

* fix unprotected import

* fix unprotected vision import

* update ViTImageProcessorFast

* increase threshold slow fast ewuivalence

* add fast img blip

* add fast class in tests with cli

* improve cli

* add fast image processor convnext

* add LlavaPatchingMixin and fast image processor for llava_next and llava_onevision

* add device kwarg to ImagesKwargs for fast processing on cuda

* cleanup

* fix unprotected import

* group images by sizes and add batch processing

* Add batch equivalence tests, skip when center_crop is used

* cleanup

* update init and cli

* fix-copies

* refactor convnext, cleanup base

* fix

* remove patching mixins, add piped torchvision transforms for ViT

* fix unbatched processing

* fix f strings

* protect imports

* change llava onevision to class transforms (test)

* fix convnext

* improve formatting (following Pavel review)

* fix handling device arg

* improve cli

* fix

* fix inits

* Add distinction between preprocess and _preprocess, and support for arbitrary kwargs through valid_extra_kwargs

* uniformize qwen2_vl fast

* fix docstrings

* add add fast image processor llava

* remove min_pixels max_pixels from accepted size

* nit

* nit

* refactor fast image processors docstrings

* cleanup and remove fast class transforms

* update add fast image processor transformers cli

* cleanup docstring

* uniformize pixtral fast and  make _process_image explicit

* fix prepare image structure llava next/onevision

* Use typed kwargs instead of explicit args

* nit fix import Unpack

* clearly separate pops and gets in base preprocess. Use explicit typed kwargs

* make qwen2_vl preprocess arguments hashable
  • Loading branch information
yonigozlan authored Feb 4, 2025
1 parent 8d73a38 commit fa56dcc
Show file tree
Hide file tree
Showing 66 changed files with 4,072 additions and 2,269 deletions.
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/blip.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ The original code can be found [here](https://github.com/salesforce/BLIP).
[[autodoc]] BlipImageProcessor
- preprocess

## BlipImageProcessorFast

[[autodoc]] BlipImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/clip.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,11 @@ The resource should ideally demonstrate something new instead of duplicating an
[[autodoc]] CLIPImageProcessor
- preprocess

## CLIPImageProcessorFast

[[autodoc]] CLIPImageProcessorFast
- preprocess

## CLIPFeatureExtractor

[[autodoc]] CLIPFeatureExtractor
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/convnext.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,11 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] ConvNextImageProcessor
- preprocess

## ConvNextImageProcessorFast

[[autodoc]] ConvNextImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/deit.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@ If you're interested in submitting a resource to be included here, please feel f
[[autodoc]] DeiTImageProcessor
- preprocess

## DeiTImageProcessorFast

[[autodoc]] DeiTImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
[[autodoc]] LlavaImageProcessor
- preprocess

## LlavaImageProcessorFast

[[autodoc]] LlavaImageProcessorFast
- preprocess

## LlavaProcessor

[[autodoc]] LlavaProcessor
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/llava_next.md
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,11 @@ model = AutoModelForImageTextToText.from_pretrained(
[[autodoc]] LlavaNextImageProcessor
- preprocess

## LlavaNextImageProcessorFast

[[autodoc]] LlavaNextImageProcessorFast
- preprocess

## LlavaNextProcessor

[[autodoc]] LlavaNextProcessor
Expand Down
13 changes: 9 additions & 4 deletions docs/source/en/model_doc/llava_onevision.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,8 @@ import torch
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf")
model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-7b-ov-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")

# prepare image and text prompt, using the appropriate prompt template
Expand Down Expand Up @@ -298,8 +298,8 @@ First make sure to install flash-attn. Refer to the [original repository of Flas
from transformers import LlavaOnevisionForConditionalGeneration

model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_flash_attention_2=True
).to(0)
Expand All @@ -318,6 +318,11 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(

[[autodoc]] LlavaOnevisionImageProcessor

## LlavaOnevisionImageProcessorFast

[[autodoc]] LlavaOnevisionImageProcessorFast
- preprocess

## LlavaOnevisionVideoProcessor

[[autodoc]] LlavaOnevisionVideoProcessor
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/siglip.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,11 @@ Below is an expected speedup diagram that compares inference time between the na
[[autodoc]] SiglipImageProcessor
- preprocess

## SiglipImageProcessorFast

[[autodoc]] SiglipImageProcessorFast
- preprocess

## SiglipProcessor

[[autodoc]] SiglipProcessor
Expand Down
5 changes: 5 additions & 0 deletions docs/source/ja/model_doc/blip.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ BLIP は、次のようなさまざまなマルチモーダル タスクを実
[[autodoc]] BlipImageProcessor
- preprocess

## BlipImageProcessorFast

[[autodoc]] BlipImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 5 additions & 0 deletions docs/source/ja/model_doc/clip.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,11 @@ CLIP を使い始めるのに役立つ公式 Hugging Face およびコミュニ
[[autodoc]] CLIPImageProcessor
- preprocess

## CLIPImageProcessorFast

[[autodoc]] CLIPImageProcessorFast
- preprocess

## CLIPFeatureExtractor

[[autodoc]] CLIPFeatureExtractor
Expand Down
5 changes: 5 additions & 0 deletions docs/source/ja/model_doc/convnext.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,11 @@ ConvNeXT の使用を開始するのに役立つ公式 Hugging Face およびコ
[[autodoc]] ConvNextImageProcessor
- preprocess

## ConvNextImageProcessorFast

[[autodoc]] ConvNextImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 5 additions & 0 deletions docs/source/ja/model_doc/deit.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,11 @@ DeiT を始めるのに役立つ公式 Hugging Face およびコミュニティ
[[autodoc]] DeiTImageProcessor
- preprocess

## DeiTImageProcessorFast

[[autodoc]] DeiTImageProcessorFast
- preprocess

<frameworkcontent>
<pt>

Expand Down
5 changes: 1 addition & 4 deletions examples/modular-transformers/modeling_new_task_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -452,10 +452,7 @@ def prepare_inputs_for_generation(
return model_inputs

def resize_token_embeddings(
self,
new_num_tokens: Optional[int] = None,
pad_to_multiple_of=None,
mean_resizing=True
self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None, mean_resizing=True
) -> nn.Embedding:
model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of, mean_resizing)

Expand Down
5 changes: 1 addition & 4 deletions examples/modular-transformers/modular_new_task_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,7 @@ def forward(
return (embeddings,) + vlm_outputs

def resize_token_embeddings(
self,
new_num_tokens: Optional[int] = None,
pad_to_multiple_of=None,
mean_resizing=True
self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None, mean_resizing=True
) -> nn.Embedding:
model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of, mean_resizing)

Expand Down
16 changes: 16 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1308,11 +1308,19 @@
]
else:
_import_structure["image_processing_utils_fast"] = ["BaseImageProcessorFast"]
_import_structure["models.blip"].append("BlipImageProcessorFast")
_import_structure["models.clip"].append("CLIPImageProcessorFast")
_import_structure["models.convnext"].append("ConvNextImageProcessorFast")
_import_structure["models.deformable_detr"].append("DeformableDetrImageProcessorFast")
_import_structure["models.deit"].append("DeiTImageProcessorFast")
_import_structure["models.detr"].append("DetrImageProcessorFast")
_import_structure["models.llava"].append("LlavaImageProcessorFast")
_import_structure["models.llava_next"].append("LlavaNextImageProcessorFast")
_import_structure["models.llava_onevision"].append("LlavaOnevisionImageProcessorFast")
_import_structure["models.pixtral"].append("PixtralImageProcessorFast")
_import_structure["models.qwen2_vl"].append("Qwen2VLImageProcessorFast")
_import_structure["models.rt_detr"].append("RTDetrImageProcessorFast")
_import_structure["models.siglip"].append("SiglipImageProcessorFast")
_import_structure["models.vit"].append("ViTImageProcessorFast")

try:
Expand Down Expand Up @@ -6442,11 +6450,19 @@
from .utils.dummy_torchvision_objects import *
else:
from .image_processing_utils_fast import BaseImageProcessorFast
from .models.blip import BlipImageProcessorFast
from .models.clip import CLIPImageProcessorFast
from .models.convnext import ConvNextImageProcessorFast
from .models.deformable_detr import DeformableDetrImageProcessorFast
from .models.deit import DeiTImageProcessorFast
from .models.detr import DetrImageProcessorFast
from .models.llava import LlavaImageProcessorFast
from .models.llava_next import LlavaNextImageProcessorFast
from .models.llava_onevision import LlavaOnevisionImageProcessorFast
from .models.pixtral import PixtralImageProcessorFast
from .models.qwen2_vl import Qwen2VLImageProcessorFast
from .models.rt_detr import RTDetrImageProcessorFast
from .models.siglip import SiglipImageProcessorFast
from .models.vit import ViTImageProcessorFast

try:
Expand Down
Loading

0 comments on commit fa56dcc

Please sign in to comment.