[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations

Authors: Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, Stephan Alaniz

News

[2025-03-02] ⭐️ Training code & scripts released.
[2025-02-26] 🎉 Our paper was accepted to CVPR 2025.
[2025-01-20] ⭐️ Inference code & models released.

Abstract

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embed dings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision- language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing mul- timodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision- language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs.

Methodology

Pre-trained Models

We released the pre-trained FLAIR models on Huggingface. The pre-trained models, their corresponding pre-trained datasets, and R@1 retrieval results on COCO and Flickr are listed below. For the full results please see the paper. Generally, FLAIR shares a similar architecture as the ViT-B-16 model in OpenCLIP, therefore also having similar number of parameters (150M vs 149M), the extra 1M parameters come from the text-conditioned attention pooling layer in FLAIR.

Checkpoints	Pre-trained Datasets	COCO T2I	COCO I2T	Flickr T2I	Flickr I2T
flair-cc3m-recap	CC3M-recap	37.7	51.6	65.7	78.7
flair-cc12m-recap	CC12M-recap	47.8	64.1	75.4	90.8
flair-yfcc15m-recap	YFCC15M-recap	51.2	67.3	79.2	93.3
flair-merged30m	Merged30M	53.3	68.0	81.1	94.7

You don't need to munaually download the pre-trained weights to run the inference, the pre-trained weights will be automatically downloaded by specifying the huggingface-model-name in src/inference.sh(More details in the 'Inference with FLAIR' section). However, if you would like to store the pretrained weights despite the default path, you could download them manually and set --pretrained path/to/pretrained_weights flag in /src/inference.sh instead (as OpenCLIP originally does).

Dependencies

The following small tutorial helps you set up a simple python virtual environment to run our code. Since our main dependency is OpenCLIP, which is still updated frequently, you could always check their repo for a detailed tutorial on creating an environment that is best suited for your system. A conda environment is also possible with the same Python and PyTorch version.

1. Create a Virtual Environment

First, navigate to the project’s root directory flair/ and create a virtual environment using Python 3.12:

cd flair/
python3.12 -m venv flair_env

2. Activate and Navigate to src/

Activate the virtual environment and navigate to src/

source flair_env/bin/activate
cd src/

3. Install Dependencies

Our code mainly involves installing open_clip_torch and open_clip_torch[training].

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

The code is tested in Python 3.12 with PyTorch 2.5.1 with CUDA 12.4. Since OpenCLIP is quite dependency-friendly, we would assume other up-to-date versions should also work.

Usage

A minimal usage of FLAIR is displayed in src/minimal_example.py, where we show that FLAIR has two ways of generating logits:

model.get_logits(): First query the local image tokens with the global text token using attention pooling, then compute the logits. This is the primary way of FLAIR getting the logits
model.get_logits_as_clip(): Without using attention pooling, directly compute the similarity between global-level image and text features.

Run the example by:

source flair_env/bin/activate
python3 src/minimal_example.py

Inference Datasets Preparation

Check EVAL_DATASETS.md to prepare all the inference datasets. For clarity, we provide an example datasets folder with annotation files in datasets/. However, all datasets don't have to be stored in the same directory, you could specify them freely by changing the arguments in src/inference.sh.

Inference with FLAIR

To reproduce the retrieval results in the FLAIR paper, we provide an example inference bash script: src/inference.sh. Below are detailed explanations of important flags:

--huggingface-repo-name: Name of the Huggingface repo where the pre-trained models are stored. Should be fixed as 'xiaorui638/flair'.
--huggingface-model-name: Name of the pretrained models. Options include:
- flair-cc3m-recap.pt
- flair-cc12m-recap.pt
- flair-yfcc15m-recap.pt
- flair-merged30m.pt
--inference-with-flair: Enable this flag when using the FLAIR model.
--precision: Fixed as amp in our paper.
--workers: Adjustable according to your system.

Retrieval Tasks

Enable the following flags in src/inference.sh for different retrieval tasks:

Standard Retrieval
- --coco-data-root-dir: Root directory of the COCO dataset.
- --flickr-data-root-dir: Root directory of the Flickr30k dataset.
- --retrieval-coco: Activate the COCO retrieval task.
- --retrieval-flickr: Activate the Flickr retrieval task.
Fine-grained Retrieval
- --iiw-retrieval-dir: Root directory of the Image-in-Words dataset.
- --docci-retrieval-dir: Root directory of the DOCCI dataset.
- --retrieval-iiw: Activate the Image-in-Words retrieval task.
- --retrieval-docci: Activate the DOCCI retrieval task.
Long Retrieval
- --dci-retrieval-dir: Root directory of the DCI dataset.
- --urban-1k-retrieval-dir: Root directory of the Urban-1K dataset.
- --sharegpt4v-retrieval-dir: Root directory of the ShareGPT4V dataset.
- --retrieval-dci: Activate the DCI retrieval task.
- --retrieval-urban-1k: Activate the Urban1K retrieval task.
- --retrieval-sharegpt4v-1k: Activate the ShareGPT4V-1K retrieval task.
- --retrieval-sharegpt4v-10k: Activate the ShareGPT4V-10K retrieval task.

Training FLAIR

For results displayed in the main paper, FLAIR used DreamLIP's recaptioned CC3M-recap, CC12M-recap, YFCC15M-recap and combined(Merged-30M). To verify that FLAIR is fit for various data distributions, FLAIR is also trained on the original CC3M and PixelProse for the appendix of the paper. Notably, FLAIR requires all pre-training dataset to be processed into the webdataset format, to achieve higher I/O efficiency for large-scale training. In the pre-training dataset preparation step, we will take CC3M-recap as an example to demonstrate how to prepare the pretraining data. The preparation for other datasets should be similar.

Prepare Pre-training Data

Download DreamLIP's annotations for CC3M-recap: wget https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions/resolve/main/cc3m_3long_3short_1raw_captions_url.csv
Scrape the images based on the url links using img2dataset.

Single-node training script

Users can find the example single-node training script src/train_example.sh in thie repo to test if the training can run. Important flags:

--train-data: Root dir of where the training data(shards) is stored.
--train-num-samples: In the example file we set it to 2823019 because that's the total number of image-text pairs we get in CC3M-recap. This should be adjustable based on your data.

The single-node training script src/train_example.sh has been tested to run without problems. We always recommend you to run your job on single node first before starting the multi-node training by:

source flair_env/bin/activate
bash src/train_example.sh

Multi-node training script (Slurm)

In practice, FLAIR is trained with 8 NVIDIA A100s 40GB (on CC3M) or 32 NVIDIA A100s 40GB (on all larger datasets), where we finished all experiments using Slurm. In src/, we provide example slurm training scripts for each of the datasets, they are:train_cc3m_slurm.sh, train_cc12m_slurm.sh, train_yfcc15m_slurm.sh, train_merged_30m_slurm.sh.

These training scripts contain all the necessary hyperparams you need to reproduce the training. However, you might need to add modifications to be able to run on your cluster. Please specify --train-data to be the directory storing the dataset shards and --train-num-samples to be the actual valid samples of that dataset. When training on the Merged-30M dataset, note that the --train-data should be the combination of the dataset paths of cc3m-recap, cc12m-recap, yfcc15m-recap separated by ::, such as:

--train-data '/datasets/yfcc15m_recap/yfcc15m-train-{0000..3636}.tar::/datasets/cc12m_recap/cc12m-train-{0000..2175}.tar::/datasets/cc3m_recap/cc3m-train-{0001..0575}.tar'

After configuring the Slurm scripts correctly, you could run the experiment by (taking CC3M-recap as an example):

source flair_env/bin/activate
sbatch src/train_cc3m_slurm.sh

Acknowledgements

We thank OpenCLIP for providing the amazing code base. Meanwhile, we acknowledge DreamLIP and PixelProse for providing us with various pre-training datasets with captions from MLLMs. We are also greateful for LoTLIP for providing the the detailed scheme for long image-text retrieval task.

Citations

If you find our work useful, please star this repo and cite:

@article{xiao2024flair,
  title={FLAIR: VLM with Fine-grained Language-informed Image Representations},
  author={Xiao, Rui and Kim, Sanghwan and Georgescu, Mariana-Iuliana and Akata, Zeynep and Alaniz, Stephan},
  journal={arXiv preprint arXiv:2412.03561},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
datasets		datasets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations

News

Abstract

Methodology

Pre-trained Models

Dependencies

1. Create a Virtual Environment

2. Activate and Navigate to src/

3. Install Dependencies

Usage

Inference Datasets Preparation

Inference with FLAIR

Retrieval Tasks

Training FLAIR

Prepare Pre-training Data

Single-node training script

Multi-node training script (Slurm)

Acknowledgements

Citations

About

Releases

Packages

Languages

ExplainableML/flair

Folders and files

Latest commit

History

Repository files navigation

[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations

News

Abstract

Methodology

Pre-trained Models

Dependencies

1. Create a Virtual Environment

2. Activate and Navigate to src/

3. Install Dependencies

Usage

Inference Datasets Preparation

Inference with FLAIR

Retrieval Tasks

Training FLAIR

Prepare Pre-training Data

Single-node training script

Multi-node training script (Slurm)

Acknowledgements

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages