Skip to content

Latest commit

 

History

History
193 lines (158 loc) · 8.49 KB

README.md

File metadata and controls

193 lines (158 loc) · 8.49 KB

Finetuning DINOv2 with LoRA for Image Segmentation

This repository explores finetuning the DINOv2 (Oquab et al., 2024) encoder weights using Low-Rank Adaptation (Hu et al., 2021) (LoRA) and a simple 1x1 convolution decoder. LoRA makes it possible to finetune to new tasks easier without adjusting the original encoder weights by adding a small set of weights between each encoder block. The DINOv2 encoder weights are learned by self-supervised learning and accurately capture the natural image domain. For example, by applying PCA to the outputs of the encoders, we can get a coarse segmentation of the objects in the image and see semantically similar objects colored in the same color.

Check out the Explanation.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.

Additionally, I tested a more recent paper, FeatUp, which uses PCA and upsamples the embeddings in the feature space, producing higher-resolution output. See the Embedding_visualization.ipynb.

output.mp4

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name dino python=3.11
conda activate dino
# Install the package for dino_finetune imports,
pip install -e .

Special dependency if you want to investigate the encoder features in higher resolution using FeatUp. I recreated methods to process videos and images in the notebook Embedding_visualization.ipynb. To run it yourself in the notebook you need to install the FeatUp directory, and as it uses a custom kernel you need to make sure all the CUDA environment variables are configured properly.

# For CUDA_HOME/nvcc, make sure you install the cudatoolkit-dev tools
conda install -c conda-forge cudatoolkit-dev -y
# Now you should be able to run, 
nvcc -V
# So you can set the CUDA_HOME path
export CUDA_HOME=$CONDA_PREFIX
# For the LD_LIBRARY_PATH install cudnn
conda install -c conda-forge cudnn
# And set the variable
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/rob/miniconda3/envs/dino/lib

In the section below I explain all the flags used in the main.py to finetune to different datasets.

Usage

An example to run finetuning on the VOC dataset with LoRA and an FPN decoder.

python main.py --exp_name base_voc --dataset voc --size base --use_lora --img_dim 308 308 --epochs 50 --use_fpn

Flags Some explanation of the more useful flags to use when running experiments.

  • --exp_name (str): The name of the experiment. This is used to identify the experiment and save results accordingly.
  • --debug (flag): A boolean flag to indicate whether to debug the main.py training code.
  • --dataset (str): The name of the dataset to use. either voc or ade20k
  • --size (str): The size configuration for the DINOv2 backbone parameter small, base, large, or giant
  • --r (int): the LoRA rank (r) parameter to determine the amount of parameters. Usually, a small value like 3-9.
  • --use_lora (flag): A boolean flag indicating whether to use Low-Rank Adaptation (LoRA). If this flag is present, LoRA is used.
  • --use_fpn (flag): A boolean flag to indicate whether to use the FPN decoder.
  • --lora_weights (str): Path to the file location to load the LoRA weights and decoder head from.
  • --img_dim (tuple of int): The dimensions of the input images (height width). This should be specified as two integers. Example: 308 308.
  • --epochs (int): The number of training epochs. This determines how many times the model will pass through the entire training dataset. Example: 50.

There are some more unnamed parameters for training like the learning rate and batch size.

Results

Pascal VOC
I achieve a validation mean IoU of approximately 85.2% using LoRA and a 1x1 convolution decoder. When applying ImageNet-C corruptions (Hendrycks & Dietterich, 2019) to test robustness on Pascal VOC, the validation mean IoU drops to 72.2% with corruption severity level 5 (the maximum). The qualitative performance of this network is illustrated in the figure below. Based on their qualitative and quantitative performance, these pre-trained weights handle image corruption effectively.

You can use the pre-trained weights using the --lora_weights flag or using the load_parameters function call. Registers here mean that extra context global context tokens are learned, see the second reference. All models are finetuned for 100 epochs.

finetuned components model # of
params
with
registers
Pascal VOC
Validation mIoU
Pascal VOC-C
level 5
Validation mIoU
Directory
1x1 Conv decoder ViT-L/14 distilled 300 M 70.9% 66.6% output/base_voc_no_lora.pt
LoRA + 1x1 Conv decoder ViT-L/14 distilled 300 M 85.2% 72.2% output/base_voc.pt
LoRA + FPN decoder ViT-L/14 distilled 300 M 74.1% 65.6% output/base_voc_fpn.pt

ADE20k
I achieve a validation mean IoU of approximately 62.2% using LoRA and a 1x1 convolution decoder. With ADE20k-C with corruption severity level 5, the validation mean IoU drops to 55.8%. The qualitative performance of this network is illustrated in the figure below.

finetuned components model # of
params
with
registers
ADE20k
Validation mIoU
ADE20k-C
level 5
Validation mIoU
Directory
1x1 Conv decoder ViT-L/14 distilled 300 M 57.2% 54.4% output/base_ade20k_no_lora.pt
LoRA + 1x1 Conv decoder ViT-L/14 distilled 300 M 62.2% 55.8% output/base_ade20k_lora.pt
LoRA + FPN decoder ViT-L/14 distilled 300 M 62.0% 54.7% output/base_ade20k_fpn.pt

Citing

If you reference or use the codebase in your research, please cite:

@article{2024dinov2_lora_seg,
      title={Finetuning DINOv2 with LoRA for Image Segmentation},
      author={Rob van Gastel},
      year={2024}
    }

References

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision (arXiv:2304.07193). arXiv. http://arxiv.org/abs/2304.07193

Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers (arXiv:2309.16588). arXiv. https://doi.org/10.48550/arXiv.2309.16588

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685

Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (arXiv:1807.01697). arXiv. https://doi.org/10.48550/arXiv.1807.01697