Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets - WACV 2025 - Official Repository
George Leotescu*, Alin-Ionut Popa*, Diana Grigore, Daniel Voinea, Pietro Perona
This is the official PyTorch implementation of MILE.
This repository introduces an innovative framework for creating comprehensive visual concept embeddings from multiple images using self-supervised learning (SSL). Key features include:
- a structured multi-image embedding via sequential latent space modeling
- an incremental object representation update mechanism
- an SSL training framework adapted for multi-image inputs.
The approach introduces a new formulation of object retrieval based on image sets and overcomes geometrical constraints imposed by rigid objects in multi-image setups. Validated on the ABO and iNaturalist datasets, the framework outperforms competitive self and weakly supervised methods in object retrieval tasks. It utilizes ViT family backbones but is compatible with any image encoder. The multi-image approach accommodates various visual distributions of the same object without viewpoint consistency constraints, making it more flexible than traditional multi-view setups. This implementation aims to learn from a broader and more diverse training set, avoiding the limitations of contrastive learning approaches that require numerous images per object.
To install the dependencies:
conda create -n MILE python=3.9
conda activate MILE
pip install -r requirements.txt
To download and preprocess the data:
cd ./data_processing/abo
bash download_raw_data.sh
python abo_preprocess.py
This will generate the data splits for the original CVPR22 Multi-View Retrieval experiments
The default output location is:
./data_preprocesing/abo/abo_raw_data/inference_by_class
To download and preprocess the data:
cd ./data_processing/inat
bash download_raw_data.sh
python inat_preprocess.py
The default output location is:
./data_preprocesing/inat/inat_raw_data/inference_by_class
# OVERRIDE CKPT_ROOT and DATA_PATH with local paths
DATA_PATH=/large_shared/cosine-asin-image-similarity/train_wo_per_asin_case_data_dedupe_gt4_unique_vitl
python train_MILE.py --data_path $DATA_PATH --saveckp_freq 1 --arch dinov2 --model_name dinov2_vitl14_reg --view multi-view --out_dim 65536 --norm_last_layer False --warmup_teacher_temp 0.04 --teacher_temp 0.07 --warmup_teacher_temp_epochs 0 --lr 6.25e-07 --momentum_teacher 0.996 --patch_size 14 --epochs 20 --batch_size_per_gpu 3 --gate_tanh on --peft on --output_dir ./test/tmp/
To perform inference with MILE on the ABO dataset, run:
CKPT_ROOT=./test/ckpts/
DATA_PATH=./data_preprocesing/abo/abo_raw_data/inference_by_class
python inference_MILE.py --local_run --ckpt_root $CKPT_ROOT --dataset_name Abo_retrieval_test --test_source test.query --target_source test.target --patch_size 14 --view multi-view --output_type latent --k_max 10 --data_path $DATA_PATH --peft on --arch dinov2 --samples_per_class 4 --model_name blrp-dinov2-vitl14-reg-lora0-fw-5e-06-mt09996-bs3-800 --ckpt_name checkpoint0390.pth
To perform inference with MILE on the iNaturalist dataset, run:
CKPT_ROOT=./test/ckpts/
DATA_PATH=./data_preprocesing/inat/inat_raw_data/inference_by_class
python inference_MILE.py --local_run --ckpt_root $CKPT_ROOT --dataset_name iNaturalist_100 --test_source query --target_source target --patch_size 16 --view multi-view --output_type latent --k_max 10 --data_path $DATA_PATH --subset_classes 100 --arch maws --peft on --samples_per_class 4 --base_model_name vit_h14_maws --model_name blrp-vit-h14-maws-nat-lora0-fw-5e-06-mt09996-bs2-8000 --ckpt_name checkpoint2475.pth
To validate your setup, run the following test:
cd ./test
bash test.sh
ABO: 85.90
iNat: 95.05
You can download our trained models from our HuggingFace repository.
Our model achieves the following performance on :
Model | Architecture | AR@10 | |
---|---|---|---|
Single-Image | DINO [1] | ViT-S | 67.2 |
EvaClip [2] | ViT-B | 80.85 | |
EvaClip [2] | ViT-L | 84.71 | |
DINOv2 [3] | ViT-S | 77.21 | |
DINOv2 [3] | ViT-B | 82.79 | |
DINOv2 [3] | ViT-L | 84.81 | |
DINO-REG [4] | ViT-L | 84.15 | |
ft-MAWS [5] | ViT-B | 83.90 | |
ft-MAWS [5] | ViT-L | 85.27 | |
ft-MAWS [5] | ViT-H | 86.40 | |
Multi-Image | MILE w. DINOv2 | ViT-S | 77.77 |
MILE w. DINOv2 | ViT-B | 86.57 | |
MILE w. DINOv2 | ViT-L | 90.96 | |
MILE w. DINO-REG | ViT-L | 93.2 | |
MILE w. MAWS | ViT-B | 85.85 | |
MILE w. MAWS | ViT-L | 90.71 | |
MILE w. MAWS | ViT-H | 94.02 |
iNaturalist Category Retrieval Results. MILE w. MAWS (ViT-Huge) outperforms all single-image baselines significantly. The ft prefix indicates MAWS baselines pre-fine-tuned on iNaturalist. Our method excels across all metrics, demonstrating robustness to structural variations and background shifts in matched object classes.
Model | Architecture | AR@10 | |
---|---|---|---|
Single-Image | DINO [1] | ViT-S | 47.38 |
DINO [1] | ViT-S | 57.9 | |
EvaClip [2] | ViT-B | 48.13 | |
EvaClip [2] | ViT-L | 55.43 | |
DINOv2 [3] | ViT-S | 57.40 | |
DINOv2 [3] | ViT-B | 61.5 | |
DINOv2 [3] | ViT-L | 66.93 | |
ft-DINOv2 [3] | ViT-L | 80.49 | |
DINO-REG [4] | ViT-L | 70.81 | |
ft-DINO-REG [4] | ViT-L | 84.72 | |
MAWS [5] | ViT-B | 47.81 | |
MAWS [5] | ViT-L | 48.54 | |
MAWS [5] | ViT-H | 47.04 | |
Multi-Image | MILE w. DINO | ViT-S | 63.25 |
MILE w. DINOv2 | ViT-S | 65.8 | |
MILE w. DINOv2 | ViT-B | 76.61 | |
MILE w. DINOv2 | ViT-L | 79.66 | |
MILE w. DINO-REG | ViT-L | 86.33 | |
MILE w. MAWS | ViT-B | 72.51 | |
MILE w. MAWS | ViT-L | 82.51 | |
MILE w. MAWS | ViT-H | 81.65 |
ABO Object Retrieval Results. Prefix ft denotes the fine-tuned version of the single-image model setup on ABO. We fine-tuned only to the best performing pretrained backbones. All approaches are self-supervised except EvaClip, which uses weak-supervision in the form of image and text pairs. Our approach achieves superior performance for the AR@10 metric, thus proving robustness to setups where the structure of the matched object suffers changes between query and gallery visual content.
See CONTRIBUTING for more information.
This project is released under the CC-BY-NC 4.0 license. See LICENSE for more details.
This repository is based on the DINO repository. All modified DINO-related code has been consolidated within the dino
directory, accompanied by the necessary licensing details and attributions.
@inproceedings{leotescu2024mile,
title={Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets},
author={Leotescu, George and Popa, Alin-Ionut Popa and Grigore, Diana and Voinea, Daniel and Perona, Pietro},
booktitle={Proceedings of WACV},
year={2025}
}