Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets - WACV 2025 - Official Repository

George Leotescu*, Alin-Ionut Popa*, Diana Grigore, Daniel Voinea, Pietro Perona

This is the official PyTorch implementation of MILE.

This repository introduces an innovative framework for creating comprehensive visual concept embeddings from multiple images using self-supervised learning (SSL). Key features include:

a structured multi-image embedding via sequential latent space modeling
an incremental object representation update mechanism
an SSL training framework adapted for multi-image inputs.

The approach introduces a new formulation of object retrieval based on image sets and overcomes geometrical constraints imposed by rigid objects in multi-image setups. Validated on the ABO and iNaturalist datasets, the framework outperforms competitive self and weakly supervised methods in object retrieval tasks. It utilizes ViT family backbones but is compatible with any image encoder. The multi-image approach accommodates various visual distributions of the same object without viewpoint consistency constraints, making it more flexible than traditional multi-view setups. This implementation aims to learn from a broader and more diverse training set, avoiding the limitations of contrastive learning approaches that require numerous images per object.

Requirements

To install the dependencies:

conda create -n MILE python=3.9
conda activate MILE
pip install -r requirements.txt

Data Preparation

Amazon Berkeley Objects (ABO) Dataset

To download and preprocess the data:

cd ./data_processing/abo
bash download_raw_data.sh
python abo_preprocess.py

This will generate the data splits for the original CVPR22 Multi-View Retrieval experiments

The default output location is:

./data_preprocesing/abo/abo_raw_data/inference_by_class

iNaturalist

To download and preprocess the data:

cd ./data_processing/inat
bash download_raw_data.sh
python inat_preprocess.py

The default output location is:

./data_preprocesing/inat/inat_raw_data/inference_by_class

Training

# OVERRIDE CKPT_ROOT and DATA_PATH with local paths

DATA_PATH=/large_shared/cosine-asin-image-similarity/train_wo_per_asin_case_data_dedupe_gt4_unique_vitl
python train_MILE.py --data_path $DATA_PATH --saveckp_freq 1 --arch dinov2 --model_name dinov2_vitl14_reg --view multi-view --out_dim 65536 --norm_last_layer False --warmup_teacher_temp 0.04 --teacher_temp 0.07 --warmup_teacher_temp_epochs 0 --lr 6.25e-07 --momentum_teacher 0.996 --patch_size 14 --epochs 20 --batch_size_per_gpu 3 --gate_tanh on --peft on --output_dir ./test/tmp/

Inference

To perform inference with MILE on the ABO dataset, run:

CKPT_ROOT=./test/ckpts/
DATA_PATH=./data_preprocesing/abo/abo_raw_data/inference_by_class
python inference_MILE.py --local_run  --ckpt_root $CKPT_ROOT  --dataset_name Abo_retrieval_test  --test_source test.query  --target_source test.target  --patch_size 14  --view multi-view  --output_type latent  --k_max 10  --data_path  $DATA_PATH --peft on  --arch dinov2  --samples_per_class 4  --model_name blrp-dinov2-vitl14-reg-lora0-fw-5e-06-mt09996-bs3-800 --ckpt_name checkpoint0390.pth

To perform inference with MILE on the iNaturalist dataset, run:

CKPT_ROOT=./test/ckpts/
DATA_PATH=./data_preprocesing/inat/inat_raw_data/inference_by_class
python inference_MILE.py --local_run  --ckpt_root $CKPT_ROOT  --dataset_name iNaturalist_100  --test_source query  --target_source target  --patch_size 16  --view multi-view  --output_type latent  --k_max 10  --data_path $DATA_PATH  --subset_classes 100  --arch maws  --peft on  --samples_per_class 4  --base_model_name vit_h14_maws  --model_name blrp-vit-h14-maws-nat-lora0-fw-5e-06-mt09996-bs2-8000 --ckpt_name checkpoint2475.pth

To validate your setup, run the following test:

cd ./test
bash test.sh

ABO: 85.90
iNat: 95.05

Models

You can download our trained models from our HuggingFace repository.

Results

Our model achieves the following performance on :

iNaturalist

	Model	Architecture	AR@10
Single-Image	DINO [1]	ViT-S	67.2
	EvaClip [2]	ViT-B	80.85
	EvaClip [2]	ViT-L	84.71
	DINOv2 [3]	ViT-S	77.21
	DINOv2 [3]	ViT-B	82.79
	DINOv2 [3]	ViT-L	84.81
	DINO-REG [4]	ViT-L	84.15
	ft-MAWS [5]	ViT-B	83.90
	ft-MAWS [5]	ViT-L	85.27
	ft-MAWS [5]	ViT-H	86.40
Multi-Image	MILE w. DINOv2	ViT-S	77.77
	MILE w. DINOv2	ViT-B	86.57
	MILE w. DINOv2	ViT-L	90.96
	MILE w. DINO-REG	ViT-L	93.2
	MILE w. MAWS	ViT-B	85.85
	MILE w. MAWS	ViT-L	90.71
	MILE w. MAWS	ViT-H	94.02

iNaturalist Category Retrieval Results. MILE w. MAWS (ViT-Huge) outperforms all single-image baselines significantly. The ft prefix indicates MAWS baselines pre-fine-tuned on iNaturalist. Our method excels across all metrics, demonstrating robustness to structural variations and background shifts in matched object classes.

ABO

	Model	Architecture	AR@10
Single-Image	DINO [1]	ViT-S	47.38
	DINO [1]	ViT-S	57.9
	EvaClip [2]	ViT-B	48.13
	EvaClip [2]	ViT-L	55.43
	DINOv2 [3]	ViT-S	57.40
	DINOv2 [3]	ViT-B	61.5
	DINOv2 [3]	ViT-L	66.93
	ft-DINOv2 [3]	ViT-L	80.49
	DINO-REG [4]	ViT-L	70.81
	ft-DINO-REG [4]	ViT-L	84.72
	MAWS [5]	ViT-B	47.81
	MAWS [5]	ViT-L	48.54
	MAWS [5]	ViT-H	47.04
Multi-Image	MILE w. DINO	ViT-S	63.25
	MILE w. DINOv2	ViT-S	65.8
	MILE w. DINOv2	ViT-B	76.61
	MILE w. DINOv2	ViT-L	79.66
	MILE w. DINO-REG	ViT-L	86.33
	MILE w. MAWS	ViT-B	72.51
	MILE w. MAWS	ViT-L	82.51
	MILE w. MAWS	ViT-H	81.65

ABO Object Retrieval Results. Prefix ft denotes the fine-tuned version of the single-image model setup on ABO. We fine-tuned only to the best performing pretrained backbones. All approaches are self-supervised except EvaClip, which uses weak-supervision in the form of image and text pairs. Our approach achieves superior performance for the AR@10 metric, thus proving robustness to setups where the structure of the matched object suffers changes between query and gallery visual content.

Security

See CONTRIBUTING for more information.

License

This project is released under the CC-BY-NC 4.0 license. See LICENSE for more details.

Acknowledgement

This repository is based on the DINO repository. All modified DINO-related code has been consolidated within the dino directory, accompanied by the necessary licensing details and attributions.

Citation

@inproceedings{leotescu2024mile,
  title={Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets},
  author={Leotescu, George and Popa, Alin-Ionut Popa and Grigore, Diana and Voinea, Daniel and Perona, Pietro},
  booktitle={Proceedings of WACV},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
data_preprocesing		data_preprocesing
dino		dino
figures		figures
model		model
sagemaker		sagemaker
test		test
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-LICENSES.TXT		THIRD-PARTY-LICENSES.TXT
inference_MILE.py		inference_MILE.py
launch_dist_train.py		launch_dist_train.py
logger_config.py		logger_config.py
requirements.txt		requirements.txt
train_MILE.py		train_MILE.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets - WACV 2025 - Official Repository

Requirements

Data Preparation

Amazon Berkeley Objects (ABO) Dataset

iNaturalist

Training

Inference

Models

Results

iNaturalist

ABO

Security

License

Acknowledgement

Citation

About

Releases

Packages

Contributors 3

Languages

License

amazon-science/MILE

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Incremental Learning of Object Representations from Arbitrary Image Sets - WACV 2025 - Official Repository

Requirements

Data Preparation

Amazon Berkeley Objects (ABO) Dataset

iNaturalist

Training

Inference

Models

Results

iNaturalist

ABO

Security

License

Acknowledgement

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages