1 KAUST, 2 University of Oxford,
We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the inherent distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days.
Follow the steps below to set up the environment, prepare the dataset, and run the training pipeline:
-
Create the Conda Environment
Create a Conda environment namedssl_diversity
with Python 3.10:conda create -n ssl_diversity python=3.10 conda activate ssl_diversity
-
Install Required Packages
Install the required Python packages specified in therequirements.txt
file:pip install -r requirements.txt
-
Install NVIDIA DALI (Optional)
If you plan to use NVIDIA DALI for augmentations, install it using the following command:pip install nvidia-dali-cuda110
-
Prepare the Dataset
Run thecreate_csv.py
script to generate a CSV file listing the image paths:- Open the script and update the variables as needed:
root_directory = "some_images" # Replace with the root directory containing your images output_file = "image_paths.csv" # Specify the desired name of the output CSV file
- Execute the script:
python create_csv.py
- Open the script and update the variables as needed:
-
Update Your YAML Configuration File
Configure the dataset section in your YAML file as follows:# Dataset configuration data: dataset: "custom" # Using custom dataset type for CSV train_path: "/home/hammh0a/new/solo-learn/image_paths.csv" # Path to the generated CSV file format: "csv" # Specify CSV format num_workers: 8 no_labels: True fraction: 1.0 # Adjust between 0.0-1.0 for partial dataset use root_dir: "./" # Root directory for relative image paths path_column: "path" # Name of the column containing image paths in CSV
-
Control Training Data Fraction
Set thefraction
parameter in the YAML file to control the percentage of data used during training (e.g.,1.0
for full dataset,0.5
for 50%). -
Run the Training Script
Execute the training process by running therunner.sh
script. Ensure the correct YAML file is specified in the script:bash runner.sh
If you find this work useful in your research, please consider citing:
@misc{hammoud2024pretraining,
title={On Pretraining Data Diversity for Self-Supervised Learning},
author={Hasan Abed Al Kader Hammoud and Tuhin Das and Fabio Pizzati and Philip Torr and Adel Bibi and Bernard Ghanem},
year={2024},
eprint={2403.13808},
archivePrefix={arXiv},
primaryClass={cs.CV}
}