Shuffle Variations

Repository for experimenting with different inter-epoch shuffling methods in data parallel training of neural networks.

Introduction

This repository is dedicated to the exploration of various techniques for shuffling data between epochs during the training of neural networks. The primary goals of this research project are:

Reducing Communication Overhead: Find methods that require less communication between nodes/ranks in distributed training environments.
Sequential Data Access: Develop strategies that allow for sequential data access during training.
Maintaining Quality Metrics: Ensure that the quality of models trained using these shuffling methods remains state-of-the-art (SOTA).

Key Objectives

Identify efficient data shuffling techniques for neural network training.
Investigate ways to minimize communication overhead in distributed training setups.
Preserve or improve the quality of trained models while optimizing data shuffling.

Who Is This For?

This repository is relevant for you if:

You're interested in speeding up the training of neural networks by reducing shuffling overhead.
You're exploring the usage of Data Parallelism (DDP) on a Slurm cluster.
You're curious about innovative techniques for managing data in deep learning projects.

Getting Started

Clone the repository
Install requirements.txt
Adjust the system-config.yaml file to your needs
Add your own run-config in the run-configs folder

Slurm Cluster: 5. Add your own Slurm batch script in the slurm folder 6. Submit the Slurm job

Local (4 processes fixed!, CPU execution): 5. torchrun --nproc_per_node=4 main.py --config_path=run-configs/.yaml

Repository Structure

The repository is organized as follows:

.
├── README.md
├── data
├── notebooks
├── run-configs
├── slurm
├── src
│   ├── main.py
│   ├── data
│   │   ├── data.py
│   │   ├── datasets.py
│   │   ├── partition.py
│   │   ├── sorted_dataset.py
│   ├── models
│   │   ├── models.py
│   ├── training
│   │   ├── train.py
│   │   ├── custom_sampler.py
│   │   ├── stratified_sampler.py
│   ├── util
│   │   ├── cases.py
│   │   ├── helper.py
│   ├── visualization
├── test
├── wandb
├── system-config.yaml

system-config.yaml

This file contains all configuration elements that are system specific and run-independent. This includes the ddp port, the system type (server | local) and the entire specification of available datasets. Therefore, only this file needs to be changed when adding/removing/updating datasets.

The file is structured as follows:

system: (server | local)
ddp:
  port: <port>
datasets:
  <dataset_name>
    path: <path_to_dataset>
    load-function:
      module: <module_name>
      type: (generic | built-in)
      name: <function_name>
    transforms:
      train:
        - name: <transform_name>
          kwargs:
            <param_name>: <param_value>
        ...
      test:
        ...

The load-function specifies how a dataset can be loaded (e.g. torchvision.datasets.cifar.CIFAR10 or torchvision.datasets.ImageFolder). Where the type specifies which arguments are passed to the function (train=(True | False) is set only when using built-in datasets). The transforms are applied to the dataset in the given order.

Potential Pitfalls/Errors

When using torch.hub.load() the repo might get downloaded but not unzipped. This will result in a FileNotFoundError of the hubconf.py file. To fix this, manually unzip the downloaded file and make sure it has the correct name (e.g. pytorch_vision_v0.10.0).
To efficiently sort the dataset and calculate the label frequencies, the script requires the dataset object to have an attribute called "targets". This should contain the labels of the dataset (as integers) in the same order as the dataset itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shuffle Variations

Introduction

Key Objectives

Who Is This For?

Getting Started

Repository Structure

system-config.yaml

Potential Pitfalls/Errors

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github/workflows		.github/workflows
data		data
notebooks		notebooks
run-configs		run-configs
slurm		slurm
src		src
test		test
trained_models		trained_models
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
system-config.yaml		system-config.yaml

License

BraSDon/shuffle-variations

Folders and files

Latest commit

History

Repository files navigation

Shuffle Variations

Introduction

Key Objectives

Who Is This For?

Getting Started

Repository Structure

system-config.yaml

Potential Pitfalls/Errors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages