Skip to content

Commit

Permalink
feat: update project structure and trainer path
Browse files Browse the repository at this point in the history
  • Loading branch information
haoxiangsnr committed Dec 28, 2023
1 parent 00c7aef commit e11a5f7
Show file tree
Hide file tree
Showing 12 changed files with 504 additions and 519 deletions.
4 changes: 3 additions & 1 deletion audiozen/common_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from accelerate.logging import get_logger
from torch.utils.data import DataLoader
from torchinfo import summary
from tqdm import tqdm
from tqdm.auto import tqdm

from audiozen.acoustics.audio_feature import istft, stft
from audiozen.debug_utils import DebugUnderflowOverflow
Expand Down Expand Up @@ -384,6 +384,8 @@ def train(self, train_dataloader: DataLoader, validation_dataloaders):
bar_format="{l_bar}{r_bar}",
colour="green",
disable=not self.accelerator.is_local_main_process,
position=0,
leave=True,
)

for batch_idx, batch in enumerate(dataloader_bar):
Expand Down
28 changes: 17 additions & 11 deletions docs/source/concepts/experiment_arguments.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Experiment arguments

AudioZEN uses TOML configuration files (`*.toml`) to configure and manage experiments.
Spiking-FullSubNet uses TOML configuration files (`*.toml`) to configure and manage experiments.
Each experiment is configured by a `*.toml` file, which contains the experiment meta information, trainer, loss function, learning rate scheduler, optimizer, model, dataset, and acoustic features. the basename of the `*.toml` file is used as the experiment ID or identifier.
You can track configuration changes using version control and reproduce experiments by using the same configuration file. For more information on TOML syntax, visit the [TOML website](https://toml.io/en/).

Expand Down Expand Up @@ -92,16 +92,22 @@ clip_grad_norm_value = 5
In this example, AudioZEN will load a custom `Trainer` class from `trainer.py` in the python search path and initialize it with the arguments in the `[trainer.args]` section. You are able to use multiple ways to specify the `path` argument. See the next section for more details.
In AudioZEN, `Trainer` class must be a subclass of `audiozen.trainer.base_trainer.BaseTrainer`. It supports the following arguments at least:

| Item | Default | Description |
| ---------------------- | ------- | ---------------------------------------------------------------------------------------------------- |
| `max_epochs` | `9999` | The maximum number of epochs to train. |
| `clip_grad_norm_value` | `-1` | The maximum norm of the gradients used for clipping. "-1" means no clipping. |
| `save_max_score` | `true` | Whether to find the best model by the maximum score. |
| `save_ckpt_interval` | `1` | The interval of saving checkpoints. |
| `patience` | `10` | The number of epochs with no improvement after which the training will be stopped. |
| `plot_norm` | `true` | Whether to plot the norm of the gradients. |
| `validation_interval` | `1` | The interval of validation. |
| `max_num_checkpoints` | `10` | The maximum number of checkpoints to keep. Saving too many checkpoints causes disk space to run out. |
| Item | Default | Description |
| ----------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------- |
| `debug` | `false` | Whether to enable debug mode. If it is true, we will collect the happening time of NaN and Inf. |
| `max_steps` | `999999999` | The maximum number of steps to train. |
| `max_epochs` | `9999` | The maximum number of epochs to train. If `max_steps` is set, `max_epochs` will be ignored. |
| `max_grad_norm` | `-1` | The maximum norm of the gradients used for clipping. "-1" means no clipping. |
| `save_max_score` | `true` | Whether to find the best model by the maximum score. |
| `save_ckpt_interval` | `1` | The interval of saving checkpoints. |
| `max_patience` | `10` | The number of epochs with no improvement after which the training will be stopped. |
| `plot_norm` | `true` | Whether to plot the norm of the gradients. |
| `validation_interval` | `1` | The interval of validation. |
| `max_num_checkpoints` | `10` | The maximum number of checkpoints to keep. Saving too many checkpoints causes disk space to run out. |
| `scheduler_name` | `"constant_schedule_with_warmup"` | The name of the scheduler. |
| `warmup_steps` | `0` | The number of warmup steps. |
| `warmup_ratio` | `0.0` | The ratio of warmup steps. If `warmup_steps` is set, `warmup_ratio` will be ignored. |
| `gradient_accumulation_steps` | `1` | The number of gradient accumulation steps. It is used to simulate a larger batch size. |

#### Loading a module by `path` argument

Expand Down
37 changes: 4 additions & 33 deletions docs/source/getting_started/installation.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
# Getting Started

## Prerequisites
# Installation

Spiking-FullSubNet is built on top of PyTorch and provides standard audio signal processing and deep learning tools.
To install the PyTorch binaries, we recommend [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) as a Python distribution.

## Installation

1. First, create a Conda virtual environment with Python. In our project, `python=3.10` is tested.
```shell
Expand Down Expand Up @@ -38,8 +35,10 @@ To install the PyTorch binaries, we recommend [Anaconda](https://www.anaconda.co
pip install -r requirements.txt
```

4. Install Spiking-FullSubNet package in editable mode. Finally, we will install the Spiking-FullSubNet package in editable mode (a.k.a. development mode). By installing in editable mode, we can call `spiking_fullsubnet` package in everywhere of code, e.g, in `recipes` and `tools` folders. In addition, we are able to modify the source code of `spiking_fullsubnet` package directly. Any changes to the original package would reflect directly in your conda environment.
4. We integrated all the audio signal processing tools into a package named `audiozen`. We will install the `audiozen` package in the following steps. First, we will install the `audiozen` package in editable mode. By installing in editable mode, we can call `audiozen` package in everywhere of code, e.g, in `recipes` and `tools` folders. In addition, we are able to modify the source code of `audiozen` package directly. Any changes to the original package would reflect directly in your conda environment.
```shell
cd audiozen
pip install --editable . # or for short: pip install -e .
```

Expand All @@ -49,32 +48,4 @@ Ok, all installations have done. You may speed up the installation by the follow
- [Speed up your Conda installs with Mamba](https://pythonspeed.com/articles/faster-conda-install/)
- Use the [THU Anaconda mirror site](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/) to speed up the Conda installation.
- Use the [THU PyPi mirror site](https://mirrors.tuna.tsinghua.edu.cn/help/pypi/) to speed up the PyPI installation.
```

## Running an experiment

In Spiking-FullSubNet, we adopt a `recipes/<dataset>/<model>` directory structure. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset. Please refer to [Intel Neuromorphic DNS Challenge Datasets](https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge#dataset) for preparing the dataset.

To run an experiment of a model, we first go to a model directory. For example, we can entry to the directory `recipes/intel_ndns/spiking_fullsubnet/` to run an experiment of the `spiking_fullsubnet` model.

```shell
cd recipes/intel_ndns/spiking_fullsubnet/
```

In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HugoingFace Accelerate to start an experiment. Don't worry if you are not familiar with Accelerate. It will help you to run an parallel experiment easily. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.
First, we need to configuration the GPU usage. Accelerate provides a CLI tool that unifies all launchers, so you only have to remember one command. To use it, run a quick configuration setup first on your machine and answer the questions:
```shell
accelerate config
```
```{note}
If you don't want to use the CLI tool, you may use explicit arguments to specify the GPU usage (https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-env). For example: `accelerate launch --multi_gpu --num_processes=6 --gpu_ids 0,1,2,3,4,5 --main_process_port 46524 --main_process_ip 127.0.0.1 run.py -C config.toml`
```

Then, we can use the following command to train the `spiking_fullsubnet` model using configurations in `baseline_m_cumulative_laplace_norm.toml`:

```shell
accelerate launch run.py -C baseline_m_cumulative_laplace_norm.toml -M train
```
114 changes: 29 additions & 85 deletions docs/source/getting_started/running_an_experiment.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,35 @@
# Running an experiment

As mentioned in the previous section, AudioZEN adopts a `recipes/<dataset>/<model>` directory structure.
To run an experiment of a model, we first go to a dataset directory, which will include an entry file `run.py` and some dataloaders dedicated to this dataset. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset.
In Spiking-FullSubNet, we adopt a `recipes/<dataset>/<model>` directory structure. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset. Please refer to [Intel Neuromorphic DNS Challenge Datasets](https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge#dataset) for preparing the dataset.

To run an experiment of a model, we first go to a model directory. For example, we can entry to the directory `recipes/intel_ndns/spiking_fullsubnet/` to run an experiment of the `spiking_fullsubnet` model.

```shell
cd recipes/intel_ndns/
cd recipes/intel_ndns/spiking_fullsubnet/
```

## Entry file `run.py`
In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HugoingFace Accelerate to start an experiment. Don't worry if you are not familiar with Accelerate. It will help you to run an parallel experiment easily. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.

In each `<dataset>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We call the `run.py` script to run an experiment.
For example, we can use the following command to train the `sdnn_delays` model using configurations in `baseline.toml`:
First, we need to configuration the GPU usage. Accelerate provides a CLI tool that unifies all launchers, so you only have to remember one command. To use it, run a quick configuration setup first on your machine and answer the questions:

```shell
torchrun
--standalone
--nnodes=1
--nproc_per_node=1
run.py
-C sdnn_delays/baseline.toml
-M train
accelerate config
```

Then, we can use the following command to train the `spiking_fullsubnet` model using configurations in `baseline_m.toml`:

```shell
accelerate launch run.py -C baseline_m.toml -M train
```

```{note}
Alternatively, if you don't want to use the CLI tool, you may use explicit arguments to specify the GPU usage (https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-env). For example: `accelerate launch --multi_gpu --num_processes=6 --gpu_ids 0,1,2,3,4,5 --main_process_port 46524 --main_process_ip 127.0.0.1 run.py -C baseline_m.toml`
```

Here, we use `torchrun` to start the experiment.
`torchrun` isn't magic. It is a superset of `torch.distributed.launch` and is provided by PyTorch officials, helping us to start multi-GPU training conveniently. Its just a python `console_entrypoint` added for convenience (check [torchrun versus python -m torch.distributed.run](https://pytorch.org/docs/stable/elastic/run.html)). Check [Torchrun (Elastic Training)](https://pytorch.org/docs/stable/elastic/run.html) for more details.

## Entry file `run.py`

In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HuggingFace Accelerate to start an experiment. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.

`run.py` supports the following parameters:

Expand All @@ -33,86 +40,23 @@ Here, we use `torchrun` to start the experiment.
| `-R` / `--resume` | Resume the experiment from the latest checkpoint. | `False` |
| `--ckpt_path` | The checkpoint path for test. It can be `best`, `latest`, or a path to a checkpoint file. | `latest` |

See more details in `recipes/intel_ndns/run.py` and `recipes/intel_ndns/sdnn_delays/baseline.toml`.

## Single-machine multi-GPU training
See more details in `recipes/intel_ndns/spiking_fullsubnet/run.py`.

In most cases, we want to start an experiment on a single machine with multiple GPUs. Here, we show some examples for how to.

First, let us use `baseline.toml` to train `sdnn_delays` with two GPUs on a single machine:

```shell
torchrun
--standalone
--nnodes=1
--nproc_per_node=2
run.py
--configuration sdnn_delays/baseline.toml
--mode train
```

`--nnodes=1` means that we will start the experiment on a single machine. `--nproc_per_node=2` means that we will use two GPUs on the single machine.

:::{attention}
The model `sdnn_delays` based on Lava-dl package, which actually does not support multi-GPU training. Here, we just use it as an example to show how to start an experiment on a single machine with multiple GPUs using `torchrun`.
:::

After a suspended experiment, we can resume training (using `-R` or `--resume`) from the last checkpoint:

```shell
torchrun
--standalone
--nnodes=1
--nproc_per_node=2
run.py
--C sdnn_delays/baseline.toml
accelerate launch run.py
-C baseline_m_cumulative_laplace_norm.toml
--M train
-R
```

In the case of running multiple experiments on a single machine, since the first experiment has occupied the default `DistributedDataParallel` (DDP) listening port `29500`, we need to make sure that each instance (job) is setup on different ports to avoid port conflicts. Or you may directly use `rdzv_endpoint=localhost:0`, meaning to select a random unused port:

```shell
torchrun
--rdzv_backend=c10d
--rdzv_endpoint=localhost:0
--nnodes=1
--nproc_per_node=2
run.py
-C sdnn_delays/baseline.toml
-M train
```

Using "best" epoch to test the model performance on the test dataset:

```shell
torchrun
--standalone
--nnodes=1
--nproc_per_node=2
run.py
-C sdnn_delays/baseline.toml
-M test
--ckpt_path best
```

First to train the model on the training dataset. Then test the model performance on the test dataset:

```shell
torchrun
--standalone
--nnodes=1
--nproc_per_node=2
run.py
-C sdnn_delays/baseline.toml
-M train test
accelerate launch run.py
-C baseline_m_cumulative_laplace_norm.toml
--M test
--ckpt_path best
```

:::{attention}
Before use `torchrun`, don't forget to use a environment variable `CUDA_VISIBLE_DEVICES` to control the GPU usage. For example, the following command will use the first and second GPUs:

```shell
export CUDA_VISIABLE_DEVICES=0,1
```
:::
```
56 changes: 0 additions & 56 deletions docs/source/getting_started/running_an_experiment_accelerate.md

This file was deleted.

Loading

0 comments on commit e11a5f7

Please sign in to comment.