feat: update project structure and trainer path

haoxiangsnr · Dec 28, 2023 · e11a5f7 · e11a5f7
1 parent 00c7aef
commit e11a5f7
Show file tree

Hide file tree

Showing 12 changed files with 504 additions and 519 deletions.
diff --git a/audiozen/common_trainer.py b/audiozen/common_trainer.py
@@ -13,7 +13,7 @@
 from accelerate.logging import get_logger
 from torch.utils.data import DataLoader
 from torchinfo import summary
-from tqdm import tqdm
+from tqdm.auto import tqdm
 
 from audiozen.acoustics.audio_feature import istft, stft
 from audiozen.debug_utils import DebugUnderflowOverflow
@@ -384,6 +384,8 @@ def train(self, train_dataloader: DataLoader, validation_dataloaders):
                 bar_format="{l_bar}{r_bar}",
                 colour="green",
                 disable=not self.accelerator.is_local_main_process,
+                position=0,
+                leave=True,
             )
 
             for batch_idx, batch in enumerate(dataloader_bar):

diff --git a/docs/source/concepts/experiment_arguments.md b/docs/source/concepts/experiment_arguments.md
@@ -1,6 +1,6 @@
 # Experiment arguments
 
-AudioZEN uses TOML configuration files (`*.toml`) to configure and manage experiments.
+Spiking-FullSubNet uses TOML configuration files (`*.toml`) to configure and manage experiments.
 Each experiment is configured by a `*.toml` file, which contains the experiment meta information, trainer, loss function, learning rate scheduler, optimizer, model, dataset, and acoustic features. the basename of the `*.toml` file is used as the experiment ID or identifier.
 You can track configuration changes using version control and reproduce experiments by using the same configuration file. For more information on TOML syntax, visit the [TOML website](https://toml.io/en/).
 
@@ -92,16 +92,22 @@ clip_grad_norm_value = 5
 In this example, AudioZEN will load a custom `Trainer` class from `trainer.py` in the python search path and initialize it with the arguments in the `[trainer.args]` section. You are able to use multiple ways to specify the `path` argument. See the next section for more details.
 In AudioZEN, `Trainer` class must be a subclass of `audiozen.trainer.base_trainer.BaseTrainer`. It supports the following arguments at least:
 
-| Item                   | Default | Description                                                                                          |
-| ---------------------- | ------- | ---------------------------------------------------------------------------------------------------- |
-| `max_epochs`           | `9999`  | The maximum number of epochs to train.                                                               |
-| `clip_grad_norm_value` | `-1`    | The maximum norm of the gradients used for clipping. "-1" means no clipping.                         |
-| `save_max_score`       | `true`  | Whether to find the best model by the maximum score.                                                 |
-| `save_ckpt_interval`   | `1`     | The interval of saving checkpoints.                                                                  |
-| `patience`             | `10`    | The number of epochs with no improvement after which the training will be stopped.                   |
-| `plot_norm`            | `true`  | Whether to plot the norm of the gradients.                                                           |
-| `validation_interval`  | `1`     | The interval of validation.                                                                          |
-| `max_num_checkpoints`  | `10`    | The maximum number of checkpoints to keep. Saving too many checkpoints causes disk space to run out. |
+| Item                          | Default                           | Description                                                                                          |
+| ----------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------- |
+| `debug`                       | `false`                           | Whether to enable debug mode. If it is true, we will collect the happening time of NaN and Inf.      |
+| `max_steps`                   | `999999999`                       | The maximum number of steps to train.                                                                |
+| `max_epochs`                  | `9999`                            | The maximum number of epochs to train. If `max_steps` is set, `max_epochs` will be ignored.          |
+| `max_grad_norm`               | `-1`                              | The maximum norm of the gradients used for clipping. "-1" means no clipping.                         |
+| `save_max_score`              | `true`                            | Whether to find the best model by the maximum score.                                                 |
+| `save_ckpt_interval`          | `1`                               | The interval of saving checkpoints.                                                                  |
+| `max_patience`                | `10`                              | The number of epochs with no improvement after which the training will be stopped.                   |
+| `plot_norm`                   | `true`                            | Whether to plot the norm of the gradients.                                                           |
+| `validation_interval`         | `1`                               | The interval of validation.                                                                          |
+| `max_num_checkpoints`         | `10`                              | The maximum number of checkpoints to keep. Saving too many checkpoints causes disk space to run out. |
+| `scheduler_name`              | `"constant_schedule_with_warmup"` | The name of the scheduler.                                                                           |
+| `warmup_steps`                | `0`                               | The number of warmup steps.                                                                          |
+| `warmup_ratio`                | `0.0`                             | The ratio of warmup steps. If `warmup_steps` is set, `warmup_ratio` will be ignored.                 |
+| `gradient_accumulation_steps` | `1`                               | The number of gradient accumulation steps. It is used to simulate a larger batch size.               |
 
 #### Loading a module by `path` argument
 

diff --git a/docs/source/getting_started/installation.md b/docs/source/getting_started/installation.md
@@ -1,11 +1,8 @@
-# Getting Started
-
-## Prerequisites
+# Installation
 
 Spiking-FullSubNet is built on top of PyTorch and provides standard audio signal processing and deep learning tools.
 To install the PyTorch binaries, we recommend [Anaconda](https://www.anaconda.com/products/individual) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) as a Python distribution.
 
-## Installation
 
 1. First, create a Conda virtual environment with Python. In our project, `python=3.10` is tested.
     ```shell
@@ -38,8 +35,10 @@ To install the PyTorch binaries, we recommend [Anaconda](https://www.anaconda.co
     pip install -r requirements.txt
     ```
 
-4. Install Spiking-FullSubNet package in editable mode. Finally, we will install the Spiking-FullSubNet package in editable mode (a.k.a. development mode). By installing in editable mode, we can call `spiking_fullsubnet` package in everywhere of code, e.g, in `recipes` and `tools` folders. In addition, we are able to modify the source code of `spiking_fullsubnet` package directly. Any changes to the original package would reflect directly in your conda environment.
+4. We integrated all the audio signal processing tools into a package named `audiozen`. We will install the `audiozen` package in the following steps. First, we will install the `audiozen` package in editable mode. By installing in editable mode, we can call `audiozen` package in everywhere of code, e.g, in `recipes` and `tools` folders. In addition, we are able to modify the source code of `audiozen` package directly. Any changes to the original package would reflect directly in your conda environment.
     ```shell
+    cd audiozen
+
     pip install --editable . # or for short: pip install -e .
     ```
 
@@ -49,32 +48,4 @@ Ok, all installations have done. You may speed up the installation by the follow
 - [Speed up your Conda installs with Mamba](https://pythonspeed.com/articles/faster-conda-install/)
 - Use the [THU Anaconda mirror site](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/) to speed up the Conda installation.
 - Use the [THU PyPi mirror site](https://mirrors.tuna.tsinghua.edu.cn/help/pypi/) to speed up the PyPI installation.
-```
-
-## Running an experiment
-
-In Spiking-FullSubNet, we adopt a `recipes/<dataset>/<model>` directory structure. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset. Please refer to [Intel Neuromorphic DNS Challenge Datasets](https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge#dataset) for preparing the dataset.
-
-To run an experiment of a model, we first go to a model directory. For example, we can entry to the directory `recipes/intel_ndns/spiking_fullsubnet/` to run an experiment of the `spiking_fullsubnet` model.
-
-```shell
-cd recipes/intel_ndns/spiking_fullsubnet/
-```
-
-In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HugoingFace Accelerate to start an experiment. Don't worry if you are not familiar with Accelerate. It will help you to run an parallel experiment easily. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.
-
-First, we need to configuration the GPU usage. Accelerate provides a CLI tool that unifies all launchers, so you only have to remember one command. To use it, run a quick configuration setup first on your machine and answer the questions:
-
-```shell
-accelerate config
-```
-
-```{note}
-If you don't want to use the CLI tool, you may use explicit arguments to specify the GPU usage (https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-env). For example: `accelerate launch --multi_gpu --num_processes=6 --gpu_ids 0,1,2,3,4,5 --main_process_port 46524 --main_process_ip 127.0.0.1 run.py -C config.toml`
-```
-
-Then, we can use the following command to train the `spiking_fullsubnet` model using configurations in `baseline_m_cumulative_laplace_norm.toml`:
-
-```shell
-accelerate launch run.py -C baseline_m_cumulative_laplace_norm.toml -M train
 ```
diff --git a/docs/source/getting_started/running_an_experiment.md b/docs/source/getting_started/running_an_experiment.md
@@ -1,28 +1,35 @@
 # Running an experiment
 
-As mentioned in the previous section, AudioZEN adopts a `recipes/<dataset>/<model>` directory structure.
-To run an experiment of a model, we first go to a dataset directory, which will include an entry file `run.py` and some dataloaders dedicated to this dataset. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset.
+In Spiking-FullSubNet, we adopt a `recipes/<dataset>/<model>` directory structure. For example, let us entry to the directory `recipes/intel_ndns/`. The corresponding dataset is the Intel Neuromorphic DNS Challenge dataset. Please refer to [Intel Neuromorphic DNS Challenge Datasets](https://github.com/IntelLabs/IntelNeuromorphicDNSChallenge#dataset) for preparing the dataset.
+
+To run an experiment of a model, we first go to a model directory. For example, we can entry to the directory `recipes/intel_ndns/spiking_fullsubnet/` to run an experiment of the `spiking_fullsubnet` model.
+
 ```shell
-cd recipes/intel_ndns/
+cd recipes/intel_ndns/spiking_fullsubnet/
 ```
 
-## Entry file `run.py`
+In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HugoingFace Accelerate to start an experiment. Don't worry if you are not familiar with Accelerate. It will help you to run an parallel experiment easily. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.
 
-In each `<dataset>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We call the `run.py` script to run an experiment.
-For example, we can use the following command to train the `sdnn_delays` model using configurations in `baseline.toml`:
+First, we need to configuration the GPU usage. Accelerate provides a CLI tool that unifies all launchers, so you only have to remember one command. To use it, run a quick configuration setup first on your machine and answer the questions:
 
 ```shell
-torchrun
-    --standalone
-    --nnodes=1
-    --nproc_per_node=1
-    run.py
-    -C sdnn_delays/baseline.toml
-    -M train
+accelerate config
+```
+
+Then, we can use the following command to train the `spiking_fullsubnet` model using configurations in `baseline_m.toml`:
+
+```shell
+accelerate launch run.py -C baseline_m.toml -M train
+```
+
+```{note}
+Alternatively, if you don't want to use the CLI tool, you may use explicit arguments to specify the GPU usage (https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-env). For example: `accelerate launch --multi_gpu --num_processes=6 --gpu_ids 0,1,2,3,4,5 --main_process_port 46524 --main_process_ip 127.0.0.1 run.py -C baseline_m.toml`
 ```
 
-Here, we use `torchrun` to start the experiment.
-`torchrun` isn't magic. It is a superset of `torch.distributed.launch` and is provided by PyTorch officials, helping us to start multi-GPU training conveniently. Its just a python `console_entrypoint` added for convenience (check [torchrun versus python -m torch.distributed.run](https://pytorch.org/docs/stable/elastic/run.html)). Check [Torchrun (Elastic Training)](https://pytorch.org/docs/stable/elastic/run.html) for more details.
+
+## Entry file `run.py`
+
+In this `<model>` directory, we have an entry file `run.py`, dataloaders, and some model directories. We use HuggingFace Accelerate to start an experiment. Please refer to [HuggingFace Accelerate](https://huggingface.co/docs/accelerate/) for more details.
 
 `run.py` supports the following parameters:
 
@@ -33,86 +40,23 @@ Here, we use `torchrun` to start the experiment.
 | `-R` / `--resume`        | Resume the experiment from the latest checkpoint.                                                                     | `False`  |
 | `--ckpt_path`            | The checkpoint path for test. It can be `best`, `latest`, or a path to a checkpoint file.                             | `latest` |
 
-See more details in `recipes/intel_ndns/run.py` and `recipes/intel_ndns/sdnn_delays/baseline.toml`.
-
-## Single-machine multi-GPU training
+See more details in `recipes/intel_ndns/spiking_fullsubnet/run.py`.
 
-In most cases, we want to start an experiment on a single machine with multiple GPUs. Here, we show some examples for how to.
-
-First, let us use `baseline.toml` to train `sdnn_delays` with two GPUs on a single machine:
-
-```shell
-torchrun
-    --standalone
-    --nnodes=1
-    --nproc_per_node=2
-    run.py
-    --configuration sdnn_delays/baseline.toml
-    --mode train
-```
-
-`--nnodes=1` means that we will start the experiment on a single machine. `--nproc_per_node=2` means that we will use two GPUs on the single machine.
-
-:::{attention}
-The model `sdnn_delays` based on Lava-dl package, which actually does not support multi-GPU training. Here, we just use it as an example to show how to start an experiment on a single machine with multiple GPUs using `torchrun`.
-:::
 
 After a suspended experiment, we can resume training (using `-R` or `--resume`) from the last checkpoint:
 
 ```shell
-torchrun
-    --standalone
-    --nnodes=1
-    --nproc_per_node=2
-    run.py
-    --C sdnn_delays/baseline.toml
+accelerate launch run.py
+    -C baseline_m_cumulative_laplace_norm.toml
     --M train
     -R
 ```
 
-In the case of running multiple experiments on a single machine, since the first experiment has occupied the default `DistributedDataParallel` (DDP) listening port `29500`, we need to make sure that each instance (job) is setup on different ports to avoid port conflicts. Or you may directly use `rdzv_endpoint=localhost:0`, meaning to select a random unused port:
-
-```shell
-torchrun
-    --rdzv_backend=c10d
-    --rdzv_endpoint=localhost:0
-    --nnodes=1
-    --nproc_per_node=2
-    run.py
-    -C sdnn_delays/baseline.toml
-    -M train
-```
-
 Using "best" epoch to test the model performance on the test dataset:
 
 ```shell
-torchrun
-    --standalone
-    --nnodes=1
-    --nproc_per_node=2
-    run.py
-    -C sdnn_delays/baseline.toml
-    -M test
-    --ckpt_path best
-```
-
-First to train the model on the training dataset. Then test the model performance on the test dataset:
-
-```shell
-torchrun
-    --standalone
-    --nnodes=1
-    --nproc_per_node=2
-    run.py
-    -C sdnn_delays/baseline.toml
-    -M train test
+accelerate launch run.py
+    -C baseline_m_cumulative_laplace_norm.toml
+    --M test
     --ckpt_path best
-```
-
-:::{attention}
-Before use `torchrun`, don't forget to use a environment variable `CUDA_VISIBLE_DEVICES` to control the GPU usage. For example, the following command will use the first and second GPUs:
-
-```shell
-export CUDA_VISIABLE_DEVICES=0,1
-```
-:::
+```
diff --git a/docs/source/getting_started/running_an_experiment_accelerate.md b/docs/source/getting_started/running_an_experiment_accelerate.md