Scripts for training large-scale monolingual speech foundation models with 158K hours of Finnish speech
Model |
---|
wav2vec 2.0 Base Pre-trained |
wav2vec 2.0 Base Fine-tuned |
wav2vec 2.0 Large Pre-trained |
wav2vec 2.0 Large Fine-tuned |
wav2vec 2.0 X-Large Pre-trained |
wav2vec 2.0 X-Large Fine-tuned |
More details on the models are available in the paper (TBA). The models are also available at Huggingface Hub
Developing a foundation model from scratch requires not only vast amounts of unlabeled speech data but also substantial computational resources. Moreover, extensive hyperparameter search is often not feasible for large-scale models. Therefore, we are glad to share our pre-training logs on Weights & Biases (W&B) to provide more insights for other researchers developing their own speech foundation models.
The raw, unlabeled TV and radio data are organized into 1-hour files, each located in the directory channel_name/year/month/day/channel_name_start_time-end_time.ts
:
.
└── raw_tv_and_radio_data/
├── radio_channel_1/
│ ├── 2009/
│ │ ├── 01/
│ │ │ ├── 01/
│ │ │ │ ├── radio_channel_1_0000-0100.ts
│ │ │ │ ├── radio_channel_1_0100-0200.ts
│ │ │ │ └── ...
│ │ │ ├── 02/
│ │ │ │ └── ...
│ │ │ └── ...
│ │ ├── 02/
│ │ │ └── .../
│ │ │ └── ...
│ │ └── ...
│ └── 2010/
│ └── .../
│ └── .../
│ └── ...
├── tv_channel_2/
│ └── ...
└── ...
- Convert the files to 16kHz mono flac audio by running
scripts/data_preprocessing/convert_to_flac.sh
. The script preserves the original folder structure. - Run voice activity detection (VAD) to split the data into shorter utterances and reduce the non-speech events, such as music, noise, and silence, and put them into uncompressed (.tar) tarballs, with one archive per year per radio station or TV channel. The script
scripts/data_preprocessing/segment_with_vad_and_tar.sh
does it for one year of the data fromradio_station_1
. The script also stores a Python dictionaryout_file_to_nframes_dict
with the number of frames for each audio segment, which will be needed later to create the Fairseq manifest of the data. Note: Fairseq does not support compressed archives Note: Millions of small files affect the performance of any filesystem. As a result, quotas on Lustre filesystems are typically limited to several million files. To avoid running out of quota, put the short audio files into a .tar archive after VAD-based segmentation of a small part of the raw data (one day, month, or year), and remove them immediately afterward. You can also consider storing the preprocessed audio files in the/tmp
folder, which usually does not consume the quota. - Prepare the Fairseq manifest of the data.
scripts/data_preprocessing/prepare_fairseq_manifest.sh
creates a .tsv file with allradio_station_1
audio samples stored in the corresponding .tar archives. To hold out a validation subsetvalid_size_hours
hours, runscripts/data_preprocessing/prepare_fairseq_manifest_valid_tsv.sh
afterward. - Binarize the Fairseq manifest by running
scripts/data_preprocessing/binarize_manifest.sh
. This step is recommended for large datasets to avoid running out of RAM during pre-training.
The scripts shared in this repository are adapted to the AMD hardware of the LUMI supercomputer. To train a wav2vec 2.0 Base model, run
sbatch /scripts/pretraining/pretrain_wav2vec2_base.sh
Note: you can simulate 512 GPUs by using k GPUs and adding command line parameters (before --config-dir
)
distributed_training.distributed_world_size=k
+optimization.update_freq='[x]'
where x = 512/k
To fine-tune a wav2vec 2.0 Base model using Fairseq, run
sbatch scripts/finetuning/full-scale-asr/finetune_wav2vec2_base.sh
-
When pre-training on the LUMI supercomputer using Fairseq, it is crucial to set
export MIOPEN_FIND_MODE=2
. MIOpen is AMD’s deep-learning primitives library for GPUs (counterpart of NVIDIA's cuDNN). Setting the Find Mode to2
, orFAST
is crucial for optimal pre-training speed, otherwise pre-training is 10-20x times slower. More details on MIOpen Find modes are available here -
You can simulate 128 GPUs by using k GPUs and adding command line parameters (before
--config-dir
)distributed_training.distributed_world_size=k
+optimization.update_freq='[x]'
where x = 128/k -
For more LUMI-specific details on training with AMD GPUs, see here, here, and here.
To fine-tune a wav2vec 2.0 Base model using Huggingface Transformers, run
sbatch scripts/finetuning/low-resource-asr/finetune_wav2vec2_base.sh