StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Original README at https://github.com/yl4579/StyleTTS2, checkpoints at https://huggingface.co/malaysia-ai/StyleTTS2-MS
In Utils folder, there are three pre-trained models:
- ASR folder: Forked original yl4579/AuxiliaryASR at malaysia-ai/AuxiliaryASR-Phonemizer to use
ms
phonemizer and trained on mesolitica/tts-combine-annotated dataset - JDC folder: No modification done, use original yl4579/PitchExtractor.
- PLBERT folder: Forked original PL-BERT at malaysia-ai/PL-BERT-MS to use custom word tokenizer and pretrained on Malay Wikipedia and local news. Check how we pruned checkpoint at prune-checkpoint.ipynb.
We cleaned the dataset using https://github.com/mesolitica/malaya-speech/blob/master/malay_vits/preprocessing.py and pushed to https://huggingface.co/datasets/mesolitica/TTS/tree/main/cleaned, notebook at prepare-styletts2.ipynb
First stage training:
accelerate launch train_first.py --config_path ./Configs/config_multispeakers.yml
Second stage training (DDP version not working, so the current version uses DP, again see #7 if you want to help):
python train_second.py --config_path ./Configs/config.yml
You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at log_dir
.
The data list format needs to be filename.wav|transcription|speaker
, see val_list.txt as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.
In config.yml, there are a few important configurations to take care of:
OOD_data
: The path for out-of-distribution texts for SLM adversarial training. The format should betext|anything
.min_length
: Minimum length of OOD texts for training. This is to make sure the synthesized speech has a minimum length.max_len
: Maximum length of audio for training. The unit is frame. Since the default hop size is 300, one frame is approximately300 / 24000
(0.0125) second. Lowering this if you encounter the out-of-memory issue.multispeaker
: Set to true if you want to train a multispeaker model. This is needed because the architecture of the denoiser is different for single and multispeaker models.batch_percentage
: This is to make sure during SLM adversarial training there are no out-of-memory (OOM) issues. If you encounter OOM problem, please set a lower number for this.
- Loss becomes NaN: If it is the first stage, please make sure you do not use mixed precision, as it can cause loss becoming NaN for some particular datasets when the batch size is not set properly (need to be more than 16 to work well). For the second stage, please also experiment with different batch sizes, with higher batch sizes being more likely to cause NaN loss values. We recommend the batch size to be 16. You can refer to issues #10 and #11 for more details.
- Out of memory: Please either use lower
batch_size
ormax_len
. You may refer to issue #10 for more information. - Non-English dataset: You can train on any language you want, but you will need to use a pre-trained PL-BERT model for that language. We have a pre-trained multilingual PL-BERT that supports 14 languages. You may refer to yl4579/StyleTTS#10 and #70 for some examples to train on Chinese datasets.
The script is modified from train_second.py
which uses DP, as DDP does not work for train_second.py
. Please see the bold section above if you are willing to help with this problem.
python train_finetune.py --config_path ./Configs/config_ft.yml
Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml
finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100. The samples can be found at #65 (comment).
If you are using a single GPU (because the script doesn't work with DDP) and want to save training speed and VRAM, you can do (thank @korakoe for making the script at #100):
accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml
@Kreevoz has made detailed notes on common issues in finetuning, with suggestions in maximizing audio quality: #81. Some of these also apply to training from scratch. @IIEleven11 has also made a guideline for fine-tuning: #128.
- Out of memory after
joint_epoch
: This is likely because your GPU RAM is not big enough for SLM adversarial training run. You may skip that but the quality could be worse. Settingjoint_epoch
a larger number thanepochs
could skip the SLM advesariral training.
Please refer to Inference_LJSpeech.ipynb (single-speaker) and Inference_LibriTTS.ipynb (multi-speaker) for details. For LibriTTS, you will also need to download reference_audio.zip and unzip it under the demo
before running the demo.
-
The pretrained StyleTTS 2 on LJSpeech corpus in 24 kHz can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main.
-
The pretrained StyleTTS 2 model on LibriTTS can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main.
You can import StyleTTS 2 and run it in your own code. However, the inference depends on a GPL-licensed package, so it is not included directly in this repository. A GPL-licensed fork has an importable script, as well as an experimental streaming API, etc. A fully MIT-licensed package that uses gruut (albeit lower quality due to mismatch between phonemizer and gruut) is also available.
Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.
- High-pitched background noise: This is caused by numerical float differences in older GPUs. For more details, please refer to issue #13. Basically, you will need to use more modern GPUs or do inference on CPUs.
- Pre-trained model license: You only need to abide by the above rules if you use the pre-trained models and the voices are NOT in the training set, i.e., your reference speakers are not from any open access dataset. For more details of rules to use the pre-trained models, please see #37.
- archinetai/audio-diffusion-pytorch
- jik876/hifi-gan
- rishikksh20/iSTFTNet-pytorch
- nii-yamagishilab/project-NN-Pytorch-scripts/project/01-nsf
Code: MIT License
Pre-Trained Models: Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.