Pytorch re-implementation of MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis.
-
Voice synthesis via converting text input into audio output
-
Audio modeling
Stage 1. Models the intermediate representation given text as input
Stage 2. Transforms the intermediate representation back to audio (i.e., vocoder)
-
Representation
- Typically chosen to be easier to model than raw audio while preserving enough information to allow faithful inversion back to audio
-
Mel-spectrogram (example data from LJ Speech dataset [LJ001-0001])
-
LJ Speech dataset:
This is consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.
Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
-
Generator
- The generator is a fully convolutional feed-forward network with mel-spectrogram as input and raw waveform as output.
- Stack of transposed convolutional layers are used to upsample the input sequence and each trnasposed convolutional layers is followed by a stack of residual blocks with dilated convolutions.
-
Discriminator
- Multi-scale architecture with three discriminators that have identical network structure but operate on different audio scales are adopted.
- This structure has an inductive bias that each discriminator learns features for different frequency range of the audio.