diff --git a/README.md b/README.md
index 49ce7fd..65f81ce 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ The implementation is based on [Tensorflow](https://tensorflow.org/).
- [Prerequisites](#toc-prerequisites)
- [Installation](#toc-installation)
- [Dataset Preparation](#toc-preparation)
- 1. [Signal Statistics](#toc-preparation-signal-stats)
+ 1. [Dataset Definition File](#toc-preparation-definition)
2. [Feature Pre-Calculation](#toc-preparation-feature-pre-calc)
- [Training](#toc-training)
- [Evaluation](#toc-evaluation)
@@ -83,88 +83,96 @@ export PYTHONPATH=$PYTHONPATH:$PWD/tacotron:$PWD
## Dataset Preparation
-Datasets are loaded using dataset loaders.
-Currently each dataset requires a custom dataset loader to be written.
-Depending on how the loader does its job the datasets can be stored in nearly any form and file-format.
-If you want to use a custom dataset you have to write a custom loading helper.
-However, a few custom loaders for datasets exist already.
+Each dataset is defined by an dataset definition file (`dataset.json`).
+Additionally each dataset has to define separate `train.csv` and `eval.csv` listing files defining
+the data necessary for training and evaluation.
-See: [datasets/](datasets/)
+A dataset might look like this:
```bash
-datasets/
-├── blizzard_nancy.py
-├── cmu_slt.py
-├── lj_speech.py
-...
+some-dataset/
+├── dataset.json
+├── eval.csv
+├── train.csv
+└── wavs
```
Please take a look at [LJSPEECH.md](LJSPEECH.md) for a full step by step tutorial on how to train
a model on the LJ Speech v1.1 dataset.
-### Signal Statistics
-
-In order to create a model the exact character vocabulary and certain signal boundaries have to
-be calculated for normalization.
-The model uses linear scale as well as Mel scale spectrograms for synthesis.
-All spectrograms are scaled linearly to fit the range `(0.0, 1.0)` using global minimum and
-maximum dB values calculated on the training corpus.
-
-First we have to configure the dataset in [tacotron/params/dataset.py](tacotron/params/dataset.py).
-Enter the path to the dataset `dataset_folder` and set the `dataset_loader` variable to te
-loader required for your dataset.
+### Dataset Definition File
+
+
+In order for a model to work with a dataset file paths, a character vocabulary and certain signal
+statistics for normalization have to be known.
+Each dataset store such information in the definition file `dataset.json`.
+
+Lets take a look at an exemplaric definition file for the LJSpeech dataset.
+```json
+{
+ "dataset_folder": "/datasets/LJSpeech-1.1",
+ "audio_folder": "wavs",
+ "train_listing": "train.csv",
+ "eval_listing": "eval.csv",
+ "vocabulary": {
+ "pad": 0,
+ "eos": 1,
+ "p": 2,
+ .
+ .
+ .
+ "!": 37,
+ "?": 38
+ },
+ "normalization": {
+ "mel_mag_ref_db": 6.02,
+ "mel_mag_max_db": 99.89,
+ "linear_ref_db": 35.66,
+ "linear_mag_max_db": 100
+ }
+}
+```
-Then calculate the vocabulary and the signal boundaries using:
+The `dataset_folder` field defines the path to the dataset base folder.
+All other paths are relative to that folder.
+For example `audio_folder` gives the relative path to the audio file folder.
+While `train_listing` and `eval_listing` give the path to the training and evaluation listing files.
+Appart from the paths the definition file also defines `vocabulary` and `normalization`.
+The vocabulary is an enumerated lookup dictionary defining all characters the model should learn on.
+An exception are the two virtual `pad` and `eos` symbols that are used for padding and marking
+the end of a sequence.
+Finally, `normalization` is an lookup dictionary containing normalization parameters for feature
+calculation.
+
+However, Before creating the definition file one has to generate the `train.csv` and `eval.csv`
+files for the dataset to be used.
+Each line of these files uses the delimiter `|` and has the following format:
```bash
-python tacotron/dataset_statistics.py
-
-Dataset: /my-dataset-path/LJSpeech-1.1
-Loading dataset ...
-Dataset vocabulary:
-vocabulary_dict={
- 'pad': 0,
- 'eos': 1,
- 'p': 2,
- 'r': 3,
- 'i': 4,
- 'n': 5,
- 't': 6,
- 'g': 7,
- ' ': 8,
- 'h': 9,
- 'e': 10,
- 'o': 11,
- 'l': 12,
- 'y': 13,
- 's': 14,
- 'w': 15,
- 'c': 16,
- 'a': 17,
- 'd': 18,
- 'f': 19,
- 'm': 20,
- 'x': 21,
- 'b': 22,
- 'v': 23,
- 'u': 24,
- 'k': 25,
- 'j': 26,
- 'z': 27,
- 'q': 28,
-},
-vocabulary_size=29
-
-
-Collecting decibel statistics for 13100 files ...
-mel_mag_ref_db = 6.026512479977281
-mel_mag_max_db = -99.89414986824931
-linear_ref_db = 35.65918850818663
-linear_mag_max_db = -100.0
+.wav|
+```
+Take a look at [datasets/preparation/ljspeech.py](datasets/preparation/ljspeech.py) to see how
+`train.csv` and `eval.csv` can be generated.
+
+When both `train.csv` and `eval.csv` exist the dataset definition file can be generated like in
+this example:
+```python
+from datasets.dataset import Dataset
+
+dataset = Dataset('/tmp/LJSpeech-1.1/dataset.json')
+dataset.set_dataset_folder('/tmp/LJSpeech-1.1/')
+dataset.set_audio_folder('wavs')
+dataset.set_train_listing_file('train.csv')
+dataset.set_eval_listing_file('eval.csv')
+dataset.load_listings(stale=True)
+dataset.generate_vocabulary()
+
+# Calculates the signal statistics over the entire dataset (may take a while).
+dataset.generate_normalization(n_threads=4)
+dataset.save()
```
-Now complement `vocabulary_dict` and `vocabulary_size` in [tacotron/params/dataset.py](tacotron/params/dataset.py) and transfer the decibel boundaries (`mel_mag_ref_db`,
-`mel_mag_max_db`, `linear_ref_db`, `linear_mag_max_db`) to your loader.
-Each loader derived from `DatasetHelper` has to define these variables in order to be able to
-normalize the audio files.
+Finally the path to the dataset definition file the model should load has to be configured in
+[tacotron/params/dataset.py](tacotron/params/dataset.py).
+Set the path to the dataset definition file in `dataset_file`.
### Feature Pre-Calculation
@@ -173,9 +181,11 @@ pre-calculate features and store them on disk.
To pre-calculate features run:
```bash
-python tacotron/dataset_precalc_features.py
+python tacotron/calculate_features.py
```
+The features are calculated using the general model parameters set in
+[tacotron/params/model.py](tacotron/params/model.py).
The pre-computed features are stored as `.npz` files next to the actual audio files.
Note that independent from pre-calculation, features can also be cached in RAM to accelerate throughput.
@@ -372,7 +382,8 @@ Just open a pull request with your proposed changes.
* The model architecture is implemented in [tacotron/model.py](tacotron/model.py).
* The model architecture parameters are defined in [tacotron/params/model.py](tacotron/params/model.py).
* The train code is defined [tacotron/train.py](tacotron/train.py).
-* Take a look at [datasets/blizzard_nancy.py](datasets/blizzard_nancy.py) to see how a dataset loading helper has to implemented.
+* Take a look at [datasets/dataset.py](datasets/dataset.py) to see how dataset loading
+is implemented.
### Todo
See [Issues](https://github.com/yweweler/single-speaker-tts/issues)
diff --git a/datasets/dataset.py b/datasets/dataset.py
index a43e97d..4e5fbf5 100644
--- a/datasets/dataset.py
+++ b/datasets/dataset.py
@@ -315,9 +315,18 @@ def save(self):
with open(self.__dataset_file, 'w') as json_file:
json.dump(self.__definition, json_file, indent=2)
- def load_listings(self):
+ def load_listings(self, stale=False):
"""
Load and parse the train and eval listing files from disk.
+
+ Arguments:
+ stale (bool):
+ Flag indicating of the dataset is loaded in an incomplete state.
+ When set to True, the dataset does not assume that an vocabulary is available and
+ only parses the raw rows from the listing file.
+ If set to False, the loader assumes that the dataset definition is build and
+ augments the parsed rows with additional data.
+ Default is False.
"""
# Get the full train listing file path.
train_listing_file = os.path.join(
@@ -326,7 +335,7 @@ def load_listings(self):
)
# Load and parse the file.
- parsed_train_rows = self.__load_listing_file(train_listing_file)
+ parsed_train_rows = self.__load_listing_file(train_listing_file, stale)
self.__train_listing = parsed_train_rows
print('Loaded {} train rows'.format(len(parsed_train_rows)))
@@ -337,18 +346,26 @@ def load_listings(self):
)
# Load and parse the file.
- parsed_eval_rows = self.__load_listing_file(eval_listing_file)
+ parsed_eval_rows = self.__load_listing_file(eval_listing_file, stale)
self.__eval_listing = parsed_eval_rows
print('Loaded {} eval rows'.format(len(parsed_eval_rows)))
- def __load_listing_file(self, _listing_file):
+ def __load_listing_file(self, _listing_file, stale=False):
"""
- Load and parse a specific listing frile from disk.
+ Load and parse a specific listing file from disk.
Arguments:
_listing_file (str):
path to the listing file to be loaded.
+ stale (bool):
+ Flag indicating of the dataset is loaded in an incomplete state.
+ When set to True, the dataset does not assume that an vocabulary is available and
+ only parses the raw rows from the listing file.
+ If set to False, the loader assumes that the dataset definition is build and
+ augments the parsed rows with additional data.
+ Default is False.
+
Returns (:obj:`list` of :obj:`dict`):
List of parsed listing rows as dictionaries.
The keys of each dictionary are: ['audio_path', 'sentence', 'tokenized_sentence',
@@ -362,18 +379,20 @@ def __load_listing_file(self, _listing_file):
for row in csv_rows:
parsed_row = self.__parse_listing_row(row)
- # Tokenize the sentence.
- sentence = parsed_row['sentence']
- tokenized_sentence = self.sentence2tokens(sentence)
- tokenized_sentence.append(self.get_eos_token())
+ if stale is False:
+ # Tokenize the sentence.
+ sentence = parsed_row['sentence']
+ tokenized_sentence = self.sentence2tokens(sentence)
+ tokenized_sentence.append(self.get_eos_token())
- # Get the length of the tokenized sentence.
- tokenized_sentence_length = len(tokenized_sentence)
+ # Get the length of the tokenized sentence.
+ tokenized_sentence_length = len(tokenized_sentence)
+
+ parsed_row.update({
+ 'tokenized_sentence': np.array(tokenized_sentence, dtype=np.int32),
+ 'tokenized_sentence_length': tokenized_sentence_length
+ })
- parsed_row.update({
- 'tokenized_sentence': np.array(tokenized_sentence, dtype=np.int32),
- 'tokenized_sentence_length': tokenized_sentence_length
- })
parsed_rows.append(parsed_row)
return parsed_rows
@@ -427,17 +446,17 @@ def generate_normalization(self, n_threads=1):
"linear_mag_max_db": stats[0]
}
-# if __name__ == '__main__':
-# dataset = Dataset('/tmp/LJSpeech-1.1/dataset.json')
-# dataset.load()
-# dataset.load_listings()
-#
-# # dataset.set_dataset_folder('/tmp/LJSpeech-1.1/')
-# # dataset.set_audio_folder('wavs')
-# # dataset.set_train_listing_file('train.csv')
-# # dataset.set_eval_listing_file('eval.csv')
-# # TODO: Fix cyclic dependency between `load_listings` and `generate_vocabulary`.
-# # dataset.load_listings()
-# # dataset.generate_vocabulary()
-# # dataset.generate_normalization()
-# # dataset.save()
+
+if __name__ == '__main__':
+ dataset = Dataset('/tmp/LJSpeech-1.1/dataset.json')
+ # dataset.load()
+ # dataset.load_listings()
+
+ dataset.set_dataset_folder('/tmp/LJSpeech-1.1/')
+ dataset.set_audio_folder('wavs')
+ dataset.set_train_listing_file('train.csv')
+ dataset.set_eval_listing_file('eval.csv')
+ dataset.load_listings(stale=True)
+ dataset.generate_vocabulary()
+ dataset.generate_normalization()
+ dataset.save()