Training Polish Language - would changing the tokenizer help? #51

Kamil-Roszak · 2023-03-11T19:46:25Z

Kamil-Roszak
Mar 11, 2023

Hi,
With the help of the repository, I was able to successfully create a model for the Polish language. It works really well.
Thank you so much 152334H for working on all of this 🙂
I used CommonVoice and a few other sources, totaling a few hundred hours and 3.5 thousand voices. However, there are 2 problems:

Some words like "żółw" cannot be pronounced correctly unless accidentally in the middle of a sentence.
The cloning ability is not very good, and the best performing voices are those already in the dataset.

I thought that maybe the problem is with the tokenizer, but it could be something else as well.
Could you please help me with what needs to be changed/adapted to train another language, specifically Polish?

Answered by HobisPL

Mar 12, 2023

I tested it and after changing the tokenizer, it is possible to obtain correct pronunciation in a given language, but the tokenizer needs to be changed during training and synthesis.

It is necessary to generate your own tokenizer.json in your language, preferably using an ebook that contains all the letters. DLAS provides a script for generating the tokenizer, but it does not work properly: DL-Art-School/codes/data/audio/voice_tokenizer.py.

For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val:
tokenizer_vocab: path/to/tokenizer_json

In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your o…

View full answer

152334H · 2023-03-12T14:01:49Z

152334H
Mar 12, 2023
Maintainer

As I tell everyone, the most likely problem is a lack of CLVP fine-tuning. You can find neonbjb's configs for that here, but you would have to adapt them for your own purposes (a bit non-trivial)

You can't exactly switch to a different tokenizer on-the-fly, it would immediately destroy model capabilities. A talented developer could possibly add an additional intermediary layer to convert the new tokens to the old tokens, but that would be harder than doing the above.

It is also likely that the other models (diffusion upscaler, vocoder, voicefixer) all play a minor role in the degraded capabilities, but I would only touch those if you had dealt with everything above.

0 replies

HobisPL · 2023-03-12T22:50:13Z

HobisPL
Mar 12, 2023

I tested it and after changing the tokenizer, it is possible to obtain correct pronunciation in a given language, but the tokenizer needs to be changed during training and synthesis.

It is necessary to generate your own tokenizer.json in your language, preferably using an ebook that contains all the letters. DLAS provides a script for generating the tokenizer, but it does not work properly: DL-Art-School/codes/data/audio/voice_tokenizer.py.

For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val:
tokenizer_vocab: path/to/tokenizer_json

In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your own tokenizer.json, and on line 190, change english_cleaners to basic_cleaners.

This works for the Polish language and correctly pronounces all words.

16 replies

LorenzoBrugioni Apr 2, 2023

You can also use colab, it takes literally seconds to create the tokenizer. Just create a txt file, let's say italian.txt, containing a lot of lines from your language. Then you do 'pip install tokenizers' and you use the train function from tokenizer.py. Obviously you have to substitute the paths @HobisPL wrote with the path to italian.txt . Hope that helps.

pheonis2 Apr 2, 2023

Wow, Thanks a ton man...So simple ..yet i was grinding after this from past two days... Thanks for saving my life.. i was wondering how much text should one provide to get a good tokenizer?(if providing more text will produce a better tokenizer). I just created tokenizer in my hindi language with just 2500words in the .txt file.

LorenzoBrugioni Apr 2, 2023

Don't worry @pheonis2 , I'me here to learn too. Anyway for the number of words, i guess the more the better. Unfortunately my first fine tuning run with the new tokenizer didn't go well, and the output was nonsense, let me know if you have better luck.

HobisPL Apr 3, 2023

@LorenzoBrugioni
You need to remember to change the tokenizer in the repository for synthesis, otherwise it will speak as if in an incomprehensible language. You can change it in tortoise/utils/tokenizer.py, line 180:


DEFAULT_VOCAB_FILE = os.path.join(
os.path.dirname(os.path.realpath(file)), "../data/polish_tokenizer.json"
)

And also line 191 to basic_cleaners():

def preprocess_text(self, txt):
    txt = basic_cleaners(txt)
    return txt

And remember to do the same for training, replacing english_cleaners with basic_cleaners. I did it by commenting out the original english_cleaners, but you can also delete it and change even basic_cleaners to english_cleaners, as the repository has constant values set for english_cleaners, so you would have to change it in several places, unless it can be done from the yml configuration file, but I did it as described above.
EXAMPLE_gpt.zip

HobisPL Apr 3, 2023

tokenizer for synthesis.zip

HobisPL · 2023-03-30T11:19:03Z

HobisPL
Mar 30, 2023

@LorenzoBrugioni
I did some tests, but it requires a large dataset or one that contains all the words and combinations. At first, I used a dataset of about 60 hours, and the model learned to speak Polish well, but voices that were not in the dataset spoke with a slight accent. Now, I have created a new dataset of around 600 hours by downloading 630 audiobooks ranging from 0:30h to 2h. I trained the model on 1e-4 for about 5 hours, and it learned to speak Polish fluently, but its voice sounded similar across all speakers. The next step was to change the learning rate to 1e-5, which takes longer to train, but it captures the voice of each speaker accurately. If it's hard to understand my description, I apologize as my Eng lish is not perfect.

@Brugio96
Merges in a tokenizer refer to an operation in BPE (Byte Pair Encoding), in which the two most frequently occurring tokens are merged into one token. This operation is repeated until a set number of tokens is reached. The specific settings for merges depend on the language and the corpus being tokenized.
I have prepared code for the Polish language, but different settings may be required for other languages. In the trainer at the bottom of the file, the "vocab_size" setting was set to 255. To test the tokenizer for your language, you will need to adjust the settings accordingly and use sample texts from that language for training and testing the tokenizer. In the "train_tokenizer()" function, you should input ebooks from your language, and in the "test_tokenizer()" function, you can enter random text in your language to check if the tokenizer is working correctly.
tokenizer.py.zip

4 replies

LorenzoBrugioni Mar 30, 2023

Hey @HobisPL, thank you so much for the info and the zip file! I really appreciate it. Your english is good 👌🏼.

I'm thinking about investing some time and money to train and finetune a model on an Italian dataset. Before I jump in, I'd like to double-check a few things and ask you a couple more questions, if that's cool with you.

About the dataset segmentation and composition:
- Did you use a specific method to cut the audio, or did you let Whisper handle that task?
- Is there an ideal range of seconds for the wav audio length to input into the model that you recommend?
- did you put all the wav files from different speakers in the same folder?

About the training process:
- You mentioned that you initially trained the model with a learning rate of 1e-4 for about 5 hours, then switched to a learning rate of 1e-5. Did you finetune the already finetuned model, or did you start from the autoregressive.pth checkpoint?
- Could you please provide information on the number of epochs, batch size, and other training parameters you used? If you're willing to share the training .yml files, that'd be awesome and save you from typing a long response.
- Did you use a validation set during the process?

About the results:
- How is the zero shot voice cloning on unseen speakers?
- How good is the perceptual quality? How good is the similarity to the seen speakers? Are there any artifacts such as repeated words, skipped words or strange noises?

Sorry for all these questions ahah, thanks again for your help!

LorenzoBrugioni Apr 3, 2023

@HobisPL I've tried a little fine tuning run with the new italian tokenizer that I made following your code, and I modified some things as you said here:

For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val:
tokenizer_vocab: path/to/tokenizer_json
In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your own tokenizer.json, and on line 190, change english_cleaners to basic_cleaners.

I've tried with a little single speaker dataset of about 2 hours and a half for about 10 epochs, and the results are like italian gibberish, it's like italian-a-like sounding but it's nonsense and it doesn't correspond to the text prompt.
Do you think it's because the little size of the dataset or the low number of epochs? Have you maybe experienced that as well in your early runs?
Also, have you changed the Text LR Weight during training?
I quote from the mrq Tortoise Fine Tuning repo:

Text LR Weight: governs how much to train the text portion (phonemes) of the model.
For English, leave this at 0.01, as you don't really need to re-teach it English.
For non-English, set this to 1, as you'll need to effectively "teach" the model a new language.

LorenzoBrugioni Apr 3, 2023

@HobisPL Thank you for the response , I did exactly what you said both for training and synthesis, but I have the results described above.
I suspect that it is because of the little dataset (2 and a half hours) or low number of epochs.
Thank you very much also for the yml training file.
Since you shared the fine tuning details I have only two major questions left:

What is the length of the wavs to input to the model approximately?
About the final result: how good is the similarity to the seen speakers? Are there any artifacts such as repeated words, skipped words or strange noises?

Sorry about all the question, but I'm doing this as my thesis so I'm trying to do everything in the right way.

Thank you again, you are of great help 🙏🏻 !!

HobisPL Apr 4, 2023

@LorenzoBrugioni
I think that 2.5 hours is very little to learn a new language, 20 hours will probably be the minimum. I trained on 60 hours and it was not satisfactory.
Do you start training from
pretrain_model_gpt: '../experiments/autoregressive.pth'
My dataset has a maximum length of one WAV sample of 10 seconds and a minimum of 2 seconds. The samples do not contain silence at the beginning or end of the WAV file.
My previous model with a 60-hour dataset often repeated the last sentence and there were various artifacts. In the current 500-hour model (which I am still training), artifacts occur very rarely. Artifacts appear on voices that the model cannot handle, but there are no longer any problems with repeating the last words, which often occurred in the previous model. My new model currently has 1 epoch (12,000 iterations, but apparently 50,000 iterations are needed, which would be 5 epochs in my case). It is training slowly because I am using a laptop with only 8GB VRAM.
I use this GUI in Streamlit for synthesis (scripts/app.py). I also changed the maximum generation length from 200 to 50 because there was a problem with generating long tests, but now it generates longer texts for longer.
I think you should increase your dataset to at least 20 hours.
I made my 500-hour dataset from free audiobooks that were 1-2 hours long. I used
https://github.com/wiseman/py-webrtcvad
to split them and
https://github.com/guillaumekln/faster-whisper
for transcription. It generates 5 times faster than regular transcription. Additionally, I wrote code that detects numbers in the dataset and corrects the transcription by saving them in words/phonetically using the wav2vec2 model.
https://huggingface.co/joaoalvarenga/wav2vec2-large-xlsr-italian
There are different models, some work better than others. Creating such a dataset took me 3 days.

You can use this website where you can find a ready-made dataset in your language of 127 hours for download:
https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/
Just remember to convert the samples from 16000khz to 22050khz. This is a ready-made dataset with transcription.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Polish Language - would changing the tokenizer help? #51

{{title}}

Replies: 3 comments 20 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Training Polish Language - would changing the tokenizer help? #51

Replies: 3 comments · 20 replies

152334H Mar 12, 2023 Maintainer

Replies: 3 comments 20 replies

152334H
Mar 12, 2023
Maintainer