Training Polish Language - would changing the tokenizer help? #51
-
Hi,
I thought that maybe the problem is with the tokenizer, but it could be something else as well. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 20 replies
-
As I tell everyone, the most likely problem is a lack of CLVP fine-tuning. You can find neonbjb's configs for that here, but you would have to adapt them for your own purposes (a bit non-trivial) You can't exactly switch to a different tokenizer on-the-fly, it would immediately destroy model capabilities. A talented developer could possibly add an additional intermediary layer to convert the new tokens to the old tokens, but that would be harder than doing the above. It is also likely that the other models (diffusion upscaler, vocoder, voicefixer) all play a minor role in the degraded capabilities, but I would only touch those if you had dealt with everything above. |
Beta Was this translation helpful? Give feedback.
-
I tested it and after changing the tokenizer, it is possible to obtain correct pronunciation in a given language, but the tokenizer needs to be changed during training and synthesis. It is necessary to generate your own tokenizer.json in your language, preferably using an ebook that contains all the letters. DLAS provides a script for generating the tokenizer, but it does not work properly: DL-Art-School/codes/data/audio/voice_tokenizer.py. For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val: In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your own tokenizer.json, and on line 190, change english_cleaners to basic_cleaners. This works for the Polish language and correctly pronounces all words. |
Beta Was this translation helpful? Give feedback.
-
@LorenzoBrugioni @Brugio96 |
Beta Was this translation helpful? Give feedback.
I tested it and after changing the tokenizer, it is possible to obtain correct pronunciation in a given language, but the tokenizer needs to be changed during training and synthesis.
It is necessary to generate your own tokenizer.json in your language, preferably using an ebook that contains all the letters. DLAS provides a script for generating the tokenizer, but it does not work properly: DL-Art-School/codes/data/audio/voice_tokenizer.py.
For training, you just need to add EXAMPLE_gpt.yml the following lines in train and val:
tokenizer_vocab: path/to/tokenizer_json
In the case of tortoise-tts-fast, you need to change the path in the file tortoise/utils/tokenizer.py on line 180 to your o…