Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve phonemize for multi-language support #108

Open
HDANILO opened this issue Feb 16, 2025 · 11 comments
Open

Improve phonemize for multi-language support #108

HDANILO opened this issue Feb 16, 2025 · 11 comments
Assignees
Labels
feature Further information is requested

Comments

@HDANILO
Copy link

HDANILO commented Feb 16, 2025

Describe the feature

    def phonemize(self, text, lang="en-us", norm=True) -> str:
        """
        lang can be 'en-us' or 'en-gb'
        """
        if norm:
            text = Tokenizer.normalize_text(text)

        phonemes = phonemizer.phonemize(
            text, lang, preserve_punctuation=True, with_stress=True
        )

        # https://en.wiktionary.org/wiki/kokoro#English
        phonemes = phonemes.replace("kəkˈoːɹoʊ", "kˈoʊkəɹoʊ").replace(
            "kəkˈɔːɹəʊ", "kˈəʊkəɹəʊ"
        )
        phonemes = (
            phonemes.replace("ʲ", "j")
            .replace("r", "ɹ")
            .replace("x", "k")
            .replace("ɬ", "l")
        )
        phonemes = re.sub(r"(?<=[a-zɹː])(?=hˈʌndɹɪd)", " ", phonemes)
        phonemes = re.sub(r' z(?=[;:,.!?¡¿—…"«»“” ]|$)', "z", phonemes)
        if lang == "en-us":
            phonemes = re.sub(r"(?<=nˈaɪn)ti(?!ː)", "di", phonemes)
        phonemes = "".join(filter(lambda p: p in VOCAB, phonemes))
        return phonemes.strip()

phonemizer.phonemize should already encapsulate phonemes alterations for diverse languages, by injecting phonemes replacement you're binding the kokoro-onnx to english, which is a bad design choice.

I've done a simple test on my computer and I got brazillian portuguese generation to sound almost perfect just by removing all these replacements.

@HDANILO HDANILO added the feature Further information is requested label Feb 16, 2025
@thewh1teagle
Copy link
Owner

We should remove all the replaces I didn't notice it
Yiu can create PR or I'll update in few days

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

Please have a look at the proposed design here:

#109

The idea of having specific pre-processing per language is good, and it definitely worked well with english, I think its a good idea to keep it around but also allowing other languages to have also the same possibility.

For instance, "R$ 10,10" which is "dez reais" portuguese for "ten reals", is spelled "R dolar thousand and ten" using the current version of Tokenizer, but after the split, it is read as "R Dolar ten ten", ideally, after a PortugueseTokenizer is implemented we would hear something like "dez reais e dez centavos".

If you wish to have that merged, let me know next steps

@thewh1teagle
Copy link
Owner

I didn't understand what's make the tokenizer spell it well (beside the pr)

@thewh1teagle
Copy link
Owner

Also one feature / bug and small focused per pr
I meaned only remove the replace calls

@thewh1teagle
Copy link
Owner

Did you see with misaki example?
Should spell well

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

Did you see with misaki example?

I havent seen misaki example, would you please link it to me?

I meaned only remove the replace calls

Removing only the replace calls doesn't do the job, that's because in the normalize text there's a bunch of pre-processing happening such that formats like " $10,000.52" can be pronnounced correctly, among others, like replacing the "," in "$10,000.52" out, making it "$10000.52" which in other languages works completely different, in portuguese the equivalent to "$10,000.52" is "R$10.000,52". So the change is more fundamental, and if we take out all the replaces from normalize_text then quality of english won't be as good.

In my PR, the Tokenizer is the version where most, if not all, replaces are removed, and the EnglishTokenizer is the version where replacements that are relevant to english are kept, this way we guarantee that there's space for specialization. The trade off is that we had to introduce a Factory to facilitate the creation of the right Tokenizer version.

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

The other option I see is really remove all replacements and pre-processing language specific and delegate that to a library that already does that, but I do not know one, removing without care now will definitely degrade english quality

@thewh1teagle
Copy link
Owner

https://github.com/thewh1teagle/kokoro-onnx/blob/main/examples/language.py

Try with misaki
I don't know if it support your language details there

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

I don't understand phonemes and therefore its hard for me to judge, but I've been using the PR i've sent to generate Brazillian Portuguese narration for some story telling tiktok videos, and the result has been great, better than the other alternatives I tried out there, it could be better though, if we could implement the "PortugueseTokenizer" that could pre-processor some of the text to a format that is better readable, same that is already being done for english.

But I guess thats a discussion for another feature

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

Try with misaki

Ok, spent sometime looking into misaki, it indeed doesn't support pt-br, espeak does quite well, I modified the languages.py to output a good sounding português audio:

"""
Note: on Linux you need to run this as well: apt-get install portaudio19-dev

1. Prepare virtual environment
    uv venv --seed -p 3.11
    source .venv/bin/activate

2. Install packages
    pip install kokoro-onnx sounddevice 'misaki[en]'

3. Download models
    wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx
    wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin

4. Run
    python examples/language.py

Please read carefully https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md
To use other languages install misaki with the specific language. Example: pip install misaki[ko] (Korean). And change the import. Example: from misaki.ko import KOG2P
"""

import ctypes

import espeakng_loader
import phonemizer
import sounddevice as sd
from phonemizer.backend.espeak.wrapper import EspeakWrapper

from kokoro_onnx import Kokoro, log

# Check that the espeak-ng library can be loaded
try:
    ctypes.cdll.LoadLibrary(espeakng_loader.get_library_path())
except Exception as e:
    log.error(f"Failed to load espeak shared library: {e}")

EspeakWrapper.set_data_path(espeakng_loader.get_data_path())
EspeakWrapper.set_library(espeakng_loader.get_library_path())

# Kokoro
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")

# Phonemize
text = "Kokoro é uma biblioteca de conversão de texto em fala."
phonemes = phonemizer.phonemize(
    text, language="pt-br", with_stress=True, backend="espeak"
)

# Create
samples, sample_rate = kokoro.create(
    phonemes, voice="pm_alex", is_phonemes=True, lang="pt-br"
)

# Play
print("Playing audio...")
sd.play(samples, sample_rate)
sd.wait()

@HDANILO
Copy link
Author

HDANILO commented Feb 16, 2025

Should Misaki be the one dealing with language specific text preprocessing? Perhaps that's the part I've been missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants