Takes a song and its lyrics, extracts the vocals, splits the syllables and computes a forced alignment to generate a karaoke in an Aegisub subtitles file (.ass).
Open the notebook in Google Colab to use their offered GPU resources:
The full pipeline will be completed in less than a minute in their environment.
Requirements:
git clone https://github.com/Japan7/yohane.git
cd yohane/
poetry install --only main --extras torch
poetry run yohane
- Yohane's syllable splitting is optimized for Japanese lyrics
- Torchaudio ffmpeg backend is not available on Windows: convert your song file to .wav beforehand with
ffmpeg -i <src> <out>.wav
- Long syllables at end of lines will often be truncated
- Forced alignment can't deal with overlapping vocals
- It is not fully accurate, you should still check and edit the result!
- Get the song and its lyrics
- Use the yohane notebook or the CLI locally to generate the karaoke file
In Aegisub:
- Load the .ass and the video
- Replace the Default style with your own
- Due to the normalization during the process, lines are lowercased and special characters have been removed: use the original lines in comments to fix the timed lines
- Subtitle > Select Lines… > check Comments and Set selection > OK and delete the selected lines
- Listen to each line and fix their End time
- Iterate over each line in karaoke mode and merge/fix syllable timings