GitHub - NicolasBFR/Diff-HierVC: Official Pytorch Implementation of "Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation"

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

The official Pytorch implementation of Diff-HierVC (Interspeeh 2023, Oral)

Ha-Yeong Choi, Sang-Hoon Lee, Seong-Whan Lee

Overall architecture

Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate $F_0$ with the target voice style. Subsequently, the generated $F_0$ is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.

🎧 Audio Demo

https://diff-hiervc.github.io/audio_demo/

📑 Pre-trained Model

Our model checkpoints can be downloaded here.

model_diffhier.pth
voc_ckpt.pth

🔨 Usage

Clone this rep && Install python requirement

git clone https://github.com/hayeong0/Diff-HierVC.git
pip install -r req*

Download the pre-trained model checkpoint from drive and place it in the following path.

.
├── ckpt
│   ├── config.json
│   └── model_diffhier.pth ✅
├── inference.py
├── infer.sh
├── model
├── module
├── requirements.txt
├── utils
└── vocoder
    ├── hifigan.py
    ├── modules.py
    └── voc_ckpt.pth ✅

Run infer.sh

diffpitch_ts refers to the time step of the pitch generator and diffvoice_ts refers to the time step of the Mel generator.

Empirically, it has been observed that if the time step of diffpitch is too small, noise remains, and if it is too large, excessive diversity occurs.

Please use it appropriately for your dataset!

bash infer.sh

python3 inference.py \
    --src_path './sample/src_p241_004.wav' \
    --trg_path './sample/tar_p239_022.wav' \
    --ckpt_model './ckpt/model_diffhier.pth' \
    --ckpt_voc './vocoder/voc_ckpt.pth' \
    --output_dir './converted' \
    --diffpitch_ts 30 \
    --diffvoice_ts 6

🎧 Test it on your own dataset and share your interesting results! :)

🎓 Citation

@inproceedings{choi23d_interspeech,
  author={Ha-Yeong Choi and Sang-Hoon Lee and Seong-Whan Lee},
  title={{Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={2283--2287},
  doi={10.21437/Interspeech.2023-817}
}

💎 Acknowledgements

Our code is based on DiffVC and HiFiGAN.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

🎧 Audio Demo

📑 Pre-trained Model

🔨 Usage

🎓 Citation

💎 Acknowledgements

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
ckpt		ckpt
model		model
module		module
sample		sample
utils		utils
vocoder		vocoder
README.md		README.md
infer.sh		infer.sh
inference.py		inference.py
requirements.txt		requirements.txt

NicolasBFR/Diff-HierVC

Folders and files

Latest commit

History

Repository files navigation

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

🎧 Audio Demo

📑 Pre-trained Model

🔨 Usage

🎓 Citation

💎 Acknowledgements

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages