This is a curated list of awesome Speech Bandwidth Extension tutorials, papers, libraries, datasets, tools, scripts and results. The purpose of this repo is to organize the world’s resources for speech bandwidth extension, and make them universally accessible and useful.
To add items to this page, simply send a pull request. (contributing guide)
- IRM-based-Speech-Enhancement-using-LSTM [Code]
- nn-irm [Code]
- Speech Enhancement Using a Two-Stage Network for an Efficient Boosting Strategy [Code][PDF]
- SETK: Speech Enhancement Tools integrated with Kaldi [Code]
- sednn:deep_learning_for_speech_enhancement_keras_python [Code]
- Speech_Enhancement_DNN_NMF [Code]
- Deep-Learning-for-Speech-Enhancement [Code]
- gcc-nmf:Real-time GCC-NMF Blind Speech Separation and Enhancement [Code]
- TensorFlow-speech-enhancement-Chinese [Code]
- DNN-Speech-enhancement-demo-tool [Code]
- CNN-for-single-channel-speech-enhancement [Code]
- rnn-speech-denoising [Code]
- DNN-SpeechEnhancement [Code]
- segan_pytorch [Code]
- PHASEN[Code]
- TCNSE [Code]
- pb_chime5:Speech enhancement system for the CHiME-5 dinner party scenario [Code]
- Supervised online diarization with sample mean loss for multi-domain data, 2019
- Discriminative Neural Clustering for Speaker Diarisation, 2019
- End-to-End Neural Speaker Diarization with Permutation-Free Objectives, 2019
- End-to-End Neural Speaker Diarization with Self-attention, 2019
- Fully Supervised Speaker Diarization, 2018
- Joint Speech Recognition and Speaker Diarization via Sequence Transduction, 2019
- Says who? Deep learning models for joint speech recognition, segmentation and diarization, 2018
- Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge, 2018
- ODESSA at Albayzin Speaker Diarization Challenge 2018, 2018
- Joint Discriminative Embedding Learning, Speech Activity and Overlap Detection for the DIHARD Challenge, 2018
- Overlap-aware diarization: resegmentation using neural end-to-end overlapped speech detection
- Speaker diarization using latent space clustering in generative adversarial network
- A study of semi-supervised speaker diarization system using gan mixture model
- Learning deep representations by multilayer bootstrap networks for speaker diarization
- Enhancements for Audio-only Diarization Systems
- LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
- Meeting Transcription Using Virtual Microphone Arrays
- Speaker diarisation using 2D self-attentive combination of embeddings
- Neural speech turn segmentation and affinity propagation for speaker diarization
- Multimodal Speaker Segmentation and Diarization using Lexical and Acoustic Cues via Sequence to Sequence Neural Networks
- Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks
- Speaker Diarization with LSTM
- Speaker diarization using deep neural network embeddings
- Speaker diarization using convolutional neural network for statistics accumulation refinement
- pyannote. metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
- Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks
- Speaker Diarization using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
- A study of the cosine distance-based mean shift for telephone speech diarization
- Speaker diarization with PLDA i-vector scoring and unsupervised calibration
- Artificial neural network features for speaker diarization
- PLDA-based Clustering for Speaker Diarization of Broadcast Streams
- Speaker diarization of meetings based on speaker role n-gram models
- An overview of automatic speaker diarization systems
- A spectral clustering approach to speaker diarization
- AKtools:the open software toolbox for signal acquisition, processing, and inspection in acoustics [SVN Code](username: aktools; password: ak)
- MatlabToolbox [Code]
- athena-signal [[Code]](https://github.com/athena-team/athena-signal)
- python_speech_features [Code]
- speechFeatures:语音处理,声源定位中的一些基本特征 [Code]
- sap-voicebox [Code]
- Calculate-SNR-SDR [Code]
- RIR-Generator [Code]
- Python library for Room Impulse Response (RIR) simulation with GPU acceleration [Code]
- ROOMSIM:binaural image source simulation [Code]
- binaural-image-source-model [Code]
Link | Language | Description |
---|---|---|
SIDEKIT for diarization (s4d) | Python | An open source package extension of SIDEKIT for Speaker diarization. |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
AaltoASR | Python & Perl | Speaker diarization scripts, based on AaltoASR. |
LIUM SpkDiarization | Java | LIUM_SpkDiarization is a software dedicated to speaker diarization (i.e. speaker segmentation and clustering). It is written in Java, and includes the most recent developments in the domain (as of 2013). |
kaldi-asr | Bash | Example scripts for speaker diarization on a portion of CALLHOME used in the 2000 NIST speaker recognition evaluation. |
Alize LIA_SpkSeg | C++ | ALIZÉ is an opensource platform for speaker recognition. LIA_SpkSeg is the tools for speaker diarization. |
pyannote-audio | Python | Neural building blocks for speaker diarization: speech activity detection, speaker change detection, speaker embedding. |
pyBK | Python | Speaker diarization using binary key speaker modelling. Computationally light solution that does not require external training data. |
Speaker-Diarization | Python | Speaker diarization using uis-rnn and GhostVLAD. An easier way to support openset speakers. |
EEND | Python & Bash & Perl | End-to-End Neural Diarization. |
VBDiarization | Python | Speaker diarization based on Kaldi x-vectors using pretrained model trained in Kaldi (kaldi-asr/kaldi) and converted to ONNX format (onnx/onnx) running in ONNXRuntime (Microsoft/onnxruntime). |
RE-VERB | Python & JavaScript | RE: VERB is speaker diarization system, it allows the user to send/record audio of a conversation and receive timestamps of who spoke when. |
Link | Language | Description |
---|---|---|
pyannote-metrics | Python | A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. |
SimpleDER | Python | A lightweight library to compute Diarization Error Rate (DER). |
NIST md-eval | Perl | (1) modified md-eval.pl from Mary Tai Knox; (2) md-eval-v21.pl from jitendra; (3) md-eval-22.pl from nryant |
dscore | Python & Perl | Diarization scoring tools. |
Sequence Match Accuracy | Python | Match the accuracy of two sequences with Hungarian algorithm. |
Link | Language | Description |
---|---|---|
uis-rnn | Python & PyTorch | Google's Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, for Fully Supervised Speaker Diarization. This clustering algorithm is supervised. |
uis-rnn-sml | Python & PyTorch | A variant of UIS-RNN, for the paper Supervised Online Diarization with Sample Mean Loss for Multi-Domain Data. |
DNC | Python & ESPnet | Transformer-based Discriminative Neural Clustering (DNC) for Speaker Diarisation. Like UIS-RNN, it is supervised. |
SpectralCluster | Python | Spectral clustering with affinity matrix refinement operations. |
sklearn.cluster | Python | scikit-learn clustering algorithms. |
PLDA | Python | Probabilistic Linear Discriminant Analysis & classification, written in Python. |
PLDA | C++ | Open-source implementation of simplified PLDA (Probabilistic Linear Discriminant Analysis). |
Link | Method | Language | Description |
---|---|---|---|
resemble-ai/Resemblyzer | d-vector | Python & PyTorch | PyTorch implementation of generalized end-to-end loss for speaker verification, which can be used for voice cloning and diarization. |
Speaker_Verification | d-vector | Python & TensorFlow | Tensorflow implementation of generalized end-to-end loss for speaker verification. |
PyTorch_Speaker_Verification | d-vector | Python & PyTorch | PyTorch implementation of "Generalized End-to-End Loss for Speaker Verification" by Wan, Li et al. With UIS-RNN integration. |
Real-Time Voice Cloning | d-vector | Python & PyTorch | Implementation of "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" (SV2TTS) with a vocoder that works in real-time. |
deep-speaker | d-vector | Python & Keras | Third party implementation of the Baidu paper Deep Speaker: an End-to-End Neural Speaker Embedding System. |
x-vector-kaldi-tf | x-vector | Python & TensorFlow & Perl | Tensorflow implementation of x-vector topology on top of Kaldi recipe. |
kaldi-ivector | i-vector | C++ & Perl | Extension to Kaldi implementing the standard i-vector hyperparameter estimation and i-vector extraction procedure. |
voxceleb-ivector | i-vector | Perl | Voxceleb1 i-vector based speaker recognition system. |
Link | Language | Description |
---|---|---|
change_detection | Python & Keras | Code for Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. |
Link | Language | Description |
---|---|---|
LibROSA | Python | Python library for audio and music analysis. https://librosa.github.io/ |
python_speech_features | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. https://python-speech-features.readthedocs.io/en/latest/ |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Link | Language | Description |
---|---|---|
pyroomacoustics | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. It was developed as a fast prototyping platform for beamforming algorithms in indoor scenarios. https://pyroomacoustics.readthedocs.io |
gpuRIR | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration |
rir_simulator_python | Python | Room impulse response simulator using python |
Link | Language | Description |
---|---|---|
VB Diarization | Python | VB Diarization with Eigenvoice and HMM Priors. |
Audio | Diarization ground truth | Language | Pricing | Additional information |
---|---|---|---|---|
2000 NIST Speaker Recognition Evaluation | Disk-6 (Switchboard), Disk-8 (CALLHOME) | Multiple | $2400.00 | Evaluation Plan |
2003 NIST Rich Transcription Evaluation Data | Together with audios | en, ar, zh | $2000.00 | telephone speech, broadcast news |
CALLHOME American English Speech | CALLHOME American English Transcripts | en | $1500.00 + $1000.00 | CH109 whitelist |
The ICSI Meeting Corpus | Together with audios | en | Free | License |
The AMI Meeting Corpus | Together with audios (need to be processed) | Multiple | Free | License |
Fisher English Training Speech Part 1 Speech | Fisher English Training Speech Part 1 Transcripts | en | $7000.00 + $1000.00 | |
Fisher English Training Part 2, Speech | Fisher English Training Part 2, Transcripts | en | $7000.00 + $1000.00 |
Name | Utterances | Speakers | Language | Pricing | Additional information |
---|---|---|---|---|---|
TIMIT | 6K+ | 630 | en | $250.00 | Published in 1993, the TIMIT corpus of read speech is one of the earliest speaker recognition datasets. |
VCTK | 43K+ | 109 | en | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. |
LibriSpeech | 292K | 2K+ | en | Free | Large-scale (1000 hours) corpus of read English speech. |
LibriVox | 180K | 9K+ | Multiple | Free | Free public domain audiobooks. LibriSpeech is a processed subset of LibriVox. Each original unsegmented utterance could be very long. |
VoxCeleb 1&2 | 1M+ | 7K | Multiple | Free | VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. |
The Spoken Wikipedia Corpora | 5K | 879 | en, de, nl | Free | Volunteer readers reading Wikipedia articles. |
CN-Celeb | 130K+ | 1K | zh | Free | A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University. |
BookTubeSpeech | 8K | 8K | en | Free | Audio samples extracted from BookTube videos - videos where people share their opinions on books - from YouTube. The dataset can be downloaded using BookTubeSpeech-download. |
DeepMine | 540K | 1850 | fa, en | Unknown | A speech database in Persian and English designed to build and evaluate speaker verification, as well as Persian ASR systems. |
Name | Utterances | Pricing | Additional information |
---|---|---|---|
AudioSet | 2M | Free | A large-scale dataset of manually annotated audio events. |
MUSAN | N/A | Free | MUSAN is a corpus of music, speech, and noise recordings. |
- High-Accuracy Neural-Network Models for Speech Enhancement - 2017
- DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation - 2015
- Microphone array signal processing: beyond the beamformer - 2011
- CCF speech seminar 2020
- Literature Review For Speaker Change Detection by Halil Erdoğan
- Speaker Diarization: Separation of Multiple Speakers in an Audio File by Jaspreet Singh
- Speaker Diarization with Kaldi by Yoav Ramon
- Google's Diarization System: Speaker Diarization with LSTM by Google
- Fully Supervised Speaker Diarization: Say Goodbye to clustering by Google
- Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings by Microsoft Research
- Robust Speaker Diarization for Meetings: the ICSI system by Microsoft Research
Company | Product |
---|---|
Google Cloud Speech-to-Text API | |
Amazon | Amazon Transcribe |
IBM | Watson Speech To Text API |
DeepAffects | Speaker Diarization API |