Speech To Text, Nepali, CNN, ResNet, BiLSTM, CTC
This repo is a part of the research project for designing the automatic speech recogntion(ASR) model for Nepali language using ML techniques. This repo is a further learnings and implementation to Manish Dhakal repository. All thanks to his efforts.
- You are free to use this research as a reference and make modifications to continue your own research in Nepali ASR.
- The
trainer.py
has been implemented to run on the sampled data for now. To replicate the result please replace dataset directory with original OpenSLR dataset. - Please remove the (audio, text) pairs that include Devnagari numeric texts like १४२३, ५९२, etc from the dataset because they degrade the performance of the model.
- Remove the (audio, text) pairs that include Devnagari numeric transcriptions
- Data cleaning (clipping silent gaps from both ends)
- MFCC feature extraction from audio data
- Design Neural Network (optimal: CNN + ResNet + BiLSTM) model
- Calculate CTC loss for applying gradient (training)
- Decode the texts by using beam search decoding (infernce)
- Initialize the virtual environment by installing packages from
requirements.txt
. - Run the training pipeline & evaluate authors model, which can be also be used to evaluate your own (audio,text) pairs.
- Create API_KEY file and and paste your OpenAI API key there.
- streamlit run "path/file.py"
python -m streamlit run webapp.py
To train the model, the bellow code are used.
python trainer.py # For running the training pipeline
python eval.py # For testing and evaluating the model already trained by the author