Skip to content

ibug-group/Visual_Speech_Recognition_for_Multiple_Languages

 
 

Repository files navigation

logo

Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Stavros Petridis, Maja Pantic.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. The repository is mainly based on ESPnet. We provide state-of-the-art algorithms for end-to-end visual speech recognition in the wild.

Major features
  • Modular Design

    The repository is composed of face tracking, pre-processing, and acoustic/visual encoder backbones.

  • Support of Benchmarks for Speech Recognition

    Our models provide state-of-the-art performance for speech recognition datasets.

  • Support of Extraction of Representations or Mouth Region Of Interest

    Our models directly support extraction of speech representations or mouth region of interests (ROIs).

  • Support of Recognition of Your Own Videos

    We provide support for performing visual speech recognition for your own videos.

Demo

English -> Mandarin -> Spanish French -> Portuguese -> Italian

Installation

How to Install Environments

  1. Clone the repository into a directory. We refer to that directory as ${lipreading_root}.
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
  1. Install PyTorch (>=1.8.0)

  2. Install other packages.

pip install -r requirements.txt

How to Prepare Models and Landmarks

  • Model. Download a model from Model Zoo.

    • For models trained on the CMU-MOSEAS dataset, which contains multiple languages, please unzip them into ${lipreading_root}/models/${dataset}/${language_code} (e.g. ${lipreading_root}/models/CMUMOSEAS/pt).
    • For models trained on a dataset with one language, please unzip them into ${lipreading_root}/models/${dataset}.
  • Language Model. The performance can be improved in most cases by incorporating an external language model. Please download a language model from Model Zoo.

    • For a language model trained for the CMU-MOSEAS dataset, please unzip them into ${lipreading_root}/language_models/${dataset}/${language_code}.
    • For a language model trained for datasets with one language, please unzip them into ${lipreading_root}/language_models/${dataset}.
  • Tracker [option]. If you intend to test your own videos, additional packages for face detection and face alignment need to be pre-installed, which are provided in the tools folder.

  • Landmarks [option]. If you want to evaluate on benchmarks, there is no need to install the tracker. Please download pre-computed landmarks from Model Zoo and unzip them into ${lipreading_root}/landmarks/${dataset}.

Recognition

Generic Options

  • We refer to a path name (.ini) that includes configuration information as <CONFIG-FILENAME-PATH>. We put configuration files in ${lipreading_root}/configs by default.

  • We refer to a path name (.ref) that includes labels information as <LABELS-FILENAME-PATH>.

    • For the CMU-MOSEAS dataset and Multilingual TEDx dataset, which include multiple languages, we put labels files (.ref) in ${lipreading_root}/labels/${dataset}/${language_code}.
    • For datasets with one language, we put label files in ${lipreading_root}/labels/${dataset}.
  • We refer to the original dataset directory as <DATA-DIRECTORY-PATH>, and to the path name of a single original video as <DATA-FILENAME-PATH>.

  • We refer to the landmarks diectory as <LANDMARKS-DIRECTORY-PATH>. We assume the default directory is ${lipreading_root}/landmarks/${dataset}/${dataset}_landmarks.

  • We use CPU for inference by default. If you want to speed up the decoding process, please consider

    • adding a command-line argument about the GPU option (e.g. --gpu-idx <GPU_ID>). <GPU_ID> is the ID of your selected GPU, which is a 0-based integer.
    • setting beam_size in the configuration filename (.ini) <CONFIG-FILENAME-PATH> to a small value (e.g. 5) in case your maximum GPU Memory is exceeded.

How to Test

  • We assume original videos from desired dataset have been downloaded to the dataset directory <DATA-DIRECTORY-PATH> and landmarks have been unzipped to the landmark directory ${lipreading_root}/landmarks/${dataset}.

  • The frame rate (fps) of your video should match the input v_fps in the configuration file.

  • To evaluate the performance on desired dataset.
python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH>
  • To lip read from a single video file.
python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH>

How to Extract Mouth ROIs

  • Mouth ROIs can be extracted by setting <FEATS-POSITION> to mouth. The mouth ROIs will be saved to <OUTPUT-FILENAME-PATH> with the .avi file extension.

  • The ${lipreading_root}/outputs folder can be used to save the mouth ROIs.

  • To extract mouth ROIs from desired dataset.
python main.py --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>
  • To extract mouth ROIs from a single video file.
python main.py --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

How to Extract Speech Representations

  • Speech representations can be extracted from the top of ResNet-18 (512-D) or Conformer (256-D) by setting <FEATS-POSITION> to resnet or conformer, respetively. The representations will be saved to <OUTPUT-DIRECTORY-PATH> or <OUTPUT-FILENAME-PATH> with the .npz file extension.

  • The ${lipreading_root}/outputs folder can be used to save the speech representations.

  • To extract speech representations from desired dataset.
python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --labels-filename <LABELS-FILENAME-PATH> \
               --data-dir <DATA-DIRECTORY-PATH> \
               --landmarks-dir <LANDMARKS-DIRECTORY-PATH> \
               --dst-dir <OUTPUT-DIRECTORY-PATH> \
               --feats-position <FEATS-POSITION>
  • To extract speech representations from a single video file.
python main.py --config-filename <CONFIG-FILENAME-PATH> \
               --data-filename <DATA-FILENAME-PATH> \
               --dst-filename <OUTPUT-FILENAME-PATH> \
               --feats-position <FEATS-POSITION>

Model Zoo

Overview

We support a number of datasets for speech recognition:

Evaluation

We provide landmarks, language models, models for each dataset. Please see the models page for details.

Citation

If you find this code useful in your research, please consider citing the following papers:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)

About

Visual Speech Recognition for Multiple Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%