speech2text

Realtime offline (self-hosted) speech recognition tool based on OpenAI Whisper model.

Turns this (en_chunk.wav):

into this:

Goals

Why did I make it:

to get a better grasp on the modern speech-to-text instruments: their capabilities and limitations
to reuse this project as a sub-module in my other projects

Features

a speech-to-text transcription, which works fully offline
- though internet connection still required for the first launch: to download the models
the output is very responsive: the words appear on the screen with minimal delay
efficient usage of system resources

Flaws

cold start takes quite some time (20-80 sec with SSD)
consumes a lot of RAM
- occupies 4-8 GB with default preset (Whisper models: tiny.en + small.en)
CUDA support is strongly advised:
- works 5-10 times slower without CUDA Toolkit
- incapable of running in real-time mode without CUDA
may require a lot of fine-tuning at first — to adjust to your microphone
deployment may be tricky

Algorithm description

Audio data is stored and transformed between three different formats:

WAVE PCM encoding — its implementation from Python standard library's wave module
Pydub AudioSegment — convenient for sound processing
NumPy array of floats — the Whisper input format

The workflow can be roughly described as follow:

We attempt to split the audio stream into sentences based on the pauses in speech. Each such sentence is stored in a separated block.
The last block constantly changes, because we iteratively append small chunks of sound (0.5-1.5 sec) while the user speaks. It is referred to as ongoing block.
All the previous blocks won't be changing. They are referred to as finalized blocks.
We apply a slow high quality transcription to each of the finalized blocks — only once.
We apply a fast low quality transcription to the ongoing block at each iteration.

This approach is a tradeoff between quality and speed. It immediately gives the user a low quality transcription of what he says in the current phrase, and it refines this transcription when the phrase is finished.

A more detailed description

flowchart TB

A([Sound source]) -->|sound| B(Listener)
B -->|data chunks| C(Accumulation)
C -->|binary data| D{{Preparations}}
D -->|NumPy f16 array| G(Transcribing)
G --> C
G -->|text| H([Text accumulated])

, where:

Listener provides a stream of fixed sized PCM-encoded binary chunks.
Accumulation — adds new binary chunks to the currently ongoing audio block.
Preparations — transforms binary data into NumPy array, and applies a variety of measures in order to increase the quality of transcription.
Transcribing — applies low quality transcription to the ongoing audio block, applies high quality transcription to the blocks to be finalized (if any), sends the ongoing block back to Accumulation.

More about the `Preparations` stage

flowchart LR

C([in])
C -->|binary| D(Adjustment)
D -->|seg| E(Split on silence)
E -->|seg| F(Refinement)
F -->|NumPy array| G([out])

Adjustment
- transforms binary (WAVE PCM encoding) to seg (Pydub AudioSegment)
- may apply volume normalization
Split on silence — splits audio blocks on silence
- all except the last block are marked to be finalized
Refinement
- may apply human voice frequency amplification
- may speed up the audio
- applies normalization
- converts to mono, 16000 Hz
- converts to NumPy array of floats
- may apply noise suppression

How to launch

Software requirements

Python 3.11
Poetry
(optional) make tool
(optional) CUDA Toolkit

Setting up developer environment

In case you use Windows, CUDA v11.4: just run make init (or poetry install).

Otherwise you may require changing the pyproject.toml file: the torch dependency from [tool.poetry.dependencies] section.

First run

Before launching the app, you might want to tackle with the workflow_debugging.ipynb notebook.

To launch the app, run: make run (or make, or poetry run python -m speech2text). It will transcribe the en_chunk.wav file. The result should resemble the animation from the top of the project.

To transcribe your mic input instead: in __main__.py replace demo_console_realtime(IN_FILE_PATH) with demo_console_realtime().

You also might want to have a look at config.yaml — in order to fine-tune the settings. If having troubles with performance:

replace model_name: small.en with model_name: tiny.en
delete all the inclusions of noisereduce sub-section (replace it with noisereduce:)

make sure CUDA is available:

from torch import cuda

assert cuda.is_available()

Licensing

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
speech2text		speech2text
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
config.yaml		config.yaml
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
spectrogram.png		spectrogram.png
speech2text.gif		speech2text.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

speech2text

Table of contents

Goals

Features

Flaws

Algorithm description

A more detailed description

More about the `Preparations` stage

How to launch

Software requirements

Setting up developer environment

First run

Licensing

About

Releases 1

Packages

Languages

License

sentenzo/speech2text

Folders and files

Latest commit

History

Repository files navigation

speech2text

Table of contents

Goals

Features

Flaws

Algorithm description

A more detailed description

More about the Preparations stage

How to launch

Software requirements

Setting up developer environment

First run

Licensing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

More about the `Preparations` stage

Packages