Corpus Converter Scripts

Here you will find script that'll help you to convert corpus files (lists of sentences) to the right format for language model training. For example:

Expand numbers: 22 -> twenty-two
Normalize alphabet (depends on language): âáà -> aaa
Expand symbols: % -> percent
Replace time: 9.30 PM -> nine thirty pm
Replace date: 4/28/2023 -> twenty-eighth of april two thousand twenty-three
Make everything lower-case

Quick-Start

check python3 optimize_sentences.py -h