Here you will find script that'll help you to convert corpus files (lists of sentences) to the right format for language model training. For example:
- Expand numbers:
22 -> twenty-two
- Normalize alphabet (depends on language):
âáà -> aaa
- Expand symbols:
% -> percent
- Replace time:
9.30 PM -> nine thirty pm
- Replace date:
4/28/2023 -> twenty-eighth of april two thousand twenty-three
- Make everything lower-case
- check
python3 optimize_sentences.py -h