Skip to content

Latest commit

 

History

History
15 lines (11 loc) · 522 Bytes

README.md

File metadata and controls

15 lines (11 loc) · 522 Bytes

Corpus Converter Scripts

Here you will find script that'll help you to convert corpus files (lists of sentences) to the right format for language model training. For example:

  • Expand numbers: 22 -> twenty-two
  • Normalize alphabet (depends on language): âáà -> aaa
  • Expand symbols: % -> percent
  • Replace time: 9.30 PM -> nine thirty pm
  • Replace date: 4/28/2023 -> twenty-eighth of april two thousand twenty-three
  • Make everything lower-case

Quick-Start

  • check python3 optimize_sentences.py -h