Skip to content

Commit

Permalink
initial release 0.1
Browse files Browse the repository at this point in the history
  • Loading branch information
repodiac committed Aug 5, 2020
1 parent 99fe488 commit ff1f964
Show file tree
Hide file tree
Showing 4 changed files with 1,175 additions and 1 deletion.
46 changes: 45 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,46 @@
# german_transliterate
Python module to clean and transliterate (i.e. normalize) German text including abbreviations, numbers, timestamps etc. Useful für TTS or ASR pre-/post-processing tasks.

**german_transliterate** is a Python module to clean and transliterate (i.e. normalize) German text including abbreviations, numbers, timestamps etc. It can be used to clean messy text (e.g. map peculiar Unicode encodings to ASCII) or replace common abbreviations in text in combination with various text mining tasks.

However, it is particularly useful for Text-To-Speech (TTS) preprocessing (both in training and inference) and has features to support phonemic encoding of the results (e.g. with [espeak-ng](https://en.wikipedia.org/wiki/ESpeak#eSpeak_NG)) afterwards as next step in the processing pipeline.

Is has been successfully applied to preprocessing with [Mozilla TTS](https://github.com/mozilla/TTS) in combination with `espeak-ng` phonemes as input data to both training and inference pipeline.

## License

Licensed under Apache 2.0 license - if you think this software is useful and is used elsewhere, any pointer to this work or github repo are highly welcomed!

## Version History

* `release 0.1` - initial release of the software, still a lot of `ToDo`s and some more experimental features (see documentation); also exception handling could be improved

# Installation/Setup

It has currently only one external dependency, [num2words](https://pypi.org/project/num2words/). All dependencies are to be found in `requirements.txt` and included in `setup.py` as well, at the moment.

Installation is currently done via **local** package installation:

* go to the directory where you cloned this repo (via `git clone https://github.com/repodiac/german_transliterate`)
* type `pip install -e .`

It should install to your current Python environment as any other `pip` package (in case, create a virtual environment with `virtualenv` before).

# Documentation

Example usage:

```
from german_transliterate.german_transliterate import GermanTransliterate
text = 'Um 13:15h kaufte Hr. Meier 1.000 Luftballons für 250€.'
print('ORIGINAL:', text)
# use these setting for PHONEMIC ENCODINGS, leave all parameters empty otherwise
print('TRANSLITERATION:', GermanTransliterate(replace={';': ',', ':': ' '}, sep_abbreviation=' -- ').transliterate(text))
```

The parameters used for the config parameter `transliterate_ops` are as follows:
* ... to be completed

# Issues and Comments

Please open issues for bugs or feature requests. You can also reach out to me via github.
Loading

0 comments on commit ff1f964

Please sign in to comment.