Yet another text augmentation python package.
- Usage
- Appendix
import augtxt
import numpy as np
Check the demo notebook for an usage example.
The function augtxt.augmenters.wordtypo
applies randomly different augmentations to one word.
The result is a simulated distribution of possible word augmentations, e.g. how are possible typological errors distributed for a specific original word.
The procedure does not guarantee that the original word will be augmented.
Check the demo notebook for an usage example.
The function augtxt.augmenters.senttypo
applies randomly different augmentations to
a) at least one word in a sentence, or
b) not more than a certain percentage of words in a sentence.
The procedure guarantees that the sentence is augmented.
The functions also allows to exclude specific strings from augmentation (e.g. exclude=("[MASK]", "[UNK]")
). However, these strings cannot include the special characters .,;:!?
(incl. whitespace).
Check the demo notebook for an usage example.
The augtxt.typo
module is about augmenting characters to mimic human errors while using a keyboard device.
A user mix two consecutive characters up.
- Swap 1st and 2nd characters:
augtxt.typo.swap_consecutive("Kinder", loc=0)
(Result:iKnder
) - Swap 1st and 2nd characters, and enforce letter cases:
augtxt.typo.swap_consecutive("Kinder", loc=0, keep_case=True)
(Result:Iknder
) - Swap random
i
-th andi+1
-th characters that are more likely at the end of the word:np.random.seed(seed=123); augtxt.typo.swap_consecutive("Kinder", loc='end')
User presses a key twice accidentaly
- Make 5th letter a double letter: ``augtxt.typo.pressed_twice("Eltern", loc=4)
(Result:
Elterrn`)
User presses the key not enough (Lisbach, 2011, p.72), the key is broken, finger motion fails.
- Drop the 3rd letter:
augtxt.typo.drop_char("Straße", loc=2)
(Result:Staße
)
Letter is left out, but the following letter is typed twice.
It's a combination of augtxt.typo.pressed_twice
and augtxt.typo.drop_char
.
from augtxt.typo import drop_n_next_twice
augm = drop_n_next_twice("Tante", loc=2)
# Tatte
Usually SHFIT
is used to type a capital letter, and ALT
or ALT+SHIFT
for less common characters.
A typo might occur because these special keys are nor are not pressed in combination with a normal key.
The function augtxt.typo.pressed_shiftalt
such errors randomly.
from augtxt.typo import pressed_shiftalt
augm = pressed_shiftalt("Onkel", loc=2)
# OnKel, On˚el, Onel
The keymap
can differ depending on the language and the keyboard layout.
from augtxt.typo import pressed_shiftalt
import augtxt.keyboard_layouts as kbl
augm = pressed_shiftalt("Onkel", loc=2, keymap=kbl.macbook_us)
# OnKel, On˚el, Onel
Further, transition probabilities in case of a typo can be specified
from augtxt.typo import pressed_shiftalt
import augtxt.keyboard_layouts as kbl
keyboard_transprob = {
"keys": [.0, .75, .2, .05],
"shift": [.9, 0, .05, .05],
"alt": [.9, .05, .0, .05],
"shift+alt": [.3, .35, .35, .0]
}
augm = pressed_shiftalt("Onkel", loc=2, keymap=kbl.macbook_us, trans=keyboard_transprob)
- Lisbach, B., 2011. Linguistisches Identity Matching. Vieweg+Teubner, Wiesbaden. https://doi.org/10.1007/978-3-8348-9791-6
The PUNCT (.?!;:
) and COMMA (,
) tokens carry syntatic information.
An use case
import augtxt.punct
text = ("Die Lehrerin [MASK] einen Roman. "
"Die Schülerin [MASK] ein Aufsatz, der sehr [MASK] war.")
augmented = augtxt.punct.remove_syntaxinfo(text)
# 'Die Lehrerin [MASK] einen Roman Die Schülerin [MASK] ein Aufsatz der sehr [MASK] war'
The function augtxt.punct.merge_words
removes randomly whitespace or hyphens between words, and transform the second word to lower case.
import augtxt.punct
text = "Die Bindestrich-Wörter sind da."
np.random.seed(seed=23)
augmented = augtxt.punct.merge_words(text, num_aug=1)
assert augmented == 'Die Bindestrich-Wörter sindda.'
np.random.seed(seed=1)
augmented = augtxt.punct.merge_words(text, num_aug=1)
assert augmented == 'Die Bindestrichwörter sind da.'
The augtxt.order
simulate errors on word token level.
np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.swap_consecutive(text, exclude=["[MASK]"], num_aug=1))
# die Tausche Wörter, lasse sie weg, oder [MASK] was.
np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.write_twice(text, exclude=["[MASK]"], num_aug=1))
# Tausche die die Wörter, lasse sie weg, oder [MASK] was.
np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.drop_word(text, exclude=["[MASK]"], num_aug=1))
# Tausche Wörter, lasse sie weg, oder [MASK] was.
np.random.seed(seed=42)
text = "Tausche die Wörter, lasse sie weg, oder [MASK] was."
print(augtxt.order.drop_n_next_twice(text, exclude=["[MASK]"], num_aug=1))
# die die Wörter, lasse sie weg, oder [MASK] was.
Deprecation Notice:
augtxt.wordsubs
will be deleted in 0.6.0 and replaced.
Especially synonym replacement is not trivial in German language.
Please check https://github.com/ulf1/flexion for further information.
The augtxt.wordsubs
module is about replacing specific strings, e.g. words, morphemes, named entities, abbreviations, etc.
It is recommend to filter vocab
further. For example, PoS tag the sequences and only augment VERB and NOUN tokens.
import itertools
import augtxt.wordsubs
import numpy as np
original_seqs = [["Das", "ist", "ein", "Satz", "."], ["Dies", "ist", "ein", "anderer", "Satz", "."]]
vocab = set([s.lower() for s in itertools.chain(*original_seqs) if len(s) > 1])
synonyms = {
'anderer': ['verschiedener', 'einiger', 'vieler', 'diverser', 'sonstiger',
'etlicher', 'einzelner', 'bestimmter', 'ähnlicher'],
'satz': ['sätze', 'anfangssatz', 'schlussatz', 'eingangssatz', 'einleitungssatzes',
'einleitungsssatz', 'einleitungssatz', 'behauptungssatz', 'beispielsatz',
'schlusssatz', 'anfangssatzes', 'einzelsatz', '#einleitungssatz',
'minimalsatz', 'inhaltssatz', 'aufforderungssatz', 'ausgangssatz'],
'.': [',', '🎅'],
'das': ['welches', 'solches'],
'ein': ['weiteres'],
'dies': ['was', 'umstand', 'dass']
}
np.random.seed(42)
augmented_seqs = augtxt.wordsubs.synonym_replacement(
original_seqs, synonyms, num_aug=10, keep_case=True)
# check results for 1st sentence
for s in augmented_seqs[0]:
print(s)
The augtxt
git repo is available as PyPi package
pip install augtxt>=0.5.0
pip install git+ssh://git@github.com/ulf1/augtxt.git
Install a virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -r requirements-demo.txt
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
Python commands
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
pytest
Publish
pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Please open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.