finnlem is a neural network based lemmatizer model for Finnish language.
A trained neural network can map given Finnish words into their base form with quite reasonable accuracy. These are examples of the model output:
[ORIGINAL] --> [BASE FORM]
Kiinalaisessa --> kiinalainen
osinkotulojen --> osinko#tulo
Rajoittavalla --> rajoittaa
multimediaopetusmateriaalia --> multi#media#opetus#materiaali
ei-rasistisella --> ei-rasistinen
The model is a tensorflow implementation of a sequence-to-sequence (Seq2Seq) recurrent neural network model. This repository contains the code and data needed for training and making predictions with the model. The datasets contain over 2M samples in total.
- Easy-to-use Python wrapper for sequence-to-sequence modeling
- Automatical session handling, model checkpointing and logging
- Support for tensorboard
- Sequence-to-sequence model features: Bahdanau and Luong attention, residual connections, dropout, beamsearch decoding, ...
You should have the latest versions for (as of 7/2017):
- keras
- nltk
- numpy
- pandas
- tensorflow (1.3.0 or greater, with CUDA 8.0 and cuDNN 6.0 or greater)
- unidecode
- sacremoses (see issue regarding this)
After this, clone this repository to your local machine.
Update 10.9.2020: You could also try to first clone and then run pip install -r requirements.txt
at the root of this repository. This will install the latest versions of the required packages automatically, but notice that the very latest versions of some of the packages might nowadays be incompatible with the source code provided here. Feel free to make a pull request with fixed versions of the packages, in case you manage to run the source code successfully :)
Three-steps are required in order to get from zero to making predictions with a trained model:
- Dictionary training: Dictionary is created from training documents, which are processed the same way as the Seq2Seq model inputs later on. Dictionary handles vocabulary/integer mappings required by Seq2Seq.
- Model training: Seq2Seq model is trained in batches with training documents that contain source and target.
- Model decoding: Unseen source documents are fed into Seq2Seq model, which makes predictions on the target.
The following is a simple example of using some of the features in the Python API. See more detailed descriptions of functions and parameters available from the source code documentation.
from dictionary import Dictionary
# Documents to fit in dictionary
docs = ['abcdefghijklmnopqrstuvwxyz','åäö','@?*#-']
# Create a new Dictionary object
d = Dictionary()
# Fit characters of each document
d.fit(docs)
# Save for later usage
d.save('./data/dictionaries/lemmatizer.dict')
from model_wrappers import Seq2Seq
# Create a new model
model = Seq2Seq(model_dir='./data/models/lemmatizer,
dict_path='./data/dictionaries/lemmatizer.dict')
# Create some documents to train on
source_docs = ['koira','koiran','koiraa','koirana','koiraksi','koirassa']*128
target_docs = ['koira','koira','koira','koira','koira','koira']*128
# Train 100 batches, save checkpoint every 25th batch
for i in range(100):
loss,global_step = model.train(source_docs, target_docs, save_every_n_batch=25)
print('Global step %d loss: %f' % (global_step,loss))
test_docs = ['koiraa','koirana','koiraksi']
pred_docs = model.decode(test_docs)
print(pred_docs) # --> [['koira'],['koira'],['koira']]
Command line (See list of available commands here)
The following demonstrates the usage of command line for training and predicting from files.
python -m dict_train
--dict-save-path ./data/dictionaries/lemmatizer.dict
--dict-train-path ./data/dictionaries/lemmatizer.vocab
The dictionary train path file(s) should contain one document per line (example).
python -m model_train
--model-dir ./data/models/lemmatizer
--dict-path ./data/dictionaries/lemmatizer.dict
--train-data-path ./data/datasets/lemmatizer_train.csv
The model train and validation data path file(s) should contain one source and target document per line, separated by a comma (example).
python -m model_decode
--model-dir ./data/models/lemmatizer
--test-data-path ./data/datasets/lemmatizer_test.csv
--decoded-data-path ./data/decoded/lemmatizer_decoded.csv
The model test data path file(s) should contain either:
- one source document per line, or
- one source and target document per line, separated by a comma (example)
-
To use tensorboard, run command
python -m tensorflow.tensorboard --logdir=model_dir
, wheremodel_dir
is the Seq2Seq model checkpoint folder. -
The model was originally created for summarizing the Finnish news, by using news contents as the sources, and news titles as the targets. This proved to be quite a difficult task due to rich morphology of Finnish language, and lack of computational resources. My first approach for tackling the morphology was to use the base forms for each word, which is what the model in this package does by default. However, using this model to convert every word to their base form ended up being too slow to be used as an input for the second model in real time.
In the end, I decided to try the Finnish SnowballStemmer from nltk in order to get the "base words", and started training the model with 100k vocabulary. After 36 hours of training with loss decreasing very slowly, I decided to stop, and keep this package as a character-level lemmatizer. However, in model_wrappers.py, there is a global variable DOC_HANDLER_FUNC, which enables one to change the preprocessing method easily from characters to words by setting
DOC_HANDLER_FUNC='WORD'
. Try changing the variable, and/or write your own preprocessing function doc_to_tokens, if you'd like to experiment with the word-level model.
- JayParks/tf-seq2seq: Example sequence-to-sequence implementation in tensorflow
- Omorfi: Finnish open source morphology tool
- FinnTreeBank: Source for datasets
- Finnish Dependency Parser: Source for datasets
Jesse Myrberg (jesse.myrberg@gmail.com)