Skip to content

Latest commit

 

History

History
162 lines (99 loc) · 11.6 KB

README.md

File metadata and controls

162 lines (99 loc) · 11.6 KB

Training Tutorials for the Stanza Python NLP Library

This repo provides step-by-step tutorials for training models with Stanza - the official Python NLP library by the Stanford NLP Group. All neural processors in Stanza, including the tokenzier, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer, the dependency parser, and the named entity tagger, can be trained with your own data.

This repo is meant to complement our training documentation, by providing runnable scripts coupled with toy data that makes it much easier for users to get started with model training. To train models with your own data, you should be able to simply replace the provided toy data with your own data in the same format, and start using them with Stanza right after training.

Warning: This repo is fully tested on Linux. Due to syntax differences between macOS and Linux (e.g., the declare -A in the scripts/treebank_to_shorthand.sh is not supported by macOS), you need to rewrite some files to run on macOS. The command lines given here will not work on Windows.

This repo is designed for and tested on stanza 1.4.0. Earlier versions will not fully work with these commands.

Windows

To reiterate, this is only tested on Linux. In order to run on Windows, there is a source scripts/config.sh line in the initial setup below. Theoretically, if you manually set those variables in the shell, or if you add those variables to the environment using the control panel, the rest of the scripts might work.

Environment Setup

Run the following commands at the command line.

Stanza only supports python3. You can install all dependencies needed by training Stanza models with:

pip install -r requirements.txt

Next, set up the folders and scripts needed for training with:

git clone git@github.com:stanfordnlp/stanza-train.git
cd stanza-train

git clone git@github.com:stanfordnlp/stanza.git
cp config/config.sh stanza/scripts/config.sh
cp config/xpos_vocab_factory.py stanza/stanza/models/pos/xpos_vocab_factory.py
cd stanza
source scripts/config.sh

The config.sh script is used to set environment variables (e.g., data path, word vector path, etc.) needed by training and testing Stanza models.

The xpos_vocab_factory.py script is used to build XPOS vocabulary file for our provided UD_English-TEST toy treebank. Compared with the original file in the downloaded Stanza repo, we only add the shorthand name of the toy treebank (en_test) to the script, so that it can be recognized during training. If you want to use another dataset other than UD_English-TEST after running this tutorial, you can add the shorthand of your treebank in the same way. In case you're curious, here's how we built this file.

Training and Evaluating Processors

Here we provide instructions for training each processor currently supported by Stanza, using the toy data in this repo as example datasets. Model performance will be printed during training. As our provided toy data only contain several sentences for demonstration purpose, you should be able to get 100% accuracy at the end of training.

tokenize

The tokenize processor segments the text into tokens and sentences. All downstream processors which generate annotations at the token or sentence level depends on the output from this processor.

Training the tokenize processor currently requires the Universal Dependencies treebank data in both plain text and the conllu format, as you can find in our provided toy examples here. To train the tokenize processor with this toy data, run the following command:

python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST
python3 -m stanza.utils.training.run_tokenizer UD_English-TEST --step 500

Note that since this toy data is very small in scale, we are restricting the training with a very small step parameter. To train on your own data, you can either set a larger step parameter, or use the default parameter value.

mwt

The Universal Dependencies grammar defines syntatic relations between syntactic words, which, for many languages (e.g., French), are different from raw tokens as segmented from the text. For these languages, the mwt processor expands the multi-word tokens (MWT) recognized by the tokenize processor into multiple syntactic words, paving the ways for downstream annotations.

Note: The mwt processor is not needed and cannot be trained for languages that do not have multi-word tokens (MWT), such as English or Chinese.

Like the tokenize processor, training the mwt processor requires UD data, in the format like our provided toy examples here. You can run the following command to train the mwt processor:

python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_English-TEST
python3 -m stanza.utils.training.run_mwt UD_English-TEST --num_epoch 2

Note: Running the above command with the toy data will yield a message saying that zero training data can be found for MWT training. This is normal since MWT is not needed for English. The training should work when you replace the provided data with data in languages that support MWT (e.g., German, French, etc.).

lemma

The lemma processor predicts lemmas for all words in an input sentence. Training the lemma processor requires data files in the conllu format. With the toy examples, you can train the lemma processor with the following command:

python3 -m stanza.utils.datasets.prepare_lemma_treebank UD_English-TEST
python3 -m stanza.utils.training.run_lemma UD_English-TEST --num_epoch 2

pos

The pos processor annotates words with three types of syntactic information simultaneously: the Universal POS (UPOS) tags, and treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).

Training the pos processor usually requires UD data in the conllu format and pretrained word vectors. For demo purpose, we provide an example word vector file here. With the toy data and word vector file, you can train the pos processor with:

python3 -m stanza.utils.datasets.prepare_pos_treebank UD_English-TEST
python3 -m stanza.utils.training.run_pos UD_English-TEST --max_steps 500

depparse

The depparse processor implements a dependency parser that predicts syntactic relations between words in a sentence. Training the depparse processor requires data files in the conllu format, and a pretrained word vector file. With the toy data and word vector file, you can train the depparse processor with:

python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_English-TEST
python3 -m stanza.utils.training.run_depparse UD_English-TEST --max_steps 500

Note that the gold parameter here tells the scripts to use the "gold" human-annotated POS tags in the training of the parser.

ner

The ner processor recognizes named entities in the input text. Training the ner processor requires column training data in either BIO or BIOES format. See this wikipedia page for an introduction of the formats. We provide toy examples here in the BIO format. For better performance a pretrained word vector file is also recommended. With the toy data and word vector file, you can train the ner processor with:

python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5

Note that for demo purpose we are restricting the word vector dimension to be 5 with the word_emb_dim parameter. You should change it to match the dimension of your own word vectors.

Improving NER Performance with Contextualized Character Language Models

The performance of the ner processor can be significantly improved by using contextualized string embeddings (i.e., a character-level language model), as was shown in this COLING 2018 paper. To enable this in your NER model, you'll need to first train two character-level language models for your language (named as charlm module in Stanza), and then use these trained charlm models in your NER training.

charlm

Training charlm requires a large amount of raw text, such as text from news articles or wikipedia pages, in plain text files. We provide toy data for training charlm here. With the toy data, you can run the following command to train two charlm models, one in the forward direction of the text and another in the backward direction, respectively:

python3 -m stanza.utils.training.run_charlm en_TEST --forward   --epochs 2 --cutoff 0 --batch_size 2
python3 -m stanza.utils.training.run_charlm en_TEST --backward  --epochs 2 --cutoff 0 --batch_size 2

Running these commands will result in two model files in the saved_models/charlm directory, with the prefix en_test.

Note: For details on why two models are needed and how they are used in the NER tagger, please refer to this COLING 2018 paper.

Training contextualized ner models with pretrained charlm

Training contextualized ner models requires BIO-format data, pretrained word vectors, and the pretrained charlm models obtained in the last step. You can run the following command to train the ner processor:

python3 -m stanza.utils.training.run_ner en_sample --max_steps 500 --word_emb_dim 5 --charlm test

Note that the charlm here instructs the training script to look for the character language model files with the prefix of en_test.

Initializing Processors with Trained Models

Initializing a processor with your own trained model only requires the path for the model file. Here we provide an example to initialize the tokenize processor with a model file saved at saved_models/tokenize/en_test_tokenizer.pt:

>>> import stanza
>>> nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_model_path='saved_models/tokenize/en_test_tokenizer.pt')

Contributing Your Models to the Model Zoo

After training your own models, we welcome you to contribute your models so that it can be used by the community. To do this, you can start by creating a GitHub issue. Please help us understand your model by clearly describing your dataset, model performance, your contact information, and why you think your model would benefit the whole community. We will integrate your models into our official repository once we are able to verify its quality and usability.