This repository contains a set of scripts to build a ready-to-use Juman++ model for Jumandic.
- Unix environment (on Windows use WSL or MSYS2/MinGW64)
- Juman++ build environment
- Python 3.6+
- Ruby
- Perl
- Configured ssh authorization for github (we will clone several repositories via ssh)
- 32 GB of RAM
- Original texts from Mainichi Shinbun (year 1995) for Kyoto Corpus (see the page for more information). Othewise, Juman++ model will be trained only on Leads corpus and will have poor quality.
Run the configuration script: python3 configure.py
.
It will prompt for the location of Mainichi Shinbun texts.
After that run make nornn
for training a model without RNN component.
make rnn
produces the model with RNN component.
The models will be inside the bld/model
folder.
It is possible to add your words to the model. To do it:
- Perform the configuration as described above:
python3 configure.py
- Fetch the repositories
make repo
. - Go into
bld/repos/jumandic
folder, it is a local clone of JumanDIC repository. - Create a new file with the
.dic
extension in theuserdic
folder of thebld/repos/jumandic
folder. - Put your words into that file, in JUMAN dictionary format (refer to other files for example).
- Execute
make clean-dic
if you have already built a Juman++ model. - Build your model as shown above.
If the built model does not contain your words, ensure that the binary dictionary was rebuilt after adding new words.