Skip to content

ku-nlp/jumanpp-jumandic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Descritption

This repository contains a set of scripts to build a ready-to-use Juman++ model for Jumandic.

Prerequrements

  • Unix environment (on Windows use WSL or MSYS2/MinGW64)
  • Juman++ build environment
  • Python 3.6+
  • Ruby
  • Perl
  • Configured ssh authorization for github (we will clone several repositories via ssh)
  • 32 GB of RAM

Recommended

  • Original texts from Mainichi Shinbun (year 1995) for Kyoto Corpus (see the page for more information). Othewise, Juman++ model will be trained only on Leads corpus and will have poor quality.

How to Use

Run the configuration script: python3 configure.py. It will prompt for the location of Mainichi Shinbun texts.

After that run make nornn for training a model without RNN component. make rnn produces the model with RNN component. The models will be inside the bld/model folder.

Adding your words to the model

It is possible to add your words to the model. To do it:

  1. Perform the configuration as described above: python3 configure.py
  2. Fetch the repositories make repo.
  3. Go into bld/repos/jumandic folder, it is a local clone of JumanDIC repository.
  4. Create a new file with the .dic extension in the userdic folder of the bld/repos/jumandic folder.
  5. Put your words into that file, in JUMAN dictionary format (refer to other files for example).
  6. Execute make clean-dic if you have already built a Juman++ model.
  7. Build your model as shown above.

If the built model does not contain your words, ensure that the binary dictionary was rebuilt after adding new words.