SudachiPy

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

This is not a pure Python implementation, but bindings for the Sudachi.rs.

Binary wheels

We provide binary builds for macOS (10.14+), Windows and Linux only for x86_64 architecture. x86 32-bit architecture is not supported and is not tested. MacOS source builds seem to work on ARM-based (Aarch64) Macs, but this architecture also is not tested and require installing Rust toolchain and Cargo.

More information here.

TL;DR

$ pip install sudachipy sudachidict_core

$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅	名詞,固有名詞,一般,*,*,*	高輪ゲートウェイ駅
EOS

$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪	名詞,固有名詞,地名,一般,*,*	高輪
ゲートウェイ	名詞,普通名詞,一般,*,*,*	ゲートウェー
駅	名詞,普通名詞,一般,*,*,*	駅
EOS

$ echo "空缶空罐空きカン" | sudachipy -a
空缶	名詞,普通名詞,一般,*,*,*	空き缶	空缶	アキカン	0
空罐	名詞,普通名詞,一般,*,*,*	空き缶	空罐	アキカン	0
空きカン	名詞,普通名詞,一般,*,*,*	空き缶	空きカン	アキカン	0
EOS

from sudachipy import Dictionary, SplitMode

tokenizer = Dictionary().create()

morphemes = tokenizer.tokenize("国会議事堂前駅")
print(morphemes[0].surface())  # '国会議事堂前駅'
print(morphemes[0].reading_form())  # 'コッカイギジドウマエエキ'
print(morphemes[0].part_of_speech())  # ['名詞', '固有名詞', '一般', '*', '*', '*']

morphemes = tokenizer.tokenize("国会議事堂前駅", SplitMode.A)
print([m.surface() for m in morphemes])  # ['国会', '議事', '堂', '前', '駅']

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It may take a while to download the dictionary file (around 70MB for the core edition).

pip install sudachidict_core

Alternatively, you can choose other dictionary editions. See this section for the detail.

Usage: As a command

There is a CLI command sudachipy.

$ echo "外国人参政権" | sudachipy
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国	名詞,普通名詞,一般,*,*,*	外国
人	接尾辞,名詞的,一般,*,*,*	人
参政	名詞,普通名詞,一般,*,*,*	参政
権	接尾辞,名詞的,一般,*,*,*	権
EOS

$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
                          [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -s string      sudachidict type
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

Note: The Debug option (-d) is disabled in version 0.6.*

Output

Columns are tab separated.

Surface
Part-of-Speech Tags (comma separated)
Normalized Form

When you add the -a option, it additionally outputs

Dictionary Form
Reading Form
Dictionary ID
- 0 for the system dictionary
- 1 and above for the user dictionaries
- -1 if a word is Out-of-Vocabulary (not in the dictionary)
Synonym group IDs
(OOV) if a word is Out-of-Vocabulary (not in the dictionary)

$ echo "外国人参政権" | sudachipy -a
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権	外国人参政権	ガイコクジンサンセイケン	0	[]
EOS

echo "阿quei" | sudachipy -a
阿	名詞,普通名詞,一般,*,*,*	阿	阿		-1	[]	(OOV)
quei	名詞,普通名詞,一般,*,*,*	quei	quei		-1	[]	(OOV)
EOS

Usage: As a Python package

API

See API reference page.

Example

from sudachipy import Dictionary, SplitMode

tokenizer_obj = Dictionary().create()

# Multi-granular Tokenization

# SplitMode.C is the default mode
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.C)]
# => ['国家公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.B)]
# => ['国家', '公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.A)]
# => ['国家', '公務', '員']

# Morpheme information

m = tokenizer_obj.tokenize("食べ")[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']

# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

(With 20210802 core dictionary. The results may change when you use other versions)

Dictionary Edition

There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.

SudachiPy uses sudachidict_core by default.

Dictionaries can be installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.

SudachiDict-small · PyPI
SudachiDict-core · PyPI
SudachiDict-full · PyPI

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option -s.

$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small

$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with the Dicionary() argument; config or dict.

class Dictionary(config=None, resource_dir=None, dict=None)

config
- You can specify the file path to the setting file with config (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
- If the dictionary file is specified in the setting file as systemDict, SudachiPy will use the dictionary.
dict
- You can also specify the dictionary type with dict.
- The available arguments are small, core, full, or a path to the dictionary file.
- If different dictionaries are specified with config and dict, a dictionary defined dict overrides those defined in the config.

from sudachipy import Dictionary

# default: sudachidict_core
tokenizer_obj = Dictionary().create()

# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = Dictionary(config="/path/to/sudachi.json").create()

# The dictionary specified by `dict` will be used.
tokenizer_obj = Dictionary(dict="core").create()  # sudachidict_core (same as default)
tokenizer_obj = Dictionary(dict="small").create()  # sudachidict_small
tokenizer_obj = Dictionary(dict="full").create()  # sudachidict_full

# The dictionary specified by `dict` overrides those defined in the config.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config="/path/to/sudachi.json", dict="full").create()

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.

{
    "systemDict" : "relative/path/from/resourceDir/to/system.dic",
    ...
}

The default setting file is sudachi.json. You can specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommand ubuild.

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-o file] [-d string] -s file file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

options:
  -h, --help  show this help message and exit
  -o file     output file (default: user.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -s file     system dictionary path

About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

To use your customized system.dic, place sudachi.json to anywhere you like, and overwrite systemDict value with the relative path from sudachi.json to your system.dic.

{
    "systemDict" : "relative/path/to/system.dic",
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

For Developers

Build from source

Install sdist via pip

Install python module setuptools and setuptools-rust.
Run ./build-sdist.sh in python dir.
- source distribution will be generated under python/dist/ dir.
Install it via pip: pip install ./python/dist/SudachiPy-[version].tar.gz

Install develop build

Install python module setuptools and setuptools-rust.
Run python3 -m pip install -e . to install sudachipy (editable install).
Now you can import the module by import sudachipy.

ref: setuptools-rust

Test

Run build_and_test.sh to run the tests.

Contact

Sudachi and SudachiPy are developed by WAP Tokushima Laboratory of AI and NLP.

Open an issue, or come to our Slack workspace for questions and discussion.

https://sudachi-dev.slack.com/ (Get invitation here)

Enjoy tokenization!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SudachiPy

Binary wheels

TL;DR

Setup

Step 1. Install SudachiPy

Step 2. Get a Dictionary

Usage: As a command

Output

Usage: As a Python package

API

Example

Dictionary Edition

Dictionary option: command line

Dictionary option: Python package

Dictionary in The Setting File

User Dictionary

Customized System Dictionary

For Developers

Build from source

Install sdist via pip

Install develop build

Test

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

SudachiPy

Binary wheels

TL;DR

Setup

Step 1. Install SudachiPy

Step 2. Get a Dictionary

Usage: As a command

Output

Usage: As a Python package

API

Example

Dictionary Edition

Dictionary option: command line

Dictionary option: Python package

Dictionary in The Setting File

User Dictionary

Customized System Dictionary

For Developers

Build from source

Install sdist via pip

Install develop build

Test

Contact