Skip to content
/ konoha Public

๐ŸŒฟ An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

License

Notifications You must be signed in to change notification settings

himkt/konoha

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

956574d ยท May 12, 2024
Jan 13, 2024
Jun 5, 2021
Mar 13, 2024
Mar 13, 2024
May 12, 2024
Jun 5, 2021
Mar 13, 2024
Jan 12, 2024
Jan 12, 2024
May 12, 2024
Nov 14, 2020
May 28, 2019
Jan 12, 2024
Nov 7, 2022
May 12, 2024
May 12, 2024
May 12, 2024

Repository files navigation

๐ŸŒฟ Konoha: Simple wrapper of Japanese Tokenizers

Open In Colab

GitHub stars

Downloads Downloads Downloads

Build Status Documentation Status PyPI - Python Version PyPI GitHub Issues GitHub Pull Requests

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["๏ผ‘ใค็›ฎใฎๅ…ฅๅŠ›", "๏ผ’ใค็›ฎใฎๅ…ฅๅŠ›"] to localhost:8000/api/v1/batch_tokenize.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "ใ“ใ‚Œใฏใƒšใƒณใงใ™"}'

{
  "tokens": [
    [
      {
        "surface": "ใ“ใ‚Œ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใฏ",
        "part_of_speech": "ๅŠฉ่ฉž"
      },
      {
        "surface": "ใƒšใƒณ",
        "part_of_speech": "ๅ่ฉž"
      },
      {
        "surface": "ใงใ™",
        "part_of_speech": "ๅŠฉๅ‹•่ฉž"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]'.

  • Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
  • Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '่‡ช็„ถ่จ€่ชžๅ‡ฆ็†ใ‚’ๅ‹‰ๅผทใ—ใฆใ„ใพใ™'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆ, ใ„, ใพใ™]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [โ–, ่‡ช็„ถ, ่จ€่ชž, ๅ‡ฆ็†, ใ‚’, ๅ‹‰ๅผท, ใ—, ใฆใ„ใพใ™]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

You can change symbols for a sentence splitter and bracket expression.

  1. sentence splitter
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(period="๏ผŽ")
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„๏ผŽ', 'ใ ใŒ๏ผŒใ€Œใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']
  1. bracket expression
sentence = "็งใฏ็Œซใ ใ€‚ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"ใ€Ž.*?ใ€")],
)
print(tokenizer.tokenize(sentence))
# => ['็งใฏ็Œซใ ใ€‚', 'ๅๅ‰ใชใ‚“ใฆใ‚‚ใฎใฏใชใ„ใ€‚', 'ใ ใŒ๏ผŒใ€Žใ‹ใ‚ใ„ใ„ใ€‚ใใ‚Œใงๅๅˆ†ใ ใ‚ใ†ใ€ใ€‚']

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!