Name	Name	Last commit message	Last commit date
Latest commit himkt misc: remove poetry.lock May 12, 2024 956574d · May 12, 2024 History 433 Commits
.github/workflows	.github/workflows	ci: use newer action (#196 )	Jan 13, 2024
data	data	Remove `with_postag` from WordTokenizer (#141 )	Jun 5, 2021
docs	docs	deps: poetry update (#208 )	Mar 13, 2024
example	example	deps: poetry update (#208 )	Mar 13, 2024
src/konoha	src/konoha	fix: __init__.py	May 12, 2024
test_fixtures	test_fixtures	Remove `with_postag` from WordTokenizer (#141 )	Jun 5, 2021
tests	tests	deps: poetry update (#208 )	Mar 13, 2024
.gitignore	.gitignore	chore: cleanup (#189 )	Jan 12, 2024
.readthedocs.yml	.readthedocs.yml	fix: .readthedocs.yml	Jan 12, 2024
Dockerfile	Dockerfile	misc: Dockerfile	May 12, 2024
LICENSE	LICENSE	Update LICENSE	Nov 14, 2020
Makefile	Makefile	Cleanup	May 28, 2019
README.md	README.md	chore: update badge	Jan 12, 2024
docker-compose.yml	docker-compose.yml	Cleanup Docker stuff (#170 )	Nov 7, 2022
pyproject.toml	pyproject.toml	misc: py38	May 12, 2024
requirements-dev.lock	requirements-dev.lock	misc: py38	May 12, 2024
requirements.lock	requirements.lock	misc: py38	May 12, 2024

🌿 Konoha: Simple wrapper of Japanese Tokenizers

Konoha is a Python library for providing easy-to-use integrated interface of various Japanese tokenizers, which enables you to switch a tokenizer and boost your pre-processing.

Supported tokenizers

Also, konoha provides rule-based tokenizers (whitespace, character) and a rule-based sentence splitter.

Quick Start with Docker

Simply run followings on your computer:

docker run --rm -p 8000:8000 -t himkt/konoha  # from DockerHub

Or you can build image on your machine:

git clone https://github.com/himkt/konoha  # download konoha
cd konoha && docker-compose up --build  # build and launch container

Tokenization is done by posting a json object to localhost:8000/api/v1/tokenize. You can also batch tokenize by passing texts: ["１つ目の入力", "２つ目の入力"] to localhost:8000/api/v1/batch_tokenize.

(API documentation is available on localhost:8000/redoc, you can check it using your web browser)

Send a request using curl on your terminal. Note that a path to an endpoint is changed in v4.6.4. Please check our release note (https://github.com/himkt/konoha/releases/tag/v4.6.4).

$ curl localhost:8000/api/v1/tokenize -X POST -H "Content-Type: application/json" \
    -d '{"tokenizer": "mecab", "text": "これはペンです"}'

{
  "tokens": [
    [
      {
        "surface": "これ",
        "part_of_speech": "名詞"
      },
      {
        "surface": "は",
        "part_of_speech": "助詞"
      },
      {
        "surface": "ペン",
        "part_of_speech": "名詞"
      },
      {
        "surface": "です",
        "part_of_speech": "助動詞"
      }
    ]
  ]
}

Installation

I recommend you to install konoha by pip install 'konoha[all]'.

Install konoha with a specific tokenizer: pip install 'konoha[(tokenizer_name)].
Install konoha with a specific tokenizer and remote file support: pip install 'konoha[(tokenizer_name),remote]'

If you want to install konoha with a tokenizer, please install konoha with a specific tokenizer (e.g. konoha[mecab], konoha[sudachi], ...etc) or install tokenizers individually.

Example

Word level tokenization

from konoha import WordTokenizer

sentence = '自然言語処理を勉強しています'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
# => [自然, 言語, 処理, を, 勉強, し, て, い, ます]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
# => [▁, 自然, 言語, 処理, を, 勉強, し, ています]

For more detail, please see the example/ directory.

Remote files

Konoha supports dictionary and model on cloud storage (currently supports Amazon S3). It requires installing konoha with the remote option, see Installation.

# download user dictionary from S3
word_tokenizer = WordTokenizer("mecab", user_dictionary_path="s3://abc/xxx.dic")
print(word_tokenizer.tokenize(sentence))

# download system dictionary from S3
word_tokenizer = WordTokenizer("mecab", system_dictionary_path="s3://abc/yyy")
print(word_tokenizer.tokenize(sentence))

# download model file from S3
word_tokenizer = WordTokenizer("sentencepiece", model_path="s3://abc/zzz.model")
print(word_tokenizer.tokenize(sentence))

Sentence level tokenization

from konoha import SentenceTokenizer

sentence = "私は猫だ。名前なんてものはない。だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer()
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，「かわいい。それで十分だろう」。']

You can change symbols for a sentence splitter and bracket expression.

sentence splitter

sentence = "私は猫だ。名前なんてものはない．だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer(period="．")
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。名前なんてものはない．', 'だが，「かわいい。それで十分だろう」。']

bracket expression

sentence = "私は猫だ。名前なんてものはない。だが，『かわいい。それで十分だろう』。"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"『.*?』")],
)
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，『かわいい。それで十分だろう』。']

Test

python -m pytest

Article

Acknowledgement

Sentencepiece model used in test is provided by @yoheikikuta. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌿 Konoha: Simple wrapper of Japanese Tokenizers

Supported tokenizers

Quick Start with Docker

Installation

Example

Word level tokenization

Remote files

Sentence level tokenization

Test

Article

Acknowledgement

About

Releases 38

Packages

Used by 799

Contributors 9

Languages

License

himkt/konoha

Folders and files

Latest commit

History

Repository files navigation

🌿 Konoha: Simple wrapper of Japanese Tokenizers

Supported tokenizers

Quick Start with Docker

Installation

Example

Word level tokenization

Remote files

Sentence level tokenization

Test

Article

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 38

Packages 0

Used by 799

Contributors 9

Languages

Packages