kaiseki (解析) is a japanese tokenizer and morphological analyzer using mecab-ipadic, insipired by this article.
kaiseki supports both morpheme tokenization and word tokenization (inflections included). It also provides additional informations from the mecab dictionary such as part of speech, conjugation form,...
use kaiseki::{Tokenizer, error:Error};
fn main() -> Result<(), Error> {
let tokenizer = Tokenizer::new()?;
let morphemes = tokenizer.tokenize("東京都に住んでいる");
let morphemes: Vec<_> = morphemes.iter().map(|m| &m.text).collect();
println!("{:?}", morphemes); // ["東京", "都", "に", "住ん", "で", "いる"]
let words = tokenizer.tokenize_word("東京都に住んでいる");
let words: Vec<_> = words.iter().map(|w| &w.text).collect();
println!("{:?}", words); // ["東京", "都", "に", "住んでいる"]
Ok(())
}
cargo test
-
The Mecab Project for providing the the dictionary and data used for tokenizing.
-
kotori and kuromoji-rs for some reference.
MIT License.