zabon.ruby 🍊

A Ruby gem / Rails helper for dealing with Japanese line-breaking logic. It is basically a port of mikan.js, which implements a regular expression based algorithm to segment text into semantic chunks. No machine learning needed 🤖☺️. In addition the resulting text segments can be wrapped in a configurable HTML tag. All praise 👏👏👏 for the algorithm goes to trkbt10.

Usage

# split this sentence
Zabon.split('この文を分割する')
 => ["この", "文を", "分割する"]

Configuration

Configuration is used for tag that the results can be wrapped in. It's making heavy use of Rails tag helpers. E.g. put this in an initializer in your Rails app.

Zabon.configure do |config|
  config.tag = :div # default: :span
  config.tag_options = { class: 'zabon_trara', style: 'font_size: 5em' } # default:  { class: 'zabon', style: 'display: inline-block' }
  config.strip_tags = false # default true
end

Rails

# add the following in an intializer to overwrite `t` helper method from Rails to use Zabon's helper method, which applies Japanese line breaking logic and wraps the results in a HTML tag and joins them back together.
require 'zabon'

module ActionView
  module Helpers
    module TranslationHelper
      def t(key, **options)
        zabon_translate(key, **options)
      end
    end
  end
end

Japanese grammar 🇯🇵

Just enough Japanese to understand the algorithm :)

Writing system ✍️

The Japanese writing system uses for different components:

Hiragana (ひらがな), a syllabary alphabet used for Japanese words not covered by kanji and mostly for grammatical inflections
Katakana (カタカナ), a syllabary alphabet used for transcription of foreign-language words into Japanese; for emphasis; onomatopoeia; for scientific terms and often Japanese companies.
Kanji (漢字), a set of Chinese characters directly incorporated into the written Japanese language with often Japanese pronunciation, which can be
Romaji, use of Latin script in Japanese language

Particles

Joshi (助詞), Japanese particles written in Hiragana, are suffixes or short words that follow a modified noun, verb, adjective, or sentence. Their grammatical range can indicate various meanings and functions:

case markers
parallel markers
sentence ending particles
interjectory particles
adverbial particles
binding particles
conjunctive particles
phrasal particles

Line breaking

Certain characters in Japanese should not come at the end of a line, certain characters should not come at the start of a line, and some characters should never be split up across two lines. These rules are called Kinsoku Shori 禁則処理:

simplified:

Class	Can't begin a line	Can't finish a line
small kana	ぁぃぅぇぉっ...
parentheses	）〉》】...	（〈《【...
quotations	」』”...	「『“...
punctuation	、。・！？...

Text segmentation

Written Japanese uses no spaces and little punctuation to delimit words. Readers instead depend on grammatical cues (e.g. Japanese, particles and verb endings), the relative frequency of character combinations, and semantic context, in order to determine what words have been written. This is a non trivial problem which is often solved by applying machine learning algorithms. Without a careful approach, breaks can occur randomly and usually in the middle of a word. This is an issue with typography on the web and results in a degradation of readability.

Zabon ???

I made a couple of assumptions when choosing the name:

🍊 The original algorithm name Mikan might be transscription of 蜜柑, a Japanese citrus fruit (Mandarin, Satsuma)
There already is a gem called mikan, didn't want to go for mikan_ruby or similar b/c of autoloading
🍇 My guess is the original author chose this name, b/c he was searching for something simpler then Google's Budou (葡萄)
🔪 Both fruits have in common, that they can be easily split apart in segments
So I was searching for another fruit that can be easily split apart, what can be split better apart than a Pomelo (文旦, ぶんたん) - Zabon (derived from Portoguese: zamboa)

Who knows if that's how it was 🤷🏻‍♂️😂.

The Algorithm

This algorithm does NOT find the most minimal segmentation of unbreakable text segments and probably will have problems if a text is solely written in one alphabet. It also does not support Furigana (yet). It does basic text segmentation and stitches the segments back together in segments which can be made unbreakable. The unbreakability we achieve by wrapping them in a tag with certain CSS rules.

Splitting

Split text across different alphabets used: split text into parts that are written in Kanjis, Hiragana, Katakana, Latin (incl. double width characters). The assumption here is that parts written in the same script should belong together.
Then split up each element further by splitting up particles are sequences that might be used as particles. The original author of the algorithm has identified the following list (でなければ, について, かしら, くらい, けれど, なのか, ばかり, ながら, ことよ, こそ, こと, さえ, しか, した, たり, だけ, だに, だの, つつ, ても, てよ, でも, とも, から, など, なりので, のに, ほど, まで, もの, やら, より, って, で, と, な, に, ね, の, も, は, ば, へ, や, わ, を, か, が, さ, し, ぞ, て). To me that looks about right, but maybe there are missing some.
Split along further by splitting up brackets and quotations: ([,〈,《,「,『,｢,【,〔,〚,〖,〘,❮,❬,❪,❨,(,<,{,❲,❰,｛,❴,] + the matching end brackets and quotations.

Stitching

Now we have a list of minimal segments and try to stitch them back together in a result set, so that they will fulfil Japanese line breaking rules. We are gonna look at tuples from left to right, looking at the current segment and the previous segment.
If the current segment is a beginning bracket or quotation; we look at the next segment, we have a definitiv start of an unbreakable segment.
If the current segment is an ending bracket or quotation; we append to the last entry of the result set and don't look back anymore; we've reached the end of a segment and start a new one with the next iteration.
If the previous segment is a beginning bracket; we stitch it together with the current segment to become a new segment. In the next iteration we don’t need to look at the previous segment anymore and continue.
If he current segment is a particle or a punctuation mark and we are not looking back (see step 7.); we append the current segment to the last entry of the result set.
If he current segment is a particle or a punctuation mark or if the previous segment is not a bracket, quotation or punctuation mark or a conjunctive particle (と, の,に) and the current segment is in Hiragana; we append to the last entry of the result set.
If no condition from stiching steps 1-2 are matching we can safely add the current segment to the result set.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
bin		bin
lib		lib
test		test
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
.ruby-gemset		.ruby-gemset
.ruby-version		.ruby-version
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
zabon.gemspec		zabon.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zabon.ruby 🍊

Usage

Configuration

Rails

Japanese grammar 🇯🇵

Writing system ✍️

Particles

Line breaking

Text segmentation

Zabon ???

The Algorithm

Splitting

Stitching

Other solutions

Google Budou

Text segmenter backends

Resources

About

Languages

License

dingsdax/zabon

Folders and files

Latest commit

History

Repository files navigation

zabon.ruby 🍊

Usage

Configuration

Rails

Japanese grammar 🇯🇵

Writing system ✍️

Particles

Line breaking

Text segmentation

Zabon ???

The Algorithm

Splitting

Stitching

Other solutions

Google Budou

Text segmenter backends

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages