Name	Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github	.github
readme	readme
taibun	taibun
tests	tests
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
setup.py	setup.py

Taibun

Taiwanese Hokkien Transliterator and Tokeniser

It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.
Includes word tokeniser for Taiwanese Hokkien.

Report Bug • PyPI

Table of Contents

Install
Usage
- Converter
  - System
  - Dialect
  - Format
  - Delimiter
  - Sandhi
  - Punctuation
  - Convert non-CJK
- Tokeniser
- Other Functions
Example
Data
Licence

Install

Taibun can be installed from pypi

$ pip install taibun

Usage

Converter

Converter class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.

# constructor
c = Converter(system, dialect, format, delimiter, sandhi, punctuation, convert_non_cjk)

# transliterate Chinese characters
c.get(input)

System

system String - system of transliteration.

Tailo (default) - Tâi-uân Lô-má-jī Phing-im Hong-àn
POJ - Pe̍h-ōe-jī
Zhuyin - Taiwanese Phonetic Symbols
TLPA - Taiwanese Language Phonetic Alphabet
Pingyim - Bbánlám Uē Pìngyīm Hōng'àn
Tongiong - Daī-ghî Tōng-iōng Pīng-im
IPA - International Phonetic Alphabet

text	Tailo	POJ	Zhuyin	TLPA	Pingyim	Tongiong	IPA
台灣	Tâi-uân	Tâi-oân	ㄉㄞˊ ㄨㄢˊ	Tai5 uan5	Dáiwán	Tāi-uǎn	Tai²⁵ uan²⁵

Dialect

dialect String - preferred pronunciation.

south (default) - Zhangzhou-leaning pronunciation
north - Quanzhou-leaning pronunciation

text	south	north
五月節	Gōo-gue̍h-tseh	Gōo-ge̍h-tsueh

Format

format String - format in which tones will be represented in the converted sentence.

mark (default) - uses diacritics for each syllable. Not available for TLPA.
number - add a number which represents the tone at the end of the syllable
strip - removes any tone marking

text	mark	number	strip
台灣	Tâi-uân	Tai5-uan5	Tai-uan

Delimiter

delimiter String - sets the delimiter character that will be placed in between syllables of a word.

Default value depends on the chosen system:

'-' - for Tailo, POJ, Tongiong
'' - for Pingyim
' ' - for Zhuyin, TLPA, IPA

text	'-'	''	' '
台灣	Tâi-uân	Tâiuân	Tâi uân

Sandhi

sandhi String - applies the sandhi rules of Taiwanese Hokkien to syllables of a single word.

Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.

none - doesn't perform any tone sandhi
auto - closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔
exc_last - changes tone for every syllable except for the last one
incl_last - changes tone for every syllable including the last one

Default value depends on the chosen system:

auto - for Tongiong
none - for Tailo, POJ, Zhuyin, TLPA, Pingyim, IPA

text	none	auto	exc_last	incl_last
這是台灣囡仔	Tse sī Tâi-uân gín-á	Tse sì Tāi-uān gin-á	Tsē sì Tāi-uān gin-á	Tsē sì Tāi-uān gin-a

Sandhi rules also change depending on the dialect chosen.

text	no sandhi	south	north
台灣	Tâi-uân	Tāi-uân	Tài-uân

Punctuation

punctuation String

format (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence.
none - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences.

text	format	none
這是臺南，簡稱「南」（白話字：Tâi-lâm；注音符號：ㄊㄞˊ ㄋㄢˊ，國語：Táinán）。	Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán).	tse sī Tâi-lâm，kán-tshing「lâm」（Pe̍h-uē-jī：Tâi-lâm；tsù-im hû-hō：ㄊㄞˊ ㄋㄢˊ，kok-gí：Táinán）。

Convert non-CJK

convert_non_cjk Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.

True - convert non-Chinese character words
False (default) - convert only Chinese character words

text	False	True
我食pháng	ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng	ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ

Tokeniser

Tokeniser class performs NLTK wordpunct_tokenize-like tokenisation of a Taiwanese Hokkien sentence.

# constructor
t = Tokeniser()

# tokenise Taiwanese Hokkien sentence
t.tokenise(input)

Other Functions

# Convert to Traditional
to_traditional(input)

# Convert to Simplified
to_simplified(input)

# Check if the string is fully composed of Chinese characters
is_cjk(input)

Example

# Converter
from taibun import Converter

## System
c = Converter() # Tailo system default
c.get('先生講，學生恬恬聽。')
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.

c = Converter(system='Zhuyin')
c.get('先生講，學生恬恬聽。')
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.

## Dialect
c = Converter() # south dialect default
c.get("我欲用箸食魚")
>> Guá beh īng tī tsia̍h hî

c = Converter(dialect='north')
c.get("我欲用箸食魚")
>> Guá bueh īng tū tsia̍h hû

## Format
c = Converter() # for Tailo, mark by default
c.get("生日快樂")
>> Senn-ji̍t khuài-lo̍k

c = Converter(format='number')
c.get("生日快樂")
>> Senn1-jit8 khuai3-lok8

c = Converter(format='strip')
c.get("生日快樂")
>> Senn-jit khuai-lok

## Delimiter
c = Converter(delimiter='')
c.get("先生講，學生恬恬聽。")
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.

c = Converter(system='Pingyim', delimiter='-')
c.get("先生講，學生恬恬聽。")
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.

## Sandhi
c = Converter() # for Tailo, sandhi none by default
c.get("這是台灣囡仔")
>> Tse sī Tâi-uân gín-á

c = Converter(sandhi='auto')
c.get("這是台灣囡仔")
>> Tse sì Tāi-uān gin-á

c = Converter(sandhi='exc_last')
c.get("這是台灣囡仔")
>> Tsē sì Tāi-uān gin-á

c = Converter(sandhi='incl_last')
c.get("這是台灣囡仔")
>> Tsē sì Tāi-uān gin-a

## Punctuation
c = Converter() # format punctuation default
c.get("太空朋友，恁好！恁食飽未？")
>> Thài-khong pîng-iú, lín-hó! Lín tsia̍h-pá buē?

c = Converter(punctuation='none')
c.get("太空朋友，恁好！恁食飽未？")
>> thài-khong pîng-iú，lín-hó！lín tsia̍h-pá buē？

## Convert non-CJK
c = Convert(system='Zhuyin') # False convert_non_cjk default
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng

c = Convert(system='Zhuyin', convert_non_cjk=True)
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ


# Tokeniser
from taibun import Tokeniser

t = Tokeniser()
t.tokenise("太空朋友，恁好！恁食飽未？")
>> ['太空', '朋友', '，', '恁好', '！', '恁', '食飽', '未', '？']


# Other Functions
from taibun import to_traditional, to_simplified, is_cjk

to_traditional("我听无台湾话")
>> 我聽無台灣話

to_simplified("我聽無臺灣話")
>> 我听无台湾话

is_cjk('我食麭')
>> True

is_cjk('我食pháng')
>> False

Data

Acknowledgements

Samuel Jen (Github · LinkedIn) - Taiwanese and Mandarin translation

Licence

Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under CC BY-SA 4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taibun

Install

Usage

Converter

System

Dialect

Format

Delimiter

Sandhi

Punctuation

Convert non-CJK

Tokeniser

Other Functions

Example

Data

Acknowledgements

Licence

About

Releases 9

Packages

Languages

License

andreihar/taibun

Folders and files

Latest commit

History

Repository files navigation

Taibun

Install

Usage

Converter

System

Dialect

Format

Delimiter

Sandhi

Punctuation

Convert non-CJK

Tokeniser

Other Functions

Example

Data

Acknowledgements

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Languages

Packages