Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese language support #260

Open
jghwwnq opened this issue May 18, 2023 · 26 comments
Open

Chinese language support #260

jghwwnq opened this issue May 18, 2023 · 26 comments
Labels
languages Dictionary or language related issues

Comments

@jghwwnq
Copy link

jghwwnq commented May 18, 2023

Is there any plan to support Chinese input method?

@sspanak sspanak changed the title support Chinese language support May 19, 2023
@sspanak sspanak added help wanted Extra attention is needed languages Dictionary or language related issues labels May 19, 2023
@sspanak
Copy link
Owner

sspanak commented May 25, 2023

I haven't considered this, but I suppose it is doable.

I know there are two ways of typing in Chinese: phonetic and "strokes" (I am not sure about the name of the latter). However, I don't know how to convert the words to digit sequences in either mode, so I need help from a native speaker. For example, in English, "food" is: "3663". I need to know how to convert "食物" to digits. Also, I need a good word list for the predictions.

@Liquid-Aristocracy
Copy link

Maybe this can help for pinyin: https://github.com/mozillazg/phrase-pinyin-data/blob/master/pinyin.txt

However predictive input could be hard when considering some cases...

@sspanak
Copy link
Owner

sspanak commented Jun 15, 2023

So, considering these examples again:

食物: shí wù
食物中毒: shí wù zhòng dú

Pressing "744-98" to type "shi-wu" without diacritics, must yield: "食物". And "744-98-94664-38" would become: "食物中毒". Am I reading this correctly, @Liquid-Aristocracy?

I also have some more questions:

  1. Is: "中国 (Pinyin)" a good name for the input method?
  2. Looking at the Java Supported Locales, I see "zh_CN" is "Chinese Simplified" and "Chinese in China", while "zh_TW" is "Chinese Traditional" and "Chinese Taiwan" at the same time. Which locale are we supposed to use in TT9?

@Liquid-Aristocracy
Copy link

Liquid-Aristocracy commented Jun 16, 2023

Yes, separation is often not needed, so you can have 食物=shiwu=74898. There's two problems though:

  1. This list only has phrases, 食=shi=748 and 物=wu=98 are also needed. Just separating these phrase might not be enough bc there are also characters that aren't in any words. There could be a way to find or generate them.
  2. There's a way to only type the initial part of pinyin, 食物=shw and 食物中毒=shwzhd and it's wildly used. Is supporting this doable?

And your two questions:

  1. 中国 (pinyin) means China (pinyin). 中文 (pinyin) or 中文(拼音) could be better, means Chinese (pinyin). If the method support zh_CN only, you can have 简体中文(拼音), means Simplified Chinese (pinyin). And with Traditional Chinese you can use 正體中文 or 繁體中文.
  2. The list is in Simplified Chinese. Pinyin is mostly used with Simplified Chinese so it's fine. I referenced Gboard, zh_CN is supported with pinyin, and zh_TW is supported with phonetics (zhuyin) and also pinyin but not a priority.

@sspanak
Copy link
Owner

sspanak commented Jun 16, 2023

This list only has phrases, 食=shi=748 and 物=wu=98 are also needed. Just separating these phrase might not be enough bc there are also characters that aren't in any words. There could be a way to find or generate them.

Could we instead use the large_pinyin file? I was going to suggest that anyway, because the "small" file contains less than 50000 words which is really not enough to write in any language from my experience. Could you please check if that file contains the characters you mentioned?

And yes, it shouldn't be a big problem to include both the words and the unique characters. Even if a character has multiple readings, it can be done with small changes to the code and the database.

There's a way to only type the initial part of pinyin, 食物=shw and 食物中毒=shwzhd and it's wildly used. Is supporting this doable?

Anything is possible. I only need a strict set of rules for conversion to Latin. However, there could be a single set of rules, meaning "食物" must be either "shw" or "shiwu". Technically, it may be possible to include both, but the database will grow too big and there will be too much lag when typing... I am unsure if the experience will be so good.

One things that come to my mind is, if we decide to use only the initial part, how to differentiate between words with the same beginning? For example, both "龙超" and "龙輴" would be "lch", because they are "long chao" and "long chun" respectively. Is it OK if they both appear as suggestions when the user types "lch"? Won't there be too many suggestions in some cases then?

Sorry for asking so many questions, but I have no idea how T9 for Chinese is supposed to work and I can't read the language, so it is a bit difficult to wrap my head around it.

@Liquid-Aristocracy
Copy link

Liquid-Aristocracy commented Jun 16, 2023

Could you please check if that file contains the characters you mentioned?

I'll do it later.

Won't there be too many suggestions in some cases then?

That sounds right. It's probably better to have some frequency data and only make these most frequent to appear, if you want to implement it. Typing could be done with only the "full mode".

However, there could be a single set of rules, meaning "食物" must be either "shw" or "shiwu".

There's already such complication here though, some characters can have multiple pronouncations thus multiple conversions, 长 can be both chang and zhang. Is that workable?

@Liquid-Aristocracy
Copy link

Could you please check if that file contains the characters you mentioned?

About this: yes, I got it wrong. large_pinyin is the full list. However there's still only phrases. Single characters' pinyin data is here: https://github.com/mozillazg/pinyin-data/blob/master/pinyin.txt

However glancing over that list, I found the majority of words aren't ones that I can recognize, let alone use in typing. Maybe using Mandarin Frequency Lists from Wiktionary is better, though they are for Taiwan. I'll also check the frequency list provided by this website.

@sspanak
Copy link
Owner

sspanak commented Jul 7, 2023

There's already such complication here though, some characters can have multiple pronouncations thus multiple conversions, 长 can be both chang and zhang. Is that workable?

It should be possible, but some code changes will be necessary. The way I see how Chinese support could be added is:

  1. When the user starts typing, they actually type Pinyin that matches what they want to say. So, if it is "chang" or "zhang" this would yield exactly one result, because there would be only one "chang" or "zhang" in the database. So far, this is the same as typing in any other language, except that we don't search for similar words, we only want exact matches. This would probably require 1-2 lines of code.
  2. Then, the Chinese-specific step comes. We do a secondary lookup, in a second database table, where we have Pinyin-to-Chinese relations + frequencies. We find all the Chinese words or characters that match "chang" and sort them by frequency. It will take a bit more time to implement, but it's not going to be very difficult.

This will double the loading time, but with proper indexing, I think we can keep it under 50 ms in the worst case and below 20 ms typical.

Now, the only thing is to find a proper word/character list and the Pinyin transcriptions. And this may turn out to be the hardest part. I really hope you or someone else could help, because I have no idea where to look for one.

Maybe using Mandarin Frequency Lists from Wiktionary is better

It contains only 10,000 words, which is really not enough at all. Even English, which has very simple grammar rules (no cases, no inflections and whatnot) has a wordlist of 130,000 words and as I understand, this is far from perfect.

@gleaner-m
Copy link

Hello, if you need it, I have two Chinese dictionary files that can meet most Chinese input needs.

I am enclosing information that I hope will be of help to you.
Uploading assets.zip…

@sspanak
Copy link
Owner

sspanak commented Aug 16, 2023

The link is broken.

@gleaner-m
Copy link

assets.zip
dict

The link is broken.

@ghost
Copy link

ghost commented Oct 11, 2023

When the user starts typing, they actually type Pinyin that matches what they want to say. So, if it is "chang" or "zhang" this would yield exactly one result, because there would be only one "chang" or "zhang" in the database.

I'm unsure what you mean in the second half of this. In this case "chang" and "zhang" could both refer to 长, but 张 is also pronounced "zhang", and 唱 is also pronounced "chang", so these should be kept separate as far as i can see.

Looking at the database format, it seems that perhaps one approach would be a two column word list, with three columns for frequency information:

食物 shiwu [1]
多长 duochang [1]
生长 shengzhang [2]

In this system, the second column is what the user types, the first column is what gets inserted. This doesn't make the initial only input any easier (ie. shw=shiwu), unless those entries were all added separately to the file. That said, i think in practice these form of contraction seems to be used most on common phrases rather than rare words anyway.

In terms of word lists, it might be best to combine several. Many of the ones i have been looking through also fail at phrases longer than words that nonetheless should be possible to input in a single go, fore example 多长时间 duochangshijian. The wikipedia lists are a good start at the very least and provide simplified characters. https://cc-cedict.org/ is also a good project.

@ghost
Copy link

ghost commented Nov 6, 2023

Found this project: http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-ch it's a frequency list of chinese from tv show subtitles and seem to have a lot of words.

Another thing i wonder about is to add chinese names or other chinese words to the dictionary. As it seems currently that whitespace is used to detect a one word. But there will not be gaps between words in chinese, and it may become quite necessary to add chinese names, which will not appear in the dictionary but i will want to appear in the dictionary in the future.

@sspanak
Copy link
Owner

sspanak commented Nov 13, 2023

I'm unsure what you mean in the second half of this. In this case "chang" and "zhang" could both refer to 长, but 张 is also pronounced "zhang", and 唱 is also pronounced "chang", so these should be kept separate as far as i can see.

Thanks for the clarification. It is definitely nice to know.

Looking at the database format, it seems that perhaps one approach would be a two column word list, with three columns for frequency information

Forget about the current database. It is way too slow on low-end devices and it is unusable for Chinese. It is going away.

Another thing i wonder about is to add chinese names or other chinese words to the dictionary

Yes, personal names, city names, landmark objects, company names and whatnot are very important for good typing experience.

And last, but not least, I would like to thank everyone for the suggested word lists. However, I would like to remind that I also need the Pinyin transcriptions. I cannot add Chinese, if I don't know how to convert the Chinese characters to Latin alphabet.

@ghost
Copy link

ghost commented Nov 13, 2023

A cursory search led me to https://github.com/briankung/pinyin-tool but it is unclear to me just how naive this is. When i have some free time, i can run some tests and try converting some of the big files. It seem likely that any first list will have mistakes, but hopefully they will not be too common and over time can be patched out.

@ghost
Copy link

ghost commented Dec 5, 2023

Here is a file i made from transcriptions of tv show and youtube videos. It is a tsv file with hanzi in the first column and pinyin with numbers in the second. I think it would be best for it not to be necessary to type numbers (in fact i can't think how that would nicely be done) but they were added by the converter so i have left them in case.

vocab.txt

@ghost
Copy link

ghost commented Dec 11, 2023

I have also found this wikipedia page of characters sort by pinyin. https://en.wiktionary.org/wiki/Appendix:Mandarin_Pinyin/Table_of_General_Standard_Chinese_Characters

IT is the 8000 characters published by the government from common use, and also commonly used in names which is helpful. If you want me to format this in a particular way i can try and do that.

@sspanak
Copy link
Owner

sspanak commented Dec 26, 2023

I have also found this wikipedia page of characters... IT is the 8000 characters published by the government from common use

I can extract them from Wikipedia easily, that's fine.

But, let me check if I get everything correctly. This list of 8000 characters is the core of the language and you could use them to type almost any word, even if the pronunciation of the entire word may differ. Basically, you could use them to type letter-by-letter, sort of what is ABC mode in English.

Correct... or not?

If the above is correct, it is possible to create a new type of input mode for Chinese (and other non-ABC languages). Using space will not be a problem. The default way of accepting a suggestion is the OK key, not 0-key, so you can use that instead.

But I am not sure how to present the suggestions in a practical way.

For example, typing "QI" would yield some 100 suggestions. And even if I do the numbers thing, to display only the QIs with, say, a rising tone, there would be 43 characters. Presenting so many characters to the user would still be overwhelming in my opinion. Not to mention, TT9 is currently constrained to 20 suggestions. I can, of course, raise the limit, but is there for the sake of usability.

On the other hand, I've just tested the Chinese input method on my Qin F21 and it seems to do exactly why I described above. Typing "74" (QI), actually causes 100 or so suggestions to appear. Is this really the way to go? How did you guys used to type back in the days, when there were no smartphones?

Here is a file i made from transcriptions of tv show and youtube videos. It is a tsv file with hanzi in the first column and pinyin with numbers in the second

Thank you very much for this. I guess any extra characters could be added to the list of 8000 "official" ones. Other than that, I don't think we need entire words, for example: "默默无闻", because every separate character would be in the dictionary and you could just type: "mò", "mò", "wú", "wén" and get the entire word.

So, the last problem to solve is how to present a reasonable number of suggestions to the user.

@ghost
Copy link

ghost commented Dec 26, 2023

One common way from older phones is bihua, you will see in your qin f21 that the 12345 keys actually have strokes and you would input characters stroke by stroke. This method is implemented in stroke count method on f-droid, and perhaps would be possible to implement here.

you could use them to type almost any word, even if the pronunciation of the entire word may differ

I'm not sure what you mean here, but i think your understanding is correct. The phrase 你叫什么名字 you could type as ni jiao shen me ming zi, selecting one character each time. But this is actually four words, ni jiao shenme mingzi, you tend not to write one character at a time but in words or phrases and the input method knows what are the more frequently used or common phrases. Chinese has polysyllabic words, it is not one character is one word. I usually will only type one by one characters when typing a new name which has a rare character. Does that make sense?

That is why it is beneficial to have the words in the dictionary. I guess the alternative is not to store the words in the dictionary at first, but then as the user types characters by other characters that should be being stored. Otherwise typing will be a very painful experience.

https://www.bilibili.com/video/BV1TT41117ZP here is a video where you can see someone typing, maybe this is a helpful demonstration. Although it is similar to typing alphebtic languages, but with an extra step.

It is difficult for me to explain the process, i hope you can understand, and if not let me know and i will try to explain it again. Thanks for the interest in this feature.

@ghost
Copy link

ghost commented Dec 26, 2023

I forgot to say. When i type 743663 for shenme, the first result is what i expect. There is one other two character phrase on the first page of results, the rest are all single characters which helps to demonstrate how most of the times typing phrases is easy. But i think you shouldn't enforce the numbers, as most pinyin input methods do not use numbers, it is a bit slow and unnecessary.

If the database can be fast enough in this app to show all the possibilities it should do so, as in general use the only time that one will need to scroll them is for names. With frequency weighting, everything else will probably be in the first page.

@sspanak
Copy link
Owner

sspanak commented Jan 4, 2024

OK, thanks for all suggestions and information. I think now I have enough knowledge to add Chinese.

Let's go with the phonetic method and display all possible suggestions even if they are 100. Performance shouldn't be an issues with the Objectbox database (already merged in master). I've tested quickly Gboard and it does the same, so this must be the way to go.

As for the strokes method, it requires knowledge about how characters are written, which I don't have. It will just take too long for me to learn, understand and implement it. I can't do this right now.

@sspanak
Copy link
Owner

sspanak commented Jan 4, 2024

So here is the summary from user perspective:

  1. Add Simplified Chinese. It will be displayed as: "中文(拼音)" or "简体中文(拼音)". Not sure which one is better, but I'll use the first one, unless anyone suggest the second one.
  2. The locale will be: "zh_CN".
  3. The dictionary will contain all 8000 general purpose characters, the words extracted from the subtitles and possibly, the huge list of words from Mozilla. The Mandarin frequency lists could be used for adding word frequencies, if extracting them from Wiktionary is not overcomplicated.
  4. Chinese punctuation will be added.
  5. When typing, one will type phonetically, using the English layout. This will result in either the composing text appearing in Latin or an overlay appearing like in Gboard, whatever is more convenient. The suggestions will be displayed in the original language. Possibly, the Latin word could also be included.
  6. If text composing is still active, hitting backspace reverts the composing text to Latin and erases the letters as usual. Suggestions are also updated as usual.
  7. Accepting words works as usual: either OK to accept, or SPACE to accept and type a space.

Technical stuff:

  1. Language will have an extra property that shows whether it is a syllable- or letter-based.
  2. There will be a third dictionary format: originalWord, latinTranscription, optionalFrequency
  3. There will be a new database table, "transcriptions", linking the Latin and the original word, also containing the frequency.
  4. Only the exact matches of the Latin word will be fetched from the database.
  5. There will be a new query for getting the words. It will get only the exact Latin matches, join them with the transcriptions and order by the transcriptions length, then frequency. The WordStore will know which query to use, based on the Language.
  6. Frequencies are only read and updated in the transcriptions table.
  7. There will be a new query for updating the frequencies. Again, the store will know which one to use, depending on the Language.
  8. Likely, there will be a need for a new ModePhonetic or ModeSyllable (or come up with a better name), which will take care of backspace and the composing overlay properly. The rest can probably be the same as ModePredictive, meaning it can be inherited.

The plan above is not set in stone and it may change as needed during implementation.

@sspanak sspanak removed the help wanted Extra attention is needed label Jan 4, 2024
@sspanak sspanak added this to the Exotic language pack milestone Jan 4, 2024
@ghost
Copy link

ghost commented Jan 4, 2024 via email

@public26612
Copy link

expect!

@public26612
Copy link

Really in need, always waiting!

@sspanak
Copy link
Owner

sspanak commented Jul 9, 2024

I am willing to make it sooner or later. But please understand there this project is being run by a single developer, for free, for benefit of the humankind. As such, there are no guaranteed deadlines. If you want to speed up things a bit, consider making a donation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
languages Dictionary or language related issues
Projects
None yet
Development

No branches or pull requests

5 participants