-
-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese language support #260
Comments
I haven't considered this, but I suppose it is doable. I know there are two ways of typing in Chinese: phonetic and "strokes" (I am not sure about the name of the latter). However, I don't know how to convert the words to digit sequences in either mode, so I need help from a native speaker. For example, in English, "food" is: "3663". I need to know how to convert "食物" to digits. Also, I need a good word list for the predictions. |
Maybe this can help for pinyin: https://github.com/mozillazg/phrase-pinyin-data/blob/master/pinyin.txt However predictive input could be hard when considering some cases... |
So, considering these examples again:
Pressing "744-98" to type "shi-wu" without diacritics, must yield: "食物". And "744-98-94664-38" would become: "食物中毒". Am I reading this correctly, @Liquid-Aristocracy? I also have some more questions:
|
Yes, separation is often not needed, so you can have 食物=shiwu=74898. There's two problems though:
And your two questions:
|
Could we instead use the And yes, it shouldn't be a big problem to include both the words and the unique characters. Even if a character has multiple readings, it can be done with small changes to the code and the database.
Anything is possible. I only need a strict set of rules for conversion to Latin. However, there could be a single set of rules, meaning "食物" must be either "shw" or "shiwu". Technically, it may be possible to include both, but the database will grow too big and there will be too much lag when typing... I am unsure if the experience will be so good. One things that come to my mind is, if we decide to use only the initial part, how to differentiate between words with the same beginning? For example, both "龙超" and "龙輴" would be "lch", because they are "long chao" and "long chun" respectively. Is it OK if they both appear as suggestions when the user types "lch"? Won't there be too many suggestions in some cases then? Sorry for asking so many questions, but I have no idea how T9 for Chinese is supposed to work and I can't read the language, so it is a bit difficult to wrap my head around it. |
I'll do it later.
That sounds right. It's probably better to have some frequency data and only make these most frequent to appear, if you want to implement it. Typing could be done with only the "full mode".
There's already such complication here though, some characters can have multiple pronouncations thus multiple conversions, 长 can be both chang and zhang. Is that workable? |
About this: yes, I got it wrong. However glancing over that list, I found the majority of words aren't ones that I can recognize, let alone use in typing. Maybe using Mandarin Frequency Lists from Wiktionary is better, though they are for Taiwan. I'll also check the frequency list provided by this website. |
It should be possible, but some code changes will be necessary. The way I see how Chinese support could be added is:
This will double the loading time, but with proper indexing, I think we can keep it under 50 ms in the worst case and below 20 ms typical. Now, the only thing is to find a proper word/character list and the Pinyin transcriptions. And this may turn out to be the hardest part. I really hope you or someone else could help, because I have no idea where to look for one.
It contains only 10,000 words, which is really not enough at all. Even English, which has very simple grammar rules (no cases, no inflections and whatnot) has a wordlist of 130,000 words and as I understand, this is far from perfect. |
Hello, if you need it, I have two Chinese dictionary files that can meet most Chinese input needs. I am enclosing information that I hope will be of help to you. |
The link is broken. |
assets.zip
|
I'm unsure what you mean in the second half of this. In this case "chang" and "zhang" could both refer to 长, but 张 is also pronounced "zhang", and 唱 is also pronounced "chang", so these should be kept separate as far as i can see. Looking at the database format, it seems that perhaps one approach would be a two column word list, with three columns for frequency information:
In this system, the second column is what the user types, the first column is what gets inserted. This doesn't make the initial only input any easier (ie. shw=shiwu), unless those entries were all added separately to the file. That said, i think in practice these form of contraction seems to be used most on common phrases rather than rare words anyway. In terms of word lists, it might be best to combine several. Many of the ones i have been looking through also fail at phrases longer than words that nonetheless should be possible to input in a single go, fore example 多长时间 duochangshijian. The wikipedia lists are a good start at the very least and provide simplified characters. https://cc-cedict.org/ is also a good project. |
Found this project: http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-ch it's a frequency list of chinese from tv show subtitles and seem to have a lot of words. Another thing i wonder about is to add chinese names or other chinese words to the dictionary. As it seems currently that whitespace is used to detect a one word. But there will not be gaps between words in chinese, and it may become quite necessary to add chinese names, which will not appear in the dictionary but i will want to appear in the dictionary in the future. |
Thanks for the clarification. It is definitely nice to know.
Forget about the current database. It is way too slow on low-end devices and it is unusable for Chinese. It is going away.
Yes, personal names, city names, landmark objects, company names and whatnot are very important for good typing experience. And last, but not least, I would like to thank everyone for the suggested word lists. However, I would like to remind that I also need the Pinyin transcriptions. I cannot add Chinese, if I don't know how to convert the Chinese characters to Latin alphabet. |
A cursory search led me to https://github.com/briankung/pinyin-tool but it is unclear to me just how naive this is. When i have some free time, i can run some tests and try converting some of the big files. It seem likely that any first list will have mistakes, but hopefully they will not be too common and over time can be patched out. |
Here is a file i made from transcriptions of tv show and youtube videos. It is a tsv file with hanzi in the first column and pinyin with numbers in the second. I think it would be best for it not to be necessary to type numbers (in fact i can't think how that would nicely be done) but they were added by the converter so i have left them in case. |
I have also found this wikipedia page of characters sort by pinyin. https://en.wiktionary.org/wiki/Appendix:Mandarin_Pinyin/Table_of_General_Standard_Chinese_Characters IT is the 8000 characters published by the government from common use, and also commonly used in names which is helpful. If you want me to format this in a particular way i can try and do that. |
I can extract them from Wikipedia easily, that's fine. But, let me check if I get everything correctly. This list of 8000 characters is the core of the language and you could use them to type almost any word, even if the pronunciation of the entire word may differ. Basically, you could use them to type letter-by-letter, sort of what is ABC mode in English. Correct... or not? If the above is correct, it is possible to create a new type of input mode for Chinese (and other non-ABC languages). Using space will not be a problem. The default way of accepting a suggestion is the OK key, not 0-key, so you can use that instead. But I am not sure how to present the suggestions in a practical way. For example, typing "QI" would yield some 100 suggestions. And even if I do the numbers thing, to display only the QIs with, say, a rising tone, there would be 43 characters. Presenting so many characters to the user would still be overwhelming in my opinion. Not to mention, TT9 is currently constrained to 20 suggestions. I can, of course, raise the limit, but is there for the sake of usability. On the other hand, I've just tested the Chinese input method on my Qin F21 and it seems to do exactly why I described above. Typing "74" (QI), actually causes 100 or so suggestions to appear. Is this really the way to go? How did you guys used to type back in the days, when there were no smartphones?
Thank you very much for this. I guess any extra characters could be added to the list of 8000 "official" ones. Other than that, I don't think we need entire words, for example: "默默无闻", because every separate character would be in the dictionary and you could just type: "mò", "mò", "wú", "wén" and get the entire word. So, the last problem to solve is how to present a reasonable number of suggestions to the user. |
One common way from older phones is bihua, you will see in your qin f21 that the 12345 keys actually have strokes and you would input characters stroke by stroke. This method is implemented in stroke count method on f-droid, and perhaps would be possible to implement here.
I'm not sure what you mean here, but i think your understanding is correct. The phrase 你叫什么名字 you could type as ni jiao shen me ming zi, selecting one character each time. But this is actually four words, ni jiao shenme mingzi, you tend not to write one character at a time but in words or phrases and the input method knows what are the more frequently used or common phrases. Chinese has polysyllabic words, it is not one character is one word. I usually will only type one by one characters when typing a new name which has a rare character. Does that make sense? That is why it is beneficial to have the words in the dictionary. I guess the alternative is not to store the words in the dictionary at first, but then as the user types characters by other characters that should be being stored. Otherwise typing will be a very painful experience. https://www.bilibili.com/video/BV1TT41117ZP here is a video where you can see someone typing, maybe this is a helpful demonstration. Although it is similar to typing alphebtic languages, but with an extra step. It is difficult for me to explain the process, i hope you can understand, and if not let me know and i will try to explain it again. Thanks for the interest in this feature. |
I forgot to say. When i type 743663 for shenme, the first result is what i expect. There is one other two character phrase on the first page of results, the rest are all single characters which helps to demonstrate how most of the times typing phrases is easy. But i think you shouldn't enforce the numbers, as most pinyin input methods do not use numbers, it is a bit slow and unnecessary. If the database can be fast enough in this app to show all the possibilities it should do so, as in general use the only time that one will need to scroll them is for names. With frequency weighting, everything else will probably be in the first page. |
OK, thanks for all suggestions and information. I think now I have enough knowledge to add Chinese. Let's go with the phonetic method and display all possible suggestions even if they are 100. Performance shouldn't be an issues with the Objectbox database (already merged in As for the strokes method, it requires knowledge about how characters are written, which I don't have. It will just take too long for me to learn, understand and implement it. I can't do this right now. |
So here is the summary from user perspective:
Technical stuff:
The plan above is not set in stone and it may change as needed during implementation. |
This seems to all be a good plan, i am not sure from what you write it will support typing multiple characters at once?
You can let me know how the progress goes and i am happy to test at any point.
Once thing:
Accepting words works as usual: either OK to accept, or SPACE to accept and type a space.
I am not sure how the default duoqin keyboard does this, and maybe it is too much complexity because this system sounds to work fine. But the usual way on software keyboards is that space accepts OR types a space if no letters have been typed because chinese doesn't usually use spaces but they can be used to separate phrases.
Another thing:
On software pinyin keyboards, you can select the pinyin you are trying to type to filter the list shorter. For example 926 can be zao or yao, so you can remove lots of candidates like this. I don't know a good way to show this on the small keypad. Don't worry too much. But i'm letting you know.
Thanks for all the consideration.
|
expect! |
Really in need, always waiting! |
I am willing to make it sooner or later. But please understand there this project is being run by a single developer, for free, for benefit of the humankind. As such, there are no guaranteed deadlines. If you want to speed up things a bit, consider making a donation. |
Is there any plan to support Chinese input method?
The text was updated successfully, but these errors were encountered: