-
-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lexical model for Portuguese #303
base: master
Are you sure you want to change the base?
Conversation
Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process. Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment. |
Thanks for this contribution. Unfortunately, there are some issues with the folder structure and the files. In order to keep our automated build system working, we have a strict naming system for folders and files.
In that
The The format of the .tsv file that is required is: Currently the Let me know if you have any questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite the wordlist file there! That said... first and foremost, this isn't the input format we currently expect, as there's a lot of extra data here. I'll address that first, then the line count.
An excerpt:
aba abar VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
ababada ababadar VERB Mood=Imp|Number=Sing|Person=2|VerbForm=Fin
ababada ababadar VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
ababadada ababadado ADJ Gender=Fem|Number=Sing|VerbForm=Part
ababadada ababadar VERB Gender=Fem|Number=Sing|VerbForm=Part
This looks like data tailored to grammatical and/or morphological modeling with some other engine out there. We currently have no use for part-of-speech data or the other semantic aspects (gender, number, verb form). I'm not even sure whether column 1 or 2 is the actual "word" being "listed" - I get the feeling the other is the root form of the word or similar, but I'm not familiar enough with Portuguese to make that call. (Probably the second column, as the first column varies for singular vs plural when the second doesn't.)
I get the feeling that sort of data could be useful if and when we're able to provide morphological and/or agglutinative models... but we are not currently able to provide such functionality, and it will be a while before we are.
Here's a link talking about the format we expect for our TSV files: https://help.keyman.com/developer/17.0/guides/lexical-models/tutorial/step-3
We're just looking for a simple "word" + "frequency" per row. Nothing about the difference between word root + the form in the list, nothing about why the variation is what it is or what it represents... just a raw word + frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, about the line count.
A wordlist with this many entries will result in very sluggish performance at this time. We do have work in progress that will mitigate this quite significantly, but due to various constraints and a desire for something resembling timeliness for our release cycles, it won't be landing in our next release. You can track this upcoming feature here: keymanapp/keyman#12293. Some comparison details about the impact it will have can be seen here: keymanapp/keyman#12129 (comment).
Note that your TSV is on the same scale as the dotland.hy.armenian, 1.0.0
entry in the table found at that last link. This means you'll be hit hard by our current performance limitations, even if you get the data into the exact format we currently expect, should you keep every word entry currently in that list.
The link that kicked all of that off: https://community.software.sil.org/t/text-suggestion-not-working-in-some-cases/6490 - note the user's issues when they used their model as the same scale as your current wordlist. We have fixed a lot of the side issues, but the performance aspect is still pending.
No description provided.