Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lexical model for Portuguese #303

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Robersongriz
Copy link

No description provided.

@keyman-server
Copy link

Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process.

Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment.

@DavidLRowe
Copy link
Collaborator

Thanks for this contribution. Unfortunately, there are some issues with the folder structure and the files.

In order to keep our automated build system working, we have a strict naming system for folders and files. portilexicon_ud.pt.portilexicon_ud is okay for a project name (though rather long, so you could replace the second 'portilexicon_ud' with something shorter, such as 'lex'). The first part of the name (up to the first '.') must be the folder name in which the project folder is placed and which in turn is placed under the 'experimental' folder (or eventually the 'release' folder). So you have:
experimental/portilexicon_ud/portilexicon_ud.pt.portilexicon_ud
as the project folder. In that folder you'll have four files and the source folder:

HISTORY.md
LICENSE.md
README.md
portilexicon_ud.pt.portilexicon_ud.kpj
source

In that experimental/portilexicon_ud/portilexicon_ud.pt.portilexicon_ud/source folder you'll have:

portilexicon_ud.pt.portilexicon_ud.model.kps
portilexicon_ud.pt.portilexicon_ud.model.ts
readme.htm
welcome.htm
wordlist.tsv

The wordlist.tsv file is very long. GitHub gives the length as 1,226,339 lines. I don't think a file this long is practical for this lexical model process. (@jahorton Can you comment?)

The format of the .tsv file that is required is:
word<TAB>count<TAB>comment
where the comment and count fields are optional, however this file has:
word<TAB>word<TAB>part-of-speech...
so this will confuse the software that builds the lexical model.
The count field is an integer that gives an indication of how common the word is. (The larger the number the more likely it is to be offered as a suggestion.) If you omit the count field, then all words have a value of 1, and are equally likely.

Currently the welcome.htm file is not referenced in the package. You can use Keyman Developer, select the Packaging tab at the bottom, and the Details tab on the left to select the welcome.htm file.

Let me know if you have any questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite the wordlist file there! That said... first and foremost, this isn't the input format we currently expect, as there's a lot of extra data here. I'll address that first, then the line count.

An excerpt:

aba	abar	VERB	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
ababada	ababadar	VERB	Mood=Imp|Number=Sing|Person=2|VerbForm=Fin
ababada	ababadar	VERB	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
ababadada	ababadado	ADJ	Gender=Fem|Number=Sing|VerbForm=Part
ababadada	ababadar	VERB	Gender=Fem|Number=Sing|VerbForm=Part

This looks like data tailored to grammatical and/or morphological modeling with some other engine out there. We currently have no use for part-of-speech data or the other semantic aspects (gender, number, verb form). I'm not even sure whether column 1 or 2 is the actual "word" being "listed" - I get the feeling the other is the root form of the word or similar, but I'm not familiar enough with Portuguese to make that call. (Probably the second column, as the first column varies for singular vs plural when the second doesn't.)

I get the feeling that sort of data could be useful if and when we're able to provide morphological and/or agglutinative models... but we are not currently able to provide such functionality, and it will be a while before we are.

Here's a link talking about the format we expect for our TSV files: https://help.keyman.com/developer/17.0/guides/lexical-models/tutorial/step-3

We're just looking for a simple "word" + "frequency" per row. Nothing about the difference between word root + the form in the list, nothing about why the variation is what it is or what it represents... just a raw word + frequency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, about the line count.

A wordlist with this many entries will result in very sluggish performance at this time. We do have work in progress that will mitigate this quite significantly, but due to various constraints and a desire for something resembling timeliness for our release cycles, it won't be landing in our next release. You can track this upcoming feature here: keymanapp/keyman#12293. Some comparison details about the impact it will have can be seen here: keymanapp/keyman#12129 (comment).

Note that your TSV is on the same scale as the dotland.hy.armenian, 1.0.0 entry in the table found at that last link. This means you'll be hit hard by our current performance limitations, even if you get the data into the exact format we currently expect, should you keep every word entry currently in that list.

The link that kicked all of that off: https://community.software.sil.org/t/text-suggestion-not-working-in-some-cases/6490 - note the user's issues when they used their model as the same scale as your current wordlist. We have fixed a lot of the side issues, but the performance aspect is still pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants