add lexical model for Portuguese #303

Robersongriz · 2025-02-03T15:32:27Z

No description provided.

keyman-server · 2025-02-03T15:36:19Z

Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process.

Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment.

DavidLRowe · 2025-02-04T04:38:34Z

Thanks for this contribution. Unfortunately, there are some issues with the folder structure and the files.

In order to keep our automated build system working, we have a strict naming system for folders and files. portilexicon_ud.pt.portilexicon_ud is okay for a project name (though rather long, so you could replace the second 'portilexicon_ud' with something shorter, such as 'lex'). The first part of the name (up to the first '.') must be the folder name in which the project folder is placed and which in turn is placed under the 'experimental' folder (or eventually the 'release' folder). So you have:
experimental/portilexicon_ud/portilexicon_ud.pt.portilexicon_ud
as the project folder. In that folder you'll have four files and the source folder:

HISTORY.md
LICENSE.md
README.md
portilexicon_ud.pt.portilexicon_ud.kpj
source

In that experimental/portilexicon_ud/portilexicon_ud.pt.portilexicon_ud/source folder you'll have:

portilexicon_ud.pt.portilexicon_ud.model.kps
portilexicon_ud.pt.portilexicon_ud.model.ts
readme.htm
welcome.htm
wordlist.tsv

The wordlist.tsv file is very long. GitHub gives the length as 1,226,339 lines. I don't think a file this long is practical for this lexical model process. (@jahorton Can you comment?)

The format of the .tsv file that is required is:
word<TAB>count<TAB>comment
where the comment and count fields are optional, however this file has:
word<TAB>word<TAB>part-of-speech...
so this will confuse the software that builds the lexical model.
The count field is an integer that gives an indication of how common the word is. (The larger the number the more likely it is to be offered as a suggestion.) If you omit the count field, then all words have a value of 1, and are equally likely.

Currently the welcome.htm file is not referenced in the package. You can use Keyman Developer, select the Packaging tab at the bottom, and the Details tab on the left to select the welcome.htm file.

Let me know if you have any questions.

jahorton · 2025-02-06T01:13:48Z

experimental/pt/portilexicon_ud/source/wordlist.tsv

That's quite the wordlist file there! That said... first and foremost, this isn't the input format we currently expect, as there's a lot of extra data here. I'll address that first, then the line count.

An excerpt:

aba abar VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin ababada ababadar VERB Mood=Imp|Number=Sing|Person=2|VerbForm=Fin ababada ababadar VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin ababadada ababadado ADJ Gender=Fem|Number=Sing|VerbForm=Part ababadada ababadar VERB Gender=Fem|Number=Sing|VerbForm=Part

This looks like data tailored to grammatical and/or morphological modeling with some other engine out there. We currently have no use for part-of-speech data or the other semantic aspects (gender, number, verb form). I'm not even sure whether column 1 or 2 is the actual "word" being "listed" - I get the feeling the other is the root form of the word or similar, but I'm not familiar enough with Portuguese to make that call. (Probably the second column, as the first column varies for singular vs plural when the second doesn't.)

I get the feeling that sort of data could be useful if and when we're able to provide morphological and/or agglutinative models... but we are not currently able to provide such functionality, and it will be a while before we are.

Here's a link talking about the format we expect for our TSV files: https://help.keyman.com/developer/17.0/guides/lexical-models/tutorial/step-3

We're just looking for a simple "word" + "frequency" per row. Nothing about the difference between word root + the form in the list, nothing about why the variation is what it is or what it represents... just a raw word + frequency.

Now, about the line count.

A wordlist with this many entries will result in very sluggish performance at this time. We do have work in progress that will mitigate this quite significantly, but due to various constraints and a desire for something resembling timeliness for our release cycles, it won't be landing in our next release. You can track this upcoming feature here: keymanapp/keyman#12293. Some comparison details about the impact it will have can be seen here: keymanapp/keyman#12129 (comment).

Note that your TSV is on the same scale as the dotland.hy.armenian, 1.0.0 entry in the table found at that last link. This means you'll be hit hard by our current performance limitations, even if you get the data into the exact format we currently expect, should you keep every word entry currently in that list.

The link that kicked all of that off: https://community.software.sil.org/t/text-suggestion-not-working-in-some-cases/6490 - note the user's issues when they used their model as the same scale as your current wordlist. We have fixed a lot of the side issues, but the performance aspect is still pending.

add PortiLexicon-UD

1bbba6b

jahorton reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lexical model for Portuguese #303

add lexical model for Portuguese #303

Robersongriz commented Feb 3, 2025

keyman-server commented Feb 3, 2025

DavidLRowe commented Feb 4, 2025

jahorton Feb 6, 2025

jahorton Feb 6, 2025

add lexical model for Portuguese #303

Are you sure you want to change the base?

add lexical model for Portuguese #303

Conversation

Robersongriz commented Feb 3, 2025

keyman-server commented Feb 3, 2025

DavidLRowe commented Feb 4, 2025

jahorton Feb 6, 2025

Choose a reason for hiding this comment

jahorton Feb 6, 2025

Choose a reason for hiding this comment