-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch between special verb composition and its incorporation with the standard general make #2
Comments
How did you generate those lists (exact command)? Which fst tool did you use? Foma or Hfst? |
I ran As input for one I used a FOMA-transformed version of |
The corresponding HFST command is: hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst For whatever reason it is unbelievably slow. @flammie do you know why?
Ok. |
I cannot see anything obvious; if you do it without flags it seems quite fast though very long as well, and there seems to be a lot of flags in the automaton, maybe the flag minding print even collects first and filters and prints afterwards? I'll try if I can debug print to find something... |
Examining that the correct set of pairings is around 500 cases, and that the incorrect set seems to allow all possible inner inflectional prefix chunks (of which there are some 20 types), it seems that the longer lists in terms of its size might just be 500 x 20/30 ~ 10/15k. If this is the correct diagnosis here, the crucial question now is, where and how in the general GT compilation might it happen that the |
I have now run three different tests, to see whether there are meaningful differences between them. The tests are as follows: hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst > verbtest.txt
hfst-fst2strings -X obey-flags src/fst/lexicon.hfst > lextest.txt
hfst-fst2strings -X obey-flags src/analyser-gt-norm.hfst > ananormtest.txt That is, extract all pair strings from the The output is in practice identical in all cases: wc -l verbtest.txt
486 verbtest.txt
grep -v 'PUNCT' lextest.txt | grep -v 'CLB' | wc -l
486
grep -v 'PUNCT' ananormtest.txt | grep -v 'CLB' | wc -l
962 The larger number for the last case is caused by automatic initial uppercasing, essentially doubling the count. When divided by 2, the result is 481, and if we assume a handful of non-casing initial letters, then the numbers add up perfectly. And in any case it is clear that this is far from the 16k + reported in the opening comment. That is, using pure HFST, I see no issue at all. I thus suspect that the error is related to the conversion from HFST to FOMA. |
Hmmm... When I use
So I'm wondering what file is used as the source for the hfst-to-foma conversion? |
A further note is that the normative FOMA generator seems to work appropriately as well, cf.
So the glitch seems to be in the conversion of the normative FOMA analyzer. Maybe that is the explicit place we should look into with the GT compilation? |
When compiling verb morphology using
verb_lexicon.xfscript
(which accesses the various stem and affix and other LEXC files), we get a well behaving FST with some 486 analysis-form pairings, which appear all to be correct (1.pairs.txt.But when we run make in the standard GT compilation, we end up having substantially more, 16321, analysis form pairings, most of which are gibberish (2.pairs.txt).
@snomos Where might this go wrong? It is as if the TAMA flag-diacritics seem no longer to be applied, when the
verb_lexicon.hfst
is combined with the rest.The text was updated successfully, but these errors were encountered: