-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New model for unsupported language (Albanian: sq) #1360
Comments
Random request, this is really hard to read, please check the formatting next time on the stack traces |
Try adding
|
… a note on how to develop a new language's Pipeline. Related to #1360
many thanks AngledLuffa, now it works. And sorry about the awful formatting :( |
unfortunately, the new option and/or the dev branch doesn't seem to work. If I load models using the config dictionary, I get the following
but the pos processor is actually loaded!
│ ├── sq_nel_nocharlm_parser_checkpoint.pt |
The I can see that you're loading the POS model first before the depparse. Sanity check first - is the POS model labeling either upos or xpos? If somehow it was trained to only label the features, I could see it throwing this kind of error. Otherwise, it really looks from the code that this particular error should happen - it only triggers if both upos and xpos are missing for a word.
If the POS model should be working, what happens if you run the pipeline without the depparse and print out the results? Are there any sentences for which the POS is actually missing? I wonder if that can happen if the POS model has blank tags in the dataset it's learning from |
Many thanks for the detailed answer! This is really strange, I have tried to load the pipeline as I do in the script and it worked correctly on a few sentence. I have also tried to pass to the script a small txt file with some sentences and it worked too. |
If it "misses" things to be incorrect, that's one thing. But I do very much wonder why it would label anything Are you able to send the data + the data you are trying to test on, or maybe just send the model and the test data? I'd really like to see it in action myself to debug this issue. Another possible debugging step would be to examine the output of just the tokenizer and the POS w/o any of the subsequent models and check for any words which are missing both xpos and upos. |
Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS. Is this something you want to fix on your end? Maybe the tagger is supposed to ignore those items, or learn to tag them with |
... to be more precise, it IS learning to tag words w/o tags with |
thing is, I have already used these data to train a model two or three times last November and it worked fine. I have just added a few sentences for teaching the parser to recognize mwt like Albanian ta = të + e. |
It will successfully train a tagger even if there are empty tags. However, it's learned to recognize some words as having the empty tag, and that's the label the tagger gives those words. Did I express that clearly? I did the following experiment. Instead of sentences such as this in English, where
I changed all instances of
Now the tagger I trained labels I think it might make more sense to either throw an error when training a tagger on a partially complete file, or possibly treat single blank tags as masked out. Learning to recognize the blank tag doesn't seem very useful... In the meantime, if you find and eliminate those blank tags from your dataset, I believe this error will go away. |
ok, I have successfully parsed a file with just the pos tagging. Indeed, there are some tokens without UPOS. Actually, just one i.e., the stupid " punctuation 🔝 |
Indeed. I just need to figure out what the right approach is. The two leading candidates in my mind are to stop the tagger from training if there are blank UPOS, so as to give the user a chance to go back and fix the issue, or to treat the blanks as unlabeled tokens in the tagger which don't get a label of any kind. The second one is more appealing to me ideologically, but the problem is that in a case similar to yours where maybe all the punctuation was unlabeled, then they would all get tagged with the most likely known tag at test time (perhaps NOUN, for example). If you have an alternate suggestion, happy to hear it. |
I have corrected the dataset, retrained the model and now the parser works fine. |
…ly labeled training data was causing problems when training a non-UD dataset #1360
This error message is now part of the 1.8.2 release. Is there anything else you need addressed? |
great! thank you, everything looks good! |
… a note on how to develop a new language's Pipeline. Related to #1360
…ly labeled training data was causing problems when training a non-UD dataset #1360
@rahonalab I'm wondering - there is only a very small Albanian UD dataset on universaldependencies.org, and I don't see any planned Albanian expansions. Can I ask what dataset you used for this? If there is any publicly available data (larger than the UD dataset) we could add this language as a standard language to Stanza. |
hello! I have used two datasets which we plan to release as UD treebanks soon. I'll keep you posted |
That would be excellent! Looking forward to it. |
Sorry for the double bug report.
Can you please tell me what is the right procedure to load a model for a language that is not currently supported i..e, Albanian (sq).
I have tried the following two things:
pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None)
It doesn't work:
2024-03-02 15:25:18 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
# Language code for the language to build the Pipeline in 'lang': 'sq', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}" # You only need model paths if you have a specific model outside of stanza_resources 'tokenize_model_path': '/corpus/models/stanza/sq/tokenize/sq_nel_tokenizer.pt', 'pos_model_path': '/corpus/models/stanza/sq/pos/sq_nel_tagger.pt', 'lemma_model_path': '/corpus/models/stanza/sq/lemma/sq_nel_lemmatizer.pt', 'depparse_model_path': '/corpus/models/stanza/sq/depparse/sq_nel_parser.pt', 'pos_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', 'depparse_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', })
But, again, it doesn't work:
2024-03-02 16:00:25 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
As a workaround, I have put a code of a supported language, but it's not ideal, as it might load other models...
Thanks!
The text was updated successfully, but these errors were encountered: