-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatizer in French not getting the right lemma for some Verbs. #7320
Comments
Hi, it does look like there might be a rule for My French is not that great, so I'm not sure whether this might cause problems for other verbs in some way, but you can add a rule to try it out like this: nlp = spacy.load("fr_core_news_sm")
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]
assert [t.lemma_ for t in nlp("le chat monte les escaliers")] == ['le', 'chat', 'monter', 'le', 'escalier'] The lemmatizer depends on the POS annotation, so you still might see lemma errors that are caused by |
Hi. Thank you for the feedback. I ran some tests using the additional lemma rule that you suggested. Indeed it seems to solve the issue in my examples. I did also find a strange issue when running the tests, as if the lemma inference were cached between different sentences. So if i recognize the verb "monte" (which is incorrect) first, and then add the lemma_rule |
There is a lemmatizer cache that would cause this behavior. You can clear it (just by hand: |
oh. cool. Thanks ! Will stop looking at that second one then !
|
Hi again. import spacy
import fr_dep_news_trf
nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]
def test_verb(sentence, correct_value):
nlp.get_pipe("lemmatizer").cache = {}
doc = nlp(sentence)
assert doc[2].pos_ == 'VERB'
assert doc[2].lemma_ == correct_value
print(f"OK - {correct_value}")
test_cases = [
#[ sentence, correct_value],
["le chat mange du pain", "manger"],
["le chat dormait dans son lit", "dormir"],
["le chat saute haut", "sauter"],
["la souris marche sur les cailloux", "marcher"],
["la souris ouvre les portes", "ouvrir"],
["la souris offre des cadeaux", "offrir"],
["le chat regarde le chien", "regarder"],
["le chat monte les escaliers", "monter"],
]
for t in test_cases : test_verb(*t) |
Sure, if you'd like to open a PR, please go ahead! We mainly test the lookup lemmatizers in the tests in that repo because we don't want to have to download pretrained pipelines for the test suite. You can construct docs and test by hand, but it's a bit of pain. We could potentially add more lemmatizer tests to the |
Having tried this solution ( nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']] ), I can report that it does not produce (only) the desired outcome. While it does fix many lemmatization errors for conjugated -ER verb forms, it also introduces errors in lemmatization of infinitive forms. Infinitives like "résoudre", "prendre", or "réduire" are assigned lemmas of "résoudrer", "prendrer", and "réduirer", respectively. Perhaps there's a way to account for verbs already in the infinitive form? I am not sure exactly how the lemmatization rules work, unfortunately - sorry! - but there might also be potential problems with subjunctive forms ending in -e (qu'on prenne, que je vienne, ...) being given non-existent lemmas (prenne --> prenner) if the verb form doesn't factor in to the rule somehow. |
hi. Yes I encountered the same issue a few days ago that is why I have put off sending this PR to investigate further... The fix with the new lemma rule is really useful but indeed it breaks more complex sentences like the one in the example below. Note that the verb is correctly recognized by the morph as being in the infinitive (INF) form. So indeed skipping the lemmatizer rules for verbs in infinitive could be a way to go. I could not find an easy way to do it (besides doing it manually outside of spacy) import spacy
import fr_dep_news_trf
nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]
doc =nlp("Je souhaite descendre dans la cave")
print([[t.lemma_, t.pos_, t.morph] for t in doc])
# => [['je', 'PRON', Number=Sing|Person=1], ['souhaiter', 'VERB', Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin], ['descendrer', 'VERB', VerbForm=Inf], ['dans', 'ADP', ], ['le', 'DET', Definite=Def|Gender=Fem|Number=Sing|PronType=Art], ['cave', 'NOUN', Gender=Fem|Number=Sing]] |
The rule-based lemmatizer does have a mechanism for checking for forms like infinitives that are already lemmas and don't need to be processed further. There's not currently a check for French, but you can see what it looks like for English here: spaCy/spacy/lang/en/lemmatizer.py Lines 5 to 40 in ed561cf
All you would need to do is add a similar |
Hi. Here is an issue I'm getting using some French pipelines (fr_core_news_lg or fr_dep_news_trf).
As you can see it works in some cases but fetches the wrong lemma in some other cases.
So far I've only been able to reproduce the issue with some verbs that all are from the same group (called 'first group' - ending in "er"). But not all of them have the issue as you can see in example 2.
The verbs are detected properly, even with the right tense. But the lemma is missing the trailing "r" in a lot of cases.
At quick lookup against a verb dictionary could work around the issue, but I would rather help fix the root cause here :)
Thank you.
How to reproduce the behaviour
Info about spaCy
The text was updated successfully, but these errors were encountered: