Lemmatizer in French not getting the right lemma for some Verbs. #7320

ioExpander · 2021-03-06T12:25:54Z

Hi. Here is an issue I'm getting using some French pipelines (fr_core_news_lg or fr_dep_news_trf).
As you can see it works in some cases but fetches the wrong lemma in some other cases.
So far I've only been able to reproduce the issue with some verbs that all are from the same group (called 'first group' - ending in "er"). But not all of them have the issue as you can see in example 2.
The verbs are detected properly, even with the right tense. But the lemma is missing the trailing "r" in a lot of cases.

At quick lookup against a verb dictionary could work around the issue, but I would rather help fix the root cause here :)

Thank you.

How to reproduce the behaviour

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])

#1
doc =nlp("le chat dort dans son lit")
print(*[t.lemma_ for t in doc]) # Correct
# Output : le chat dormir dans son lit

#2
doc =nlp("le chat mange des souris")
print(*[t.lemma_ for t in doc]) # Correct
# output : le chat manger un souris

#3
doc =nlp("le chat monte les escaliers")
print(*[t.lemma_ for t in doc]) # Incorrect
# output : le chat monte le escalier
# Should be : le chat monter le escalier

#4
doc =nlp("le chat saute haut")
print(*[t.lemma_ for t in doc]) # Incorrect
# Output : le chat saute haut 
# Should be : le chat sauter haut

Info about spaCy

spaCy version: 3.0.3
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
Pipelines: fr_core_news_lg (3.0.0), fr_dep_news_trf (3.0.0)

adrianeboyd · 2021-03-08T15:42:36Z

Hi, it does look like there might be a rule for e -> er that's missing from the French lemmatizer rules:

https://github.com/explosion/spacy-lookups-data/blob/544a965501f06f55349e7402e80d6a49bc4cb3cd/spacy_lookups_data/data/fr_lemma_rules.json#L79-L125

My French is not that great, so I'm not sure whether this might cause problems for other verbs in some way, but you can add a rule to try it out like this:

nlp = spacy.load("fr_core_news_sm")
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]
assert [t.lemma_ for t in nlp("le chat monte les escaliers")] == ['le', 'chat', 'monter', 'le', 'escalier']

The lemmatizer depends on the POS annotation, so you still might see lemma errors that are caused by morphologizer errors rather than lemmatizer problems.

ioExpander · 2021-03-10T11:46:19Z

Hi. Thank you for the feedback. I ran some tests using the additional lemma rule that you suggested. Indeed it seems to solve the issue in my examples.
I'm trying to figure out if this rule can be generalized or if there could be some exceptions of a French verb ending with -e and without it's infinitive form in -er. Also wondering why iI did not get the issue with the missing rule in example 2 : mange -> "manger" as a verb.

I did also find a strange issue when running the tests, as if the lemma inference were cached between different sentences. So if i recognize the verb "monte" (which is incorrect) first, and then add the lemma_rule [['e', 'er']] the next sentence is still inferred as "monte" instead of "monter".
Will try to investigate more on this one too later today hopefully.

adrianeboyd · 2021-03-10T11:50:30Z

There is a lemmatizer cache that would cause this behavior. You can clear it (just by hand: nlp.get_pipe("lemmatizer").cache = {}) or save and reload the pipeline.

ioExpander · 2021-03-10T11:53:55Z

oh. cool. Thanks ! Will stop looking at that second one then !

There is a lemmatizer cache that would cause this behavior. You can clear it (just by hand: nlp.get_pipe("lemmatizer").cache = {}) or save and reload the pipeline.

ioExpander · 2021-03-11T14:29:08Z

Hi again.
Ran a few tests and could not find a single broken verb detection after adding the lemma_rule. I'm not an expert, but native speaking in French so here a few examples. The additional rule fixes lemma inference for very basic verbs like "to jump" or "to climb" so I would add it to in the code base. It is a single line change, but I can open a PR if you want.

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]

def test_verb(sentence, correct_value):
  nlp.get_pipe("lemmatizer").cache = {}
  doc = nlp(sentence)
  assert doc[2].pos_ == 'VERB'
  assert doc[2].lemma_ == correct_value
  print(f"OK - {correct_value}")

test_cases = [
              #[ sentence, correct_value],
              ["le chat mange du pain", "manger"],
              ["le chat dormait dans son lit", "dormir"],
              ["le chat saute haut", "sauter"],
              ["la souris marche sur les cailloux", "marcher"],
              ["la souris ouvre les portes", "ouvrir"],
              ["la souris offre des cadeaux", "offrir"],
              ["le chat regarde le chien", "regarder"],
              ["le chat monte les escaliers", "monter"],
]

for t in test_cases : test_verb(*t)

adrianeboyd · 2021-03-11T15:58:53Z

Sure, if you'd like to open a PR, please go ahead! We mainly test the lookup lemmatizers in the tests in that repo because we don't want to have to download pretrained pipelines for the test suite. You can construct docs and test by hand, but it's a bit of pain. We could potentially add more lemmatizer tests to the spacy-models repo, which is what we use to test newly trained models before releasing.

e-nesse · 2021-04-12T22:33:56Z

Having tried this solution ( nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']] ), I can report that it does not produce (only) the desired outcome. While it does fix many lemmatization errors for conjugated -ER verb forms, it also introduces errors in lemmatization of infinitive forms. Infinitives like "résoudre", "prendre", or "réduire" are assigned lemmas of "résoudrer", "prendrer", and "réduirer", respectively. Perhaps there's a way to account for verbs already in the infinitive form? I am not sure exactly how the lemmatization rules work, unfortunately - sorry! - but there might also be potential problems with subjunctive forms ending in -e (qu'on prenne, que je vienne, ...) being given non-existent lemmas (prenne --> prenner) if the verb form doesn't factor in to the rule somehow.

ioExpander · 2021-04-13T08:09:15Z

hi. Yes I encountered the same issue a few days ago that is why I have put off sending this PR to investigate further... The fix with the new lemma rule is really useful but indeed it breaks more complex sentences like the one in the example below. Note that the verb is correctly recognized by the morph as being in the infinitive (INF) form. So indeed skipping the lemmatizer rules for verbs in infinitive could be a way to go. I could not find an easy way to do it (besides doing it manually outside of spacy)

import spacy
import fr_dep_news_trf

nlp = fr_dep_news_trf.load(exclude=["ner"])
nlp.get_pipe("lemmatizer").lookups.get_table("lemma_rules")["verb"] += [['e', 'er']]

doc =nlp("Je souhaite descendre dans la cave")
print([[t.lemma_, t.pos_, t.morph] for t in doc])

# => [['je', 'PRON', Number=Sing|Person=1], ['souhaiter', 'VERB', Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin], ['descendrer', 'VERB', VerbForm=Inf], ['dans', 'ADP', ], ['le', 'DET', Definite=Def|Gender=Fem|Number=Sing|PronType=Art], ['cave', 'NOUN', Gender=Fem|Number=Sing]]

adrianeboyd · 2021-04-13T08:21:47Z

The rule-based lemmatizer does have a mechanism for checking for forms like infinitives that are already lemmas and don't need to be processed further. There's not currently a check for French, but you can see what it looks like for English here:

spaCy/spacy/lang/en/lemmatizer.py

Lines 5 to 40 in ed561cf

    
           class EnglishLemmatizer(Lemmatizer): 
        
               """English lemmatizer. Only overrides is_base_form.""" 
        
               def is_base_form(self, token: Token) -> bool: 
        
                   """ 
        
                   Check whether we're dealing with an uninflected paradigm, so we can 
        
                   avoid lemmatization entirely. 
        
                   univ_pos (unicode / int): The token's universal part-of-speech tag. 
        
                   morphology (dict): The token's morphological features following the 
        
                       Universal Dependencies scheme. 
        
                   """ 
        
                   univ_pos = token.pos_.lower() 
        
                   morphology = token.morph.to_dict() 
        
                   if univ_pos == "noun" and morphology.get("Number") == "Sing": 
        
                       return True 
        
                   elif univ_pos == "verb" and morphology.get("VerbForm") == "Inf": 
        
                       return True 
        
                   # This maps 'VBP' to base form -- probably just need 'IS_BASE' 
        
                   # morphology 
        
                   elif univ_pos == "verb" and ( 
        
                       morphology.get("VerbForm") == "Fin" 
        
                       and morphology.get("Tense") == "Pres" 
        
                       and morphology.get("Number") is None 
        
                   ): 
        
                       return True 
        
                   elif univ_pos == "adj" and morphology.get("Degree") == "Pos": 
        
                       return True 
        
                   elif morphology.get("VerbForm") == "Inf": 
        
                       return True 
        
                   elif morphology.get("VerbForm") == "None": 
        
                       return True 
        
                   elif morphology.get("Degree") == "Pos": 
        
                       return True 
        
                   else: 
        
                       return False

All you would need to do is add a similar is_base_form method to FrenchLemmatizer. It could work similarly and then as long as the tagger/morphologizer was correct (which is a bit of a caveat for most of the rules, of course), then you could skip infinitives with is_base_form and the new rule would only apply to finite verbs.

svlandeg added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / fr French language data and models perf / accuracy Performance: accuracy labels Mar 6, 2021

polm added the help wanted Contributions welcome! label Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatizer in French not getting the right lemma for some Verbs. #7320

Lemmatizer in French not getting the right lemma for some Verbs. #7320

ioExpander commented Mar 6, 2021 •

edited

Loading

adrianeboyd commented Mar 8, 2021

ioExpander commented Mar 10, 2021

adrianeboyd commented Mar 10, 2021

ioExpander commented Mar 10, 2021

ioExpander commented Mar 11, 2021

adrianeboyd commented Mar 11, 2021

e-nesse commented Apr 12, 2021 •

edited

Loading

ioExpander commented Apr 13, 2021 •

edited

Loading

adrianeboyd commented Apr 13, 2021

Lemmatizer in French not getting the right lemma for some Verbs. #7320

Lemmatizer in French not getting the right lemma for some Verbs. #7320

Comments

ioExpander commented Mar 6, 2021 • edited Loading

How to reproduce the behaviour

Info about spaCy

adrianeboyd commented Mar 8, 2021

ioExpander commented Mar 10, 2021

adrianeboyd commented Mar 10, 2021

ioExpander commented Mar 10, 2021

ioExpander commented Mar 11, 2021

adrianeboyd commented Mar 11, 2021

e-nesse commented Apr 12, 2021 • edited Loading

ioExpander commented Apr 13, 2021 • edited Loading

adrianeboyd commented Apr 13, 2021

ioExpander commented Mar 6, 2021 •

edited

Loading

e-nesse commented Apr 12, 2021 •

edited

Loading

ioExpander commented Apr 13, 2021 •

edited

Loading