How to force lemmatizing as specific part of speech? #9479

iliakur · 2021-10-15T14:49:29Z

iliakur
Oct 15, 2021

For our application we need to be able to lemmatize English nominalized gerunds (eg "Dancing is my favorite hobby") as verbs.

Below is a snippet of how we're doing it now. I'm not including the _is_gerund_noun function because it's not relevant. Assume it correctly identifies "dancing" from the example as a gerund.

@Language.factory("verb_lemmas_for_gerunds")
def verb_lemmas_for_gerunds(nlp, name):
    """
    Gerunds (-ing forms) of verbs receive corresponding verb lemmas even if parsed as nouns.
    """
    lemmatizer = dict(nlp.pipeline)["lemmatizer"]

    def inner(doc: Doc) -> Doc:
        for token in doc:
            if _is_gerund_noun(token):
                tok = Doc(nlp.vocab, [token.text], [False])[0]
                tok.pos_ = "VERB"
                verb_lemma = lemmatizer.rule_lemmatize(tok)[0]
                token.morph = MorphAnalysis(
                    vocab, {**token.morph.to_dict(), **{"verb_lemma": verb_lemma}}
                 )
        return doc

    return inner

This approach seems hacky to me, especially the part where we create a new Doc out of just 1 token. Is there a better way?

adrianeboyd · 2021-10-18T06:28:15Z

adrianeboyd
Oct 18, 2021

You don't need to make a doc, just use the existing token:

lemmatizer = nlp.get_pipe("lemmatizer")
orig_pos = doc[0].pos_
doc[0].pos_ = "VERB"
lemma = lemmatizer.rule_lemmatize(doc[0])[0]
doc[0].pos_ = orig_pos
# do stuff with lemma

It would be a bit cleaner and faster to extend the lemmatizer and override rule_lemmatize with something like:

from typing import List, Optional
from thinc.api import Model
from spacy.language import Language
from spacy.lang.en import English, EnglishLemmatizer
from spacy.tokens import Token


class GerundPlusLemmatizer(EnglishLemmatizer):
    def rule_lemmatize(self, token: Token) -> List[str]:
        if token.text.endswith("ing") and token.pos_ == "NOUN":
            orig_pos = token.pos_
            token.pos_ = "VERB"
            lemmas = super().rule_lemmatize(token)
            token.pos_ = orig_pos
        else:
            lemmas = super().rule_lemmatize(token)
        return lemmas


@English.factory(
    "gerund_plus_lemmatizer",
    assigns=["token.lemma"],
    default_config={"model": None, "mode": "rule", "overwrite": False},
    default_score_weights={"lemma_acc": 1.0},
)
def make_gerund_plus_lemmatizer(
    nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
):
    return GerundPlusLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite)

This has the advantage of using the built-in lemmatizer cache and not having a component that references another component in the pipeline, which can make things brittle/complicated when saving/loading.

Remember to initialize the lemmatizer to load the tables:

nlp.add_pipe("gerund_plus_lemmatizer").initialize()

4 replies

iliakur Oct 18, 2021
Author

Thanks a bunch for the detailed response! I'll implement the cleaner and faster version, but it's also good to know the quick-n-dirty approach for future hacking :)

adrianeboyd Oct 18, 2021

We still need to improve the docs around the lemmatizer, and this would be a good kind of example...

iliakur Oct 18, 2021
Author

I'd gladly submit a PR, but I can't promise anything, life's a bit chaotic for the next few months...

adrianeboyd Oct 18, 2021

It was more a note about work I need to do, not a request, but PRs for the docs are always welcome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to force lemmatizing as specific part of speech? #9479

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to force lemmatizing as specific part of speech? #9479

iliakur Oct 15, 2021

Replies: 1 comment · 4 replies

adrianeboyd Oct 18, 2021

iliakur Oct 18, 2021 Author

adrianeboyd Oct 18, 2021

iliakur Oct 18, 2021 Author

adrianeboyd Oct 18, 2021

iliakur
Oct 15, 2021

Replies: 1 comment 4 replies

adrianeboyd
Oct 18, 2021

iliakur Oct 18, 2021
Author

iliakur Oct 18, 2021
Author