How to force lemmatizing as specific part of speech? #9479
iliakur
started this conversation in
Help: Best practices
Replies: 1 comment 4 replies
-
You don't need to make a doc, just use the existing token: lemmatizer = nlp.get_pipe("lemmatizer")
orig_pos = doc[0].pos_
doc[0].pos_ = "VERB"
lemma = lemmatizer.rule_lemmatize(doc[0])[0]
doc[0].pos_ = orig_pos
# do stuff with lemma It would be a bit cleaner and faster to extend the lemmatizer and override from typing import List, Optional
from thinc.api import Model
from spacy.language import Language
from spacy.lang.en import English, EnglishLemmatizer
from spacy.tokens import Token
class GerundPlusLemmatizer(EnglishLemmatizer):
def rule_lemmatize(self, token: Token) -> List[str]:
if token.text.endswith("ing") and token.pos_ == "NOUN":
orig_pos = token.pos_
token.pos_ = "VERB"
lemmas = super().rule_lemmatize(token)
token.pos_ = orig_pos
else:
lemmas = super().rule_lemmatize(token)
return lemmas
@English.factory(
"gerund_plus_lemmatizer",
assigns=["token.lemma"],
default_config={"model": None, "mode": "rule", "overwrite": False},
default_score_weights={"lemma_acc": 1.0},
)
def make_gerund_plus_lemmatizer(
nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool
):
return GerundPlusLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) This has the advantage of using the built-in lemmatizer cache and not having a component that references another component in the pipeline, which can make things brittle/complicated when saving/loading. Remember to initialize the lemmatizer to load the tables: nlp.add_pipe("gerund_plus_lemmatizer").initialize() |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For our application we need to be able to lemmatize English nominalized gerunds (eg "Dancing is my favorite hobby") as verbs.
Below is a snippet of how we're doing it now. I'm not including the
_is_gerund_noun
function because it's not relevant. Assume it correctly identifies "dancing" from the example as a gerund.This approach seems hacky to me, especially the part where we create a new
Doc
out of just 1 token. Is there a better way?Beta Was this translation helpful? Give feedback.
All reactions