Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's sometimes difficult to initialize pipeline components in code #7027

Open
honnibal opened this issue Feb 11, 2021 · 1 comment
Open

It's sometimes difficult to initialize pipeline components in code #7027

honnibal opened this issue Feb 11, 2021 · 1 comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components feat / ux Feature: User experience, error messages etc.

Comments

@honnibal
Copy link
Member

honnibal commented Feb 11, 2021

The workflow for setting up a pipeline component in code sometimes feels a bit rough. This came up while I was investigating #6958.

Let's say we have some pipeline component that assumes its .initialize() method will be called before it's in a valid state, as the transformer does --- but the component doesn't necessarily need to be trained, as such, before it's in a functional state. We have the following:

import spacy

nlp = spacy.blank("en")
transformer = nlp.add_pipe("transformer")

So now we need to call transformer.initialize(). How to do that?

  • Maybe I should use nlp.initialize()? That does work --- but if I were adding the component in addition to other components, I'll have problems, as I'll wipe their weights.
  • Maybe I should use nlp.resume_training()? It seemed like that ought to work, even though it's not the most obvious. It doesn't though, because it doesn't call .initialize() on the components, as it can't know what weights that would reset.
  • Okay so maybe I should call transformer.initialize(get_examples=lambda: [], nlp=nlp). However, this runs into an error in validate_get_examples, which complains the list is empty. The component does support an empty list though.
  • transformer.initialize(nlp=nlp)? This doesn't work, even though the docstring refers to it as an "optional get_examples callback".
  • Okay so what I need to do is construct at least one Example object, so that I can return it in get_examples. Kind of a hassle.
  • Alternatively I could be sneaky and do transformer.model.initialize(). This happens to work, but if the component requires other initialization it won't in this instance, so it's not a generalizable solution.

A quick improvement is to add an argument to validate_get_examples indicating whether the component can work with no examples. I'm not sure how to help components that do need some data though.

Maybe some components should check whether they're initialized, and do that on first usage if necessary? It does feel dirty, though.

@honnibal honnibal changed the title It's kind of difficult to initialize pipeline components in code It's sometimes difficult to initialize pipeline components in code Feb 11, 2021
@honnibal honnibal added the usage General spaCy usage label Feb 11, 2021
@ines ines added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components feat / ux Feature: User experience, error messages etc. and removed usage General spaCy usage labels Feb 11, 2021
@adrianeboyd
Copy link
Contributor

There is the same issue for the lemmatizer with its lookup tables. It doesn't call validate_get_examples, though, it just ignores it, so you can call nlp.get_pipe("lemmatizer").initialize(). The warning isn't helpful if you're substituting a lemmatizer in an existing pipeline because it says to call nlp.initialize(), which is going to wipe out everything else.

Why is transformers validating the examples if it's not using them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components feat / ux Feature: User experience, error messages etc.
Projects
None yet
Development

No branches or pull requests

3 participants