Task Abstraction #12114

coltonflowers1 · 2023-01-17T17:03:50Z

coltonflowers1
Jan 17, 2023

I am currently building my own entity linking system that has two overarching goals:

Allow our NLP Engineers to implement their own solutions to the various tasks and subtasks needed, extracting entities/spans, linking spans/entities to knowledge base ids, and doing both in an end-to-end fashion. I'd like to be able to define the task by defining the document components they require and the ones that they will assign, much like in the language.factory method and have the NLP engineers implement concrete pipeline components that have the corresponding requires and assigns arguments.
Allow our data scientists to deploy the concrete implementations for the various tasks/subtasks provided by the NLP engineers without needing to know anything but a components' requires and assigns arguments.

To keep in-line with the open-closed principle, I originally found myself creating a series of abstract base classes of models, each defined by its requires and assigns arguments.

I eventually realized that I was reimplementing a lot of functionality already present in Spacy so I decided to do the entire project using Spacy's API. I thought that I could replace the abstract base classes with their respective pipeline components, but these two concepts are not really isomorphic since these concrete pipeline components determine some (but not a lot) of implementation/architecture detail. For example, the entity-recognizer pipeline component is inherently a transition-based parser, but one could imagine doing NER using something else. Interestingly, different concrete pipeline components can have the same input/output type combinations but each pipeline component class is dedicated to a different type of implementation, e.g. rule-based vs. statististical as in the entity_ruler vs ner. I would like some sort of task abstraction to unite these two and tell users that I can replace components of one task type with a component of the same task type.

I see three options:

Defining custom component classes that would allow essentially any architecture for a given task so that I can associate each task with a single component class? This appears untenable right off the bat since it would overcomplicate the config arguments, extending the possible init arguments with every possible implementation.
Define a task as a scorer/evaluation method instead of its input/output type. I don't think this really captures the task abstraction since one can easily exchange one evaluator for another for a given task like NER.
Define a parent component class for each task type? I'm not sure how you would even impose an intermediate abstraction between the Pipe/TrainablePipe component classes and their concrete component classes. Possibly use a mix-in? But then I'm not sure how to enforce that the mix-in only gets inherited along with either the Pipe or TrainablePipe class. Maybe I just need to do some more research about this.

Are there any additional pros/cons with these two approaches or possibly an entirely different solution? Or, am I better off not trying to do this, altogether? (edited)

coltonflowers1 · 2023-01-18T16:59:54Z

coltonflowers1
Jan 18, 2023
Author

Another idea: Create a wrapper around the language.factory decorator that accepts a ComponentTask object and extracts the assigns and requires args from this object and passes them along to the language.factory method. Engineers are now expected to use the CustomComponentRegister method to register their implementations of various component tasks, and users of the registered concrete components will be made aware of the task type associated with a component from the task name prefix on the component factory name. One could also imagine creating a task registry so that one can look up a task class via its name string.

The main downside I can see immediately is that I do not know how to incorporate spacy's built-in components since these are already registered.

class ComponentTask(Dataclass):
    name:str
    assigns:List[str]
    requires:List[str]


class ComponentRegister:
    @classmethod
    def factory(
        cls,
        name: str,
        task: ComponentTask,
        *,
        default_config: Dict[str, Any] = SimpleFrozenDict(),
        retokenizes: bool = False,
        default_score_weights: Dict[str, Optional[float]] = SimpleFrozenDict(),
        func: Optional[Callable] = None,
    ) -> Callable:
        return Language.factory(
            task.name + "." + name,
            default_config=default_config,
            assigns=task.assigns,
            requires=task.requires,
            retokenizes=retokenizes,
            default_score_weights=default_score_weights,
            func=func,
        )

0 replies

rmitsch · 2023-01-20T11:24:55Z

rmitsch
Jan 20, 2023
Maintainer

Hi @coltonflowers! These are some interesting thoughts. Got some questions to better understand your use case.

I am currently building my own entity linking system that has two overarching goals: ...
...
I eventually realized that I was reimplementing a lot of functionality already present in Spacy so I decided to do the entire project using Spacy's API.

I. e. all components your engineers are building and data scientists will be using will be custom spaCy components, and your EL system will be assembled and run as a spaCy pipeline. Is that correct?

I would like some sort of task abstraction to unite these two and tell users that I can replace components of one task type with a component of the same task type.

From your description it seems to me that the requires and assigns is sufficient for your task abstraction. Correct?

I would like some sort of task abstraction to unite these two and tell users that I can replace components of one task type with a component of the same task type.

"Users" is equivalent with data scientists for your use case?

2 replies

coltonflowers1 Jan 20, 2023
Author

Thanks for the reply @rmitsch! To answer your questions,

I. e. all components your engineers are building and data scientists will be using will be custom spaCy components, and your EL system will be assembled and run as a spaCy pipeline. Is that correct?

Yes, we expect our engineers to build and register the components and for our data scientists to build spaCy pipelines using the registered components like they can currently do with all the built-in components.

From your description it seems to me that the requires and assigns is sufficient for your task abstraction. Correct?

Yes, I believe these are the only parts of SpaCy's API I need to define this task abstraction. In addition, I would like to have short, salient names for the various combinations of these attributes (i.e. "entity recognition", "span linking", "span extraction") and be able to associate additional functionality with them, as well. The main functionality I can see right now is mostly related to bookkeeping:

Getting all the concrete components registered with a task.
Getting/Setting a default concrete component for each task.
To help specify how various components can be composed to create a spaCy pipeline that has a particular over-arching task in mind, like end-to-end entity linking. Right now, there is an implicit association with the built-in entity recognizer and entity linker in that you have to run the former before you run the latter. You should get an error if you don't do so when nlp.analyze_pipes()gets run, but I'd just like to make this more explicit to my data scientists and let them know how they should combine these various components via their task type rather than having them dig into the various requires and assigns of the pipeline components. To this end, I've been toying with defining a SpaCy pipeline in the abstract: Rather than defining a pipeline in terms of the concrete components, we define it as one of several sequences of these component tasks and then tell the data scientists that in order to produce a concrete component they can replace each task in the pipeline with one of its corresponding concrete components. For example, I could define an end-to-end entity linking pipeline as either a pipeline that defines two components: a component to identify the locations of the entities and another that takes associates each entity with a concept from a knowledge base. Alternatively, this pipeline could be defined as a single end-to-end entity linking component that does the extraction and linking jointly in a single component.

AbstractEndToEndELSpaCyPipeline = [
[
"entity_extraction"
"entity_linking" 
],
 [
"end_to_end_entity_linking
]
]

where the tasks associated with these names are defined as:

EntityExtractionTask(requires=[],assigns=["doc.ents", "token.ent_iob"])
EntityLinkingTask(requires=["doc.ents", "token.ent_iob"],assigns=["token.ent_kb_id"])
ETEEntityLinkingTasks(requires=["doc.ents", "token.ent_iob"],assigns=["token.ent_kb_id"])

I can then tell our data scientists that they just need to provide names for the concrete implementations for these various components in order to have a pipeline that accomplishes the overall objective for that abstract pipeline. i.e. end-to-end entity linking. Alternatively, if I have defaults set for the various tasks, they can be used to create a concrete pipeline for any one of the task sequences.

"Users" is equivalent with data scientists for your use case?

Yes, I should've just said data scientists, but I know that the term can be fairly overloaded. When I say data scientists, I mean to say users that can create pipelines using pre-built components, train these pipelines, and use them for inference but are not expected to know little about the concrete implementations of the pipeline components.

Let me know if I can clarify anything else.

rmitsch Jan 24, 2023
Maintainer

Thanks for elaborating on this! This is rather helpful in understanding your goal. I do see how the abstraction in terms of in- and outputs makes a lot of sense for your downstream users/data scientists.

So in my understanding, in the end you want to provide some kind of API for your downstream users that has a concept of

tasks, which is an abstract representation of a spaCy component, with fixed inputs and outputs; and
pipeline schedules (like AbstractEndToEndELSpaCyPipeline - calling them likes this to differentiate between those and proper spaCy pipelines), which include one or more examples of how to chain together tasks to achieve whatever goal this pipeline schedule represents - e. g. entity linking.

This API should allow your NLP engineers to

link spaCy components to their corresponding tasks; and
manipulate (i. e. create/modify/delete pipeline schedules).

Do you feel like this is an accurate high-level representation of what you want to achieve?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Abstraction #12114

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Task Abstraction #12114

coltonflowers1 Jan 17, 2023

Replies: 2 comments · 2 replies

coltonflowers1 Jan 18, 2023 Author

rmitsch Jan 20, 2023 Maintainer

coltonflowers1 Jan 20, 2023 Author

rmitsch Jan 24, 2023 Maintainer

coltonflowers1
Jan 17, 2023

Replies: 2 comments 2 replies

coltonflowers1
Jan 18, 2023
Author

rmitsch
Jan 20, 2023
Maintainer

coltonflowers1 Jan 20, 2023
Author

rmitsch Jan 24, 2023
Maintainer