Custom component on filtered sentences of doc object #11946

monWork · 2022-12-07T21:26:15Z

monWork
Dec 7, 2022

Hi,
I am trying to build a pipeline with Transforms, NER and Spancat and another classification model.
pipeline = ["transformer", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner","spancat", "custom componet"]

spancat is custom trained model here.

I need to pre filter sentences from docs which contains any word for a list and only those sentences need to be processed via spancat model, instead of entire doc obj. However, for all the other components i need entire document.
Example if document has 20 sents and only 8 sentences are from the pre filter list then spancat component should use only those 8 sentences and annotate.
How can i achieve that?
One way i can think of create another custom component and load spancat model into the init module of language factory. But i dont want to load the model for every single document it processes.

polm · 2022-12-08T03:24:29Z

polm
Dec 8, 2022

It sounds like you should write a custom suggester for the spancat that only suggests the sentences you want to keep.

To be clear, do you want spancat to classify whole sentences, or to classify spans only within certain sentences?

Either way you can do it with a custom suggester, it'll just be slightly more complicated if it's the latter option.

2 replies

monWork Dec 8, 2022
Author

I am classifying spans only within certain sentences. I did implement it via custom suggester but customer suggester runs into issues with spacy 3.4(#11818) and we dont want to downgrade the spacy because of other dependency in the project. Hence looking for workaround in the pipeline.

polm Dec 8, 2022

OK, I see the bug is your issue. A fix has been merged in #11860, so the best thing is to wait until that's released, which should be soon.

If you must have a workaround in the meantime, for any sentences you don't want to do classification in, you could just add dummy annotations with a label that you know to ignore (or remove) later.

monWork · 2022-12-08T06:19:59Z

monWork
Dec 8, 2022
Author

So here is the actual problem we have very large documents to process and number of candidates generated is very huge from n-gram suggester for example one doc itself can create 550K candidates. So, the model runs of out of gpu memory (16 gb) hence we wanted to reduce the number of generated candidates in spancat and this is the reason we seek for pre filtered sentences from that the doc which can be processed through spancat model.

1 reply

polm Dec 9, 2022

OK, thanks for the extra context. In that case you have a few options.

You should still be able to use a custom suggester. You could just modify the custom suggester to only suggest ngrams inside the sentences you've flagged - you could do this by having your component call the existing ngram suggester on the spans of interest. To deal with the bug you'd have to add some dummy spans, but it shouldn't increase the number that much, I think?

Another, probably simpler, workaround is to have a separate spancat pipeline that you just pass filtered spans to. You say you don't want to load another model for each span - you shouldn't do that. You can just load the model once and pass spans to it.

Separately from that, if you have very long documents it sounds like maybe you shouldn't use the ngram suggester. Can you give an example of the type of entities you have? Maybe you should use a sentence based span suggester.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom component on filtered sentences of doc object #11946

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Custom component on filtered sentences of doc object #11946

monWork Dec 7, 2022

Replies: 2 comments · 3 replies

polm Dec 8, 2022

monWork Dec 8, 2022 Author

polm Dec 8, 2022

monWork Dec 8, 2022 Author

polm Dec 9, 2022

monWork
Dec 7, 2022

Replies: 2 comments 3 replies

polm
Dec 8, 2022

monWork Dec 8, 2022
Author

monWork
Dec 8, 2022
Author