-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle sentence boundaries from multiple components #4775
Comments
Suggestions from @DomHudson in #5050 (comment):
|
Related conversations:
|
My position on this is mostly "keep it as is". I'm open to debate on this, but I'll explain my position. I agree that an I still think the ternary values are the most practical mechanism for allowing components to coordinate on the sentence boundaries. I don't think it would really help to have something like a decision history or something, and that would be impractical for efficiency reasons anyway. There's ultimately no way for components to know what other components are expected to run before or after them. It's up to the pipeline author to construct a pipeline that behaves well as a whole. It's nice if components are configurable about how they set the sentence boundary values, but that's a question for the design of the individual components. And the pipeline author can always insert other processes that run over the I don't think any more complicated mechanism than ternary values would really help components coordinate. Let's say components got to set a single probability instead of a ternary. If you're writing a component and you receive some set probability, how should you interpret it? It will depend on how accurate you expect that model to be on your data, and how accurate you expect the component's own model to be. Only the person who puts together the pipeline is in a position to know how those values should be integrated, so it still can't happen automatically. Similarly, let's say you had a full history of which components had set the So my position is that components are able to set three values for the For ourselves as pipeline and component authors, I think the parser could be a bit more configurable. We could expose an option to never insert sentence boundaries, regardless of whether |
Feature description
Decide how to handle
is_sentenced
and sentence boundaries that may come from multiple components (Sentencizer, SentenceRecognizer, Parser).Some ideas:
is_sentenced
property more likeis_parsed
that can be set by componentsfinalize_sentences
?) that can be inserted at the right point in the pipelineCheck that no spacy components clobber sentence boundaries and that
is_sentenced
works consistently when sentence boundaries come from multiple sources. If a component after the parser changes sentence boundaries, make sure the required tree recalculations are done (a related issue: #4497).Potentially add warnings when non-zero
sent_start
is changed by any component?I think the default behavior could be that any pipeline component can add sentence boundaries but that components won't remove any sentence boundaries. The idea would be that the Sentencizer or SentenceRecognizer add punctuation-based boundaries (typically high precision, although the Sentencizer less so) and the Parser can add phrase-based boundaries (improving recall). I don't know if this works as cleanly as envisioned in practice, especially with the Sentencizer. Most likely people using the Sentencizer aren't using other components so it's less of an issue, but I could imagine SentenceRecognizer + Parser as a common combination.
The text was updated successfully, but these errors were encountered: