Improving default .sents #11595

swetepete · 2022-10-07T13:13:55Z

swetepete
Oct 7, 2022

I use the Spacy .sents attribute and it’s quite good but it doesn’t do so well with separated lines of text that aren’t complete sentences. It can put unconnected lines of text into one segment.

Is there any best practice to make the SentenceRecognizer as effective as possible?

For example, here #9819, it was suggested to train your own model. But ideally, there would already be a very good model, for everybody to use.

I have been using the “sm” language models, is “large” better?

Are there downloadable models from HuggingFace or something, a SentenceRecognizer known for its near-perfect performance?

Thank you

adrianeboyd · 2022-10-10T08:14:46Z

adrianeboyd
Oct 10, 2022

I would gently push back on the idea that you're going to find one very good model that's very good for every task. For some tasks newlines might always indicate a sentence boundary, but for some other tasks they might only sometimes be a sentence boundary. For some tasks all sentences might start with capital letters and end with punctuation, but for most this wouldn't be the case. The general issues with expectations around the provided trained pipelines and statistical components are discussed in #3052.

In practice, you usually want a combination of rule-based components based on knowledge about your data (e.g. all newlines are sentence boundaries or all tabs followed by lowercase tokens are not sentence boundaries) and statistical components like the parser or senter that are ideally trained on data that's similar to your final task.

For a spacy pipeline, my general recommendation would be to combine rule-based approaches (preprocessing, a sentencizer, custom components) + training or fine-tuning a senter. (The parser can do a better job around phrases that don't end with punctuation but it only makes sense to spend the time on dependency annotation if you also need dependency parses. For just sentence segmentation the annotation is way too difficult and time-consuming.)

The sentence boundaries in the trained pipelines come from the parser by default, which is able to use parse info to predict whether the current token looks like the end of a phrase. The original training corpora typically do not include any whitespace other than single spaces, so we also do some whitespace augmentation to try to make the models less sensitive to variation in whitespace. But our goal is to provide models that are generally useful rather than ones tuned for a specific task.

For the trained pipelines, you can check SENTS_F in the accuracy tables. If you look at the en_core_* pipelines for v3.4.0, the pipelines are pretty much equivalent:

en_core_web_sm-3.4.0    0.904
en_core_web_md-3.4.0    0.909
en_core_web_lg-3.4.0    0.904
en_core_web_trf-3.4.0   0.902

The English models are trained on OntoNotes, which contains a mixture of text types including newswire, transcribed broadcast news, transcribed telephone conversations, Biblical texts, and other genres that might be pretty different from your texts.

You can try switching to the senter to see if it's better for your task: https://spacy.io/models#design-modify

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving default .sents #11595

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Improving default .sents #11595

swetepete Oct 7, 2022

Replies: 1 comment

adrianeboyd Oct 10, 2022

swetepete
Oct 7, 2022

adrianeboyd
Oct 10, 2022