Improving default .sents #11595
Replies: 1 comment
-
I would gently push back on the idea that you're going to find one very good model that's very good for every task. For some tasks newlines might always indicate a sentence boundary, but for some other tasks they might only sometimes be a sentence boundary. For some tasks all sentences might start with capital letters and end with punctuation, but for most this wouldn't be the case. The general issues with expectations around the provided trained pipelines and statistical components are discussed in #3052. In practice, you usually want a combination of rule-based components based on knowledge about your data (e.g. all newlines are sentence boundaries or all tabs followed by lowercase tokens are not sentence boundaries) and statistical components like the For a spacy pipeline, my general recommendation would be to combine rule-based approaches (preprocessing, a The sentence boundaries in the trained pipelines come from the parser by default, which is able to use parse info to predict whether the current token looks like the end of a phrase. The original training corpora typically do not include any whitespace other than single spaces, so we also do some whitespace augmentation to try to make the models less sensitive to variation in whitespace. But our goal is to provide models that are generally useful rather than ones tuned for a specific task. For the trained pipelines, you can check
The English models are trained on OntoNotes, which contains a mixture of text types including newswire, transcribed broadcast news, transcribed telephone conversations, Biblical texts, and other genres that might be pretty different from your texts. You can try switching to the |
Beta Was this translation helpful? Give feedback.
-
I use the Spacy .sents attribute and it’s quite good but it doesn’t do so well with separated lines of text that aren’t complete sentences. It can put unconnected lines of text into one segment.
Is there any best practice to make the SentenceRecognizer as effective as possible?
For example, here #9819, it was suggested to train your own model. But ideally, there would already be a very good model, for everybody to use.
I have been using the “sm” language models, is “large” better?
Are there downloadable models from HuggingFace or something, a SentenceRecognizer known for its near-perfect performance?
Thank you
Beta Was this translation helpful? Give feedback.
All reactions