Pretrained Pipeline Suitability Issues #11952

polm · 2022-12-09T04:13:06Z

polm
Dec 9, 2022

You may have tried the spaCy pretrained models and been unsatisfied with their accuracy on your text. One thing to keep in mind about machine learning models is that they learn based on their training data, so if the input is very different from the training data it can be hard for the model to perform well.

The pretrained English pipelines for spaCy are trained on a dataset known as OntoNotes. While OntoNotes contains many different types of text, it's mostly newspaper or blog articles that have rather clean capitalization and punctuation, and use complete sentences. This is also true of the training data for the pretrained pipelines for most other languages, and you can check the exact details of the model you're using on the models page. If your input is more messy, like text or chat messages, OCR output, or similar, then accuracy will suffer.

If you find that this is a problem for you, the best solution is always to train a model on your own data. You can also look into data augmentation, like removing the case information from properly cased input, to make your model less sensitive to orthographic variation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained Pipeline Suitability Issues #11952

{{title}}

Replies: 0 comments

Select a reply

Pretrained Pipeline Suitability Issues #11952

polm Dec 9, 2022

Replies: 0 comments

polm
Dec 9, 2022