Pretrained Pipeline Suitability Issues #11952
Locked
polm
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
You may have tried the spaCy pretrained models and been unsatisfied with their accuracy on your text. One thing to keep in mind about machine learning models is that they learn based on their training data, so if the input is very different from the training data it can be hard for the model to perform well.
The pretrained English pipelines for spaCy are trained on a dataset known as OntoNotes. While OntoNotes contains many different types of text, it's mostly newspaper or blog articles that have rather clean capitalization and punctuation, and use complete sentences. This is also true of the training data for the pretrained pipelines for most other languages, and you can check the exact details of the model you're using on the models page. If your input is more messy, like text or chat messages, OCR output, or similar, then accuracy will suffer.
If you find that this is a problem for you, the best solution is always to train a model on your own data. You can also look into data augmentation, like removing the case information from properly cased input, to make your model less sensitive to orthographic variation.
Beta Was this translation helpful? Give feedback.
All reactions