What should NER Training Data Look Like? #11954
Locked
polm
started this conversation in
Help: Best practices
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
To train an NER model, you need training data, which is example inputs with entities marked. This post gives a short overview of things to watch out for when making NER training data that have often come up on the forums, as well as pointers to more related documentation. The model will learn from your examples, so the examples should be as much like real input as possible.
The Golden Rule of Training Data
The golden rule of training data is that training data should look as much like your real inputs as possible. If your training data is clean newspaper articles, but your input is messy text from Twitter, the model won't be prepared to do the right thing. If you aren't getting the results you want, this is the first thing to check.
Avoid Lists of Entities Without Context
One common point of confusion is that a list of entities is not useful training data. For example, if we wanted to train a model to find country names, then just text like "Germany", "Japan", "Greenland" would not be useful - instead, we should have full sentences, like "I went to Mexico last year" or "Kiwis live in New Zealand".
With a list of entities the only thing the model can do is memorize them and maybe learn some details about what country names look like, but with full sentences the model can learn to tell what a country name is from context - for example, if you said "I flew to XYZ for vacation", then you can tell that XYZ is a place even if you've never heard of it before. Examples like that are important so that the model can learn to make judgements from context too.
If you just have a list of entities, you might want to use a rule-based PhraseMatcher instead. A PhraseMatcher is also better if you can make a list of all the valid values for your entity.
Avoid Ambiguity
Another important rule is that entities should not be ambiguous. A simple way to check if a label is ambiguous is to imagine you gave two different people instructions to mark entities - could they always mark them the exact same way? There are a couple of ways this can be a problem.
First, people might not agree about labels for things. For example, if you have labels like GOOD_MOVIE and BAD_MOVIE, people might have different opinions, or it might not always be clear from context which label is correct. It's better to have a general label everyone can agree on, like MOVIE, and use a different model (not necessarily NER) to do more detailed classification.
Second, people might not label the same spans of text. This is especially true for long spans of text. For example, if you want to label PROBLEM in a paragraph from a restaurant review, different people might select different parts of the text that communicate the same thing. As an example, consider this text:
One person might annotate "didn't taste very good" as a PROBLEM, while another person might annotate "it didn't taste very good" (including "it"). These are basically the same thing, but the model won't understand that, and will have trouble learning to pick the correct span. The spancat with a well-chosen span suggester can help work around this, though it can still be a challenging problem.
More Details
For an example of NER training data and how to convert it to .spacy format for training, see the training data docs. For a more thorough introduction to the training process, see the spaCy course, and for tips on preparing training data and troubleshooting NER models, see the NER flowchart.
Beta Was this translation helpful? Give feedback.
All reactions