What should NER Training Data Look Like? #11954

polm · 2022-12-09T04:16:55Z

polm
Dec 9, 2022

To train an NER model, you need training data, which is example inputs with entities marked. This post gives a short overview of things to watch out for when making NER training data that have often come up on the forums, as well as pointers to more related documentation. The model will learn from your examples, so the examples should be as much like real input as possible.

The Golden Rule of Training Data

The golden rule of training data is that training data should look as much like your real inputs as possible. If your training data is clean newspaper articles, but your input is messy text from Twitter, the model won't be prepared to do the right thing. If you aren't getting the results you want, this is the first thing to check.

Avoid Lists of Entities Without Context

One common point of confusion is that a list of entities is not useful training data. For example, if we wanted to train a model to find country names, then just text like "Germany", "Japan", "Greenland" would not be useful - instead, we should have full sentences, like "I went to Mexico last year" or "Kiwis live in New Zealand".

With a list of entities the only thing the model can do is memorize them and maybe learn some details about what country names look like, but with full sentences the model can learn to tell what a country name is from context - for example, if you said "I flew to XYZ for vacation", then you can tell that XYZ is a place even if you've never heard of it before. Examples like that are important so that the model can learn to make judgements from context too.

If you just have a list of entities, you might want to use a rule-based PhraseMatcher instead. A PhraseMatcher is also better if you can make a list of all the valid values for your entity.

Avoid Ambiguity

Another important rule is that entities should not be ambiguous. A simple way to check if a label is ambiguous is to imagine you gave two different people instructions to mark entities - could they always mark them the exact same way? There are a couple of ways this can be a problem.
First, people might not agree about labels for things. For example, if you have labels like GOOD_MOVIE and BAD_MOVIE, people might have different opinions, or it might not always be clear from context which label is correct. It's better to have a general label everyone can agree on, like MOVIE, and use a different model (not necessarily NER) to do more detailed classification.
Second, people might not label the same spans of text. This is especially true for long spans of text. For example, if you want to label PROBLEM in a paragraph from a restaurant review, different people might select different parts of the text that communicate the same thing. As an example, consider this text:

The food was expensive, and it didn't taste very good.

One person might annotate "didn't taste very good" as a PROBLEM, while another person might annotate "it didn't taste very good" (including "it"). These are basically the same thing, but the model won't understand that, and will have trouble learning to pick the correct span. The spancat with a well-chosen span suggester can help work around this, though it can still be a challenging problem.

More Details

For an example of NER training data and how to convert it to .spacy format for training, see the training data docs. For a more thorough introduction to the training process, see the spaCy course, and for tips on preparing training data and troubleshooting NER models, see the NER flowchart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should NER Training Data Look Like? #11954

{{title}}

Replies: 0 comments

Select a reply

What should NER Training Data Look Like? #11954

polm Dec 9, 2022

The Golden Rule of Training Data

Avoid Lists of Entities Without Context

Avoid Ambiguity

More Details

Replies: 0 comments

polm
Dec 9, 2022