Best practice for training and using custom NER model out of spacy blank? #11507

steve-solun · 2022-09-15T10:48:09Z

steve-solun
Sep 15, 2022

I am training my spacy blank "en" model with the following kind of annotations: one word and label:

train = [
          ("hibernate",{"entities":[(0,9,"power_options")]}),
          ("restart",{"entities":[(0,6,"power_options")]}),
          ("func_name",{"entities":[(0,9,"build_func")]}),
          ("channel",{"entities":[(0,12,"open_channel")]})
................hundreds of same format one word annotations
]

I am using the following config file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

But when I am loading the model and trying to find the entities given a sentence:

nlp = spacy.load(r"./output/model-best") #load the best model
doc = nlp("restart pc") # input sample text

I see that the ents = 'restart pc' and doesn't split to each word and attaches the labels it learned from.
What I am missing here when building my own custom NER model?

darrkj · 2022-09-16T03:40:00Z

darrkj
Sep 16, 2022

I think you may have a training data issue. The model is only ever given one word observations that have the whole token as the entity. The data you are evaluating the model on is different in that it has more than one word. The model may even be learning to just say the whole input is the entity as doing that will always be correct in the training process.

0 replies

polm · 2022-09-27T04:37:25Z

polm
Sep 27, 2022

As mentioned, your training data is not usable. Training data should be like the input data you expect - in NER that would normally be complete sentences. NER models learn not only from labelled terms, but also from unlabelled data - they need both in order to see what to label or not label.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for training and using custom NER model out of spacy blank? #11507

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Best practice for training and using custom NER model out of spacy blank? #11507

steve-solun Sep 15, 2022

Replies: 2 comments

darrkj Sep 16, 2022

polm Sep 27, 2022

steve-solun
Sep 15, 2022

darrkj
Sep 16, 2022

polm
Sep 27, 2022