What is the fastest way to update the NER model once we get new data #12278

adypy · 2023-02-15T05:45:46Z

adypy
Feb 15, 2023

I trained NER on a very large dataset and after training I got a golden dataset for few tags that are 100% correct. Now I want this info to flow in my model- I know I can retrain model or correct the older data based on golden data but it will take lot of time and retraining which I can't afford right now.
Does there exist any other shortcuts like entity ruler or something like that to make use of golden data during inference.

kadarakos · 2023-02-15T11:09:34Z

kadarakos
Feb 15, 2023

Hey,

In general, the mechanism you can use in spaCy to continue training is to source some or all of the components of the previously trained model. We have an example for this https://github.com/explosion/projects/tree/v3/pipelines/ner_demo_update. Hope you'll find the usage documentation helpful and please let us know how we could improve it: https://spacy.io/usage/processing-pipelines#sourced-components.

0 replies

adypy · 2023-02-15T12:51:38Z

adypy
Feb 15, 2023
Author

The dataset I trained on is noisy hence by having golden data with me now I want to somehow feed it to model. The above suggested approach is fine-tuning on golden data right?

0 replies

kadarakos · 2023-02-15T12:54:43Z

kadarakos
Feb 15, 2023

Yes exactly!

0 replies

adypy · 2023-02-15T12:55:46Z

adypy
Feb 15, 2023
Author

Ok can you also confirm that does entity ruler only do exact string matching or it works on top of some learning?

0 replies

kadarakos · 2023-02-15T13:00:43Z

kadarakos
Feb 15, 2023

Its true that the EntityRuler does string matching and so it is not a learned component. However, the matches do not have to be exact per se. Since 3.5 it can also do fuzzy matching: https://spacy.io/api/entityruler.

2 replies

adypy Feb 15, 2023
Author

Understood, Thanks for the quick help!!

kadarakos Feb 15, 2023

No worries!

vardhan-shah-backend · 2023-02-17T10:10:59Z

vardhan-shah-backend
Feb 17, 2023

hey @adypy & @kadarakos, I also have the similar use-case, where I want to fine-tune pre-trained spacy model for custom NER. I was wondering how much time the fine tuning based training would take? Is there any mechanism to estimate based on number of hyper-parameters in model, number of entities, examples & GPU/CPU power or some other relevant factors? I just need to get a rough idea about number of hours & GPU/CPU power needed.

1 reply

kadarakos Feb 17, 2023

Without knowing more details it is not possible for us to give you estimates. Would it be possible for you to open a new discussion prioritizing one of the questions with a bit more details so that we can give you better quality answers?

adypy · 2023-02-20T07:48:34Z

adypy
Feb 20, 2023
Author

@kadarakos Is there any way to access embedding of tokens while training NER model?

5 replies

kadarakos Feb 20, 2023

You can access the contextualized embeddings of each token in the Doc with doc.tensor. For example:

nlp = spacy.load("en_core_web_lg")
doc = nlp("Let's inspect the embeddings")

This gives us 6 tokens:

["Let's", "inspect", "the", "embeddings", "!"]

So in this particular case the doc.tensor is a numpy.ndarray (or cupy if using GPU) of shape (6, 96).

The token.vector is of the same type as doc.tensor, but its only a 1D array -- in the case of en_core_web_lg its shape is (300, ).
This is the embedding that is learned for the token itself without taking context into account.

adypy Feb 21, 2023
Author

I did exactly same but embedding coming out to be empty-
model - custom NER model

d = model(sample_text')
for t in d:
    print(t.vector)

output

[
]
[]
[]
[]
[]
[]
[]
[]
[]
[]

kadarakos Feb 23, 2023

Seems like your model does not include pre-trained embeddings: https://spacy.io/usage/embeddings-transformers

adypy Feb 23, 2023
Author

No we are training NER from scratch, in this case will I be able to access transformer emb learnt?

kadarakos Feb 23, 2023

You can train from scratch and also include pre-trained embeddings as well. Its almost always a good idea to do so. You can see examples in https://github.com/explosion/projects on how include pre-trained vectors and how to pre-train your own. It might be worth taking a look at: https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo.

The transformer based pipelines do not include vectors for tokens.
You can try:

nlp_trf = spacy.load("en_core_web_trf")
nlp_lg = spacy.load("en_core_web_lg")
for token in nlp_trf("This is life"):
    print(token.vector)
for token in nlp_lg("This is life"):
    print(token.vector)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the fastest way to update the NER model once we get new data #12278

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What is the fastest way to update the NER model once we get new data #12278

Replies: 7 comments · 8 replies

adypy Feb 15, 2023 Author

adypy Feb 15, 2023 Author

adypy Feb 15, 2023 Author

adypy Feb 20, 2023 Author

adypy Feb 21, 2023 Author

adypy Feb 23, 2023 Author

Replies: 7 comments 8 replies

adypy
Feb 15, 2023
Author

adypy
Feb 15, 2023
Author

adypy Feb 15, 2023
Author

adypy
Feb 20, 2023
Author

adypy Feb 21, 2023
Author

adypy Feb 23, 2023
Author