How to combine NER with entity_ruler in an exclusive way? #11266
Replies: 2 comments 2 replies
-
There's no config option for this, but since pipeline components are just functions that run on a Doc, you can make a small custom component to wrap an EntityRuler like so:
|
Beta Was this translation helpful? Give feedback.
-
Hi @adrianeboyd, As explained, I am training NER from invoice documents. I have 10 thousands of invoices for training - so I think this is a good trainingset, even if the invoices are only form round about 200 different companies (overfitting problem). I train only a view very important entities inside my documents:
But what I did so far is: I start from scratch with an empty model! I do not load the spaCy english language model. Now I think I'm making it unnecessary difficult for the NER component because SpaCy has absolutely no hints from pre-trained entities like PERSON, LOC or DATE which usually can of course be found in an invoice document. Also if I use a Matcher to detect IBAN/BIC with regex before the NER and name this matches something different like ' If you can confirm, that this is the case, which language model would you recommend for invoice recognition in general? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have a complex situation with text from invoices, where the statistical model (NER) not always finds all entities (mostly when the given text changes massive its internal structure).
I found out, that adding an entity_ruler I can fix this. For example I add a RegEx pattern to detect IBAN/BIC
With this combination all BIC numbers, of course, are now detected correctly. Independently form the model quality.
But in many cases an invoice document may include several BIC numbers. The statistical model works after some training perfect on known invoice-types and so it knows for example to take the first BIC from a list. This is great and what I what to achieve.
But now with my added entity_ruler also all other BICs are always added to the result.
So my question is:
Is it possible to configure the pipes in an exclusive way, that the 'entity_ruler' only is applied if the NER did not yet found a match?
Thanks for any tips
===
Ralph
Beta Was this translation helpful? Give feedback.
All reactions