Distributed inference for a spaCy custom NER model #11771

dave-espinosa · 2022-11-07T23:31:49Z

dave-espinosa
Nov 7, 2022

Hello everyone. I trained a NER model, and I can get a reasonably fast inference by a applying all the hints explained in the speed FAQ. However, one thing is getting inference from fair amount of texts (<100K), and a whole different world, is getting inference from hundreds of millions of texts (which is what I want to improve). In the past, I have tried the following (chronological) approaches:

In general, using nlp.pipe and its argument n_process is the fastest way to improve inference speed, but computational resources are expensive (i.e., number of available cores in your physical or virtual machine).
Batch processing could help to the previous issue, but I have not found anything "spaCy-native" to do so, so my best bet was implementing something like this (pseudocode):

FOR doc IN nlp.pipe(all_files, n_process=as_much_as_possible):
    FOR ent in doc.ents:
        store_my_extracted_ents_where_required(ent)
        update_logs()

The previous implementation can help you to split the data (i.e., if you have a test dataset of 1M samples, you can process it in 10 batches of 100K, keeping the prices low but at the cost of having to wait 10x more). You could also speed-up, by putting 10 (virtual) machines to work on each batch, thus obtaining the exact opposite of the previous situation (i.e., you process all 10 batches of 100K at once, 1 per VM, but this could trigger the prices). I had the fortune of not being quite limited by budget (speed was the top priority), so this solution worked OK for datasets between 100K and 1M samples.
You can totally combine the 2nd and 3rd approaches: Imagine setting 10VMs where each VM will deal with 10 batches of 100K samples... You could be now processing at 100x the original speed! Yet, I am not a network specialist or a cloud architect, so what I did was "manually setting a cluster", where I kind of hardcoded a bunch of individual VMs, setting where each VM should start and end, while keeping the rest of the pseudocode explained before, almost intact. The tracking of the elapsed work, or the resuming in case of failure of any kind was however, very difficult and prone to failure... But still, feasible (nearly not! 😅)

The obvious next step is trying to improve the 4th approach to something more stable (and maybe recurring to some product that I still don't know of? 🤔), but then I think it would be nice to listen to the community. BTW, I am working with Google Cloud Platform related products, but any idea is worth trying at this point. I think I can summarize my queries in the following ones:

Have you done something similar to this? Getting inference using some (NER) spaCY model, on hundreds of millions of samples, in a reasonable amount of time (i.e. no more than a bunch of days)?
What product did you use? Does spaCy have some implementation which allows batch processing int the previously explained fashion, and if so, where is a demo available?
I am very tempted to use Compute Engine Managed Instance Groups (MIGs) to carry on with this task, but I don't know if you have some better idea? BTW, I have long discarded Apache Spark on Dataproc, since optimizing spaCy in an Apache Spark cluster, is quite unstable (as seen here and here).

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference for a spaCy custom NER model #11771

{{title}}

Replies: 0 comments

Select a reply

Distributed inference for a spaCy custom NER model #11771

dave-espinosa Nov 7, 2022

Replies: 0 comments

dave-espinosa
Nov 7, 2022