Spacy Inference speed for single text #11554

marzooq-unbxd · 2022-09-29T00:09:03Z

marzooq-unbxd
Sep 29, 2022

I was trying to deploy my spacy NER model to a Flask web-app with a gunicorn layer.
I read about improving the inference speed , by moving from nlp(text) to nlp.pipe([list_of_texts]) , But in order for it to re-use resources , shouldnt list_of_texts be len()>1.I send NER entities for every single request that comes to my service(len=1).I was having high latency just by increasing rps.
Changing n_process wouldnt help according to #10087 .Changing n_threads isnt an option anymore, (doesnt remove GIL lock?)
Is the only way to solve this by reducing model arhcitecture size?

polm · 2022-09-29T04:07:44Z

polm
Sep 29, 2022

The easiest ways to improve processing speed are to either batch requests (reducing setup/teardown costs) or to do less work (using a smaller architecture). If your API doesn't allow batching requests externally, you can batch them internally. A simple way to do that would be to have a separate thread with a queue that handles batching and calling nlp.pipe. That could also be structured as a separate server process (on the same machine or not).

2 replies

marzooq-unbxd Oct 3, 2022
Author

Also this high latency is actually intermittent, happens every some number of requests.So I figured it could be related to the following memory leaks issues
#9369
#10496

I will try reloading the spacy model once the vocabulary grows beyond a limit, but I had a doubt whether there is a problem in how I am initialising the spacy object itself.

Right now, in the flask server.py

from flask import Flask, request, jsonify
from A import SpacyExtractor
ner_app = None
app = Flask(__name__)


def load_resources():
    global  spacy_app
    spacy_app = SpacyExtractor()
    load_spacy_model(spacy_app)

@app.route("/api/predictions", methods=['GET'])
def get_spacy_prediction():
    sentence = request.args.get('sentence', '')
    return spacy_app.get_spacy_predictions(sentence)

--preload is set to false while starting up the web app, have not added threading
load_resources is being run for each new worker since it uses gunicorn

In this case, I feel mostly there would not be an issue, since there are multiple workers(processes being spawned),but just wanted your opinion.I am not clear whether threading can help spacy ,
and an asgi framework also would not cause significant gains acc to me since there are no I/O processes in nlp(text)

polm Oct 4, 2022

I'm not very familiar with multithreading in Flask, but is it possible the delay is due to the model being loaded? If workers are spun up lazily, or can go to sleep, that would explain the issue. Maybe log when a model is loaded?

To be clear, when you have intermittent load, is it actually a lot of requests, or is it not that many requests but it's still slow for some reason? The latter suggests something like lazy loading of models, while the former suggests you might benefit from a worker queue to increase throughput.

About the "memory leak" issue - it's not a memory leak. The size of the vocab grows over time, and in the linked thread that seemed to be causing out of memory errors. If that was an issue for you, you'd see errors, not periodic slowdown.

Like you say, I don't think async will help much with this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy Inference speed for single text #11554

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spacy Inference speed for single text #11554

marzooq-unbxd Sep 29, 2022

Replies: 1 comment · 2 replies

polm Sep 29, 2022

marzooq-unbxd Oct 3, 2022 Author

polm Oct 4, 2022

marzooq-unbxd
Sep 29, 2022

Replies: 1 comment 2 replies

polm
Sep 29, 2022

marzooq-unbxd Oct 3, 2022
Author