-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spacy + Dask? #5111
Comments
Related: #3552 |
Thanks for the detailed report! I wonder whether this may actually be solved by the very recent PR #5081 that was just merged onto Is there any chance you could try building the current master branch from source, and see whether that would fix your issue? |
@svlandeg I think by running |
Ah, yes, that should do it. You can double check in the |
What is annoying, is that I can't really reproduce this. I actually run into a |
@svlandeg let me attach a jupyter notebook where I could more reliably produce the error early next week I think by moving based on that, this pattern has been working for me so far: from dask.distributed import get_worker
def process(doc):
worker = get_worker()
try:
nlp = worker.nlp
except AttributeError:
nlp = spacy.load(model)
worker.nlp = nlp
return [str(x) for x in nlp(doc).ents]
ents = bag.map(process).compute()
# same approach works if you run want to process one partition at a time
# which allows using `nlp.pipe` with some batch size which is even faster
ents = bag.map_partition(process_partition).compute() |
Considering there's a workaround for now (though it might not be ideal), I suggest to revisit this issue once we're closer to releasing the v.3 version. Hopefully the refactored vectors should make this a non-issue then. |
Just wanted to add that the workaround seems to have limitations, though I have not yet had time to fully investigate the root cause: usnistgov/cv-py#1 |
I think you'll have an easier time of things if you try to parallelise over larger units of work, e.g. a few dozen megabytes of compressed text per process. I would recommend chunking the text beforehand in a preprocess, and just passing the input and output path to dask. You'd then have something like this as your remote task: def process_job(model_name, input_path, output_path):
nlp = load_or_get_model(model_name)
outputs = []
texts = read_texts(input_path):
for doc in nlp.pipe(texts):
outputs.append(convert_output(doc))
write_outputs(outputs, output_path) I advise against trying to handle the parallelism too transparently. Really fine-grained task distribution is a method of last resort, for when the scheduling is difficult enough that you can't really do it manually. When you have a simple loop, you're much better off cutting it up into chunks yourself, and only distributing units of work that take at least a few minutes to complete. It's much more efficient and far more reliable, because you can manage failure by just repeating the jobs where the output files aren't there. |
I'm unable to get a simple example using spacy + dask.distributed up and running. In the context of a Jupyter Notebook, I recieve this error:
In the context of a script, it simply hangs.
How to reproduce the behavior
Here's my attempt at creating a reproducible script although I get slightly different behavior in a notebook setting.
Info about spaCy
The text was updated successfully, but these errors were encountered: