-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage of debug-data
with a huge training set
#4748
Comments
Thanks for the report!
You probably want to split these into multiple files. spaCy can also read from directories instead of single JSON files, so there's really no need to have a 2.3 GB file. This could easily cause other problems down the line. About However, if it's not memory-efficient and you can't use it with large data files, that's obviously bad. We could probably refactor the logic to only process the data as a stream, make one single pass over each corpus and compute all the stats that way. You can find the source here if you want to give it a try and see if it improves things for you: https://github.com/explosion/spaCy/blob/master/spacy/cli/debug_data.py |
Hi Ines, thank you for your quick reply. |
* A new DocTuple class is used to wrap a single document in the corpus while keeping track of its ID; this object mimics the behaviour of tuple of two elements so it's still safe to destructure it. * More options have been added to continue the debug-data procedure even in case of errors. * debug-data has been improved to analyze the corpora in one single pass in order to guarantee e low memory footprint (actually there still is one step where a second pass over the training data is required). This should help debugging huge data sets.
Sorry for the late follow-up, but I just wanted to bump this issue as I still think it's very relevant. Since the PR you last created, the I wanted to ask you @sfragis whether you have time to rebase your old PR against the new |
Hi Sofie, I'd be happy to contribute but honestly I've no time at all. |
Will do, thanks for letting me know! |
Hi, I'm using Spacy 2.2.2 to train new tagger and parser models for the Italian language.
My training data set is quite big (about 2.3 GB for the train and 580 MB for the dev) and is saved in two JSONL files.
I'm experiencing an unexpected memory usage when running the
debug-data
command: memory usage starts low and then grows up to consuming my 32GB of RAM as well as the whole swap (about the same size).Before upgrading my RAM to 128 GB (which I suspect might be useless), I'm interested in your opinion about:
Info about spaCy
The text was updated successfully, but these errors were encountered: