Memory usage of `debug-data` with a huge training set #4748

sfragis · 2019-12-03T10:37:23Z

Hi, I'm using Spacy 2.2.2 to train new tagger and parser models for the Italian language.
My training data set is quite big (about 2.3 GB for the train and 580 MB for the dev) and is saved in two JSONL files.
I'm experiencing an unexpected memory usage when running the debug-data command: memory usage starts low and then grows up to consuming my 32GB of RAM as well as the whole swap (about the same size).
Before upgrading my RAM to 128 GB (which I suspect might be useless), I'm interested in your opinion about:

hints about data set structure: for instance, comments in issue Huge RAM consumption during NER training #4700 suggested to reduce the sentence length on average, but I've no clue about what values might be optimal; is there any rule of thumb to properly dimension the data set?
possible optimizations to the source code to reduce memory footprint (for instance by improving the lazy loading of the data set); I'm willing to contribute to Spacy if anyone would kindly point me to the problematic parts (if any, of course)

Info about spaCy

spaCy version: 2.2.2
Platform: Linux-4.4.0-112-generic-x86_64-with-debian-stretch-sid
Python version: 3.7.4

The text was updated successfully, but these errors were encountered:

ines · 2019-12-03T11:06:38Z

Thanks for the report!

My training data set is quite big (about 2.3 GB for the train and 580 MB for the dev) and is saved in two JSONL files.

You probably want to split these into multiple files. spaCy can also read from directories instead of single JSON files, so there's really no need to have a 2.3 GB file. This could easily cause other problems down the line.

About debug-data: Since the debug-data command is really mostly a debugging utility, we didn't particularly focus on optimising it for efficiency. For instance, I'm pretty sure we're just loading the whole corpus into memory (e.g. by calling list around it), and I think we're also making at least one additional pass over the data to compute the stats. That's typically okay, because you're usually just running the debugging manually a few times and even if you have to wait for a few minutes, that's not a big deal.

However, if it's not memory-efficient and you can't use it with large data files, that's obviously bad.

We could probably refactor the logic to only process the data as a stream, make one single pass over each corpus and compute all the stats that way. You can find the source here if you want to give it a try and see if it improves things for you: https://github.com/explosion/spaCy/blob/master/spacy/cli/debug_data.py

sfragis · 2019-12-03T16:36:18Z

Hi Ines, thank you for your quick reply.
I successfully managed to read the whole dataset from JSONL and have it saved into smaller MessagePack files.
The problem may be related to the invocation of GoldCorpus.train_docs where the returned generator is turned into a list as you mentioned.
I will try to make the rest of the code more streamy and provide a pull request if I succeed.

* A new DocTuple class is used to wrap a single document in the corpus while keeping track of its ID; this object mimics the behaviour of tuple of two elements so it's still safe to destructure it. * More options have been added to continue the debug-data procedure even in case of errors. * debug-data has been improved to analyze the corpora in one single pass in order to guarantee e low memory footprint (actually there still is one step where a second pass over the training data is required). This should help debugging huge data sets.

svlandeg · 2020-08-20T15:44:27Z

Sorry for the late follow-up, but I just wanted to bump this issue as I still think it's very relevant. Since the PR you last created, the develop branch has been coming together nicely, but I think the same issues with debug data are still present. For instance, we're still calling list(Corpus(train_path)(nlp)).

I wanted to ask you @sfragis whether you have time to rebase your old PR against the new develop branch? If not, I could try and pick those ideas from your old PR and reapply them for a new PR...

sfragis · 2020-08-24T09:56:27Z

Hi Sofie, I'd be happy to contribute but honestly I've no time at all.
Feel free to pick code and ideas from my PR and adapt them to the develop branch.
Cheers

svlandeg · 2020-08-24T10:06:51Z

Will do, thanks for letting me know!

ines added enhancement Feature requests and improvements feat / cli Feature: Command-line interface perf / memory Performance: memory use labels Dec 3, 2019

sfragis mentioned this issue Dec 28, 2019

Debug data memory footprint and document identifier printed on errors #4844

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage of `debug-data` with a huge training set #4748

Memory usage of `debug-data` with a huge training set #4748

sfragis commented Dec 3, 2019 •

edited

Loading

ines commented Dec 3, 2019 •

edited

Loading

sfragis commented Dec 3, 2019

svlandeg commented Aug 20, 2020

sfragis commented Aug 24, 2020

svlandeg commented Aug 24, 2020

Memory usage of debug-data with a huge training set #4748

Memory usage of debug-data with a huge training set #4748

Comments

sfragis commented Dec 3, 2019 • edited Loading

Info about spaCy

ines commented Dec 3, 2019 • edited Loading

sfragis commented Dec 3, 2019

svlandeg commented Aug 20, 2020

sfragis commented Aug 24, 2020

svlandeg commented Aug 24, 2020

Memory usage of `debug-data` with a huge training set #4748

Memory usage of `debug-data` with a huge training set #4748

sfragis commented Dec 3, 2019 •

edited

Loading

ines commented Dec 3, 2019 •

edited

Loading