MemoryError in multilabel textcat with 14k labels and bow architecture #5163

nsorros · 2020-03-17T15:47:59Z

nsorros
Mar 17, 2020

How to reproduce the behaviour

I am training a textcat classifier in a dataset with 10k documents and 14k labels with non exclusive classes. I am following a variation of the train_text example. The model trains fine with simple_cnn architecture but throws a MemoryError for bow mentioning that it is unable to allocated 50GB of memory during the update step.

This happens even when I pass documents one by one and have noticed that the memory seems to depend on the number of labels, passing a subset of labels works but as labels increase the memory footprint becomes prohibitive.

I am not sure whether this is a bug or a feature but could you explain what is going on and whether there is a way around this? Can this be managed better by reducing the vocabulary somehow?

I am happy to include more information about the code and error I am getting if this is helpful.

Your Environment

spaCy version: 2.2.1
Platform: Linux-4.15.0-1060-aws-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9

adrianeboyd · 2020-03-17T17:03:53Z

adrianeboyd
Mar 17, 2020

Spacy hasn't really been designed or tested for use with this many labels. Our general recommendation would be to consider an alternative like vowpal wabbit when you hit spacy's limitations, since it may well be better and faster.

In any case, it's interesting to hear that the simple_cnn model trains well, especially training with such a small number of documents. In the experiments I've run, bow was typically better and faster than simple_cnn for datasets with smaller number of labels, but I didn't profile the memory usage, so I'm not sure what might be going on. How long are your documents?

0 replies

nsorros · 2020-03-18T09:10:23Z

nsorros
Mar 18, 2020
Author

Thanks for the quick response @adrianeboyd. I was sort of expecting this to be the case. The documents are abstracts of scientific publications, they have an average of 189 tokens with a std of 77.

I thought that this might be related to the number of parameters that are needed to be learned which should be proportional to the number of tokens multiplied by number of labels. Note that the vocabulary is quite large. This is probably different in the cnn case as the representation is dense due to the word vectors so the number of parameters smaller. Does this make sense? Is there any way to test how many parameters spacy is trying to learn?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError in multilabel textcat with 14k labels and bow architecture #5163

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

MemoryError in multilabel textcat with 14k labels and bow architecture #5163

nsorros Mar 17, 2020

How to reproduce the behaviour

Your Environment

Replies: 2 comments

adrianeboyd Mar 17, 2020

nsorros Mar 18, 2020 Author

nsorros
Mar 17, 2020

adrianeboyd
Mar 17, 2020

nsorros
Mar 18, 2020
Author