MemoryError in multilabel textcat with 14k labels and bow architecture #5163
Replies: 2 comments
-
Spacy hasn't really been designed or tested for use with this many labels. Our general recommendation would be to consider an alternative like vowpal wabbit when you hit spacy's limitations, since it may well be better and faster. In any case, it's interesting to hear that the |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick response @adrianeboyd. I was sort of expecting this to be the case. The documents are abstracts of scientific publications, they have an average of 189 tokens with a std of 77. I thought that this might be related to the number of parameters that are needed to be learned which should be proportional to the number of tokens multiplied by number of labels. Note that the vocabulary is quite large. This is probably different in the cnn case as the representation is dense due to the word vectors so the number of parameters smaller. Does this make sense? Is there any way to test how many parameters spacy is trying to learn? |
Beta Was this translation helpful? Give feedback.
-
How to reproduce the behaviour
I am training a textcat classifier in a dataset with 10k documents and 14k labels with non exclusive classes. I am following a variation of the train_text example. The model trains fine with
simple_cnn
architecture but throws a MemoryError forbow
mentioning that it is unable to allocated 50GB of memory during the update step.This happens even when I pass documents one by one and have noticed that the memory seems to depend on the number of labels, passing a subset of labels works but as labels increase the memory footprint becomes prohibitive.
I am not sure whether this is a bug or a feature but could you explain what is going on and whether there is a way around this? Can this be managed better by reducing the vocabulary somehow?
I am happy to include more information about the code and error I am getting if this is helpful.
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions