-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Dealing with "MemoryError ('Error assigning xxxxxx bytes') #11962
Replies: 1 comment · 5 replies
-
This error makes it looks like there's an integer overflow involved somewhere:
Does it run without memory issues if you use an ngram suggester with lengths Can you share the code for the custom reducer? |
Beta Was this translation helpful? Give feedback.
All reactions
-
I am running the ngram model right now, so I will update. Here is the custom code for the reducer. from functools import partial
from pathlib import Path
from typing import Iterable, Callable
import spacy
from spacy.training import Example
from spacy.tokens import DocBin, Doc
from typing import List, Tuple, cast
from thinc.api import Model, with_getitem, chain, list2ragged, Logistic
from thinc.api import Maxout, Linear, concatenate, glorot_uniform_init, PyTorchLSTM
from thinc.api import reduce_mean, reduce_max, reduce_first, reduce_last
from thinc.types import Ragged, Floats2d
from spacy.util import registry
from spacy.tokens import Doc
from spacy.ml.extract_spans import extract_spans
@registry.layers("mean_max_reducer.v1.5")
def build_mean_max_reducer1(hidden_size: int,
dropout: float = 0.0) -> Model[Ragged, Floats2d]:
"""Reduce sequences by concatenating their mean and max pooled vectors,
and then combine the concatenated vectors with a hidden layer.
"""
return chain(
concatenate(
cast(Model[Ragged, Floats2d], reduce_last()),
cast(Model[Ragged, Floats2d], reduce_first()),
reduce_mean(),
reduce_max(),
),
Maxout(nO=hidden_size, normalize=True, dropout=dropout),
) I am not sure if this is related to this overall issue, but as per #11905 , I tried the master branch from the repo in another env, resulting in the following. Transformer loss skyrockets suddenly, and I get a memory leak error. =============================== train_spancat ===============================
ℹ Re-running 'train_spancat': spaCy minor version changed (3.3.0 in
project.lock, 3.5.0 current)
Running command: /Users/masakieguchi/opt/miniforge3/envs/spacy-exp3.5/bin/python -m spacy train configs/span_finder/RoBERTa_cx_max1_do0.2_sqbatch.cfg --output training/spancat/engagement_spl/RoBERTa_cx_max1_do0.2_sqbatch_span_finder/ --paths.train data/engagement_spl_train.spacy --paths.dev data/engagement_spl_dev.spacy --gpu-id -1 --vars.spans_key sc -c ./scripts/custom_functions.py
ℹ Saving to output directory:
training/spancat/engagement_spl/RoBERTa_cx_max1_do0.2_sqbatch_span_finder
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
=========================== Initializing pipeline ===========================
/Users/masakieguchi/Dropbox/0_Projects/0_basenlp/1_spacy/spaCy/spacy/util.py:876: UserWarning: [W095] Model 'en_core_web_trf' (3.4.1) was trained with spaCy v3.4 and may not be 100% compatible with the current version (3.5.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
[2022-12-11 23:47:01,166] [INFO] Set up nlp object from config
[2022-12-11 23:47:01,172] [INFO] Pipeline: ['transformer', 'tagger', 'parser', 'ner', 'trainable_transformer', 'span_finder', 'spancat']
[2022-12-11 23:47:01,176] [INFO] Created vocabulary
[2022-12-11 23:47:01,177] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2022-12-11 23:47:10,161] [INFO] Initialized pipeline components: ['trainable_transformer', 'span_finder', 'spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'tagger', 'parser', 'ner',
'trainable_transformer', 'span_finder', 'spancat']
ℹ Frozen components: ['transformer', 'parser', 'tagger', 'ner']
ℹ Set annotations on update for: ['span_finder']
ℹ Initial learn rate: 0.0
E # LOSS TRAIN... LOSS SPAN_... LOSS SPANCAT SPAN_FINDE... SPAN_FINDE... SPAN_FINDE... SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------- ------------- ------------ ------------- ------------- ------------- ---------- ---------- ---------- ------
0 0 2962483.36 48.88 28428.61 0.53 0.27 82.79 0.03 0.01 8.46 0.17
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
0 100 188985327.90 5296.43 1902432.71 0.53 0.27 99.81 0.00 0.00 0.00 0.20
0 200 13683.75 3514.56 20753.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0 300 293.43 3428.55 14347.97 6.20 3.28 56.08 0.00 0.00 0.00 0.11
0 400 178.66 2294.31 8538.92 10.07 5.43 69.20 0.00 0.00 0.00 0.14
0 500 133836226712.80 2561.81 7867.32 18.38 10.71 64.64 27.83 78.41 16.92 0.39
0 600 148.54 2101.31 4386.72 16.33 9.25 69.77 41.32 78.46 28.04 0.49
0 700 129.05 1667.21 2570.57 20.41 11.88 72.15 45.42 68.65 33.94 0.52
0 800 40367669400.08 1996.37 3014.38 17.70 9.94 80.89 54.68 68.43 45.53 0.60
0 900 95623635103.78 1641.21 2325.80 18.33 10.34 80.42 57.53 66.54 50.67 0.62
0 1000 153.66 1732.75 2495.41 22.07 12.89 76.81 59.65 74.96 49.52 0.64
0 1100 324844494994.28 1493.34 2259.08 21.49 12.46 78.14 62.36 71.93 55.04 0.66
0 1200 1149607329984.28 1608.94 2440.73 20.32 11.61 81.37 66.10 69.38 63.12 0.69
1 1300 2676636778668.62 1345.51 2146.99 21.95 12.80 77.19 65.11 73.48 58.46 0.68
1 1400 6125195886747.96 1185.34 1915.89 20.30 11.61 80.61 65.22 63.35 67.21 0.68
1 1500 2453383545001.22 1276.02 1834.70 23.75 14.04 77.09 66.25 66.73 65.78 0.68
1 1600 135384356520.86 1347.67 1805.49 22.51 13.15 78.04 67.31 68.50 66.16 0.69
1 1700 10670272863392.86 1371.04 1741.47 22.50 13.15 77.95 67.72 68.34 67.11 0.70
2 1800 11429366935721.60 1418.81 1786.24 22.16 12.91 78.04 66.73 66.64 66.83 0.69
2 1900 31688885273744.43 1330.31 1622.04 22.04 12.80 79.18 69.01 68.09 69.96 0.71
2 2000 8201115268604.55 1496.64 1699.18 24.80 14.86 75.10 68.02 69.37 66.73 0.69
Epoch 3: 87%|██████████████████████████████████████████████████████████████████████████████████████████████████▎ | 87/100 [36:53<04:51, 22.43s/it]
/Users/masakieguchi/opt/miniforge3/envs/spacy-exp3.5/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' I will keep updating this! |
Beta Was this translation helpful? Give feedback.
All reactions
-
I noticed one other thing that might be part of the problem: Can you try your original version without any of the frozen components? There are some known bugs related to freezing transformers (#11547), and you can always add these components later rather than trying to train with them already defined in the pipeline. Still keep the custom name/upstream like |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you Adriane! I have made some observation on the patterns of problem I have. While I am figuring out the issue on my end, I also would like to share it here to start conversation on any possible fixes. Some weird thing is that the configuration with 3.3.0 seemed to be working with small dataset I had, but once I added more annotation it started to throw the memory error. Then I started to figure out the right config to prevent that (reducing the nlp batch size and training batch size)... Below I summarized my observation up until now. I wonder if there is any mistake that I am making in below. Spacy 3.3.0What seems to be working:
What do not work:
Spacy 3.4.3 (throws an error)
Spacy with master branch
What seems to be working
What seems not working:
End-user observation
Completed training with jumps in Transformer lossFinally I would like to share another weird thing happening to the Transformer loss. I am not sure if this is particularly an issue because it technically finished the training. Here I trained the model with a larger dataset I was taking about with the spacy master branch. I excluded the freezing part of the model from the config. it finished the training, but the transformer loss jumped and never came back.
|
Beta Was this translation helpful? Give feedback.
All reactions
-
Could you possibly share your data (privately is fine) to make it easier for us to reproduce this on our end? Something is overflowing somewhere, but I don't immediately see where this might come from in The bug with zero sequence lengths should be fixed in v3.4.4 (released yesterday) and the upcoming v3.3.2 (to be released today), so you shouldn't need to build from Don't use pytorch v1.13.0 until numpy v1.24.0 is released because there is a bug in numpy related to dlpack deleters. We recommend using v1.12.1 instead for now, see #11742. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Sure! It is not a public data yet, but I can share it with you privately. I think I figured out why, now that you mentioned "suggesting longer spans". Indeed, my annotation have very long span such as clause lengths (subordinate clause can be over 10 words or sometimes > 30 words). And I think you are right, that is a real memory issue rather than an error if I think about it carefully. Based on your comment about the long span possibly causing the issue, I switched to subtree_ngram suggester, and it seems it is working properly (in fact it has more coverage because my annotation can be linguistically defined with subtree and ngrams). I may be working on custom suggester that caters to my analytic needs (increasing precision while maintaining high recall). I confirmed that v3.3.4 worked properly with RoBERTa-base but still giving zero score for RoBERTa-large, This likely due to my learning rate.... |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I have been training spancat models for a month now. And I see frequent memory errors during the training. I sometimes make it work by reducing the batch sizes. It sometimes worked, but it keeps getting smaller so I decided to ask if there is any solution to this.
Description of the issue
Example error message
Example config
The followings are two examples from my config:
I use a custom component where I set dropout for the Maxout layer for spancat.
Most recently, the following batch setting resulted in the same type of error.
Environment
Beta Was this translation helpful? Give feedback.
All reactions