Dealing with "MemoryError ('Error assigning xxxxxx bytes') #11962

egumasa · 2022-12-11T00:36:22Z

egumasa
Dec 11, 2022

Hi all,

I have been training spancat models for a month now. And I see frequent memory errors during the training. I sometimes make it work by reducing the batch sizes. It sometimes worked, but it keeps getting smaller so I decided to ask if there is any solution to this.

Description of the issue

I have around 2000 sentences for training data and 400 for dev.
Each doc in the training set has around 100 to 200 words.
I switch between batch_by_word settings from 200 to 800. and batch_by_sequence from 2 to 8.
It used to work up to 800 words and 8 sequences (again, each doc are 100 to 200 words or 3 sentences long maximum), but it started throwing the error, and now it stops working in the middle even I set it to a maximum of 600 words or 5 sequences.
I get an error message right after an evaluation metric is shown.

Example error message

  0    1400         137.55        1017.05       1525.03          18.62          10.48          82.98       65.54       71.16       60.74    0.69
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.3s
  1    1500         122.63         937.37       1406.49          19.31          10.95          81.94       64.88       69.37       60.93    0.68
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.1s
  1    1600         117.00         840.23       1284.29          23.27          13.68          77.76       66.56       74.62       60.08    0.69
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.2s
  1    1700         112.21         774.83       1202.72          21.74          12.61          78.61       66.30       70.41       62.64    0.69
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.1s
  1    1800         118.48         741.16       1191.53          22.26          12.95          79.09       67.71       72.03       63.88    0.70
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.1s
  1    1900         105.45         753.17       1127.01          22.53          13.13          79.18       67.60       71.07       64.45    0.70
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.3s
  1    2000          93.62         704.11       1002.62          22.73          13.29          78.33       67.28       68.64       65.97    0.69
wandb: Adding directory to artifact (./training/spancat/engagement_spl/RoBERTa_cx_max1_do02_span_finder/model-last)... Done. 1.3s
⚠ Aborting and saving the final best model. Encountered exception:
MemoryError('Error assigning 18446744065960261632 bytes')
wandb: Waiting for W&B process to finish... (success).
wandb: 
wandb: Run history:
wandb:                        ents_f ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                        ents_p ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                        ents_r ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:              loss_span_finder ▁█▆▄▄▄▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂
wandb:                  loss_spancat ▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:    loss_trainable_transformer ▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                         score ▂▁▂▂▅▅▆▆▇▇▇▇▇████████
wandb:                       sents_f ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                       sents_p ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                       sents_r ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: span_finder_span_candidates_f ▁▁▂▃▅▇▆▇▆▇▆▇█▆▇▇█████
wandb: span_finder_span_candidates_p ▁▁▂▃▄▇▆▇▆▇▆▇▇▆▆▇█▇███
wandb: span_finder_span_candidates_r █▁▆▅▅▄▅▅▆▆▆▆▆▆▆▆▆▆▆▆▆
wandb:                    spans_sc_f ▁▁▁▁▄▅▅▆▆▆▇▇▇████████
wandb:                    spans_sc_p ▁▁▁▁███▇██▇▇██▇▇█▇▇▇▇
wandb:                    spans_sc_r ▂▁▁▁▃▃▄▄▅▅▆▇▇▇▇▇▇████
wandb:                         speed ▁▄▆██▇███▇▇█▇▇▇▇█▇▇▇▇
wandb:                       tag_acc ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb:                        ents_f 0.0
wandb:                        ents_p 0.0
wandb:                        ents_r 0.0
wandb:              loss_span_finder 704.1115
wandb:                  loss_spancat 1002.62025
wandb:    loss_trainable_transformer 93.61956
wandb:                         score 0.69274
wandb:                       sents_f 0.90769
wandb:                       sents_p 0.91237
wandb:                       sents_r 0.90306
wandb: span_finder_span_candidates_f 0.22728
wandb: span_finder_span_candidates_p 0.13292
wandb: span_finder_span_candidates_r 0.78327
wandb:                    spans_sc_f 0.67281
wandb:                    spans_sc_p 0.68645
wandb:                    spans_sc_r 0.6597
wandb:                         speed 771.16807
wandb:                       tag_acc 0.0
wandb: 
wandb: Synced polished-fire-23: 
wandb: Synced 6 W&B file(s), 0 media file(s), 97 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./
Traceback (most recent call last):
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/typer/main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/cli/train.py", line 45, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/cli/train.py", line 75, in train
    train_nlp(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/training/loop.py", line 122, in train
    raise e
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/training/loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/training/loop.py", line 203, in train_while_improving
    nlp.update(
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/language.py", line 1156, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/spacy/pipeline/spancat.py", line 346, in update
    backprop_scores(d_scores)  # type: ignore
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/thinc/layers/chain.py", line 60, in backprop
    dX = callback(dY)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/thinc/layers/chain.py", line 60, in backprop
    dX = callback(dY)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/thinc/layers/concatenate.py", line 67, in backprop
    gradient = bwd(dY)
  File "/Users/masakieguchi/opt/miniforge3/envs/spancat/lib/python3.9/site-packages/thinc/layers/reduce_mean.py", line 26, in backprop
    return Ragged(model.ops.backprop_reduce_mean(dY, lengths), lengths)
  File "thinc/backends/numpy_ops.pyx", line 272, in thinc.backends.numpy_ops.NumpyOps.backprop_reduce_mean
  File "cymem/cymem.pyx", line 74, in cymem.cymem.Pool.alloc
MemoryError: Error assigning 18446744065960261632 bytes

Example config

The followings are two examples from my config:
I use a custom component where I set dropout for the Maxout layer for spancat.

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
source = "en_core_web_trf"

[vars]
spans_key = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "ner", "trainable_transformer", "span_finder", "spancat"]
batch_size = 16
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}


[components]

[components.transformer]
source = "en_core_web_trf"

[components.tagger]
source = ${paths.source}
#upstream = "*"
# replace_listeners = ["model.transformer"]

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "transformer"
pooling = {"@layers":"reduce_mean.v1"}

[components.parser]
source = ${paths.source}

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "transformer"
pooling = {"@layers":"reduce_mean.v1"}


[components.ner]
source = ${paths.source}

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "transformer"
pooling = {"@layers":"reduce_mean.v1"}

[components.trainable_transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.trainable_transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.trainable_transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
# window = 128
# stride = 96
window = 196
stride = 147

[components.trainable_transformer.model.tokenizer_config]
use_fast = true


[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.15
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0

[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}

[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2

[components.span_finder.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "trainable_transformer"

[components.span_finder.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat]
factory = "spancat"
max_positive = 2
spans_key = ${vars.spans_key}
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "mean_max_reducer.v1.5"
hidden_size = 256
# maxout_pieces = 3
dropout = 0.2

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "trainable_transformer"

[components.spancat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
candidates_key = ${components.span_finder.predicted_key}

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
# patience = 200
patience = 2000
max_epochs = 0
max_steps = 20000
# max_steps = 1000
eval_frequency = 100
frozen_components = ["transformer", "parser", "tagger", "ner"]
annotating_components = ["span_finder"]
before_to_disk = null

# [training.batcher]
# @batchers = "spacy.batch_by_sequence.v1"
# get_length = null

# [training.batcher.size]
# @schedules = "compounding.v1"
# start = 32
# stop = 256
# compound = 1.001

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 200
stop = 600
compound = 1.0005

# [training.logger]
# @loggers = "spacy.ConsoleLogger.v1"
# progress_bar = true

[training.logger]
@loggers = "spacy.WandbLogger.v3"
project_name = "spnacat_engagementv2"
remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 35000
initial_rate = 0.00005

[training.score_weights]
span_finder_span_candidates_f = 0.0
span_finder_span_candidates_p = 0.0
span_finder_span_candidates_r = 0.2
spans_sc_p = 0.1
spans_sc_r = 0.1
spans_sc_f = 0.7
dep_las_per_type = null
sents_p = null
sents_r = null
ents_per_type = null
tag_acc = null
dep_uas = null
dep_las = null
sents_f = null
ents_f = null
ents_p = null
ents_r = null
lemma_acc = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Most recently, the following batch setting resulted in the same type of error.

[nlp]
lang = "en"
pipeline = ["transformer", "tagger", "parser", "ner", "trainable_transformer", "span_finder", "spancat"]
batch_size = 8
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

<omitted>

[components.span_finder]
factory = "experimental_span_finder"
threshold = 0.15
predicted_key = "span_candidates"
training_key = ${vars.spans_key}
min_length = 0
max_length = 0

[components.span_finder.scorer]
@scorers = "spacy-experimental.span_finder_scorer.v1"
predicted_key = ${components.span_finder.predicted_key}
training_key = ${vars.spans_key}

[components.span_finder.model]
@architectures = "spacy-experimental.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO=2

[components.span_finder.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "trainable_transformer"

[components.span_finder.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat]
factory = "spancat"
max_positive = 2
spans_key = ${vars.spans_key}
threshold = 0.4

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "mean_max_reducer.v1.5"
hidden_size = 128
# maxout_pieces = 3
dropout = 0.2

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
upstream = "trainable_transformer"

[components.spancat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat.suggester]
@misc = "spacy-experimental.span_finder_suggester.v1"
candidates_key = ${components.span_finder.predicted_key}

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
# patience = 200
patience = 2000
max_epochs = 0
max_steps = 20000
# max_steps = 1000
eval_frequency = 100
frozen_components = ["transformer", "parser", "tagger", "ner"]
annotating_components = ["span_finder"]
before_to_disk = null

# [training.batcher]
# @batchers = "spacy.batch_by_sequence.v1"
# get_length = null

# [training.batcher.size]
# @schedules = "compounding.v1"
# start = 32
# stop = 256
# compound = 1.001

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
# discard_oversize = false
# tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 2
stop = 5
compound = 1.0002

Environment

M1 Max with 32 GB memory
Miniforge3
python = 3.9.7.final.0

# packages in environment at /Users/masakieguchi/opt/miniforge3/envs/spancat:
#
# Name                    Version                   Build  Channel
altair                    4.2.0                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
blinker                   1.5                      pypi_0    pypi
blis                      0.7.9                    pypi_0    pypi
bzip2                     1.0.8                h3422bc3_4    conda-forge
ca-certificates           2022.9.24            h4653dfc_0    conda-forge
cachetools                5.2.0                    pypi_0    pypi
catalogue                 2.0.8                    pypi_0    pypi
certifi                   2022.9.24                pypi_0    pypi
charset-normalizer        2.1.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
commonmark                0.9.1                    pypi_0    pypi
confection                0.0.3                    pypi_0    pypi
contourpy                 1.0.6                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
cymem                     2.0.7                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
docker-pycreds            0.4.0                    pypi_0    pypi
en-core-web-sm            3.3.0                    pypi_0    pypi
en-core-web-trf           3.3.0                    pypi_0    pypi
en-engagement-spl-roberta-acad-max1-do02 0.2.6.1130               pypi_0    pypi
en-engagement-spl-roberta-acad-max2-do02 0.2.5.1130               pypi_0    pypi
entrypoints               0.4                      pypi_0    pypi
filelock                  3.8.0                    pypi_0    pypi
fonttools                 4.38.0                   pypi_0    pypi
gitdb                     4.0.10                   pypi_0    pypi
gitpython                 3.1.29                   pypi_0    pypi
huggingface-hub           0.8.1                    pypi_0    pypi
idna                      3.4                      pypi_0    pypi
importlib-metadata        5.1.0                    pypi_0    pypi
jinja2                    3.1.2                    pypi_0    pypi
joblib                    1.2.0                    pypi_0    pypi
jsonschema                4.17.1                   pypi_0    pypi
kiwisolver                1.4.4                    pypi_0    pypi
langcodes                 3.3.0                    pypi_0    pypi
libffi                    3.4.2                h3422bc3_5    conda-forge
libsqlite                 3.40.0               h76d750c_0    conda-forge
libzlib                   1.2.13               h03a7124_4    conda-forge
markupsafe                2.1.1                    pypi_0    pypi
matplotlib                3.6.2                    pypi_0    pypi
murmurhash                1.0.9                    pypi_0    pypi
ncurses                   6.3                  h07bb92c_1    conda-forge
numpy                     1.23.5                   pypi_0    pypi
openssl                   3.0.7                h03a7124_0    conda-forge
packaging                 21.3                     pypi_0    pypi
pandas                    1.5.2                    pypi_0    pypi
pathtools                 0.1.2                    pypi_0    pypi
pathy                     0.10.0                   pypi_0    pypi
pillow                    9.3.0                    pypi_0    pypi
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
preshed                   3.0.8                    pypi_0    pypi
promise                   2.3                      pypi_0    pypi
protobuf                  3.20.3                   pypi_0    pypi
psutil                    5.9.4                    pypi_0    pypi
pyarrow                   10.0.1                   pypi_0    pypi
pydantic                  1.8.2                    pypi_0    pypi
pydeck                    0.8.0                    pypi_0    pypi
pygments                  2.13.0                   pypi_0    pypi
pympler                   1.0.1                    pypi_0    pypi
pyparsing                 3.0.9                    pypi_0    pypi
pyrsistent                0.19.2                   pypi_0    pypi
python                    3.9.15          hea58f1e_0_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.6                   pypi_0    pypi
pytz-deprecation-shim     0.1.0.post0              pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.1.2                h46ed386_0    conda-forge
regex                     2022.10.31               pypi_0    pypi
requests                  2.28.1                   pypi_0    pypi
rich                      12.6.0                   pypi_0    pypi
scikit-learn              1.1.3                    pypi_0    pypi
scipy                     1.9.3                    pypi_0    pypi
semver                    2.13.0                   pypi_0    pypi
sentry-sdk                1.11.1                   pypi_0    pypi
setproctitle              1.3.2                    pypi_0    pypi
setuptools                65.6.3                   pypi_0    pypi
shortuuid                 1.0.11                   pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
smart-open                5.2.1                    pypi_0    pypi
smmap                     5.0.0                    pypi_0    pypi
spacy                     3.3.0                    pypi_0    pypi
spacy-alignments          0.8.6                    pypi_0    pypi
spacy-experimental        0.6.1                    pypi_0    pypi
spacy-huggingface-hub     0.0.7                    pypi_0    pypi
spacy-legacy              3.0.10                   pypi_0    pypi
spacy-loggers             1.0.3                    pypi_0    pypi
spacy-streamlit           1.0.4                    pypi_0    pypi
spacy-transformers        1.1.7                    pypi_0    pypi
srsly                     2.4.5                    pypi_0    pypi
streamlit                 1.15.1                   pypi_0    pypi
thinc                     8.0.17                   pypi_0    pypi
thinc-apple-ops           0.0.8                    pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.12               he1e0b03_0    conda-forge
tokenizers                0.12.1                   pypi_0    pypi
toml                      0.10.2                   pypi_0    pypi
toolz                     0.12.0                   pypi_0    pypi
torch                     1.11.0                   pypi_0    pypi
tornado                   6.2                      pypi_0    pypi
tqdm                      4.64.1                   pypi_0    pypi
transformers              4.20.1                   pypi_0    pypi
typer                     0.4.2                    pypi_0    pypi
typing-extensions         4.4.0                    pypi_0    pypi
tzdata                    2022.6                   pypi_0    pypi
tzlocal                   4.2                      pypi_0    pypi
urllib3                   1.26.13                  pypi_0    pypi
validators                0.20.0                   pypi_0    pypi
wandb                     0.13.5                   pypi_0    pypi
wasabi                    0.10.1                   pypi_0    pypi
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h57fd34a_0    conda-forge
zipp                      3.10.0                   pypi_0    pypi

adrianeboyd · 2022-12-12T09:43:50Z

adrianeboyd
Dec 12, 2022

This error makes it looks like there's an integer overflow involved somewhere:

MemoryError: Error assigning 18446744065960261632 bytes

Does it run without memory issues if you use an ngram suggester with lengths [1] and the default reducer?

Can you share the code for the custom reducer?

5 replies

egumasa Dec 12, 2022
Author

I am running the ngram model right now, so I will update.

Here is the custom code for the reducer.

from functools import partial
from pathlib import Path
from typing import Iterable, Callable
import spacy
from spacy.training import Example
from spacy.tokens import DocBin, Doc

from typing import List, Tuple, cast
from thinc.api import Model, with_getitem, chain, list2ragged, Logistic
from thinc.api import Maxout, Linear, concatenate, glorot_uniform_init, PyTorchLSTM
from thinc.api import reduce_mean, reduce_max, reduce_first, reduce_last
from thinc.types import Ragged, Floats2d

from spacy.util import registry
from spacy.tokens import Doc
from spacy.ml.extract_spans import extract_spans

@registry.layers("mean_max_reducer.v1.5")
def build_mean_max_reducer1(hidden_size: int,
                            dropout: float = 0.0) -> Model[Ragged, Floats2d]:
    """Reduce sequences by concatenating their mean and max pooled vectors,
    and then combine the concatenated vectors with a hidden layer.
    """
    return chain(
        concatenate(
            cast(Model[Ragged, Floats2d], reduce_last()),
            cast(Model[Ragged, Floats2d], reduce_first()),
            reduce_mean(),
            reduce_max(),
        ),
        Maxout(nO=hidden_size, normalize=True, dropout=dropout),
    )

I am not sure if this is related to this overall issue, but as per #11905 , I tried the master branch from the repo in another env, resulting in the following. Transformer loss skyrockets suddenly, and I get a memory leak error.

=============================== train_spancat ===============================
ℹ Re-running 'train_spancat': spaCy minor version changed (3.3.0 in
project.lock, 3.5.0 current)
Running command: /Users/masakieguchi/opt/miniforge3/envs/spacy-exp3.5/bin/python -m spacy train configs/span_finder/RoBERTa_cx_max1_do0.2_sqbatch.cfg --output training/spancat/engagement_spl/RoBERTa_cx_max1_do0.2_sqbatch_span_finder/ --paths.train data/engagement_spl_train.spacy --paths.dev data/engagement_spl_dev.spacy --gpu-id -1 --vars.spans_key sc -c ./scripts/custom_functions.py
ℹ Saving to output directory:
training/spancat/engagement_spl/RoBERTa_cx_max1_do0.2_sqbatch_span_finder
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

=========================== Initializing pipeline ===========================
/Users/masakieguchi/Dropbox/0_Projects/0_basenlp/1_spacy/spaCy/spacy/util.py:876: UserWarning: [W095] Model 'en_core_web_trf' (3.4.1) was trained with spaCy v3.4 and may not be 100% compatible with the current version (3.5.0). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
[2022-12-11 23:47:01,166] [INFO] Set up nlp object from config
[2022-12-11 23:47:01,172] [INFO] Pipeline: ['transformer', 'tagger', 'parser', 'ner', 'trainable_transformer', 'span_finder', 'spancat']
[2022-12-11 23:47:01,176] [INFO] Created vocabulary
[2022-12-11 23:47:01,177] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2022-12-11 23:47:10,161] [INFO] Initialized pipeline components: ['trainable_transformer', 'span_finder', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'tagger', 'parser', 'ner',
'trainable_transformer', 'span_finder', 'spancat']
ℹ Frozen components: ['transformer', 'parser', 'tagger', 'ner']
ℹ Set annotations on update for: ['span_finder']
ℹ Initial learn rate: 0.0
E    #       LOSS TRAIN...  LOSS SPAN_...  LOSS SPANCAT  SPAN_FINDE...  SPAN_FINDE...  SPAN_FINDE...  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  -------------  -------------  ------------  -------------  -------------  -------------  ----------  ----------  ----------  ------
  0       0     2962483.36          48.88      28428.61           0.53           0.27          82.79        0.03        0.01        8.46    0.17
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0     100   188985327.90        5296.43    1902432.71           0.53           0.27          99.81        0.00        0.00        0.00    0.20                
  0     200       13683.75        3514.56      20753.15           0.00           0.00           0.00        0.00        0.00        0.00    0.00                
  0     300         293.43        3428.55      14347.97           6.20           3.28          56.08        0.00        0.00        0.00    0.11                
  0     400         178.66        2294.31       8538.92          10.07           5.43          69.20        0.00        0.00        0.00    0.14                
  0     500  133836226712.80        2561.81       7867.32          18.38          10.71          64.64       27.83       78.41       16.92    0.39              
  0     600         148.54        2101.31       4386.72          16.33           9.25          69.77       41.32       78.46       28.04    0.49                
  0     700         129.05        1667.21       2570.57          20.41          11.88          72.15       45.42       68.65       33.94    0.52                
  0     800  40367669400.08        1996.37       3014.38          17.70           9.94          80.89       54.68       68.43       45.53    0.60               
  0     900  95623635103.78        1641.21       2325.80          18.33          10.34          80.42       57.53       66.54       50.67    0.62               
  0    1000         153.66        1732.75       2495.41          22.07          12.89          76.81       59.65       74.96       49.52    0.64                
  0    1100  324844494994.28        1493.34       2259.08          21.49          12.46          78.14       62.36       71.93       55.04    0.66              
  0    1200  1149607329984.28        1608.94       2440.73          20.32          11.61          81.37       66.10       69.38       63.12    0.69             
  1    1300  2676636778668.62        1345.51       2146.99          21.95          12.80          77.19       65.11       73.48       58.46    0.68             
  1    1400  6125195886747.96        1185.34       1915.89          20.30          11.61          80.61       65.22       63.35       67.21    0.68             
  1    1500  2453383545001.22        1276.02       1834.70          23.75          14.04          77.09       66.25       66.73       65.78    0.68             
  1    1600  135384356520.86        1347.67       1805.49          22.51          13.15          78.04       67.31       68.50       66.16    0.69              
  1    1700  10670272863392.86        1371.04       1741.47          22.50          13.15          77.95       67.72       68.34       67.11    0.70            
  2    1800  11429366935721.60        1418.81       1786.24          22.16          12.91          78.04       66.73       66.64       66.83    0.69            
  2    1900  31688885273744.43        1330.31       1622.04          22.04          12.80          79.18       69.01       68.09       69.96    0.71            
  2    2000  8201115268604.55        1496.64       1699.18          24.80          14.86          75.10       68.02       69.37       66.73    0.69             
Epoch 3:  87%|██████████████████████████████████████████████████████████████████████████████████████████████████▎              | 87/100 [36:53<04:51, 22.43s/it]
/Users/masakieguchi/opt/miniforge3/envs/spacy-exp3.5/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I will keep updating this!

adrianeboyd Dec 13, 2022

I noticed one other thing that might be part of the problem: Can you try your original version without any of the frozen components? There are some known bugs related to freezing transformers (#11547), and you can always add these components later rather than trying to train with them already defined in the pipeline. Still keep the custom name/upstream like trainable_transformer as you have it to make it easier to combine components later.

egumasa Dec 15, 2022
Author

Thank you Adriane!

I have made some observation on the patterns of problem I have. While I am figuring out the issue on my end, I also would like to share it here to start conversation on any possible fixes.

Some weird thing is that the configuration with 3.3.0 seemed to be working with small dataset I had, but once I added more annotation it started to throw the memory error. Then I started to figure out the right config to prevent that (reducing the nlp batch size and training batch size)...

Below I summarized my observation up until now. I wonder if there is any mistake that I am making in below.
I have varied settings that I tried, such as Spacy version and their dependency versions, dataset and pipelines I trained

Spacy 3.3.0

What seems to be working:

span_finder component on small dataset (1,000 sentences) with RoBERTa-base.
The trained model looks just fine (F1 of around .70).
ngram suggester works.

What do not work:

Span_finder with larger the dataset (2,000 sentences) throws an error even when I try to limit the batch sizes. Typically it is related to the memory.
- MemoryError('Error assigning 18446744065475841024 bytes')
- UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects
Larger Transformer models such as RoBERTa large and ALBERT-large, xlarge, produces zero prediction even with the smaller dataset that can be used to train the RoBERTa-base. It does not throw an error.

Spacy 3.4.3 (throws an error)

Produces Value Error: all sequence lengths must be ≥ 0

Spacy with master branch

Works depending on the the version of spacy-transformer and transformers.

What seems to be working

Training on torch=1.11.0, spacy-transformer v1.1.7 and transformers v4.20.1 seems to be working, unless it encounters memory issue.
Training on this version of spacy shows huge Transformer loss after it started to stabilize; prediction was not affected by this.
ngram suggester

What seems not working:

Training on torch=1.13.0,spacy-transformer v1.1.8 and transformers v4.21.3 frequently throws errors:
- Kinds of errors appeared to depend on the run:
  - Fatal Python error: PyEval_SaveThread (in case of ngram suggester).
  - resource_tracker: there appear to be 1 leaked semaphone ….

End-user observation

Whether it finishes the training without training or not, spacy master branch associates with a jump of transformer losses after it stabilized. Sometimes this leads to memory errors, but in other cases training finishes without problems.
This may depend on how much storage is left on device (100GB is safe, but 50GB may cause an issue); after training I sometimes need to clear cache.

Completed training with jumps in Transformer loss

Finally I would like to share another weird thing happening to the Transformer loss. I am not sure if this is particularly an issue because it technically finished the training.

Here I trained the model with a larger dataset I was taking about with the spacy master branch. I excluded the freezing part of the model from the config. it finished the training, but the transformer loss jumped and never came back.

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'span_finder', 'spancat']
ℹ Set annotations on update for: ['span_finder']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS SPAN_...  LOSS SPANCAT  SPAN_FINDE...  SPAN_FINDE...  SPAN_FINDE...  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  -------------  -------------  ------------  -------------  -------------  -------------  ----------  ----------  ----------  ------
  0       0     2956995.13          38.61      26792.59           0.17           0.11           0.48        0.00        0.00        0.00    0.00
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0     200    30405911.15        4192.61     371474.16          13.28           7.86          42.68        0.00        0.00        0.00    0.14                                
  0     400         147.29        2333.83       6004.66          19.93          12.31          52.19       21.02       70.27       12.36    0.30                                
  0     600         152.90        2050.38       2838.37          19.48          11.32          70.06       44.08       72.55       31.65    0.49                                
  0     800         159.13        2146.05       2871.68          21.67          12.81          70.44       51.67       74.20       39.64    0.54                                
  0    1000         179.87        2097.54       2798.49          24.11          14.55          70.34       58.28       73.69       48.19    0.58                                
  0    1200         230.61        2145.73       3147.55          22.44          13.18          75.48       61.81       70.80       54.85    0.62                                
  0    1400         234.10        2061.79       3110.97          23.47          13.91          75.19       64.59       74.29       57.13    0.64                                
  0    1600         251.85        1936.88       3047.80          24.78          14.93          72.91       66.70       75.12       59.98    0.64                                
  1    1800  630327804165.17        1754.71       2868.50          23.00          13.52          77.00       66.44       68.13       64.83    0.65                              
  1    2000  422413517065.61        1710.75       2640.82          23.40          13.83          75.76       66.83       69.18       64.64    0.65                              
  1    2200  4523833643246.40        1851.84       2557.83          25.27          15.13          76.71       68.20       70.15       66.35    0.66                             
  2    2400  7019054407885.71        1688.14       2272.46          24.79          14.76          77.47       68.60       69.85       67.40    0.67                             
  2    2600  14735903115954.94        1619.23       2029.12          24.90          14.95          74.52       68.03       71.44       64.92    0.66                            
  3    2800  3703426369528.21        1425.59       1707.61          25.08          15.01          76.14       68.24       70.79       65.87    0.66                             
  3    3000  5090660946027.07        1347.17       1583.94          24.76          14.73          77.66       69.93       71.39       68.54    0.68                             
  4    3200  3754571939818.25        1183.60       1361.04          24.67          14.68          77.38       69.13       70.86       67.49    0.67                             
  4    3400  8888816527442.75        1142.31       1297.99          25.53          15.34          76.05       68.73       71.07       66.54    0.67                             
  5    3600  9015410131017.07        1009.32       1134.70          25.34          15.22          75.67       69.11       71.99       66.44    0.67                             
  5    3800  6667227504705.73         994.69       1139.89          26.28          15.96          74.24       68.76       72.13       65.68    0.66                             
  6    4000  5780683264316.94         999.12       1043.13          24.98          14.98          75.10       68.73       71.18       66.44    0.66                             
Epoch 7:   0%|                                                                                                                                          | 0/200 [00:00<?, ?it/s]✔ Saved pipeline to output directory
training/spancat/engagement_spl/RoBERTa_span_finder/model-last

adrianeboyd Dec 15, 2022

Could you possibly share your data (privately is fine) to make it easier for us to reproduce this on our end? Something is overflowing somewhere, but I don't immediately see where this might come from in span_finder, but possibly it is suggesting longer spans or something that lead to errors elsewhere, I'm just not really sure.

The bug with zero sequence lengths should be fixed in v3.4.4 (released yesterday) and the upcoming v3.3.2 (to be released today), so you shouldn't need to build from master anymore to avoid this bug.

Don't use pytorch v1.13.0 until numpy v1.24.0 is released because there is a bug in numpy related to dlpack deleters. We recommend using v1.12.1 instead for now, see #11742.

egumasa Dec 16, 2022
Author

Sure! It is not a public data yet, but I can share it with you privately. I think I figured out why, now that you mentioned "suggesting longer spans". Indeed, my annotation have very long span such as clause lengths (subordinate clause can be over 10 words or sometimes > 30 words). And I think you are right, that is a real memory issue rather than an error if I think about it carefully.
Here is the demo of what I am trying to do. I use the model that worked with smaller dataset.

Based on your comment about the long span possibly causing the issue, I switched to subtree_ngram suggester, and it seems it is working properly (in fact it has more coverage because my annotation can be linguistically defined with subtree and ngrams).

I may be working on custom suggester that caters to my analytic needs (increasing precision while maintaining high recall).

I confirmed that v3.3.4 worked properly with RoBERTa-base but still giving zero score for RoBERTa-large, This likely due to my learning rate....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with "MemoryError ('Error assigning xxxxxx bytes') #11962

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Dealing with "MemoryError ('Error assigning xxxxxx bytes') #11962

egumasa Dec 11, 2022

Description of the issue

Example error message

Example config

Environment

Replies: 1 comment · 5 replies

adrianeboyd Dec 12, 2022

egumasa Dec 12, 2022 Author

adrianeboyd Dec 13, 2022

egumasa Dec 15, 2022 Author

Spacy 3.3.0

What seems to be working:

What do not work:

Spacy 3.4.3 (throws an error)

Spacy with master branch

What seems to be working

What seems not working:

End-user observation

Completed training with jumps in Transformer loss

adrianeboyd Dec 15, 2022

egumasa Dec 16, 2022 Author

egumasa
Dec 11, 2022

Replies: 1 comment 5 replies

adrianeboyd
Dec 12, 2022

egumasa Dec 12, 2022
Author

egumasa Dec 15, 2022
Author

egumasa Dec 16, 2022
Author