Models are not deterministic / reproducible on GPU #6490

echatzikyriakidis · 2020-12-03T10:36:42Z

How to reproduce the behaviour

I cannot reproduce the same results when training a NER model using GPU in Google Colab.
When running the same code with CPU it seems to work.
However, when enabling GPU with prefer_gpu() the reproduction is not working.

`
# Example code

def train_blank_ner_model(language_id, train_X, entity_types, epochs, random_state, dropout, minibatch_size, losses_display_frequency_in_epochs):
	fix_random_seed(random_state)

	nlp = spacy.blank(language_id)

	assert len(nlp.pipe_names) == 0, f"Pipeline of blank model '{language_id}' is not empty."

	ner = nlp.create_pipe('ner')

	nlp.add_pipe(ner)

	for entity_type in entity_types:
		ner.add_label(entity_type)

	optimizer = nlp.begin_training()

	for epoch in tqdm(range(1, epochs + 1)):
		train_X = shuffle(train_X, random_state=random_state)

		losses = {}

		batches = minibatch(train_X, size=compounding(*minibatch_size))

		for batch in tqdm(batches, leave=False):
			texts, annotations = zip(*batch)
  
			nlp.update(texts, annotations, sgd=optimizer, drop=dropout, losses=losses)

		if epoch % losses_display_frequency_in_epochs == 0:
			print(f"Epoch {epoch}, Loss: {losses['ner']}")

	print(f"Training completed with loss: {losses['ner']}")

	return nlp

import spacy

print(f"GPU Initialization: {spacy.prefer_gpu()}")

nlp = train_blank_ner_model(language_id='de',
		train_X=X_train,
		entity_types=ner_entity_types,
		epochs=3,
		random_state=42,
		dropout=0.4,
		minibatch_size=(0.4, 0.4, 1.0),
		losses_display_frequency_in_epochs=5)

`

Your Environment

Operating System: Google Colab
Python Version Used: Python 3.8
spaCy Version Used: spacy[cuda101]==2.3.4
Environment Information: Google Colab

The text was updated successfully, but these errors were encountered:

svlandeg · 2020-12-03T12:56:31Z

Thanks for the report! I just double checked with the latest code from master and can confirm that there seems to be a reproducibility issue for the GPU when training the NER model.

We'll look into this!

echatzikyriakidis · 2020-12-03T13:35:01Z

Thank you @svlandeg !

We can continue our experimentation phase even without determinism since the losses from various runs with different random seeds are more or loss the same. No big flunctuations.

However, if we can have soon a new release with the fix it could be so great.

Please note that the same thing happens when using a pre-trained model, etc, en_core_web_lg.

echatzikyriakidis · 2020-12-23T21:10:49Z

Hi @svlandeg !

Do we have any update on this?

polm · 2021-08-09T11:45:43Z

I managed to track down the source of this problem. In the backprop in HashEmbed we use cupyx.scatter_add, which is non-deterministic. So this affects anything that uses a tok2vec layer.

Unfortunately there is not a simple substitution for this without consequences. We could unroll the addition to control the order of operations but it would be too slow. This is also known to be an issue in Pytorch (which doesn't use cupy but a similar implementation) but because the actual change in values is small it's not generally considered an issue (see pytorch/pytorch#50469).

That said we think we can design a deterministic equivalent with a more acceptable speed penalty and will be taking a look at it. In the meantime this is something to be aware of, and this will be the main issue for it, so just subscribe here if you'd like updates.

svlandeg added feat / ner Feature: Named Entity Recognizer gpu Using spaCy on GPU labels Dec 3, 2020

svlandeg added the bug Bugs and behaviour differing from documentation label Dec 3, 2020

FrancescoCasalegno mentioned this issue May 3, 2021

Test reproducibility of spaCy training BlueBrain/Search#343

Closed

3 tasks

svlandeg mentioned this issue May 3, 2021

NER Training on GPU is not reproducible #7981

Closed

polm changed the title ~~fix_random_seed() does not work on GPU mode~~ Models are not deterministic / reproducible on GPU Aug 9, 2021

FrancescoCasalegno mentioned this issue Aug 24, 2021

Training NER models on GPU is non-reproducible BlueBrain/Search#379

Open

polm mentioned this issue Nov 17, 2021

Variable results for textcat on GPU (nightly) #6416

Closed

polm added the reproducibility Consistency, reproducibility, determinism, and randomness label Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models are not deterministic / reproducible on GPU #6490

Models are not deterministic / reproducible on GPU #6490

echatzikyriakidis commented Dec 3, 2020 •

edited

Loading

svlandeg commented Dec 3, 2020

echatzikyriakidis commented Dec 3, 2020 •

edited

Loading

echatzikyriakidis commented Dec 23, 2020

polm commented Aug 9, 2021

Models are not deterministic / reproducible on GPU #6490

Models are not deterministic / reproducible on GPU #6490

Comments

echatzikyriakidis commented Dec 3, 2020 • edited Loading

How to reproduce the behaviour

Your Environment

svlandeg commented Dec 3, 2020

echatzikyriakidis commented Dec 3, 2020 • edited Loading

echatzikyriakidis commented Dec 23, 2020

polm commented Aug 9, 2021

echatzikyriakidis commented Dec 3, 2020 •

edited

Loading

echatzikyriakidis commented Dec 3, 2020 •

edited

Loading