fix: nan failure during training #3159

ori-kron-wis · 2025-01-21T16:08:19Z

Using SaveCheckpoint callback with on_exception can save the best optimal model up to the point it crashed due to Nan's in loss or gradients.
See an example (using Michal's data):

import scvi
from scvi.train._callbacks import SaveCheckpoint
from scvi.model import SCANVI
import pandas as pd
import numpy as np
import scanpy as sc
import torch
torch.set_float32_matmul_precision('high')

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
scvi.settings.seed = 0

early_stopping_kwargs = {
    'early_stopping': True,
    'early_stopping_monitor': 'elbo_validation', #'train_loss'
    'early_stopping_patience': 50,
    'early_stopping_mode': "min",
    'early_stopping_min_delta': 0.0,
    #'check_val_every_n_epoch': 1,
    #'check_finite': True,
}

ScviM = scvi.model.SCVI.load("/home/access/scvi_forScanVI4")

lvae = scvi.model.SCANVI.from_scvi_model(
                ScviM,
                unlabeled_category='unlabeled',
                labels_key="celltypes_steven2",
                linear_classifier=True,
            )
lvae.train(batch_size=1024,n_samples_per_label=100, max_epochs=500, gradient_clip_val=0,
           **early_stopping_kwargs , detect_anomaly=False, enable_checkpointing=True,
           callbacks=[SaveCheckpoint(monitor="elbo_validation", load_best_on_end=True)]) #breaks at epoch 58

#WE now want to laod this model and continue to train it
model = SCANVI.load("/home/access/.config/JetBrains/PyCharmCE2024.2/scratches/scvi_log/"
                    "2025-01-23_13-37-44_elbo_validation/"
                    "epoch=54-step=53295-elbo_validation=1255.7066650390625/",adata=ScviM.adata)
model.train(batch_size=2048,n_samples_per_label=50, max_epochs=500, gradient_clip_val=1,
           **early_stopping_kwargs , detect_anomaly=False, enable_checkpointing=True, plan_kwargs={"lr": 1e-2},
           callbacks=[SaveCheckpoint(monitor="elbo_validation", load_best_on_end=True)])

#running with detect_anomlay=True really slows down the whole thing
print("done")

We can then load it and continue training it (with or without parameters twicking)

If Nan exists, also added a gradient_clip_val param to train to be set >0 , it will help solve the nan issue

codecov · 2025-01-21T16:20:13Z

Codecov Report

Attention: Patch coverage is 35.71429% with 18 lines in your changes missing coverage. Please review.

Project coverage is 82.59%. Comparing base (2f1611c) to head (f977864).

Files with missing lines	Patch %	Lines
src/scvi/train/_callbacks.py	35.71%	18 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (2f1611c) and HEAD (f977864). Click for more details.

HEAD has 53 uploads less than BASE

Flag BASE (2f1611c) HEAD (f977864)

56 3

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3159      +/-   ##
==========================================
- Coverage   89.43%   82.59%   -6.85%     
==========================================
  Files         185      185              
  Lines       16182    16210      +28     
==========================================
- Hits        14473    13389    -1084     
- Misses       1709     2821    +1112

Files with missing lines	Coverage Δ
src/scvi/train/_callbacks.py	`77.05% <35.71%> (-8.16%)`	⬇️

... and 27 files with indirect coverage changes

… callback save

canergen · 2025-02-09T22:01:34Z

src/scvi/train/_callbacks.py

+                )
+            else:
+                self.reason = (
+                    "\033[31m[Warning] Exception occurred during training (Nan or Inf gradients). "


what's this \033[31m string?

it prints the reason in red to screen

canergen · 2025-02-09T22:02:15Z

src/scvi/train/_callbacks.py

+            pyro_param_store = best_state_dict.pop("pyro_param_store", None)
+            pl_module.module.load_state_dict(best_state_dict)
+            if pyro_param_store is not None:
+                # For scArches shapes are changed and we don't want to overwrite


what's this comment here?

you tell me :-). it came from resolvi merge

canergen · 2025-02-09T22:02:40Z

src/scvi/train/_callbacks.py

+                # For scArches shapes are changed and we don't want to overwrite
+                # these changed shapes.
+                pyro.get_param_store().set_state(pyro_param_store)
+            print(self.reason)


can we refine the printing here instead of two prints statements.

we can do it in one line of course.

canergen · 2025-02-09T22:03:00Z

src/scvi/train/_trainer.py

@@ -72,6 +72,8 @@ class Trainer(pl.Trainer):
        and in 'max' mode it will stop when the quantity monitored has stopped increasing.
    enable_progress_bar
        Whether to enable or disable the progress bar.
+    gradient_clip_val


does it do anything?

its working, but does it help in avoiding gradients nan exception? in my test case, no.
it just changed the epoch it failed. Basically, its a common practice to use it when such a thing happens.

But, we can still use it via the trainer kwargs and not explicitly with the train function signature. Ill revert.

See velovi model, its part of it.

canergen · 2025-02-09T22:03:25Z

tests/train/test_callbacks.py

+
+    model.train(
+        max_epochs=5,
+        callbacks=[SaveCheckpoint(monitor="elbo_validation", load_best_on_end=True)],


does this give a NaN?

of course not. its a placeholder for now.
I was hoping that we could come up with a unit test data that causes the on exception to work in a pytest env. If we do we can use it here and check we get what we expect.

ori-kron-wis added 8 commits January 12, 2025 23:36

Added a Nan check callback option.

a6b4f0a

If Nan exists, also added a gradient_clip_val param to train to be set >0 , it will help solve the nan issue

Merge remote-tracking branch 'origin/main' into Ori-Nan-Crash-Fix

de6e324

update

847049a

Merge remote-tracking branch 'origin/main' into Ori-Nan-Crash-Fix

be95d7d

Merge remote-tracking branch 'origin/main' into Ori-Nan-Crash-Fix

de324dc

update with option to replace nan grads with 0, but will it still work?

e494fe7

bug

aa61d66

bug

ee699f2

ori-kron-wis self-assigned this Jan 21, 2025

ori-kron-wis added the on-merge: backport to 1.3.x on-merge: backport to 1.3.x label Jan 21, 2025

ori-kron-wis added this to the scvi-tools 1.3 milestone Jan 21, 2025

ori-kron-wis changed the title ~~Ori nan crash fix~~ fix: nan failure during training Jan 21, 2025

ori-kron-wis added 11 commits January 21, 2025 18:20

add changlog

bac8064

updates

bde1e10

updates but still not working

339849a

rollback and add just on_exceptio to SaveCheckpoint callback

aab2b64

Merge branch 'main' into Ori-Nan-Crash-Fix

601e970

fix

42e0d45

updates

2c1767a

Merge branch 'main' into Ori-Nan-Crash-Fix

21943dc

fix the changlog

1d87539

Merge remote-tracking branch 'origin/main' into Ori-Nan-Crash-Fix

bcee76c

test for callback

22951d1

ori-kron-wis added on-merge: backport to 1.2.x on-merge: backport to 1.2.x and removed on-merge: backport to 1.3.x on-merge: backport to 1.3.x labels Jan 30, 2025

ori-kron-wis modified the milestones: scvi-tools 1.3, scvi-tools 1.2 Jan 30, 2025

test for callback fix, still need to find case it fails and shows the…

1279d0d

… callback save

canergen reviewed Feb 9, 2025

View reviewed changes

ori-kron-wis added 2 commits February 10, 2025 11:10

Merge remote-tracking branch 'origin/main' into Ori-Nan-Crash-Fix

87b5122

review fixes

f977864

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: nan failure during training #3159

fix: nan failure during training #3159

ori-kron-wis commented Jan 21, 2025 •

edited

Loading

codecov bot commented Jan 21, 2025 •

edited

Loading

canergen Feb 9, 2025

ori-kron-wis Feb 10, 2025

canergen Feb 9, 2025

ori-kron-wis Feb 10, 2025

canergen Feb 9, 2025

ori-kron-wis Feb 10, 2025

canergen Feb 9, 2025

ori-kron-wis Feb 10, 2025

canergen Feb 9, 2025

ori-kron-wis Feb 10, 2025 •

edited

Loading

fix: nan failure during training #3159

Are you sure you want to change the base?

fix: nan failure during training #3159

Conversation

ori-kron-wis commented Jan 21, 2025 • edited Loading

codecov bot commented Jan 21, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ori-kron-wis Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

ori-kron-wis commented Jan 21, 2025 •

edited

Loading

codecov bot commented Jan 21, 2025 •

edited

Loading

ori-kron-wis Feb 10, 2025 •

edited

Loading