Outlier detection: catch more outliers by not updating moving average with skipped updates #711

ademeure · 2024-07-25T20:43:18Z

This is an improvement to the znorm/zgrad update skipping mechanisms (-sl and -sg) to avoid skipping updates for outliers. Note that znorm will still be updated if zgrad is an outlier that causes the update to be skipped (and vice-versa), but that shouldn't matter much.

Combined with "-sg 5" my intuition is that this should significantly improve stability (famous last words...) - we should still try to root cause the instability as much as possible after that of course.

…skipped updates, and use a shorter window of 64 steps

ademeure · 2024-07-29T00:11:29Z

Updated the PR - very promising based on a short local run, going to use this for a longer run and for FP8 runs now...

The most surprisingly effective change seems to be that whenever grad z-norm is >2, beta2 is decreased from 0.95 to 0.9. As per various papers including https://arxiv.org/pdf/2304.13013, a smaller beta2 seems to improve stability, at a slight cost in how fast the loss improves (and the final value it converges to). I suspect that by selectively reducing beta2 based on the grad z-norm, we can get some of the stability benefits of a lower beta2 without actually hurting the performance as much or at all.

When grad z-norm is >8, it will "mostly skip" the update - i.e. learning rate is 0.1x, weight decay is 0.2x, and beta1 is increased from 0.90 to 0.95 in order to prevent outliers from "lingering" in the 1st momentum too much (reducing learning rate to 0 but not increasing beta1 resulted in some lingering high grad z-norms which went away after that change).

More aggressive outlier detection: do not update moving average with …

19c83f9

…skipped updates, and use a shorter window of 64 steps

ademeure marked this pull request as draft July 25, 2024 20:44

ademeure added 2 commits July 28, 2024 23:16

Much improved outlier detection & response

e69be7a

+missing changes

dec7f76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier detection: catch more outliers by not updating moving average with skipped updates #711

Outlier detection: catch more outliers by not updating moving average with skipped updates #711

ademeure commented Jul 25, 2024

ademeure commented Jul 29, 2024

Outlier detection: catch more outliers by not updating moving average with skipped updates #711

Are you sure you want to change the base?

Outlier detection: catch more outliers by not updating moving average with skipped updates #711

Conversation

ademeure commented Jul 25, 2024

ademeure commented Jul 29, 2024