Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection: catch more outliers by not updating moving average with skipped updates #711

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ademeure
Copy link
Contributor

This is an improvement to the znorm/zgrad update skipping mechanisms (-sl and -sg) to avoid skipping updates for outliers. Note that znorm will still be updated if zgrad is an outlier that causes the update to be skipped (and vice-versa), but that shouldn't matter much.

Combined with "-sg 5" my intuition is that this should significantly improve stability (famous last words...) - we should still try to root cause the instability as much as possible after that of course.

…skipped updates, and use a shorter window of 64 steps
@ademeure ademeure marked this pull request as draft July 25, 2024 20:44
@ademeure
Copy link
Contributor Author

Updated the PR - very promising based on a short local run, going to use this for a longer run and for FP8 runs now...

The most surprisingly effective change seems to be that whenever grad z-norm is >2, beta2 is decreased from 0.95 to 0.9. As per various papers including https://arxiv.org/pdf/2304.13013, a smaller beta2 seems to improve stability, at a slight cost in how fast the loss improves (and the final value it converges to). I suspect that by selectively reducing beta2 based on the grad z-norm, we can get some of the stability benefits of a lower beta2 without actually hurting the performance as much or at all.

When grad z-norm is >8, it will "mostly skip" the update - i.e. learning rate is 0.1x, weight decay is 0.2x, and beta1 is increased from 0.90 to 0.95 in order to prevent outliers from "lingering" in the 1st momentum too much (reducing learning rate to 0 but not increasing beta1 resulted in some lingering high grad z-norms which went away after that change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant