-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
safe_get_full_grad & safe_set_full_grad #7117
Comments
Yes. This API is called after reduction and so all DP ranks have the same gradient value: https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging
These APIs should be called by all the DP ranks to avoid hangs.
Yes, users have freedom/responsibility for correct usage of these APIs. Normally gradient values are the same across DP ranks after reduction.
The type should match the gradient accumulation data type configured by the user: https://deepspeed.readthedocs.io/en/latest/zero3.html#deepspeed.utils.safe_set_full_grad |
@tjruwase Thank you for your explanation. I have an additional question regarding the safe_get_local_grad and safe_set_local_grad functions. Are these functions designed to return the gradient values specific to each process or rank (i.e., the gradients computed on that particular rank before reduction)? If not, could you please explain the differences between the local gradients and the aggregated gradients? Thank you! |
@jinghanjia, please see motivation of the local APIs here: #4681 |
deepspeed 0.15.3
zero 3 is used
For "safe_get_full_grad", does it return the same gradient values on each process/rank?
As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?
If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?
Also, which float type should be used for "safe_set_full_grad"? any way to check this?
The text was updated successfully, but these errors were encountered: