-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Deepspeed Zero 3 MiCS support (Issues #20378) #20461
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #20461 +/- ##
=======================================
- Coverage 88% 88% -0%
=======================================
Files 267 267
Lines 23382 23389 +7
=======================================
Hits 20479 20479
- Misses 2903 2910 +7 |
This is great, thanks for the contribution @hehepig4 |
let's see how CI goes and we can proceed |
In the meantime @hehepig4, would you be willing to add a mention to this in the relevant docs section? Also ideally we could make the same exact change in Fabric as well: |
Thanks! I will first work on docs, then fabric. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the updates. I see CI is going out of memory, I need to further investigate.
What does this PR do?
After deepspeed 0.9.2, they provided Mics support, which can specify how to split parameters across devices in Zero stage 3 reference.
To activate MiCS, users should add 'mics_shard_size' in deepspeed config, and use 'deepspeed.zero.MiCS_Init' instead of 'deepspeed.zero.Init' when initializing models. Issues occur when:
The core changes are done by altering zero.Init in DeepSpeedStrategy.model_sharded_context (line 519 in lightning.pytorch.strategies.deepspeed.py) when detecting 'mics_shard_size' in deepspeed zero_optmization config.
Fixes & adds support for #20378
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20461.org.readthedocs.build/en/20461/