add option to skip initial sync in Manager #117
Labels
enhancement
New feature or request
good first issue
Good for newcomers
lighthouse
Lighthouse and quorum related
rust
We currently always heal on step 0 to avoid synchronization issues. We want an option to support skipping this sync for users who set the PyTorch seed so all ranks are initialized with the same values.
This should match the name
init_sync
from pytorch/pytorch#142824Bonus would be to randomly initialize a value in Manager so we can detect whether or not ranks are seeded and throw an error if there's a mismatch on first quorum.
Relevant code:
max_step == 0 && primary.replica_id != p.replica_id
torchft/src/manager.rs
Lines 403 to 410 in d427bef
The text was updated successfully, but these errors were encountered: