Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGTransport in-place transfers #118

Open
d4l3k opened this issue Feb 22, 2025 · 0 comments
Open

PGTransport in-place transfers #118

d4l3k opened this issue Feb 22, 2025 · 0 comments
Labels
checkpoint related to checkpointing/recovery/healing process_group related to ProcessGroups and collectives python

Comments

@d4l3k
Copy link
Member

d4l3k commented Feb 22, 2025

Currently PGTransport will allocate new tensors and copy them to CPU -- this is memory inefficient and slow as we have to limit amount of tensors transferred at once and do a GPU->CPU->GPU copy. A better solution would be to directly transfer into the GPU tensors inplace

This requires matching the tensors between the local state_dict and the remote state_dict. This is a bit tricky to do in the general case of arbitrary Python objects but should be fine with dictionaries. I'm not sure if the ordering of PyTree is guaranteed or if we need to have some custom mapping logic

Relevant code: https://github.com/pytorch/torchft/blob/main/torchft/checkpointing/pg_transport.py

We should also add a benchmark to test this with PGNCCL

@d4l3k d4l3k added checkpoint related to checkpointing/recovery/healing python process_group related to ProcessGroups and collectives labels Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpoint related to checkpointing/recovery/healing process_group related to ProcessGroups and collectives python
Projects
None yet
Development

No branches or pull requests

1 participant