[GraphBolt][CUDA] Cooperative Minibatching initial exchange. #7795

mfbalin · 2024-09-11T19:32:10Z

Description

Towards implementing #7273.

The code is untested right now. I need to merge some of the changes so that the PR size stays limited.

The initial exchange is required because each process needs to sample for the nodes they are assigned. There is no guarantee about any partitioning after things come from ItemSampler or when negative edges are added. Thus, initial exchange ensures that each process samples only for the nodes that they own.

Follow up work:

Exchanges after each sampling stage.
Add torch layer for GNN forward backward using Cooperative Minibatching and stored tensors including exchange information.
Add multi-GPU example showcasing Cooperative Minibatching.
Optimize Cooperative Minibatching using partitioning. (More details to come later.)

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-09-11T19:32:37Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-09-11T19:34:36Z

Commit ID: 7ca6a7a

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-09-11T20:02:53Z

Commit ID: c0af5c8

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-09-11T20:33:18Z

Commit ID: ba3bc97

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-09-11T21:43:01Z

Commit ID: 9be2085

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin added 2 commits September 11, 2024 19:24

[GraphBolt][CUDA] Cooperative Minibatching initial exchange.

0722311

linting

Loading
Loading status checks…

7ca6a7a

mfbalin added the expedited label Sep 11, 2024

mfbalin requested a review from frozenbugs September 11, 2024 19:32

mfbalin added 2 commits September 11, 2024 20:00

linting

Loading
Loading status checks…

c0af5c8

change variable name.

Loading
Loading status checks…

ba3bc97

refactor.

Loading
Loading status checks…

9be2085

mfbalin merged commit 53e70c5 into dmlc:master Sep 11, 2024
2 checks passed

mfbalin deleted the gb_cuda_cooperative_exchange branch September 11, 2024 21:46

lijialin03 pushed a commit to lijialin03/dgl that referenced this pull request Jan 6, 2025

[GraphBolt][CUDA] Cooperative Minibatching initial exchange. (dmlc#7795)

17292b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][CUDA] Cooperative Minibatching initial exchange. #7795

[GraphBolt][CUDA] Cooperative Minibatching initial exchange. #7795

mfbalin commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

[GraphBolt][CUDA] Cooperative Minibatching initial exchange. #7795

[GraphBolt][CUDA] Cooperative Minibatching initial exchange. #7795

Conversation

mfbalin commented Sep 11, 2024

Description

Checklist

Changes

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024

dgl-bot commented Sep 11, 2024