How to Parallelize a Transformer for Training | How To Scale Your Model #7

jacobaustin123 · 2025-02-03T02:22:51Z

jacobaustin123
Feb 3, 2025
Maintainer

Training parallelism, discussed!

roee89871324 · 2025-02-09T23:10:39Z

roee89871324
Feb 9, 2025 — with giscus

Nit: "ZeRO-{1,2,3} are used to refer to sharding the weights, gradients and optimizer states in this way, respectively"
This is actually in reverse order in the paper. Optimiser state is 1, gradients is 2 and weights is 3 (since weights have some overhead to normal dp they kept it "last")

1 reply

jacobaustin123 Feb 10, 2025
Maintainer Author

Thanks for catching this, I reversed the ordering. In writing this, we basically found that as long as you're careful about AllGathering the weights ahead of time, ZeRO-3 should always be OK from a roofline standpoint even thought it adds extra comms. I think that wasn't so clear to me from the ZeRO paper.

chillmang · 2025-02-10T13:31:36Z

chillmang
Feb 10, 2025 — with giscus

How do Mixture of Experts (MoE) and GShard fit into this context? Thank you.

0 replies

voyagerdd · 2025-02-12T01:35:19Z

voyagerdd
Feb 12, 2025 — with giscus

For pure data parallelism, does num_params=HBM per device/10 need to take into account gradient storage space besides parameters and optimizer states?

1 reply

jacobaustin123 Feb 12, 2025
Maintainer Author

This would be with like batch size 1 or something, not enough to be useful. This is just the absolute bare minimum. I'll add a note.

cnberry · 2025-02-12T03:58:38Z

cnberry
Feb 12, 2025 — with giscus

nit: Seems like you are splitting the difference when approximately max model size in TPU v5p. Takeaway value and text should be consistent.

"TPUv5p pod with 96GB of HBM and pure data parallelism this is about 10B parameters."
"For TPU v5p this is roughly 9B parameters."

1 reply

jacobaustin123 Feb 12, 2025
Maintainer Author

Thanks, I've updated this to be 9B everywhere.

emergenz · 2025-02-12T12:05:57Z

emergenz
Feb 12, 2025 — with giscus

nit:
Increasing parallelism or reducing batch size both tend to make us more communication-bound because they reduce the amount of compute performed per chip.

1 reply

jacobaustin123 Feb 12, 2025
Maintainer Author

Fixed, thanks!

bonpyt · 2025-02-14T16:35:41Z

bonpyt
Feb 14, 2025 — with giscus

While pure FSDP parallelism dominates for very large batch sizes, in the regime where batch size over number of chips is between roughly 400 and 850, a mixed FSDP + MP strategy is required in order to be comms-bound.

Shouldn't it be to be compute-bound?

1 reply

jacobaustin123 Feb 14, 2025
Maintainer Author

Fixed!

bonpyt · 2025-02-17T12:30:22Z

bonpyt
Feb 17, 2025 — with giscus

Question 1 / Attention parameters: missing number of layers L.

1 reply

jacobaustin123 Feb 18, 2025
Maintainer Author

This should be fixed now, thank you!

manavgarg · 2025-02-21T20:52:58Z

manavgarg
Feb 21, 2025 — with giscus

I think the diagrams for "Tensor Parallelism" and "Mixed FSDP and Tensor Parallelism" sections needs to be switched. The diagram for "Tensor Parallelism" shows local shape as [B // N, D // M] whereas the one for "Mixed FSDP and Tensor Parallelism" show as "[B, D // M]" and it should be inverse?

1 reply

jacobaustin123 Feb 21, 2025
Maintainer Author

Good catch, I think these were swapped. I think these should now be fixed.

tengyifei · 2025-02-25T06:39:32Z

tengyifei
Feb 25, 2025 — with giscus

Just a question about DCN vs FSDP bandwidth requirements. The book mentions "Because DCN has lower bandwidth, it’s typically too slow to do much useful FSDP". By doing that math, it would appear that Data Parallel and FSDP has the same amount of communication (one all-reduces the gradients, another all-gathers the weights and reduce-scatters the gradients), could you explain why is FSDP much more bandwidth intensive than DDP (and hence unsuitable for DCN)?

0 replies

awgu · 2025-02-26T03:24:37Z

awgu
Feb 26, 2025 — with giscus

ZeRO-{1,2,3} are used to refer to sharding the optimizer states, gradients, and weights in this way, respectively. Since all have the same communication cost, we can basically always do ZeRO-3 sharding, which shards the parameters, gradients, and optimizer states across a set of devices.

I do not think that it is true that all three have the same communication cost. Across all variations of implementations I have personally seen, ZeRO-3 includes the extra parameter all-gather in backward, while ZeRO-2 does not. It may be that in your case, you often find that the backward parameter all-gather can be fully overlapped, but I would still not say that the communication cost is the same (since there are certainly many real cases where it cannot be overlapped).

2 replies

jacobaustin123 Mar 3, 2025
Maintainer Author

First of all, I haven't seen many non-TPU implementations so there might be a difference there.

WRT your point about the extra parameter all-gather, the key here is that DP -> FSDP turns a backward pass AllReduce into an AllGather and a ReduceScatter, which has the same overall cost. You're right that it also adds some communication during the forward pass that's absent in the pure DP version, but the backward pass comms volume ought to be the same.

jacobaustin123 Mar 3, 2025
Maintainer Author

I added a new footnote to clarify this, let me know if it seems good to you.

kerrywang · 2025-02-27T01:55:25Z

kerrywang
Feb 27, 2025 — with giscus

For data parallelism the compute $T_{math}$ is computed as:
$$T_\text{math} = \frac{2 \cdot 2 \cdot 2 \cdot B \cdot L \cdot D \cdot F} {X \cdot C}$$

I am trying to understand what the three two each means:

matmul (multiply and addition)
there are 2 matrix multiplication

the final2 is a bit confusing to me, if you are counting both forward and backward pass, shouldn't it be 3?
Thanks!

3 replies

jacobaustin123 Mar 3, 2025
Maintainer Author

I'll add a note about this. We're just looking at the backward pass here since the forward pass as no comms. The last 2 just comes from the 2x matmuls in the backward pass. We can't include the forward pass because we can't overlap the gradient all reduce with the forward pass, since we don't have gradients yet.

iankur Mar 7, 2025 — with giscus

@jacobaustin123 T_math expression here includes the number of layers L also in the numerator whereas it should not depend on L as used in other places?

jacobaustin123 Mar 12, 2025
Maintainer Author

Fixed!

boyrealmadred · 2025-02-28T03:59:18Z

boyrealmadred
Feb 28, 2025 — with giscus

In Figure: FSDP shard, shouldn't it be Wout instead Win since the logical shape is [D,F] ?

1 reply

jacobaustin123 Mar 3, 2025
Maintainer Author

I don't think so, in general W_in has shape [D, F] (where F >> D) and W_out has shape [F, D].

jesse7chen · 2025-03-05T07:21:39Z

jesse7chen
Mar 5, 2025 — with giscus

I'm curious at what point we are able to overlap comms and compute for tensor parallelism? It doesn't seem possible at first glance since the scattered activations from the previous layer need to be immediately re-gathered for the next layer's computation, with no compute in between. I feel like it is implied that we can overlap it somehow since we still derive the conditions when tensor parallelism becomes comms bound, but based on the sequence of events, it seems like it'll be comms-bound no matter what.

3 replies

jacobaustin123 Mar 5, 2025
Maintainer Author

They can generally be overlapped. We need to do [B, D_Y] @ [D, F_Y], and we can do this by starting to do the local matmul over D, then shuffle the shards around as we keep doing the matmul. See the JAX section for an example.

tengyifei Mar 9, 2025 — with giscus

Is this overlapping technique same as the "Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models" paper aka collective matmul?

jacobaustin123 Mar 9, 2025
Maintainer Author

Yes

gitnicos · 2025-03-12T02:19:38Z

gitnicos
Mar 12, 2025 — with giscus

In Pure Parallelism section Communication time it is worth mentioning where the factor 2 comes = it counts for redundancy of bidirectional communication in a ring to cut down number of hops that was talked in Section 3 (not previous section as mentioned in the text).

Generally, I think this kind of brief repetitions do help to stay focused on main subject of discussion rather than scrambling to recall exactly where some factors do come from.

0 replies

gitnicos · 2025-03-12T04:04:13Z

gitnicos
Mar 12, 2025 — with giscus

In the FSDP diagram above "Note that the activations (left) are not sharded along the contracting dimension, which is what forces us to gather". It is important reminding reasoning behind what necessitates gathering - that true activation function is non-linear and it can't be sharded.

Generally, I see activation is mistakenly used instead of pre-activation term. Pre-activation can be sharded because it is linear function, activation can not - it is non-linear. But this is a problem of many publications that abuse the term, in my opinion.

0 replies

manishucsd · 2025-03-12T17:42:14Z

manishucsd
Mar 12, 2025 — with giscus

Equation (2) above has the text "Thus, for a single layer in the backward pass, we have" followed by "L" in in the FLOP/s equation (2).

1 reply

jacobaustin123 Mar 12, 2025
Maintainer Author

Good point. Deleted.

manishucsd · 2025-03-14T00:23:10Z

manishucsd
Mar 14, 2025 — with giscus

From the FSDP section, please help me understand what you mean here

"FSDP shards the contracting dimension of the MLP weights along the data dimension."

While this is true that the contracting dimension is shared for the W_in, but the contracting dimension F for W_out is not sharded? It is still D which is not the contracting dimension.

1 reply

jacobaustin123 Mar 17, 2025
Maintainer Author

Yes this is correct. I've fixed this.

How to Parallelize a Transformer for Training | How To Scale Your Model #7

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 17 comments · 18 replies

roee89871324 Feb 9, 2025 — with giscus

jacobaustin123 Feb 10, 2025 Maintainer Author

chillmang Feb 10, 2025 — with giscus

voyagerdd Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025 Maintainer Author

cnberry Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025 Maintainer Author

emergenz Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025 Maintainer Author

bonpyt Feb 14, 2025 — with giscus

jacobaustin123 Feb 14, 2025 Maintainer Author

bonpyt Feb 17, 2025 — with giscus

jacobaustin123 Feb 18, 2025 Maintainer Author

manavgarg Feb 21, 2025 — with giscus

jacobaustin123 Feb 21, 2025 Maintainer Author

tengyifei Feb 25, 2025 — with giscus

awgu Feb 26, 2025 — with giscus

jacobaustin123 Mar 3, 2025 Maintainer Author

jacobaustin123 Mar 3, 2025 Maintainer Author

kerrywang Feb 27, 2025 — with giscus

jacobaustin123 Mar 3, 2025 Maintainer Author

iankur Mar 7, 2025 — with giscus

jacobaustin123 Mar 12, 2025 Maintainer Author

boyrealmadred Feb 28, 2025 — with giscus

jacobaustin123 Mar 3, 2025 Maintainer Author

jesse7chen Mar 5, 2025 — with giscus

jacobaustin123 Mar 5, 2025 Maintainer Author

tengyifei Mar 9, 2025 — with giscus

jacobaustin123 Mar 9, 2025 Maintainer Author

gitnicos Mar 12, 2025 — with giscus

gitnicos Mar 12, 2025 — with giscus

manishucsd Mar 12, 2025 — with giscus

jacobaustin123 Mar 12, 2025 Maintainer Author

manishucsd Mar 14, 2025 — with giscus

jacobaustin123 Mar 17, 2025 Maintainer Author

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 17 comments 18 replies

roee89871324
Feb 9, 2025 — with giscus

jacobaustin123 Feb 10, 2025
Maintainer Author

chillmang
Feb 10, 2025 — with giscus

voyagerdd
Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025
Maintainer Author

cnberry
Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025
Maintainer Author

emergenz
Feb 12, 2025 — with giscus

jacobaustin123 Feb 12, 2025
Maintainer Author

bonpyt
Feb 14, 2025 — with giscus

jacobaustin123 Feb 14, 2025
Maintainer Author

bonpyt
Feb 17, 2025 — with giscus

jacobaustin123 Feb 18, 2025
Maintainer Author

manavgarg
Feb 21, 2025 — with giscus

jacobaustin123 Feb 21, 2025
Maintainer Author

tengyifei
Feb 25, 2025 — with giscus

awgu
Feb 26, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

jacobaustin123 Mar 3, 2025
Maintainer Author

kerrywang
Feb 27, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

jacobaustin123 Mar 12, 2025
Maintainer Author

boyrealmadred
Feb 28, 2025 — with giscus

jacobaustin123 Mar 3, 2025
Maintainer Author

jesse7chen
Mar 5, 2025 — with giscus

jacobaustin123 Mar 5, 2025
Maintainer Author

jacobaustin123 Mar 9, 2025
Maintainer Author

gitnicos
Mar 12, 2025 — with giscus

gitnicos
Mar 12, 2025 — with giscus

manishucsd
Mar 12, 2025 — with giscus

jacobaustin123 Mar 12, 2025
Maintainer Author

manishucsd
Mar 14, 2025 — with giscus

jacobaustin123 Mar 17, 2025
Maintainer Author