How to Parallelize a Transformer for Training | How To Scale Your Model #7
Replies: 17 comments 18 replies
-
Nit: "ZeRO-{1,2,3} are used to refer to sharding the weights, gradients and optimizer states in this way, respectively" |
Beta Was this translation helpful? Give feedback.
-
How do Mixture of Experts (MoE) and GShard fit into this context? Thank you. |
Beta Was this translation helpful? Give feedback.
-
For pure data parallelism, does num_params=HBM per device/10 need to take into account gradient storage space besides parameters and optimizer states? |
Beta Was this translation helpful? Give feedback.
-
nit: Seems like you are splitting the difference when approximately max model size in TPU v5p. Takeaway value and text should be consistent. "TPUv5p pod with 96GB of HBM and pure data parallelism this is about 10B parameters." |
Beta Was this translation helpful? Give feedback.
-
nit: |
Beta Was this translation helpful? Give feedback.
-
Shouldn't it be to be compute-bound? |
Beta Was this translation helpful? Give feedback.
-
Question 1 / Attention parameters: missing number of layers |
Beta Was this translation helpful? Give feedback.
-
I think the diagrams for "Tensor Parallelism" and "Mixed FSDP and Tensor Parallelism" sections needs to be switched. The diagram for "Tensor Parallelism" shows local shape as [B // N, D // M] whereas the one for "Mixed FSDP and Tensor Parallelism" show as "[B, D // M]" and it should be inverse? |
Beta Was this translation helpful? Give feedback.
-
Just a question about DCN vs FSDP bandwidth requirements. The book mentions "Because DCN has lower bandwidth, it’s typically too slow to do much useful FSDP". By doing that math, it would appear that Data Parallel and FSDP has the same amount of communication (one all-reduces the gradients, another all-gathers the weights and reduce-scatters the gradients), could you explain why is FSDP much more bandwidth intensive than DDP (and hence unsuitable for DCN)? |
Beta Was this translation helpful? Give feedback.
-
I do not think that it is true that all three have the same communication cost. Across all variations of implementations I have personally seen, ZeRO-3 includes the extra parameter all-gather in backward, while ZeRO-2 does not. It may be that in your case, you often find that the backward parameter all-gather can be fully overlapped, but I would still not say that the communication cost is the same (since there are certainly many real cases where it cannot be overlapped). |
Beta Was this translation helpful? Give feedback.
-
For data parallelism the compute I am trying to understand what the three two each means:
the final2 is a bit confusing to me, if you are counting both forward and backward pass, shouldn't it be 3? |
Beta Was this translation helpful? Give feedback.
-
In Figure: FSDP shard, shouldn't it be Wout instead Win since the logical shape is [D,F] ? |
Beta Was this translation helpful? Give feedback.
-
I'm curious at what point we are able to overlap comms and compute for tensor parallelism? It doesn't seem possible at first glance since the scattered activations from the previous layer need to be immediately re-gathered for the next layer's computation, with no compute in between. I feel like it is implied that we can overlap it somehow since we still derive the conditions when tensor parallelism becomes comms bound, but based on the sequence of events, it seems like it'll be comms-bound no matter what. |
Beta Was this translation helpful? Give feedback.
-
In Pure Parallelism section Communication time it is worth mentioning where the factor 2 comes = it counts for redundancy of bidirectional communication in a ring to cut down number of hops that was talked in Section 3 (not previous section as mentioned in the text). Generally, I think this kind of brief repetitions do help to stay focused on main subject of discussion rather than scrambling to recall exactly where some factors do come from. |
Beta Was this translation helpful? Give feedback.
-
In the FSDP diagram above "Note that the activations (left) are not sharded along the contracting dimension, which is what forces us to gather". It is important reminding reasoning behind what necessitates gathering - that true activation function is non-linear and it can't be sharded. Generally, I see activation is mistakenly used instead of pre-activation term. Pre-activation can be sharded because it is linear function, activation can not - it is non-linear. But this is a problem of many publications that abuse the term, in my opinion. |
Beta Was this translation helpful? Give feedback.
-
Equation (2) above has the text "Thus, for a single layer in the backward pass, we have" followed by "L" in in the FLOP/s equation (2). |
Beta Was this translation helpful? Give feedback.
-
From the FSDP section, please help me understand what you mean here
While this is true that the contracting dimension is shared for the W_in, but the contracting dimension |
Beta Was this translation helpful? Give feedback.
-
Training parallelism, discussed!
Beta Was this translation helpful? Give feedback.
All reactions