All About Rooflines | How To Scale Your Model #3
Replies: 9 comments 7 replies
-
For the |
Beta Was this translation helpful? Give feedback.
-
Nice work, thank you. But I am confused that why the roofline doesn't pass through the origin? Considering that |
Beta Was this translation helpful? Give feedback.
-
When deriving the 240 (or ~ 500 when using GPU) threshold for batch size B, under the assumption B << D, does this threshold vary significantly when using a consumer-grade GPU (say RTX 3090 etc.) versus enterprise-grade GPU (say H100) ? |
Beta Was this translation helpful? Give feedback.
-
To remember, I took personal note for Part-1 as below. Hope my understanding is correct. High Arithmetic Intensity = Compute Bound Low Arithmetic Intensity = Bandwidth Bound |
Beta Was this translation helpful? Give feedback.
-
Unless I am mistaken, the reported FLOPs/s number of "1.98e15 bfloat16" for the H100, corresponds to operations with sparsity. The corresponding FLOPs/s with dense operations, should be half of the reported one (see e.g. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/). |
Beta Was this translation helpful? Give feedback.
-
In the matrix multiplication section you're using B for a matrix and a shape parameter for the matrix A. Probably want to change one of them :). |
Beta Was this translation helpful? Give feedback.
-
Is this a typo? |
Beta Was this translation helpful? Give feedback.
-
In the roofline figure above, the boundary between the compute bound (green) and bandwidth bound(pink) should start at the point where the accelerator flops flattens, right? Why is it not that way? Kindly explain. |
Beta Was this translation helpful? Give feedback.
-
When reading the example of partitioned matmul over two TPUs: X is split horizaontally into [X0, X1] x [Y0, Y1]t is simply X0xY0 + X1+Y1. |
Beta Was this translation helpful? Give feedback.
-
Discussions about rooflines!
Beta Was this translation helpful? Give feedback.
All reactions