All About Rooflines | How To Scale Your Model #3

jacobaustin123 · 2025-02-03T02:21:33Z

jacobaustin123
Feb 3, 2025
Maintainer

Discussions about rooflines!

manishravula · 2025-02-05T00:51:12Z

manishravula
Feb 5, 2025 — with giscus

For the T_math in the distributed gemm case, if the flops are (BDF/2) and the compute remains the same, where do we get the extra 2s in the numerator and the denominator? (if that was representing the aggregate compute across two TPUs, then that should be made clear somewhere as well?)

1 reply

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

So on a single TPU the compute would be 2BDF (BDF multiplies and BF(D-1) adds, technically). Split across two TPUs, each does half this amount, so it’s 2BDF/2 per chip, and 2BF/2 bytes transferred from each chip

xmfbit · 2025-02-05T08:40:21Z

xmfbit
Feb 5, 2025 — with giscus

Nice work, thank you. But I am confused that why the roofline doesn't pass through the origin? Considering that Real FLOPS/s = min(Hardware FLOPs/s, BW * AI)

1 reply

fedelebron Feb 5, 2025 — with giscus
Collaborator

Roofline plots are traditionally done in log-log, which is why there's no "zero". We'll make an edit to clarify that, thanks!

kishorepv · 2025-02-06T02:53:06Z

kishorepv
Feb 6, 2025 — with giscus

When deriving the 240 (or ~ 500 when using GPU) threshold for batch size B, under the assumption B << D, does this threshold vary significantly when using a consumer-grade GPU (say RTX 3090 etc.) versus enterprise-grade GPU (say H100) ?

1 reply

jacobaustin123 Feb 6, 2025 — with giscus
Maintainer Author

I'm not an expert on GPUs but e.g. RTX 3090 claims to support 268e12 FP16 FLOPs and have a memory bandwidth of 936e9, which would give us a critical batch size of roughly 286 (source), so close to half that of the A100. Each generation will likely have a slightly different value depending on what workloads NVIDIA is trying to make efficient

meetrais · 2025-02-06T16:17:20Z

meetrais
Feb 6, 2025 — with giscus

To remember, I took personal note for Part-1 as below. Hope my understanding is correct.

High Arithmetic Intensity = Compute Bound
This is because if operations/calculations are of high arithmetic intensity then it will keep FLOPs busy longer, resulting in compute-bound. It wont require much data transfer.

Low Arithmetic Intensity = Bandwidth Bound
This is because if operations/calculations are of low arithmetic intesity then FLOPs will get free quikly and will require quick data transfer to FLOPs which are idle. This makes it bandwidth-bound because higher GB/s speed will give better result.

0 replies

sanagno · 2025-02-07T12:20:42Z

sanagno
Feb 7, 2025 — with giscus

Unless I am mistaken, the reported FLOPs/s number of "1.98e15 bfloat16" for the H100, corresponds to operations with sparsity. The corresponding FLOPs/s with dense operations, should be half of the reported one (see e.g. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/).

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Thanks for noting this, updated.

damek · 2025-02-07T14:59:22Z

damek
Feb 7, 2025 — with giscus

In the matrix multiplication section you're using B for a matrix and a shape parameter for the matrix A. Probably want to change one of them :).

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Good call, fixed. Will update in a moment.

zhipengzhaocmu · 2025-02-07T19:29:54Z

zhipengzhaocmu
Feb 7, 2025 — with giscus

Is this a typo? 1e12 / 9.89e14 = 1.01us and 1e12 / 9.1e14 = 1.1ms The first us should all be ms.

1 reply

jacobaustin123 Feb 7, 2025
Maintainer Author

Yes, good catch. Just updated this and it slipped through. Fixed now!

kirachy · 2025-02-08T08:23:27Z

kirachy
Feb 8, 2025 — with giscus

In the roofline figure above, the boundary between the compute bound (green) and bandwidth bound(pink) should start at the point where the accelerator flops flattens, right? Why is it not that way? Kindly explain.

1 reply

jacobaustin123 Feb 8, 2025
Maintainer Author

This is a mistake. The correct figure is something like

I'll update the website.

Shua1 · 2025-02-10T00:28:19Z

Shua1
Feb 10, 2025 — with giscus

When reading the example of partitioned matmul over two TPUs:
I was confused by why we don't need to do a X1 x Y0 and X0 x Y1.
And turns out I just need to refresh my algebra knowledge.

X is split horizaontally into
X0 = X[:, :D // 2]
X1 = X[:, D//2:]
Y is split veritically into
Y0 = Y[:D//2, :]
Y1 = Y[D//2:, :]

[X0, X1] x [Y0, Y1]^t is simply X0xY0 + X1+Y1.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All About Rooflines | How To Scale Your Model #3

{{title}}

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

All About Rooflines | How To Scale Your Model #3

jacobaustin123 Feb 3, 2025 Maintainer

Replies: 9 comments · 7 replies

manishravula Feb 5, 2025 — with giscus

jacobaustin123 Feb 5, 2025 — with giscus Maintainer Author

xmfbit Feb 5, 2025 — with giscus

fedelebron Feb 5, 2025 — with giscus Collaborator

kishorepv Feb 6, 2025 — with giscus

jacobaustin123 Feb 6, 2025 — with giscus Maintainer Author

meetrais Feb 6, 2025 — with giscus

sanagno Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025 Maintainer Author

damek Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025 Maintainer Author

zhipengzhaocmu Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025 Maintainer Author

kirachy Feb 8, 2025 — with giscus

jacobaustin123 Feb 8, 2025 Maintainer Author

Shua1 Feb 10, 2025 — with giscus

jacobaustin123
Feb 3, 2025
Maintainer

Replies: 9 comments 7 replies

manishravula
Feb 5, 2025 — with giscus

jacobaustin123 Feb 5, 2025 — with giscus
Maintainer Author

xmfbit
Feb 5, 2025 — with giscus

fedelebron Feb 5, 2025 — with giscus
Collaborator

kishorepv
Feb 6, 2025 — with giscus

jacobaustin123 Feb 6, 2025 — with giscus
Maintainer Author

meetrais
Feb 6, 2025 — with giscus

sanagno
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

damek
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

zhipengzhaocmu
Feb 7, 2025 — with giscus

jacobaustin123 Feb 7, 2025
Maintainer Author

kirachy
Feb 8, 2025 — with giscus

jacobaustin123 Feb 8, 2025
Maintainer Author

Shua1
Feb 10, 2025 — with giscus