[WIP]Significanlty optimized performance #14071

dmakoviichuk-tt · 2024-10-21T23:11:53Z

Ticket

Link to Github Issue

Problem description

We found 2 very slow places which called pretty often for every op.

What's changed

Moved very heavy check in CoreRangeSet which is not needed in release.
Refactored logical_cores() calls.

Got near ~100ms (~10% performance) boost in NanoGPT training.

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

https://github.com/tenstorrent/tt-metal/actions/runs/11450140201

tt-aho · 2024-10-21T23:49:09Z

@dmakoviichuk-tt why would the CoreRangeSet check not be needed in release? Users can create and pass in CoreRangeSets to our apis, so I think the no overlap check is needed.

I am working on some optimizations related to this, namely we're switching the internal container to a vector so iteration should be faster, as well as reducing the iterations of these nested loops as there are currently duplicated checks happening.

Ex is here

tt-metal/tt_metal/common/core_coord.cpp

Line 214 in e75ecfe

void CoreRangeSet::validate_no_overlap() {

rfurko-tt · 2024-10-22T03:26:35Z

@tt-aho just curious, what's the expected ranges_.size() and min/max value of the x, y coords?

tt-aho · 2024-10-22T05:17:22Z

I don't have any recent statistics, so information may be outdated, but normally with the way we distribute work to device I'd expect it to generally be at most 3, but a lot of the time may only be 1 or 2.

min max of x, y would be (0, 0) up to the size of the grid. Your worst case CoreRangeSet would basically be if every core is specified separately, but this would be an issue of the user/setup code being suboptimal and not an issue of validation and code should be fixed upstream.

I see that this PR makes optimizations to populate_dispatch_data which is only run on the first iteration of a program. For this to have an impact on NanoGPT training means you're either not running with program cache, or program cache isn't being hit if enabled? We generally have not looked at optimizing this path as we expect to be running with program cache in almost all cases, so it's normally a one time cost on the first iteration.

dmakoviichuk-tt · 2024-10-22T16:36:36Z

@tt-aho it is too slow operation. I'll try to rewrite it but currently this consistency check takes a lot of time.
Seems like it is called pretty often inside with the good parameters.
Second idea is to provide additional parameter to constructor: validate which will be false by default.

For the cache issue: we consistently see some ops have issues with caching. And we are creating tickets to the op owners in parallel. But current reality is that we have issues with cache pretty often. Overall we were able to turn it on only couple days ago.
anyway thats why it is [WIP] :)

Also a first run is something which takes forever so optimizing here is a good idea too.

tt-aho · 2024-10-22T17:28:06Z

@tt-aho it is too slow operation. I'll try to rewrite it but currently this consistency check takes a lot of time. Seems like it is called pretty often inside with the good parameters. Second idea is to provide additional parameter to constructor: validate which will be false by default.

For the cache issue: we consistently see some ops have issues with caching. And we are creating tickets to the op owners in parallel. But current reality is that we have issues with cache pretty often. Overall we were able to turn it on only couple days ago. anyway thats why it is [WIP] :)

Also a first run is something which takes forever so optimizing here is a good idea too.

Understandable. I optimized the consistency check in my current PR #14074. Would be good to see what the performance impact is there as well as seeing what sizes of the CoreRangeSets you're getting in NanoGPT and if it's within the typical range. Potentially can also disable the check as you've done and move to a lazy evaluation scheme where we only validate in fns that actually need it instead on every creation.

#0: Significanlty optimized performance

d4d0bf6

dmakoviichuk-tt requested review from abhullar-tt, pgkeller, aliuTT, tt-aho, tt-dma, tt-asaigal and ubcheema as code owners October 21, 2024 23:11

dmakoviichuk-tt temporarily deployed to dev October 21, 2024 23:12 — with GitHub Actions Inactive

dmakoviichuk-tt temporarily deployed to dev October 21, 2024 23:25 — with GitHub Actions Inactive

dmakoviichuk-tt temporarily deployed to dev October 21, 2024 23:35 — with GitHub Actions Inactive

dmakoviichuk-tt temporarily deployed to dev October 21, 2024 23:37 — with GitHub Actions Inactive

#0: merged latest main

bf2d723

dmakoviichuk-tt closed this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Significanlty optimized performance #14071

[WIP]Significanlty optimized performance #14071

dmakoviichuk-tt commented Oct 21, 2024 •

edited

Loading

tt-aho commented Oct 21, 2024

rfurko-tt commented Oct 22, 2024

tt-aho commented Oct 22, 2024

dmakoviichuk-tt commented Oct 22, 2024 •

edited

Loading

tt-aho commented Oct 22, 2024 •

edited

Loading

[WIP]Significanlty optimized performance #14071

[WIP]Significanlty optimized performance #14071

Conversation

dmakoviichuk-tt commented Oct 21, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

tt-aho commented Oct 21, 2024

rfurko-tt commented Oct 22, 2024

tt-aho commented Oct 22, 2024

dmakoviichuk-tt commented Oct 22, 2024 • edited Loading

tt-aho commented Oct 22, 2024 • edited Loading

dmakoviichuk-tt commented Oct 21, 2024 •

edited

Loading

dmakoviichuk-tt commented Oct 22, 2024 •

edited

Loading

tt-aho commented Oct 22, 2024 •

edited

Loading