Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Significanlty optimized performance #14071

Closed
wants to merge 2 commits into from

Conversation

dmakoviichuk-tt
Copy link
Contributor

@dmakoviichuk-tt dmakoviichuk-tt commented Oct 21, 2024

Ticket

Link to Github Issue

Problem description

We found 2 very slow places which called pretty often for every op.

What's changed

  • Moved very heavy check in CoreRangeSet which is not needed in release.
  • Refactored logical_cores() calls.

Got near ~100ms (~10% performance) boost in NanoGPT training.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • New/Existing tests provide coverage for changes

https://github.com/tenstorrent/tt-metal/actions/runs/11450140201

@tt-aho
Copy link
Contributor

tt-aho commented Oct 21, 2024

@dmakoviichuk-tt why would the CoreRangeSet check not be needed in release? Users can create and pass in CoreRangeSets to our apis, so I think the no overlap check is needed.

I am working on some optimizations related to this, namely we're switching the internal container to a vector so iteration should be faster, as well as reducing the iterations of these nested loops as there are currently duplicated checks happening.

Ex is here

void CoreRangeSet::validate_no_overlap() {

@rfurko-tt
Copy link
Contributor

@tt-aho just curious, what's the expected ranges_.size() and min/max value of the x, y coords?

@tt-aho
Copy link
Contributor

tt-aho commented Oct 22, 2024

I don't have any recent statistics, so information may be outdated, but normally with the way we distribute work to device I'd expect it to generally be at most 3, but a lot of the time may only be 1 or 2.

min max of x, y would be (0, 0) up to the size of the grid. Your worst case CoreRangeSet would basically be if every core is specified separately, but this would be an issue of the user/setup code being suboptimal and not an issue of validation and code should be fixed upstream.

I see that this PR makes optimizations to populate_dispatch_data which is only run on the first iteration of a program. For this to have an impact on NanoGPT training means you're either not running with program cache, or program cache isn't being hit if enabled? We generally have not looked at optimizing this path as we expect to be running with program cache in almost all cases, so it's normally a one time cost on the first iteration.

@dmakoviichuk-tt
Copy link
Contributor Author

dmakoviichuk-tt commented Oct 22, 2024

@tt-aho it is too slow operation. I'll try to rewrite it but currently this consistency check takes a lot of time.
Seems like it is called pretty often inside with the good parameters.
Second idea is to provide additional parameter to constructor: validate which will be false by default.

For the cache issue: we consistently see some ops have issues with caching. And we are creating tickets to the op owners in parallel. But current reality is that we have issues with cache pretty often. Overall we were able to turn it on only couple days ago.
anyway thats why it is [WIP] :)

Also a first run is something which takes forever so optimizing here is a good idea too.

@tt-aho
Copy link
Contributor

tt-aho commented Oct 22, 2024

@tt-aho it is too slow operation. I'll try to rewrite it but currently this consistency check takes a lot of time. Seems like it is called pretty often inside with the good parameters. Second idea is to provide additional parameter to constructor: validate which will be false by default.

For the cache issue: we consistently see some ops have issues with caching. And we are creating tickets to the op owners in parallel. But current reality is that we have issues with cache pretty often. Overall we were able to turn it on only couple days ago. anyway thats why it is [WIP] :)

Also a first run is something which takes forever so optimizing here is a good idea too.

Understandable. I optimized the consistency check in my current PR #14074. Would be good to see what the performance impact is there as well as seeing what sizes of the CoreRangeSets you're getting in NanoGPT and if it's within the typical range. Potentially can also disable the check as you've done and move to a lazy evaluation scheme where we only validate in fns that actually need it instead on every creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants