-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Significanlty optimized performance #14071
Conversation
@dmakoviichuk-tt why would the CoreRangeSet check not be needed in release? Users can create and pass in CoreRangeSets to our apis, so I think the no overlap check is needed. I am working on some optimizations related to this, namely we're switching the internal container to a vector so iteration should be faster, as well as reducing the iterations of these nested loops as there are currently duplicated checks happening. Ex is here tt-metal/tt_metal/common/core_coord.cpp Line 214 in e75ecfe
|
@tt-aho just curious, what's the expected |
I don't have any recent statistics, so information may be outdated, but normally with the way we distribute work to device I'd expect it to generally be at most 3, but a lot of the time may only be 1 or 2. min max of x, y would be (0, 0) up to the size of the grid. Your worst case CoreRangeSet would basically be if every core is specified separately, but this would be an issue of the user/setup code being suboptimal and not an issue of validation and code should be fixed upstream. I see that this PR makes optimizations to |
@tt-aho it is too slow operation. I'll try to rewrite it but currently this consistency check takes a lot of time. For the cache issue: we consistently see some ops have issues with caching. And we are creating tickets to the op owners in parallel. But current reality is that we have issues with cache pretty often. Overall we were able to turn it on only couple days ago. Also a first run is something which takes forever so optimizing here is a good idea too. |
Understandable. I optimized the consistency check in my current PR #14074. Would be good to see what the performance impact is there as well as seeing what sizes of the CoreRangeSets you're getting in NanoGPT and if it's within the typical range. Potentially can also disable the check as you've done and move to a lazy evaluation scheme where we only validate in fns that actually need it instead on every creation. |
Ticket
Link to Github Issue
Problem description
We found 2 very slow places which called pretty often for every op.
What's changed
Got near ~100ms (~10% performance) boost in NanoGPT training.
Checklist
https://github.com/tenstorrent/tt-metal/actions/runs/11450140201