-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BH MMul Hang Parent Case #18064
Comments
- Removed skip mark so test can run on branch - Updated grid size to 13x10 - Scaled size of inputs to match grid size - Added prints before/after ttnn.synchronize_device call #0: hanging case: 13x10 grid, removed PACK(()), have matmul_block #0: single core test hang, with 1x1 subblock, no reader/writer, no pack_tile #0: Buildable, fix compute kernel arguments #0: Double kernel, proper size, doesn't hang on WH #0: Match the arguments from py and cpp #0: Combined cpp repro that introduces hang #0: Addded comments to the file * Addresses issue #18064 (more specifically #16439)
On https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/bh_hang instruciton gathering is partially disabled, please see the commit message to disable it completely. This fixes the hang for single core hang #16439, maybe with ResNet50, #18065, but doesn't help with multiple cores, including SDPA #16673 / #18094. |
Most recent experiment is to disable instruction gathering and branch prediction. Some improvement, but some test cases still fail. Request @mywoodstock to test with Resnet and @ncvetkovicTT 's latest branch. |
@mywoodstock branch soon to come, after I do the rebase again. |
After I rebased SDPA started failing with PCC errors, but no hangs, so I gave up on it. Instead, I'll continue working on this branch ncvetkovic/bh_hang which is a bit apart from latest main. @mywoodstock please introduce the changes from the branch tt-llk/ncvetkovic/bh_hang to whichever tt-metal branch has the ResNet50 test and tell us if it hangs. |
@cmaryanTT @ttmtrajkovic @cglagovichTT @vmilicevicTT @mywoodstock @pgkeller Take a look at the following table for the overview of the workarounds that were attempted and their results. You can add suggestions and comments either somehow in the sheet or here/DM. https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0 |
Note issue #18250 - there may be a new MMul hang cause if you're using a recent build. |
I had to rerun SDPA tests to confirm the behavior. Multi core variant doesn't hang when CSR writes are protected, single core variant behaves like single core |
All issues except #16439 original variant can be solved with the fix #10 from the table. In order to understand why disabling cache at the start of trisc.cc doesn't help with multi core SDPA, I am trying to introduce the hang on minimal core grid possible for simulation to be reasonably long. My current findings show that reducing the core grid below 16 removes the hang in I propose that workaround attempt no.10 becomes a workaround, or a solution for now, for all other issues. The issues are caused by a HW bug which requires that CSR writes be guarded by either |
@ncvetkovicTT @cglagovichTT - have you confirmed if the hang is the same in SDPA after the no 10 workaround? I agree about committing the workaround for now to allow other work to proceed. |
@cmaryanTT Fix no10 works on SDPA, fixes 11 - 16 don't and I am trying to understand why. |
@cmaryanTT @pgkeller @mywoodstock @nathan-TT @cglagovichTT Let me know if you guys need a separate branch for the fix. Essentially the workaround is to change
and to add the SFPI compiler flag |
- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers
- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers
Due to a HW bug in CSR writes, we need to guard them by first disabling branch prediction, and then re-enabling it. This is a workaround for tenstorrent/tt-metal#18064 (comment) and is proposed as a fix by the HW team.
- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers
@ejouretTT @vmilicevicTT @cmaryanTT @mywoodstock @pgkeller @cglagovichTT PR #18359 should resolve all issues except #16439 multi core case, which will be addressed as soon as this PR is merged. Please rebase to the latest main once this PR is merged, I need to see why EventSync is failing first, any input is welcome. |
- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers
### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes
To clarify - this workaround should unblock everyone on all related issues from the description. I suggest to keep #16439 open until we understand why full-grid test fails. Also, keep #18094 open as a tracker until we realize the effect of clock gating + instruction cache on this test. We can close this issue. |
Using this to organize the different manifestations of the MMul hang in BH
MMul in isolation (main case): #16439
MMul in Conv2D, in Resnet: #18065
MMul in SDPA, in LLama: #16673 (working with semaphore LLK fix)
The text was updated successfully, but these errors were encountered: