-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BH MMul Hang in Resnet #18065
Comments
@mywoodstock , can you please help with creating a minimal repro for this test? Assigning the issue to you. |
@vmilicevicTT I have tried to create smaller tests but haven't been able to repro the issue. It only happens in the full model run. |
This hasn't reproduced in isolation. Next attempt will be for @mywoodstock to create a test that repeatedly loops over the problematic section of the model. I think there was some success with this sort of thing in #16439 . |
Note issue #18250 - there may be a new MMul hang cause if you're using a recent build. |
I tried disabling instruction gather by
It still hanged, then on top of it I added the compiler option to make SFPI not use the replays by
But the hang persisted. Commenting out the disable_gathering & enable_gatherings did not help either. |
What if you do |
Did not work |
Cannot reproduce a hang on EDIT: I didn't comment out the workaround in |
@amahmudTT Can you confirm that what you did is the following:
The reason I ask is that I get a pass in this case on |
I did not perform the above step. |
Okay so here is what I realized, if I encounter a hang, then no matter what I do from the above mentioned steps, i encounter a hang if I do not restart the card, but if I restart then following the steps outlined does not produce a hang. |
The issue #17099 gets fixed only when compiler option for sfpi is changed and the disable_gathering() related changes are made, one does not suffice, so I tested this issue with those changes and like that issue I see a similar failure related to assert.
|
Interesting, I get a clear pass in that case:
Maybe it's something card-related? Let me repeat the steps that I do just for clarity:
|
### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes
@cmaryanTT @mywoodstock I guess we can close this now? |
Yes, let me confirm with the latest main. Will report back soon. |
Thanks! |
OK removing the matmul block workarounds from both matmul and conv kernels looks good! All three cases with batch=16, 20 and 32 work without hangs. |
Originally posted by @mywoodstock in #16439
This instance is successful when the workaround is applied; but workaround is not confirmed to solve the root cause.
The text was updated successfully, but these errors were encountered: