-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device profiler on ttnn resnet50 is hanging on BH #17099
Comments
As per conversation with Paul, first thing to check is to entirely disable l1 data cache which is set in risc_common |
Confirming that disabling L1 cache fixes the issue. Hanging run: https://github.com/tenstorrent/tt-metal/actions/runs/12959399845 Passing run https://github.com/tenstorrent/tt-metal/actions/runs/12966853376/job/36168151536#step:9:407
|
@mo-tenstorrent Trying to identify to single core - still in progress. Reach out to @mywoodstock to create unit tests around it. |
Previous workaround did not fix this - needed to come up with new solution. Root cause identified - running it singularly doesn't cause the hang. @mo-tenstorrent to re-evaluate and provide updates on root cause. |
The root cause is not fully identified yet. However, we can pinpoint a single op, a conv2d, in the model that triggers the hang. |
Setting Better workaround is applying the following patch:
|
Replicating the baseline from two weeks ago: On a clean non-profiler build the hang can be reproduced with the following changes to the code:
Multiple numbers of reg_read + L1 writes had to be tried for the hang to appear. One note here is that the reset requirements of the above hang are slightly different than the original hang. This difference in reset behaviour might be caused by profiler vs non-profiler builds. In profiler builds we ask for finish at different times and also we do direct L1 reads through UMD. |
Here is the back trace of the process after the hang:
It is stuck on finish. Rerunning the test without smi-reset gives the following trace when hung:
Which looks like a umd read_block is hung |
Removing the code changes in trisck.cc and reducing ncrisck.cc to single reg_read + L1 writes caused the hang to move to the warm up run in layer 4 module 1 with the following trace:
|
Just doing reg_reads to local memory is also causing a hang but the reset behaviour is changed. The following patch hangs on
|
All data points above this comment were don on BH-30 Device 1
|
On BH-30 Device 0 it is confirmed that 18 nop cycles can also induce a hang on umd read block. This is on layer 4 module 3 conv 3. Diff:
|
Stalling 10 cycles instead of 18 will cause a hang on finish instead of read block. Finish hang crashes on FW init the next reboot.
Next run crash:
|
On Bh-30 Device 0, Just Enabling watcher with only waypoints enabled causes the hang. Waypoints suggest that Need to further investigate if this is the llk itself having and issue or it has been misused by kernels and now eventually gets stuck. Added waypoint diff:
Watcher Sample:
|
Watcher only enabled hangs are non-deterministic. We can also get the same setup as above to hang as below: Host is hung on
Watcher log shows that workers cores are running kernels. Dispatch cores are stuck waiting for workers to finish. Most likely in this scenario, workers are stuck in a loop that is not known to watcher.
|
Single core problem – Op problem or LLK problem. Next step is @mywoodstock to take a look. Need to get this singled down to a reproducible case. |
@ttmtrajkovic to assign someone from LLK team to dive into this. |
Tried the following workarounds from @nvelickovicTT and we still hang. worker cores are stuck on noc command buffer becoming avilable. Watcher log
|
I will be looking at this issue from the LLK side @ejouretTT |
After going through the issue, my thoughts are, since we faced these types of issues and found out the root cause like in (SDPA , MATMUL) our procedure has been the same, find a small test case on a single core that fails deterministically and simulate it. If it is not small and not on a single core it is impossible to simulate since simulating just 2 matmul tiles twice would take around 6 minutes. Simulating the whole model and on multiple cores is going to be impossible. Is it possible to provide us with a repro on a single core and not on the full model ? |
Many efforts were attempted to see if we can isolate the issue into small single core test. Unfortunately all were unfruitful. This hang has so far required this particular train of ops to run for it get reproduced. @mywoodstock do you have any other idea on how we might be able to better isolate this? |
@mo-tenstorrent The original branch in the issue has been deleted, could you give your latest branch and repro command ? |
Applied the compiler arg to 11e3906 , mo/bh_model_test
|
Pushed the original branch back. |
@amahmudTT To reduce test cases to further identify root cause. |
Needed to restart after failure, as without restart recompiling and running hangs no matter what. The workaround of using
seems to postpone the hang but not fix it. The compiler change allows many more tests to pass (may be all , I am confused with the ending timer message, it just could be a hanging timer)
This is the ending message, the timer keeps on going, could it be its just a dangling timer, need to confirm.
|
Disabling L1 cache but keeping the other changes (disable/enable_gathering() and compiler option) still allowes the test to run fully (with the timer still going on at the end) |
@amahmudTT Thank you, could you please provide the branch and the command that you're running either here or in the table, Row2ColumnG (in the note body)? Also, can you please confirm the results that I put in the table, ColumnG? |
updated |
Confirmed with Mo, the timer at the end was not a hang, it was the profiler that kept on writing through another thread. So the above modifications does remove the hangs. |
Aright thanks! I believe however that we're going to use other workaround, #10 from the table, which protects CSR writes and adds SFPI compiler flag. |
Let's make sure that workaround works with compiler issue before closing @nathan-TT. |
Note the following in case you want to reproduce the hang and test our discovered solutions. On IRD machines, Weka is slow so when generating large device profiler csvs things can slowdown dramatically. Point the generated folder to localdev with a symlink.
With above done, please run below on profiler builds (i.e.
It will result in the same hang on that branch if you do not have our discovered solutions are applied. |
@mo-tenstorrent So wait, fix from this PR doesn't help in your case? |
### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes
No it does, this is just a better reproduction step for quicker testing |
Oh alright, then I guess we can close this issue as well? |
The plan was to hold off on that until Nathan has actually brought in the compile flag switch into main. |
I have done that with this change. Or is there some other change that you guys had in mind which I'm missing here? |
Oh I see, @pgkeller did you have anything else in mind? |
I remember in the meeting it was mentioned Nathan was probably going to introduce a change to the compiler (probably it would not need the flag for blackhole) & wanted this issue to be open so that they could test their change with this issue. |
I think we can close this. Nath's change will be for SFPU only and we don't have a repro that shows this issue on SFPU. |
Confirmed that on BH-30 on commit 681d3f7 on main which has the fix, we don't see a hang when we profile resnet50. This is without using any env vars or mods to the code. |
On blackhole, we initially get kernel compile error, but with the following patch, it runs:
But when trying to profile Resnet50, the run with profiler hangs. (Passes fine without profiler).
Branch:
asarje/bh-rn50-20250123
Compile with profiler:
./build_metal.sh -p --debug
Run the model with profiler:
python -m tracy -p -r -v -m pytest "\"tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]\""
The run will hang.
The text was updated successfully, but these errors were encountered: