BH MMul Hang in Resnet #18065

cmaryanTT · 2025-02-20T02:00:14Z

@ncvetkovicTT With removing the matmul_block workaround and adding your workaround commit, the model still hangs.

I am using this branch: asarje/rn50-bh-largebatch
Test: pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"

This branch currently has the matmul workaround skipped in the matmul kernel, but not in the conv kernel. The conv kernel is:
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp Line 271 (https://github.com/tenstorrent/tt-metal/blob/asarje/rn50-bh-largebatch/ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp#L271-L277)

Originally posted by @mywoodstock in #16439

This instance is successful when the workaround is applied; but workaround is not confirmed to solve the root cause.

The text was updated successfully, but these errors were encountered:

vmilicevicTT · 2025-02-20T14:02:51Z

@mywoodstock , can you please help with creating a minimal repro for this test? Assigning the issue to you.
@cmaryanTT , could you please prioritize this

mywoodstock · 2025-02-20T14:39:15Z

@vmilicevicTT I have tried to create smaller tests but haven't been able to repro the issue. It only happens in the full model run.

cmaryanTT · 2025-02-20T19:48:48Z

This hasn't reproduced in isolation. Next attempt will be for @mywoodstock to create a test that repeatedly loops over the problematic section of the model. I think there was some success with this sort of thing in #16439 .

cmaryanTT · 2025-02-24T22:03:05Z

Note issue #18250 - there may be a new MMul hang cause if you're using a recent build.

amahmudTT · 2025-02-25T10:02:34Z

I tried disabling instruction gather by

--- a/tt_metal/hw/firmware/src/trisc.cc
+++ b/tt_metal/hw/firmware/src/trisc.cc
@@ -76,6 +76,7 @@ constexpr bool cb_init_write = false;
 using namespace ckernel;

 int main(int argc, char *argv[]) {
+    disable_gathering();
     configure_l1_data_cache();

It still hanged, then on top of it I added the compiler option to make SFPI not use the replays by

--- a/tt_metal/jit_build/build.cpp
+++ b/tt_metal/jit_build/build.cpp
@@ -327,7 +327,7 @@ JitBuildCompute::JitBuildCompute(const JitBuildEnv& env, const JitBuiltStateConf

     this->out_path_ = this->is_fw_ ? env_.out_firmware_root_ : env_.out_kernel_root_;

-    this->cflags_ = env_.cflags_ + "-O3 ";
+    this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";

But the hang persisted.

Commenting out the disable_gathering & enable_gatherings did not help either.

ncvetkovicTT · 2025-02-25T10:08:41Z

I tried disabling instruction gather by

--- a/tt_metal/hw/firmware/src/trisc.cc
+++ b/tt_metal/hw/firmware/src/trisc.cc
@@ -76,6 +76,7 @@ constexpr bool cb_init_write = false;
 using namespace ckernel;

 int main(int argc, char *argv[]) {
+    disable_gathering();
     configure_l1_data_cache();

It still hanged, then on top of it I added the compiler option to make SFPI not use the replays by

--- a/tt_metal/jit_build/build.cpp
+++ b/tt_metal/jit_build/build.cpp
@@ -327,7 +327,7 @@ JitBuildCompute::JitBuildCompute(const JitBuildEnv& env, const JitBuiltStateConf

     this->out_path_ = this->is_fw_ ? env_.out_firmware_root_ : env_.out_kernel_root_;

-    this->cflags_ = env_.cflags_ + "-O3 ";
+    this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";

But the hang persisted.

What if you do disable_gathering() in trisc.cc and then comment out the two calls in to disable/enable gathering in load_replay_buf?

amahmudTT · 2025-02-25T10:23:12Z

But the hang persisted.

What if you do disable_gathering() in trisc.cc and then comment out the two calls in to disable/enable gathering in load_replay_buf?

Did not work

ncvetkovicTT · 2025-02-25T15:33:45Z

Cannot reproduce a hang on yyzo-bh-15. I checked out the latest asarje/rn50-bh-largebatch, have a fresh build and ran pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]" but the test passes quickly.

EDIT: I didn't comment out the workaround in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp

ncvetkovicTT · 2025-02-25T18:20:10Z

@amahmudTT Can you confirm that what you did is the following:

Go to Abhinav's branch from above
Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
Put disable_gathering() in trisc.cc
Comment out disable_gathering() and enable_gathering() inside both definitions of load_replay_buf
Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
Observe the hang
?

The reason I ask is that I get a pass in this case on yyzo-bh-15, although I do get a hang for some other cases. See the table for details:
https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

amahmudTT · 2025-02-26T02:36:20Z

2. Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp

I did not perform the above step.

amahmudTT · 2025-02-26T03:56:49Z

@amahmudTT Can you confirm that what you did is the following:

Go to Abhinav's branch from above

Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp

Put disable_gathering() in trisc.cc

Comment out disable_gathering() and enable_gathering() inside both definitions of load_replay_buf

Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"

Observe the hang
?

The reason I ask is that I get a pass in this case on yyzo-bh-15, although I do get a hang for some other cases. See the table for details: https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

Okay so here is what I realized, if I encounter a hang, then no matter what I do from the above mentioned steps, i encounter a hang if I do not restart the card, but if I restart then following the steps outlined does not produce a hang.

amahmudTT · 2025-02-26T10:01:09Z

The issue #17099 gets fixed only when compiler option for sfpi is changed and the disable_gathering() related changes are made, one does not suffice, so I tested this issue with those changes and like that issue I see a similar failure related to assert.

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-26 09:51:30.358 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 09:51:30 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-26 09:51:30.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-26 09:51:31.028 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
backtrace:

ncvetkovicTT · 2025-02-26T10:15:17Z

The issue #17099 gets fixed only when compiler option for sfpi is changed and the disable_gathering() related changes are made, one does not suffice, so I tested this issue with those changes and like that issue I see a similar failure related to assert.

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-26 09:51:30.358 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 09:51:30 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-26 09:51:30.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-26 09:51:31.028 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
backtrace:

Interesting, I get a clear pass in that case:

13.75s call     tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
1.13s setup    tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
0.00s teardown tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
============================================================================================================================================================ short test summary info =============================================================================================================================================================
PASSED tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
========================================================================================================================================================= 1 passed, 1 warning in 15.30s ==========================================================================================================================================================
                 Device | INFO     | Closing user mode device drivers

Maybe it's something card-related? Let me repeat the steps that I do just for clarity:

I comment out nop loop in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp, lines 274-276
I add compiler flags to build.cpp so that line 330 is this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";
Add disable_gathering(); at the start of trisc.cc main.
Comment out disable_gathering(); and enable_gathering(); calls in load_replay_buf functions, although that should be irrelevant now since SFPI doesn't rely on them after compiling with the mentioned flags.
Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
Get a pass 🤔

### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes

ncvetkovicTT · 2025-02-27T17:02:09Z

@cmaryanTT @mywoodstock I guess we can close this now?

mywoodstock · 2025-02-27T17:04:32Z

@cmaryanTT @mywoodstock I guess we can close this now?

Yes, let me confirm with the latest main. Will report back soon.

ncvetkovicTT · 2025-02-27T17:15:13Z

Thanks!

mywoodstock · 2025-02-27T17:23:56Z

OK removing the matmul block workarounds from both matmul and conv kernels looks good! All three cases with batch=16, 20 and 32 work without hangs.
Here is the latest PR with these updates: #17985

This was referenced Feb 20, 2025

BH MMul Hang Parent Case #18064

Open

Matmul hang in Isolation in BH #16439

Open

vmilicevicTT added LLK blackhole P1 labels Feb 20, 2025

vmilicevicTT assigned mywoodstock, nvelickovicTT and amahmudTT Feb 20, 2025

vmilicevicTT unassigned nvelickovicTT Feb 20, 2025

cmaryanTT added the bug Something isn't working label Feb 20, 2025

ejouretTT added the ResNet-50 label Feb 24, 2025

ejouretTT mentioned this issue Feb 26, 2025

Device profiler on ttnn resnet50 is hanging on BH #17099

Closed

ncvetkovicTT mentioned this issue Feb 26, 2025

#18064: Workaround for multiple matmul hangs: #18359

Merged

6 tasks

mywoodstock closed this as completed Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BH MMul Hang in Resnet #18065

BH MMul Hang in Resnet #18065

cmaryanTT commented Feb 20, 2025 •

edited

Loading

vmilicevicTT commented Feb 20, 2025

mywoodstock commented Feb 20, 2025

cmaryanTT commented Feb 20, 2025

cmaryanTT commented Feb 24, 2025

amahmudTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 25, 2025

amahmudTT commented Feb 25, 2025

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

amahmudTT commented Feb 26, 2025

amahmudTT commented Feb 26, 2025 •

edited

Loading

amahmudTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 27, 2025

mywoodstock commented Feb 27, 2025

ncvetkovicTT commented Feb 27, 2025

mywoodstock commented Feb 27, 2025 •

edited

Loading

BH MMul Hang in Resnet #18065

BH MMul Hang in Resnet #18065

Comments

cmaryanTT commented Feb 20, 2025 • edited Loading

vmilicevicTT commented Feb 20, 2025

mywoodstock commented Feb 20, 2025

cmaryanTT commented Feb 20, 2025

cmaryanTT commented Feb 24, 2025

amahmudTT commented Feb 25, 2025 • edited Loading

ncvetkovicTT commented Feb 25, 2025

amahmudTT commented Feb 25, 2025

ncvetkovicTT commented Feb 25, 2025 • edited Loading

ncvetkovicTT commented Feb 25, 2025 • edited Loading

amahmudTT commented Feb 26, 2025

amahmudTT commented Feb 26, 2025 • edited Loading

amahmudTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 27, 2025

mywoodstock commented Feb 27, 2025

ncvetkovicTT commented Feb 27, 2025

mywoodstock commented Feb 27, 2025 • edited Loading

cmaryanTT commented Feb 20, 2025 •

edited

Loading

amahmudTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

amahmudTT commented Feb 26, 2025 •

edited

Loading

mywoodstock commented Feb 27, 2025 •

edited

Loading