Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BH MMul Hang in Resnet #18065

Closed
cmaryanTT opened this issue Feb 20, 2025 · 17 comments
Closed

BH MMul Hang in Resnet #18065

cmaryanTT opened this issue Feb 20, 2025 · 17 comments
Assignees
Labels

Comments

@cmaryanTT
Copy link

cmaryanTT commented Feb 20, 2025

@ncvetkovicTT With removing the matmul_block workaround and adding your workaround commit, the model still hangs.

I am using this branch: asarje/rn50-bh-largebatch
Test: pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"

This branch currently has the matmul workaround skipped in the matmul kernel, but not in the conv kernel. The conv kernel is:
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp Line 271 (https://github.com/tenstorrent/tt-metal/blob/asarje/rn50-bh-largebatch/ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp#L271-L277)

Originally posted by @mywoodstock in #16439

This instance is successful when the workaround is applied; but workaround is not confirmed to solve the root cause.

@vmilicevicTT
Copy link

@mywoodstock , can you please help with creating a minimal repro for this test? Assigning the issue to you.
@cmaryanTT , could you please prioritize this

@mywoodstock
Copy link
Contributor

@vmilicevicTT I have tried to create smaller tests but haven't been able to repro the issue. It only happens in the full model run.

@cmaryanTT cmaryanTT added the bug Something isn't working label Feb 20, 2025
@cmaryanTT
Copy link
Author

This hasn't reproduced in isolation. Next attempt will be for @mywoodstock to create a test that repeatedly loops over the problematic section of the model. I think there was some success with this sort of thing in #16439 .

@cmaryanTT
Copy link
Author

Note issue #18250 - there may be a new MMul hang cause if you're using a recent build.

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 25, 2025

I tried disabling instruction gather by

--- a/tt_metal/hw/firmware/src/trisc.cc
+++ b/tt_metal/hw/firmware/src/trisc.cc
@@ -76,6 +76,7 @@ constexpr bool cb_init_write = false;
 using namespace ckernel;

 int main(int argc, char *argv[]) {
+    disable_gathering();
     configure_l1_data_cache();

It still hanged, then on top of it I added the compiler option to make SFPI not use the replays by

--- a/tt_metal/jit_build/build.cpp
+++ b/tt_metal/jit_build/build.cpp
@@ -327,7 +327,7 @@ JitBuildCompute::JitBuildCompute(const JitBuildEnv& env, const JitBuiltStateConf

     this->out_path_ = this->is_fw_ ? env_.out_firmware_root_ : env_.out_kernel_root_;

-    this->cflags_ = env_.cflags_ + "-O3 ";
+    this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";

But the hang persisted.

Commenting out the disable_gathering & enable_gatherings did not help either.

@ncvetkovicTT
Copy link
Contributor

I tried disabling instruction gather by

--- a/tt_metal/hw/firmware/src/trisc.cc
+++ b/tt_metal/hw/firmware/src/trisc.cc
@@ -76,6 +76,7 @@ constexpr bool cb_init_write = false;
 using namespace ckernel;

 int main(int argc, char *argv[]) {
+    disable_gathering();
     configure_l1_data_cache();

It still hanged, then on top of it I added the compiler option to make SFPI not use the replays by

--- a/tt_metal/jit_build/build.cpp
+++ b/tt_metal/jit_build/build.cpp
@@ -327,7 +327,7 @@ JitBuildCompute::JitBuildCompute(const JitBuildEnv& env, const JitBuiltStateConf

     this->out_path_ = this->is_fw_ ? env_.out_firmware_root_ : env_.out_kernel_root_;

-    this->cflags_ = env_.cflags_ + "-O3 ";
+    this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";

But the hang persisted.

What if you do disable_gathering() in trisc.cc and then comment out the two calls in to disable/enable gathering in load_replay_buf?

@amahmudTT
Copy link
Contributor

But the hang persisted.

What if you do disable_gathering() in trisc.cc and then comment out the two calls in to disable/enable gathering in load_replay_buf?

Did not work

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 25, 2025

Cannot reproduce a hang on yyzo-bh-15. I checked out the latest asarje/rn50-bh-largebatch, have a fresh build and ran pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]" but the test passes quickly.

EDIT: I didn't comment out the workaround in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 25, 2025

@amahmudTT Can you confirm that what you did is the following:

  1. Go to Abhinav's branch from above
  2. Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
  3. Put disable_gathering() in trisc.cc
  4. Comment out disable_gathering() and enable_gathering() inside both definitions of load_replay_buf
  5. Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
  6. Observe the hang
    ?

The reason I ask is that I get a pass in this case on yyzo-bh-15, although I do get a hang for some other cases. See the table for details:
https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

@amahmudTT
Copy link
Contributor

2. Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp

I did not perform the above step.

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 26, 2025

@amahmudTT Can you confirm that what you did is the following:

  1. Go to Abhinav's branch from above
  2. Comment out the workaround (nop loop) in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
  3. Put disable_gathering() in trisc.cc
  4. Comment out disable_gathering() and enable_gathering() inside both definitions of load_replay_buf
  5. Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
  6. Observe the hang
    ?

The reason I ask is that I get a pass in this case on yyzo-bh-15, although I do get a hang for some other cases. See the table for details: https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

Okay so here is what I realized, if I encounter a hang, then no matter what I do from the above mentioned steps, i encounter a hang if I do not restart the card, but if I restart then following the steps outlined does not produce a hang.

@amahmudTT
Copy link
Contributor

The issue #17099 gets fixed only when compiler option for sfpi is changed and the disable_gathering() related changes are made, one does not suffice, so I tested this issue with those changes and like that issue I see a similar failure related to assert.

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-26 09:51:30.358 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 09:51:30 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-26 09:51:30.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-26 09:51:31.028 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
backtrace:

@ncvetkovicTT
Copy link
Contributor

The issue #17099 gets fixed only when compiler option for sfpi is changed and the disable_gathering() related changes are made, one does not suffice, so I tested this issue with those changes and like that issue I see a similar failure related to assert.

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-26 09:51:30.358 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 09:51:30 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-26 09:51:30.387 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 09:51:30.493 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-26 09:51:31.028 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
backtrace:

Interesting, I get a clear pass in that case:

13.75s call     tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
1.13s setup    tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
0.00s teardown tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
============================================================================================================================================================ short test summary info =============================================================================================================================================================
PASSED tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]
========================================================================================================================================================= 1 passed, 1 warning in 15.30s ==========================================================================================================================================================
                 Device | INFO     | Closing user mode device drivers

Maybe it's something card-related? Let me repeat the steps that I do just for clarity:

  1. I comment out nop loop in ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp, lines 274-276
  2. I add compiler flags to build.cpp so that line 330 is this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay ";
  3. Add disable_gathering(); at the start of trisc.cc main.
  4. Comment out disable_gathering(); and enable_gathering(); calls in load_replay_buf functions, although that should be irrelevant now since SFPI doesn't rely on them after compiling with the mentioned flags.
  5. Run pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=32-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
  6. Get a pass 🤔

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
### Ticket
#18064 #18065 #17099 #16673

### Problem description
Disabling instruction cache doesn't happen in time for subsequent TTI
instruction not to end up in the cache. In order to properly disable I$,
we need to disable branch prediction first. Since reprogramming the
REPLAY buffers needs to happen when the cache is disabled, SFPI compiler
cannot rely on REPLAY buffers. These things introduce multiple matmul
hangs.

### What's changed
- Guard CSRRS by disabling branch prediction in BH ckernel.h
- Add a compiler flag for BH which makes the SFPI compiler not to use
replay buffers

### Checklist
- [x] [All post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml)
CI passes -
[26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035)
- expected to fail
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml)
CI passes (if applicable) -
[3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504)
- [x] [Model
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml)
CI passes (if applicable) -
[7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991)
- [x] [Device performance
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml)
CI passes (if applicable) -
[5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826),
expected to fail
- [ ] **(For models and ops writers)** Full [new models
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
CI passes (if applicable)
- [ ] New/Existing tests provide coverage for changes
@ncvetkovicTT
Copy link
Contributor

@cmaryanTT @mywoodstock I guess we can close this now?

@mywoodstock
Copy link
Contributor

@cmaryanTT @mywoodstock I guess we can close this now?

Yes, let me confirm with the latest main. Will report back soon.

@ncvetkovicTT
Copy link
Contributor

Thanks!

@mywoodstock
Copy link
Contributor

mywoodstock commented Feb 27, 2025

OK removing the matmul block workarounds from both matmul and conv kernels looks good! All three cases with batch=16, 20 and 32 work without hangs.
Here is the latest PR with these updates: #17985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants