BH MMul Hang Parent Case #18064

cmaryanTT · 2025-02-20T01:53:47Z

Using this to organize the different manifestations of the MMul hang in BH

MMul in isolation (main case): #16439
MMul in Conv2D, in Resnet: #18065
MMul in SDPA, in LLama: #16673 (working with semaphore LLK fix)

SPDA MMul LLK investigation subcase 18094

* Addresses issue #18064 (more specifically #16439)

- Removed skip mark so test can run on branch - Updated grid size to 13x10 - Scaled size of inputs to match grid size - Added prints before/after ttnn.synchronize_device call #0: hanging case: 13x10 grid, removed PACK(()), have matmul_block #0: single core test hang, with 1x1 subblock, no reader/writer, no pack_tile #0: Buildable, fix compute kernel arguments #0: Double kernel, proper size, doesn't hang on WH #0: Match the arguments from py and cpp #0: Combined cpp repro that introduces hang #0: Addded comments to the file * Addresses issue #18064 (more specifically #16439)

ncvetkovicTT · 2025-02-21T16:41:09Z

On https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/bh_hang instruciton gathering is partially disabled, please see the commit message to disable it completely. This fixes the hang for single core hang #16439, maybe with ResNet50, #18065, but doesn't help with multiple cores, including SDPA #16673 / #18094.

cmaryanTT · 2025-02-24T15:11:14Z

Most recent experiment is to disable instruction gathering and branch prediction. Some improvement, but some test cases still fail. Request @mywoodstock to test with Resnet and @ncvetkovicTT 's latest branch.

ncvetkovicTT · 2025-02-24T15:15:02Z

@mywoodstock branch soon to come, after I do the rebase again.

ncvetkovicTT · 2025-02-24T16:35:17Z

After I rebased SDPA started failing with PCC errors, but no hangs, so I gave up on it. Instead, I'll continue working on this branch ncvetkovic/bh_hang which is a bit apart from latest main.

@mywoodstock please introduce the changes from the branch tt-llk/ncvetkovic/bh_hang to whichever tt-metal branch has the ResNet50 test and tell us if it hangs.

ncvetkovicTT · 2025-02-24T17:03:41Z

@cmaryanTT @ttmtrajkovic @cglagovichTT @vmilicevicTT @mywoodstock @pgkeller Take a look at the following table for the overview of the workarounds that were attempted and their results. You can add suggestions and comments either somehow in the sheet or here/DM.

https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

cmaryanTT · 2025-02-24T22:02:56Z

Note issue #18250 - there may be a new MMul hang cause if you're using a recent build.

ncvetkovicTT · 2025-02-25T15:39:09Z

I had to rerun SDPA tests to confirm the behavior. Multi core variant doesn't hang when CSR writes are protected, single core variant behaves like single core test_benchmark.py and C++ repros. For now it seems that the multi core variant of test_benchmark.py hangs with all atempted workarounds. I cannot reproduce ResNet50 hang on yyzo-bh-15.

ncvetkovicTT · 2025-02-26T15:24:57Z

All issues except #16439 original variant can be solved with the fix #10 from the table. In order to understand why disabling cache at the start of trisc.cc doesn't help with multi core SDPA, I am trying to introduce the hang on minimal core grid possible for simulation to be reasonably long. My current findings show that reducing the core grid below 16 removes the hang in test_sdpa_decode_paged_attention.

I propose that workaround attempt no.10 becomes a workaround, or a solution for now, for all other issues. The issues are caused by a HW bug which requires that CSR writes be guarded by either nops, fences, or, even better, branch prediction disable/enable.

cmaryanTT · 2025-02-26T16:47:15Z

@ncvetkovicTT @cglagovichTT - have you confirmed if the hang is the same in SDPA after the no 10 workaround?

I agree about committing the workaround for now to allow other work to proceed.

ncvetkovicTT · 2025-02-26T17:00:22Z

@cmaryanTT Fix no10 works on SDPA, fixes 11 - 16 don't and I am trying to understand why.

ncvetkovicTT · 2025-02-26T17:16:10Z

@cmaryanTT @pgkeller @mywoodstock @nathan-TT @cglagovichTT Let me know if you guys need a separate branch for the fix. Essentially the workaround is to change disable_gathering in ckernel.h to be:

template <bool add_nops = true>
inline void disable_gathering() {
    // Disable gathering: set bit 18
    asm(R"ASM(
        .option push
        li   t1, 0x2
        csrrs zero, 0x7c0, t1
        li   t1, 0x1
        slli t1, t1, 18
        fence
        csrrs zero, 0x7c0, t1
        li   t1, 0x2
        csrrc zero, 0x7c0, t1
        fence
        .option pop
         )ASM"
    :::"t1");
     
     //Gathering is done early in the pipeline, so we need to make sure
     //the above csrrw gets processed before the load-replay instructions
     if constexpr (add_nops) {
         TTI_NOP;
         TTI_NOP;
         TTI_NOP;
    }
}

and to add the SFPI compiler flag -fno-rvtt-sfpu-replay.

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

Due to a HW bug in CSR writes, we need to guard them by first disabling branch prediction, and then re-enabling it. This is a workaround for tenstorrent/tt-metal#18064 (comment) and is proposed as a fix by the HW team.

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

ncvetkovicTT · 2025-02-27T13:51:04Z

@ejouretTT @vmilicevicTT @cmaryanTT @mywoodstock @pgkeller @cglagovichTT PR #18359 should resolve all issues except #16439 multi core case, which will be addressed as soon as this PR is merged. Please rebase to the latest main once this PR is merged, I need to see why EventSync is failing first, any input is welcome.

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes

ncvetkovicTT · 2025-02-27T17:04:05Z

To clarify - this workaround should unblock everyone on all related issues from the description. I suggest to keep #16439 open until we understand why full-grid test fails. Also, keep #18094 open as a tracker until we realize the effect of clock gating + instruction cache on this test.

We can close this issue.

cmaryanTT added LLK blackhole op_cat: mm bug Something isn't working labels Feb 20, 2025

cmaryanTT assigned cmaryanTT and ttmtrajkovic Feb 20, 2025

cmaryanTT mentioned this issue Feb 20, 2025

Matmul hang in Isolation in BH #16439

Open

ncvetkovicTT added a commit that referenced this issue Feb 20, 2025

#0: Addded comments to the file

bbcf441

* Addresses issue #18064 (more specifically #16439)

cmaryanTT mentioned this issue Feb 20, 2025

Investigate LLK issues in SDPA instance of BH matmul hang #18094

Open

ncvetkovicTT mentioned this issue Feb 26, 2025

#18064: Guard CSRRS by disabling branch prediction tenstorrent/tt-llk#26

Merged

ncvetkovicTT added a commit that referenced this issue Feb 26, 2025

#18064: Workaround for multiple matmul hangs:

8f4e20e

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

ncvetkovicTT mentioned this issue Feb 26, 2025

#18064: Workaround for multiple matmul hangs: #18359

Merged

6 tasks

ncvetkovicTT added a commit that referenced this issue Feb 26, 2025

#18064: Workaround for multiple matmul hangs:

cf920b5

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Move compiler flag to a proper place

7277129

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Fix the syntax

ce8a57f

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Workaround for multiple matmul hangs:

c9e97b1

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Move compiler flag to a proper place

4bc8a51

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Fix the syntax

433136a

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Workaround for multiple matmul hangs:

5604a43

- Guard CSRRS by disabling branch prediction - Add a compiler flag which makes the SFPI compiler not to use replay buffers

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Move compiler flag to a proper place

36c79da

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025

#18064: Fix the syntax

7e4ed22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BH MMul Hang Parent Case #18064

BH MMul Hang Parent Case #18064

cmaryanTT commented Feb 20, 2025 •

edited

Loading

ncvetkovicTT commented Feb 21, 2025

cmaryanTT commented Feb 24, 2025

ncvetkovicTT commented Feb 24, 2025 •

edited

Loading

ncvetkovicTT commented Feb 24, 2025

ncvetkovicTT commented Feb 24, 2025

cmaryanTT commented Feb 24, 2025

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 26, 2025 •

edited

Loading

cmaryanTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 27, 2025 •

edited

Loading

ncvetkovicTT commented Feb 27, 2025

BH MMul Hang Parent Case #18064

BH MMul Hang Parent Case #18064

Comments

cmaryanTT commented Feb 20, 2025 • edited Loading

ncvetkovicTT commented Feb 21, 2025

cmaryanTT commented Feb 24, 2025

ncvetkovicTT commented Feb 24, 2025 • edited Loading

ncvetkovicTT commented Feb 24, 2025

ncvetkovicTT commented Feb 24, 2025

cmaryanTT commented Feb 24, 2025

ncvetkovicTT commented Feb 25, 2025 • edited Loading

ncvetkovicTT commented Feb 26, 2025 • edited Loading

cmaryanTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ncvetkovicTT commented Feb 27, 2025 • edited Loading

ncvetkovicTT commented Feb 27, 2025

cmaryanTT commented Feb 20, 2025 •

edited

Loading

ncvetkovicTT commented Feb 24, 2025 •

edited

Loading

ncvetkovicTT commented Feb 25, 2025 •

edited

Loading

ncvetkovicTT commented Feb 26, 2025 •

edited

Loading

ncvetkovicTT commented Feb 27, 2025 •

edited

Loading