Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BH MMul Hang Parent Case #18064

Open
cmaryanTT opened this issue Feb 20, 2025 · 13 comments
Open

BH MMul Hang Parent Case #18064

cmaryanTT opened this issue Feb 20, 2025 · 13 comments
Assignees
Labels

Comments

@cmaryanTT
Copy link

cmaryanTT commented Feb 20, 2025

Using this to organize the different manifestations of the MMul hang in BH

MMul in isolation (main case): #16439
MMul in Conv2D, in Resnet: #18065
MMul in SDPA, in LLama: #16673 (working with semaphore LLK fix)

  • SPDA MMul LLK investigation subcase 18094
@cmaryanTT cmaryanTT added LLK blackhole op_cat: mm bug Something isn't working labels Feb 20, 2025
ncvetkovicTT added a commit that referenced this issue Feb 20, 2025
* Addresses issue #18064 (more specifically #16439)
ncvetkovicTT pushed a commit that referenced this issue Feb 21, 2025
- Removed skip mark so test can run on branch
- Updated grid size to 13x10
- Scaled size of inputs to match grid size
- Added prints before/after ttnn.synchronize_device call

#0: hanging case: 13x10 grid, removed PACK(()), have matmul_block

#0: single core test hang, with 1x1 subblock, no reader/writer, no pack_tile

#0: Buildable, fix compute kernel arguments

#0: Double kernel, proper size, doesn't hang on WH

#0: Match the arguments from py and cpp

#0: Combined cpp repro that introduces hang

#0: Addded comments to the file
* Addresses issue #18064 (more specifically #16439)
@ncvetkovicTT
Copy link
Contributor

On https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/bh_hang instruciton gathering is partially disabled, please see the commit message to disable it completely. This fixes the hang for single core hang #16439, maybe with ResNet50, #18065, but doesn't help with multiple cores, including SDPA #16673 / #18094.

@cmaryanTT
Copy link
Author

Most recent experiment is to disable instruction gathering and branch prediction. Some improvement, but some test cases still fail. Request @mywoodstock to test with Resnet and @ncvetkovicTT 's latest branch.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 24, 2025

@mywoodstock branch soon to come, after I do the rebase again.

@ncvetkovicTT
Copy link
Contributor

After I rebased SDPA started failing with PCC errors, but no hangs, so I gave up on it. Instead, I'll continue working on this branch ncvetkovic/bh_hang which is a bit apart from latest main.

@mywoodstock please introduce the changes from the branch tt-llk/ncvetkovic/bh_hang to whichever tt-metal branch has the ResNet50 test and tell us if it hangs.

@ncvetkovicTT
Copy link
Contributor

@cmaryanTT @ttmtrajkovic @cglagovichTT @vmilicevicTT @mywoodstock @pgkeller Take a look at the following table for the overview of the workarounds that were attempted and their results. You can add suggestions and comments either somehow in the sheet or here/DM.

https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

@cmaryanTT
Copy link
Author

Note issue #18250 - there may be a new MMul hang cause if you're using a recent build.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 25, 2025

I had to rerun SDPA tests to confirm the behavior. Multi core variant doesn't hang when CSR writes are protected, single core variant behaves like single core test_benchmark.py and C++ repros. For now it seems that the multi core variant of test_benchmark.py hangs with all atempted workarounds. I cannot reproduce ResNet50 hang on yyzo-bh-15.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 26, 2025

All issues except #16439 original variant can be solved with the fix #10 from the table. In order to understand why disabling cache at the start of trisc.cc doesn't help with multi core SDPA, I am trying to introduce the hang on minimal core grid possible for simulation to be reasonably long. My current findings show that reducing the core grid below 16 removes the hang in test_sdpa_decode_paged_attention.

I propose that workaround attempt no.10 becomes a workaround, or a solution for now, for all other issues. The issues are caused by a HW bug which requires that CSR writes be guarded by either nops, fences, or, even better, branch prediction disable/enable.

@cmaryanTT
Copy link
Author

@ncvetkovicTT @cglagovichTT - have you confirmed if the hang is the same in SDPA after the no 10 workaround?

I agree about committing the workaround for now to allow other work to proceed.

@ncvetkovicTT
Copy link
Contributor

@cmaryanTT Fix no10 works on SDPA, fixes 11 - 16 don't and I am trying to understand why.

@ncvetkovicTT
Copy link
Contributor

@cmaryanTT @pgkeller @mywoodstock @nathan-TT @cglagovichTT Let me know if you guys need a separate branch for the fix. Essentially the workaround is to change disable_gathering in ckernel.h to be:

template <bool add_nops = true>
inline void disable_gathering() {
    // Disable gathering: set bit 18
    asm(R"ASM(
        .option push
        li   t1, 0x2
        csrrs zero, 0x7c0, t1
        li   t1, 0x1
        slli t1, t1, 18
        fence
        csrrs zero, 0x7c0, t1
        li   t1, 0x2
        csrrc zero, 0x7c0, t1
        fence
        .option pop
         )ASM"
    :::"t1");
     
     //Gathering is done early in the pipeline, so we need to make sure
     //the above csrrw gets processed before the load-replay instructions
     if constexpr (add_nops) {
         TTI_NOP;
         TTI_NOP;
         TTI_NOP;
    }
}

and to add the SFPI compiler flag -fno-rvtt-sfpu-replay.

ncvetkovicTT added a commit that referenced this issue Feb 26, 2025
- Guard CSRRS by disabling branch prediction
- Add a compiler flag which makes the SFPI compiler
not to use replay buffers
ncvetkovicTT added a commit that referenced this issue Feb 26, 2025
- Guard CSRRS by disabling branch prediction
- Add a compiler flag which makes the SFPI compiler
not to use replay buffers
ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
ncvetkovicTT added a commit to tenstorrent/tt-llk that referenced this issue Feb 27, 2025
Due to a HW bug in CSR writes, we need to guard them by first disabling
branch prediction, and then re-enabling it. This is a workaround for
tenstorrent/tt-metal#18064 (comment)
and is proposed as a fix by the HW team.
ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
- Guard CSRRS by disabling branch prediction
- Add a compiler flag which makes the SFPI compiler
not to use replay buffers
@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 27, 2025

@ejouretTT @vmilicevicTT @cmaryanTT @mywoodstock @pgkeller @cglagovichTT PR #18359 should resolve all issues except #16439 multi core case, which will be addressed as soon as this PR is merged. Please rebase to the latest main once this PR is merged, I need to see why EventSync is failing first, any input is welcome.

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
- Guard CSRRS by disabling branch prediction
- Add a compiler flag which makes the SFPI compiler
not to use replay buffers
ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
### Ticket
#18064 #18065 #17099 #16673

### Problem description
Disabling instruction cache doesn't happen in time for subsequent TTI
instruction not to end up in the cache. In order to properly disable I$,
we need to disable branch prediction first. Since reprogramming the
REPLAY buffers needs to happen when the cache is disabled, SFPI compiler
cannot rely on REPLAY buffers. These things introduce multiple matmul
hangs.

### What's changed
- Guard CSRRS by disabling branch prediction in BH ckernel.h
- Add a compiler flag for BH which makes the SFPI compiler not to use
replay buffers

### Checklist
- [x] [All post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml)
CI passes -
[26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035)
- expected to fail
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml)
CI passes (if applicable) -
[3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504)
- [x] [Model
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml)
CI passes (if applicable) -
[7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991)
- [x] [Device performance
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml)
CI passes (if applicable) -
[5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826),
expected to fail
- [ ] **(For models and ops writers)** Full [new models
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
CI passes (if applicable)
- [ ] New/Existing tests provide coverage for changes
@ncvetkovicTT
Copy link
Contributor

To clarify - this workaround should unblock everyone on all related issues from the description. I suggest to keep #16439 open until we understand why full-grid test fails. Also, keep #18094 open as a tracker until we realize the effect of clock gating + instruction cache on this test.

We can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants