-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate LLK issues in SDPA instance of BH matmul hang #18094
Comments
@cmaryanTT I started looking at the hang today (sdpa), in this issue description, you mentioned the latest workaround. Can you specify what is the workaround ? Looking around I have seen two workarounds one that does a for loop of 10 nops and another puts a nop inside a while loop that checks the value of a semaphore. |
@cglagovichTT could you specify the exact branch and command to reproduce the issue ? I am seeing multiple branches and workarounds for now, would like to start with the most up to date or relevant one . |
One workaround I tried was the 10 NOPs before calling The second thing I tried was Nikola's BH llk branch, which resolved the hangs for me. #16673 (comment) The second attempt was on recent main of last week, so it is more up to date. |
repro:
You can set the count to anything you like. If you find a specific test which hangs more frequently, you can comment the others to isolate it, otherwise it'll take a while to run all tests for N loops. |
@cglagovichTT The workaround on https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/bh_hang doesn't help with SDPA, at least not when run on multiple cores. Is there any way you could create a single/double core repro? I see that there's a limitation in the versim simulator so I would need this smaller repro in order to analyze the traces/waveforms. |
strange, I ran for dozens of iterations with the LLK commit https://github.com/tenstorrent/tt-llk-bh/tree/nvelickovic/%2316439_matmul_hang applied onto tt-metal branch edit: Is this workaround you mention something new? In that case, yes I'll make a single-core repro |
Here's a repro on branch
The hang is easiest to repro with watcher enabled and the following options disabled:
watcher log ends up looking like this:
I also tried this single-core test with the following LLK commit and I was not able to repro the hang: https://github.com/tenstorrent/tt-llk-bh/tree/nvelickovic/%2316439_matmul_hang |
@cmaryanTT @vmilicevicTT @cglagovichTT @ttmtrajkovic Every workaround that includes disabling TRISC cache at the start of the FW code and never enabling it again (i.e. without the toggling in REPLAY reprogramming) introduces the hang when I use the test from this discussion. The only way that we could make this SDPA test pass is if we modify
@cglagovichTT Could you somehow make a single core version of this test? |
my comment above provides a single core hanging test. is there something different you need? |
I am so sorry Colman, I got lost in all the parallel conversations and didn't see. I will take a look at it in the morning. |
Note issue #18250 - there may be a new MMul hang cause if you're using a recent build. |
@cglagovichTT I am not sure I am being able to reproduce the hang, the watcher log looks like this
As you can see it gets out of the CWFW and ends up in the last state where the smsg is all DDDD , so I can assume the hang did not happen right ? The pattern repeats. But the thing it I had many many dumps but with the same pattern for example
So is it hanging but in a different state than yours ? |
I have updated the table to reflect this error that I made by running multi core tests instead of single core. I added a column for single core SDPA and it behaves like single core |
The multicore version of this is continuously failing at a certain spot, so might help us figure the issue. Steps to reproduce.
Branch cglagovich/18094, commit 28a97df
Make changes in thirdparty ckernel.h to comment out disable_gathering()/enable_gathering() Then the failure will still be there. Running with L1 cache disabled will skip some failures but this one will be there. |
@amahmudTT Thanks, that's weird, I got a pass for that case, but I introduced the changes from his branch onto mine, instead of using his. I will use his branch and update the table. |
@amahmudTT Again, I got lost. The table confirms what you get. We can never make this multi core SDPA pass when we disable cache at the start of the trisc.cc. That's why I am trying to reduce the number of cores used in that case, so that we can simulate and see what's going on. |
I will try to look into the assembly for the kernels and see if any csrrs instruction gets slipped in or not. And try to reproduce the error in isolation although previous such attempts failed it seems it needs some more runs before it to hang. |
When I do the following on
i.e. leave out only one set of parameters, then all test runs that have number of cores larger than or equal to 16 in that parameter list introduce hang. No matter how I reduce this (grid_size=(5, 3), (7, 2) etc.), the hang goes away by itself. I DPRINTed compile time args from kernel @cglagovichTT How do I need to change values of |
If you halve the number of cores, you should halve the |
LMK if you want a test case with some certain properties, I might be able to help if I can find a BH machine - looks like they're all busy right now. |
@Colman Glagovich ***@***.***> can you please post a
verified-by-you-on-WH known good set of reduced parameters, to help reduce
the trial and error in the LLK team
…On Wed, Feb 26, 2025 at 11:20 AM Colman Glagovich ***@***.***> wrote:
If you halve the number of cores, you should halve the b value. If b is
already 1, halve the nkv value. If nkv is 1, halve the s value
—
Reply to this email directly, view it on GitHub
<#18094 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BMVHPONY6TA7PGSSKP335632RXSWFAVCNFSM6AAAAABXRHAFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBVGU2TGMZRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: cglagovichTT]*cglagovichTT* left a comment
(tenstorrent/tt-metal#18094)
<#18094 (comment)>
If you halve the number of cores, you should halve the b value. If b is
already 1, halve the nkv value. If nkv is 1, halve the s value
—
Reply to this email directly, view it on GitHub
<#18094 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BMVHPONY6TA7PGSSKP335632RXSWFAVCNFSM6AAAAABXRHAFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBVGU2TGMZRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@cglagovichTT Reducing the dimensions as you mentioned didn't help, the process either exits, fails or doesn't hang. |
@ncvetkovicTT I've verified on WH that each of these configurations passes. These should proportionally scale the 32-core hang repro down for various core grids. I was not able to test on BH yet. @skip_for_grayskull("Unsupported in GS since L1 runs OOM with most configs")
@pytest.mark.parametrize(
"kv_dtype, q_dtype",
[
[ttnn.bfloat16, ttnn.bfloat16],
],
ids=[
"all_bfp16",
],
)
@pytest.mark.parametrize(
"b, nh, nkv, s, d, grid_size, cur_pos_tensor",
(
[1, 32, 1, 16 * 1024, 128, (1, 1), True],
[1, 32, 1, 64 * 1024, 128, (4, 1), True],
[1, 32, 2, 64 * 1024, 128, (8, 1), True],
[1, 32, 4, 64 * 1024, 128, (8, 2), True],
[1, 32, 8, 64 * 1024, 128, (8, 4), True],
),
)
@pytest.mark.parametrize("block_size", (64, 128), ids=["paged_64", "paged_128"])
def test_sdpa_decode_paged_attention(
device, b, nh, nkv, s, d, kv_dtype, grid_size, q_dtype, cur_pos_tensor, block_size, use_program_cache
):
if s == 128 * 1024 and block_size != 64:
# 128k sequence, block_size 64 tests the sizing of the page table CB
pytest.skip("Skipping test for seq_len=128k with block_size!=64")
ttnn.device.DisablePersistentKernelCache()
run_test_sdpa_decode_paged_attention(
device,
b,
nh,
nkv,
s,
d,
kv_dtype,
grid_size,
q_dtype,
cur_pos_tensor,
block_size=block_size,
sharded_in=True,
sharded_out=False,
) |
Okay thanks, the smallest grid that I get a hang for is 8x1 in your proposal. |
Is it always the same core which hangs? |
Could you help me in determining that? Do I just run watcher or what? |
run with watcher enabled. here are the watcher docs Collect logs for the hang multiple times, then compare between them to see which cores were hung on each run |
- In order to introduce a hang, remove trisc.cc change and the corresponding ckernel.h change - Need to be run on a board with clock gating off!
Seems like disabling clock gating on a board and disabling instruction cache at the start of trisc.cc make the hang go away for all grid sizes. We are tracking this in a separate branch, so I would maybe leave this issue open until we understand what exactly is going on and why CG_off helps. We are tracking this on a separate branch where we will potentially simulate some scenarios: https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/18094-SDPA_mc_hang_repro @ejouretTT let's keep this issue open to track this side quest, although we will have a workaround for it once #18359 is merged. |
@cglagovichTT Could you confirm if rebasing to main / fetching the workaround unblocks your model completely? I want us all to be aligned and not rely only on my test results. |
LLK subcase for the SDPA variant of the BH matmul hang tracked at a high level in #16673 .
The development team is unblocked with the latest workaround, but leaving this to track root cause.
Note that there are other variants of this already being worked on - #18064. This case is separated for tracking, but may be an identical issue.
FYI - @cglagovich
The text was updated successfully, but these errors were encountered: