Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate LLK issues in SDPA instance of BH matmul hang #18094

Open
cmaryanTT opened this issue Feb 20, 2025 · 29 comments
Open

Investigate LLK issues in SDPA instance of BH matmul hang #18094

cmaryanTT opened this issue Feb 20, 2025 · 29 comments
Assignees
Labels
blackhole bug Something isn't working llama3 LLK

Comments

@cmaryanTT
Copy link

cmaryanTT commented Feb 20, 2025

LLK subcase for the SDPA variant of the BH matmul hang tracked at a high level in #16673 .

The development team is unblocked with the latest workaround, but leaving this to track root cause.

Note that there are other variants of this already being worked on - #18064. This case is separated for tracking, but may be an identical issue.

FYI - @cglagovich

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 21, 2025

@cmaryanTT I started looking at the hang today (sdpa), in this issue description, you mentioned the latest workaround. Can you specify what is the workaround ? Looking around I have seen two workarounds one that does a for loop of 10 nops and another puts a nop inside a while loop that checks the value of a semaphore.

@amahmudTT
Copy link
Contributor

@cglagovichTT could you specify the exact branch and command to reproduce the issue ? I am seeing multiple branches and workarounds for now, would like to start with the most up to date or relevant one .

@cglagovichTT
Copy link
Contributor

One workaround I tried was the 10 NOPs before calling matmul_block: #16673 (comment)

The second thing I tried was Nikola's BH llk branch, which resolved the hangs for me. #16673 (comment)

The second attempt was on recent main of last week, so it is more up to date.

@cglagovichTT
Copy link
Contributor

repro:

  • branch cglagovich/18094 contains a commit which adds a few tests to the pytest
pip install pytest-repeat
pytest --count=20 -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py -k "test_sdpa_decode_paged_attention"

You can set the count to anything you like. If you find a specific test which hangs more frequently, you can comment the others to isolate it, otherwise it'll take a while to run all tests for N loops.

@ncvetkovicTT
Copy link
Contributor

@cglagovichTT The workaround on https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/bh_hang doesn't help with SDPA, at least not when run on multiple cores. Is there any way you could create a single/double core repro? I see that there's a limitation in the versim simulator so I would need this smaller repro in order to analyze the traces/waveforms.

@cglagovichTT
Copy link
Contributor

cglagovichTT commented Feb 21, 2025

strange, I ran for dozens of iterations with the LLK commit https://github.com/tenstorrent/tt-llk-bh/tree/nvelickovic/%2316439_matmul_hang applied onto tt-metal branch cglagovich/18094 and I did not see hangs.
Shall I try to repro again?

edit: Is this workaround you mention something new? In that case, yes I'll make a single-core repro

@cglagovichTT
Copy link
Contributor

Here's a repro on branch cglagovich/18094

TT_METAL_WATCHER=3 pytest --count=20 -svv tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py -k "single_core"

The hang is easiest to repro with watcher enabled and the following options disabled:

export TT_METAL_WATCHER_DISABLE_ASSERT=1
export TT_METAL_WATCHER_DISABLE_PAUSE=1
export TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1
export TT_METAL_WATCHER_DISABLE_STACK_USAGE=1

watcher log ends up looking like this:

Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2): CWFW,CRBW,   R,MWDD,   R  rmsg:D1G|BNT h_id:3142 smsg:GGGG k_ids:14|13|12

I also tried this single-core test with the following LLK commit and I was not able to repro the hang: https://github.com/tenstorrent/tt-llk-bh/tree/nvelickovic/%2316439_matmul_hang

@ncvetkovicTT
Copy link
Contributor

@cmaryanTT @vmilicevicTT @cglagovichTT @ttmtrajkovic Every workaround that includes disabling TRISC cache at the start of the FW code and never enabling it again (i.e. without the toggling in REPLAY reprogramming) introduces the hang when I use the test from this discussion.

The only way that we could make this SDPA test pass is if we modify disable_gathering function from ckernel.h so that it looks like one of the following:

  1. fence, CSRRS_bit_18 (disable cache), fence
  2. CSRRS_bit_1 (disable branch prediction), fence, CSRRS_bit_18 (disable cache), CSRRC_bit_1 (enable branch prediction), fence

@cglagovichTT Could you somehow make a single core version of this test?

@cglagovichTT
Copy link
Contributor

my comment above provides a single core hanging test. is there something different you need?

@ncvetkovicTT
Copy link
Contributor

I am so sorry Colman, I got lost in all the parallel conversations and didn't see. I will take a look at it in the morning.

@cmaryanTT
Copy link
Author

Note issue #18250 - there may be a new MMul hang cause if you're using a recent build.

@amahmudTT
Copy link
Contributor

@cglagovichTT I am not sure I am being able to reproduce the hang, the watcher log looks like this

Dump #4 at 19.228s
Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2): CWFW,   W,   R,   R,   R  rmsg:D1G|BNT h_id:2 smsg:DGGG k_ids:8|7|6
Device 0 worker core(x= 1,y= 0) virtual(x= 2,y= 2):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
.....
Device 0 worker core(x=12,y= 9) virtual(x=15,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=13,y= 9) virtual(x=16,y=11):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[3]: tt_metal/impl/dispatch/kernels/cq_dispatch_slave.cpp
k_id[6]: ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp
k_id[7]: ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/reader_decode_all.cpp
k_id[8]: ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/dataflow/writer_decode_all.cpp
Dump #4 completed at 19.289s
-----
Dump #5 at 22.289s
Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2): CWFW,   W,   R,   R,   R  rmsg:D1G|BNT h_id:2 smsg:DGGG k_ids:8|7|6
Device 0 worker core(x= 1,y= 0) virtual(x= 2,y= 2):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0

As you can see it gets out of the CWFW and ends up in the last state where the smsg is all DDDD , so I can assume the hang did not happen right ? The pattern repeats. But the thing it I had many many dumps but with the same pattern for example

-----
Dump #7212 at 22082.600s
Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2): CWFW,   W,   R,   R,   R  rmsg:D1G|BNT h_id:2 smsg:DGGG k_ids:8|7|6
Device 0 worker core(x= 1,y= 0) virtual(x= 2,y= 2):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0

So is it hanging but in a different state than yours ?

@ncvetkovicTT
Copy link
Contributor

I have updated the table to reflect this error that I made by running multi core tests instead of single core. I added a column for single core SDPA and it behaves like single core test_benchmark.py and my C++ repro. For the multi core case I confirmed the results that I had earlier and stated above, that the test doesn't hang when we add fences around CSR write or protect it by disabling branch prediction.

@amahmudTT
Copy link
Contributor

The multicore version of this is continuously failing at a certain spot, so might help us figure the issue. Steps to reproduce.

2025-02-26 12:34:19.237 | DEBUG    | tests.tt_eager.python_api_testing.unit_testing.misc.test_scaled_dot_product_attention_decode:run_test_sdpa_decode_paged_attention:731 - Testing causal with sequence length: 1065
2025-02-26 12:34:19.237 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_scaled_dot_product_attention_decode:run_test_sdpa_decode_paged_attention:734 - Using chunk size: 512
2025-02-26 12:34:19.237 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_scaled_dot_product_attention_decode:run_test_sdpa_decode_paged_attention:735 - Using padded layer length: 1536
2025-02-26 12:34:19.237 | INFO     | tests.tt_eager.python_api_testing.unit_testing.misc.test_scaled_dot_product_attention_decode:run_test_sdpa_decode_paged_attention:736 - Using padded num heads: 32

Branch cglagovich/18094, commit 28a97df

  1. Change data type to float16_b to avoid pcc errors
--- a/tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py
+++ b/tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py
@@ -836,14 +836,14 @@ def run_test_sdpa_decode_paged_attention(
     "kv_dtype, q_dtype",
     [
         # [ttnn.bfloat8_b, ttnn.bfloat8_b],
-        # [ttnn.bfloat16, ttnn.bfloat16],
-        [ttnn.bfloat8_b, ttnn.bfloat16],
+         [ttnn.bfloat16, ttnn.bfloat16],
+        #[ttnn.bfloat8_b, ttnn.bfloat16],
         # [ttnn.bfloat4_b, ttnn.bfloat16],
     ],
     ids=[
         # "all_bfp8",
-        # "all_bfp16",
-        "kv_bfp8_q_bf16",
+         "all_bfp16",
+        #"kv_bfp8_q_bf16",
         # "kv_bfp4",
     ],
 )
  1. Make changes to trisc.cc by adding disable_gathering()
  2. Make compiler changes
--- a/tt_metal/jit_build/build.cpp
+++ b/tt_metal/jit_build/build.cpp
@@ -327,7 +327,7 @@ JitBuildCompute::JitBuildCompute(const JitBuildEnv& env, const JitBuiltStateConf

     this->out_path_ = this->is_fw_ ? env_.out_firmware_root_ : env_.out_kernel_root_;

-    this->cflags_ = env_.cflags_ + "-O3 ";
+    this->cflags_ = env_.cflags_ + "-O3 " + " -fno-rvtt-sfpu-replay  ";

Make changes in thirdparty ckernel.h to comment out disable_gathering()/enable_gathering()

Then the failure will still be there. Running with L1 cache disabled will skip some failures but this one will be there.

@ncvetkovicTT
Copy link
Contributor

@amahmudTT Thanks, that's weird, I got a pass for that case, but I introduced the changes from his branch onto mine, instead of using his. I will use his branch and update the table.

@ncvetkovicTT
Copy link
Contributor

@amahmudTT Again, I got lost. The table confirms what you get. We can never make this multi core SDPA pass when we disable cache at the start of the trisc.cc. That's why I am trying to reduce the number of cores used in that case, so that we can simulate and see what's going on.

@amahmudTT
Copy link
Contributor

I will try to look into the assembly for the kernels and see if any csrrs instruction gets slipped in or not. And try to reproduce the error in isolation although previous such attempts failed it seems it needs some more runs before it to hang.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 26, 2025

When I do the following on cglagovich/18094 / 28a97df:

--- a/tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py
+++ b/tests/tt_eager/python_api_testing/unit_testing/misc/test_scaled_dot_product_attention_decode.py
@@ -836,29 +836,29 @@ def run_test_sdpa_decode_paged_attention(
     "kv_dtype, q_dtype",
     [
         # [ttnn.bfloat8_b, ttnn.bfloat8_b],
-        # [ttnn.bfloat16, ttnn.bfloat16],
-        [ttnn.bfloat8_b, ttnn.bfloat16],
+        [ttnn.bfloat16, ttnn.bfloat16],
+        # [ttnn.bfloat8_b, ttnn.bfloat16],
         # [ttnn.bfloat4_b, ttnn.bfloat16],
     ],
     ids=[
         # "all_bfp8",
-        # "all_bfp16",
-        "kv_bfp8_q_bf16",
+        "all_bfp16",
+        # "kv_bfp8_q_bf16",
         # "kv_bfp4",
     ],
 )
 @pytest.mark.parametrize(
     "b, nh, nkv, s, d, grid_size, cur_pos_tensor",
     (
-        [1, 32, 8, 64 * 1024, 128, (8, 4), True],
-        [32, 32, 8, 2 * 1024, 128, (8, 4), True],
+        [1, 32, 8, 64 * 1024, 128, (5, 4), True],
+        # [32, 32, 8, 2 * 1024, 128, (8, 4), True],
         # [32, 32, 8, 64 * 1024, 128, (8, 8), True],
         # [32, 8, 1, 32768, 128, (8, 6), True],  # Llama2-70B
         # [4, 32, 8, 4096, 128, (8, 8), True],  # llama 3.1 8b
         # [4, 16, 4, 32768, 128, (8, 8), True],
         # [32, 32, 8, 4096, 128, (8, 8), True],  # llama 3.1 8b
-        [8, 16, 4, 4096, 128, (8, 2), True],  # llama 3.1 8b N300
-        [1, 8, 1, 128 * 1024, 128, (8, 4), True],  # llama 3.1 8b N300
+        # [8, 16, 4, 4096, 128, (8, 2), True],  # llama 3.1 8b N300
+        # [1, 8, 1, 128 * 1024, 128, (8, 4), True],  # llama 3.1 8b N300
         # [1, 8, 1, 32768, 128, (8, 1), True],  # Llama2-70B
         # [16, 8, 1, 32768, 128, (8, 6), False, False],  # Llama2-70B
         # [8, 8, 1, 32768, 128, (8, 6), True, False],  # Llama2-70B

i.e. leave out only one set of parameters, then all test runs that have number of cores larger than or equal to 16 in that parameter list introduce hang. No matter how I reduce this (grid_size=(5, 3), (7, 2) etc.), the hang goes away by itself.

I DPRINTed compile time args from kernel ttnn/cpp/ttnn/operations/transformer/sdpa_decode/device/kernels/compute/sdpa_flash_decode.cpp but they're the same no matter what the grid size is.

@cglagovichTT How do I need to change values of b, nh, nkv, s, d parameters so that when I reduce grid size, everything else stays the same from kernel/LLK perspective? I understand that the input tensor gets split amongst the cores, but DPRINT from kernel doesn't show that 🤔

@cglagovichTT
Copy link
Contributor

If you halve the number of cores, you should halve the b value. If b is already 1, halve the nkv value. If nkv is 1, halve the s value

@cglagovichTT
Copy link
Contributor

cglagovichTT commented Feb 26, 2025

LMK if you want a test case with some certain properties, I might be able to help if I can find a BH machine - looks like they're all busy right now.

@cmaryanTT
Copy link
Author

cmaryanTT commented Feb 26, 2025 via email

@ncvetkovicTT
Copy link
Contributor

@cglagovichTT Reducing the dimensions as you mentioned didn't help, the process either exits, fails or doesn't hang.

@cglagovichTT
Copy link
Contributor

@ncvetkovicTT I've verified on WH that each of these configurations passes. These should proportionally scale the 32-core hang repro down for various core grids. I was not able to test on BH yet.

@skip_for_grayskull("Unsupported in GS since L1 runs OOM with most configs")
@pytest.mark.parametrize(
    "kv_dtype, q_dtype",
    [
        [ttnn.bfloat16, ttnn.bfloat16],
    ],
    ids=[
        "all_bfp16",
    ],
)
@pytest.mark.parametrize(
    "b, nh, nkv, s, d, grid_size, cur_pos_tensor",
    (
        [1, 32, 1, 16 * 1024, 128, (1, 1), True],
        [1, 32, 1, 64 * 1024, 128, (4, 1), True],
        [1, 32, 2, 64 * 1024, 128, (8, 1), True],
        [1, 32, 4, 64 * 1024, 128, (8, 2), True],
        [1, 32, 8, 64 * 1024, 128, (8, 4), True],
    ),
)
@pytest.mark.parametrize("block_size", (64, 128), ids=["paged_64", "paged_128"])
def test_sdpa_decode_paged_attention(
    device, b, nh, nkv, s, d, kv_dtype, grid_size, q_dtype, cur_pos_tensor, block_size, use_program_cache
):
    if s == 128 * 1024 and block_size != 64:
        # 128k sequence, block_size 64 tests the sizing of the page table CB
        pytest.skip("Skipping test for seq_len=128k with block_size!=64")
    ttnn.device.DisablePersistentKernelCache()
    run_test_sdpa_decode_paged_attention(
        device,
        b,
        nh,
        nkv,
        s,
        d,
        kv_dtype,
        grid_size,
        q_dtype,
        cur_pos_tensor,
        block_size=block_size,
        sharded_in=True,
        sharded_out=False,
    )

@ncvetkovicTT
Copy link
Contributor

Okay thanks, the smallest grid that I get a hang for is 8x1 in your proposal.

@cglagovichTT
Copy link
Contributor

Is it always the same core which hangs?

@ncvetkovicTT
Copy link
Contributor

Could you help me in determining that? Do I just run watcher or what?

@cglagovichTT
Copy link
Contributor

run with watcher enabled. here are the watcher docs

Collect logs for the hang multiple times, then compare between them to see which cores were hung on each run

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
- In order to introduce a hang, remove trisc.cc
change and the corresponding ckernel.h change
- Need to be run on a board with clock gating off!
@ncvetkovicTT
Copy link
Contributor

Seems like disabling clock gating on a board and disabling instruction cache at the start of trisc.cc make the hang go away for all grid sizes. We are tracking this in a separate branch, so I would maybe leave this issue open until we understand what exactly is going on and why CG_off helps. We are tracking this on a separate branch where we will potentially simulate some scenarios:

https://github.com/tenstorrent/tt-metal/tree/ncvetkovic/18094-SDPA_mc_hang_repro

@ejouretTT let's keep this issue open to track this side quest, although we will have a workaround for it once #18359 is merged.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 28, 2025

@cglagovichTT Could you confirm if rebasing to main / fetching the workaround unblocks your model completely? I want us all to be aligned and not rely only on my test results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blackhole bug Something isn't working llama3 LLK
Projects
None yet
Development

No branches or pull requests

5 participants