Device profiler on ttnn resnet50 is hanging on BH #17099

mywoodstock · 2025-01-24T22:50:21Z

On blackhole, we initially get kernel compile error, but with the following patch, it runs:

diff --git a/tt_metal/hw/inc/blackhole/dev_mem_map.h b/tt_metal/hw/inc/blackhole/dev_mem_map.h
index 075edd005c..b97e3c5601 100644
--- a/tt_metal/hw/inc/blackhole/dev_mem_map.h
+++ b/tt_metal/hw/inc/blackhole/dev_mem_map.h
@@ -48,7 +48,7 @@

 /////////////
 // Firmware/kernel code holes
-#define MEM_BRISC_FIRMWARE_SIZE (5 * 1024 + 128)
+#define MEM_BRISC_FIRMWARE_SIZE (5 * 1024 + 256)
 // TODO: perhaps put NCRISC FW in the scratch area and free 1.5K after init (GS/WH)
 #define MEM_NCRISC_FIRMWARE_SIZE 1536
 #define MEM_TRISC0_FIRMWARE_SIZE 1536

But when trying to profile Resnet50, the run with profiler hangs. (Passes fine without profiler).

Branch: asarje/bh-rn50-20250123
Compile with profiler: ./build_metal.sh -p --debug
Run the model with profiler: python -m tracy -p -r -v -m pytest "\"tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]\""

The run will hang.

The text was updated successfully, but these errors were encountered:

mo-tenstorrent · 2025-01-25T13:44:25Z

As per conversation with Paul, first thing to check is to entirely disable l1 data cache which is set in risc_common

mo-tenstorrent · 2025-01-25T17:16:10Z

Confirming that disabling L1 cache fixes the issue.

Hanging run: https://github.com/tenstorrent/tt-metal/actions/runs/12959399845

Passing run https://github.com/tenstorrent/tt-metal/actions/runs/12966853376/job/36168151536#step:9:407

TT_METAL_DISABLE_L1_DATA_CACHE_RISCVS="BR,NC,TR" python -m tracy -p -r -v -m pytest "\"tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]\""

ejouretTT · 2025-02-18T16:14:25Z

@mo-tenstorrent Trying to identify to single core - still in progress. Reach out to @mywoodstock to create unit tests around it.
Workaround exists but no long term fix exists. Impacts performance evaluation.
@mo-tenstorrent to rename this issue.

ejouretTT · 2025-02-19T16:15:11Z

Previous workaround did not fix this - needed to come up with new solution. Root cause identified - running it singularly doesn't cause the hang. @mo-tenstorrent to re-evaluate and provide updates on root cause.

mo-tenstorrent · 2025-02-19T16:42:07Z

The root cause is not fully identified yet. However, we can pinpoint a single op, a conv2d, in the model that triggers the hang.
Returning right before that op in the model leads to no hang. Returning right after the op leads to a hang.

mo-tenstorrent · 2025-02-19T16:46:03Z

Setting TT_METAL_DISABLE_L1_DATA_CACHE_RISCVS="BR,NC,TR" is not a workaround for this issue.

Better workaround is applying the following patch:

diff --git a/tt_metal/tools/profiler/kernel_profiler.hpp b/tt_metal/tools/profiler/kernel_profiler.hpp
index 55a95d76ff..f80e2f0b7b 100644
--- a/tt_metal/tools/profiler/kernel_profiler.hpp
+++ b/tt_metal/tools/profiler/kernel_profiler.hpp
@@ -136,6 +136,11 @@ inline __attribute__((always_inline)) bool bufferHasRoom() {
 }

 inline __attribute__((always_inline)) void mark_time_at_index_inlined(uint32_t index, uint32_t timer_id) {
+    constexpr int CYCLE_BURN_COUNT = 3;
+#pragma GCC unroll 65534
+    for (int j = 0; j < CYCLE_BURN_COUNT; j++) {
+        asm volatile ("nop");
+    }
     volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
     profiler_data_buffer[myRiscID][index] =
         0x80000000 | ((timer_id & 0x7FFFF) << 12) | (p_reg[WALL_CLOCK_HIGH_INDEX] & 0xFFF);

CYCLE_BURN_COUNT here is the smallest count which removes the hang. This count might need changing on different commits as timing disturbances might be different between the commits.

mo-tenstorrent · 2025-02-19T17:58:19Z

Replicating the baseline from two weeks ago:

On a clean non-profiler build the hang can be reproduced with the following changes to the code:

diff --git a/tt_metal/hw/firmware/src/brisck.cc b/tt_metal/hw/firmware/src/brisck.cc
index bac6b3c611..b3055bc698 100644
--- a/tt_metal/hw/firmware/src/brisck.cc
+++ b/tt_metal/hw/firmware/src/brisck.cc
@@ -21,6 +21,9 @@
 #include "remote_circular_buffer_api.h"
 #endif

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +46,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        profiler_data_buffer[8] = p_reg[0];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
         kernel_main();
+        profiler_data_buffer[10] = p_reg[0];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..aa32fffb2d 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,9 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +54,13 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    profiler_data_buffer[4] = p_reg[0];
+    profiler_data_buffer[5] = p_reg[1];
+    profiler_data_buffer[5] = p_reg[1];
+    profiler_data_buffer[5] = p_reg[1];
     kernel_main();
+    profiler_data_buffer[6] = p_reg[0];
+    profiler_data_buffer[7] = p_reg[1];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC
diff --git a/tt_metal/hw/firmware/src/trisck.cc b/tt_metal/hw/firmware/src/trisck.cc
index 6b4fa96dc7..4dbfd829bc 100644
--- a/tt_metal/hw/firmware/src/trisck.cc
+++ b/tt_metal/hw/firmware/src/trisck.cc
@@ -37,10 +37,15 @@ volatile tt_reg_ptr uint * mailbox_base[4] = {
 };
 }

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
     DeviceZoneScopedMainChildN("TRISC-KERNEL");
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[1] = p_reg[1];
 #ifdef KERNEL_RUN_TIME
     ckernel::wait(KERNEL_RUN_TIME);
 #endif
@@ -59,6 +64,16 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("TRISC-KERNEL");
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    // profiler_data_buffer[1] = p_reg[1];
     run_kernel();
+    // profiler_data_buffer[2] = p_reg[0];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
 #endif
 }

Multiple numbers of reg_read + L1 writes had to be tried for the hang to appear.

One note here is that the reset requirements of the above hang are slightly different than the original hang.
Without tt-smi reset, rerunning the same test would pass the device initialization stage, but will hang on a much earlier stage of the model, it hangs at the warm up run on layer 1, module 2 instead of at the optimized run at layer 4, module 3 where it originally hung.
After tt-smi reset, the hang goes back to layer 4, module 3.

This difference in reset behaviour might be caused by profiler vs non-profiler builds. In profiler builds we ask for finish at different times and also we do direct L1 reads through UMD.

mo-tenstorrent · 2025-02-19T18:22:04Z

Here is the back trace of the process after the hang:

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5973cae1dfc8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x5973cae1dfc8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5973cae1dfc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000077fabb7dca41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5973cae1dfd0, cond=0x5973cae1dfa0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x5973cae1dfa0, mutex=0x5973cae1dfd0) at ./nptl/pthread_cond_wait.c:627
#5  0x000077fa9be4e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) () from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x000077fa5b5986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x000077fa5b597591 in tt::tt_metal::HWCommandQueue::enqueue_read_buffer(tt::tt_metal::v0::Buffer&, void*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x000077fa5b593a5b in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x000077fa5b592be1 in tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x000077fa5b592a92 in tt::tt_metal::v0::EnqueueReadBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, void*, bool) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x000077fa5e2807b7 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host_helper<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#12 0x000077fa5e2802c2 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#13 0x000077fa5e33b1ae in _ZN2tt8tt_metal11tensor_impl8dispatchIZNS1_15to_host_wrapperIJRNS0_6TensorERbRhEEEDaDpOT_EUlTyDpOT0_E_JS5_S6_S7_EEEDaNS0_8DataTypeEOT_SD_ () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#14 0x000077fa5e33af7d in std::__1::__function::__func<tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char)::$_0, std::__1::allocator<tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char)::$_0>, void ()>::operator()() () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#15 0x000077fa5e2a38c2 in tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#16 0x000077fa5e27e629 in tt::tt_metal::Tensor::cpu(bool, unsigned char) const () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#17 0x000077fa5c309859 in ttnn::operations::core::from_device(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#18 0x000077fa5e58eab0 in pybind11::cpp_function::initialize<tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor, tt::tt_metal::Tensor const&, bool, unsigned char, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::kw_only, pybind11::arg_v, char [848]>(tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&, bool, unsigned char), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::kw_only const&, pybind11::arg_v const&, char const (&) [848])::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#19 0x000077fa5e58ea2e in pybind11::cpp_function::initialize<tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor, tt::tt_metal::Tensor const&, bool, unsigned char, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::kw_only, pybind11::arg_v, char [848]>(tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&, bool, unsigned char), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::kw_only const&, pybind11::arg_v const&, char const (&) [848])::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#20 0x000077fa5c2ae4d9 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so

It is stuck on finish.

Rerunning the test without smi-reset gives the following trace when hung:

(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1  0x00007ba3d1936fc0 in tt::umd::TTDevice::read_block(unsigned long, unsigned long, unsigned char*) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#2  0x00007ba3d18a9e17 in tt::umd::Cluster::read_device_memory(void*, tt::umd::cxy_pair, unsigned long, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#3  0x00007ba3d1e63743 in tt::Cluster::read_core(void*, unsigned int, tt::umd::cxy_pair, unsigned long, bool) const () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#4  0x00007ba3d1d24ac6 in tt::tt_metal::SystemMemoryManager::fetch_queue_reserve_back(unsigned char) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#5  0x00007ba3d1d251ac in void tt::tt_metal::buffer_dispatch::issue_buffer_dispatch_command_sequence<tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams>(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, CoreType) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#6  0x00007ba3d1d22bac in tt::tt_metal::buffer_dispatch::write_sharded_buffer_to_core(void const*, unsigned int, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::tt_metal::buffer_dispatch::BufferDispatchConstants const&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, tt::umd::xy_pair, CoreType) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x00007ba3d1d22f3f in tt::tt_metal::buffer_dispatch::write_to_device_buffer(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::v0::BufferRegion const&, unsigned int, tt::stl::Span<unsigned int const, 18446744073709551615ul>, CoreType, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x00007ba3d1d97977 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(tt::tt_metal::v0::Buffer&, void const*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x00007ba3d1d97786 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x00007ba3d1d92d67 in tt::tt_metal::v0::EnqueueWriteSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x00007ba3d1d92992 in tt::tt_metal::v0::EnqueueWriteBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#12 0x00007ba3d4b33cb6 in void tt::tt_metal::tensor_impl::write_data_to_device_buffer<unsigned short, tt::tt_metal::borrowed_buffer::Buffer>(tt::tt_metal::CommandQueue&, tt::tt_metal::borrowed_buffer::Buffer<unsigned short> const&, std::__1::shared_ptr<tt::tt_metal::v0::Buffer>) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so

Which looks like a umd read_block is hung

mo-tenstorrent · 2025-02-19T18:41:44Z

Removing the code changes in trisck.cc and reducing ncrisck.cc to single reg_read + L1 writes caused the hang to move to the warm up run in layer 4 module 1 with the following trace:

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +46,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        profiler_data_buffer[8] = p_reg[0];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
         kernel_main();
+        profiler_data_buffer[10] = p_reg[0];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..4909dbef13 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,9 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +54,9 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    profiler_data_buffer[4] = p_reg[0];
     kernel_main();
+    profiler_data_buffer[7] = p_reg[1];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC

read_block hangs here as well. This is after a fresh tt-smi reset.

#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:385
#1  0x00007b6f7f536fc0 in tt::umd::TTDevice::read_block(unsigned long, unsigned long, unsigned char*) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#2  0x00007b6f7f4a9e17 in tt::umd::Cluster::read_device_memory(void*, tt::umd::cxy_pair, unsigned long, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#3  0x00007b6f7fa63743 in tt::Cluster::read_core(void*, unsigned int, tt::umd::cxy_pair, unsigned long, bool) const () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#4  0x00007b6f7f924ac6 in tt::tt_metal::SystemMemoryManager::fetch_queue_reserve_back(unsigned char) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#5  0x00007b6f7f9251ac in void tt::tt_metal::buffer_dispatch::issue_buffer_dispatch_command_sequence<tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams>(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, CoreType) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#6  0x00007b6f7f922bac in tt::tt_metal::buffer_dispatch::write_sharded_buffer_to_core(void const*, unsigned int, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::tt_metal::buffer_dispatch::BufferDispatchConstants const&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, tt::umd::xy_pair, CoreType) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x00007b6f7f922f3f in tt::tt_metal::buffer_dispatch::write_to_device_buffer(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::v0::BufferRegion const&, unsigned int, tt::stl::Span<unsigned int const, 18446744073709551615ul>, CoreType, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x00007b6f7f997977 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(tt::tt_metal::v0::Buffer&, void const*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x00007b6f7f997786 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x00007b6f7f992d67 in tt::tt_metal::v0::EnqueueWriteSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x00007b6f7f992992 in tt::tt_metal::v0::EnqueueWriteBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so

mo-tenstorrent · 2025-02-19T19:27:38Z

Just doing reg_reads to local memory is also causing a hang but the reset behaviour is changed. The following patch hangs on read_block on layer 4 module 2 every time without tt-smi in between:

diff --git a/tt_metal/hw/firmware/src/brisck.cc b/tt_metal/hw/firmware/src/brisck.cc
index bac6b3c611..0a1b8550e1 100644
--- a/tt_metal/hw/firmware/src/brisck.cc
+++ b/tt_metal/hw/firmware/src/brisck.cc
@@ -21,6 +21,8 @@
 #include "remote_circular_buffer_api.h"
 #endif

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +45,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
         kernel_main();
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..57fcc5365d 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,10 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+//volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    //reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +55,9 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    test = p_reg[0];
     kernel_main();
+    test = p_reg[0];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC

mo-tenstorrent · 2025-02-20T15:45:30Z

All data points above this comment were don on BH-30 Device 1
On BH-30 Device 0 it is confirmed that a single reg read between wait_for_go and kernel main causes the hang on umd read block. This is on warmup run layer 4 module 2 conv 2

diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..a6fd9a7255 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,10 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,6 +55,7 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    test = p_reg[0];
     kernel_main();
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");

mo-tenstorrent · 2025-02-20T16:11:42Z

On BH-30 Device 0 it is confirmed that 18 nop cycles can also induce a hang on umd read block. This is on layer 4 module 3 conv 3.

Diff:

diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..6a90bf10e5 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,8 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,6 +53,11 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    constexpr int CYCLE_BURN_COUNT = 18;
+#pragma GCC unroll 65534
+    for (int j = 0; j < CYCLE_BURN_COUNT; j++) {
+        asm volatile ("nop");
+    }
     kernel_main();
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");

mo-tenstorrent · 2025-02-20T16:48:37Z

Stalling 10 cycles instead of 18 will cause a hang on finish instead of read block. Finish hang crashes on FW init the next reboot.

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x5c09b03cd5f8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0,
    futex_word=0x5c09b03cd5f8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5c09b03cd5f8,
    expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fe794fd4a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5c09b03cd600,
    cond=0x5c09b03cd5d0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x5c09b03cd5d0, mutex=0x5c09b03cd600) at ./nptl/pthread_cond_wait.c:627
#5  0x00007fe77564e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
    () from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x00007fe734d986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so

Next run crash:

(python_env) mmemarian@bh-30-special-mmemarian-for-reservation-14277:/proj_sw/mmemarian/tt-metal$ pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
2025-02-20 16:38:51.175 | DEBUG    | ttnn:<module>:83 - Initial ttnn.CONFIG:
Config{cache_path=/home/mmemarian/.cache/ttnn,model_cache_path=/home/mmemarian/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_should_raise_exception=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
2025-02-20 16:38:51.494 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2025-02-20 16:38:51.495 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.Conv1d be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv2d be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.from_torch be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.to_torch be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.to_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.from_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.deallocate be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.reallocate be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.load_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.dump_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.as_tensor be migrated to C++?
2025-02-20 16:38:51.510 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv_transpose2d be migrated to C++?
2025-02-20 16:38:51.513 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv2d be migrated to C++?
2025-02-20 16:38:51.513 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.Conv1d be migrated to C++?
================================================================================================================= test session starts =================================================================================================================
platform linux -- Python 3.10.12, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/mmemarian/tt-metal/python_env/bin/python3
cachedir: .pytest_cache
rootdir: /proj_sw/mmemarian/tt-metal, configfile: pytest.ini
plugins: xdist-3.6.1, split-0.8.2, dash-2.15.0, timeout-2.2.0, anyio-4.8.0
timeout: 300.0s
timeout method: signal
timeout func_only: False
collecting 1 item                                                                                                                                                                                                                                     2025-02-20 16:38:53.201 | WARNING  | tests.ttnn.conftest:pytest_collection_modifyitems:32 - Fast Runtime Mode is ON. Skipping tests tagged with @pytest.mark.requires_fast_runtime_mode_off
collected 1 item

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-20 16:38:53.203 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-20 16:38:53 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-20 16:38:53.224 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-20 16:38:53.430 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-20 16:38:53.430 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-20 16:38:53.499 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x180000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Device 0: Timeout (10000 ms) waiting for physical cores to finish: (x=5,y=5), (x=13,y=5).
                 Always | FATAL    | Device 0 init: failed to initialize FW! Try resetting the board.
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ /proj_sw/mmemarian/tt-metal/tt_metal/impl/device/device.cpp:831: tt::exception
info:
Device 0 init: failed to initialize FW! Try resetting the board.
backtrace:

mo-tenstorrent · 2025-02-21T14:57:40Z

On Bh-30 Device 0, Just Enabling watcher with only waypoints enabled causes the hang.

Waypoints suggest that wait_for_next_context in _llk_unpack_tilize_ is where worker cores are stuck.

Need to further investigate if this is the llk itself having and issue or it has been misused by kernels and now eventually gets stuck.

Added waypoint diff:

diff --git a/llk_lib/llk_unpack_tilize.h b/llk_lib/llk_unpack_tilize.h
index 7de5e16..9d7bcfd 100644
--- a/llk_lib/llk_unpack_tilize.h
+++ b/llk_lib/llk_unpack_tilize.h
@@ -109,8 +109,10 @@ inline void _llk_unpack_tilize_(const std::uint32_t base_address, const std::uin
     // Clear z/w start counters
     TTI_SETADCZW(0b001, 0, 0, 0, 0, 0b1111);

+    WAYPOINT("WFCW");
     // Wait for free context
     wait_for_next_context(2);
+    WAYPOINT("WFCD");

     // Get tile address
     if (0 == unp_cfg_context) {

Watcher Sample:

evice 0 worker core(x=12,y= 7) virtual(x=15,y= 9): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x=13,y= 7) virtual(x=16,y= 9):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 8) virtual(x= 1,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 1,y= 8) virtual(x= 2,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 2,y= 8) virtual(x= 3,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 3,y= 8) virtual(x= 4,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 4,y= 8) virtual(x= 5,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 5,y= 8) virtual(x= 6,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 6,y= 8) virtual(x= 7,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 7,y= 8) virtual(x=10,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51

mo-tenstorrent · 2025-02-21T15:48:51Z

Watcher only enabled hangs are non-deterministic. We can also get the same setup as above to hang as below:

Host is hung on finish

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x580d29d8f808)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x580d29d8f808)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x580d29d8f808, expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007203da8a3a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x580d29d8f810, cond=0x580d29d8f7e0)
    at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x580d29d8f7e0, mutex=0x580d29d8f810) at ./nptl/pthread_cond_wait.c:627
#5  0x00007203bae4e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) ()
   from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x000072037a5986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x000072037a597591 in tt::tt_metal::HWCommandQueue::enqueue_read_buffer(tt::tt_metal::v0::Buffer&, void*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x000072037a593a5b in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x000072037a592be1 in tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x000072037a592a92 in tt::tt_metal::v0::EnqueueReadBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, void*, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x000072037d2807b7 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host_helper<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#12 0x000072037d2802c2 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char)
    () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#13 0x000072037d33b1ae in _ZN2tt8tt_metal11tensor_impl8dispatchIZNS1_15to_host_wrapperIJRNS0_6TensorERbRhEEEDaDpOT_EUlTyDpOT0_E_JS5_S6_S7_EEEDaNS0_8DataTypeEOT_SD_ () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
--Type <RET> for more, q to quit, c to continue without paging--

Watcher log shows that workers cores are running kernels. Dispatch cores are stuck waiting for workers to finish.

Most likely in this scenario, workers are stuck in a loop that is not known to watcher.

Dump #95 at 105.228s
Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:41|40|43
Device 0 worker core(x= 1,y= 0) virtual(x= 2,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 0) virtual(x= 3,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 0) virtual(x= 4,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 0) virtual(x= 5,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 0) virtual(x= 6,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 0) virtual(x= 7,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 0) virtual(x=10,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 0) virtual(x=11,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 0) virtual(x=12,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 0) virtual(x=13,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 0) virtual(x=14,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 0) virtual(x=15,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 0) virtual(x=16,y= 2):  NTW,DAPW,   W,   W,   W  rmsg:H1G|bNt h_id:0 smsg:GDDD k_ids:0|1|0
Device 0 worker core(x= 0,y= 1) virtual(x= 1,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 1) virtual(x= 2,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 1) virtual(x= 3,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 1) virtual(x= 4,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 1) virtual(x= 5,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 1) virtual(x= 6,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 1) virtual(x= 7,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 1) virtual(x=10,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 1) virtual(x=11,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 1) virtual(x=12,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 1) virtual(x=13,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 1) virtual(x=14,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 1) virtual(x=15,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 1) virtual(x=16,y= 3): DAPD, PWW,   W,   W,   W  rmsg:H1G|BNt h_id:0 smsg:GDDD k_ids:3|2|0
Device 0 worker core(x= 0,y= 2) virtual(x= 1,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 2) virtual(x= 2,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 2) virtual(x= 3,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 2) virtual(x= 4,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 2) virtual(x= 5,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 2) virtual(x= 6,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 2) virtual(x= 7,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 2) virtual(x=10,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 2) virtual(x=11,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 2) virtual(x=12,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 2) virtual(x=13,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 2) virtual(x=14,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 2) virtual(x=15,y= 4): CWFW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:276 smsg:DGGG k_ids:38|36|39
Device 0 worker core(x=13,y= 2) virtual(x=16,y= 4):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 3) virtual(x= 1,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 3) virtual(x= 2,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 3) virtual(x= 3,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 3) virtual(x= 4,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 3) virtual(x= 5,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 3) virtual(x= 6,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 3) virtual(x= 7,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 3) virtual(x=10,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 3) virtual(x=11,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 3) virtual(x=12,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 3) virtual(x=13,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 3) virtual(x=14,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 3) virtual(x=15,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 3) virtual(x=16,y= 5):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 4) virtual(x= 1,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 4) virtual(x= 2,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 4) virtual(x= 3,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 4) virtual(x= 4,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 4) virtual(x= 5,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 4) virtual(x= 6,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 4) virtual(x= 7,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 4) virtual(x=10,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 4) virtual(x=11,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 4) virtual(x=12,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 4) virtual(x=13,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 4) virtual(x=14,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 4) virtual(x=15,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 4) virtual(x=16,y= 6):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 5) virtual(x= 1,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 5) virtual(x= 2,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 5) virtual(x= 3,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 5) virtual(x= 4,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 5) virtual(x= 5,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 5) virtual(x= 6,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 5) virtual(x= 7,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 5) virtual(x=10,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 5) virtual(x=11,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 5) virtual(x=12,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 5) virtual(x=13,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 5) virtual(x=14,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 5) virtual(x=15,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 5) virtual(x=16,y= 7):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 6) virtual(x= 1,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 6) virtual(x= 2,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 6) virtual(x= 3,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 6) virtual(x= 4,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 6) virtual(x= 5,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 6) virtual(x= 6,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 6) virtual(x= 7,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 6) virtual(x=10,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 6) virtual(x=11,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
zDevice 0 worker core(x= 9,y= 6) virtual(x=12,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 6) virtual(x=13,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 6) virtual(x=14,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 6) virtual(x=15,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 6) virtual(x=16,y= 8):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 7) virtual(x= 1,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 7) virtual(x= 2,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 7) virtual(x= 3,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 7) virtual(x= 4,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 7) virtual(x= 5,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 7) virtual(x= 6,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 7) virtual(x= 7,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 7) virtual(x=10,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 7) virtual(x=11,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 7) virtual(x=12,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 7) virtual(x=13,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 7) virtual(x=14,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 7) virtual(x=15,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 7) virtual(x=16,y= 9):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 8) virtual(x= 1,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 8) virtual(x= 2,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 8) virtual(x= 3,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 8) virtual(x= 4,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 8) virtual(x= 5,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 8) virtual(x= 6,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 8) virtual(x= 7,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 8) virtual(x=10,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 8) virtual(x=11,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 9,y= 8) virtual(x=12,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=10,y= 8) virtual(x=13,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=11,y= 8) virtual(x=14,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=12,y= 8) virtual(x=15,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=13,y= 8) virtual(x=16,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 9) virtual(x= 1,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 1,y= 9) virtual(x= 2,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 2,y= 9) virtual(x= 3,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 3,y= 9) virtual(x= 4,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 4,y= 9) virtual(x= 5,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 5,y= 9) virtual(x= 6,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 6,y= 9) virtual(x= 7,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 7,y= 9) virtual(x=10,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 8,y= 9) virtual(x=11,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 9,y= 9) virtual(x=12,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=10,y= 9) virtual(x=13,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=11,y= 9) virtual(x=14,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:248 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=12,y= 9) virtual(x=15,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:248 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=13,y= 9) virtual(x=16,y=11):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[3]: tt_metal/impl/dispatch/kernels/cq_dispatch_slave.cpp
k_id[36]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp
k_id[38]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
k_id[39]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
k_id[40]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp
k_id[41]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp
k_id[42]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
k_id[43]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
Dump #95 completed at 105.296s

ejouretTT · 2025-02-21T16:10:39Z

Single core problem – Op problem or LLK problem.

Next step is @mywoodstock to take a look. Need to get this singled down to a reproducible case.

ejouretTT · 2025-02-21T16:11:43Z

@ttmtrajkovic to assign someone from LLK team to dive into this.

mo-tenstorrent · 2025-02-21T16:56:25Z

Tried the following workarounds from @nvelickovicTT and we still hang.

worker cores are stuck on noc command buffer becoming avilable.

Watcher log

Device 0 worker core(x=13,y= 6) virtual(x=16,y= 8):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 7) virtual(x= 1,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 7) virtual(x= 2,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 7) virtual(x= 3,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 7) virtual(x= 4,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 7) virtual(x= 5,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 7) virtual(x= 6,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 7) virtual(x= 7,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 7) virtual(x=10,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 7) virtual(x=11,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 7) virtual(x=12,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 7) virtual(x=13,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 7) virtual(x=14,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 7) virtual(x=15,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43

BH_hang_workaround_metal.patch
BH_hang_workaround_llk.patch

amahmudTT · 2025-02-21T17:03:44Z

I will be looking at this issue from the LLK side @ejouretTT

amahmudTT · 2025-02-22T07:05:56Z

After going through the issue, my thoughts are, since we faced these types of issues and found out the root cause like in (SDPA , MATMUL) our procedure has been the same, find a small test case on a single core that fails deterministically and simulate it. If it is not small and not on a single core it is impossible to simulate since simulating just 2 matmul tiles twice would take around 6 minutes. Simulating the whole model and on multiple cores is going to be impossible. Is it possible to provide us with a repro on a single core and not on the full model ?

mo-tenstorrent · 2025-02-24T14:40:20Z

Many efforts were attempted to see if we can isolate the issue into small single core test. Unfortunately all were unfruitful. This hang has so far required this particular train of ops to run for it get reproduced.

@mywoodstock do you have any other idea on how we might be able to better isolate this?

amahmudTT · 2025-02-25T10:30:55Z

@mo-tenstorrent The original branch in the issue has been deleted, could you give your latest branch and repro command ?
Thanks !

amahmudTT · 2025-02-25T11:55:50Z

Applied the compiler arg to 11e3906 , mo/bh_model_test
Does not seem to remove the hang, but this change along with disabling gathering tends to move the hang to downsampling instead of conv2

   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 137 KB | 64.14 MB | 2:21.62025-02-25 12:01:45.549 | DEBUG    | models.demos.ttnn_resnet.tt.ttnn_functional_resnet50:run_downsample_if_req:168 - Running downsample
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 255.56 KB | 64.22 MB | 13:40.2

mo-tenstorrent · 2025-02-25T14:43:42Z

Pushed the original branch back.

ejouretTT · 2025-02-25T16:19:59Z

@amahmudTT To reduce test cases to further identify root cause.

amahmudTT · 2025-02-26T06:14:58Z

Needed to restart after failure, as without restart recompiling and running hangs no matter what.

The workaround of using

disable_gathering() in trisc.cc
comment out all disable_gathering() and enable_gathering()

seems to postpone the hang but not fix it.

The compiler change allows many more tests to pass (may be all , I am confused with the ending timer message, it just could be a hanging timer)

 2025-02-26 06:05:13.495 | DEBUG    | ttnn:manage_config:90 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 06:05:13 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
   7.80 Kbps /100.0% =   0.01 Mbps | Tx: 1664 bytes | 64 MB | 8.04 s2025-02-26 06:05:13.543 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 06:05:13.620 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 06:05:13.620 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 1754 bytes | 64 MB | 8.65 s2025-02-26 06:05:14.136 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 6325 bytes | 64.01 MB | 21.25 s                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!

This is the ending message, the timer keeps on going, could it be its just a dangling timer, need to confirm.

2025-02-26 06:21:05.951 | INFO     | models.demos.ttnn_resnet.tests.resnet50_test_infra:validate:339 - ResNet50 batch_size=16, act_dtype=DataType.BFLOAT8_B, weight_dtype=DataType.BFLOAT8_B, math_fidelity=MathFidelity.LoFi, PCC=0.9982199016078069
PASSED                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 865.35 KB | 64.87 MB | 10:06.6

@pgkeller

amahmudTT · 2025-02-26T08:31:49Z

Disabling L1 cache but keeping the other changes (disable/enable_gathering() and compiler option) still allowes the test to run fully (with the timer still going on at the end)

ncvetkovicTT · 2025-02-26T09:30:29Z

@amahmudTT Thank you, could you please provide the branch and the command that you're running either here or in the table, Row2ColumnG (in the note body)? Also, can you please confirm the results that I put in the table, ColumnG?
https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

amahmudTT · 2025-02-26T10:37:52Z

@amahmudTT Thank you, could you please provide the branch and the command that you're running either here or in the table, Row2ColumnG (in the note body)? Also, can you please confirm the results that I put in the table, ColumnG? https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

updated

amahmudTT · 2025-02-26T15:56:26Z

Confirmed with Mo, the timer at the end was not a hang, it was the profiler that kept on writing through another thread. So the above modifications does remove the hangs.

ncvetkovicTT · 2025-02-26T16:07:36Z

Aright thanks! I believe however that we're going to use other workaround, #10 from the table, which protects CSR writes and adds SFPI compiler flag.

ejouretTT · 2025-02-26T16:17:04Z

Workaround for #16439 seems to close out this issue and #18065. We still have an open issue affecting the multicore SDPA case. Need to further root cause to guarantee issue is in tensix and not NOC.

ejouretTT · 2025-02-26T16:19:24Z

Let's make sure that workaround works with compiler issue before closing @nathan-TT.

mo-tenstorrent · 2025-02-27T16:35:37Z

Note the following in case you want to reproduce the hang and test our discovered solutions.

On IRD machines, Weka is slow so when generating large device profiler csvs things can slowdown dramatically. Point the generated folder to localdev with a symlink.

sudo mkdir /localdev/{username}/profiler_generated
sudo chmod 777 /localdev/{username}/profiler_generated
cd $TT_METAL_HOME
rm -rf generated
ln -s /localdev/{username}/profiler_generated generated

With above done, please run below on profiler builds (i.e. build_metal.sh -p):

TT_METAL_DEVICE_PROFILER=1 pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"

It will result in the same hang on that branch if you do not have our discovered solutions are applied.

ncvetkovicTT · 2025-02-27T16:47:43Z

@mo-tenstorrent So wait, fix from this PR doesn't help in your case?

### Ticket #18064 #18065 #17099 #16673 ### Problem description Disabling instruction cache doesn't happen in time for subsequent TTI instruction not to end up in the cache. In order to properly disable I$, we need to disable branch prediction first. Since reprogramming the REPLAY buffers needs to happen when the cache is disabled, SFPI compiler cannot rely on REPLAY buffers. These things introduce multiple matmul hangs. ### What's changed - Guard CSRRS by disabling branch prediction in BH ckernel.h - Add a compiler flag for BH which makes the SFPI compiler not to use replay buffers ### Checklist - [x] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI passes - [26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035) - expected to fail - [x] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI passes (if applicable) - [3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504) - [x] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI passes (if applicable) - [7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991) - [x] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI passes (if applicable) - [5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826), expected to fail - [ ] **(For models and ops writers)** Full [new models tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml) CI passes (if applicable) - [ ] New/Existing tests provide coverage for changes

mo-tenstorrent · 2025-02-27T17:01:44Z

No it does, this is just a better reproduction step for quicker testing

ncvetkovicTT · 2025-02-27T17:04:56Z

Oh alright, then I guess we can close this issue as well?

mo-tenstorrent · 2025-02-27T17:12:00Z

The plan was to hold off on that until Nathan has actually brought in the compile flag switch into main.

ncvetkovicTT · 2025-02-27T17:13:21Z

I have done that with this change. Or is there some other change that you guys had in mind which I'm missing here?

mo-tenstorrent · 2025-02-27T17:15:41Z

Oh I see, @pgkeller did you have anything else in mind?

amahmudTT · 2025-02-27T18:27:32Z

I remember in the meeting it was mentioned Nathan was probably going to introduce a change to the compiler (probably it would not need the flag for blackhole) & wanted this issue to be open so that they could test their change with this issue.

pgkeller · 2025-02-27T22:52:43Z

I think we can close this. Nath's change will be for SFPU only and we don't have a repro that shows this issue on SFPU.

mo-tenstorrent · 2025-02-27T22:56:26Z

Confirmed that on BH-30 on commit 681d3f7 on main which has the fix, we don't see a hang when we profile resnet50. This is without using any env vars or mods to the code.

mywoodstock assigned mo-tenstorrent Jan 24, 2025

mywoodstock added the blackhole label Jan 24, 2025

mywoodstock added the P1 label Jan 28, 2025

ejouretTT added P0 and removed P1 labels Feb 18, 2025

mo-tenstorrent changed the title ~~Tracy profiler on BH not working~~ Device profiler on ttnn resnet50 is hanging on BH Feb 18, 2025

ejouretTT assigned mywoodstock Feb 21, 2025

ejouretTT added the LLK label Feb 21, 2025

amahmudTT self-assigned this Feb 21, 2025

ejouretTT added the ResNet-50 label Feb 24, 2025

mo-tenstorrent closed this as completed Feb 25, 2025

mo-tenstorrent reopened this Feb 25, 2025

amahmudTT mentioned this issue Feb 26, 2025

BH MMul Hang in Resnet #18065

Closed

ejouretTT assigned nathan-TT Feb 26, 2025

ncvetkovicTT mentioned this issue Feb 26, 2025

#18064: Workaround for multiple matmul hangs: #18359

Merged

6 tasks

mo-tenstorrent closed this as completed Feb 27, 2025

Device profiler on ttnn resnet50 is hanging on BH #17099

Device profiler on ttnn resnet50 is hanging on BH #17099

Comments

mywoodstock commented Jan 24, 2025

mo-tenstorrent commented Jan 25, 2025 • edited Loading

mo-tenstorrent commented Jan 25, 2025 • edited Loading

ejouretTT commented Feb 18, 2025

ejouretTT commented Feb 19, 2025

mo-tenstorrent commented Feb 19, 2025 • edited Loading

mo-tenstorrent commented Feb 19, 2025

mo-tenstorrent commented Feb 19, 2025 • edited Loading

mo-tenstorrent commented Feb 19, 2025

mo-tenstorrent commented Feb 19, 2025 • edited Loading

mo-tenstorrent commented Feb 19, 2025

mo-tenstorrent commented Feb 20, 2025 • edited Loading

mo-tenstorrent commented Feb 20, 2025 • edited Loading

mo-tenstorrent commented Feb 20, 2025 • edited Loading

mo-tenstorrent commented Feb 21, 2025

mo-tenstorrent commented Feb 21, 2025 • edited Loading

ejouretTT commented Feb 21, 2025

ejouretTT commented Feb 21, 2025

mo-tenstorrent commented Feb 21, 2025

amahmudTT commented Feb 21, 2025

amahmudTT commented Feb 22, 2025 • edited Loading

mo-tenstorrent commented Feb 24, 2025

amahmudTT commented Feb 25, 2025

amahmudTT commented Feb 25, 2025 • edited Loading

mo-tenstorrent commented Feb 25, 2025

ejouretTT commented Feb 25, 2025

amahmudTT commented Feb 26, 2025 • edited Loading

amahmudTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

amahmudTT commented Feb 26, 2025

amahmudTT commented Feb 26, 2025

ncvetkovicTT commented Feb 26, 2025

ejouretTT commented Feb 26, 2025

ejouretTT commented Feb 26, 2025

mo-tenstorrent commented Feb 27, 2025 • edited Loading

ncvetkovicTT commented Feb 27, 2025 • edited Loading

mo-tenstorrent commented Feb 27, 2025

ncvetkovicTT commented Feb 27, 2025

mo-tenstorrent commented Feb 27, 2025

ncvetkovicTT commented Feb 27, 2025

mo-tenstorrent commented Feb 27, 2025

amahmudTT commented Feb 27, 2025 • edited Loading

pgkeller commented Feb 27, 2025

mo-tenstorrent commented Feb 27, 2025 • edited Loading

mo-tenstorrent commented Jan 25, 2025 •

edited

Loading

mo-tenstorrent commented Jan 25, 2025 •

edited

Loading

mo-tenstorrent commented Feb 19, 2025 •

edited

Loading

mo-tenstorrent commented Feb 19, 2025 •

edited

Loading

mo-tenstorrent commented Feb 19, 2025 •

edited

Loading

mo-tenstorrent commented Feb 20, 2025 •

edited

Loading

mo-tenstorrent commented Feb 20, 2025 •

edited

Loading

mo-tenstorrent commented Feb 20, 2025 •

edited

Loading

mo-tenstorrent commented Feb 21, 2025 •

edited

Loading

amahmudTT commented Feb 22, 2025 •

edited

Loading

amahmudTT commented Feb 25, 2025 •

edited

Loading

amahmudTT commented Feb 26, 2025 •

edited

Loading

mo-tenstorrent commented Feb 27, 2025 •

edited

Loading

ncvetkovicTT commented Feb 27, 2025 •

edited

Loading

amahmudTT commented Feb 27, 2025 •

edited

Loading

mo-tenstorrent commented Feb 27, 2025 •

edited

Loading