Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device profiler on ttnn resnet50 is hanging on BH #17099

Closed
mywoodstock opened this issue Jan 24, 2025 · 45 comments
Closed

Device profiler on ttnn resnet50 is hanging on BH #17099

mywoodstock opened this issue Jan 24, 2025 · 45 comments

Comments

@mywoodstock
Copy link
Contributor

On blackhole, we initially get kernel compile error, but with the following patch, it runs:

diff --git a/tt_metal/hw/inc/blackhole/dev_mem_map.h b/tt_metal/hw/inc/blackhole/dev_mem_map.h
index 075edd005c..b97e3c5601 100644
--- a/tt_metal/hw/inc/blackhole/dev_mem_map.h
+++ b/tt_metal/hw/inc/blackhole/dev_mem_map.h
@@ -48,7 +48,7 @@

 /////////////
 // Firmware/kernel code holes
-#define MEM_BRISC_FIRMWARE_SIZE (5 * 1024 + 128)
+#define MEM_BRISC_FIRMWARE_SIZE (5 * 1024 + 256)
 // TODO: perhaps put NCRISC FW in the scratch area and free 1.5K after init (GS/WH)
 #define MEM_NCRISC_FIRMWARE_SIZE 1536
 #define MEM_TRISC0_FIRMWARE_SIZE 1536

But when trying to profile Resnet50, the run with profiler hangs. (Passes fine without profiler).

Branch: asarje/bh-rn50-20250123
Compile with profiler: ./build_metal.sh -p --debug
Run the model with profiler: python -m tracy -p -r -v -m pytest "\"tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]\""

The run will hang.

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Jan 25, 2025

As per conversation with Paul, first thing to check is to entirely disable l1 data cache which is set in risc_common

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Jan 25, 2025

Confirming that disabling L1 cache fixes the issue.

Hanging run: https://github.com/tenstorrent/tt-metal/actions/runs/12959399845

Passing run https://github.com/tenstorrent/tt-metal/actions/runs/12966853376/job/36168151536#step:9:407

TT_METAL_DISABLE_L1_DATA_CACHE_RISCVS="BR,NC,TR" python -m tracy -p -r -v -m pytest "\"tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]\""

@mywoodstock mywoodstock added the P1 label Jan 28, 2025
@ejouretTT ejouretTT added P0 and removed P1 labels Feb 18, 2025
@ejouretTT
Copy link

@mo-tenstorrent Trying to identify to single core - still in progress. Reach out to @mywoodstock to create unit tests around it.
Workaround exists but no long term fix exists. Impacts performance evaluation.
@mo-tenstorrent to rename this issue.

@mo-tenstorrent mo-tenstorrent changed the title Tracy profiler on BH not working Device profiler on ttnn resnet50 is hanging on BH Feb 18, 2025
@ejouretTT
Copy link

Previous workaround did not fix this - needed to come up with new solution. Root cause identified - running it singularly doesn't cause the hang. @mo-tenstorrent to re-evaluate and provide updates on root cause.

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 19, 2025

The root cause is not fully identified yet. However, we can pinpoint a single op, a conv2d, in the model that triggers the hang.
Returning right before that op in the model leads to no hang. Returning right after the op leads to a hang.

@mo-tenstorrent
Copy link
Contributor

Setting TT_METAL_DISABLE_L1_DATA_CACHE_RISCVS="BR,NC,TR" is not a workaround for this issue.

Better workaround is applying the following patch:

diff --git a/tt_metal/tools/profiler/kernel_profiler.hpp b/tt_metal/tools/profiler/kernel_profiler.hpp
index 55a95d76ff..f80e2f0b7b 100644
--- a/tt_metal/tools/profiler/kernel_profiler.hpp
+++ b/tt_metal/tools/profiler/kernel_profiler.hpp
@@ -136,6 +136,11 @@ inline __attribute__((always_inline)) bool bufferHasRoom() {
 }

 inline __attribute__((always_inline)) void mark_time_at_index_inlined(uint32_t index, uint32_t timer_id) {
+    constexpr int CYCLE_BURN_COUNT = 3;
+#pragma GCC unroll 65534
+    for (int j = 0; j < CYCLE_BURN_COUNT; j++) {
+        asm volatile ("nop");
+    }
     volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
     profiler_data_buffer[myRiscID][index] =
         0x80000000 | ((timer_id & 0x7FFFF) << 12) | (p_reg[WALL_CLOCK_HIGH_INDEX] & 0xFFF);

CYCLE_BURN_COUNT here is the smallest count which removes the hang. This count might need changing on different commits as timing disturbances might be different between the commits.

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 19, 2025

Replicating the baseline from two weeks ago:

On a clean non-profiler build the hang can be reproduced with the following changes to the code:

diff --git a/tt_metal/hw/firmware/src/brisck.cc b/tt_metal/hw/firmware/src/brisck.cc
index bac6b3c611..b3055bc698 100644
--- a/tt_metal/hw/firmware/src/brisck.cc
+++ b/tt_metal/hw/firmware/src/brisck.cc
@@ -21,6 +21,9 @@
 #include "remote_circular_buffer_api.h"
 #endif

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +46,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        profiler_data_buffer[8] = p_reg[0];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
         kernel_main();
+        profiler_data_buffer[10] = p_reg[0];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..aa32fffb2d 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,9 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +54,13 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    profiler_data_buffer[4] = p_reg[0];
+    profiler_data_buffer[5] = p_reg[1];
+    profiler_data_buffer[5] = p_reg[1];
+    profiler_data_buffer[5] = p_reg[1];
     kernel_main();
+    profiler_data_buffer[6] = p_reg[0];
+    profiler_data_buffer[7] = p_reg[1];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC
diff --git a/tt_metal/hw/firmware/src/trisck.cc b/tt_metal/hw/firmware/src/trisck.cc
index 6b4fa96dc7..4dbfd829bc 100644
--- a/tt_metal/hw/firmware/src/trisck.cc
+++ b/tt_metal/hw/firmware/src/trisck.cc
@@ -37,10 +37,15 @@ volatile tt_reg_ptr uint * mailbox_base[4] = {
 };
 }

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
     DeviceZoneScopedMainChildN("TRISC-KERNEL");
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[1] = p_reg[1];
 #ifdef KERNEL_RUN_TIME
     ckernel::wait(KERNEL_RUN_TIME);
 #endif
@@ -59,6 +64,16 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("TRISC-KERNEL");
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    profiler_data_buffer[0] = p_reg[0];
+    // profiler_data_buffer[1] = p_reg[1];
     run_kernel();
+    // profiler_data_buffer[2] = p_reg[0];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
+    profiler_data_buffer[3] = p_reg[1];
 #endif
 }

Multiple numbers of reg_read + L1 writes had to be tried for the hang to appear.

One note here is that the reset requirements of the above hang are slightly different than the original hang.
Without tt-smi reset, rerunning the same test would pass the device initialization stage, but will hang on a much earlier stage of the model, it hangs at the warm up run on layer 1, module 2 instead of at the optimized run at layer 4, module 3 where it originally hung.
After tt-smi reset, the hang goes back to layer 4, module 3.

This difference in reset behaviour might be caused by profiler vs non-profiler builds. In profiler builds we ask for finish at different times and also we do direct L1 reads through UMD.

@mo-tenstorrent
Copy link
Contributor

Here is the back trace of the process after the hang:

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5973cae1dfc8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x5973cae1dfc8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5973cae1dfc8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x000077fabb7dca41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5973cae1dfd0, cond=0x5973cae1dfa0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x5973cae1dfa0, mutex=0x5973cae1dfd0) at ./nptl/pthread_cond_wait.c:627
#5  0x000077fa9be4e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) () from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x000077fa5b5986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x000077fa5b597591 in tt::tt_metal::HWCommandQueue::enqueue_read_buffer(tt::tt_metal::v0::Buffer&, void*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x000077fa5b593a5b in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x000077fa5b592be1 in tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x000077fa5b592a92 in tt::tt_metal::v0::EnqueueReadBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, void*, bool) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x000077fa5e2807b7 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host_helper<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#12 0x000077fa5e2802c2 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#13 0x000077fa5e33b1ae in _ZN2tt8tt_metal11tensor_impl8dispatchIZNS1_15to_host_wrapperIJRNS0_6TensorERbRhEEEDaDpOT_EUlTyDpOT0_E_JS5_S6_S7_EEEDaNS0_8DataTypeEOT_SD_ () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#14 0x000077fa5e33af7d in std::__1::__function::__func<tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char)::$_0, std::__1::allocator<tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char)::$_0>, void ()>::operator()() () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#15 0x000077fa5e2a38c2 in tt::tt_metal::tensor_ops::tensor_cpu(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#16 0x000077fa5e27e629 in tt::tt_metal::Tensor::cpu(bool, unsigned char) const () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#17 0x000077fa5c309859 in ttnn::operations::core::from_device(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#18 0x000077fa5e58eab0 in pybind11::cpp_function::initialize<tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor, tt::tt_metal::Tensor const&, bool, unsigned char, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::kw_only, pybind11::arg_v, char [848]>(tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&, bool, unsigned char), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::kw_only const&, pybind11::arg_v const&, char const (&) [848])::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#19 0x000077fa5e58ea2e in pybind11::cpp_function::initialize<tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor, tt::tt_metal::Tensor const&, bool, unsigned char, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::kw_only, pybind11::arg_v, char [848]>(tt::tt_metal::Tensor (*&)(tt::tt_metal::Tensor const&, bool, unsigned char), tt::tt_metal::Tensor (*)(tt::tt_metal::Tensor const&, bool, unsigned char), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::kw_only const&, pybind11::arg_v const&, char const (&) [848])::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#20 0x000077fa5c2ae4d9 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so

It is stuck on finish.

Rerunning the test without smi-reset gives the following trace when hung:

(gdb) bt
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1  0x00007ba3d1936fc0 in tt::umd::TTDevice::read_block(unsigned long, unsigned long, unsigned char*) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#2  0x00007ba3d18a9e17 in tt::umd::Cluster::read_device_memory(void*, tt::umd::cxy_pair, unsigned long, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#3  0x00007ba3d1e63743 in tt::Cluster::read_core(void*, unsigned int, tt::umd::cxy_pair, unsigned long, bool) const () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#4  0x00007ba3d1d24ac6 in tt::tt_metal::SystemMemoryManager::fetch_queue_reserve_back(unsigned char) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#5  0x00007ba3d1d251ac in void tt::tt_metal::buffer_dispatch::issue_buffer_dispatch_command_sequence<tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams>(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, CoreType) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#6  0x00007ba3d1d22bac in tt::tt_metal::buffer_dispatch::write_sharded_buffer_to_core(void const*, unsigned int, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::tt_metal::buffer_dispatch::BufferDispatchConstants const&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, tt::umd::xy_pair, CoreType) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x00007ba3d1d22f3f in tt::tt_metal::buffer_dispatch::write_to_device_buffer(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::v0::BufferRegion const&, unsigned int, tt::stl::Span<unsigned int const, 18446744073709551615ul>, CoreType, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x00007ba3d1d97977 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(tt::tt_metal::v0::Buffer&, void const*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x00007ba3d1d97786 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x00007ba3d1d92d67 in tt::tt_metal::v0::EnqueueWriteSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x00007ba3d1d92992 in tt::tt_metal::v0::EnqueueWriteBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#12 0x00007ba3d4b33cb6 in void tt::tt_metal::tensor_impl::write_data_to_device_buffer<unsigned short, tt::tt_metal::borrowed_buffer::Buffer>(tt::tt_metal::CommandQueue&, tt::tt_metal::borrowed_buffer::Buffer<unsigned short> const&, std::__1::shared_ptr<tt::tt_metal::v0::Buffer>) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so

Which looks like a umd read_block is hung

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 19, 2025

Removing the code changes in trisck.cc and reducing ncrisck.cc to single reg_read + L1 writes caused the hang to move to the warm up run in layer 4 module 1 with the following trace:

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +46,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        profiler_data_buffer[8] = p_reg[0];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
+        profiler_data_buffer[9] = p_reg[1];
         kernel_main();
+        profiler_data_buffer[10] = p_reg[0];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
+        profiler_data_buffer[11] = p_reg[1];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..4909dbef13 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,9 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +54,9 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    profiler_data_buffer[4] = p_reg[0];
     kernel_main();
+    profiler_data_buffer[7] = p_reg[1];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC

read_block hangs here as well. This is after a fresh tt-smi reset.

#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:385
#1  0x00007b6f7f536fc0 in tt::umd::TTDevice::read_block(unsigned long, unsigned long, unsigned char*) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#2  0x00007b6f7f4a9e17 in tt::umd::Cluster::read_device_memory(void*, tt::umd::cxy_pair, unsigned long, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libdevice.so
#3  0x00007b6f7fa63743 in tt::Cluster::read_core(void*, unsigned int, tt::umd::cxy_pair, unsigned long, bool) const () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#4  0x00007b6f7f924ac6 in tt::tt_metal::SystemMemoryManager::fetch_queue_reserve_back(unsigned char) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#5  0x00007b6f7f9251ac in void tt::tt_metal::buffer_dispatch::issue_buffer_dispatch_command_sequence<tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams>(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, CoreType) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#6  0x00007b6f7f922bac in tt::tt_metal::buffer_dispatch::write_sharded_buffer_to_core(void const*, unsigned int, tt::tt_metal::v0::Buffer&, tt::tt_metal::buffer_dispatch::ShardedBufferWriteDispatchParams&, tt::tt_metal::buffer_dispatch::BufferDispatchConstants const&, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>, tt::umd::xy_pair, CoreType) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x00007b6f7f922f3f in tt::tt_metal::buffer_dispatch::write_to_device_buffer(void const*, tt::tt_metal::v0::Buffer&, tt::tt_metal::v0::BufferRegion const&, unsigned int, tt::stl::Span<unsigned int const, 18446744073709551615ul>, CoreType, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x00007b6f7f997977 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(tt::tt_metal::v0::Buffer&, void const*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x00007b6f7f997786 in tt::tt_metal::HWCommandQueue::enqueue_write_buffer(std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x00007b6f7f992d67 in tt::tt_metal::v0::EnqueueWriteSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x00007b6f7f992992 in tt::tt_metal::v0::EnqueueWriteBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, std::__1::variant<std::__1::shared_ptr<std::__1::vector<unsigned char, std::__1::allocator<unsigned char> > > const, std::__1::shared_ptr<std::__1::vector<unsigned short, std::__1::allocator<unsigned short> > > const, std::__1::shared_ptr<std::__1::vector<int, std::__1::allocator<int> > > const, std::__1::shared_ptr<std::__1::vector<unsigned int, std::__1::allocator<unsigned int> > > const, std::__1::shared_ptr<std::__1::vector<float, std::__1::allocator<float> > > const, std::__1::shared_ptr<std::__1::vector<bfloat16, std::__1::allocator<bfloat16> > > const, void const*>, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so

@mo-tenstorrent
Copy link
Contributor

Just doing reg_reads to local memory is also causing a hang but the reset behaviour is changed. The following patch hangs on read_block on layer 4 module 2 every time without tt-smi in between:

diff --git a/tt_metal/hw/firmware/src/brisck.cc b/tt_metal/hw/firmware/src/brisck.cc
index bac6b3c611..0a1b8550e1 100644
--- a/tt_metal/hw/firmware/src/brisck.cc
+++ b/tt_metal/hw/firmware/src/brisck.cc
@@ -21,6 +21,8 @@
 #include "remote_circular_buffer_api.h"
 #endif

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -43,7 +45,15 @@ void kernel_launch(uint32_t kernel_base_addr) {
     wait_for_go_message();
     {
         DeviceZoneScopedMainChildN("BRISC-KERNEL");
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
         kernel_main();
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
+        test = p_reg[0];
     }
 #endif
 }
diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..57fcc5365d 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,10 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+//volatile tt_l1_ptr uint32_t* profiler_data_buffer =
+    //reinterpret_cast<volatile tt_l1_ptr uint32_t*>(GET_MAILBOX_ADDRESS_DEV(profiler.buffer));
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,7 +55,9 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    test = p_reg[0];
     kernel_main();
+    test = p_reg[0];
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");
         // Assert that no noc transactions are outstanding, to ensure that all reads and writes have landed and the NOC

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 20, 2025

All data points above this comment were don on BH-30 Device 1
On BH-30 Device 0 it is confirmed that a single reg read between wait_for_go and kernel main causes the hang on umd read block. This is on warmup run layer 4 module 2 conv 2

diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..a6fd9a7255 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,10 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

+volatile tt_reg_ptr uint32_t* p_reg = reinterpret_cast<volatile tt_reg_ptr uint32_t*>(RISCV_DEBUG_REG_WALL_CLOCK_L);
+volatile uint32_t test;
 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,6 +55,7 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    test = p_reg[0];
     kernel_main();
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 20, 2025

On BH-30 Device 0 it is confirmed that 18 nop cycles can also induce a hang on umd read block. This is on layer 4 module 3 conv 3.

Diff:

diff --git a/tt_metal/hw/firmware/src/ncrisck.cc b/tt_metal/hw/firmware/src/ncrisck.cc
index d0809c0dbf..6a90bf10e5 100644
--- a/tt_metal/hw/firmware/src/ncrisck.cc
+++ b/tt_metal/hw/firmware/src/ncrisck.cc
@@ -29,6 +29,8 @@ uint32_t noc_nonposted_writes_acked[NUM_NOCS];
 uint32_t noc_nonposted_atomics_acked[NUM_NOCS];
 uint32_t noc_posted_writes_num_issued[NUM_NOCS];

 void kernel_launch(uint32_t kernel_base_addr) {
 #if defined(DEBUG_NULL_KERNELS) && !defined(DISPATCH_KERNEL)
     wait_for_go_message();
@@ -51,6 +53,11 @@ void kernel_launch(uint32_t kernel_base_addr) {
 #endif
     wait_for_go_message();
     DeviceZoneScopedMainChildN("NCRISC-KERNEL");
+    constexpr int CYCLE_BURN_COUNT = 18;
+#pragma GCC unroll 65534
+    for (int j = 0; j < CYCLE_BURN_COUNT; j++) {
+        asm volatile ("nop");
+    }
     kernel_main();
     if constexpr (NOC_MODE == DM_DEDICATED_NOC) {
         WAYPOINT("NKFW");

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 20, 2025

Stalling 10 cycles instead of 18 will cause a hang on finish instead of read block. Finish hang crashes on FW init the next reboot.

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0,
    futex_word=0x5c09b03cd5f8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0,
    futex_word=0x5c09b03cd5f8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5c09b03cd5f8,
    expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0,
    private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007fe794fd4a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5c09b03cd600,
    cond=0x5c09b03cd5d0) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x5c09b03cd5d0, mutex=0x5c09b03cd600) at ./nptl/pthread_cond_wait.c:627
#5  0x00007fe77564e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&)
    () from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x00007fe734d986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so

Next run crash:

(python_env) mmemarian@bh-30-special-mmemarian-for-reservation-14277:/proj_sw/mmemarian/tt-metal$ pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"
2025-02-20 16:38:51.175 | DEBUG    | ttnn:<module>:83 - Initial ttnn.CONFIG:
Config{cache_path=/home/mmemarian/.cache/ttnn,model_cache_path=/home/mmemarian/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_should_raise_exception=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
2025-02-20 16:38:51.494 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2025-02-20 16:38:51.495 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.Conv1d be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv2d be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.from_torch be migrated to C++?
2025-02-20 16:38:51.496 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.to_torch be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.to_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.from_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.deallocate be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.reallocate be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.load_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.dump_tensor be migrated to C++?
2025-02-20 16:38:51.497 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.as_tensor be migrated to C++?
2025-02-20 16:38:51.510 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv_transpose2d be migrated to C++?
2025-02-20 16:38:51.513 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.conv2d be migrated to C++?
2025-02-20 16:38:51.513 | DEBUG    | ttnn.decorators:operation_decorator:807 - Should ttnn.Conv1d be migrated to C++?
================================================================================================================= test session starts =================================================================================================================
platform linux -- Python 3.10.12, pytest-7.2.2, pluggy-1.5.0 -- /proj_sw/mmemarian/tt-metal/python_env/bin/python3
cachedir: .pytest_cache
rootdir: /proj_sw/mmemarian/tt-metal, configfile: pytest.ini
plugins: xdist-3.6.1, split-0.8.2, dash-2.15.0, timeout-2.2.0, anyio-4.8.0
timeout: 300.0s
timeout method: signal
timeout func_only: False
collecting 1 item                                                                                                                                                                                                                                     2025-02-20 16:38:53.201 | WARNING  | tests.ttnn.conftest:pytest_collection_modifyitems:32 - Fast Runtime Mode is ON. Skipping tests tagged with @pytest.mark.requires_fast_runtime_mode_off
collected 1 item

tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}] 2025-02-20 16:38:53.203 | DEBUG    | ttnn:manage_config:91 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-20 16:38:53 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
2025-02-20 16:38:53.224 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-20 16:38:53.430 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-20 16:38:53.430 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
2025-02-20 16:38:53.499 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x180000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
                 Always | FATAL    | Device 0: Timeout (10000 ms) waiting for physical cores to finish: (x=5,y=5), (x=13,y=5).
                 Always | FATAL    | Device 0 init: failed to initialize FW! Try resetting the board.
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_THROW @ /proj_sw/mmemarian/tt-metal/tt_metal/impl/device/device.cpp:831: tt::exception
info:
Device 0 init: failed to initialize FW! Try resetting the board.
backtrace:

@mo-tenstorrent
Copy link
Contributor

On Bh-30 Device 0, Just Enabling watcher with only waypoints enabled causes the hang.

Waypoints suggest that wait_for_next_context in _llk_unpack_tilize_ is where worker cores are stuck.

Need to further investigate if this is the llk itself having and issue or it has been misused by kernels and now eventually gets stuck.

Added waypoint diff:

diff --git a/llk_lib/llk_unpack_tilize.h b/llk_lib/llk_unpack_tilize.h
index 7de5e16..9d7bcfd 100644
--- a/llk_lib/llk_unpack_tilize.h
+++ b/llk_lib/llk_unpack_tilize.h
@@ -109,8 +109,10 @@ inline void _llk_unpack_tilize_(const std::uint32_t base_address, const std::uin
     // Clear z/w start counters
     TTI_SETADCZW(0b001, 0, 0, 0, 0, 0b1111);

+    WAYPOINT("WFCW");
     // Wait for free context
     wait_for_next_context(2);
+    WAYPOINT("WFCD");

     // Get tile address
     if (0 == unp_cfg_context) {

Watcher Sample:

evice 0 worker core(x=12,y= 7) virtual(x=15,y= 9): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x=13,y= 7) virtual(x=16,y= 9):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 8) virtual(x= 1,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 1,y= 8) virtual(x= 2,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 2,y= 8) virtual(x= 3,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 3,y= 8) virtual(x= 4,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 4,y= 8) virtual(x= 5,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 5,y= 8) virtual(x= 6,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 6,y= 8) virtual(x= 7,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51
Device 0 worker core(x= 7,y= 8) virtual(x=10,y=10): CRBW,CRBW,WFCW,MWDD,   R  rmsg:D0G|BNT h_id:18 smsg:GGGG k_ids:49|50|51

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 21, 2025

Watcher only enabled hangs are non-deterministic. We can also get the same setup as above to hang as below:

Host is hung on finish

#0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x580d29d8f808)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x0, clockid=0, expected=0, futex_word=0x580d29d8f808)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x580d29d8f808, expected=expected@entry=0, clockid=clockid@entry=0,
    abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007203da8a3a41 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x580d29d8f810, cond=0x580d29d8f7e0)
    at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x580d29d8f7e0, mutex=0x580d29d8f810) at ./nptl/pthread_cond_wait.c:627
#5  0x00007203bae4e09f in std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) ()
   from /lib/x86_64-linux-gnu/libc++.so.1
#6  0x000072037a5986eb in tt::tt_metal::HWCommandQueue::finish(tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#7  0x000072037a597591 in tt::tt_metal::HWCommandQueue::enqueue_read_buffer(tt::tt_metal::v0::Buffer&, void*, tt::tt_metal::v0::BufferRegion const&, bool, tt::stl::Span<tt::tt_metal::SubDeviceId const, 18446744073709551615ul>) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#8  0x000072037a593a5b in decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul>::__dispatch[abi:ue170006]<std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)1, std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >&) () from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#9  0x000072037a592be1 in tt::tt_metal::v0::EnqueueReadSubBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> >, void*, tt::tt_metal::v0::BufferRegion const&, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#10 0x000072037a592a92 in tt::tt_metal::v0::EnqueueReadBuffer(tt::tt_metal::CommandQueue&, std::__1::variant<std::__1::reference_wrapper<tt::tt_metal::v0::Buffer>, std::__1::shared_ptr<tt::tt_metal::v0::Buffer> > const&, void*, bool) ()
   from /proj_sw/mmemarian/tt-metal/build_Release/lib/libtt_metal.so
#11 0x000072037d2807b7 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host_helper<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char) () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#12 0x000072037d2802c2 in tt::tt_metal::Tensor tt::tt_metal::tensor_impl::to_host<bfloat16>(tt::tt_metal::Tensor const&, bool, unsigned char)
    () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
#13 0x000072037d33b1ae in _ZN2tt8tt_metal11tensor_impl8dispatchIZNS1_15to_host_wrapperIJRNS0_6TensorERbRhEEEDaDpOT_EUlTyDpOT0_E_JS5_S6_S7_EEEDaNS0_8DataTypeEOT_SD_ () from /proj_sw/mmemarian/tt-metal/ttnn/ttnn/_ttnn.so
--Type <RET> for more, q to quit, c to continue without paging--

Watcher log shows that workers cores are running kernels. Dispatch cores are stuck waiting for workers to finish.

Most likely in this scenario, workers are stuck in a loop that is not known to watcher.

Dump #95 at 105.228s
Device 0 worker core(x= 0,y= 0) virtual(x= 1,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:41|40|43
Device 0 worker core(x= 1,y= 0) virtual(x= 2,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 0) virtual(x= 3,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 0) virtual(x= 4,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 0) virtual(x= 5,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 0) virtual(x= 6,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 0) virtual(x= 7,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 0) virtual(x=10,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 0) virtual(x=11,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 0) virtual(x=12,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 0) virtual(x=13,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 0) virtual(x=14,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 0) virtual(x=15,y= 2):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 0) virtual(x=16,y= 2):  NTW,DAPW,   W,   W,   W  rmsg:H1G|bNt h_id:0 smsg:GDDD k_ids:0|1|0
Device 0 worker core(x= 0,y= 1) virtual(x= 1,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 1) virtual(x= 2,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 1) virtual(x= 3,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 1) virtual(x= 4,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 1) virtual(x= 5,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 1) virtual(x= 6,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 1) virtual(x= 7,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 1) virtual(x=10,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 1) virtual(x=11,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 1) virtual(x=12,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 1) virtual(x=13,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 1) virtual(x=14,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 1) virtual(x=15,y= 3):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 1) virtual(x=16,y= 3): DAPD, PWW,   W,   W,   W  rmsg:H1G|BNt h_id:0 smsg:GDDD k_ids:3|2|0
Device 0 worker core(x= 0,y= 2) virtual(x= 1,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 2) virtual(x= 2,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 2) virtual(x= 3,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 2) virtual(x= 4,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 2) virtual(x= 5,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 2) virtual(x= 6,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 2) virtual(x= 7,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 2) virtual(x=10,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 2) virtual(x=11,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 2) virtual(x=12,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 2) virtual(x=13,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 2) virtual(x=14,y= 4):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 2) virtual(x=15,y= 4): CWFW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:276 smsg:DGGG k_ids:38|36|39
Device 0 worker core(x=13,y= 2) virtual(x=16,y= 4):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 3) virtual(x= 1,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 3) virtual(x= 2,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 3) virtual(x= 3,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 3) virtual(x= 4,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 3) virtual(x= 5,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 3) virtual(x= 6,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 3) virtual(x= 7,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 3) virtual(x=10,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 3) virtual(x=11,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 3) virtual(x=12,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 3) virtual(x=13,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 3) virtual(x=14,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 3) virtual(x=15,y= 5):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 3) virtual(x=16,y= 5):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 4) virtual(x= 1,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 4) virtual(x= 2,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 4) virtual(x= 3,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 4) virtual(x= 4,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 4) virtual(x= 5,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 4) virtual(x= 6,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 4) virtual(x= 7,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 4) virtual(x=10,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 4) virtual(x=11,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 4) virtual(x=12,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 4) virtual(x=13,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 4) virtual(x=14,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 4) virtual(x=15,y= 6):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 4) virtual(x=16,y= 6):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 5) virtual(x= 1,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 5) virtual(x= 2,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 5) virtual(x= 3,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 5) virtual(x= 4,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 5) virtual(x= 5,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 5) virtual(x= 6,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 5) virtual(x= 7,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 5) virtual(x=10,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 5) virtual(x=11,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 5) virtual(x=12,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 5) virtual(x=13,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 5) virtual(x=14,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 5) virtual(x=15,y= 7):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 5) virtual(x=16,y= 7):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 6) virtual(x= 1,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 6) virtual(x= 2,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 6) virtual(x= 3,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 6) virtual(x= 4,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 6) virtual(x= 5,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 6) virtual(x= 6,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 6) virtual(x= 7,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 6) virtual(x=10,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 6) virtual(x=11,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
zDevice 0 worker core(x= 9,y= 6) virtual(x=12,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 6) virtual(x=13,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 6) virtual(x=14,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 6) virtual(x=15,y= 8):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 6) virtual(x=16,y= 8):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 7) virtual(x= 1,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 7) virtual(x= 2,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 7) virtual(x= 3,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 7) virtual(x= 4,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 7) virtual(x= 5,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 7) virtual(x= 6,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 7) virtual(x= 7,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 7) virtual(x=10,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 7) virtual(x=11,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 7) virtual(x=12,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 7) virtual(x=13,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 7) virtual(x=14,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 7) virtual(x=15,y= 9):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x=13,y= 7) virtual(x=16,y= 9):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 8) virtual(x= 1,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 8) virtual(x= 2,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 8) virtual(x= 3,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 8) virtual(x= 4,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 8) virtual(x= 5,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 8) virtual(x= 6,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 8) virtual(x= 7,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 8) virtual(x=10,y=10):    R,   R,   R,   R,   R  rmsg:D0D|BNT h_id:277 smsg:GGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 8) virtual(x=11,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 9,y= 8) virtual(x=12,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=10,y= 8) virtual(x=13,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=11,y= 8) virtual(x=14,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=12,y= 8) virtual(x=15,y=10):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=13,y= 8) virtual(x=16,y=10):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 9) virtual(x= 1,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 1,y= 9) virtual(x= 2,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 2,y= 9) virtual(x= 3,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 3,y= 9) virtual(x= 4,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 4,y= 9) virtual(x= 5,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 5,y= 9) virtual(x= 6,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 6,y= 9) virtual(x= 7,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 7,y= 9) virtual(x=10,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 8,y= 9) virtual(x=11,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 9,y= 9) virtual(x=12,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=10,y= 9) virtual(x=13,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:272 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=11,y= 9) virtual(x=14,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:248 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=12,y= 9) virtual(x=15,y=11):   GW,   W,   W,   W,   W  rmsg:D0D|bnt h_id:248 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x=13,y= 9) virtual(x=16,y=11):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
k_id[0]: blank
k_id[1]: tt_metal/impl/dispatch/kernels/cq_prefetch.cpp
k_id[2]: tt_metal/impl/dispatch/kernels/cq_dispatch.cpp
k_id[3]: tt_metal/impl/dispatch/kernels/cq_dispatch_slave.cpp
k_id[36]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp
k_id[38]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
k_id[39]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
k_id[40]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in0_sender_padding.cpp
k_id[41]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_sender_writer_padding.cpp
k_id[42]: ttnn/cpp/ttnn/operations/matmul/device/kernels/dataflow/reader_bmm_tile_layout_in1_receiver_writer_padding.cpp
k_id[43]: ttnn/cpp/ttnn/operations/matmul/device/kernels/compute/bmm_large_block_zm_fused_bias_activation.cpp
Dump #95 completed at 105.296s

@ejouretTT
Copy link

Single core problem – Op problem or LLK problem.

Next step is @mywoodstock to take a look. Need to get this singled down to a reproducible case.

@ejouretTT
Copy link

@ttmtrajkovic to assign someone from LLK team to dive into this.

@mo-tenstorrent
Copy link
Contributor

Tried the following workarounds from @nvelickovicTT and we still hang.

worker cores are stuck on noc command buffer becoming avilable.

Watcher log

Device 0 worker core(x=13,y= 6) virtual(x=16,y= 8):   GW,   W,   W,   W,   W  rmsg:H0D|bnt h_id:0 smsg:DDDD k_ids:0|0|0
Device 0 worker core(x= 0,y= 7) virtual(x= 1,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 1,y= 7) virtual(x= 2,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 2,y= 7) virtual(x= 3,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 3,y= 7) virtual(x= 4,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 4,y= 7) virtual(x= 5,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 5,y= 7) virtual(x= 6,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 6,y= 7) virtual(x= 7,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 7,y= 7) virtual(x=10,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 8,y= 7) virtual(x=11,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x= 9,y= 7) virtual(x=12,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=10,y= 7) virtual(x=13,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=11,y= 7) virtual(x=14,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43
Device 0 worker core(x=12,y= 7) virtual(x=15,y= 9):  NSW,   W,   R,MWDD,   R  rmsg:D0G|BNT h_id:15 smsg:DGGG k_ids:42|40|43

BH_hang_workaround_metal.patch
BH_hang_workaround_llk.patch

@amahmudTT amahmudTT self-assigned this Feb 21, 2025
@amahmudTT
Copy link
Contributor

I will be looking at this issue from the LLK side @ejouretTT

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 22, 2025

After going through the issue, my thoughts are, since we faced these types of issues and found out the root cause like in (SDPA , MATMUL) our procedure has been the same, find a small test case on a single core that fails deterministically and simulate it. If it is not small and not on a single core it is impossible to simulate since simulating just 2 matmul tiles twice would take around 6 minutes. Simulating the whole model and on multiple cores is going to be impossible. Is it possible to provide us with a repro on a single core and not on the full model ?

@mo-tenstorrent
Copy link
Contributor

Many efforts were attempted to see if we can isolate the issue into small single core test. Unfortunately all were unfruitful. This hang has so far required this particular train of ops to run for it get reproduced.

@mywoodstock do you have any other idea on how we might be able to better isolate this?

@amahmudTT
Copy link
Contributor

@mo-tenstorrent The original branch in the issue has been deleted, could you give your latest branch and repro command ?
Thanks !

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 25, 2025

Applied the compiler arg to 11e3906 , mo/bh_model_test
Does not seem to remove the hang, but this change along with disabling gathering tends to move the hang to downsampling instead of conv2

   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 137 KB | 64.14 MB | 2:21.62025-02-25 12:01:45.549 | DEBUG    | models.demos.ttnn_resnet.tt.ttnn_functional_resnet50:run_downsample_if_req:168 - Running downsample
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 255.56 KB | 64.22 MB | 13:40.2

@mo-tenstorrent
Copy link
Contributor

Pushed the original branch back.

@ejouretTT
Copy link

@amahmudTT To reduce test cases to further identify root cause.

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 26, 2025

Needed to restart after failure, as without restart recompiling and running hangs no matter what.

The workaround of using

  1. disable_gathering() in trisc.cc
  2. comment out all disable_gathering() and enable_gathering()

seems to postpone the hang but not fix it.

The compiler change allows many more tests to pass (may be all , I am confused with the ending timer message, it just could be a hanging timer)

 2025-02-26 06:05:13.495 | DEBUG    | ttnn:manage_config:90 - Set ttnn.CONFIG.report_name to tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]: 2025-02-26 06:05:13 (UTC)
                 Device | INFO     | Opening user mode device driver
  Detecting chips (found 1)
   7.80 Kbps /100.0% =   0.01 Mbps | Tx: 1664 bytes | 64 MB | 8.04 s2025-02-26 06:05:13.543 | INFO     | SiliconDriver   - Opened PCI device 0; KMD version: 1.29.0, IOMMU: disabled
2025-02-26 06:05:13.620 | INFO     | SiliconDriver   - Detected PCI devices: [0]
2025-02-26 06:05:13.620 | INFO     | SiliconDriver   - Using local chip ids: {0} and remote chip ids {}
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 1754 bytes | 64 MB | 8.65 s2025-02-26 06:05:14.136 | INFO     | SiliconDriver   - Device: 0 Mapped iATU region 0 from 0x0 to 0x3fffffff to 0x200000000
                  Metal | INFO     | Initializing device 0. Program cache is NOT enabled
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
                  Metal | INFO     | AI CLK for device 0 is:   800 MHz
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 6325 bytes | 64.01 MB | 21.25 s                 Always | FATAL    | Event Order Issue: expected to read back completion signal for event 4 but got 640258074!
libc++abi: terminating due to uncaught exception of type std::runtime_error: TT_ASSERT @ /proj_sw/user_dev/amahmud/tt-metal/tt_metal/impl/dispatch/hardware_command_queue.cpp:648: event_completed == read_descriptor.event_id
info:
Event Order Issue: expected to read back completion signal for event 4 but got 640258074!

This is the ending message, the timer keeps on going, could it be its just a dangling timer, need to confirm.

2025-02-26 06:21:05.951 | INFO     | models.demos.ttnn_resnet.tests.resnet50_test_infra:validate:339 - ResNet50 batch_size=16, act_dtype=DataType.BFLOAT8_B, weight_dtype=DataType.BFLOAT8_B, math_fidelity=MathFidelity.LoFi, PCC=0.9982199016078069
PASSED                  Metal | INFO     | Disabling and clearing program cache on device 0
                 Device | INFO     | For Blackhole hardcode AICLK to 800 MHz due to lack of ARC message support
   1.43 Kbps /138.5% =   0.00 Mbps | Tx: 865.35 KB | 64.87 MB | 10:06.6

@pgkeller

@amahmudTT
Copy link
Contributor

Disabling L1 cache but keeping the other changes (disable/enable_gathering() and compiler option) still allowes the test to run fully (with the timer still going on at the end)

@ncvetkovicTT
Copy link
Contributor

@amahmudTT Thank you, could you please provide the branch and the command that you're running either here or in the table, Row2ColumnG (in the note body)? Also, can you please confirm the results that I put in the table, ColumnG?
https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

@amahmudTT
Copy link
Contributor

@amahmudTT Thank you, could you please provide the branch and the command that you're running either here or in the table, Row2ColumnG (in the note body)? Also, can you please confirm the results that I put in the table, ColumnG? https://docs.google.com/spreadsheets/d/1i0bfPhjZcUI1ce_CTE44a-VMnvquQjY-kRPAb5oMXow/edit?gid=0#gid=0

updated

@amahmudTT
Copy link
Contributor

Confirmed with Mo, the timer at the end was not a hang, it was the profiler that kept on writing through another thread. So the above modifications does remove the hangs.

@ncvetkovicTT
Copy link
Contributor

Aright thanks! I believe however that we're going to use other workaround, #10 from the table, which protects CSR writes and adds SFPI compiler flag.

@ejouretTT
Copy link

Workaround for #16439 seems to close out this issue and #18065. We still have an open issue affecting the multicore SDPA case. Need to further root cause to guarantee issue is in tensix and not NOC.

@ejouretTT
Copy link

Let's make sure that workaround works with compiler issue before closing @nathan-TT.

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 27, 2025

Note the following in case you want to reproduce the hang and test our discovered solutions.

On IRD machines, Weka is slow so when generating large device profiler csvs things can slowdown dramatically. Point the generated folder to localdev with a symlink.

sudo mkdir /localdev/{username}/profiler_generated
sudo chmod 777 /localdev/{username}/profiler_generated
cd $TT_METAL_HOME
rm -rf generated
ln -s /localdev/{username}/profiler_generated generated

With above done, please run below on profiler builds (i.e. build_metal.sh -p):

TT_METAL_DEVICE_PROFILER=1 pytest "tests/ttnn/integration_tests/resnet/test_ttnn_functional_resnet50.py::test_resnet_50[pretrained_weight_false-batch_size=16-act_dtype=DataType.BFLOAT8_B-weight_dtype=DataType.BFLOAT8_B-math_fidelity=MathFidelity.LoFi-device_params={'l1_small_size': 24576}]"

It will result in the same hang on that branch if you do not have our discovered solutions are applied.

@ncvetkovicTT
Copy link
Contributor

ncvetkovicTT commented Feb 27, 2025

@mo-tenstorrent So wait, fix from this PR doesn't help in your case?

ncvetkovicTT added a commit that referenced this issue Feb 27, 2025
### Ticket
#18064 #18065 #17099 #16673

### Problem description
Disabling instruction cache doesn't happen in time for subsequent TTI
instruction not to end up in the cache. In order to properly disable I$,
we need to disable branch prediction first. Since reprogramming the
REPLAY buffers needs to happen when the cache is disabled, SFPI compiler
cannot rely on REPLAY buffers. These things introduce multiple matmul
hangs.

### What's changed
- Guard CSRRS by disabling branch prediction in BH ckernel.h
- Add a compiler flag for BH which makes the SFPI compiler not to use
replay buffers

### Checklist
- [x] [All post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml)
CI passes -
[26473](https://github.com/tenstorrent/tt-metal/actions/runs/13569121473/job/37929720035)
- expected to fail
- [x] [Blackhole Post
commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml)
CI passes (if applicable) -
[3768](https://github.com/tenstorrent/tt-metal/actions/runs/13550312504)
- [x] [Model
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml)
CI passes (if applicable) -
[7715](https://github.com/tenstorrent/tt-metal/actions/runs/13550316991)
- [x] [Device performance
regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml)
CI passes (if applicable) -
[5094](https://github.com/tenstorrent/tt-metal/actions/runs/13567170826),
expected to fail
- [ ] **(For models and ops writers)** Full [new models
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
CI passes (if applicable)
- [ ] New/Existing tests provide coverage for changes
@mo-tenstorrent
Copy link
Contributor

No it does, this is just a better reproduction step for quicker testing

@ncvetkovicTT
Copy link
Contributor

Oh alright, then I guess we can close this issue as well?

@mo-tenstorrent
Copy link
Contributor

The plan was to hold off on that until Nathan has actually brought in the compile flag switch into main.

@ncvetkovicTT
Copy link
Contributor

I have done that with this change. Or is there some other change that you guys had in mind which I'm missing here?

@mo-tenstorrent
Copy link
Contributor

Oh I see, @pgkeller did you have anything else in mind?

@amahmudTT
Copy link
Contributor

amahmudTT commented Feb 27, 2025

I remember in the meeting it was mentioned Nathan was probably going to introduce a change to the compiler (probably it would not need the flag for blackhole) & wanted this issue to be open so that they could test their change with this issue.

@pgkeller
Copy link
Contributor

I think we can close this. Nath's change will be for SFPU only and we don't have a repro that shows this issue on SFPU.

@mo-tenstorrent
Copy link
Contributor

mo-tenstorrent commented Feb 27, 2025

Confirmed that on BH-30 on commit 681d3f7 on main which has the fix, we don't see a hang when we profile resnet50. This is without using any env vars or mods to the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants