Releases: tenstorrent/tt-metal
Releases · tenstorrent/tt-metal
v0.51.0-rc3
📦 Uncategorized
- Migrate Pad Device and All references
- PR: #9891
- #0: Multi-CQ support for R-Chip
- PR: #10002
- #10028: Remove skip and reduce test case for
moreh_groupnorm
test- PR: #10029
- #10005: Change input tensor parameter to optional in moreh_sum_backward
- PR: #10007
- #10004: Revise bias tensor usage in moreh_linear_backward
- PR: #10006
- #9663: support moreh_nll_loss_unreduced
- PR: #9804
- #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
- PR: #10009
- #0: Update README.md grammar for idiomatic description of TT-NN
- PR: #9827
- #9767: removed more no longer needed manually specified attributes for reflection
- PR: #10023
- Add distributed layernorm kernel documentation
- PR: #9982
- #10031: Fix -Werror=return-type error in composite_ops
- PR: #10036
- #9492: update matmul path in CODEOWNERS
- PR: #10022
- #9450: change silicon fixtures to session scope
- PR: #10019
- Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
- PR: #9934
- #9441: add all typecasts to unit test
- PR: #10046
- #9801: Add cb alignment fix for blackhole that was missed in rebase
- PR: #10051
- #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
- PR: #10047
- #10052: Add metal pack untilize test
- PR: #10057
- Add ttnn matmul tests to TG unit tests
- PR: #9477
- Add
ssm_prefix_scan
test coverage for N=16- PR: #10061
- Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
- PR: #10056
- #8450: Cleanup items pending from PR #9068
- PR: #10053
- #10030: fix moreh_nll_loss hang
- PR: #10040
- #7736: Remove unused reduce dim & type from reduce_init*
- PR: #10060
- #9871: Update backward files
- PR: #10037
- #9874: Move Unary Backward ops to TTNN
- PR: #9949
- Update op_perf_results
- PR: #10042
- #9962: Enable flags for profiler globals in jit build
- PR: #9964
- Added prefill mode for mamba modules
- PR: #10063
- Increase timeout for Mamba full model tests
- PR: #10064
- Support multiple user indices in paged_update_cache
- PR: #10050
- #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
- PR: #10095
- Pack runtime arguments across brisc/ncrisc/trisc
- PR: #9781
- Llama Demo Refactor
- PR: #10018
- #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
- PR: #10103
- #0: Move t3k demo tests to perf pipeline because it requires perf governor
- PR: #10106
- #5424: Delegated sfpu reciprocal calls to gs submodule functions
- PR: #10105
- Add trace and multi cq implementations/tests for WH Resnet
- PR: #10021
- #0: (MINOR) Update to v0.51.0
- PR: #10114
- #0: bump python3.8 venv versioning since apt repos updated
- PR: #10111
- #10099: fix semaphores init for packet mux/demux
- PR: #10134
- #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
- PR: #10113
- Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
- PR: #10135
- #0: Remove stray assert forcing single CQ on R-Chips
- PR: #10098
- #9490: Replace tt_dnn op's usage in C++ with TTNN
- PR: #9821
- #9874: Merge Next set of unary backward ops to TTNN
- PR: #10066
- #10073: Move unary backward ops to TTNN
- PR: #10065
- Unary backward op migration
- PR: #10078
- #10087: update tt-umd submodule
- PR: #10092
- #9959: Migrated pad to ttnn sweeps
- PR: #10067
- Adding distributed layernorm to llama prefill
- PR: #10054
- Add pytest xdist multiprocess to single-chip demo tests
- PR: #10162
- Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
- PR: #10171
- #10071 : Move second set of Unary Backward ops to TTNN
- PR: #10038
- #10083: added tt::stl::json::to_json and tt::stl::json::from_json
- PR: #10084
- #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
- PR: #10151
- #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
- PR: #10183
- #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
- PR: #10185
- #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
- PR: #10187
- Fix undefined memory bug in
ssm_prefix_scan
- PR: #10149
- removed weight copies from DRAM to L1
- PR: #10189
- fix syntax issues with test dispatch workflow
- PR: #10182
- #9609: Reorganize libs into ttnn
- PR: #9870
- #10165: Fix build error with g++-12
- PR: #10167
- Adding support for dram sharded matmuls
- PR: #9878
- #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
- PR: #10140
- #10072: Move next set of Unary Backward ops to TTNN
- PR: #10080
- #9082: ping individual falcon member since slack user group is not wo…
- PR: #10193
- #8681: Add Floor, Trunc blocker ops
- PR: #9098
- #9419: use memcpy to avoid mem misalignment
- PR: #10154
- #10079: Move Unary Backward ops to TTNN
- PR: #10145
- Migrate unary ops to TTNN
- PR: #10152
- #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
- PR: #10179
- #10045: use struct for matmul parameter passing and update doc string
- PR: #10153
- #10045: remove use_1d_systolic_array from ttnn matmul
- PR: #10164
- Ngrujic/profiling
- PR: #10150
- #9319: Upload benchmark data for t3k falcon 7b tests
- PR: #10159
- Aliu/build opt
- PR: #10096
- #10107: Fix hangs w/ launch_msg size >32bytes
- PR: #10157
- [CCL] Making buffer size dynamic to input slice
- PR: #10173
- #7617: remove failing experimental model test
- PR: #10205
- #7618: delete failing experimental model test
- PR: #10214
- #0: fix prefill CI for mamba
- PR: #10227
- Move Mamba tests to wh_b0_only_eth pipeline
- PR: #10206
- #9747: Implement ttnn::tilize in C++
- PR: #10188
- Aliu/prevent aho tanking
- PR: #10216
- #10045: fix up missed parameter change in mamba block model
- PR: #10225
- #9490: Added ttnn support for unary ops py file
- PR: #9883
- #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
- PR: #10217
- Update README.md
- PR: #10176
- #0: Fix imports after tt_lib change
- PR: #10235
- #10226: [Blackhole Bringup] Add new sfpu files
- PR: #10233
- Suppress g++-12 build errors with -Wno flags
- PR: #10204
- #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
- PR: #10220
- #10077: Migrate Unary comparison backward ops to TTNN with Overloading
- PR: #10198
- #10175: Remove std::function and restructure ternary_bw
- PR: #10169
- Falcon40b attn mask optimization
- PR: #10089
- #10074: Move Unary backward ops to TTNN
- PR: #10196
- Replace all TT Lib Unpad with TTNN Slice
- PR: #10104
- #10082: Migrate unary bw ops to TTNN and remove std::function
- PR: #10239
- #9715: Use build artifacts for profiler tests
- PR: #10218
- #9021: adding resnet api into ci.
- PR: #10008
- Update README.md
- PR: #10247
- Move pad_on_host/unpad_on_host to host function in TTNN
- PR: #10178
- #9874: Move polygamma_bw to TTNN
- PR: #10146
- #5337: increase t3k frequent test timeout
- PR: #10202
- Update falcon40b readme
- PR: #10261
- #0: add layernorm rmsnorm pybind, move to ttnn
- PR: #10012
- #0: Re-enable read cache in llama_model_optimized.
- PR: #10208
- Update Mistral/Mixtral README files
- PR: #10259
- #0: Update LLama2/3 readme with demo details
- PR: #10263
- #0: resnet perf fix
- PR: #10273
- Update Mamba README.md
- PR: #10262
- OPT convs in RN50 to get better device perf
- PR: #10279
- Increase timeout for N300 WH-only model pipeline
- PR: #10287
- Prefill+Decode Demo Functional Implementation
- PR: #10281
- [Falcon7b] Add wormhole demo perf mode and output verification tests
- PR: #10269
- Update Falcon7/40b READMEs with details on model functionality and perf-mode
- PR: #10290
- bump python 3.8 venv package version
- PR: #10315
- Git bisect workflow on CI runners
- PR: #10316
- #9613: scaffolding for weekly scheduled t3k perplexity tests
- PR: #10142
- fix syntax issue with bisect script
- PR: #10328
- #10231: Clean up t3k runs-on tags to minimum
- PR: #10232
- #9490: Remove tt_eager unary ops and bindings
- PR: #10194
- only build for arch that a dispatched workflow is running for
- PR: #10318
v0.51.0-rc2
📦 Uncategorized
- Migrate Pad Device and All references
- PR: #9891
- #0: Multi-CQ support for R-Chip
- PR: #10002
- #10028: Remove skip and reduce test case for
moreh_groupnorm
test- PR: #10029
- #10005: Change input tensor parameter to optional in moreh_sum_backward
- PR: #10007
- #10004: Revise bias tensor usage in moreh_linear_backward
- PR: #10006
- #9663: support moreh_nll_loss_unreduced
- PR: #9804
- #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
- PR: #10009
- #0: Update README.md grammar for idiomatic description of TT-NN
- PR: #9827
- #9767: removed more no longer needed manually specified attributes for reflection
- PR: #10023
- Add distributed layernorm kernel documentation
- PR: #9982
- #10031: Fix -Werror=return-type error in composite_ops
- PR: #10036
- #9492: update matmul path in CODEOWNERS
- PR: #10022
- #9450: change silicon fixtures to session scope
- PR: #10019
- Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
- PR: #9934
- #9441: add all typecasts to unit test
- PR: #10046
- #9801: Add cb alignment fix for blackhole that was missed in rebase
- PR: #10051
- #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
- PR: #10047
- #10052: Add metal pack untilize test
- PR: #10057
- Add ttnn matmul tests to TG unit tests
- PR: #9477
- Add
ssm_prefix_scan
test coverage for N=16- PR: #10061
- Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
- PR: #10056
- #8450: Cleanup items pending from PR #9068
- PR: #10053
- #10030: fix moreh_nll_loss hang
- PR: #10040
- #7736: Remove unused reduce dim & type from reduce_init*
- PR: #10060
- #9871: Update backward files
- PR: #10037
- #9874: Move Unary Backward ops to TTNN
- PR: #9949
- Update op_perf_results
- PR: #10042
- #9962: Enable flags for profiler globals in jit build
- PR: #9964
- Added prefill mode for mamba modules
- PR: #10063
- Increase timeout for Mamba full model tests
- PR: #10064
- Support multiple user indices in paged_update_cache
- PR: #10050
- #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
- PR: #10095
- Pack runtime arguments across brisc/ncrisc/trisc
- PR: #9781
- Llama Demo Refactor
- PR: #10018
- #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
- PR: #10103
- #0: Move t3k demo tests to perf pipeline because it requires perf governor
- PR: #10106
- #5424: Delegated sfpu reciprocal calls to gs submodule functions
- PR: #10105
- Add trace and multi cq implementations/tests for WH Resnet
- PR: #10021
- #0: (MINOR) Update to v0.51.0
- PR: #10114
- #0: bump python3.8 venv versioning since apt repos updated
- PR: #10111
- #10099: fix semaphores init for packet mux/demux
- PR: #10134
- #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
- PR: #10113
- Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
- PR: #10135
- #0: Remove stray assert forcing single CQ on R-Chips
- PR: #10098
- #9490: Replace tt_dnn op's usage in C++ with TTNN
- PR: #9821
- #9874: Merge Next set of unary backward ops to TTNN
- PR: #10066
- #10073: Move unary backward ops to TTNN
- PR: #10065
- Unary backward op migration
- PR: #10078
- #10087: update tt-umd submodule
- PR: #10092
- #9959: Migrated pad to ttnn sweeps
- PR: #10067
- Adding distributed layernorm to llama prefill
- PR: #10054
- Add pytest xdist multiprocess to single-chip demo tests
- PR: #10162
- Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
- PR: #10171
- #10071 : Move second set of Unary Backward ops to TTNN
- PR: #10038
- #10083: added tt::stl::json::to_json and tt::stl::json::from_json
- PR: #10084
- #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
- PR: #10151
- #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
- PR: #10183
- #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
- PR: #10185
- #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
- PR: #10187
- Fix undefined memory bug in
ssm_prefix_scan
- PR: #10149
- removed weight copies from DRAM to L1
- PR: #10189
- fix syntax issues with test dispatch workflow
- PR: #10182
- #9609: Reorganize libs into ttnn
- PR: #9870
- #10165: Fix build error with g++-12
- PR: #10167
- Adding support for dram sharded matmuls
- PR: #9878
- #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
- PR: #10140
- #10072: Move next set of Unary Backward ops to TTNN
- PR: #10080
- #9082: ping individual falcon member since slack user group is not wo…
- PR: #10193
- #8681: Add Floor, Trunc blocker ops
- PR: #9098
- #9419: use memcpy to avoid mem misalignment
- PR: #10154
- #10079: Move Unary Backward ops to TTNN
- PR: #10145
- Migrate unary ops to TTNN
- PR: #10152
- #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
- PR: #10179
- #10045: use struct for matmul parameter passing and update doc string
- PR: #10153
- #10045: remove use_1d_systolic_array from ttnn matmul
- PR: #10164
- Ngrujic/profiling
- PR: #10150
- #9319: Upload benchmark data for t3k falcon 7b tests
- PR: #10159
- Aliu/build opt
- PR: #10096
- #10107: Fix hangs w/ launch_msg size >32bytes
- PR: #10157
- [CCL] Making buffer size dynamic to input slice
- PR: #10173
- #7617: remove failing experimental model test
- PR: #10205
- #7618: delete failing experimental model test
- PR: #10214
- #0: fix prefill CI for mamba
- PR: #10227
- Move Mamba tests to wh_b0_only_eth pipeline
- PR: #10206
- #9747: Implement ttnn::tilize in C++
- PR: #10188
- Aliu/prevent aho tanking
- PR: #10216
- #10045: fix up missed parameter change in mamba block model
- PR: #10225
- #9490: Added ttnn support for unary ops py file
- PR: #9883
- #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
- PR: #10217
- Update README.md
- PR: #10176
- #0: Fix imports after tt_lib change
- PR: #10235
- #10226: [Blackhole Bringup] Add new sfpu files
- PR: #10233
- Suppress g++-12 build errors with -Wno flags
- PR: #10204
- #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
- PR: #10220
- #10077: Migrate Unary comparison backward ops to TTNN with Overloading
- PR: #10198
- #10175: Remove std::function and restructure ternary_bw
- PR: #10169
- Falcon40b attn mask optimization
- PR: #10089
- #10074: Move Unary backward ops to TTNN
- PR: #10196
- Replace all TT Lib Unpad with TTNN Slice
- PR: #10104
- #10082: Migrate unary bw ops to TTNN and remove std::function
- PR: #10239
- #9715: Use build artifacts for profiler tests
- PR: #10218
- #9021: adding resnet api into ci.
- PR: #10008
- Update README.md
- PR: #10247
v0.51.0-rc1
📦 Uncategorized
- Migrate Pad Device and All references
- PR: #9891
- #0: Multi-CQ support for R-Chip
- PR: #10002
- #10028: Remove skip and reduce test case for
moreh_groupnorm
test- PR: #10029
- #10005: Change input tensor parameter to optional in moreh_sum_backward
- PR: #10007
- #10004: Revise bias tensor usage in moreh_linear_backward
- PR: #10006
- #9663: support moreh_nll_loss_unreduced
- PR: #9804
- #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
- PR: #10009
- #0: Update README.md grammar for idiomatic description of TT-NN
- PR: #9827
- #9767: removed more no longer needed manually specified attributes for reflection
- PR: #10023
- Add distributed layernorm kernel documentation
- PR: #9982
- #10031: Fix -Werror=return-type error in composite_ops
- PR: #10036
- #9492: update matmul path in CODEOWNERS
- PR: #10022
- #9450: change silicon fixtures to session scope
- PR: #10019
- Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
- PR: #9934
- #9441: add all typecasts to unit test
- PR: #10046
- #9801: Add cb alignment fix for blackhole that was missed in rebase
- PR: #10051
- #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
- PR: #10047
- #10052: Add metal pack untilize test
- PR: #10057
- Add ttnn matmul tests to TG unit tests
- PR: #9477
- Add
ssm_prefix_scan
test coverage for N=16- PR: #10061
- Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
- PR: #10056
- #8450: Cleanup items pending from PR #9068
- PR: #10053
- #10030: fix moreh_nll_loss hang
- PR: #10040
- #7736: Remove unused reduce dim & type from reduce_init*
- PR: #10060
- #9871: Update backward files
- PR: #10037
- #9874: Move Unary Backward ops to TTNN
- PR: #9949
- Update op_perf_results
- PR: #10042
- #9962: Enable flags for profiler globals in jit build
- PR: #9964
- Added prefill mode for mamba modules
- PR: #10063
- Increase timeout for Mamba full model tests
- PR: #10064
- Support multiple user indices in paged_update_cache
- PR: #10050
- #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
- PR: #10095
- Pack runtime arguments across brisc/ncrisc/trisc
- PR: #9781
- Llama Demo Refactor
- PR: #10018
- #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
- PR: #10103
- #0: Move t3k demo tests to perf pipeline because it requires perf governor
- PR: #10106
- #5424: Delegated sfpu reciprocal calls to gs submodule functions
- PR: #10105
- Add trace and multi cq implementations/tests for WH Resnet
- PR: #10021
- #0: (MINOR) Update to v0.51.0
- PR: #10114
v0.50.0
📦 Uncategorized
- Fix issue with Mamba SSM
A
weight preprocessing- PR: #9443
- Make buid key unique for mmio and remote devices with same harvest mask
- PR: #9435
- #5337: Removed eth_dispatch yaml flag from mistral tests
- PR: #9421
- New workflow for custom test dispatch on CI runners
- PR: #9536
- #9312: Add single-header
boost-ext/reflect
library as dependency- PR: #9328
- Opt LayerNorm/RMSNorm with 2D reduce
- PR: #9603
- Revert "#8630: support uint8 data type"
- PR: #9649
- #0: Fix codeowners for metal bert
- PR: #9635
- Revert "Revert "#8630: support uint8 data type""
- PR: #9651
- #9642: fix matmul2d in1 sharded with batch>1
- PR: #9655
- #0: add tile layout support for GN
- PR: #9645
- FD2 packed binary commands
- PR: #9572
- #9082: t3k demo with slack notifications for owners. split jobs
- PR: #9625
- Rtawfik/issue 9142
- PR: #9674
- #9688: Remove redundant left shift in DEBUG_SANITIZE_NOC_READ_TRANSACTION_FROM_STATE
- PR: #9689
- #9500: Update eth_interface include in tt_cluster to not be hardcoded for WH
- PR: #9501
- #9578: Add WITH_PYTHON_BINDINGS option to allow build w/o python
- PR: #9662
- #9587: Update CB and worker Go signals to respect max sub cmd limit introduced by dispatch packed write local copy change
- PR: #9670
- Add support for bfloat4 weights in Mamba
- PR: #8869
- Use in-place binary operations in Mamba block
- PR: #9726
- #5337: Relaxed Mistral expected compilation time in CI by 1 sec
- PR: #9731
- Mo/9406 profiler build flags
- PR: #9549
- Add support for single col/row/core output grid for matmul 2D
- PR: #9683
- #9725: Set release candidate releases on GitHub to pre-release, not draft, to enable downstream users
- PR: #9729
- add tagged docker image with releases
- PR: #9693
- Rtawfik/issue 9164
- PR: #9700
- #5562: resolve reduce scatter issues (nd hang and correctness)
- PR: #9423
- Create benchmarking tools for saving run/measurement data (with Falcon7b example) and model-demo utilities for verifying tokens/perf
- PR: #9659
- #0: Fix bug with var name in single-chip falcon7b demo tests
- PR: #9740
- #9735: fix issues with including reflect library
- PR: #9737
- #9527: Remove usage of bcast where multiply is used
- PR: #9717
- Mchiou/9082 slack notification owners
- PR: #9690
- #9681: set name attribute for ttnn operations when fast runtime m…
- PR: #9730
- #9553: Add prefix scan op for Mamba prefill
- PR: #9554
- #9628: Merge Binary backward ops from tt_eager to TTNN
- PR: #9570
- Namhyeong kim/support fp32 dest acc in moreh adam
- PR: #9135
- #0: Update t3k workflow timeouts (except freq pipeline)
- PR: #9772
- Temporary update Mixtral perf times to pass CI
- PR: #9673
- #9479: fix cpu core worker bug
- PR: #9739
- #4858: add typecast fp32 <-> int32
- PR: #9736
- #0: ViT demo fix
- PR: #9768
- #9389: Add support for integer type in sum operation
- PR: #9548
- Transfer llama2/3 from experimental to demo folder.
- PR: #9716
- #9657: add topk multicore to support larger dimension sizes
- PR: #9718
- #4858: add typecast bfp8_b
- PR: #9779
- #9082: t3k model perf split tests with slack notifications, disabled cnn
- PR: #9761
- #0: Add ttnn/cpp to packages to enable using ttnn kernels in tt_eager ops
- PR: #9784
- #9741: Set stricter pytest timeouts
- PR: #9742
- #9492: Change models matmul usage to ttnn
- PR: #9727
- #9778: test prefetcher hanging with changes to test
- PR: #9795
- #9490: TTNN eltwise/unary migration
- PR: #9732
- Update timeout for falcon40b t3k demo test
- PR: #9777
- #0: Remove extra t3k falcon40b matrix test group
- PR: #9802
- #9044: Move dispatch core x y to be part of launch msg
- PR: #9743
- Modify rot mat each iteration to avoid allocating 10k tensors upfront
- PR: #9809
- Optimize bcast sharded op
- PR: #9822
- Start using
reflect
library- PR: #9780
- #0: Properly delete source folders for wheel testing
- PR: #9829
- #9479: Update Mixtral perf estimates
- PR: #9803
- #0: Added github community issue workflow
- PR: #9833
- #8729: Pytest multiprocess reset infrastructure
- PR: #9677
- Enable switching between 1 and 2 cqs in the same process
- PR: #9832
- Fixed failing tests for SD Conv tests for WH using new conv
- PR: #9799
- #0: Switch org-membership check to an authenticated call
- PR: #9840
- #0: Decrease num loops in trace stress tests
- PR: #9724
- #9628: Support optional return tensor
- PR: #9769
- #0: Use CV to wait for cq_reader in production mode. Remove enqueue_record_event for NB calls
- PR: #9793
- #9628: Merge second set of binary backward op from tt_eager to TTNN
- PR: #9771
- #0: Bump bert compile time threshold since it's been intermittently failing on ci
- PR: #9844
- Mchiou/9792 t3k runner management
- PR: #9847
- #0: Bump up Bert inference time due to instability on ci
- PR: #9850
- #8865: For host dispatch time measureing increese failing reference t…
- PR: #9438
- #9484: Add output_tensor queue_id to dependency ops
- PR: #9494
- Adding the new op: Flash Decode!
- PR: #9794
- #0: Add missing permissions to issue notification job
- PR: #9863
- #9275: Fix Falcon7b demo failing to run by default on an Grayskull e75
- PR: #9859
- #9801: Account for 64B BH PCIe alignment in cq cmd sizing
- PR: #9862
- #0: Make prefetcher early exit after fetching/reading exec_buf
- PR: #9856
- #8683: Add Unary bitwise AND, OR
- PR: #9437
- Ngrujic/profiling
- PR: #9875
- #9628: Merge third set of binary backward op from tt_eager to TTNN
- PR: #9846
- #4858: add typecast uint32
- PR: #9843
- Migrate Pad Host Code, Bindings, C++ Usages from TT Eager to TTNN
- PR: #9816
- Support longer sequence lengths in
ssm_prefix_scan
- PR: #9776
- #9709: Add optional transpose_a and transpose_b to ttnn matmul and linear
- PR: #9836
- #0: Only run batch 12 bert for GS profiling and tighten some bert/resnet thresholds
- PR: #9851
- Asarje/resnet highres 20240624
- PR: #9660
- #9492: replace falcon specific matmul calls
- PR: #9810
- Extend ssm_eltwise_mul for num_users > 32
- PR: #9867
- Update documentation for adding new ttnn operation
- PR: #9841
- Extend ssm_1d_reduce for the batch>32
- PR: #9881
- #0: rn50 fix add api
- PR: #9890
- #9123: Add support for optional output tensors to run in the worker t…
- PR: #9894
- #9861: support check_tensor helper_function
- PR: #9869
- Fix syntax issues in custom test dispatch workflow
- PR: #9567
- Add Mixtral accuracy tests and cleanup its other tests (CI-friendly)
- PR: #9864
- #9876: Increase timeout on falcon7b perplexity tests.
- PR: #9880
- #9492: Remove bmm/resnet_matmul from models
- PR: #9896
- #9410: enable fp32 precision unpacking for interm. CBs
- PR: #9885
- #9903: Fix conditional statements and indexing of y values in CoreRange::diff
- PR: #9915
- #9860: fix test create device apis
- PR: #9919
- #0: delete unused code
- PR: #9921
- #9719: fixed l1 clear issue on nlp create qkv heads decode test case
- PR: #9924
- Fixing type in llama demo readme
- PR: #9927
- #9892: Device only op report
- PR: #9914
- #8704: define consts for registers that hold x-y coordinates and amount to shift address to get x-y coord
- PR: #9897
- CODEOWNERS update
- PR: #9930
- Abhullar/bh misc fix
- PR: #9899
- Auto-register C++ ttnn operations in python
- PR: #9900
- #9788: Remove TopK from TTLib and replace all references with the TTNN api
- PR: #9884
- #0: add owners for resnet demo
- PR: #9937
- 7-way split of eager tests
- PR: #9950
- #9910: Improve Softplus kernel accuracy
- PR: #9893
- #9818: Add cache check to op info V2
- PR: #9826
- #0: update noc test bound
- PR: #9922
- Fix branching bug in softplus kernel
- PR: #9955
- propagate error upwards for tests in falcon 40b suite
- PR: #9957
- #0: Fix falcon40b softmax import failure
- PR: #9958
- #9755: move ttnn.concat to match the new file structure
- PR: #9923
- #9837: Assign workers after performing ref count cleanup in async mode
- PR: #9944
- #0: Make event_synchronize API safer
- PR: #9965
- #0: Update buffer asserts to account for trace buffers
- PR: #9918
- Clean up ttnn operation registration on python side
- PR: #9961
- #9164: [Blackhole bringup] Add fix for unpack untilize
- PR: #9967
- Aliu/no l1 clear
- PR: #9931
- Restructure ttnn::permute to match the new standard format
- PR: #9917
- #9815: Update host to pass packed write max unicast sub cmds to cq dispatch
- PR: #9868
- Distributed layernorm op
- PR: #9382
- #9831: re-enable test
- PR: #9976
- #8835: cleaned up ttnn operation registration on C++ side
- PR: #9975
- #9941: update dram/l1 to noc xy header to do the appropriate shift
- PR: #9948
- #9336: Refactoring moreh layernorm
- PR: #9636
- #9745: move unpad to slice ttnn cpp references
- PR: #9970
- #9980: Update falcon updated outputs
- PR: #9981
- Fix Main after Pad Merge
- PR: #9988
- Update eltwise bcast unary ops to use memory_config and fix PCC issue for interleaved output
- PR: #9939
- Update FD cmds to be PCIe aligned
- PR: #9929
- Fix N150 product name to nebula_x1 even if its unharvested.
- PR: #9925
- #0: add a second codeowner for conv
- PR: #9990
- #0: Get tt-metal to compile with gcc-12
- PR: #9943
- #9492: Change to ttnn matmul in tests and tt_eager
- PR: #9928
- #9441: add typecast uint16->uint32
- PR: #9991
- Move ttnn::embedding to match new pybind structure and replace C++ ttlib embeddings usage with it
- PR: #9969
-...
- PR: #9969
v0.49.0
📦 Uncategorized
- #5044: Add optional output to addalpha
- PR: #8785
- #9059: Fix matmul for single core grid
- PR: #9341
- readme update
- PR: #9352
- #0: (MINOR) Update to v0.49.0
- PR: #9353
- #7586: Move common models for single-card nightly to ln model
- PR: #9351
- Update Mamba README
- PR: #9344
- TTLIB interval to sharded sweeps
- PR: #9003
- #0: Update dataflow api comments
- PR: #9343
- #9196: Merge new op: Fast reduce nc into main
- PR: #9330
- #0: New resnet50 test skipped on WH since its WIP
- PR: #9355
- #9329: Restructure ttnn::argmax
- PR: #9331
- #9323: Introduce template for new ttnn pull requests
- PR: #9324
- #0: skip release build on GH runners, we already test it via build a…
- PR: #9362
- Remove unused dependencies and fetch gtest via CPM
- PR: #9332
- #8764: Part 3 of docs and model demos changes
- PR: #9350
- Ngrujic/profiling
- PR: #8939
- [Mistral-7B] Add flags for weight paths
- PR: #9173
- Typecast int32->fp16b
- PR: #9317
- #9258: Remove ARCH_NAME and TT_METAL_ENV from wheel testing
- PR: #9354
- Implemented SD using new Conv API
- PR: #8786
- #9258: Re-add wheel into release assets
- PR: #9374
- #9361: Install Clang-17 and gdb 14.2
- PR: #9363
- #7525: Re-skip demo batch 7 metal_BERT_large_11 on WH because it still hangs ND
- PR: #9385
- #9206: add sfpu config reg init to llk sfpu inits
- PR: #9358
- #9059: Avoid a couple of fatals in matmul
- PR: #9387
- Add Galaxy support.
- PR: #9068
v0.48.0
📦 Uncategorized
- #7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
- PR: #7745
- #5544: Add output tensors parameter to moreh_nll_loss op
- PR: #7194
- #5544: Add output tensors parameter to moreh_sgd op
- PR: #7193
- #5544: Fix package build error
- PR: #7818
- #5544: Add output tensors parameter to moreh_linear op
- PR: #7147
- #5544: Prevent eager unit test failures
- PR: #7835
- #7997: Support non-4D tensor in moreh_softmax
- PR: #7998
- #7816: Bump SD perf target
- PR: #8140
- #8098: Remove temp buffer copying when reading from hugepage to host buffer
- PR: #8138
- #0: Specify DEBUG_STATUS as a string literal instead of multiple chars
- PR: #7981
- #8212: Fix uneven shards for interleaved_to_sharded op
- PR: #8259
- #0: Refactor unpad tile to modify rt args in place and remove dynamic…
- PR: #8308
- #7838: Add support for non-4D tensor in moreh_linear OPs
- PR: #8388
- #0: Use split_work_for_tilize in both tilize and untilize
- PR: #8470
- #8131: resnet-50 fix for b20.
- PR: #8283
- Add support for multiple parameters in
EltwiseUnary
- PR: #8398
- #7625: Enable multicore for tilize with padding by default
- PR: #8527
- Trace Support
- PR: #8572
- #0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
- PR: #8645
- #7179: enabling test case. The issue was not reproducible on 8.12 dri…
- PR: #8613
- #4625: Multicore runs for untilize with unpadding on interleaved tensors
- PR: #8622
- #0: Cache program cmds, convert cb configs from write linear to write packed
- PR: #8604
- #0: Make skip and xfail optional in defining sweep tests
- PR: #8687
- Shwetank tt/bcast op
- PR: #8058
- #8364: Disable implicit fallback for ttnn.pad
- PR: #8742
- #8513: Add slack notifications to several more pipelines
- PR: #8685
- #0: Update common RT args to use no stride flag for packed cmd.
- PR: #8696
- #0: Option to write compile_commands.json from CMake
- PR: #8761
- #8718: eltwise testing for bfloat8
- PR: #8753
- Add support for bfloat8 input tensors in Mamba SSM block custom kernels
- PR: #8733
- #8460: Enable Clang-17
- PR: #8516
- #0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
- PR: #8840
- #0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
- PR: #8849
- #6365: Add ttnn host tests
- PR: #8210
- #6365: Revert "#6365: Add ttnn host tests (#8210)"
- PR: #8879
- #4382: fix GH reported vulnerabilities
- PR: #8876
- #0: bump C++ timeout limit to 45 minutes
- PR: #8882
- update unpad doc for slice generality
- PR: #8878
- Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
- PR: #8870
- #6365: Fix ttnn host wheel tests
- PR: #8897
- Add git bisect script
- PR: #8894
- #0: Move falcon40b ci unit tests to different pipeline
- PR: #8891
- #8437: remove default matmul program config
- PR: #8772
- #0: Add myself to ttnn codeowners
- PR: #8905
- #0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
- PR: #8909
- #0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
- PR: #8910
- #0: Add basic sanity checks during matmul program config creation
- PR: #8875
- #8907: Sweep tests for tilize/untilize
- PR: #8908
- #8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
- PR: #8913
- #8917: Add sweep test for the fold op
- PR: #8918
- #0: Properly support trivial single core case for 1D matmuls
- PR: #8915
- #6343: updated test_perf with test for bloom causal_lm
- PR: #8391
- #6343: Add functional_bloom test_demo
- PR: #8431
- Update README.md
- PR: #8927
- Enable optimised attention by default in falcon prefill.
- PR: #8892
- Replace FreeList shared_ptr with local_shared_ptr
- PR: #8798
- Add dummy_weights mode for mixtral tests
- PR: #8864
- Refactor operation calls: Replace operation::run() with operation::launch_op()
- PR: #8893
- Use HiFi2 to bump Falcon7b prefill PCC
- PR: #8719
- #8902: add input and attn_mask del
- PR: #8928
- #8930: Disable llama perf test
- PR: #8935
- #0: Add third codeowner to matmul path
- PR: #8934
- #0: Add create_venv.sh as environment option in installation instructions
- PR: #8898
- #7083: Composite conv fix for relu called after matmul
- PR: #8919
- #7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
- PR: #8938
- #8871: Add initial infra/support for dram sharding
- PR: #8901
- #8531: delete all makefiles
- PR: #8546
- #0: Delete dead code from work_split.hpp
- PR: #8950
- #8853: Uplift SFPI to latest w/ BH support
- PR: #8854
- #8725: Warn user if kernel cache is enabled
- PR: #8951
- #0: Minor test_prefetcher fixes
- PR: #8955
- #5389: Move ttnn.repeat to c++
- PR: #8911
- #8131: temp fix for PCC issue on W0.
- PR: #8948
- Optimize e2e perf Falcon40b modifying layernorm
- PR: #8969
- #0: Relax Falcon7b perf target
- PR: #8972
- #0: Resolve segfault in llama async mode
- PR: #8963
- Resnet Optimizations
- PR: #8933
- Create Falcon7b perplexity test and utility functions for text-gen datasets
- PR: #8960
- Revert "#8131: temp fix for PCC issue on W0."
- PR: #8984
- bmm dram sharded opt
- PR: #8947
- #8943: Clean up profiler python_env build flow
- PR: #8949
- #8904: Add slack notifications for T3000 unit-tests
- PR: #8906
- Add unet shallow functional, performance and demo test files
- PR: #8884
- #8932: Multi-Device Mixtral Argmax Support
- PR: #8990
- #8264: Worker thread optimizations:
- PR: #8778
- TTNN tests for bf8 with mk tiled scalar
- PR: #8485
- Ihamer/7468 inject noc delays
- PR: #8889
- Support changed csv row orderings in Mixtral's op_perf_results.py
- PR: #8999
- Correct merge issue in op_perf_results.py
- PR: #9001
- #0: Add kernel groups to test_pgm_dispatch
- PR: #8992
- #0: Add docs requirements to python env cache key because it can change the environment as well
- PR: #9010
- #0: Add helper function to create CBs
- PR: #8991
- #8973: Remove TT_METAL_ENV because we don't need it anymore
- PR: #8974
- #5773: Move SD model to demo folder
- PR: #8294
- #6938: Implement softplus as a single kernel
- PR: #8249
- Model team/rotary embeddings llama
- PR: #8812
- #8735: Fix hw/inc/blackhole files for compilation
- PR: #8880
- Improve Mixtral perf with ttlib
- PR: #8971
- Update README.md
- PR: #9014
- #3712: fix old version of GN test
- PR: #9017
- #0: Don't error on unused functions in compiler call
- PR: #9018
- Revert " #8904: Add slack notifications for T3000 unit-tests"
- PR: #9023
- Rtawfik/bh llk api
- PR: #8809
- #0: Added interactive demo
- PR: #9020
- Move Falcon7b before Mixtral in demo pipeline to workaround issue
- PR: #9034
- #8112: Add support for ND tensors to matmul
- PR: #9004
- #0: fix dram read benchmark
- PR: #9019
- Fix bug in utility_functions::Profiler
- PR: #9025
- Remove 1x1 matmul fallback on convolution and generalize convo…
- PR: #8886
- #5389: Remove ttnn.split
- PR: #9027
- #8767: decouple build folder name from build.cpp
- PR: #8780
- #8735: Update common flags for BH build after sfpi module update
- PR: #9024
- #8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
- PR: #8964
- #8539: Add cq_id to run_operation function args
- PR: #9039
- #8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
- PR: #8724
- #5044: Add optional output tensor and remove autoformat in eltwise binary ops
- PR: #8394
- #8895: Fix failing regression test in dump_tensor(...) API
- PR: #9040
- More Resnet Optimizations
- PR: #8993
- #4858: add typecast fp32 to uint32 op
- PR: #9033
- #8995: refactoring moreh arange
- PR: #8996
- #0: Add ccache option to build_metal.sh
- PR: #9015
- Update Mixtral perf figures
- PR: #9048
- #8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
- PR: #9047
- #0: Add CODEOWNERS for build_metal.sh
- PR: #9053
- Rtawfik/add binary reuse metal
- PR: #8727
- Update watcher.rst - use double backticks
- PR: #9054
- Falcon40b tt_lib to ttnn.experimental
- PR: #9008
- #0: fix dram sharded program cache
- PR: #9031
- #7083: New halo fix for enabled program cache
- PR: #8987
- #9051: Enable Llama model perf test
- PR: #9052
- #8764: Single card WH demo tests
- PR: #9058
- #8764: Various docs fixes for WH release
- PR: #8975
- #0: Correct script locations for nightly single card
- PR: #9062
- #8764: Use new device_l1_small_size fixture for SD demo interactive test
- PR: #9063
- #9059: Update matmul test pcc
- PR: #9061
- #0: Ensure weka mount is active for demo tests otherwise it won't run
- PR: #9069
- #0: remove reserve to avoid bad alloc
- PR: #9067
- #8764: Separate n150/n300 demo tests to not run BERT 11 on N150
- PR: #9073
- Remove unnecessary llk sfpu param files
- PR: #9065
- #9059: Add fallback for getting matmul program config
- PR: #9077
- Add grouped convolution support
- PR: #8341
- #8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
- PR: #8966
- #8976: moreh_getitem receive signed integer index tensors
- PR: #8978
- #9049: fix moreh_sgd callback and add callback test
- PR: #9050
- #0: Remove argmax multi-device test due to segfault
- PR: #9089
- #7724: Add prototype for autonomous streams for use in tunneller
- PR: #8207
- #9036: GS & BH --> Combine llk param files using variable args
- PR: #9078
- #0: optimize allgather for small tensor sizes
...
v0.46.0
📦 Uncategorized
- user-triggerable C++ post-commit suite
- PR: #6626
- #6406: add missing position_ids/attention_mask to bert demo
- PR: #6617
- #6282: Add AdamW
- PR: #6333
- #6315: Fix dprint tests for T3000
- PR: #6599
- FD2: prefetch stall, dispatch wait, linear read, delay and cleanup
- PR: #6620
- #6609: update wording in demo section of main README.md
- PR: #6639
- #6364: Autocomplete for pybinded types
- PR: #6440
- Asarje/ttnn rn50 b20
- PR: #6629
- FD2.0 Test - Fix l1 buffer not page-size aligned in after FD-on-eth changes to L1_UNRESERVED_BASE
- PR: #6646
- #6593: Add resharding to Llama2 model when possible.
- PR: #6595
- #6572: Fix
ttnn.repeat_interleave
example in documentation- PR: #6574
- #5780: Re-enable 100K enqueue program stress test on grayskull
- PR: #6648
- Enable basic width sharding support in all-gather
- PR: #6642
- Alex/metal/remove cb wait markers
- PR: #6628
- #6657: Use sysmem manager cq size instead of recomputing it each time…
- PR: #6658
- #0: (MINOR) Add Grayskull purchase link and update version to 0.46.0
- PR: #6667
- #5063: add TopK API to metal
- PR: #6563
- #5480: FD2.0 Test - Fix test_prefetcher for dram paged read test (-t 3) on whb0
- PR: #6663
- Fix logit low pcc
- PR: #6538
- Backward op - Fixed ldexp, hardsigmoid and asin
- PR: #6542
- #6598: Fix softplus
- PR: #6675
- Add support for BFP4_B tensor serialization
- PR: #6545
- Eltwise mul for different batch size
- PR: #6587
- #6575: Split docs into separate Metalium and nn docs
- PR: #6666
- #0: Add two separate links for documentation (tt-metalium/ttnn) on README
- PR: #6697
- #6361: Update ttnn repeat to use correct shapes when formatting output
- PR: #6526
- #0: Sayonaraaaaaaa
- PR: #6702
- FD2.0 Test fix test_prefetcher add_paged_dram_data_to_worker_data dropping start_page
- PR: #6703
- #5785: Watcher ringbuffer implementation
- PR: #6652
- Add FD 2.0 WriteHost Command
- PR: #6614
- #0: Put back frequent api tests because I'm an idiot
- PR: #6698
- Optimize All Gather Interleaved Worker send/receive
- PR: #6706
- #0: changing all
#include common/*
to#include tt_metal/common/*
- PR: #6669
- #6676: Fix issues related to unary lte and gte
- PR: #6685
- #5817: Fix lerp
- PR: #6630
- #6589: Fix for relu_bw
- PR: #6631
- #6633: Backward test update
- PR: #6679
- #0: Skip logit, logiteps test
- PR: #6714
- #0: Testing CI fix
- PR: #6708
- #5480: Update test_prefetcher to pass added hugepage args to dispatch kernel
- PR: #6717
- Fix l1 acc, add whb0 optimized conv tests
- PR: #6668
- Alignment fix for eth core kernels
- PR: #6696
- Add data parallel (multi-chip) for Falcon7b (prefill/decode) model and corresponding tests
- PR: #6656
- CQ_DISPATCH_CMD_WRITE_PAGED support in test_dispatcher and passing tests
- PR: #6641
- #6647: disable failing ci cpp tests and reenable cpp pipeline on CI
- PR: #6704
- Backward test updates
- PR: #6692
- Ngrujic/check bugs
- PR: #6688
- Add Llama matmul perf tests to main
- PR: #6690
- TTLIB: removing working tests from broken
- PR: #6718
- #6443: Update backward asin and addcdiv logic
- PR: #6715
- #0: Fix output cb size calculation in reshard op for bfp8b
- PR: #6739
- #0: use smart ptrs in allocator
- PR: #6719
- Jvasilje docs 0322
- PR: #6745
- DRAM based device profiler with Tracy support
- PR: #6460
- #6553: Fix ttnn.reshape(..) handling for bfloat16, TILE_LAYOUT
- PR: #6746
- Add Llama2 demo to tt-metal docs
- PR: #6682
- Mistral-7B WH demo
- PR: #6501
- Revert "#0: Put back frequent api tests because I'm an idiot"
- PR: #6755
- FP32 support
- PR: #6747
- #0: Add back frequent api tests to run.sh
- PR: #6756
- Bteng/watcher ci3
- PR: #6530
- Remove cpuprof
- PR: #6758
- logo update
- PR: #6762
- #6184: sharded row major silu support.
- PR: #6643
- #6443: Update div_bw and backward ops test file
- PR: #6742
- #6705: Relax forcing of keyword argument in ttnn.open_device
- PR: #6707
- Forward op tests
- PR: #6730
- #6691: Allow blocking of inner dim within a core for shaded in0 for 2d and 1d systolic matmuls
- PR: #6640
- #6662: Width Sharding support for eltwise OP
- PR: #6671
- Stable diffusion python API level perf improvements
- PR: #6681
- Add get_compute_kernel_config_args function
- PR: #6768
- #0: Add fd-2/main triggers for pull_request and push for post-commit
- PR: #6709
- #5480: FD2 refactor for pre/dis patch variants
- PR: #6655
- #6654: Add perf tests for ttnn ResNet50
- PR: #6673
- #5480: Fix fd gtest unit test test_write_host
- PR: #6778
- #0: Set myself as setup.py owner
- PR: #6779
- #6780: Add mistral7b to demos list in getting started
- PR: #6781
- #4003: re-added TTNN_ENABLE_LOGGING as runtime flag
- PR: #6750
- #0: Fix semaphore address gen bug
- PR: #6233
- #6769: Disable program caching for failing Llama tests.
- PR: #6770
- #5480: Fix zero sized write transaction request that could occur in write_linear_host
- PR: #6784
- #6077: Fix unet pcc issues
- PR: #6660
- Remove DstSync from llk api templates
- PR: #6753
- FP32 Support
- PR: #6785
- #6680: Reverting move op change
- PR: #6811
- #6443: Update asinh and softsign backward
- PR: #6773
- Backward tests with updated test modules
- PR: #6765
- Ngrujic/check bugs 1
- PR: #6734
- #6654: Moving init for self.compute_kernel_config
- PR: #6782
- #6805: reproduce the bug with sharded split_query_key_value_and_split_heads
- PR: #6806
- #6832: Account for tile-padding in softmax for mistral 7B
- PR: #6833
- Enable support for uint32 format to be consumed by SFPU (issue #4624)
- PR: #6796
- #4252: fix clang build error since std::log2 only constexpr in gcc
- PR: #6835
- #4003: log, debug and add pre- and post- hooks only for top-level ttnn ops
- PR: #6841
- #6823: Fix core count to not include dispatch cores in op reprot
- PR: #6831
- #6197: Align pages for interleaved <-> sharded.
- PR: #6828
- METALIUM_GUIDE
- PR: #6846
- Bteng/watcher post commit
- PR: #6760
- #6443: update backward test file for relational ops and concat op
- PR: #6817
- Revert "Bteng/watcher post commit"
- PR: #6866
- #6443: Update backward ops
- PR: #6826
- Backward test updates
- PR: #6822
- #0: Add the dim 0 support repeat backward
- PR: #5596
- Update hard related test ops
- PR: #6816
- #6757: Remove set_profiler_location
- PR: #6824
- #6443: Update backward ops erfinv elu hypot cos sin
- PR: #6827
- #6861: Enable Watcher/dprint tests on T3000 CI
- PR: #6869
- Update Mistral perf regression for CI, until issue is resolved
- PR: #6883
- Mamba/perf v1
- PR: #6744
- #0: remove data movement ops related to silu in SD
- PR: #6798
- #4003: added proper fallback for getitem of ttnn.Tensor. Slice the tensor only on the tile boundary but set the shape based on whatever user provided
- PR: #6886
- #4003: added proper fallbacks for every op that falls back to torch
- PR: #6888
- #6731: add fix to LN width sharding
- PR: #6891
- #5797: add back sweep test for ln
- PR: #6893
- Integrate GroupNorm V2 to SD model
- PR: #6862
- METALIUM_GUIDE.md updates
- PR: #6863
- [Falcon7b] Fix bugs with inference throughput measurements in demo
- PR: #6884
- #0: shallow unet add perf_mode
- PR: #6904
- #6154: 2d matmul in0 height, in1 width sharding
- PR: #6821
- #5249: Various Falcon40b test and demo cleanup
- PR: #6764
- #0: fix incremental build
- PR: #6914
- #0: remove upsample spill to DRAM
- PR: #6905
- [Llama2 Prefill] Model Functionality completed
- PR: #6800
- Watcher alignment checking for PCIe/DRAM <-> L1
- PR: #6901
- #6920: fixed the error in whisper
- PR: #6921
- Update METALIUM_GUIDE.md
- PR: #6902
- #6644: save l1 buffers to data base
- PR: #6856
- Update usage.rst
- PR: #6929
- #6804: fix ttnn falcon7b demo regression + add to CI regressions
- PR: #6924
- #6285: Add backward support for floor round and div_no_nan
- PR: #6290
- [skip ci] Update INSTALLING.md
- PR: #6936
- #6873: Add more test combinations to tt_lib sweeps add, add_unary, su…
- PR: #6887
- Ngrujic/check bugs 3
- PR: #6951
- #6882: Updated Mistral-7b perf estimate
- PR: #6892
- #6850: Update install links in Sphinx docs to point directly to INSTALLING.md
- PR: #6953
- #6619: Fix per op profiler sum
- PR: #6955
- #6644: sync before calling print l1 buffers
- PR: #6958
- Barsic/ttlib ops check
- PR: #6772
- Barsic/ttlib params fix
- PR: #6944
- #6962: Move cd tt-metal earlier in the command list of INSTALLING.md
- PR: #6966
- #6819: Add support for CreateKernel absolute file paths
- PR: #6922
- #6356: Remove half-half grid logic for bmms
- PR: #6968
- #4003: added a flag to disable ttnn fallbacks. Don't throw an error w…
- PR: #6961
- #0: Correct FW versions, tt-smi versions, and add note about tt-topology
- PR: #6971
- #0: Capitalize tt to TT consistently for marketing
- PR: #6973
- #0: Add myself as CODEOWNER for INSTALLING.md
- PR: #6974
- #6644: ttnn visualizer
- PR: #6935
- #6847: Allow disabling individual watcher features
- PR: #6855
- #6889: Support printing/padding/tilizing multi-device tensors
- PR: #6976
- #4003: removed ttnn.print_l1_buffers and consolidated all ttnn flags into a CONFIG class
- PR: #6980
- #6217: tt_lib async mode support (single chipp tensors supported)
- PR: #6700
- Reshard With Ranges
- PR: #6919
- #4003: updated buffer report to show...
v0.45.0
🚀 Features
- #6204: added support for num_users < 32 for update cache op.
- PR: #6213
- #6247 Llama2 Galaxy MLP implementation
- PR: #6265
📦 Uncategorized
- #4736: Add support for moreh_norm op
- PR: #4864
- Fix moreh_layernorm rstd
- PR: #5616
- #5508: Change test_moreh_layernorm.py for debugging
- PR: #5619
- #4686: add infra for sharing global struct among ops
- PR: #5456
- #5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
- PR: #5686
- Fix layernorm beta data format reconfig
- PR: #5760
- Add linked support for in0 in1 mcast in matmul
- PR: #5759
- #4957: optimizing construct_2d_padded_tensor_list
- PR: #5614
- #4003: added ttnn.as_tensor and enabled support for caching torch tensor
- PR: #5809
- Revert "#0: Fix for fail in asinh backward"
- PR: #5886
- #5829: Use moreh_common.hpp for data movement kernels across moreh OPs
- PR: #5833
- Barsic/ttnn ops
- PR: #5892
- #6030: Update resnet performance metrics
- PR: #6030
- #5876: pytest & c++ test logging cleanup
- PR: #5987
- #0: Use both 2x2 and 2x4 machines on every scheduled run
- PR: #6091
- Add single core matmul benchmark
- PR: #5997
- #6079: Update FORCE_INLINE to be nop when watcher is enabled
- PR: #6092
- #5980: Fix a hard-coded bounds check in dprint
- PR: #6028
- #5389: merged ttl and ttnn tensor classes into one
- PR: #6051
- Initial Performance Model
- PR: #6025
- fix ci
- PR: #6089
- TTNN RN50 :: on the road to match perf with TTLIB version
- PR: #6046
- #4438: Optimized single-core fold op
- PR: #5999
- #5589: Add repeat-interleave and addcmul sweeps
- PR: #6102
- #6055: Add square backward support
- PR: #6071
- #6057: Add backward support for lgamma
- PR: #6059
- #6056: Add backward support for frac and trunc
- PR: #6065
- #6066: Add support for backward log sigmoid
- PR: #6069
- #6002: Add backward support for binary maximum
- PR: #6003
- Ngrujic/improve conversion to bfloat8b in sweeps
- PR: #6068
- #5829: Use moreh_common.hpp for compute kernels across moreh OPs
- PR: #6122
- #0: Remove post-commit label from multi device pipeline because it's not actually post commit
- PR: #6142
- Add pack l1 acc to resnet conv
- PR: #6054
- #6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
- PR: #6145
- #6061: Add tanhshrink, threshold, Unary EQ backward ops support
- PR: #6137
- Width Sharded Concat for Unet
- PR: #5776
- #5184: uncommenting various moreh test case.
- PR: #6143
- Fix compute kernel config arg for resnet50
- PR: #6147
- Nsmith/untilize unit test
- PR: #6105
- Revert "Revert "#5389: merged ttl and tensor classes into one""
- PR: #6158
- #4438: Do not use the new fold op in Resnet tests
- PR: #6153
- Remove corerangeset that does not work on wormhole
- PR: #6156
- #6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
- PR: #6134
- #5391: Add device perf
- PR: #5875
- #0: Use multiplier for wormhole b0 mulsi3
- PR: #6160
- #4003: removed ttnn.Tensor autoclass from tensor.rst
- PR: #6170
- TTNN MultiDevice Support
- PR: #6131
- build artifacts
- PR: #6111
- #4947: Add noc alignment checks to watcher
- PR: #5998
- Add ttnn multi-chip unit test for checking device shards
- PR: #6179
- Nsmith/fix unet
- PR: #6141
- #6043: Random program stress test of command queues
- PR: #6044
- Logit and logiteps backward support
- PR: #6016
- Backward support for log2
- PR: #6064
- Add missing ttnn tests and disable broken tests until issues are fixed
- PR: #6186
- Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
- PR: #6181
- #5873: make top-level post commit workflow re-useable
- PR: #6188
- #5589: add groupnorm for ttnn sweeps
- PR: #6167
- Ngrujic/ttnn sweeps 4
- PR: #6135
- Add ethernet datamover (EDM) - a foundational ethernet transfer engine
- PR: #5718
- #6116: Add backward support for softshrink
- PR: #6118
- #0: Add verbose make logs to artifact and make nicer name on metal
- PR: #6199
- #0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
- PR: #6202
- #4809 dprint tensix regs
- PR: #6072
- #4003: fixed bloom perf test
- PR: #6208
- #6187: Conv bugfix
- PR: #6205
- #0: concat RM support variable stick widths across inputs
- PR: #6207
- TTNN RN50 on WHB0
- PR: #6173
- #6084: Lower thresholds slightly after using proper configs for device resnet
- PR: #6214
- Fast dispatch 2.0 proof of concept
- PR: #6176
- #6218: add pytest for matmul 1d 2d
- PR: #6219
- #6177: use
is_tensor_storage_on_device
so it works for MultiDeviceStorage- PR: #6178
- #6082: support workers + eth cores in one program
- PR: #6172
- #6215: Rename TensorToMeshMapper/MeshToTensorComposer
- PR: #6220
- #6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
- PR: #6224
- #6117: Add backward support for softplus
- PR: #6128
- #6223: remove redundant call to context switch
- PR: #6225
- Integrate EDM with all-gather.
- PR: #6169
- #6136: Add backward support for unary LE and GE
- PR: #6138
- #5398: fix unicast binaries
- PR: #6231
- Barsic/ttnn ops 2
- PR: #6070
- #5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
- PR: #6216
- #5372: Updated README.md file for demo
- PR: #6060
- #4003: updated ttnn.concat to have a registered fallback
- PR: #6127
- Llama2 functional bringup
- PR: #6087
- #5589: Add working BFLOAT8_B sweeps to working folder
- PR: #6192
- FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
- PR: #6229
- #0: bugfix in ttnn resnet caught by nightly
- PR: #6251
- #0: fix tt_bisect build bug
- PR: #6256
- Watcher Asserts
- PR: #6175
- #6183: add unit test for sd matmul ops
- PR: #6246
- #6254: Make program cache per device:
- PR: #6255
- #5394: Add functional version of Mamba architecture
- PR: #5948
- #6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
- PR: #6258
- #5661: Enable gtests for fast dispatch + R chip
- PR: #6110
- Alex/metal/bmm large block untilize out
- PR: #6201
- #5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
- PR: #6261
- Revert "#6183: add unit test for sd matmul ops"
- PR: #6278
- #4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
- PR: #6268
- #4003: print all of the L1 buffers using ttnn.print_l1_buffers
- PR: #6279
- #4438: Implement sharded multi-core fold op for Resnet50
- PR: #6275
- #6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
- PR: #6280
- FD2.0 fixes+mcast support for write and packed_write
- PR: #6263
- Shwetank tt/config
- PR: #5843
- #0: Change order of device and use_program_cache fixture in remaining pytests
- PR: #6269
- Softplus with beta and threshold param
- PR: #6239
- Build tests during artifact creation
- PR: #6286
- #6149: disabled test_print_l1_buffers_of_add_operation
- PR: #6299
- #4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
- PR: #6277
- #0: add to/from L1 reshard test
- PR: #6309
- #0: Add back deleted shape assertions for interleaved concat
- PR: #6307
- test errors flagged by watcher
- PR: #6320
- #0: fix incremental build
- PR: #6103
- Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
- PR: #6297
- #6329: Fixing a bug causing mismatch on indices
- PR: #6330
- #6321: Test which sweeps read/write buffer and just checks that the e…
- PR: #6322
- Support moreh_getitem forward
- PR: #6227
- #6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
- PR: #6262
- #6107: Add softsign, sign, unary ceil backward support
- PR: #6191
- #6226: Add backward support for div
- PR: #6235
- #6234: Add backward support for rdiv
- PR: #6238
- #6236: Add backward support for fmod and remainder
- PR: #6240
- #4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
- PR: #6327
- Indexed Fill
- PR: #6328
- #5589: remove dtype in gen function sweep tests where needed
- PR: #6249
- #6347: Print built-in defines once only
- PR: #6351
- #0: Add Mo as code owner on profiler code
- PR: #6352
- #0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
- PR: #6324
- #0: Fixture reorder changes reverted for falcon_7b perf test
- PR: #6318
- #5424: remove metal_ckernel_sfpu
- PR: #5665
- #0: Update remaining tt_lib.program_cache calls to use device APIs
- PR: #6357
- #6183: add unit test for sd matmul ops
- PR: #6323
- #6289: fix dispatcher page calculation
- PR: #6340
- #5924: Enable unet on wormhole_b0 changes
- PR: #6198
- #6325: skip test_multi_device.py for grayskull arch
- PR: #6332
- Alex/metal/pack untilize no repack
- PR: #6371
- #6144: Not hanging on GS or WH with or without Watcher
- PR: #6373
- Agrebenisan/swq hwq cardinality cleanup
- PR: #6369
- #6146: Add backward support for conj
- PR: #6272
- #0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
- PR: #6367
- Fix To/From Sharded Bug
- PR: #6381
- #6206: Fix resharding page mapp...
v0.44.0
📦 Uncategorized
- Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5154
- #4794: Implement DownBlock2D using ttnn for stable_diffusion model
- PR: #5091
- #4797: Implement BasicTransformerBlock sub-module using ttnn for stab…
- PR: #5081
- #0: write cluster config for FD mode, non tunneling cores as well
- PR: #5161
- Update bw test, change mulsi calls to use *
- PR: #5149
- #3003: updated tt-lib documentation
- PR: #5179
- #0: Update to v0.44.0
- PR: #5188
- #4003: added ability to trace ttnn operations using torchtrail library
- PR: #5135
- Support moreh logsoftmax
- PR: #4961
- #4614: gitmodules: Use https URLs for submodules
- PR: #5183
- #0: add reviewers to frequently touched ops docs file
- PR: #5190
- backward ops - hypot and atan2
- PR: #5045
- #4885: Move program device map to program
- PR: #5193
- #4858: Add support for float to int typecast
- PR: #5058
- Matmul_block on a smaller grid size
- PR: #5170
- Revert "#0: Add support for typecast float to int"
- PR: #5199
- Add dst ethernet router support and remote command processor to accept FD packets on remote chip
- PR: #5102
- Falcon40B TT Implementation
- PR: #5046
- #5198: Fix moreh softmax related bug
- PR: #5200
- #0: skip MOREH Softmax tests from main
- PR: #5202
- #3122: Use device grid size in falcon_attention to be genereric...
- PR: #5207
- #0: Add assertions for interleaved tensors for ops that don't support sharding
- PR: #5195
- #5169: Add activation ops to ttnn
- PR: #5217
- #3003: add duration to the ttnn operation nodes when TTNN_ENABLE_LOGGING=1 is used to compile the code
- PR: #5201
- #5027: Optimize group attn matmul for Falcon40B decode
- PR: #5127
- #0: add documentation about managing documentation
- PR: #5227
- Adding docs for maxpool, avg pool and upsample
- PR: #5223
- Revert "#0: skip MOREH Softmax tests from d5811b7…
- PR: #5228
- #5165: Add hyperbolic ops to ttnn
- PR: #5166
- #4866: Add grayskull open source llk-library
- PR: #5136
- #5002: simplified preprocessing of CNNs using preprocess_model
- PR: #5181
- Create GroupNorm sharded in TTNN
- PR: #5221
- #5097: Support for dedicated completion queue thread
- PR: #5098
- upsample test calculate grid
- PR: #5238
- fix for sharded allocater when num banks == num cores
- PR: #5229
- MHA tutorial interactive notebook with diagrams
- PR: #5239
- #4003: Adding a profile tutorial
- PR: #5242
- #0: Added non-blocking read stress test
- PR: #5243
- Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5245
- #0: Update all_gather to work for multi_link. Update falcon-40b to use 2 links for all gathers
- PR: #5214
- #5142: Remove slow dispatch mode from workgin sweeps
- PR: #5146
- #3003: fixed the input tensor documentation
- PR: #5255
- #0: Temp slower resnet VM run
- PR: #5256
- throw on fast dispatch for to_host_sharded as its not supported
- PR: #5264
- #5253: Fix kv_past_len being passed in to rotary embedding for falcon models
- PR: #5254
- #5233: started adding ttnn_functional_resnet
- PR: #5240
- #3003: updated ttnn documentation to explain what features it has over tt_lib. Added standalone examples of basic usage of ttnn
- PR: #5265
- #0: Speedup incremental builds
- PR: #5251
- #0: Change setup.py to be git worktree friendly
- PR: #5234
- MHA tutorial interactive notebook with diagrams
- PR: #5277
- #3003: disable tutorial 6 from running as the unit test
- PR: #5278
- Agrebenisan/non blocking tensor reads
- PR: #5244
- #5275: CODEOWNERS: update to include files relevant for ttnn team
- PR: #5276
- Fix an intermittent launch message transfer error
- PR: #5152
- Revert "MHA tutorial interactive notebook with diagrams"
- PR: #5282
- #0: add parens in LLK doc
- PR: #5283
- #3003: only unit test tutorials that work on pipelines
- PR: #5291
- #5246: Add unary math ops to ttnn
- PR: #5259
- Vignesh/stable diffusion ttnn basic transformer block fix
- PR: #5211
- #4854: Implement attention and rms_norm sub-module using ttnn for mis…
- PR: #5175
- #4795: Add upblock2d to functional stable diffusion model
- PR: #5085
- #4796: Implement Transformer2DModel using ttnn for stable_diffusion m…
- PR: #5092
- #0: Adding llk wormhole_b0 submodule
- PR: #5262
- #4003: Adding pyind11 to ttnn
- PR: #5236
- #5296: Fix broken link to host_api.hpp in README.md
- PR: #5297
- #0: Fix bug with the way we were measuring bert inference time
- PR: #5312
- #0: Change local tt_lib._C module install from symlink to copy
- PR: #5292
- #5233: added ability to fold batch_norm2d into conv2d
- PR: #5317
- #5222: replace hex8_to_hex32.py with cpp to shave off some compile time -temporary fix
- PR: #5220
- Enable tests for WHB0
- PR: #5307
- #5137: Cleanups for newer Linux distro / toolchains
- PR: #5162
- #5233: implemented support for converting all Resnet-18 modules using preprocess_model function
- PR: #5325
- #3003: fix model preprocessing bug
- PR: #5332
- #4799: Implement CrossAttnDownBlock2D sub-module using ttnn for stabl…
- PR: #5086
- #4800: Implement UNetMidBlock2DCrossAttn using ttnn for stable_diffus…
- PR: #5093
- #4798: Add ttnn cross attn upblock2d in functional stable diffusion m…
- PR: #5089
- #4801: Implement Unet 2D Condition model using ttnn for stable_diffus…
- PR: #5119
- #4965: Rename Conv2D to Conv2d and MaxPool2D to MaxPool2d to match torch
- PR: #5219
- #0: Remove departed team member from CODEOWNERS
- PR: #5340
- #0: add to codeowners
- PR: #5339
- #5314: Only stall on first scheduled read after commands with side effects
- PR: #5315
- #4965: fix bad rebase
- PR: #5342
- #0: Add more instructions for dispatching workflow actions and a note about skipping git hooks
- PR: #5345
- Update optimized Bert to support WH grid sizes, add sharding support for RMSNorm
- PR: #5308
- #4642: create gtest_smoke as a sanity test suit
- PR: #5112
- #5341: context switch if eth txq is full
- PR: #5347
- #5323: Convolutions of small size fail during parallelization calculations
- PR: #5324
- Npetrovic/transformer softmax
- PR: #5298
- Fix groupnorm for narrow channels
- PR: #5320
- #4862: added more test for ttnn bloom. Update optimized ttnn bert to match the structure of non-optimized ttnn bert
- PR: #5336
- #0: Add an envvar parser with value detection and default value setti…
- PR: #5367
- #4732: Clean up compute kernel apis
- PR: #5316
- #5318: Modify Falcon7B to use attn_matmul for wormhole
- PR: #5322
- #0: make logLocationsRecord a static function
- PR: #5351
- #5233: run convs with auto-format
- PR: #5364
- #5377: Avoid segfault by checking buffer !null before getting device
- PR: #5381
- Alex/metal/pack untilize b0
- PR: #5378
- #4487: Support block sharding in upsample
- PR: #5361
- #5359: update python package transformers + dependencies to include Falcon
- PR: #5360
- #3708: Add support for LN having gamma/beta in bfp8
- PR: #5376
- #4003: Skip sweep tests if not available
- PR: #5392
- #4003: use faster TMs in optimized ttnn whisper
- PR: #5384
- #4732: Clean up compute_kernel_api
- PR: #5375
- More optimizations for group_attn_matmul
- PR: #5385
- #5233: updated resnet18 to run residual connections
- PR: #5390
- #3003: added more meaningful errors to ttnn. Updated getitem to run on device in the cases when it can
- PR: #5403
- #5233: simplified the logic in tracer
- PR: #5370
- #3003: include ttl operations and necessary types under ttnn.ttl
- PR: #5405
- #0: Add note about no merge commits in main
- PR: #5349
- #0: Add timeout in profiler regression workflow
- PR: #5355
- codeowners update
- PR: #5407
- #5365: Add device argument to determine grid size based on target
- PR: #5366
- disable whisper until further investigation, see issue #5430
- PR: #5431
- #3003: fixed ttnn convs
- PR: #5432
- #3886: Fix build error for C++ tests in debug mode
- PR: #5434
- #4954: Support depth 32 in maxpool writer
- PR: #4956
- #0: Pass output cb to pack init functions
- PR: #5418
- #0: skipping DeviceLoadBlankKernels on remote devices
- PR: #5437
- #5359: transformers: update version and relax pcc asserts
- PR: #5421
- #3003: guidelines for adding new op
- PR: #5440
- Don't assume user has one entry in their
$PYTHONPATH
- PR: #5250
- FP32 tensor support for matmul
- PR: #5414
- #3003: updated tutorial 001 to describe the tensor more comprehensively before showing the add
- PR: #5441
- Onboard additional metal code owners
- PR: #5445
- #5402: Add redesigned host-side sw command queue, it can be configured i…
- PR: #5382
- #3003: fixed docs
- PR: #5455
- Alex/metal/enable conv tests on b0
- PR: #5425
- #5356: git bisect script to find broken commits
- PR: #5348
- #0: Update data_format.cpp file
- PR: #5399
- Add skip to full grid matmul whb0
- PR: #5461
- #3003: simplified the logic in ttnn/operations/matmul.py. Added dataclasses instead of tuples for CoreGrid and ShardShape
- PR: #5450
- #5204: adding moreh's test suit. removing an absolute assertion.
- PR: #5373
- Npetrovic/lt gt ne fix
- PR: #5304
- #0: Move device id attribute from tensor to DeviceStorage
- PR: #5467
- #3003: fixed scheduled pipeline
- PR: #5466
- Npetrovic/transformer concat sweeps ttnn
- PR: #5208
- #3003: added support for running ttnn.matmul using 1D_systolic_array. Also, added support for passsing in the program config directly
- PR: #5468...
v0.43.0
📦 Uncategorized
- #4668: Yolov5 GS Demo Benchmarking
- PR: #4776
- #0: uplift umd; pick up fix for n150 cluster
- PR: #4881
- #3178: Fix for wormhole b0 reduce w
- PR: #4882
- #4489: fixed bugs in the program caching of eltwise unary and eltwise binary. Updated bloom to use L1 memory config
- PR: #4842
- #4821: Add cumsum op to tt_dnn
- PR: #4824
- Dispatch/Bandwidth tests
- PR: #4783
- #4003: fixed test_eltwise_unary_op
- PR: #4901
- Argmax and Argmin Support
- PR: #4779
- #3212: softmax works after reduce fix of max, sum, etc. for WHB0
- PR: #4907
- #0: (MINOR) Update version to v0.43.0
- PR: #4910
- #4761: Add call to ttl repeat_interleave and also provide script for …
- PR: #4891
- #4003: fixed the bug with printing the compile-time attributes
- PR: #4918
- Support moreh arange
- PR: #4921
- Remove skip_for_wormhole_b0 for test_moreh_softmax and test_moreh_softmin
- PR: #4924
- #4541: remove unpad start at 0 limitation
- PR: #4566
- Agrebenisan/restart cmd fix
- PR: #4922
- Support moreh SGD
- PR: #4929
- #0: Use fetch-depth: 0 instead of fetch-tags because otherwise git complains of commit SHA/tag conflict
- PR: #4934
- #0: Add code owners for primary operations api binding
- PR: #4936
- #4547: Add 2x2 window unit tests to ttnn maxpool
- PR: #4909
- #4003: restructure ttnn
- PR: #4902
- #4889: Change TileSlice printing to only print tile data
- PR: #4912
- #4836: Add support for blocking conv activation in 2d systolic conv v…
- PR: #4837
- #0: Update unicast cycles lower bound
- PR: #4937
- #4904: Add support for 1d width sharded LN
- PR: #4905
- #4941: Convert command header to struct for easier maintainability
- PR: #4942
- #4823: enable sum_0 operation fails with low PCC [Wormhole,Grayskull]
- PR: #4955
- Fix sharded buffers for one core in fast dispatch
- PR: #4944
- #4906: global reduce sum, mean, max, min operations added
- PR: #4908
- Revert "#4823: enable sum_0 operation fails with low PCC [Wormhole,GS]
- PR: #4963
- #0: Change codeowners from specific op binding files/dirs to all tt_lib bindings
- PR: #4938
- #4003: split unary sweep into per op sweeps
- PR: #4952
- #4232: added support for converting from numpy arrays to ttnn tensors. Borrow data whenever possible when converting from numpy/torch
- PR: #4893
- Uplift AttnMatmul to support GroupAttnMatmul
- PR: #4913
- Add watcher-specific CI tests
- PR: #4919
- #4916: Add avg pool to ttnn
- PR: #4917
- #0: Add a lock on DPRINT server raise/wait structures
- PR: #4920
- #4967: added validation for input tensors
- PR: #4977
- #4971: update documentation by a new doc hierarchy;
- PR: #4983
- #0: Leftover decorate_operation replacement for avg pool
- PR: #4987
- #4899: fix the permute to operate on the intended shape
- PR: #4951
- #4730: Add tt_lib.tensor.concat
- PR: #4990
- Aliu/enqueue eth
- PR: #4845
- #4003: Updating functional performance from changes in ttnn.permute w…
- PR: #4991
- #4984: Remove dead OP_INFO and graph interpreter
- PR: #4985
- #4878: initial commit to add Conv parameters to ttnn.preprocess_model_parameters
- PR: #4966
- Update Program Hashes for Ops using Mem config
- PR: #4953
- #4984: Remove unused dprint functionality
- PR: #5000
- Aliu/ci fix
- PR: #5001
- #4215: Add Argmax and Argmin Fallback
- PR: #4928
- #4999: added input tensor validation to add, sub and mul operations.
- PR: #5004
- Support for softmax rm major sharding and causal mask sharding
- PR: #5006
- #0: provide API for where() to support scalar True/False branches
- PR: #4988
- #5003: Update expected compile and runtimes for perf regression on VM
- PR: #5008
- Revert "Update Program Hashes for Ops using Mem config"
- PR: #5021
- #4931: add apis to get ethernet by socket ids
- PR: #4932
- #4786: Add upsample_nearest2d functional stable diffusion
- PR: #4870
- #4986: deploy docs only to main and enable devs to run docs build on different pages
- PR: #5020
- Deploy ttnn sweeps results to docs
- PR: #5019
- #4958: Move all python api unit tests to frequent in order to reduce SD pipeline length
- PR: #4981
- #4999: Added input validation for ttnn.matmul and ttnn.linear. Add unit test for linear operation. Update input tensor validation in binary.py. Fix compute_output_shapes in bmm_op.cpp
- PR: #5010
- #4620: Fix+improve bw test
- PR: #5029
- #4852: Add unit tests for functional bloom
- PR: #5013
- #5032: scalar argument versions for relops
- PR: #5018
- #0: Add some README recommendations from MCW to clarify issue about access to internal workflows VM installation page
- PR: #5034
- #4790: Implement GEGLU using ttnn for stable_diffusion model
- PR: #4869
- #4999: Adding validation checks
- PR: #5011
- #4791: Implement Feedforward sub-module using ttnn for stable_diffusi…
- PR: #4868
- Npetrovic/bw ops sweeps
- PR: #5009
- #4999: update documentation of ttnn operations to include the validation schema
- PR: #5031
- #0: Remove model run from frequent_api_pipeline per @tt-rkim
- PR: #5043
- Minor dprint/watcher cleanup
- PR: #5030
- #4858: Add support for typecast
- PR: #4840
- #0: Disable dprint tests because they're flaky at the moment
- PR: #5026
- #4946: Add trig ops to ttnn
- PR: #5041
- Nshanker/convs split by 2
- PR: #5042
- #4946: Add inv trig ops to ttnn
- PR: #5038
- #4003: fixed circular dependency in decorators
- PR: #5052
- #5054: Removed asserts from conv op host code that are not required. …
- PR: #5055
- #4003: fixed circular dependencies in ttnn
- PR: #5061
- #4852: Fix CI pipeline by re-enabling functional bloom for causal LM
- PR: #5060
- GroupNorm Sharded. support
- PR: #4945
- #4972: is_sharded and memory_config is free from tensor
- PR: #4980
- #0: eltwise ops/activate operator tracking for GS, and WHB0
- PR: #5074
- Aliu/fd tunneling pr
- PR: #4725
- #4642: Converted 14 old cpp tests to use gtest, with capabilities to switch btwn FD/SD when possible
- PR: #5050
- #4852: Add tests for functional ttnn bloom implementation.
- PR: #5078
- #4003: correctly convert all parameters of torch module to ttnn parameters
- PR: #5100
- #5082: Pow gradient calculation method is different with pytorch
- PR: #5106
- Argmax/Argmin support for channel, batch and all dim
- PR: #5040
- #4420: switch to shared_ptr
- PR: #5123
- #4420: return shared_future from taskflow async wrapper
- PR: #5121
- Minor DPrint fixes
- PR: #5108
- #0: Enable/disable clearing L1 from env var
- PR: #5107
- #4003: started moving ttnn operation to C++
- PR: #5111
- #4003: Add script to help with finding issues that we need approval for
- PR: #5129
- #5044: Adding support for optional output tensors
- PR: #5104
- #4003: Adding the open flag to show only open PRs
- PR: #5134
- #5048: Add CreateDevices and CloseDevices api to detail
- PR: #5118
- decouple ClearProgramCache from CommandQueue
- PR: #5124
- Conv fixes for padding input channels. Shallow conv fixes. Conv input/output autoformatting. Cleanup
- PR: #5109
- Asarje/mp unpack tilize fused
- PR: #5033
- Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
- PR: #5125
- #5137: Cleanups for newer Linux distro / toolchains
- PR: #5114
- Revert "#5137: Cleanups for newer Linux distro / toolchains"
- PR: #5139
- Revert "Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr"
- PR: #5138
- #4793: Implement ResnetBlock2D using ttnn for stable_diffusion model
- PR: #5084
- #4788: Implement Downsample2D using ttnn for stable_diffusion model
- PR: #5090
- #4792: Implement CrossAttention sub-module using ttnn for stable_diff…
- PR: #4927
- #4747: Reduce amount of samples in bert sweeps
- PR: #5140
- #4789: Add upsample2d to functional_stable_diffusion model
- PR: #5080
- #0: Add fix for lamb optimizer
- PR: #5144
- #5057: Add relational ops support to TTNN
- PR: #5120
- skip eth test suite on GS
- PR: #5155
- #4003: updated ttnn.Tensor to be derived form ttl.tensor.Tensor
- PR: #5130
- Asarje/shwetank upsample
- PR: #5105
- #5082: power gradient is erroneous when exponent is in range (0-1)
- PR: #5158