Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13018933285
📦 Uncategorized
- Create an API for running and measuring the runtime of a ttnn op chain for use during forge compilation
- PR: #16921
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be kernel-only
- PR: #16853
- Minor SDPA optimizations
- PR: #16566
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
- Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
- Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
- Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
- Do not build UMD tests
- PR: #16877
- Move risc_attribs back to hw/inc
- PR: #16867
- Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
- Upgrade error message in control plane
- PR: #16863
- #15824 Workaround LLK issue in max_pool
- PR: #16849
- [skip ci] Fixed TG configuration description in documentation
- PR: #16884
- #0: Update pgm_dispatch_golden.json
- PR: #16818
- #0: fix stackoverflow in eth tun
- PR: #16889
- #0: Refactor enqueue_write_buffer
- PR: #16880
- #0: Add skip for mnist tests because I can't take this anymore
- PR: #16891
- #0: Remove SetLazyCommandQueueMode from Metal API
- PR: #16886
- #16868: Update profiler post proc asserts tripping due to kernel preload
- PR: #16872
- #16350: Update reciprocal docs
- PR: #16371
- [skip ci] : Update INSTALLING.md
- PR: #16893
- Remove
sharded_to_interleaved
workaround in UNet Shallow- PR: #16770
- Add CI job for running models in comparison mode
- PR: #16808
- pybind expose MeshDevice::reshape
- PR: #16798
- #0: Update sweeps README
- PR: #16902
- Workaround issue #16895, fix PCC checking for wormhole in Resnet50 demo
- PR: #16896
- #0: Refactor enqueue_read_buffer
- PR: #16908
- move device checking outside of invalidate code func
- PR: #16903
- Disable Unstable Transpose 2D Test
- PR: #16781
- New Operation: Fill_Tile_Pad ; Op to fill tile padding with a specific value
- PR: #16785
- #0: Separate HWCommandQueue in it's own header
- PR: #16885
- Update Mamba device performance targets
- PR: #16887
- Changing how we setup simulator.
- PR: #16375
- Add missing include for types used
- PR: #16934
- Adding active erisc FW for BH + support for compiling this + updating BH eth_l1_address_map
- PR: #16916
- disable test_transpose_2D due to python-side segfault
- PR: #16933
- #16913: Add Model Updates to the Release assets
- PR: #16914
- Add Datagram Sockets to Fabric
- PR: #16830
- [Llama3] Send decode output logits to dram to reduce trace l1 usage and fix 8b-n150 memory crash
- PR: #16924
- Sharding support for binary_ng
- PR: #16789
- Fix mcast end core for stress noc mcast test
- PR: #16947
- #13901: MaxPool Wide Reductions with Non-8-Tile Multiples
- PR: #16544
- Make creation functions use SimpleShape, expose SimpleShape & TensorSpec to Python
- PR: #16865
- Feature/vecadd sharding
- PR: #16654
- Resolve the issue in ubenchmark pipeline
- PR: #16949
- #0: update test_vc_uni_tunnel bw requirement
- PR: #16953
- [tt-train] Fix broken build due to taskflow change
- PR: #16952
- #16415: fix moreh_adam
- PR: #16420
- #16469 Add sharding to vecadd example
- PR: #16959
- Revert "#16469 Add sharding to vecadd example"
- PR: #16961
- Revert "Feature/vecadd sharding"
- PR: #16962
- #13195: Squeezebert using Conv1d Width Sharded
- PR: #16881
- Cleanup of various issues
- PR: #16873
- Add sweeps with pre-allocated output for topk and argmax
- PR: #16898
- #16510: Eltwise sweep test for add and mul + silu - LLama
- PR: #16516
- Fixing variable name to build umd tests
- PR: #16967
- #15246: Add sweeps for acos_bw, acosh_bw, atan_bw fill_zero_bw, frac_bw, log_sigmoid_bw, rad2deg_bw, trunc_bw sharded
- PR: #16372
- #5424: Clean up Sfpu Sign kernel api
- PR: #16809
- #12662: pad generic reduce op input
- PR: #16925
- Add Datagram Sockets to Fabric
- PR: #16951
- Support padded inputs in SDPA
- PR: #16940
- Add CNN performance optimization tech report
- PR: #16931
- [tt-train] Add gradient norm clipping
- PR: #16771
- Increase concat heads test coverage
- PR: #16972
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Tile Move Copy
- PR: #16664
- Adding a toolchain file
- PR: #15581
- Remove deprecated Tensor constructor with Shape
- PR: #16955
- [skip ci] rm -rf clang-tidy-bot
- PR: #16990
- #0: Added custom tags for BH Post commit
- PR: #16984
- Minor fixes for CB initialization
- PR: #16978
- Remove mystery unused zlib dependency
- PR: #16983
- #0: fix small kernel Unet perf
- PR: #16987
- #0: [skip ci] Add t3k model perf, tg demos, tgg model perf to, and remove unneeded single-card device perf from package and release
- PR: #16994
- #0: fix blachole scheduled post-commit
- PR: #16995
- Replace usage of get_shape() with get_logical_shape() in more places
- PR: #16739
- [TT-Train] Fix Taskflow test leakage
- PR: #17004
- Fix Blackhole Post Commit job labels
- PR: #17006
- Address some CMake warnings
- PR: #16993
- #15931: Re-enable SD nightly and demo tests
- PR: #16971
- Modify UNet Shallow to consume inputs in CHW-ordering
- PR: #16918
- Move tech report to correct TT-NN section
- PR: #17024
- Remove all usages of get_legacy_shape() from the codebase
- PR: #16998
- Fixing untilize for uint16
- PR: #17023
- Add output dtype to layernorm / rms norm
- PR: #16970
- #16938: Apply the same styling to the ttnn and tt-metalium docs as the main doc site
- PR: #16939
- #16948: Temporarily removing the blocking tests to unblock
- PR: #17034
- Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around
- PR: #17009
- #16977: using height sharding due to the shard shape
- PR: #17028
- Update Llama3 release versions in LLM table (Jan 23)
- PR: #17042
- Revert "Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around"
- PR: #17046
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Bcast
- PR: #16388
- #0: update test_rw_buffer
- PR: #17045
- Minor tweaks for BH cache invalidate
- PR: #17047
- #16469 Add sharding to vecadd example
- PR: #17011
- Cleanup ControlPlane, use tt::Cluster
- PR: #17032
- Add end-to-end performance checks for all UNet Shallow configurations with trace+2CQ
- PR: #17036
- [tt-train] Fix data race in case async enabled
- PR: #16996
- #0: Remove templating from process_write_linear.
- PR: #16943
- Disable C++20 module scanning
- PR: #17055
- Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around (fixed after revert)
- PR: #17048
- Remove ttnn::Shape from tt-train
- PR: #17053
- [skip ci] Upload full source code during release
- PR: #17014
- [skip ci] Update README.md
- PR: #17063
- Demote operator registration warning to debug message
- PR: #17065
- Restore clang-tidy scan to be incremental
- PR: #17073
- #17064: Update metal_Bert to use from_torch for converting weights, instead of old style conversions
- PR: #17066
- Afuller/fix clang tidy scan
- PR: #17075
- Add more unary sharded sweeps
- PR: #16311
- #0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice
- PR: #16878
- Remove outdated TG llama tests from CI (old codebase)
- PR: #17038
- Migrate Binary Sfpu ops to binary_ng with activations
- PR: #16523
- Add support for unpadded shapes in Matmul1D w/ gather_in0
- PR: #16627
- #0: Fix incorrect assertion for page size for prefetch relay inline to dispatch s
- PR: #17074
- #16758: Move mesh_composer call to after ttnn.from_device in ttnn.to_torch
- PR: #17054
- Single Docker Image Release
- PR: #17051
- Add validation for sharding to tensor layout and tensor spec
- PR: #16890
- Use dense packed CB indices for Matmul
- PR: #17081
- #0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues
- PR: #16960
- #16979: Log when CB ranges aren't contiguous
- PR: #17050
- #16679: K min values support for TopK
- PR: #16917
- #16502: Add Unary with params support to BinaryNg
- PR: #17067
- #0: Make sub-device merge core ranges for generating mcast commands
- PR: #17087
- PR Gate workflow as the nucleus to build up a pre-merge sanity check
- PR: #17097
- Collection of small watcher fixes
- PR: #17061
- Pack dense cb index for attn matmul
- PR: #17082
- #0: Add native 2D sharding and replication functionality to MeshBuffer
- PR: #17086
- #16720: and #14898 update output dims for argmax and move pad for generic reduce
- PR: #16989
- Fix worker <-> teardown by adding separate worker connection teardown semaphore
- PR: #17033
- #13609: Uplift dram and l1 allocators to use dram/l1 specific alignment
- PR: #13762
- #0: Fix dispatch core settings to use the actual remote device when accessing the core grid
- PR: #17109
- #16143: Inplace support for binary_ng ops with fused activations
- PR: #16449
- Remove some usages of ttnn::Shape from the codebase
- PR: #17062
- De-duplicate build-docker-artifact in workflows
- PR: #17112
- Revert some commits to fix single-card pipelines
- PR: #17121
- Move definitions to implementation for core.hpp
- PR: #17118