Release v0.55.0 · tenstorrent/tt-metal

Note

If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.

The changelog will now follow, showing the changes from last release.

This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13018933285

📦 Uncategorized

Create an API for running and measuring the runtime of a ttnn op chain for use during forge compilation
- PR: #16921
[TT-Train] Add bias=false in LinearLayer
- PR: #16707
TT-Fabric Bringup Initial Check-in
- PR: #16343
#0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
[skip ci] Update llms.md
- PR: #16737
Update test_slice.py
- PR: #16734
#16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
Update code-analysis.yaml
- PR: #16738
[skip ci] Update llms.md
- PR: #16745
remove references to LFS
- PR: #16722
Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
#0: Disable BH tools test at workflow level
- PR: #16749
Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
[skip ci] Fix lint on a doc
- PR: #16751
#0: API Unification for Device and MeshDevice
- PR: #16570
Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
#16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
#7126: remove skip for test_sd_matmul test
- PR: #16729
#0: Make device an optional parameter in the tensor distribution API
- PR: #16746
Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
#11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
#15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
[tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
#0: Fix build of test_pgm_dispatch
- PR: #16773
[tt-train] Update serialization of tensor for DDP
- PR: #16778
#0: Fix failing TG regression tests
- PR: #16776
[skip ci] Update llms.md
- PR: #16775
Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
Add Fabric Router Config to to Hal
- PR: #16761
[skip ci] Update llms.md
- PR: #16791
Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
[skip ci] Update llms.md
- PR: #16792
#0: Migrate pytensor to use from_vector Tensor creation APIs
- PR: #16767
Afuller/metalium api reorg
- PR: #16578
Ngrujic/sweep tests 3
- PR: #16316
#0: Enable nlp create heads tests on BH
- PR: #16777
Fix to_layout shard bug
- PR: #16754
Fix broken link to host API
- PR: #16799
Add noc flag to test stress noc mcast
- PR: #16772
Set codeowners for transformer ttnn ops
- PR: #16803
#15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
Linking tensor.reshape to ttnn.reshape
- PR: #16377
#16646: Fix dangling reference in sharded tensor args
- PR: #16782
#15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
Add new python api to get architecture name
- PR: #16747
Remove base.hpp
- PR: #16796
[tt-train] Change weights initialization for GPT-2
- PR: #16815
[skip ci] Update llms.md
- PR: #16828
fuse residual add with layernorm
- PR: #16794
[TT-Train] Add multidevice support to dropout
- PR: #16823
#16171: Preload kernels before receiving go message
- PR: #16680
#15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
#16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
Add nightly APC run in debug mode
- PR: #16831
[skip ci] Update llms.md
- PR: #16835
[skip ci] Update llms.md
- PR: #16839
Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
#16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
#16242: Initial implementation of MeshBuffer
- PR: #16327
Enable use-override check
- PR: #16842
Privatize Taskflow
- PR: #16838
Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
Fix CB allocation warnings from ttnn.reshard
- PR: #16795
Optimize upsample for bilinear mode
- PR: #16487
Remove Shape usage from MultiDeviceStorage
- PR: #16841
Remove redundant bank offset from destination address in ttnn.reshard
- PR: #16800
Add option to raise error on failed local/global tensor comparison
- PR: #16585
Padded Shards for Concat Support
- PR: #16765
#0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
#16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
#0: Lower Size to metalium as Shape2D
- PR: #16814
#15976: Ensure reports insert all devices into the devices table
- PR: #16834
Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
#16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
Device to Device profiler sync
- PR: #16543
Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
#16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
#16434: DPRINT to read buffer once
- PR: #16586
Bring Taskflow from CPM
- PR: #16843
This file seems to be kernel-only
- PR: #16853
Minor SDPA optimizations
- PR: #16566
#15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
Do not build UMD tests
- PR: #16877
Move risc_attribs back to hw/inc
- PR: #16867
Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
Upgrade error message in control plane
- PR: #16863
#15824 Workaround LLK issue in max_pool
- PR: #16849
[skip ci] Fixed TG configuration description in documentation
- PR: #16884
#0: Update pgm_dispatch_golden.json
- PR: #16818
#0: fix stackoverflow in eth tun
- PR: #16889
#0: Refactor enqueue_write_buffer
- PR: #16880
#0: Add skip for mnist tests because I can't take this anymore
- PR: #16891
#0: Remove SetLazyCommandQueueMode from Metal API
- PR: #16886
#16868: Update profiler post proc asserts tripping due to kernel preload
- PR: #16872
#16350: Update reciprocal docs
- PR: #16371
[skip ci] : Update INSTALLING.md
- PR: #16893
Remove sharded_to_interleaved workaround in UNet Shallow
- PR: #16770
Add CI job for running models in comparison mode
- PR: #16808
pybind expose MeshDevice::reshape
- PR: #16798
#0: Update sweeps README
- PR: #16902
Workaround issue #16895, fix PCC checking for wormhole in Resnet50 demo
- PR: #16896
#0: Refactor enqueue_read_buffer
- PR: #16908
move device checking outside of invalidate code func
- PR: #16903
Disable Unstable Transpose 2D Test
- PR: #16781
New Operation: Fill_Tile_Pad ; Op to fill tile padding with a specific value
- PR: #16785
#0: Separate HWCommandQueue in it's own header
- PR: #16885
Update Mamba device performance targets
- PR: #16887
Changing how we setup simulator.
- PR: #16375
Add missing include for types used
- PR: #16934
Adding active erisc FW for BH + support for compiling this + updating BH eth_l1_address_map
- PR: #16916
disable test_transpose_2D due to python-side segfault
- PR: #16933
#16913: Add Model Updates to the Release assets
- PR: #16914
Add Datagram Sockets to Fabric
- PR: #16830
[Llama3] Send decode output logits to dram to reduce trace l1 usage and fix 8b-n150 memory crash
- PR: #16924
Sharding support for binary_ng
- PR: #16789
Fix mcast end core for stress noc mcast test
- PR: #16947
#13901: MaxPool Wide Reductions with Non-8-Tile Multiples
- PR: #16544
Make creation functions use SimpleShape, expose SimpleShape & TensorSpec to Python
- PR: #16865
Feature/vecadd sharding
- PR: #16654
Resolve the issue in ubenchmark pipeline
- PR: #16949
#0: update test_vc_uni_tunnel bw requirement
- PR: #16953
[tt-train] Fix broken build due to taskflow change
- PR: #16952
#16415: fix moreh_adam
- PR: #16420
#16469 Add sharding to vecadd example
- PR: #16959
Revert "#16469 Add sharding to vecadd example"
- PR: #16961
Revert "Feature/vecadd sharding"
- PR: #16962
#13195: Squeezebert using Conv1d Width Sharded
- PR: #16881
Cleanup of various issues
- PR: #16873
Add sweeps with pre-allocated output for topk and argmax
- PR: #16898
#16510: Eltwise sweep test for add and mul + silu - LLama
- PR: #16516
Fixing variable name to build umd tests
- PR: #16967
#15246: Add sweeps for acos_bw, acosh_bw, atan_bw fill_zero_bw, frac_bw, log_sigmoid_bw, rad2deg_bw, trunc_bw sharded
- PR: #16372
#5424: Clean up Sfpu Sign kernel api
- PR: #16809
#12662: pad generic reduce op input
- PR: #16925
Add Datagram Sockets to Fabric
- PR: #16951
Support padded inputs in SDPA
- PR: #16940
Add CNN performance optimization tech report
- PR: #16931
[tt-train] Add gradient norm clipping
- PR: #16771
Increase concat heads test coverage
- PR: #16972
#15450: Remove default values from circular buffer parameters in LLK compute APIs: Tile Move Copy
- PR: #16664
Adding a toolchain file
- PR: #15581
Remove deprecated Tensor constructor with Shape
- PR: #16955
[skip ci] rm -rf clang-tidy-bot
- PR: #16990
#0: Added custom tags for BH Post commit
- PR: #16984
Minor fixes for CB initialization
- PR: #16978
Remove mystery unused zlib dependency
- PR: #16983
#0: fix small kernel Unet perf
- PR: #16987
#0: [skip ci] Add t3k model perf, tg demos, tgg model perf to, and remove unneeded single-card device perf from package and release
- PR: #16994
#0: fix blachole scheduled post-commit
- PR: #16995
Replace usage of get_shape() with get_logical_shape() in more places
- PR: #16739
[TT-Train] Fix Taskflow test leakage
- PR: #17004
Fix Blackhole Post Commit job labels
- PR: #17006
Address some CMake warnings
- PR: #16993
#15931: Re-enable SD nightly and demo tests
- PR: #16971
Modify UNet Shallow to consume inputs in CHW-ordering
- PR: #16918
Move tech report to correct TT-NN section
- PR: #17024
Remove all usages of get_legacy_shape() from the codebase
- PR: #16998
Fixing untilize for uint16
- PR: #17023
Add output dtype to layernorm / rms norm
- PR: #16970
#16938: Apply the same styling to the ttnn and tt-metalium docs as the main doc site
- PR: #16939
#16948: Temporarily removing the blocking tests to unblock
- PR: #17034
Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around
- PR: #17009
#16977: using height sharding due to the shard shape
- PR: #17028
Update Llama3 release versions in LLM table (Jan 23)
- PR: #17042
Revert "Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around"
- PR: #17046
#15450: Remove default values from circular buffer parameters in LLK compute APIs: Bcast
- PR: #16388
#0: update test_rw_buffer
- PR: #17045
Minor tweaks for BH cache invalidate
- PR: #17047
#16469 Add sharding to vecadd example
- PR: #17011
Cleanup ControlPlane, use tt::Cluster
- PR: #17032
Add end-to-end performance checks for all UNet Shallow configurations with trace+2CQ
- PR: #17036
[tt-train] Fix data race in case async enabled
- PR: #16996
#0: Remove templating from process_write_linear.
- PR: #16943
Disable C++20 module scanning
- PR: #17055
Add a kernel that performs permute on tiled inputs where the tile height and width can both be swapped around (fixed after revert)
- PR: #17048
Remove ttnn::Shape from tt-train
- PR: #17053
[skip ci] Upload full source code during release
- PR: #17014
[skip ci] Update README.md
- PR: #17063
Demote operator registration warning to debug message
- PR: #17065
Restore clang-tidy scan to be incremental
- PR: #17073
#17064: Update metal_Bert to use from_torch for converting weights, instead of old style conversions
- PR: #17066
Afuller/fix clang tidy scan
- PR: #17075
Add more unary sharded sweeps
- PR: #16311
#0: Hoist SubDeviceManager/Lock-Step Allocator to MeshDevice
- PR: #16878
Remove outdated TG llama tests from CI (old codebase)
- PR: #17038
Migrate Binary Sfpu ops to binary_ng with activations
- PR: #16523
Add support for unpadded shapes in Matmul1D w/ gather_in0
- PR: #16627
#0: Fix incorrect assertion for page size for prefetch relay inline to dispatch s
- PR: #17074
#16758: Move mesh_composer call to after ttnn.from_device in ttnn.to_torch
- PR: #17054
Single Docker Image Release
- PR: #17051
Add validation for sharding to tensor layout and tensor spec
- PR: #16890
Use dense packed CB indices for Matmul
- PR: #17081
#0: Add WriteShard and ReadShard MeshBuffer APIs and resolve MeshBuffer dealloc issues
- PR: #16960
#16979: Log when CB ranges aren't contiguous
- PR: #17050
#16679: K min values support for TopK
- PR: #16917
#16502: Add Unary with params support to BinaryNg
- PR: #17067
#0: Make sub-device merge core ranges for generating mcast commands
- PR: #17087
PR Gate workflow as the nucleus to build up a pre-merge sanity check
- PR: #17097
Collection of small watcher fixes
- PR: #17061
Pack dense cb index for attn matmul
- PR: #17082
#0: Add native 2D sharding and replication functionality to MeshBuffer
- PR: #17086
#16720: and #14898 update output dims for argmax and move pad for generic reduce
- PR: #16989
Fix worker <-> teardown by adding separate worker connection teardown semaphore
- PR: #17033
#13609: Uplift dram and l1 allocators to use dram/l1 specific alignment
- PR: #13762
#0: Fix dispatch core settings to use the actual remote device when accessing the core grid
- PR: #17109
#16143: Inplace support for binary_ng ops with fused activations
- PR: #16449
Remove some usages of ttnn::Shape from the codebase
- PR: #17062
De-duplicate build-docker-artifact in workflows
- PR: #17112
Revert some commits to fix single-card pipelines
- PR: #17121
Move definitions to implementation for core.hpp
- PR: #17118

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.55.0

📦 Uncategorized