Releases: tenstorrent/tt-metal
v0.56.0-rc3
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13064647165
📦 Uncategorized
- #0: Add a way to specify custom dispatch topology
- PR: #17102
- Initialize work_executor_ and set WorkExecutorMode::SYNCHRONOUS in MeshDevice constructor
- PR: #17120
- [UMD] Switching to new coord API
- PR: #17003
- #0: Increase create heads test coverage for Llama shapes
- PR: #16980
- #16503: Optimize semaphore and CB writes
- PR: #16944
- Quiet down the CMake output from dependencies
- PR: #17008
- #16847: update to address the unaligned noc_async_copy from DRAM to L1
- PR: #17125
- remove ND failing big shape in transpose failures that is already tracked and disabled in test_transpose_2d
- PR: #17145
- #14898: pass in pad value to transpose in reduce
- PR: #17142
- rm -rf build.yaml
- PR: #17150
- #16982: Fixing program cache issues with reshape
- PR: #17140
- Delete convd_host_weights and update all tests using conv2d
- PR: #16264
- Update CODEOWNERS for the public API
- PR: #17149
- Aliu/bug fix
- PR: #17151
- #0: Move distributed headers into the public API directory
- PR: #17161
- [Llama3.2-11b-vision] Add support for text-only inference through generator api
- PR: #17105
- Remove references to ARCH_NAME in programming example
- PR: #17182
- Enable PR Gate
- PR: #17098
- #16806: Fixed watcher assert on reshape in debug mode
- PR: #17152
- #0: Fix doc links and make them point to the new location
- PR: #17181
- Remove ARCH_NAME references in prog_examples
- PR: #17185
- Prefer MOLD over LLD over LD
- PR: #17154
- Restore build-wrapper.yaml with updated method
- PR: #17197
- [TT-Train] Fix text generation
- PR: #17195
- LightMetal - Add Flatbuffers into cmake infra/build as cpm package (#17039)
- PR: #17157
- Format broken Kernel APIs Tables
- PR: #17000
- #0: Use MeshBuffer to store MeshWorkload kernel binaries
- PR: #17113
- Increase rms_norm and layernorm coverage for Llama shapes
- PR: #17180
- #17213: update fused and matmul trace sweep tests
- PR: #17214
- Add support for reading from / writing to partial buffer regions that are page size aligned for sharded buffers
- PR: #17089
- #0: Fix clang-format for dataflow_api.h
- PR: #17234
- Kkabilar tt single card perf
- PR: #17231
- [FABRIC] ASYNC_WR_ATOMIC_INC
- PR: #17072
- #9945: Enable and fix SD device perf test
- PR: #17025
- Check context switch pointer for eth cores before resetting
- PR: #17212
- Pull llrt.hpp out of public interface
- PR: #17196
- Update perf and latest features for llm models (Jan 27)
- PR: #17188
- #0: (MINOR) Bump to generate RCs for v0.57.0
- PR: #17252
- Remove dead includes of host_api.hpp from ttnn
- PR: #17220
- Prevent UNet Shallow perf report entry from being overwritten
- PR: #17235
- Fix setup.py for Anaconda
- PR: #17111
- Do not run PR Gate on Draft PRs
- PR: #17272
- Add a timeout for docker image building
- PR: #17285
- LightMetal - New APIs LightMetalBeginCapture() and LightMetalEndCapture() and docs (#17039)
- PR: #17262
- #0: Update distributed tests build to account for arch
- PR: #17287
- #17227: Make dispatch core order match for single chip 2 CQ and multchip 2 CQ topologies
- PR: #17274
- #17215: Add explicit dealloc for mesh buffer
- PR: #17265
- #0: Add validation test for dispatched remote circular buffer config to device
- PR: #17233
- Remove
get_completion_queue_reader_core()
API from Device- PR: #17263
- Add resharding to post all gather layernorm/ rms norm op
- PR: #17156
- #0: Fix ttnn shared libs build
- PR: #17127
- #0: Schedule runs for single card new models tests
- PR: #17141
- Implement JointAttention
- PR: #17079
- Revert "Add resharding to post all gather layernorm/ rms norm op (#17156)
- PR: #17304
- Update memory config when using
view
op with height sharded tensors- PR: #17266
- #16812: Reordering cbs in reduce_init_delta
- PR: #16981
- #17083: Add support for watcher printing phys coords
- PR: #17244
- #16945: Add auto retries to post commit on branches
- PR: #16946
- Remove CommandQueue redirecting usages straight to HWCQ
- PR: #17219
- 1D support for tilize/reshape ops
- PR: #17238
- #16138: W-broadcasting for sharded tensors
- PR: #17101
- #0: Add PR Gate to data pipeline
- PR: #17325
- #15174: Re-enable mistral7b demo test after fw upgrade
- PR: #17305
- LightMetal - Add LoadTrace() API and move TraceDescriptor out of detail namespace (#17039)
- PR: #17313
- #15974: Create device tensors table in report database
- PR: #17293
- Privatize dprint_server.hpp
- PR: #17298
- Uplift Allocator to be its own class + migrate calls to Allocator APIs
- PR: #17268
- Bump CMake in the Docker image
- PR: #17273
- Add perf reporting for ccl async mode
- PR: #16658
- Fix debug checks for bank assignments when initializing the allocator
- PR: #17357
- Add perf report for reduce scatter async
- PR: #17223
- #0: expand halo documentation and fix images
- PR: #16802
- Fix shard and physical height mismatch in
ttnn.convert_to_chw
tests- PR: #17258
- #17134: Remove unused components
- PR: #17301
- #0: Fix retry comparison which causes endless retries until pass
- PR: #17367
- Make runtime_args_ptr const ref to solve clang-tidy error due to 58f9654
- PR: #17364
- Use padded shape to construct output memory config in
ttnn.view
- PR: #17366
- #17322 Remove transpose cpp unit tests
- PR: #17326
- #17215: Initial MeshBuffer integration with TTNN
- PR: #17259
- fix sub-height height sharded WH transpose by setting output memory config to width sharded
- PR: #17147
- #0: Only produce cicd data on workflow runs that are success/failure/cancelled (ignore skipped runs)
- PR: #17371
- #0: Use posted writes for profiler.
- PR: #17261
- #16149: Add DeviceCommandCalculator to calculate command size
- PR: #17260
- #15889: Fix handling of mantissa rounding to respect ties round to even
- PR: #16997
- #17312: Fix type error when saving report config to json file
- PR: #17350
- Implement ttnn.sampling op for top-k, top-p sampling
- PR: #17136
- Test for a bad state before building
- PR: #17379
- #0: Fix incorrect assertion introduced in #17259
- PR: #17386
- #0: Add missing include for
work_split.hpp
- PR: #17390
- Replace ttnn::Shape/LegacyShape with SimpleShape in Python
- PR: #17341
- #0: Correcting bad dim check in CCL tests
- PR: #17392
- [tt-train] Add scatter workaround while proper version is in development
- PR: #17384
- [skip-ci] Bump timeout
- PR: #17397
- #15414: Read annotation data to determine job-level failure signature and reason
- PR: #17308
- #0: TT-Mesh bug fix MeshCQ/MeshWorkload on device indexing
- PR: #17333
- #17374: Add concurrency group for _produce-data.yaml
- PR: #17402
- Revert "#17374: Add concurrency group for _produce-data.yaml"
- PR: #17404
- #0: Fix broken link for "Programming Mesh of Devices" tech report
- PR: #17400
- Revert "#15414: Read annotation data to determine job-level failure signature and reason"
- PR: #17409
- Debug strange error
- PR: #17381
v0.56.0-rc2
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13043899749
📦 Uncategorized
- #0: Add a way to specify custom dispatch topology
- PR: #17102
- Initialize work_executor_ and set WorkExecutorMode::SYNCHRONOUS in MeshDevice constructor
- PR: #17120
- [UMD] Switching to new coord API
- PR: #17003
- #0: Increase create heads test coverage for Llama shapes
- PR: #16980
- #16503: Optimize semaphore and CB writes
- PR: #16944
- Quiet down the CMake output from dependencies
- PR: #17008
- #16847: update to address the unaligned noc_async_copy from DRAM to L1
- PR: #17125
- remove ND failing big shape in transpose failures that is already tracked and disabled in test_transpose_2d
- PR: #17145
- #14898: pass in pad value to transpose in reduce
- PR: #17142
- rm -rf build.yaml
- PR: #17150
- #16982: Fixing program cache issues with reshape
- PR: #17140
- Delete convd_host_weights and update all tests using conv2d
- PR: #16264
- Update CODEOWNERS for the public API
- PR: #17149
- Aliu/bug fix
- PR: #17151
- #0: Move distributed headers into the public API directory
- PR: #17161
- [Llama3.2-11b-vision] Add support for text-only inference through generator api
- PR: #17105
- Remove references to ARCH_NAME in programming example
- PR: #17182
- Enable PR Gate
- PR: #17098
- #16806: Fixed watcher assert on reshape in debug mode
- PR: #17152
- #0: Fix doc links and make them point to the new location
- PR: #17181
- Remove ARCH_NAME references in prog_examples
- PR: #17185
- Prefer MOLD over LLD over LD
- PR: #17154
- Restore build-wrapper.yaml with updated method
- PR: #17197
- [TT-Train] Fix text generation
- PR: #17195
- LightMetal - Add Flatbuffers into cmake infra/build as cpm package (#17039)
- PR: #17157
- Format broken Kernel APIs Tables
- PR: #17000
- #0: Use MeshBuffer to store MeshWorkload kernel binaries
- PR: #17113
- Increase rms_norm and layernorm coverage for Llama shapes
- PR: #17180
- #17213: update fused and matmul trace sweep tests
- PR: #17214
- Add support for reading from / writing to partial buffer regions that are page size aligned for sharded buffers
- PR: #17089
- #0: Fix clang-format for dataflow_api.h
- PR: #17234
- Kkabilar tt single card perf
- PR: #17231
- [FABRIC] ASYNC_WR_ATOMIC_INC
- PR: #17072
- #9945: Enable and fix SD device perf test
- PR: #17025
- Check context switch pointer for eth cores before resetting
- PR: #17212
- Pull llrt.hpp out of public interface
- PR: #17196
- Update perf and latest features for llm models (Jan 27)
- PR: #17188
- #0: (MINOR) Bump to generate RCs for v0.57.0
- PR: #17252
- Remove dead includes of host_api.hpp from ttnn
- PR: #17220
- Prevent UNet Shallow perf report entry from being overwritten
- PR: #17235
- Fix setup.py for Anaconda
- PR: #17111
- Do not run PR Gate on Draft PRs
- PR: #17272
- Add a timeout for docker image building
- PR: #17285
- LightMetal - New APIs LightMetalBeginCapture() and LightMetalEndCapture() and docs (#17039)
- PR: #17262
- #0: Update distributed tests build to account for arch
- PR: #17287
- #17227: Make dispatch core order match for single chip 2 CQ and multchip 2 CQ topologies
- PR: #17274
- #17215: Add explicit dealloc for mesh buffer
- PR: #17265
- #0: Add validation test for dispatched remote circular buffer config to device
- PR: #17233
- Remove
get_completion_queue_reader_core()
API from Device- PR: #17263
- Add resharding to post all gather layernorm/ rms norm op
- PR: #17156
- #0: Fix ttnn shared libs build
- PR: #17127
- #0: Schedule runs for single card new models tests
- PR: #17141
- Implement JointAttention
- PR: #17079
- Revert "Add resharding to post all gather layernorm/ rms norm op (#17156)
- PR: #17304
- Update memory config when using
view
op with height sharded tensors- PR: #17266
- #16812: Reordering cbs in reduce_init_delta
- PR: #16981
- #17083: Add support for watcher printing phys coords
- PR: #17244
- #16945: Add auto retries to post commit on branches
- PR: #16946
- Remove CommandQueue redirecting usages straight to HWCQ
- PR: #17219
- 1D support for tilize/reshape ops
- PR: #17238
- #16138: W-broadcasting for sharded tensors
- PR: #17101
- #0: Add PR Gate to data pipeline
- PR: #17325
- #15174: Re-enable mistral7b demo test after fw upgrade
- PR: #17305
- LightMetal - Add LoadTrace() API and move TraceDescriptor out of detail namespace (#17039)
- PR: #17313
- #15974: Create device tensors table in report database
- PR: #17293
- Privatize dprint_server.hpp
- PR: #17298
- Uplift Allocator to be its own class + migrate calls to Allocator APIs
- PR: #17268
- Bump CMake in the Docker image
- PR: #17273
v0.56.0-rc1
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13022911064
📦 Uncategorized
- #0: Add a way to specify custom dispatch topology
- PR: #17102
- Initialize work_executor_ and set WorkExecutorMode::SYNCHRONOUS in MeshDevice constructor
- PR: #17120
- [UMD] Switching to new coord API
- PR: #17003
- #0: Increase create heads test coverage for Llama shapes
- PR: #16980
- #16503: Optimize semaphore and CB writes
- PR: #16944
- Quiet down the CMake output from dependencies
- PR: #17008
- #16847: update to address the unaligned noc_async_copy from DRAM to L1
- PR: #17125
- remove ND failing big shape in transpose failures that is already tracked and disabled in test_transpose_2d
- PR: #17145
- #14898: pass in pad value to transpose in reduce
- PR: #17142
- rm -rf build.yaml
- PR: #17150
- #16982: Fixing program cache issues with reshape
- PR: #17140
- Delete convd_host_weights and update all tests using conv2d
- PR: #16264
- Update CODEOWNERS for the public API
- PR: #17149
- Aliu/bug fix
- PR: #17151
- #0: Move distributed headers into the public API directory
- PR: #17161
- [Llama3.2-11b-vision] Add support for text-only inference through generator api
- PR: #17105
- Remove references to ARCH_NAME in programming example
- PR: #17182
- Enable PR Gate
- PR: #17098
- #16806: Fixed watcher assert on reshape in debug mode
- PR: #17152
- #0: Fix doc links and make them point to the new location
- PR: #17181
- Remove ARCH_NAME references in prog_examples
- PR: #17185
- Prefer MOLD over LLD over LD
- PR: #17154
- Restore build-wrapper.yaml with updated method
- PR: #17197
- [TT-Train] Fix text generation
- PR: #17195
- LightMetal - Add Flatbuffers into cmake infra/build as cpm package (#17039)
- PR: #17157
- Format broken Kernel APIs Tables
- PR: #17000
- #0: Use MeshBuffer to store MeshWorkload kernel binaries
- PR: #17113
- Increase rms_norm and layernorm coverage for Llama shapes
- PR: #17180
- #17213: update fused and matmul trace sweep tests
- PR: #17214
- Add support for reading from / writing to partial buffer regions that are page size aligned for sharded buffers
- PR: #17089
- #0: Fix clang-format for dataflow_api.h
- PR: #17234
- Kkabilar tt single card perf
- PR: #17231
- [FABRIC] ASYNC_WR_ATOMIC_INC
- PR: #17072
- #9945: Enable and fix SD device perf test
- PR: #17025
- Check context switch pointer for eth cores before resetting
- PR: #17212
- Pull llrt.hpp out of public interface
- PR: #17196
- Update perf and latest features for llm models (Jan 27)
- PR: #17188
- #0: (MINOR) Bump to generate RCs for v0.57.0
- PR: #17252
- Remove dead includes of host_api.hpp from ttnn
- PR: #17220
- Prevent UNet Shallow perf report entry from being overwritten
- PR: #17235
- Fix setup.py for Anaconda
- PR: #17111
- Do not run PR Gate on Draft PRs
- PR: #17272
v0.55.0
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13018933285
📦 Uncategorized
- Create an API for running and measuring the runtime of a ttnn op chain for use during forge compilation
- PR: #16921
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be kernel-only
- PR: #16853
- Minor SDPA optimizations
- PR: #16566
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
- Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
- Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
- Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
- Do not build UMD tests
- PR: #16877
- Move risc_attribs back to hw/inc
- PR: #16867
- Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
- Upgrade error message in control plane
- PR: #16863
- #15824 Workaround LLK issue in max_pool
- PR: #16849
- [skip ci] Fixed TG configuration description in documentation
- PR: #16884
- #0: Update pgm_dispatch_golden.json
- PR: #16818
- #0: fix stackoverflow in eth tun
- PR: #16889
- #0: Refactor enqueue_write_buffer
- PR: #16880
- #0: Add skip for mnist tests because I can't take this anymore
- PR: #16891
- #0: Remove SetLazyCommandQueueMode from Metal API
- PR: #16886
- #16868: Update profiler post proc asserts tripping due to kernel preload
- PR: #16872
- #16350: Update reciprocal docs
- PR: #16371
- [skip ci] : Update INSTALLING.md
- PR: #16893
- Remove
sharded_to_interleaved
workaround in UNet Shallow- PR: #16770
- Add CI job for running models in comparison mode
- PR: #16808
- pybind expose MeshDevice::reshape
- PR: #16798
- #0: Update sweeps README
- PR: #16902
- Workaround issue #16895, fix PCC checking for wormhole in Resnet50 demo
- PR: #16896
- #0: Refactor enqueue_read_buffer
- PR: #16908
- move device checking outside of invalidate code func
- PR: #16903
- Disable Unstable Transpose 2D Test
- PR: #16781
- New Operation: Fill_Tile_Pad ; Op to fill tile padding with a specific value
- PR: #16785
- #0: Separate HWCommandQueue in it's own header
- PR: #16885
- Update Mamba device performance targets
- PR: #16887
- Changing how we setup simulator.
- PR: #16375
- Add missing include for types used
- PR: #16934
- Adding active erisc FW for BH + support for compiling this + updating BH eth_l1_address_map
- PR: #16916
- disable test_transpose_2D due to python-side segfault
- PR: #16933
- #16913: Add Model Updates to the Release assets
- PR: #16914
- Add Datagram Sockets to Fabric
- PR: #16830
- [Llama3] Send decode output logits to dram to reduce trace l1 usage and fix 8b-n150 memory crash
- PR: #16924
- Sharding support for binary_ng
- PR: #16789
- Fix mcast end core for stress noc mcast test
- PR: #16947
- #13901: MaxPool Wide Reductions with Non-8-Tile Multiples
- PR: #16544
- Make creation functions use SimpleShape, expose SimpleShape & TensorSpec to Python
- PR: #16865
- Feature/vecadd sharding
- PR: #16654
- Resolve the issue in ubenchmark pipeline
- PR: #16949
- #0: update test_vc_uni_tunnel bw requirement
- PR: #16953
- [tt-train] Fix broken build due to taskflow change
- PR: #16952
- #16415: fix moreh_adam
- PR: #16420
- #16469 Add sharding to vecadd example
- PR: #16959
- Revert "#16469 Add sharding to vecadd example"
- PR: #16961
- Revert "Feature/vecadd sharding"
- PR: #16962
- #13195: Squeezebert using Conv1d Width Sharded
- PR: #16881
- Cleanup of various issues
- PR: #16873
- Add sweeps with pre-allocated output for topk and argmax
- PR: #16898
- #16510: Eltwise sweep test for add and mul + silu - LLama
- PR: #16516
- Fixing variable name to build umd tests
- PR: #16967
- #15246: Add sweeps for acos...
v0.55.0-rc20
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/13001851215
📦 Uncategorized
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be kernel-only
- PR: #16853
- Minor SDPA optimizations
- PR: #16566
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
- Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
- Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
- Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
- Do not build UMD tests
- PR: #16877
- Move risc_attribs back to hw/inc
- PR: #16867
- Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
- Upgrade error message in control plane
- PR: #16863
- #15824 Workaround LLK issue in max_pool
- PR: #16849
- [skip ci] Fixed TG configuration description in documentation
- PR: #16884
- #0: Update pgm_dispatch_golden.json
- PR: #16818
- #0: fix stackoverflow in eth tun
- PR: #16889
- #0: Refactor enqueue_write_buffer
- PR: #16880
- #0: Add skip for mnist tests because I can't take this anymore
- PR: #16891
- #0: Remove SetLazyCommandQueueMode from Metal API
- PR: #16886
- #16868: Update profiler post proc asserts tripping due to kernel preload
- PR: #16872
- #16350: Update reciprocal docs
- PR: #16371
- [skip ci] : Update INSTALLING.md
- PR: #16893
- Remove
sharded_to_interleaved
workaround in UNet Shallow- PR: #16770
- Add CI job for running models in comparison mode
- PR: #16808
- pybind expose MeshDevice::reshape
- PR: #16798
- #0: Update sweeps README
- PR: #16902
- Workaround issue #16895, fix PCC checking for wormhole in Resnet50 demo
- PR: #16896
v0.55.0-rc19
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12980411497
📦 Uncategorized
- Remove ARCH_NAME from host library code
- PR: #16616
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be kernel-only
- PR: #16853
- Minor SDPA optimizations
- PR: #16566
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
- Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
- Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
- Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
- Do not build UMD tests
- PR: #16877
- Move risc_attribs back to hw/inc
- PR: #16867
- Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
- Upgrade error message in control plane
- PR: #16863
- #15824 Workaround LLK issue in max_pool
- PR: #16849
- [skip ci] Fixed TG configuration description in documentation
- PR: #16884
- #0: Update pgm_dispatch_golden.json
- PR: #16818
- #0: fix stackoverflow in eth tun
- PR: #16889
- #0: Refactor enqueue_write_buffer
- PR: #16880
- #0: Add skip for mnist tests because I can't take this anymore
- PR: #16891
- #0: Remove SetLazyCommandQueueMode from Metal API
- PR: #16886
- #16868: Update profiler post proc asserts tripping due to kernel preload
- PR: #16872
- #16350: Update reciprocal docs
- PR: #16371
- [skip ci] : Update INSTALLING.md
- PR: #16893
- Remove
sharded_to_interleaved
workaround in UNet Shallow- PR: #16770
- Add CI job for running models in comparison mode
- PR: #16808
- pybind expose MeshDevice::reshape
- PR: #16798
- #0: Update sweeps README
- PR: #16902
- Workaround issue #16895, fix PCC checking for wormhole in Resnet50 demo
- PR: #16896
v0.55.0-rc18
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12960537424
📦 Uncategorized
- Remove ARCH_NAME from host library code
- PR: #16616
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be kernel-only
- PR: #16853
- Minor SDPA optimizations
- PR: #16566
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Eltwise Unary
- PR: #16527
- Fix scaling issue with RT arguments in tilize/untilize with padding
- PR: #16690
- Make stress noc mcast test respect physical coordinates + allow option to skip mcaster
- PR: #16833
- Fix some shapes for Prefetcher + Matmul, Use Multi-device Global CB
- PR: #16764
- Do not build UMD tests
- PR: #16877
- Move risc_attribs back to hw/inc
- PR: #16867
- Re-enable UNet Shallow trace+2CQ test case
- PR: #16875
- Upgrade error message in control plane
- PR: #16863
- #15824 Workaround LLK issue in max_pool
- PR: #16849
- [skip ci] Fixed TG configuration description in documentation
- PR: #16884
- #0: Update pgm_dispatch_golden.json
- PR: #16818
- #0: fix stackoverflow in eth tun
- PR: #16889
- #0: Refactor enqueue_write_buffer...
v0.55.0-rc15
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12953022780
📦 Uncategorized
- Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
- Remove ARCH_NAME from host library code
- PR: #16616
- Support subcoregrids in concat_heads
- PR: #16223
- Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
- #16590: profiler trace detection fix
- PR: #16591
- #16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
- Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
- Fix nightly stable diffusion tests
- PR: #16629
- #0: Used github team for conv files
- PR: #16563
- Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
- fix reduce scatter multi-link support bug
- PR: #16636
- support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
- Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- Port all experimental ops to compute_output_specs
- PR: #16595
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimize usage and implementation of encode/decode tensor data
- PR: #16759
- Device to Device profiler sync
- PR: #16543
- Templating and Queue Size Adjustments for Packet Queue
- PR: #16732
- Refactor Superset model benchmarking tools to use Pydantic classes and save one json
- PR: #16790
- #16078: Fix back-to-back calls of ttnn.close_device()
- PR: #16840
- #16434: DPRINT to read buffer once
- PR: #16586
- Bring Taskflow from CPM
- PR: #16843
- This file seems to be k...
v0.55.0-rc14
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12941480379
📦 Uncategorized
- Remove halo from shard spec
- PR: #15900
- Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
- Composite binary sweeps: gcd and lcm
- PR: #16423
- Remove ARCH_NAME from host library code
- PR: #16616
- [tt-train] Add nanogpt ddp mode
- PR: #16614
- #16312: Fix full op to query physical shape for buffer volume
- PR: #16562
- #16366: Changed default kernal_config_val for 32bit matmul
- PR: #16567
- #16621: Add barriers at end of cq_dispatch_slave.cpp
- PR: #16624
- Build wheels in models unit tests workflow
- PR: #16615
- Mo/10234 eth dispatch profiling
- PR: #15609
- Support subcoregrids in concat_heads
- PR: #16223
- Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
- #16590: profiler trace detection fix
- PR: #16591
- #16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
- Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
- Fix nightly stable diffusion tests
- PR: #16629
- #0: Used github team for conv files
- PR: #16563
- Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
- fix reduce scatter multi-link support bug
- PR: #16636
- support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
- Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- Port all experimental ops to compute_output_specs
- PR: #16595
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update llms.md
- PR: #16828
- fuse residual add with layernorm
- PR: #16794
- [TT-Train] Add multidevice support to dropout
- PR: #16823
- #16171: Preload kernels before receiving go message
- PR: #16680
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Test Kernels
- PR: #16613
- #16366: Changed kernel config to HiFi4 for 32F matmul
- PR: #16743
- Add nightly APC run in debug mode
- PR: #16831
- [skip ci] Update llms.md
- PR: #16835
- [skip ci] Update llms.md
- PR: #16839
- Remove some ARCH_NAME ENV usage at runtime
- PR: #16825
- Move out tensor storage into a separate .hpp/.cpp
- PR: #16832
- #16460: Add more helpful error message when tt-topology needs to be run
- PR: #16783
- Make creation functions use SimpleShape, expose SimpleShape to Python
- PR: #16826
- #16242: Initial implementation of MeshBuffer
- PR: #16327
- Enable use-override check
- PR: #16842
- Privatize Taskflow
- PR: #16838
- Fix test_new_all_gather.py regressions caused by API unification between Device/MeshDevice
- PR: #16836
- Fix CB allocation warnings from ttnn.reshard
- PR: #16795
- Optimize upsample for bilinear mode
- PR: #16487
- Remove Shape usage from MultiDeviceStorage
- PR: #16841
- Remove redundant bank offset from destination address in
ttnn.reshard
- PR: #16800
- Add option to raise error on failed local/global tensor comparison
- PR: #16585
- Padded Shards for Concat Support
- PR: #16765
- #0: Add support for tracing some sub-devices while others are still running programs
- PR: #16810
- #16769: bring up all reduce async as a composite op and added llama shape ccl test sweep
- PR: #16784
- #0: Lower Size to metalium as Shape2D
- PR: #16814
- #15976: Ensure reports insert all devices into the devices table
- PR: #16834
- Modify UNet Shallow to return output in CHW channel ordering
- PR: #16742
- #16758: Optimiz...
v0.55.0-rc13
Note
If you are installing from a release, please refer to the README, INSTALLATION instructions, and any other documentation packaged with the release, not on the main branch. There may be differences between the latest main and the previous release.
The changelog will now follow, showing the changes from last release.
This release was generated by the CI workflow https://github.com/tenstorrent/tt-metal/actions/runs/12925964343
📦 Uncategorized
- fix multi-iter in reduce scatter and adopt runtime arg overrider infra
- PR: #16531
- [tt-train] Add linear regression ddp example
- PR: #16245
- Remove eth_l1_address_params.h from device.cpp
- PR: #16538
- Sharded sweeps: exp, exp2, expm1, erfc, erfinv, round, log
- PR: #16323
- Fix
ttnn.concat
golden function when groups > 1- PR: #16556
- #16171: Assert that NCRISC NOC is idle at kernel end.
- PR: #16471
- Remove eth_l1_address_params.h from tt_cluster.cpp and watcher
- PR: #16568
- Remove dev_mem_map.h usage from watcher_device_reader.cpp
- PR: #16572
- #14616: Remove ARCH_* ifdefs from tt_cluster.cpp
- PR: #13354
- Add support for DRAM Prefetcher op
- PR: #16244
- Resolve reduce-scatter-async sharded tensor correctness bug & hang
- PR: #16548
- disable flaky t3k test
- PR: #16583
- Remove "noc_parameters.h" from device.cpp
- PR: #16582
- Remove restriction of input_nsticks_per_core % w == 0
- PR: #15205
- Add tt-forge sweep for conv2d.
- PR: #16178
- Remove noc header file inclusion from watcher_device_reader.cpp
- PR: #16589
- Fix ttnn.from_torch for 0D/1D tensors with tile layout
- PR: #16484
- Short list failing conv2d for forge sweeps
- PR: #16597
- Remove halo from shard spec
- PR: #15900
- Address issues of var & std
- PR: #16545
- #16492: Remove sub_device_ids apis from various read/write functions throughout the stack
- PR: #16565
- #6344: Update RoBERTa QA demo
- PR: #8896
- Remove noc_parameters.h inclusion from ttnn
- PR: #16593
- Resubmit #16339: parameterize dispatch_constants
- PR: #16478
- #11512: Refactor bitwise sweeps, add bitwise sharded sweeps, modify t…
- PR: #15704
- Update CODEOWNERS
- PR: #16604
- Enable multi-core and fixing bfloat8 for untilize with unpadding
- PR: #16555
- Set up targeting idle eth cores on BH - won't enable because of hang debug
- PR: #14817
- Reorganize Print Pages Infrastructure
- PR: #16463
- lower fabric erisc datamover eth context switching frequency when workload is running
- PR: #16610
- Composite binary sweeps: gcd and lcm
- PR: #16423
- Remove ARCH_NAME from host library code
- PR: #16616
- [tt-train] Add nanogpt ddp mode
- PR: #16614
- #16312: Fix full op to query physical shape for buffer volume
- PR: #16562
- #16366: Changed default kernal_config_val for 32bit matmul
- PR: #16567
- #16621: Add barriers at end of cq_dispatch_slave.cpp
- PR: #16624
- Build wheels in models unit tests workflow
- PR: #16615
- Mo/10234 eth dispatch profiling
- PR: #15609
- Support subcoregrids in concat_heads
- PR: #16223
- Build wheels in ttnn unit tests workflow because the tests need it and we forgot to put it in
- PR: #16605
- #16590: profiler trace detection fix
- PR: #16591
- #16503: Optimize CoreRangeSets for CBs and semaphores
- PR: #16549
- Revert "#16621: Add barriers at end of cq_dispatch_slave.cpp"
- PR: #16645
- Fix nightly stable diffusion tests
- PR: #16629
- #0: Used github team for conv files
- PR: #16563
- Sweeps: fixed abs, added acos and acosh sharded and non sharded
- PR: #16381
- fix reduce scatter multi-link support bug
- PR: #16636
- support i/p tensors of all dimensions/rank for prod operation
- PR: #16301
- Create Infrastructure to exactly calculate L1 Memory Usage for Conv2D #15088
- PR: #15455
- #12253: Implement Batch norm operation for inference mode
- PR: #16432
- Port all experimental ops to compute_output_specs
- PR: #16595
- #16443: Add a programming example of vecadd_multi_core and gtest
- PR: #16446
- Enable to/from torch tests for 0D/1D tensors
- PR: #16653
- Port all data movements ops to compute_output_specs
- PR: #16652
- #15246: Add sweep tests for addcdiv, addcmul, rdiv, rsub, ceil
- PR: #15998
- Fix build break
- PR: #16656
- Logical sharding for input tensor and halo output
- PR: #16517
- #16495: reduce grid for falcon7b mlp matmul
- PR: #16569
- Stress NOC mcast test
- PR: #16639
- [skip ci] Update subdevice doc
- PR: #16669
- Read from and write to partial buffer regions for interleaved buffers where offset and size of specified buffer region are divisible by buffer page size
- PR: #16102
- Fix resnet large on GS
- PR: #16665
- Fix Pre-allgather Layernorm bad PCC when use 1D reduction
- PR: #16622
- #16353: skip no volume tensors
- PR: #16619
- Create README.md
- PR: #16675
- Update README.md
- PR: #16676
- #16367: Added support to enable dram and l1 memory collection without saving to disk
- PR: #16368
- Update .clang-format-ignore
- PR: #16681
- Tweak BH csrrs init code
- PR: #16682
- #0: Clean up confusing refs to Greyskull from ttnn.copy error messages.
- PR: #16647
- Update perf and latest features for llm models (Jan 13)
- PR: #16677
- Update README.md
- PR: #16702
- #16657: Fix to_layout conversion into row major for 1D tensors
- PR: #16684
- Tilize with val padding results in L1 cache OOM
- PR: #16633
- #0: Fixes from commit ae61802
- PR: #16686
- #0: Skip build-docker-image during post-commit code-analysis since the docker image is already built in a previous job
- PR: #16703
- Generate test executables per architecture
- PR: #16594
- #16587: Update UMD submodule commit for P150 compatibility
- PR: #16709
- Replace some instances of Tensor::get_shape with get_logical_shape
- PR: #16655
- Update METALIUM_GUIDE.md
- PR: #16602
- #16621: Add barriers at end of cq_dispatch_slave.cpp on IERISC
- PR: #16666
- Finish porting OPs to compute_output_specs
- PR: #16695
- ScopedGraphCapture
- PR: #15774
- #15756 Pull in BH LLK fix for maxpool hang
- PR: #16663
- #15246: Add sweep tests for logical_and, logical_or, logical_xor
- PR: #16132
- #0: (MINOR) Bump to v0.55.0
- PR: #16714
- #11512: Add sweeps for eltwise sharded ops 3
- PR: #16307
- Add sweeps for unary, unary_sharded and binary_sharded versions of ops: fmod, remainder, maximum, minimum.
- PR: #15911
- Don't leak tt_cluster.hpp through kernel_types.hpp
- PR: #16691
- #6983: Renable skipped TT-NN unit test
- PR: #16642
- #15450: Remove default values from circular buffer parameters in LLK compute APIs
- PR: #16389
- update build flag on programming examples docs
- PR: #16635
- Fix for P100 board type
- PR: #16718
- Sever TT-Train's dependency on TT-Metalium's tests
- PR: #16685
- [TT-Train] Update generate of LLM
- PR: #16723
- [TT-Train] Add bias=false in LinearLayer
- PR: #16707
- TT-Fabric Bringup Initial Check-in
- PR: #16343
- #0: Sanitize writes to mailbox on ethernet cores.
- PR: #16574
- Add Llama11B-N300 and Llama70B-TG (TP=32) to LLM table in README.md
- PR: #16724
- [skip ci] Update llms.md
- PR: #16737
- Update test_slice.py
- PR: #16734
- #16625: Refactor tracking of sub-device managers from Device to a new class
- PR: #16683
- Update code-analysis.yaml
- PR: #16738
- [skip ci] Update llms.md
- PR: #16745
- remove references to LFS
- PR: #16722
- Fixes for conversion to row major for 0D and 0-volume tensors
- PR: #16736
- #0: Disable BH tools test at workflow level
- PR: #16749
- Removing some usages of LegacyShape, improve Tensor::to_string
- PR: #16711
- [skip ci] Fix lint on a doc
- PR: #16751
- #0: API Unification for Device and MeshDevice
- PR: #16570
- Port ttnn::random and uniform from LegacyShape to SimpleShape
- PR: #16744
- #16379: make softmax call moreh_softmax if rank above 4
- PR: #16735
- #7126: remove skip for test_sd_matmul test
- PR: #16729
- #0: Make
device
an optional parameter in the tensor distribution API- PR: #16746
- Added build-wheels to fast-dispatch-build-and-unit-tests-wrapper.yaml
- PR: #16638
- Adding CCL Async test cases to TG nightly and bug fix
- PR: #16700
- #11119: Move op_profiler.hpp under the ttnn folder
- PR: #11167
- #15979: Switch to google benchmark for pgm dispatch tests
- PR: #16547
- [tt-train] Add weight tying option for NanoGPT demo
- PR: #16768
- #0: Fix build of test_pgm_dispatch
- PR: #16773
- [tt-train] Update serialization of tensor for DDP
- PR: #16778
- #0: Fix failing TG regression tests
- PR: #16776
- [skip ci] Update llms.md
- PR: #16775
- Add tiled interleaved permute for when width dimension doesn't move (row-major tiled invariant)
- PR: #16671
- Add Fabric Router Config to to Hal
- PR: #16761
- [skip ci] Update llms.md
- PR: #16791
- Reflect ARCH_NAME Changes in CI Workflows
- PR: #16706
- [skip ci] Update llms.md
- PR: #16792
- #0: Migrate pytensor to use
from_vector
Tensor creation APIs- PR: #16767
- Afuller/metalium api reorg
- PR: #16578
- Ngrujic/sweep tests 3
- PR: #16316
- #0: Enable nlp create heads tests on BH
- PR: #16777
- Fix to_layout shard bug
- PR: #16754
- Fix broken link to host API
- PR: #16799
- Add noc flag to test stress noc mcast
- PR: #16772
- Set codeowners for transformer ttnn ops
- PR: #16803
- #15450: Remove default value for ocb argument in LLK compute API
- PR: #16376
- Linking tensor.reshape to ttnn.reshape
- PR: #16377
- #16646: Fix dangling reference in sharded tensor args
- PR: #16782
- #15450: Remove default values from circular buffer parameters in LLK compute APIs: Transpose and Reduce
- PR: #16427
- Add new python api to get architecture name
- PR: #16747
- Remove base.hpp
- PR: #16796
- [tt-train] Change weights initialization for GPT-2
- PR: #16815
- [skip ci] Update...