Skip to content

Releases: tenstorrent/tt-metal

v0.51.0-rc3

16 Jul 02:20
e1835e2
Compare
Choose a tag to compare
v0.51.0-rc3 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md
  • Move pad_on_host/unpad_on_host to host function in TTNN
  • #9874: Move polygamma_bw to TTNN
  • #5337: increase t3k frequent test timeout
  • Update falcon40b readme
  • #0: add layernorm rmsnorm pybind, move to ttnn
  • #0: Re-enable read cache in llama_model_optimized.
  • Update Mistral/Mixtral README files
  • #0: Update LLama2/3 readme with demo details
  • #0: resnet perf fix
  • Update Mamba README.md
  • OPT convs in RN50 to get better device perf
  • Increase timeout for N300 WH-only model pipeline
  • Prefill+Decode Demo Functional Implementation
  • [Falcon7b] Add wormhole demo perf mode and output verification tests
  • Update Falcon7/40b READMEs with details on model functionality and perf-mode
  • bump python 3.8 venv package version
  • Git bisect workflow on CI runners
  • #9613: scaffolding for weekly scheduled t3k perplexity tests
  • fix syntax issue with bisect script
  • #10231: Clean up t3k runs-on tags to minimum
  • #9490: Remove tt_eager unary ops and bindings
  • only build for arch that a dispatched workflow is running for

v0.51.0-rc2

15 Jul 02:19
Compare
Choose a tag to compare
v0.51.0-rc2 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0
  • #0: bump python3.8 venv versioning since apt repos updated
  • #10099: fix semaphores init for packet mux/demux
  • #10112: Drop hard pin for installation instructions for python3.8-venv in dependencies
  • Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions"
  • #0: Remove stray assert forcing single CQ on R-Chips
  • #9490: Replace tt_dnn op's usage in C++ with TTNN
  • #9874: Merge Next set of unary backward ops to TTNN
  • #10073: Move unary backward ops to TTNN
  • Unary backward op migration
  • #10087: update tt-umd submodule
  • #9959: Migrated pad to ttnn sweeps
  • Adding distributed layernorm to llama prefill
  • Add pytest xdist multiprocess to single-chip demo tests
  • Revert "Revert "#5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions""
  • #10071 : Move second set of Unary Backward ops to TTNN
  • #10083: added tt::stl::json::to_json and tt::stl::json::from_json
  • #10086: Add logic for splitting cmds that exceed the subcmd limit into separate cmds for semaphores
  • #5424: Delegated sqrt api call to thirdparty gs submodule sqrt call
  • #5424: Delegated sfpu api call to sqrt for wh to submodule sqrt call
  • #0: Fix galaxy eth dispatch init to only init the specified number of cqs (galaxy only supports single cq)
  • Fix undefined memory bug in ssm_prefix_scan
  • removed weight copies from DRAM to L1
  • fix syntax issues with test dispatch workflow
  • #9609: Reorganize libs into ttnn
  • #10165: Fix build error with g++-12
  • Adding support for dram sharded matmuls
  • #10076: Migrate Unary bw ops and replace tt_eager ops with ttnn ops
  • #10072: Move next set of Unary Backward ops to TTNN
  • #9082: ping individual falcon member since slack user group is not wo…
  • #8681: Add Floor, Trunc blocker ops
  • #9419: use memcpy to avoid mem misalignment
  • #10079: Move Unary Backward ops to TTNN
  • Migrate unary ops to TTNN
  • #9945: Skip SD for nightly FD, device perf tests, and single-card demos as it hangs on di/dt
  • #10045: use struct for matmul parameter passing and update doc string
  • #10045: remove use_1d_systolic_array from ttnn matmul
  • Ngrujic/profiling
  • #9319: Upload benchmark data for t3k falcon 7b tests
  • Aliu/build opt
  • #10107: Fix hangs w/ launch_msg size >32bytes
  • [CCL] Making buffer size dynamic to input slice
  • #7617: remove failing experimental model test
  • #7618: delete failing experimental model test
  • #0: fix prefill CI for mamba
  • Move Mamba tests to wh_b0_only_eth pipeline
  • #9747: Implement ttnn::tilize in C++
  • Aliu/prevent aho tanking
  • #10045: fix up missed parameter change in mamba block model
  • #9490: Added ttnn support for unary ops py file
  • #10101: [Blackhole Bringup] Revert Zeroacc to legacy behaviour
  • Update README.md
  • #0: Fix imports after tt_lib change
  • #10226: [Blackhole Bringup] Add new sfpu files
  • Suppress g++-12 build errors with -Wno flags
  • #0: Fix BH regression caused by unaligned L1_UNRESERVED_BASE
  • #10077: Migrate Unary comparison backward ops to TTNN with Overloading
  • #10175: Remove std::function and restructure ternary_bw
  • Falcon40b attn mask optimization
  • #10074: Move Unary backward ops to TTNN
  • Replace all TT Lib Unpad with TTNN Slice
  • #10082: Migrate unary bw ops to TTNN and remove std::function
  • #9715: Use build artifacts for profiler tests
  • #9021: adding resnet api into ci.
  • Update README.md

v0.51.0-rc1

11 Jul 02:01
07aacde
Compare
Choose a tag to compare
v0.51.0-rc1 Pre-release
Pre-release

📦 Uncategorized

  • Migrate Pad Device and All references
  • #0: Multi-CQ support for R-Chip
  • #10028: Remove skip and reduce test case for moreh_groupnorm test
  • #10005: Change input tensor parameter to optional in moreh_sum_backward
  • #10004: Revise bias tensor usage in moreh_linear_backward
  • #9663: support moreh_nll_loss_unreduced
  • #8865: Switch ported ops from tt_lib to ttnn for host dispatch time m…
  • #0: Update README.md grammar for idiomatic description of TT-NN
  • #9767: removed more no longer needed manually specified attributes for reflection
  • Add distributed layernorm kernel documentation
  • #10031: Fix -Werror=return-type error in composite_ops
  • #9492: update matmul path in CODEOWNERS
  • #9450: change silicon fixtures to session scope
  • Uplift UMD to grab support for configuring static TLBs and Hugepage for BH
  • #9441: add all typecasts to unit test
  • #9801: Add cb alignment fix for blackhole that was missed in rebase
  • #9973: Fix addrmod for reduce scalar, port over missing narrow tile c…
  • #10052: Add metal pack untilize test
  • Add ttnn matmul tests to TG unit tests
  • Add ssm_prefix_scan test coverage for N=16
  • Add PyBind to TTNN Slice (Formerly Referred to Unpad in TT Lib)
  • #8450: Cleanup items pending from PR #9068
  • #10030: fix moreh_nll_loss hang
  • #7736: Remove unused reduce dim & type from reduce_init*
  • #9871: Update backward files
  • #9874: Move Unary Backward ops to TTNN
  • Update op_perf_results
  • #9962: Enable flags for profiler globals in jit build
  • Added prefill mode for mamba modules
  • Increase timeout for Mamba full model tests
  • Support multiple user indices in paged_update_cache
  • #10085: Make ttnn::Buffer deallocate execute without querying a potentially destroyed buffer instance
  • Pack runtime arguments across brisc/ncrisc/trisc
  • Llama Demo Refactor
  • #5424: Delegated sfpu reciprocal calls to wh_b0 submodule functions
  • #0: Move t3k demo tests to perf pipeline because it requires perf governor
  • #5424: Delegated sfpu reciprocal calls to gs submodule functions
  • Add trace and multi cq implementations/tests for WH Resnet
  • #0: (MINOR) Update to v0.51.0

v0.50.0

10 Jul 22:04
f7c10a2
Compare
Choose a tag to compare

📦 Uncategorized

  • Fix issue with Mamba SSM A weight preprocessing
  • Make buid key unique for mmio and remote devices with same harvest mask
  • #5337: Removed eth_dispatch yaml flag from mistral tests
  • New workflow for custom test dispatch on CI runners
  • #9312: Add single-header boost-ext/reflect library as dependency
  • Opt LayerNorm/RMSNorm with 2D reduce
  • Revert "#8630: support uint8 data type"
  • #0: Fix codeowners for metal bert
  • Revert "Revert "#8630: support uint8 data type""
  • #9642: fix matmul2d in1 sharded with batch>1
  • #0: add tile layout support for GN
  • FD2 packed binary commands
  • #9082: t3k demo with slack notifications for owners. split jobs
  • Rtawfik/issue 9142
  • #9688: Remove redundant left shift in DEBUG_SANITIZE_NOC_READ_TRANSACTION_FROM_STATE
  • #9500: Update eth_interface include in tt_cluster to not be hardcoded for WH
  • #9578: Add WITH_PYTHON_BINDINGS option to allow build w/o python
  • #9587: Update CB and worker Go signals to respect max sub cmd limit introduced by dispatch packed write local copy change
  • Add support for bfloat4 weights in Mamba
  • Use in-place binary operations in Mamba block
  • #5337: Relaxed Mistral expected compilation time in CI by 1 sec
  • Mo/9406 profiler build flags
  • Add support for single col/row/core output grid for matmul 2D
  • #9725: Set release candidate releases on GitHub to pre-release, not draft, to enable downstream users
  • add tagged docker image with releases
  • Rtawfik/issue 9164
  • #5562: resolve reduce scatter issues (nd hang and correctness)
  • Create benchmarking tools for saving run/measurement data (with Falcon7b example) and model-demo utilities for verifying tokens/perf
  • #0: Fix bug with var name in single-chip falcon7b demo tests
  • #9735: fix issues with including reflect library
  • #9527: Remove usage of bcast where multiply is used
  • Mchiou/9082 slack notification owners
  • #9681: set name attribute for ttnn operations when fast runtime m…
  • #9553: Add prefix scan op for Mamba prefill
  • #9628: Merge Binary backward ops from tt_eager to TTNN
  • Namhyeong kim/support fp32 dest acc in moreh adam
  • #0: Update t3k workflow timeouts (except freq pipeline)
  • Temporary update Mixtral perf times to pass CI
  • #9479: fix cpu core worker bug
  • #4858: add typecast fp32 <-> int32
  • #0: ViT demo fix
  • #9389: Add support for integer type in sum operation
  • Transfer llama2/3 from experimental to demo folder.
  • #9657: add topk multicore to support larger dimension sizes
  • #4858: add typecast bfp8_b
  • #9082: t3k model perf split tests with slack notifications, disabled cnn
  • #0: Add ttnn/cpp to packages to enable using ttnn kernels in tt_eager ops
  • #9741: Set stricter pytest timeouts
  • #9492: Change models matmul usage to ttnn
  • #9778: test prefetcher hanging with changes to test
  • #9490: TTNN eltwise/unary migration
  • Update timeout for falcon40b t3k demo test
  • #0: Remove extra t3k falcon40b matrix test group
  • #9044: Move dispatch core x y to be part of launch msg
  • Modify rot mat each iteration to avoid allocating 10k tensors upfront
  • Optimize bcast sharded op
  • Start using reflect library
  • #0: Properly delete source folders for wheel testing
  • #9479: Update Mixtral perf estimates
  • #0: Added github community issue workflow
  • #8729: Pytest multiprocess reset infrastructure
  • Enable switching between 1 and 2 cqs in the same process
  • Fixed failing tests for SD Conv tests for WH using new conv
  • #0: Switch org-membership check to an authenticated call
  • #0: Decrease num loops in trace stress tests
  • #9628: Support optional return tensor
  • #0: Use CV to wait for cq_reader in production mode. Remove enqueue_record_event for NB calls
  • #9628: Merge second set of binary backward op from tt_eager to TTNN
  • #0: Bump bert compile time threshold since it's been intermittently failing on ci
  • Mchiou/9792 t3k runner management
  • #0: Bump up Bert inference time due to instability on ci
  • #8865: For host dispatch time measureing increese failing reference t…
  • #9484: Add output_tensor queue_id to dependency ops
  • Adding the new op: Flash Decode!
  • #0: Add missing permissions to issue notification job
  • #9275: Fix Falcon7b demo failing to run by default on an Grayskull e75
  • #9801: Account for 64B BH PCIe alignment in cq cmd sizing
  • #0: Make prefetcher early exit after fetching/reading exec_buf
  • #8683: Add Unary bitwise AND, OR
  • Ngrujic/profiling
  • #9628: Merge third set of binary backward op from tt_eager to TTNN
  • #4858: add typecast uint32
  • Migrate Pad Host Code, Bindings, C++ Usages from TT Eager to TTNN
  • Support longer sequence lengths in ssm_prefix_scan
  • #9709: Add optional transpose_a and transpose_b to ttnn matmul and linear
  • #0: Only run batch 12 bert for GS profiling and tighten some bert/resnet thresholds
  • Asarje/resnet highres 20240624
  • #9492: replace falcon specific matmul calls
  • Extend ssm_eltwise_mul for num_users > 32
  • Update documentation for adding new ttnn operation
  • Extend ssm_1d_reduce for the batch>32
  • #0: rn50 fix add api
  • #9123: Add support for optional output tensors to run in the worker t…
  • #9861: support check_tensor helper_function
  • Fix syntax issues in custom test dispatch workflow
  • Add Mixtral accuracy tests and cleanup its other tests (CI-friendly)
  • #9876: Increase timeout on falcon7b perplexity tests.
  • #9492: Remove bmm/resnet_matmul from models
  • #9410: enable fp32 precision unpacking for interm. CBs
  • #9903: Fix conditional statements and indexing of y values in CoreRange::diff
  • #9860: fix test create device apis
  • #0: delete unused code
  • #9719: fixed l1 clear issue on nlp create qkv heads decode test case
  • Fixing type in llama demo readme
  • #9892: Device only op report
  • #8704: define consts for registers that hold x-y coordinates and amount to shift address to get x-y coord
  • CODEOWNERS update
  • Abhullar/bh misc fix
  • Auto-register C++ ttnn operations in python
  • #9788: Remove TopK from TTLib and replace all references with the TTNN api
  • #0: add owners for resnet demo
  • 7-way split of eager tests
  • #9910: Improve Softplus kernel accuracy
  • #9818: Add cache check to op info V2
  • #0: update noc test bound
  • Fix branching bug in softplus kernel
  • propagate error upwards for tests in falcon 40b suite
  • #0: Fix falcon40b softmax import failure
  • #9755: move ttnn.concat to match the new file structure
  • #9837: Assign workers after performing ref count cleanup in async mode
  • #0: Make event_synchronize API safer
  • #0: Update buffer asserts to account for trace buffers
  • Clean up ttnn operation registration on python side
  • #9164: [Blackhole bringup] Add fix for unpack untilize
  • Aliu/no l1 clear
  • Restructure ttnn::permute to match the new standard format
  • #9815: Update host to pass packed write max unicast sub cmds to cq dispatch
  • Distributed layernorm op
  • #9831: re-enable test
  • #8835: cleaned up ttnn operation registration on C++ side
  • #9941: update dram/l1 to noc xy header to do the appropriate shift
  • #9336: Refactoring moreh layernorm
  • #9745: move unpad to slice ttnn cpp references
  • #9980: Update falcon updated outputs
  • Fix Main after Pad Merge
  • Update eltwise bcast unary ops to use memory_config and fix PCC issue for interleaved output
  • Update FD cmds to be PCIe aligned
  • Fix N150 product name to nebula_x1 even if its unharvested.
  • #0: add a second codeowner for conv
  • #0: Get tt-metal to compile with gcc-12
  • #9492: Change to ttnn matmul in tests and tt_eager
  • #9441: add typecast uint16->uint32
  • Move ttnn::embedding to match new pybind structure and replace C++ ttlib embeddings usage with it
Read more

v0.49.0

12 Jun 14:05
Compare
Choose a tag to compare

📦 Uncategorized

  • #5044: Add optional output to addalpha
  • #9059: Fix matmul for single core grid
  • readme update
  • #0: (MINOR) Update to v0.49.0
  • #7586: Move common models for single-card nightly to ln model
  • Update Mamba README
  • TTLIB interval to sharded sweeps
  • #0: Update dataflow api comments
  • #9196: Merge new op: Fast reduce nc into main
  • #0: New resnet50 test skipped on WH since its WIP
  • #9329: Restructure ttnn::argmax
  • #9323: Introduce template for new ttnn pull requests
  • #0: skip release build on GH runners, we already test it via build a…
  • Remove unused dependencies and fetch gtest via CPM
  • #8764: Part 3 of docs and model demos changes
  • Ngrujic/profiling
  • [Mistral-7B] Add flags for weight paths
  • Typecast int32->fp16b
  • #9258: Remove ARCH_NAME and TT_METAL_ENV from wheel testing
  • Implemented SD using new Conv API
  • #9258: Re-add wheel into release assets
  • #9361: Install Clang-17 and gdb 14.2
  • #7525: Re-skip demo batch 7 metal_BERT_large_11 on WH because it still hangs ND
  • #9206: add sfpu config reg init to llk sfpu inits
  • #9059: Avoid a couple of fatals in matmul
  • Add Galaxy support.

v0.48.0

10 Jun 18:09
Compare
Choose a tag to compare

📦 Uncategorized

  • #7744: Add support for non-4D tensor in moreh_sum, moreh_sum_backward
  • #5544: Add output tensors parameter to moreh_nll_loss op
  • #5544: Add output tensors parameter to moreh_sgd op
  • #5544: Fix package build error
  • #5544: Add output tensors parameter to moreh_linear op
  • #5544: Prevent eager unit test failures
  • #7997: Support non-4D tensor in moreh_softmax
  • #7816: Bump SD perf target
  • #8098: Remove temp buffer copying when reading from hugepage to host buffer
  • #0: Specify DEBUG_STATUS as a string literal instead of multiple chars
  • #8212: Fix uneven shards for interleaved_to_sharded op
  • #0: Refactor unpad tile to modify rt args in place and remove dynamic…
  • #7838: Add support for non-4D tensor in moreh_linear OPs
  • #0: Use split_work_for_tilize in both tilize and untilize
  • #8131: resnet-50 fix for b20.
  • Add support for multiple parameters in EltwiseUnary
  • #7625: Enable multicore for tilize with padding by default
  • Trace Support
  • #0: Switch set runtime args assertion for if kernel was placed on core to TT_ASSERT
  • #7179: enabling test case. The issue was not reproducible on 8.12 dri…
  • #4625: Multicore runs for untilize with unpadding on interleaved tensors
  • #0: Cache program cmds, convert cb configs from write linear to write packed
  • #0: Make skip and xfail optional in defining sweep tests
  • Shwetank tt/bcast op
  • #8364: Disable implicit fallback for ttnn.pad
  • #8513: Add slack notifications to several more pipelines
  • #0: Update common RT args to use no stride flag for packed cmd.
  • #0: Option to write compile_commands.json from CMake
  • #8718: eltwise testing for bfloat8
  • Add support for bfloat8 input tensors in Mamba SSM block custom kernels
  • #8460: Enable Clang-17
  • #0: Remove overhead in calling functions wrapped in tensor_impl_wrapper
  • #0: Updating the perf thresold to incorporate Merge back uneven reshard commit.
  • #6365: Add ttnn host tests
  • #6365: Revert "#6365: Add ttnn host tests (#8210)"
  • #4382: fix GH reported vulnerabilities
  • #0: bump C++ timeout limit to 45 minutes
  • update unpad doc for slice generality
  • Convert Falcon7b tt_lib ops and tensors to ttnn.experimental
  • #6365: Fix ttnn host wheel tests
  • Add git bisect script
  • #0: Move falcon40b ci unit tests to different pipeline
  • #8437: remove default matmul program config
  • #0: Add myself to ttnn codeowners
  • #0: Update README.md to include mention of TTNN_CONFIG_OVERRIDES
  • #0: Fix typos and add TTNN_CONFIG_OVERRIDES parameter descriptions to readme
  • #0: Add basic sanity checks during matmul program config creation
  • #8907: Sweep tests for tilize/untilize
  • #8902: Fixed program caching bug in nlp load slice op and added additional test cases for the op
  • #8917: Add sweep test for the fold op
  • #0: Properly support trivial single core case for 1D matmuls
  • #6343: updated test_perf with test for bloom causal_lm
  • #6343: Add functional_bloom test_demo
  • Update README.md
  • Enable optimised attention by default in falcon prefill.
  • Replace FreeList shared_ptr with local_shared_ptr
  • Add dummy_weights mode for mixtral tests
  • Refactor operation calls: Replace operation::run() with operation::launch_op()
  • Use HiFi2 to bump Falcon7b prefill PCC
  • #8902: add input and attn_mask del
  • #8930: Disable llama perf test
  • #0: Add third codeowner to matmul path
  • #0: Add create_venv.sh as environment option in installation instructions
  • #7083: Composite conv fix for relu called after matmul
  • #7525: Skip batch 7 metal BERT on WH B0 because it still hangs too often
  • #8871: Add initial infra/support for dram sharding
  • #8531: delete all makefiles
  • #0: Delete dead code from work_split.hpp
  • #8853: Uplift SFPI to latest w/ BH support
  • #8725: Warn user if kernel cache is enabled
  • #0: Minor test_prefetcher fixes
  • #5389: Move ttnn.repeat to c++
  • #8131: temp fix for PCC issue on W0.
  • Optimize e2e perf Falcon40b modifying layernorm
  • #0: Relax Falcon7b perf target
  • #0: Resolve segfault in llama async mode
  • Resnet Optimizations
  • Create Falcon7b perplexity test and utility functions for text-gen datasets
  • Revert "#8131: temp fix for PCC issue on W0."
  • bmm dram sharded opt
  • #8943: Clean up profiler python_env build flow
  • #8904: Add slack notifications for T3000 unit-tests
  • Add unet shallow functional, performance and demo test files
  • #8932: Multi-Device Mixtral Argmax Support
  • #8264: Worker thread optimizations:
  • TTNN tests for bf8 with mk tiled scalar
  • Ihamer/7468 inject noc delays
  • Support changed csv row orderings in Mixtral's op_perf_results.py
  • Correct merge issue in op_perf_results.py
  • #0: Add kernel groups to test_pgm_dispatch
  • #0: Add docs requirements to python env cache key because it can change the environment as well
  • #0: Add helper function to create CBs
  • #8973: Remove TT_METAL_ENV because we don't need it anymore
  • #5773: Move SD model to demo folder
  • #6938: Implement softplus as a single kernel
  • Model team/rotary embeddings llama
  • #8735: Fix hw/inc/blackhole files for compilation
  • Improve Mixtral perf with ttlib
  • Update README.md
  • #3712: fix old version of GN test
  • #0: Don't error on unused functions in compiler call
  • Revert " #8904: Add slack notifications for T3000 unit-tests"
  • Rtawfik/bh llk api
  • #0: Added interactive demo
  • Move Falcon7b before Mixtral in demo pipeline to workaround issue
  • #8112: Add support for ND tensors to matmul
  • #0: fix dram read benchmark
  • Fix bug in utility_functions::Profiler
  • Remove 1x1 matmul fallback on convolution and generalize convo…
  • #5389: Remove ttnn.split
  • #8767: decouple build folder name from build.cpp
  • #8735: Update common flags for BH build after sfpi module update
  • #8895: Fix ttnn.as_tensor(..) method for placing tensors on-device
  • #8539: Add cq_id to run_operation function args
  • #8632: Support fp32 dest acc en in moreh_sum and moreh_sum_backward
  • #5044: Add optional output tensor and remove autoformat in eltwise binary ops
  • #8895: Fix failing regression test in dump_tensor(...) API
  • More Resnet Optimizations
  • #4858: add typecast fp32 to uint32 op
  • #8995: refactoring moreh arange
  • #0: Add ccache option to build_metal.sh
  • Update Mixtral perf figures
  • #8349: Use BFP4_B for attention mask in falcon7b optimised prefill.
  • #0: Add CODEOWNERS for build_metal.sh
  • Rtawfik/add binary reuse metal
  • Update watcher.rst - use double backticks
  • Falcon40b tt_lib to ttnn.experimental
  • #0: fix dram sharded program cache
  • #7083: New halo fix for enabled program cache
  • #9051: Enable Llama model perf test
  • #8764: Single card WH demo tests
  • #8764: Various docs fixes for WH release
  • #0: Correct script locations for nightly single card
  • #8764: Use new device_l1_small_size fixture for SD demo interactive test
  • #9059: Update matmul test pcc
  • #0: Ensure weka mount is active for demo tests otherwise it won't run
  • #0: remove reserve to avoid bad alloc
  • #8764: Separate n150/n300 demo tests to not run BERT 11 on N150
  • Remove unnecessary llk sfpu param files
  • #9059: Add fallback for getting matmul program config
  • Add grouped convolution support
  • #8282: Support non-4d tensor and fp32_dest_acc_en for moreh nllloss backward
  • #8976: moreh_getitem receive signed integer index tensors
  • #9049: fix moreh_sgd callback and add callback test
  • #0: Remove argmax multi-device test due to segfault
  • #7724: Add prototype for autonomous streams for use in tunneller
  • #9036: GS & BH --> Combine llk param files using variable args
  • #0: optimize allgather for small tensor sizes
    ...
Read more

v0.46.0

05 Apr 13:57
Compare
Choose a tag to compare

📦 Uncategorized

  • user-triggerable C++ post-commit suite
  • #6406: add missing position_ids/attention_mask to bert demo
  • #6282: Add AdamW
  • #6315: Fix dprint tests for T3000
  • FD2: prefetch stall, dispatch wait, linear read, delay and cleanup
  • #6609: update wording in demo section of main README.md
  • #6364: Autocomplete for pybinded types
  • Asarje/ttnn rn50 b20
  • FD2.0 Test - Fix l1 buffer not page-size aligned in after FD-on-eth changes to L1_UNRESERVED_BASE
  • #6593: Add resharding to Llama2 model when possible.
  • #6572: Fix ttnn.repeat_interleave example in documentation
  • #5780: Re-enable 100K enqueue program stress test on grayskull
  • Enable basic width sharding support in all-gather
  • Alex/metal/remove cb wait markers
  • #6657: Use sysmem manager cq size instead of recomputing it each time…
  • #0: (MINOR) Add Grayskull purchase link and update version to 0.46.0
  • #5063: add TopK API to metal
  • #5480: FD2.0 Test - Fix test_prefetcher for dram paged read test (-t 3) on whb0
  • Fix logit low pcc
  • Backward op - Fixed ldexp, hardsigmoid and asin
  • #6598: Fix softplus
  • Add support for BFP4_B tensor serialization
  • Eltwise mul for different batch size
  • #6575: Split docs into separate Metalium and nn docs
  • #0: Add two separate links for documentation (tt-metalium/ttnn) on README
  • #6361: Update ttnn repeat to use correct shapes when formatting output
  • #0: Sayonaraaaaaaa
  • FD2.0 Test fix test_prefetcher add_paged_dram_data_to_worker_data dropping start_page
  • #5785: Watcher ringbuffer implementation
  • Add FD 2.0 WriteHost Command
  • #0: Put back frequent api tests because I'm an idiot
  • Optimize All Gather Interleaved Worker send/receive
  • #0: changing all #include common/* to #include tt_metal/common/*
  • #6676: Fix issues related to unary lte and gte
  • #5817: Fix lerp
  • #6589: Fix for relu_bw
  • #6633: Backward test update
  • #0: Skip logit, logiteps test
  • #0: Testing CI fix
  • #5480: Update test_prefetcher to pass added hugepage args to dispatch kernel
  • Fix l1 acc, add whb0 optimized conv tests
  • Alignment fix for eth core kernels
  • Add data parallel (multi-chip) for Falcon7b (prefill/decode) model and corresponding tests
  • CQ_DISPATCH_CMD_WRITE_PAGED support in test_dispatcher and passing tests
  • #6647: disable failing ci cpp tests and reenable cpp pipeline on CI
  • Backward test updates
  • Ngrujic/check bugs
  • Add Llama matmul perf tests to main
  • TTLIB: removing working tests from broken
  • #6443: Update backward asin and addcdiv logic
  • #0: Fix output cb size calculation in reshard op for bfp8b
  • #0: use smart ptrs in allocator
  • Jvasilje docs 0322
  • DRAM based device profiler with Tracy support
  • #6553: Fix ttnn.reshape(..) handling for bfloat16, TILE_LAYOUT
  • PR: #6746
  • Add Llama2 demo to tt-metal docs
  • Mistral-7B WH demo
  • Revert "#0: Put back frequent api tests because I'm an idiot"
  • FP32 support
  • #0: Add back frequent api tests to run.sh
  • Bteng/watcher ci3
  • Remove cpuprof
  • logo update
  • #6184: sharded row major silu support.
  • #6443: Update div_bw and backward ops test file
  • #6705: Relax forcing of keyword argument in ttnn.open_device
  • Forward op tests
  • #6691: Allow blocking of inner dim within a core for shaded in0 for 2d and 1d systolic matmuls
  • #6662: Width Sharding support for eltwise OP
  • Stable diffusion python API level perf improvements
  • Add get_compute_kernel_config_args function
  • #0: Add fd-2/main triggers for pull_request and push for post-commit
  • #5480: FD2 refactor for pre/dis patch variants
  • #6654: Add perf tests for ttnn ResNet50
  • #5480: Fix fd gtest unit test test_write_host
  • #0: Set myself as setup.py owner
  • #6780: Add mistral7b to demos list in getting started
  • #4003: re-added TTNN_ENABLE_LOGGING as runtime flag
  • #0: Fix semaphore address gen bug
  • #6769: Disable program caching for failing Llama tests.
  • #5480: Fix zero sized write transaction request that could occur in write_linear_host
  • #6077: Fix unet pcc issues
  • Remove DstSync from llk api templates
  • FP32 Support
  • #6680: Reverting move op change
  • #6443: Update asinh and softsign backward
  • Backward tests with updated test modules
  • Ngrujic/check bugs 1
  • #6654: Moving init for self.compute_kernel_config
  • #6805: reproduce the bug with sharded split_query_key_value_and_split_heads
  • #6832: Account for tile-padding in softmax for mistral 7B
  • Enable support for uint32 format to be consumed by SFPU (issue #4624)
  • #4252: fix clang build error since std::log2 only constexpr in gcc
  • #4003: log, debug and add pre- and post- hooks only for top-level ttnn ops
  • #6823: Fix core count to not include dispatch cores in op reprot
  • #6197: Align pages for interleaved <-> sharded.
  • METALIUM_GUIDE
  • Bteng/watcher post commit
  • #6443: update backward test file for relational ops and concat op
  • Revert "Bteng/watcher post commit"
  • #6443: Update backward ops
  • Backward test updates
  • #0: Add the dim 0 support repeat backward
  • Update hard related test ops
  • #6757: Remove set_profiler_location
  • #6443: Update backward ops erfinv elu hypot cos sin
  • #6861: Enable Watcher/dprint tests on T3000 CI
  • Update Mistral perf regression for CI, until issue is resolved
  • Mamba/perf v1
  • #0: remove data movement ops related to silu in SD
  • #4003: added proper fallback for getitem of ttnn.Tensor. Slice the tensor only on the tile boundary but set the shape based on whatever user provided
  • #4003: added proper fallbacks for every op that falls back to torch
  • #6731: add fix to LN width sharding
  • #5797: add back sweep test for ln
  • Integrate GroupNorm V2 to SD model
  • METALIUM_GUIDE.md updates
  • [Falcon7b] Fix bugs with inference throughput measurements in demo
  • #0: shallow unet add perf_mode
  • #6154: 2d matmul in0 height, in1 width sharding
  • #5249: Various Falcon40b test and demo cleanup
  • #0: fix incremental build
  • #0: remove upsample spill to DRAM
  • [Llama2 Prefill] Model Functionality completed
  • Watcher alignment checking for PCIe/DRAM <-> L1
  • #6920: fixed the error in whisper
  • Update METALIUM_GUIDE.md
  • #6644: save l1 buffers to data base
  • Update usage.rst
  • #6804: fix ttnn falcon7b demo regression + add to CI regressions
  • #6285: Add backward support for floor round and div_no_nan
  • [skip ci] Update INSTALLING.md
  • #6873: Add more test combinations to tt_lib sweeps add, add_unary, su…
  • Ngrujic/check bugs 3
  • #6882: Updated Mistral-7b perf estimate
  • #6850: Update install links in Sphinx docs to point directly to INSTALLING.md
  • #6619: Fix per op profiler sum
  • #6644: sync before calling print l1 buffers
  • Barsic/ttlib ops check
  • Barsic/ttlib params fix
  • #6962: Move cd tt-metal earlier in the command list of INSTALLING.md
  • #6819: Add support for CreateKernel absolute file paths
  • #6356: Remove half-half grid logic for bmms
  • #4003: added a flag to disable ttnn fallbacks. Don't throw an error w…
  • #0: Correct FW versions, tt-smi versions, and add note about tt-topology
  • #0: Capitalize tt to TT consistently for marketing
  • #0: Add myself as CODEOWNER for INSTALLING.md
  • #6644: ttnn visualizer
  • #6847: Allow disabling individual watcher features
  • #6889: Support printing/padding/tilizing multi-device tensors
  • #4003: removed ttnn.print_l1_buffers and consolidated all ttnn flags into a CONFIG class
  • #6217: tt_lib async mode support (single chipp tensors supported)
  • Reshard With Ranges
  • #4003: updated buffer report to show...
Read more

v0.45.0

22 Mar 18:03
Compare
Choose a tag to compare

🚀 Features

  • #6204: added support for num_users < 32 for update cache op.
  • #6247 Llama2 Galaxy MLP implementation

📦 Uncategorized

  • #4736: Add support for moreh_norm op
  • Fix moreh_layernorm rstd
  • #5508: Change test_moreh_layernorm.py for debugging
  • #4686: add infra for sharing global struct among ops
  • #5592: Fix pcc on Falcon 7b prefill by turning on l1 packer on MLP 4h-to-h matmul
  • Fix layernorm beta data format reconfig
  • Add linked support for in0 in1 mcast in matmul
  • #4957: optimizing construct_2d_padded_tensor_list
  • #4003: added ttnn.as_tensor and enabled support for caching torch tensor
  • Revert "#0: Fix for fail in asinh backward"
  • #5829: Use moreh_common.hpp for data movement kernels across moreh OPs
  • Barsic/ttnn ops
  • #6030: Update resnet performance metrics
  • #5876: pytest & c++ test logging cleanup
  • #0: Use both 2x2 and 2x4 machines on every scheduled run
  • Add single core matmul benchmark
  • #6079: Update FORCE_INLINE to be nop when watcher is enabled
  • #5980: Fix a hard-coded bounds check in dprint
  • #5389: merged ttl and ttnn tensor classes into one
  • Initial Performance Model
  • fix ci
  • TTNN RN50 :: on the road to match perf with TTLIB version
  • #4438: Optimized single-core fold op
  • #5589: Add repeat-interleave and addcmul sweeps
  • #6055: Add square backward support
  • #6057: Add backward support for lgamma
  • #6056: Add backward support for frac and trunc
  • #6066: Add support for backward log sigmoid
  • #6002: Add backward support for binary maximum
  • Ngrujic/improve conversion to bfloat8b in sweeps
  • #5829: Use moreh_common.hpp for compute kernels across moreh OPs
  • #0: Remove post-commit label from multi device pipeline because it's not actually post commit
  • Add pack l1 acc to resnet conv
  • #6144: Skip 512x512 cross attn 2d upblock for now in nightly because it hangs
  • #6061: Add tanhshrink, threshold, Unary EQ backward ops support
  • Width Sharded Concat for Unet
  • #5184: uncommenting various moreh test case.
  • Fix compute kernel config arg for resnet50
  • Nsmith/untilize unit test
  • Revert "Revert "#5389: merged ttl and tensor classes into one""
  • #4438: Do not use the new fold op in Resnet tests
  • Remove corerangeset that does not work on wormhole
  • #6129: Expose kernel config attrs and use 4 dst tiles for fp32 configs
  • #5391: Add device perf
  • #0: Use multiplier for wormhole b0 mulsi3
  • #4003: removed ttnn.Tensor autoclass from tensor.rst
  • TTNN MultiDevice Support
  • build artifacts
  • #4947: Add noc alignment checks to watcher
  • Add ttnn multi-chip unit test for checking device shards
  • Nsmith/fix unet
  • #6043: Random program stress test of command queues
  • Logit and logiteps backward support
  • Backward support for log2
  • Add missing ttnn tests and disable broken tests until issues are fixed
  • Fix Events feature for FD1.3 (out-of-order event ids, events feature missing) #6093
  • #5873: make top-level post commit workflow re-useable
  • #5589: add groupnorm for ttnn sweeps
  • Ngrujic/ttnn sweeps 4
  • Add ethernet datamover (EDM) - a foundational ethernet transfer engine
  • #6116: Add backward support for softshrink
  • #0: Add verbose make logs to artifact and make nicer name on metal
  • #0: Only use 2x4 setup for multi-card WH CI as 2x2 does not provide us good feedback
  • #4809 dprint tensix regs
  • #4003: fixed bloom perf test
  • #6187: Conv bugfix
  • #0: concat RM support variable stick widths across inputs
  • TTNN RN50 on WHB0
  • #6084: Lower thresholds slightly after using proper configs for device resnet
  • Fast dispatch 2.0 proof of concept
  • #6218: add pytest for matmul 1d 2d
  • #6177: use is_tensor_storage_on_device so it works for MultiDeviceStorage
  • #6082: support workers + eth cores in one program
  • #6215: Rename TensorToMeshMapper/MeshToTensorComposer
  • #6164: Update test_noc_unicast_vs_multicast_to_single_core_latency to not use same cores for producer and consumer on WH
  • #6117: Add backward support for softplus
  • #6223: remove redundant call to context switch
  • Integrate EDM with all-gather.
  • #6136: Add backward support for unary LE and GE
  • #5398: fix unicast binaries
  • Barsic/ttnn ops 2
  • #5380: Add wormhole_b0 model perf tests, only falcon7b in ttlib for now
  • #5372: Updated README.md file for demo
  • #4003: updated ttnn.concat to have a registered fallback
  • Llama2 functional bringup
  • #5589: Add working BFLOAT8_B sweeps to working folder
  • FD2.0 rename HostQ->PrefetchQ, add multi-core capability, fix NOC coords
  • #0: bugfix in ttnn resnet caught by nightly
  • #0: fix tt_bisect build bug
  • Watcher Asserts
  • #6183: add unit test for sd matmul ops
  • #6254: Make program cache per device:
  • #5394: Add functional version of Mamba architecture
  • #6257: Add temporary convenience script for 800MHz / new eth reset dependent CI
  • #5661: Enable gtests for fast dispatch + R chip
  • Alex/metal/bmm large block untilize out
  • #5389: made tensor attributes public and use ttnn::Shape instead of tt::tt_metal::Shape for storing shape
  • Revert "#6183: add unit test for sd matmul ops"
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffer_state
  • #4003: print all of the L1 buffers using ttnn.print_l1_buffers
  • #4438: Implement sharded multi-core fold op for Resnet50
  • #6149: disabled the check for comparing generated report with GOLDEN_L1_BUFFER_REPORT becauson pipelines it looks different than when running locally
  • FD2.0 fixes+mcast support for write and packed_write
  • Shwetank tt/config
  • #0: Change order of device and use_program_cache fixture in remaining pytests
  • Softplus with beta and threshold param
  • Build tests during artifact creation
  • #6149: disabled test_print_l1_buffers_of_add_operation
  • #4003: updated ttnn.to_torch to work with bfloat8_b tensors that are not multiple of tile size without tile padding
  • #0: add to/from L1 reshard test
  • #0: Add back deleted shape assertions for interleaved concat
  • test errors flagged by watcher
  • #0: fix incremental build
  • Merge xuncai/llama-attention-galaxy to main: First version of llama-attention galaxy on emulated chips
  • #6329: Fixing a bug causing mismatch on indices
  • #6321: Test which sweeps read/write buffer and just checks that the e…
  • Support moreh_getitem forward
  • #6125: Update in0_block_w to be full shard width for sharded 2D systolic matmul
  • #6107: Add softsign, sign, unary ceil backward support
  • #6226: Add backward support for div
  • #6234: Add backward support for rdiv
  • #6236: Add backward support for fmod and remainder
  • #4003: added positional embeddings to bert and updated ttnn_sharded_optimized_bert to run with batch size of 12
  • Indexed Fill
  • #5589: remove dtype in gen function sweep tests where needed
  • #6347: Print built-in defines once only
  • #0: Add Mo as code owner on profiler code
  • #0: Simplify tt_lib.scripts package by adding a specific tt_eager/scripts directory and putting the production scripts in there, whereas development scripts will stay in /scripts
  • #0: Fixture reorder changes reverted for falcon_7b perf test
  • #5424: remove metal_ckernel_sfpu
  • #0: Update remaining tt_lib.program_cache calls to use device APIs
  • #6183: add unit test for sd matmul ops
  • #6289: fix dispatcher page calculation
  • #5924: Enable unet on wormhole_b0 changes
  • #6325: skip test_multi_device.py for grayskull arch
  • Alex/metal/pack untilize no repack
  • #6144: Not hanging on GS or WH with or without Watcher
  • Agrebenisan/swq hwq cardinality cleanup
  • #6146: Add backward support for conj
  • #0: bug fix UTWH div_up instead of div trunc for calculating CB sizes
  • Fix To/From Sharded Bug
  • #6206: Fix resharding page mapp...
Read more

v0.44.0

27 Feb 15:57
Compare
Choose a tag to compare

📦 Uncategorized

  • Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
  • #4794: Implement DownBlock2D using ttnn for stable_diffusion model
  • #4797: Implement BasicTransformerBlock sub-module using ttnn for stab…
  • #0: write cluster config for FD mode, non tunneling cores as well
  • Update bw test, change mulsi calls to use *
  • #3003: updated tt-lib documentation
  • #0: Update to v0.44.0
  • #4003: added ability to trace ttnn operations using torchtrail library
  • Support moreh logsoftmax
  • #4614: gitmodules: Use https URLs for submodules
  • #0: add reviewers to frequently touched ops docs file
  • backward ops - hypot and atan2
  • #4885: Move program device map to program
  • #4858: Add support for float to int typecast
  • Matmul_block on a smaller grid size
  • Revert "#0: Add support for typecast float to int"
  • Add dst ethernet router support and remote command processor to accept FD packets on remote chip
  • Falcon40B TT Implementation
  • #5198: Fix moreh softmax related bug
  • #0: skip MOREH Softmax tests from main
  • #3122: Use device grid size in falcon_attention to be genereric...
  • #0: Add assertions for interleaved tensors for ops that don't support sharding
  • #5169: Add activation ops to ttnn
  • #3003: add duration to the ttnn operation nodes when TTNN_ENABLE_LOGGING=1 is used to compile the code
  • #5027: Optimize group attn matmul for Falcon40B decode
  • #0: add documentation about managing documentation
  • Adding docs for maxpool, avg pool and upsample
  • Revert "#0: skip MOREH Softmax tests from d5811b7
  • #5165: Add hyperbolic ops to ttnn
  • #4866: Add grayskull open source llk-library
  • #5002: simplified preprocessing of CNNs using preprocess_model
  • Create GroupNorm sharded in TTNN
  • #5097: Support for dedicated completion queue thread
  • upsample test calculate grid
  • fix for sharded allocater when num banks == num cores
  • MHA tutorial interactive notebook with diagrams
  • #4003: Adding a profile tutorial
  • #0: Added non-blocking read stress test
  • Revert "MHA tutorial interactive notebook with diagrams"
  • #0: Update all_gather to work for multi_link. Update falcon-40b to use 2 links for all gathers
  • #5142: Remove slow dispatch mode from workgin sweeps
  • #3003: fixed the input tensor documentation
  • #0: Temp slower resnet VM run
  • throw on fast dispatch for to_host_sharded as its not supported
  • #5253: Fix kv_past_len being passed in to rotary embedding for falcon models
  • #5233: started adding ttnn_functional_resnet
  • #3003: updated ttnn documentation to explain what features it has over tt_lib. Added standalone examples of basic usage of ttnn
  • #0: Speedup incremental builds
  • #0: Change setup.py to be git worktree friendly
  • MHA tutorial interactive notebook with diagrams
  • #3003: disable tutorial 6 from running as the unit test
  • Agrebenisan/non blocking tensor reads
  • #5275: CODEOWNERS: update to include files relevant for ttnn team
  • Fix an intermittent launch message transfer error
  • Revert "MHA tutorial interactive notebook with diagrams"
  • #0: add parens in LLK doc
  • #3003: only unit test tutorials that work on pipelines
  • #5246: Add unary math ops to ttnn
  • Vignesh/stable diffusion ttnn basic transformer block fix
  • #4854: Implement attention and rms_norm sub-module using ttnn for mis…
  • #4795: Add upblock2d to functional stable diffusion model
  • #4796: Implement Transformer2DModel using ttnn for stable_diffusion m…
  • #0: Adding llk wormhole_b0 submodule
  • #4003: Adding pyind11 to ttnn
  • #5296: Fix broken link to host_api.hpp in README.md
  • #0: Fix bug with the way we were measuring bert inference time
  • #0: Change local tt_lib._C module install from symlink to copy
  • #5233: added ability to fold batch_norm2d into conv2d
  • #5222: replace hex8_to_hex32.py with cpp to shave off some compile time -temporary fix
  • Enable tests for WHB0
  • #5137: Cleanups for newer Linux distro / toolchains
  • #5233: implemented support for converting all Resnet-18 modules using preprocess_model function
  • #3003: fix model preprocessing bug
  • #4799: Implement CrossAttnDownBlock2D sub-module using ttnn for stabl…
  • #4800: Implement UNetMidBlock2DCrossAttn using ttnn for stable_diffus…
  • #4798: Add ttnn cross attn upblock2d in functional stable diffusion m…
  • #4801: Implement Unet 2D Condition model using ttnn for stable_diffus…
  • #4965: Rename Conv2D to Conv2d and MaxPool2D to MaxPool2d to match torch
  • #0: Remove departed team member from CODEOWNERS
  • #0: add to codeowners
  • #5314: Only stall on first scheduled read after commands with side effects
  • #4965: fix bad rebase
  • #0: Add more instructions for dispatching workflow actions and a note about skipping git hooks
  • Update optimized Bert to support WH grid sizes, add sharding support for RMSNorm
  • #4642: create gtest_smoke as a sanity test suit
  • #5341: context switch if eth txq is full
  • #5323: Convolutions of small size fail during parallelization calculations
  • Npetrovic/transformer softmax
  • Fix groupnorm for narrow channels
  • #4862: added more test for ttnn bloom. Update optimized ttnn bert to match the structure of non-optimized ttnn bert
  • #0: Add an envvar parser with value detection and default value setti…
  • #4732: Clean up compute kernel apis
  • #5318: Modify Falcon7B to use attn_matmul for wormhole
  • #0: make logLocationsRecord a static function
  • #5233: run convs with auto-format
  • #5377: Avoid segfault by checking buffer !null before getting device
  • Alex/metal/pack untilize b0
  • #4487: Support block sharding in upsample
  • #5359: update python package transformers + dependencies to include Falcon
  • #3708: Add support for LN having gamma/beta in bfp8
  • #4003: Skip sweep tests if not available
  • #4003: use faster TMs in optimized ttnn whisper
  • #4732: Clean up compute_kernel_api
  • More optimizations for group_attn_matmul
  • #5233: updated resnet18 to run residual connections
  • #3003: added more meaningful errors to ttnn. Updated getitem to run on device in the cases when it can
  • #5233: simplified the logic in tracer
  • #3003: include ttl operations and necessary types under ttnn.ttl
  • #0: Add note about no merge commits in main
  • #0: Add timeout in profiler regression workflow
  • codeowners update
  • #5365: Add device argument to determine grid size based on target
  • disable whisper until further investigation, see issue #5430
  • #3003: fixed ttnn convs
  • #3886: Fix build error for C++ tests in debug mode
  • #4954: Support depth 32 in maxpool writer
  • #0: Pass output cb to pack init functions
  • #0: skipping DeviceLoadBlankKernels on remote devices
  • #5359: transformers: update version and relax pcc asserts
  • #3003: guidelines for adding new op
  • Don't assume user has one entry in their $PYTHONPATH
  • FP32 tensor support for matmul
  • #3003: updated tutorial 001 to describe the tensor more comprehensively before showing the add
  • Onboard additional metal code owners
  • #5402: Add redesigned host-side sw command queue, it can be configured i…
  • #3003: fixed docs
  • Alex/metal/enable conv tests on b0
  • #5356: git bisect script to find broken commits
  • #0: Update data_format.cpp file
  • Add skip to full grid matmul whb0
  • #3003: simplified the logic in ttnn/operations/matmul.py. Added dataclasses instead of tuples for CoreGrid and ShardShape
  • #5204: adding moreh's test suit. removing an absolute assertion.
  • Npetrovic/lt gt ne fix
  • #0: Move device id attribute from tensor to DeviceStorage
  • #3003: fixed scheduled pipeline
  • Npetrovic/transformer concat sweeps ttnn
  • #3003: added support for running ttnn.matmul using 1D_systolic_array. Also, added support for passsing in the program config directly
Read more

v0.43.0

08 Feb 18:02
Compare
Choose a tag to compare

📦 Uncategorized

  • #4668: Yolov5 GS Demo Benchmarking
  • #0: uplift umd; pick up fix for n150 cluster
  • #3178: Fix for wormhole b0 reduce w
  • #4489: fixed bugs in the program caching of eltwise unary and eltwise binary. Updated bloom to use L1 memory config
  • #4821: Add cumsum op to tt_dnn
  • Dispatch/Bandwidth tests
  • #4003: fixed test_eltwise_unary_op
  • Argmax and Argmin Support
  • #3212: softmax works after reduce fix of max, sum, etc. for WHB0
  • #0: (MINOR) Update version to v0.43.0
  • #4761: Add call to ttl repeat_interleave and also provide script for …
  • #4003: fixed the bug with printing the compile-time attributes
  • Support moreh arange
  • Remove skip_for_wormhole_b0 for test_moreh_softmax and test_moreh_softmin
  • #4541: remove unpad start at 0 limitation
  • Agrebenisan/restart cmd fix
  • Support moreh SGD
  • #0: Use fetch-depth: 0 instead of fetch-tags because otherwise git complains of commit SHA/tag conflict
  • #0: Add code owners for primary operations api binding
  • #4547: Add 2x2 window unit tests to ttnn maxpool
  • #4003: restructure ttnn
  • #4889: Change TileSlice printing to only print tile data
  • #4836: Add support for blocking conv activation in 2d systolic conv v…
  • #0: Update unicast cycles lower bound
  • #4904: Add support for 1d width sharded LN
  • #4941: Convert command header to struct for easier maintainability
  • #4823: enable sum_0 operation fails with low PCC [Wormhole,Grayskull]
  • Fix sharded buffers for one core in fast dispatch
  • #4906: global reduce sum, mean, max, min operations added
  • Revert "#4823: enable sum_0 operation fails with low PCC [Wormhole,GS]
  • #0: Change codeowners from specific op binding files/dirs to all tt_lib bindings
  • #4003: split unary sweep into per op sweeps
  • #4232: added support for converting from numpy arrays to ttnn tensors. Borrow data whenever possible when converting from numpy/torch
  • Uplift AttnMatmul to support GroupAttnMatmul
  • Add watcher-specific CI tests
  • #4916: Add avg pool to ttnn
  • #0: Add a lock on DPRINT server raise/wait structures
  • #4967: added validation for input tensors
  • #4971: update documentation by a new doc hierarchy;
  • #0: Leftover decorate_operation replacement for avg pool
  • #4899: fix the permute to operate on the intended shape
  • #4730: Add tt_lib.tensor.concat
  • Aliu/enqueue eth
  • #4003: Updating functional performance from changes in ttnn.permute w…
  • #4984: Remove dead OP_INFO and graph interpreter
  • #4878: initial commit to add Conv parameters to ttnn.preprocess_model_parameters
  • Update Program Hashes for Ops using Mem config
  • #4984: Remove unused dprint functionality
  • Aliu/ci fix
  • #4215: Add Argmax and Argmin Fallback
  • #4999: added input tensor validation to add, sub and mul operations.
  • Support for softmax rm major sharding and causal mask sharding
  • #0: provide API for where() to support scalar True/False branches
  • #5003: Update expected compile and runtimes for perf regression on VM
  • Revert "Update Program Hashes for Ops using Mem config"
  • #4931: add apis to get ethernet by socket ids
  • #4786: Add upsample_nearest2d functional stable diffusion
  • #4986: deploy docs only to main and enable devs to run docs build on different pages
  • Deploy ttnn sweeps results to docs
  • #4958: Move all python api unit tests to frequent in order to reduce SD pipeline length
  • #4999: Added input validation for ttnn.matmul and ttnn.linear. Add unit test for linear operation. Update input tensor validation in binary.py. Fix compute_output_shapes in bmm_op.cpp
  • #4620: Fix+improve bw test
  • #4852: Add unit tests for functional bloom
  • #5032: scalar argument versions for relops
  • #0: Add some README recommendations from MCW to clarify issue about access to internal workflows VM installation page
  • #4790: Implement GEGLU using ttnn for stable_diffusion model
  • #4999: Adding validation checks
  • #4791: Implement Feedforward sub-module using ttnn for stable_diffusi…
  • Npetrovic/bw ops sweeps
  • #4999: update documentation of ttnn operations to include the validation schema
  • #0: Remove model run from frequent_api_pipeline per @tt-rkim
  • Minor dprint/watcher cleanup
  • #4858: Add support for typecast
  • #0: Disable dprint tests because they're flaky at the moment
  • #4946: Add trig ops to ttnn
  • Nshanker/convs split by 2
  • #4946: Add inv trig ops to ttnn
  • #4003: fixed circular dependency in decorators
  • #5054: Removed asserts from conv op host code that are not required. …
  • #4003: fixed circular dependencies in ttnn
  • #4852: Fix CI pipeline by re-enabling functional bloom for causal LM
  • GroupNorm Sharded. support
  • #4972: is_sharded and memory_config is free from tensor
  • #0: eltwise ops/activate operator tracking for GS, and WHB0
  • Aliu/fd tunneling pr
  • #4642: Converted 14 old cpp tests to use gtest, with capabilities to switch btwn FD/SD when possible
  • #4852: Add tests for functional ttnn bloom implementation.
  • #4003: correctly convert all parameters of torch module to ttnn parameters
  • #5082: Pow gradient calculation method is different with pytorch
  • Argmax/Argmin support for channel, batch and all dim
  • #4420: switch to shared_ptr
  • #4420: return shared_future from taskflow async wrapper
  • Minor DPrint fixes
  • #0: Enable/disable clearing L1 from env var
  • #4003: started moving ttnn operation to C++
  • #4003: Add script to help with finding issues that we need approval for
  • #5044: Adding support for optional output tensors
  • #4003: Adding the open flag to show only open PRs
  • #5048: Add CreateDevices and CloseDevices api to detail
  • decouple ClearProgramCache from CommandQueue
  • Conv fixes for padding input channels. Shallow conv fixes. Conv input/output autoformatting. Cleanup
  • Asarje/mp unpack tilize fused
  • Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr
  • #5137: Cleanups for newer Linux distro / toolchains
  • Revert "#5137: Cleanups for newer Linux distro / toolchains"
  • Revert "Update CreateBuffer to return shared_ptr, and Enqueue R/W buffer to accept std::shared_ptr"
  • #4793: Implement ResnetBlock2D using ttnn for stable_diffusion model
  • #4788: Implement Downsample2D using ttnn for stable_diffusion model
  • #4792: Implement CrossAttention sub-module using ttnn for stable_diff…
  • #4747: Reduce amount of samples in bert sweeps
  • #4789: Add upsample2d to functional_stable_diffusion model
  • #0: Add fix for lamb optimizer
  • #5057: Add relational ops support to TTNN
  • skip eth test suite on GS
  • #4003: updated ttnn.Tensor to be derived form ttl.tensor.Tensor
  • Asarje/shwetank upsample
  • #5082: power gradient is erroneous when exponent is in range (0-1)