Releases: flashinfer-ai/flashinfer
v0.2.2.post1
What's Changed
- bump version to v0.2.2 by @yzh119 in #891
- perf: fix the performance of second stage of split-k by @yzh119 in #894
- fix: pin_memory use cpu as default device by @KnowingNothing in #895
- perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
- perf: fix MLA split-k performance bug by @yzh119 in #898
- perf: use f16 as split-k partial output data type by @yzh119 in #900
- perf: tweak the pipeline design of mla kernel by @yzh119 in #901
Full Changelog: v0.2.2...v0.2.2.post1
v0.2.2
What's Changed
- fix cu121 torch2.6 by @zhyncs in #867
- unittest: add MLA test cases where kv_len is evenly divided by page_size. by @foreverlms in #861
- bugfix: fix the behavior of MLA kernel when kv-length is 0 by @yzh119 in #868
- Merge of previous PRs for typos in a single one. As per your request. by @didier-durand in #862
- add lightllm adoption by @zhyncs in #871
- fix geneate_dispatch_inc args from parser by @baowendin in #870
- [API] Fix top_k_top_p_sampling_from_logits param typo by @kasohrab in #875
- misc:Remove unused k_smem_offset_w update in MLA kernel by @muoshuosha in #878
- JIT compilation support for TVM by @MasterJH5574 in #880
- [Hotfix] Add flashinfer.jit.attention into packages by @zhouye in #881
- perf: FlashAttention-3 style MLA PageAttention by @yzh119 in #887
- [JIT] Fix MLA header in TVM binding by @MasterJH5574 in #889
- Fixing several typos in doc file kv_layout.rst by @didier-durand in #884
- unittest: add unittests for MLA + cudagraph by @yzh119 in #890
New Contributors
- @baowendin made their first contribution in #870
- @kasohrab made their first contribution in #875
- @zhouye made their first contribution in #881
Full Changelog: v0.2.1.post2...v0.2.2
v0.2.1.post2
What's Changed
- use 3 latest pytorch version by @youkaichao in #835
- docs: update installation by @zhyncs in #839
- Update README.md: fixing a typo for "hierical" by @didier-durand in #836
- Update page.rst: fixing 1 typo by @didier-durand in #841
- Update README.md: fixing 1 typo by @didier-durand in #842
- adds
TensorRT-LLM
to the list of projects adopting FlashInfer by @yzh119 in #843 - perf: MLA decode kernel implemented by CuTe targeted to SM80 by @tsu-bin in #844
- Update installation.rst: fixing 2 typos by @didier-durand in #840
- fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() by @sfc-gh-yewang in #808
- bugfix: Fix inline RoPE in decode kernels by @MasterJH5574 in #847
- misc: Remove duplicate param set in MLA kernel by @MasterJH5574 in #850
- feat: adding
out
andlse
parameters torun
functions to allow user allocated output buffer by @yzh119 in #854 - Unique the symbol of maybe_q_rope_offset_v. by @foreverlms in #855
- typo: update
decode_maybe_q_rope_offset
by @MasterJH5574 in #856 - update ci by @zhyncs in #857
- fix some compiler pre-check. by @foreverlms in #859
- perf: dynamic split-k for MLA by @yzh119 in #863
- Revert "fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() (β¦ by @zhyncs in #864
- chore: bump v0.2.1.post2 by @zhyncs in #865
- fix compile by @zhyncs in #866
New Contributors
- @didier-durand made their first contribution in #836
- @sfc-gh-yewang made their first contribution in #808
- @foreverlms made their first contribution in #855
Full Changelog: v0.2.1.post1...v0.2.1.post2
v0.2.1.post1
What's Changed
- doc: Fix the incorrect DeepSeek-V3 paper link by @muoshuosha in #826
- bugfix: fix the signature of
CutlassSegmentGEMMSM90
by @yzh119 in #827 - redo ci: cross python wheel by @youkaichao in #824
- bugfix: Another bugfix for torch.library by @yzh119 in #828
- misc: fix parameters name by @Chen-0210 in #817
- bugfix: update
clear_cache_dir
in JIT by @yzh119 in #829 - update release wheel by @zhyncs in #830
- chore: bump v0.2.1.post1 by @zhyncs in #831
- fix #824 by @zhyncs in #832
- fix release wheel by @zhyncs in #833
- set pip path by @zhyncs in #834
New Contributors
- @muoshuosha made their first contribution in #826
- @Chen-0210 made their first contribution in #817
Full Changelog: v0.2.1...v0.2.1.post1
v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
- release: bump version to v0.2.1 by @yzh119 in #819
- refactor: change to TORCH_LIBRARY by @youkaichao in #764
- Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
- bugfix: bugfix on sm89 MLA by @yzh119 in #821
- hotfix: bugfix on #812 by @yzh119 in #822
- refactor: change to TORCH_LIBRARY by @abmfy in #823
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1
v0.2.0.post2
What's Changed
- ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in #694
- bugfix: casting int array to int32 for rope input arguments by @yzh119 in #697
- bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in #699
- misc: remove release-please workflow by @yzh119 in #705
- Customizable SM90 prefill kernels. by @hyhieu in #704
- hotfix: revert torch.library register by @yzh119 in #709
- Improve compatibility with pytorch 2.5 by @zifeitong in #711
- misc: add bibtex reference by @yzh119 in #712
- sampling: simplify min-p sampling by @yzh119 in #713
- perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in #714
- bugfix: fix min-p AOT compilation in #713 by @yzh119 in #717
- Triton implementation of
silu_and_mul
by @nandor in #716 - bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in #718
- bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in #719
- Finer-grained control over fp16/fp8 builds by @nandor in #722
- Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in #728
- ci: rename python package name to
flashinfer-python
by @yzh119 in #729 - Add a note about int32/int64 datatypes to the
kv_layout
tutorial by @fergusfinn in #737 - fix return type of cuBLAS by @zhyncs in #749
- [Refactor] Unify JIT/Customization/AOT mode by @yzh119 in #748
- Move allocations out of torch ops by @nandor in #740
- [Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in #743
- Filter out unsupported head dim for sm90 by @abcdabcd987 in #751
- bugfix: various AOT issues by @abcdabcd987 in #752
- [bugfix] Fix cpp tests/benchmarks by @yzh119 in #753
- fix pin memory device by @youkaichao in #755
- Add dev container for easier development by @ByronHsu in #680
- hotfix: bugfix to #756 by @yzh119 in #757
- Change
apply_rope_with_cos_sin_cache
to acceptcos_sin_cache
by @ByronHsu in #754 - fix: match statement not supported in Python 3.8 by @xslingcn in #759
- bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in #762
- bugfix: Fix block-sparse attention API by @yzh119 in #767
- Version bump: v0.2.0.post2 by @yzh119 in #768
New Contributors
- @hyhieu made their first contribution in #704
- @zifeitong made their first contribution in #711
- @bobboli made their first contribution in #718
- @timzsu made their first contribution in #728
- @fergusfinn made their first contribution in #737
- @LeiWang1999 made their first contribution in #743
- @youkaichao made their first contribution in #755
- @LLLLKKKK made their first contribution in #762
Full Changelog: v0.2.0.post1...v0.2.0.post2
v0.2.0.post1
0.2.0.post1 (2024-12-22)
Bug Fixes
v0.2.0
0.2.0 (2024-12-17)
Features
- add
rotary_dim
argument to rope APIs for partial apply rope (#599) (eb9bc71) - add a
use_softmax
field in variant class (#533) (d81af97) - add an option
non_blocking
to plan function (#622) (560af6f) - add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
- add group size 3 to GQA decode dispatch (#558) (6227562)
- add JIT compilation support for FA3 templates (#672) (d4e8d79)
- allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
- CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
- fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
- improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
- JIT compilation (#507) (3613a5b)
- modify group-gemm stage number (#497) (52dab1d)
- non-contiguous query with paged kv cache (#553) (89f2c4a)
- pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
- simplify prefill JIT compilation (#605) (fe4f898)
- specify gemm backend (#648) (0cc1a51)
- support cached cos/sin in rope APIs (#585) (83e541d)
- support huggingface transformer style rope interface (#568) (4f40420)
- support sm90 cutlass group gemm (#509) (794bdda)
- torch custom_op fix for rope (#569) (3e104bc)
- torch custom_op support: norm (#552) (f6e0010)
- torch.compile and custom_op support (#554) (9bf916f)
- warmup for jit kernel tests (#629) (8f5f349)
Bug Fixes
- AOT compiler flags on non-sm90 (#522) (0aa4726)
- batch decode kernel redundant store output to gmem (#505) (90e42a7)
- compatible with torch 2.2 (#478) (ac41d1b)
- #452 (b53a46f)
- remove redundant load (#495) (2de16b0)
- update bmm fp8 test (#487) (45eac04)
Performance Improvements
- accelerate JIT compilation speed (#618) (eaf73fd)
- Dense and sparse customizable flashattention-3 template (#667) (51236c9)
- fix prefill kernel performance degradation (step 1) (#602) (595cf60)
- fix the performance issue of
append_paged_kv_cache
(#588) (e15f7c9) - improve parallelism in RoPE with pos_ids (#609) (ff05155)
- improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
- reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
- reduce total_num_tiles_q by one (#644) (553ace5)
- remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
- speedup jit compilation of prefill attention kernels (#632) (a059586)
- use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)
v0.1.6
0.1.6 (2024-08-27)
SM75 Support
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
API Changes
plan
/run
Since 0.1.6 on, begin_forward
/forward
/end_forward
APIs are replaced with the new plan
/run
API.
forward
is renamed torun
, which is more precise and consistent with the naming convention of cutlass's python API.begin_forward
is renamed toplan
, which is consistent with the naming convention of nvmath API.end_forward
is deprecated and has no effect after this PR.
There is some slight difference between the old forward
and the new run
API:
- All extra arguments such as
causal
andlogits_soft_cap
will be provided inplan
(previouslybegin_forward
) API, and cached until nextplan
call, and we only need to provide query and KV-Cache tensors inrun
API.
The old begin_forward
/forward
/end_forward
APIs are still functional, but we will gradually deprecate them in future releases.
Check #466 for more details.
MultiLevelCascadeAttentionWrapper
Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper
API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explaination.
The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper
and BatchPrefillWithSharedPrefixPagedKVCacheWrapper
will be deprecated in future releases.
Features
- sm75 support (#448, #449)
- add
MultiLevelCascadeAttentionWrapper
API (#462) (1e37989) - add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
- support bmm fp8 (#469) (f1c0b68)
Refactor
- refactor: replace
begin_forward
/forward
/end_forward
withplan
/run
#466
Misc
Performance Improvements
- slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
- slight optimization on fragment layout swizzle (#458) (7c397cb)
- use persistent kernel for merging attention states (#459) (be6bf5b)
Acknowledgement
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.
v0.1.5
0.1.5 (2024-08-13)
Bugfix
- Fix PagedPrefill python api and some typos (#441) (3fff008)
- fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)
Features
Performance Improvements
Acknowledgement
We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.