Releases · flashinfer-ai/flashinfer

27 Feb 06:00

yzh119

v0.2.2.post1

1c88d65

v0.2.2.post1 Latest

Latest

What's Changed

bump version to v0.2.2 by @yzh119 in #891
perf: fix the performance of second stage of split-k by @yzh119 in #894
fix: pin_memory use cpu as default device by @KnowingNothing in #895
perf: tweak register amount for producer/consumer in MLA template by @yzh119 in #896
perf: fix MLA split-k performance bug by @yzh119 in #898
perf: use f16 as split-k partial output data type by @yzh119 in #900
perf: tweak the pipeline design of mla kernel by @yzh119 in #901

Full Changelog: v0.2.2...v0.2.2.post1

Contributors

yzh119 and KnowingNothing

Assets 10

23 Feb 22:28

yzh119

v0.2.2

986e5b1

v0.2.2

What's Changed

fix cu121 torch2.6 by @zhyncs in #867
unittest: add MLA test cases where kv_len is evenly divided by page_size. by @foreverlms in #861
bugfix: fix the behavior of MLA kernel when kv-length is 0 by @yzh119 in #868
Merge of previous PRs for typos in a single one. As per your request. by @didier-durand in #862
add lightllm adoption by @zhyncs in #871
fix geneate_dispatch_inc args from parser by @baowendin in #870
[API] Fix top_k_top_p_sampling_from_logits param typo by @kasohrab in #875
misc:Remove unused k_smem_offset_w update in MLA kernel by @muoshuosha in #878
JIT compilation support for TVM by @MasterJH5574 in #880
[Hotfix] Add flashinfer.jit.attention into packages by @zhouye in #881
perf: FlashAttention-3 style MLA PageAttention by @yzh119 in #887
[JIT] Fix MLA header in TVM binding by @MasterJH5574 in #889
Fixing several typos in doc file kv_layout.rst by @didier-durand in #884
unittest: add unittests for MLA + cudagraph by @yzh119 in #890

New Contributors

@baowendin made their first contribution in #870
@kasohrab made their first contribution in #875
@zhouye made their first contribution in #881

Full Changelog: v0.2.1.post2...v0.2.2

Contributors

didier-durand, zhouye, and 7 other contributors

Assets 10

17 Feb 18:05

github-actions

v0.2.1.post2

8127793

v0.2.1.post2

What's Changed

use 3 latest pytorch version by @youkaichao in #835
docs: update installation by @zhyncs in #839
Update README.md: fixing a typo for "hierical" by @didier-durand in #836
Update page.rst: fixing 1 typo by @didier-durand in #841
Update README.md: fixing 1 typo by @didier-durand in #842
adds TensorRT-LLM to the list of projects adopting FlashInfer by @yzh119 in #843
perf: MLA decode kernel implemented by CuTe targeted to SM80 by @tsu-bin in #844
Update installation.rst: fixing 2 typos by @didier-durand in #840
fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() by @sfc-gh-yewang in #808
bugfix: Fix inline RoPE in decode kernels by @MasterJH5574 in #847
misc: Remove duplicate param set in MLA kernel by @MasterJH5574 in #850
feat: adding out and lse parameters to run functions to allow user allocated output buffer by @yzh119 in #854
Unique the symbol of maybe_q_rope_offset_v. by @foreverlms in #855
typo: update decode_maybe_q_rope_offset by @MasterJH5574 in #856
update ci by @zhyncs in #857
fix some compiler pre-check. by @foreverlms in #859
perf: dynamic split-k for MLA by @yzh119 in #863
Revert "fix: Pass backend in BatchPrefillWith*KVCacheWrapper.plan() (… by @zhyncs in #864
chore: bump v0.2.1.post2 by @zhyncs in #865
fix compile by @zhyncs in #866

New Contributors

@didier-durand made their first contribution in #836
@sfc-gh-yewang made their first contribution in #808
@foreverlms made their first contribution in #855

Full Changelog: v0.2.1.post1...v0.2.1.post2

Contributors

didier-durand, yzh119, and 6 other contributors

Assets 10

13 Feb 23:13

github-actions

v0.2.1.post1

6805c64

v0.2.1.post1

What's Changed

doc: Fix the incorrect DeepSeek-V3 paper link by @muoshuosha in #826
bugfix: fix the signature of CutlassSegmentGEMMSM90 by @yzh119 in #827
redo ci: cross python wheel by @youkaichao in #824
bugfix: Another bugfix for torch.library by @yzh119 in #828
misc: fix parameters name by @Chen-0210 in #817
bugfix: update clear_cache_dir in JIT by @yzh119 in #829
update release wheel by @zhyncs in #830
chore: bump v0.2.1.post1 by @zhyncs in #831
fix #824 by @zhyncs in #832
fix release wheel by @zhyncs in #833
set pip path by @zhyncs in #834

New Contributors

@muoshuosha made their first contribution in #826
@Chen-0210 made their first contribution in #817

Full Changelog: v0.2.1...v0.2.1.post1

Contributors

yzh119, youkaichao, and 3 other contributors

Assets 10

13 Feb 08:17

yzh119

v0.2.1

dbb1e4e

v0.2.1

What's Changed

misc: addressing the package renaming issues by @yzh119 in #770
feat: support deepseek prefill attention shape by @yzh119 in #765
refactor: change the structure of attention updater by @yzh119 in #772
hotfix: follow up of #772 by @yzh119 in #773
bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
perf: refactor fa2 prefill template by @yzh119 in #776
feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
misc: remove head dimension 64 from AOT by @yzh119 in #782
misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
refactor: make group_size a part of params by @yzh119 in #786
bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
fix rope logic in mla decoding by @zhyncs in #793
Fix arguments of plan for split QK/VO head dims by @abmfy in #795
test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
feat: support f32 attention output in FA2 template by @yzh119 in #799
feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
doc: add documentation to new MLA interface by @yzh119 in #811
feat: unlocking MLA for A100 by @yzh119 in #812
feat: cudagraph-compatible MLA API by @yzh119 in #813
feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
misc: fix sphinx by @abcdabcd987 in #815
bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
doc: improve mla related documentation by @yzh119 in #818

New Contributors

@abmfy made their first contribution in #795

Full Changelog: v0.2.0.post2...v0.2.1

What's Changed

misc: addressing the package renaming issues by @yzh119 in #770
feat: support deepseek prefill attention shape by @yzh119 in #765
refactor: change the structure of attention updater by @yzh119 in #772
hotfix: follow up of #772 by @yzh119 in #773
bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
perf: refactor fa2 prefill template by @yzh119 in #776
feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
misc: remove head dimension 64 from AOT by @yzh119 in #782
misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
refactor: make group_size a part of params by @yzh119 in #786
bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
fix rope logic in mla decoding by @zhyncs in #793
Fix arguments of plan for split QK/VO head dims by @abmfy in #795
test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
feat: support f32 attention output in FA2 template by @yzh119 in #799
feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
doc: add documentation to new MLA interface by @yzh119 in #811
feat: unlocking MLA for A100 by @yzh119 in #812
feat: cudagraph-compatible MLA API by @yzh119 in #813
feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
misc: fix sphinx by @abcdabcd987 in #815
bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
doc: improve mla related documentation by @yzh119 in #818
release: bump version to v0.2.1 by @yzh119 in #819
refactor: change to TORCH_LIBRARY by @youkaichao in #764
Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
bugfix: bugfix on sm89 MLA by @yzh119 in #821
hotfix: bugfix on #812 by @yzh119 in #822
refactor: change to TORCH_LIBRARY by @abmfy in #823

New Contributors

@abmfy made their first contribution in #795

Full Changelog: v0.2.0.post2...v0.2.1

Contributors

nandor, abcdabcd987, and 5 other contributors

Assets 39

31 Jan 19:49

yzh119

v0.2.0.post2

200e954

v0.2.0.post2

What's Changed

ci: fix the update_whl_index script to regonize version number with "post" and add torch2.5 by @yzh119 in #694
bugfix: casting int array to int32 for rope input arguments by @yzh119 in #697
bugfix: only use sm90 group gemm when torch cuda >= 12.3 by @yzh119 in #699
misc: remove release-please workflow by @yzh119 in #705
Customizable SM90 prefill kernels. by @hyhieu in #704
hotfix: revert torch.library register by @yzh119 in #709
Improve compatibility with pytorch 2.5 by @zifeitong in #711
misc: add bibtex reference by @yzh119 in #712
sampling: simplify min-p sampling by @yzh119 in #713
perf: fix the iteration bound of SWA in FA2 prefill template by @yzh119 in #714
bugfix: fix min-p AOT compilation in #713 by @yzh119 in #717
Triton implementation of silu_and_mul by @nandor in #716
bugfix: FusedAddRMSNorm kernels might require more than 48KB shared memory when d is large. by @bobboli in #718
bugfix: Choose sm90 kernels only for Hopper GPUs. by @bobboli in #719
Finer-grained control over fp16/fp8 builds by @nandor in #722
Align KV chunk size binary search with actual KV chunk splitting. by @timzsu in #728
ci: rename python package name to flashinfer-python by @yzh119 in #729
Add a note about int32/int64 datatypes to the kv_layout tutorial by @fergusfinn in #737
fix return type of cuBLAS by @zhyncs in #749
[Refactor] Unify JIT/Customization/AOT mode by @yzh119 in #748
Move allocations out of torch ops by @nandor in #740
[Lint] Fix some linting issues and provide automatic format check script by @LeiWang1999 in #743
Filter out unsupported head dim for sm90 by @abcdabcd987 in #751
bugfix: various AOT issues by @abcdabcd987 in #752
[bugfix] Fix cpp tests/benchmarks by @yzh119 in #753
fix pin memory device by @youkaichao in #755
Add dev container for easier development by @ByronHsu in #680
hotfix: bugfix to #756 by @yzh119 in #757
Change apply_rope_with_cos_sin_cache to accept cos_sin_cache by @ByronHsu in #754
fix: match statement not supported in Python 3.8 by @xslingcn in #759
bugfix: use actual sm count for num_sm90_ctas by @LLLLKKKK in #762
bugfix: Fix block-sparse attention API by @yzh119 in #767
Version bump: v0.2.0.post2 by @yzh119 in #768

New Contributors

@hyhieu made their first contribution in #704
@zifeitong made their first contribution in #711
@bobboli made their first contribution in #718
@timzsu made their first contribution in #728
@fergusfinn made their first contribution in #737
@LeiWang1999 made their first contribution in #743
@youkaichao made their first contribution in #755
@LLLLKKKK made their first contribution in #762

Full Changelog: v0.2.0.post1...v0.2.0.post2

Contributors

nandor, LLLLKKKK, and 12 other contributors

Assets 39

23 Dec 00:49

yzh119

v0.2.0.post1

48e8d16

v0.2.0.post1

0.2.0.post1 (2024-12-22)

Bug Fixes

bug fix on determine_attention_backend condition (#688) (bcf7a3e)
accelerate plan speed of fa3 template (#690) (db8f04d)

Assets 38

17 Dec 12:59

github-actions

v0.2.0

3470329

v0.2.0

0.2.0 (2024-12-17)

Release Blog.

Features

add rotary_dim argument to rope APIs for partial apply rope (#599) (eb9bc71)
add a use_softmax field in variant class (#533) (d81af97)
add an option non_blocking to plan function (#622) (560af6f)
add gemma_rmsnorm and gemma_fused_add_rmsnorm (#477) (1a6b17e)
add group size 3 to GQA decode dispatch (#558) (6227562)
add JIT compilation support for FA3 templates (#672) (d4e8d79)
allow the cascade kernels to be executed using varying sequence lenghts (#627) (92ac440)
CUDAGraph compatibility of multi-level cascade inference APIs (#586) (2332e8a)
fix the maximal grid dimension in prefill planning with CUDA graphs (#639) (86ca89a)
improve the precision of the FusedAddRMSNormKernel function (#587) (c7dc921)
JIT compilation (#507) (3613a5b)
modify group-gemm stage number (#497) (52dab1d)
non-contiguous query with paged kv cache (#553) (89f2c4a)
pass a dynamic token count to the cascade kernels (#635) (5fe9f7d)
simplify prefill JIT compilation (#605) (fe4f898)
specify gemm backend (#648) (0cc1a51)
support cached cos/sin in rope APIs (#585) (83e541d)
support huggingface transformer style rope interface (#568) (4f40420)
support sm90 cutlass group gemm (#509) (794bdda)
torch custom_op fix for rope (#569) (3e104bc)
torch custom_op support: norm (#552) (f6e0010)
torch.compile and custom_op support (#554) (9bf916f)
warmup for jit kernel tests (#629) (8f5f349)

Bug Fixes

AOT compiler flags on non-sm90 (#522) (0aa4726)
batch decode kernel redundant store output to gmem (#505) (90e42a7)
compatible with torch 2.2 (#478) (ac41d1b)
#452 (b53a46f)
remove redundant load (#495) (2de16b0)
update bmm fp8 test (#487) (45eac04)

Performance Improvements

accelerate JIT compilation speed (#618) (eaf73fd)
Dense and sparse customizable flashattention-3 template (#667) (51236c9)
fix prefill kernel performance degradation (step 1) (#602) (595cf60)
fix the performance issue of append_paged_kv_cache (#588) (e15f7c9)
improve parallelism in RoPE with pos_ids (#609) (ff05155)
improve plan performance by using non-blocking memcpy (#547) (41ebe6d)
reduce the read and write of shared memory in the FusedAddRMSNormKernel (#592) (2043ca2)
reduce total_num_tiles_q by one (#644) (553ace5)
remove unnecessary contiguous operation in block sparse attention (#561) (7a7ad46)
speedup jit compilation of prefill attention kernels (#632) (a059586)
use cuda-core implemention for io-bound block-sparse attention (#560) (3fbf028)

Assets 38

27 Aug 01:18

github-actions

v0.1.6

9ee26e7

v0.1.6

0.1.6 (2024-08-27)

SM75 Support

Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).

API Changes

`plan`/`run`

Since 0.1.6 on, begin_forward/forward/end_forward APIs are replaced with the new plan/run API.

forward is renamed to run, which is more precise and consistent with the naming convention of cutlass's python API.
begin_forward is renamed to plan, which is consistent with the naming convention of nvmath API.
end_forward is deprecated and has no effect after this PR.

There is some slight difference between the old forward and the new run API:

All extra arguments such as causal and logits_soft_cap will be provided in plan (previously begin_forward) API, and cached until next plan call, and we only need to provide query and KV-Cache tensors in run API.

The old begin_forward/forward/end_forward APIs are still functional, but we will gradually deprecate them in future releases.

Check #466 for more details.

`MultiLevelCascadeAttentionWrapper`

Since 0.1.6 on, we introduce a new MultiLevelCascadeAttentionWrapper API for cascade inference,
which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.

See documentation and tutorial on API usage and layout explaination.

The old BatchDecodeWithSharedPrefixPagedKVCacheWrapper and BatchPrefillWithSharedPrefixPagedKVCacheWrapper will be deprecated in future releases.

Features

sm75 support (#448, #449)
add MultiLevelCascadeAttentionWrapper API (#462) (1e37989)
add accept num, emit num metric for ChainSpeculativeSampling (#450) (fa38b5e)
support bmm fp8 (#469) (f1c0b68)

Refactor

refactor: replace begin_forward/forward/end_forward with plan/run #466

Misc

misc: improve error handling of sampling kernels (#456) (0dce178)

Performance Improvements

slight optimization on f16->f8 fragment layout swizzling (#453) (0d61871)
slight optimization on fragment layout swizzle (#458) (7c397cb)
use persistent kernel for merging attention states (#459) (be6bf5b)

Acknowledgement

We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.

Assets 38

13 Aug 10:19

github-actions

v0.1.5

838d050

v0.1.5

0.1.5 (2024-08-13)

Bugfix

Fix PagedPrefill python api and some typos (#441) (3fff008)
fix prefill kernels' lse result for empty kv-cache (#440) (6ac28f4)

Features

decouple float and int workspace buffer (#442) (a7ee566)

Performance Improvements

faster fp8->fp16 dequantization for pre sm_90 arch (#439) (c93f647)

Acknowledgement

We thank contributions and feedbacks from the community: @comaniac, @hnyls2002, @jianfei-wangg, @Yard1.

Assets 39

Releases: flashinfer-ai/flashinfer

v0.2.2.post1

What's Changed

Contributors

v0.2.2

What's Changed

New Contributors

Contributors

v0.2.1.post2

What's Changed

New Contributors

Contributors

v0.2.1.post1

What's Changed

New Contributors

Contributors

v0.2.1

What's Changed

New Contributors

What's Changed

New Contributors

Contributors

v0.2.0.post2

What's Changed

New Contributors

Contributors

v0.2.0.post1

0.2.0.post1 (2024-12-22)

Bug Fixes

v0.2.0

0.2.0 (2024-12-17)

Features

Bug Fixes

Performance Improvements

v0.1.6

0.1.6 (2024-08-27)

SM75 Support

API Changes

plan/run

MultiLevelCascadeAttentionWrapper

Features

Refactor

Misc

Performance Improvements

Acknowledgement

v0.1.5

0.1.5 (2024-08-13)

Bugfix

Features

Performance Improvements

Acknowledgement

`plan`/`run`

`MultiLevelCascadeAttentionWrapper`