#

flashmla

Here are 2 public repositories matching this topic...

Bruce-Lee-LY / decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu cuda inference nvidia mha mla multi-head-attention gqa mqa llm large-language-model flash-attention cuda-core decoding-attention flashinfer flashmla

Updated Apr 2, 2025
C++

Cat-Gawr / DeepSeek-FlashMLA

DeepSeek Flash MLA - DeepSeek - copy manual

windows nvidia-cuda deepseek flashmla

Updated Apr 22, 2025
C++

Improve this page

Add a description, image, and links to the flashmla topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the flashmla topic, visit your repo's landing page and select "manage topics."