For study purpose
implemented attentions
- Naive Attention
- Attention with KV
- Attention with non-contagious memory
- Single Query Attention with non-contagious KV cache (PagedAttention with block size 1)
- Multi Query Attention with non-contagious KV cache (for Speculative Decoding)
- Rotary Embedding