Replies: 1 comment 9 replies
-
using cuda core to do gemv will be better. we just release a new gemv in 3.1. @NVJiangShao |
Beta Was this translation helpful? Give feedback.
9 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In Transformer Decoder Architecture, the query's seq len size is "1", and the Key, Value seq len size is "timestep", each timestep will concat new K, V to KVCache.
So in FMHA, the matmul type is GEMV:
(B, H, 1, K) matmul (B, H, M, K)
(B, H, 1, M) matmul (B, H, M, K)
In cutlass FMHA it use TensorCore, but now the query seq_len size is "1", which is smaller than the mma instruction "m" size, it will waste some compute resource.
If I change the matmul by using CUDA Core, Can I get more improvement? I'm not sure about this.
Actually, FasterTransformer's masked_mutlihead_attention kernel use CUDA Core to compute it, but I test it in A100 and found cutlass FMHA performance is better than FT's(+10%).
https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp#L1105-L1939
Beta Was this translation helpful? Give feedback.
All reactions