Is there exist opportunities to improve cutlass FMHA when in Decoder phase? #963

MARD1NO · 2023-05-29T12:38:55Z

MARD1NO
May 29, 2023

In Transformer Decoder Architecture, the query's seq len size is "1", and the Key, Value seq len size is "timestep", each timestep will concat new K, V to KVCache.

So in FMHA, the matmul type is GEMV:

(B, H, 1, K) matmul (B, H, M, K)
(B, H, 1, M) matmul (B, H, M, K)

In cutlass FMHA it use TensorCore, but now the query seq_len size is "1", which is smaller than the mma instruction "m" size, it will waste some compute resource.

If I change the matmul by using CUDA Core, Can I get more improvement? I'm not sure about this.

Actually, FasterTransformer's masked_mutlihead_attention kernel use CUDA Core to compute it, but I test it in A100 and found cutlass FMHA performance is better than FT's(+10%).

https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/kernels/decoder_masked_multihead_attention/decoder_masked_multihead_attention_template.hpp#L1105-L1939

hwu36 · 2023-05-30T05:13:48Z

hwu36
May 30, 2023
Maintainer

using cuda core to do gemv will be better. we just release a new gemv in 3.1.

@NVJiangShao

9 replies

MARD1NO May 30, 2023
Author

I got a 3.991 us on a A100 with cutlass gemv by not setting kThreadCount and kThreadPerRow, and 4.583 us with cublas. Driver Version: 525.105.17 CUDA Version: 12.0 @MARD1NO

The speed is unbelievable and honestly I guess the decimal point need to move right one place？....

And I need update the time for my kernel, the previous result I run used quantized weight(sry for that), After using Half weight, it need 36.64us in A100 40GB

May I get your email address, so I can provide some code for you(For some reasons...)?

ghost May 30, 2023

Maybe you can try cuBLAS first to compare the computing time here?

MARD1NO May 30, 2023
Author

Maybe you can try cuBLAS first to compare the computing time here?

Yeah, I just using torch by:

import torch 

x = torch.ones((1, 4096), device="cuda", dtype=torch.float16)
w = torch.ones((4096, 4096), device="cuda", dtype=torch.float16)
out = torch.matmul(x, w)

In nsight compute, it need 42.85us.

Maybe you test by (1, 4096) matmul (4096, 1)? it only takes 5.66us

ghost May 30, 2023

Sorry for the mistake here. I mistakenly passed a k=64 in my test.

For FP32, I got a 45.6684 us with cutlass gemv by setting kElementsPerAccess_ = 4, kThreadCount=128 and kThreadPerRow=64
and 49.2155 us with cublas.

ghost May 30, 2023

I noticed that you used FP16 data type here.

For FP16, I got a 24.672 us with cutlass gemv by setting kElementsPerAccess_ = 8, kThreadCount=128 and kThreadPerRow=32,
and 21.467 us with cuBLAS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there exist opportunities to improve cutlass FMHA when in Decoder phase? #963

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Is there exist opportunities to improve cutlass FMHA when in Decoder phase? #963

MARD1NO May 29, 2023

Replies: 1 comment · 9 replies

hwu36 May 30, 2023 Maintainer

MARD1NO May 30, 2023 Author

ghost May 30, 2023

MARD1NO May 30, 2023 Author

ghost May 30, 2023

ghost May 30, 2023

MARD1NO
May 29, 2023

Replies: 1 comment 9 replies

hwu36
May 30, 2023
Maintainer

MARD1NO May 30, 2023
Author

MARD1NO May 30, 2023
Author