You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs:
In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads
A[0][0]
toA[0][3]
, thread 1 readsA[0][4]
toA[0][7]
. So thread 0 writesAs[0][0]
toAs[3][0]
, thread 1 writesAs[4][0]
toAs[7][0]
. For aBM(=128) * BK(=8)
sizeAs
, it is obvious thatAs[0][0]
andAs[4][0]
are on the same bank, causing bank conflict.So I think bank conflict will only occur when writing
As
notBs
. But in kernel v7 and v8, it seems like you try to optimize wrting toBs
:SGEMM_CUDA/src/kernels/8_kernel_bank_extra_col.cuh
Lines 56 to 60 in 60cba6f
Did I understand something wrong?
The text was updated successfully, but these errors were encountered: