-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(gpu): add circulant matrix for one vs many poly product #2030
base: main
Are you sure you want to change the base?
Conversation
5a2a836
to
022856b
Compare
decoded_value | ||
} | ||
|
||
fn glwe_dot_product_with_clear<Scalar: UnsignedTorus + CastFrom<usize>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test checks the poly product without noise
022856b
to
f177e42
Compare
// to any matrix dimension | ||
template <typename Torus, typename TorusVec> | ||
__global__ void tgemm(int M, int N, int K, const Torus *A, const Torus *B, | ||
int stride_B, Torus *C, int stride_C) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a stride_C
parameter, since I use this function to write the output matrix in a bigger buffer (a GLWE list) that has a bigger stride than the width of C
which is N
.
// values into their new positions. The elements above the diagonal | ||
// are multiplied by -1 | ||
template <typename Torus> | ||
__global__ void polynomial_make_circulant(Torus *result, const Torus *poly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the python equivalent of the algorithm:
with bi=blockIdx.x, bj=blockIdx.y, ti=threadIdx.x,tj=threadIdx.y
def make_circulant_transpose_cuda(v1):
N = v1.shape[0]
result = np.zeros((N * N,), dtype=np.uint64)
BS = 4
for bi in range(0, N // BS):
for bj in range(0, N // BS):
buf = np.zeros((2 * BS - 1), dtype=np.uint64)
block_start = bi * BS * N + bj * BS
for ti in range(BS):
for tj in range(BS):
tid = ti * BS + tj
if tid < 2 * BS - 1:
read_idx_start = (bj - bi) * BS + tid - BS + 1
if read_idx_start < 0:
read_idx_start = N + read_idx_start
buf[tid] = v1[read_idx_start]
# Sync threads
for ti in range(BS):
for tj in range(BS):
fact = 1
if bi * BS + ti > bj * BS + tj:
fact = -1
result[block_start + ti * N + tj] = buf[tj - ti + BS - 1] * fact
return result.reshape((N, N))
|
||
int32_t tid = threadIdx.x * CIRCULANT_BLOCKTILE + threadIdx.y; | ||
|
||
if (tid < 2 * CIRCULANT_BLOCKTILE - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the 2 rows of threads in the block read data since we only need to read 2*block_tile-1
values for a block of block_tile x block_tile
threads
7100709
to
4694773
Compare
4694773
to
f99d91b
Compare
) { | ||
let mut rng = rand::thread_rng(); | ||
|
||
let poly_size = 2 << rng.gen_range(8usize..12); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes check with n_polys=poly_size, sometimes check with arbitrary number of polys
f99d91b
to
cec3d0e
Compare
To support encrypted GLWE x clear matrix product, all polys of the GLWE are multiplied with the clear matrix. For each poly of the GLWE, this PR builds a circulant matrix that is multiplied with the clear matrix to obtain the polynomial product. Sample N of this product contains the matrix product of the original clear vector that was encrypted and the clear matrix.
Now, with
make test_core_crypto_gpu
there are two new tests: