Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gpu): add circulant matrix for one vs many poly product #2030

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrei-stoian-zama
Copy link
Contributor

@andrei-stoian-zama andrei-stoian-zama commented Feb 5, 2025

To support encrypted GLWE x clear matrix product, all polys of the GLWE are multiplied with the clear matrix. For each poly of the GLWE, this PR builds a circulant matrix that is multiplied with the clear matrix to obtain the polynomial product. Sample N of this product contains the matrix product of the original clear vector that was encrypted and the clear matrix.

Now, with make test_core_crypto_gpu there are two new tests:

test core_crypto::gpu::algorithms::test::glwe_dot_product_with_clear::test_gpu_glwe_dot_product_with_clear_test_params_4_bits_native_u64 ... ok

test core_crypto::gpu::algorithms::test::glwe_dot_product_with_clear::test_gpu_poly_product_with_clear_test_params_4_bits_native_u64 ... ok

@cla-bot cla-bot bot added the cla-signed label Feb 5, 2025
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/add_circulant_poly_product branch 3 times, most recently from 5a2a836 to 022856b Compare February 26, 2025 12:36
decoded_value
}

fn glwe_dot_product_with_clear<Scalar: UnsignedTorus + CastFrom<usize>>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test checks the poly product without noise

@andrei-stoian-zama andrei-stoian-zama force-pushed the as/add_circulant_poly_product branch from 022856b to f177e42 Compare February 26, 2025 12:55
// to any matrix dimension
template <typename Torus, typename TorusVec>
__global__ void tgemm(int M, int N, int K, const Torus *A, const Torus *B,
int stride_B, Torus *C, int stride_C) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a stride_C parameter, since I use this function to write the output matrix in a bigger buffer (a GLWE list) that has a bigger stride than the width of C which is N.

// values into their new positions. The elements above the diagonal
// are multiplied by -1
template <typename Torus>
__global__ void polynomial_make_circulant(Torus *result, const Torus *poly,
Copy link
Contributor Author

@andrei-stoian-zama andrei-stoian-zama Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the python equivalent of the algorithm:
with bi=blockIdx.x, bj=blockIdx.y, ti=threadIdx.x,tj=threadIdx.y


def make_circulant_transpose_cuda(v1):
    N = v1.shape[0]
    result = np.zeros((N * N,), dtype=np.uint64)

    BS = 4
    for bi in range(0, N // BS):
        for bj in range(0, N // BS):
            buf = np.zeros((2 * BS - 1), dtype=np.uint64)

            block_start = bi * BS * N + bj * BS

            for ti in range(BS):
                for tj in range(BS):
                    tid = ti * BS + tj
                    if tid < 2 * BS - 1:
                        read_idx_start = (bj - bi) * BS + tid - BS + 1
                        if read_idx_start < 0:
                            read_idx_start = N + read_idx_start
                        buf[tid] = v1[read_idx_start]

           # Sync threads

            for ti in range(BS):
                for tj in range(BS):
                    fact = 1
                    if bi * BS + ti > bj * BS + tj:
                        fact = -1
                    result[block_start + ti * N + tj] = buf[tj - ti + BS - 1] * fact

    return result.reshape((N, N))


int32_t tid = threadIdx.x * CIRCULANT_BLOCKTILE + threadIdx.y;

if (tid < 2 * CIRCULANT_BLOCKTILE - 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the 2 rows of threads in the block read data since we only need to read 2*block_tile-1 values for a block of block_tile x block_tile threads

@andrei-stoian-zama andrei-stoian-zama marked this pull request as ready for review February 26, 2025 13:01
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/add_circulant_poly_product branch 2 times, most recently from 7100709 to 4694773 Compare February 26, 2025 13:22
@andrei-stoian-zama andrei-stoian-zama marked this pull request as draft February 27, 2025 10:15
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/add_circulant_poly_product branch from 4694773 to f99d91b Compare February 27, 2025 14:14
) {
let mut rng = rand::thread_rng();

let poly_size = 2 << rng.gen_range(8usize..12);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes check with n_polys=poly_size, sometimes check with arbitrary number of polys

@andrei-stoian-zama andrei-stoian-zama marked this pull request as ready for review February 27, 2025 14:16
@andrei-stoian-zama andrei-stoian-zama force-pushed the as/add_circulant_poly_product branch from f99d91b to cec3d0e Compare March 3, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant