CUDA/HIP header-only library for writing vectorized and low-precision (16 bit, 8 bit) GPU kernels
performance cpp gpu cuda kernel-tuner hip vectorization floating-point half-precision mixed-precision low-precision bfloat16 header-only-library reduced-precision
-
Updated
Apr 11, 2025 - C++