Extremely fast CPU 1D convolutions. Faster than Intel IPP and Apple Accelerate on their respective platforms
Kernel size = 245
It's well know that convolution in the time domain is equivalent to multiplication in the frequency domain (circular convolution). With the Fast Fourier Transform, we can reduce the time complexity of a discrete convolution from O(n^2)
to O(n log(n))
, where n
is the larger of the two array sizes. The overlap-add method is a fast convolution method commonly use in FIR filtering, where the discrete signal is often much longer than the FIR filter kernel.
Check this repo to see how to use fftconv as a custom port through VCPKG.
fftconv::convolve_fftw
implements FFT convolution.fftconv::oaconvolve_fftw
implements FFT convolution using the overlap-add method, much faster when one sequence is much longer than the other (e.g. in FIR filtering).
All convolution functions support float
and double
and use a C++20 std::span
interface.
template <FloatOrDouble Real>
void oaconvolve_fftw(const std::span<const Real> arr,
const std::span<const Real> kernel, std::span<Real> res);
Python bindings are provided through Cython.
This benchmark is out of date. Check this repo for the up-to-date benchmarks.
The only dependency of fftconv
is fftw3. Since the float and double interface of fftw3
are used, link with -lfftw -lfftwf
.
Benchmark and test dependencies:
- fftw3
- armadillo (benchmarked against as a baseline)
- google-benchmark used for benchmarking.
- gperftools used for profiling.
Python
TODO The Python wrapper is currently out of date.
A Cython wrapper is provided. Dependencies:
Cython
for C++ bindingsnumpy
(benchmarked against)numba
(benchmarked against)scipy
(benchmarked against)matplotlib
(plot results)
python3 setup.py build_ext -i
python3 test.py # run the python test/benchmark
CPU: Intel i7 Comet Lake
C++.
The test_fftconv
binary gives an easy benchmark that runs every test case 5000 times. The bench_fftconv
uses google-benchmark
and gives much more reliable measures. Use ./script/run_bench
to run the benchmark and generate figures.
Output from bench_fftconv
(accurate bench) raw result saved in ./bench_result.json
. Plot generated from plot_bench.py
:
Output from test_fftconv
(simple bench)
% ./build/test_fftconv
=== test case (1664, 65) ===
All tests passed.
(5000 runs) convolve_fftw took 82ms
(5000 runs) oaconvolve_fftw took 36ms
(5000 runs) convolve_armadillo took 108ms
=== test case (2816, 65) ===
All tests passed.
(5000 runs) convolve_fftw took 111ms
(5000 runs) oaconvolve_fftw took 60ms
(5000 runs) convolve_armadillo took 174ms
=== test case (2304, 65) ===
All tests passed.
(5000 runs) convolve_fftw took 536ms
(5000 runs) oaconvolve_fftw took 52ms
(5000 runs) convolve_armadillo took 147ms
=== test case (4352, 65) ===
All tests passed.
(5000 runs) convolve_fftw took 335ms
(5000 runs) oaconvolve_fftw took 86ms
(5000 runs) convolve_armadillo took 276ms
Python.
% python3 test.py
=== test case (1664, 65) ===
Vectors are equal.
(5000 runs) convolve_fftw took 73ms
(5000 runs) oaconvolve_fftw took 38ms
(5000 runs) np.convolve took 140ms
(5000 runs) numba.njit(np.convolve) took 1409ms
(5000 runs) scipy.signal.convolve took 162ms
(5000 runs) scipy.signal.fftconvolve took 199ms
(5000 runs) scipy.signal.oaconvolve took 321ms
=== test case (2816, 65) ===
Vectors are equal.
(5000 runs) convolve_fftw took 96ms
(5000 runs) oaconvolve_fftw took 60ms
(5000 runs) np.convolve took 236ms
(5000 runs) numba.njit(np.convolve) took 2883ms
(5000 runs) scipy.signal.convolve took 256ms
(5000 runs) scipy.signal.fftconvolve took 256ms
(5000 runs) scipy.signal.oaconvolve took 362ms
=== test case (2304, 65) ===
Vectors are equal.
(5000 runs) convolve_fftw took 281ms
(5000 runs) oaconvolve_fftw took 53ms
(5000 runs) np.convolve took 194ms
(5000 runs) numba.njit(np.convolve) took 2215ms
(5000 runs) scipy.signal.convolve took 213ms
(5000 runs) scipy.signal.fftconvolve took 240ms
(5000 runs) scipy.signal.oaconvolve took 346ms
=== test case (4352, 65) ===
Vectors are equal.
(5000 runs) convolve_fftw took 326ms
(5000 runs) oaconvolve_fftw took 82ms
(5000 runs) np.convolve took 358ms
(5000 runs) numba.njit(np.convolve) took 3657ms
(5000 runs) scipy.signal.convolve took 378ms
(5000 runs) scipy.signal.fftconvolve took 365ms
(5000 runs) scipy.signal.oaconvolve took 395ms
The Python wrapper is almost as fast as the C++ code, as it has very little overhead.
TODO