-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RVV/R5V implementation and optimization #371
Comments
FFTs (including FFTW3's codelets) tend to require a lot of registers unless quite small. Anything that increase register pressure, including split-mode and larger LMUL (which is effectively equivalent to unrolling for this use case), are likely to cause excessive spillage of registers which will degrade performance. Also, some instructions do not like larger LMUL performance-wise - Nonetheless, there's certainly room for improvements (it has only been tested on my K1), but I'd say currently the compilers should be the primary target. I'm attaching a subset of benchmarks on the K1, and as a remark: (a) gcc (blue/cyan) is better than LLVM (red/orange) overall There should be a more consistent behavior of the software stack before doing much more tuning on the library itself. This is gcc-14 and clang-18 from a recent Debian Sid, so it's not an up-to-date issue (though LLVM moves quickly). A quick glance at the ASM suggest that excessive spillage is already an issue in some of the slower variants. Under a similar test setup, the SVE implementation is much more stable in behavior across compilers. doit.ALL.double.1d.drxx.p2.pdf Edit: it was clang-18 not -16, numbers were produced in mid-august '24 when 18 was still current. |
Register pressure is a complicated issue for which we had a more-or-less "optimal" solution that worked at some point and may have been broken by more recent compilers. The computation of the FFT is abstractly a directed acyclic graph (DAG) of additions and multiplications, and in principle you can compute the DAG in any way you want, as long as you compute the inputs to an operation before performing the operation. Some ways are worse than others, however. For example, you can compute the "butterfly" graph one level at the time from input to output, but this basically causes a spill for every operation unless the size is small. Another way is to compute sub-butterflies with R inputs if you know you have approximately R registers. The FFTW codelets are specifically generated so that if you perform the operations in the order specified by the codelet then the number of spills is more or less minimal, no matter what R is. This magic property is called "cache obliviousness". The problem now is that compilers assume that programmers don't know what they are doing, and so the first thing the compiler does is ignore the order in the code and just re-schedule everything. The standard compiler schedule (breadth-first) is the worst possible one. In the past, the -fno-schedule-insns flag to gcc was enough to disable the reordering compiler pass. Make sure you are using it. If this hack no longer works we'll need to think of something else. |
@matteo-frigo The issue is that on RISC-V V1.0, there's no support for complex operations in "interleaved" mode (real/imaginary interleaved in even/odd lanes), unlike in SVE where things like FCMLA and FCADD are quite useful. |
But complex operations should not really be needed, as long as you can reasonably shuffle even and odd elements. SSE and SSE2 have the same problem, and we solved it by saying it that a complex codelet first does multiplications by twiddle factors, where in the worst case we pay the full cost and move on, and then perform a vector real FFTs on the [re, im, re, im], which acts as a split FFT of half the size, and then we combine real+i*imag together at the end. The first and last step are perhaps suboptimal, but the bulk of the computation doesn't do any shuffles or complex arithmetic. Then SSE3 came out, which has complex multiplication, and I concluded it wasn't worth the hassle, the old scheme was good enough. |
That's the issue... in V1.0, you don't get that, unlike the various local shuffles you get in NEON/SVE (zip/uzp/...). You can slide elements (one way). You can also do a full "vrgather", which is a very powerful operation - in particular with LMUL, as you can gather from up to 8 register into up to 8 registers with LMUL=8... However, you first need to creates the indexing vector, and the instruction is likely to always be very slow. On the Spacemit K1, "vrgather" starts at 4 cycles latency and then grows quadratically with LMUL, so 256 cycles latency (!!!) for LMUL=8. V1.0 was specified to be "versatile" and supports many implementations, but the downside is you don't even get the multiple-of-128-bits-segment guarantee you get in SVE (and the Intel/AMD stuff), and very few inter-lane data manipulation operations, which is not very good for the way FFTW3 does things. Hence people proposing alternative solutions to the current "do it as all other SIMD do it" solution.
In SVE they are available, might as well use them where it makes sense :-) So they are used for VZMUL* - not the most common of macros, but it does have an impact as I've seen when adding them in NEON on the same hardware (I never tried SVE without them). |
Hi @rdolbeau
We’ve recently been using your risc-v optimized fftw3 implementation found in https://github.com/rdolbeau/fftw3/tree/riscv-v-clean
I’m interested in your assessment of the opportunity for further optimization of the vector implementation within simd-support/simd-r5v.h. Is there additional work that might make sense here, or do you believe you have achieved most of the gains? For example, I see that lmul=1 (m1) is used whereas I would have anticipated a higher lmul might be possible. Or perhaps that has been explored already and rejected.
Any thoughts appreciated.
Thanks,
Ken
The text was updated successfully, but these errors were encountered: