RVV/R5V implementation and optimization #371

ken-unger · 2024-10-08T05:50:49Z

We’ve recently been using your risc-v optimized fftw3 implementation found in https://github.com/rdolbeau/fftw3/tree/riscv-v-clean

I’m interested in your assessment of the opportunity for further optimization of the vector implementation within simd-support/simd-r5v.h. Is there additional work that might make sense here, or do you believe you have achieved most of the gains? For example, I see that lmul=1 (m1) is used whereas I would have anticipated a higher lmul might be possible. Or perhaps that has been explored already and rejected.

Any thoughts appreciated.

Thanks,
Ken

rdolbeau · 2024-10-08T09:31:16Z

FFTs (including FFTW3's codelets) tend to require a lot of registers unless quite small. Anything that increase register pressure, including split-mode and larger LMUL (which is effectively equivalent to unrolling for this use case), are likely to cause excessive spillage of registers which will degrade performance. Also, some instructions do not like larger LMUL performance-wise - vrgather in particular is likely to misbehave in many implementations (e.g. it takes a time quadratic with LMUL on the SpacemiT K1).

Nonetheless, there's certainly room for improvements (it has only been tested on my K1), but I'd say currently the compilers should be the primary target. I'm attaching a subset of benchmarks on the K1, and as a remark:

(a) gcc (blue/cyan) is better than LLVM (red/orange) overall
(b) gcc is better when using the 'interleaved' representation ('r5v') than when using the 'split' representation ('r5vsplit')
(c) llvm is better when using the 'split' representation than when using the 'interleaved' one

There should be a more consistent behavior of the software stack before doing much more tuning on the library itself. This is gcc-14 and clang-18 from a recent Debian Sid, so it's not an up-to-date issue (though LLVM moves quickly). A quick glance at the ASM suggest that excessive spillage is already an issue in some of the slower variants. Under a similar test setup, the SVE implementation is much more stable in behavior across compilers.

doit.ALL.double.1d.drxx.p2.pdf

Edit: it was clang-18 not -16, numbers were produced in mid-august '24 when 18 was still current.

matteo-frigo · 2025-02-08T13:17:38Z

Register pressure is a complicated issue for which we had a more-or-less "optimal" solution that worked at some point and may have been broken by more recent compilers.

The computation of the FFT is abstractly a directed acyclic graph (DAG) of additions and multiplications, and in principle you can compute the DAG in any way you want, as long as you compute the inputs to an operation before performing the operation. Some ways are worse than others, however. For example, you can compute the "butterfly" graph one level at the time from input to output, but this basically causes a spill for every operation unless the size is small. Another way is to compute sub-butterflies with R inputs if you know you have approximately R registers.

The FFTW codelets are specifically generated so that if you perform the operations in the order specified by the codelet then the number of spills is more or less minimal, no matter what R is. This magic property is called "cache obliviousness".

The problem now is that compilers assume that programmers don't know what they are doing, and so the first thing the compiler does is ignore the order in the code and just re-schedule everything. The standard compiler schedule (breadth-first) is the worst possible one.

In the past, the -fno-schedule-insns flag to gcc was enough to disable the reordering compiler pass. Make sure you are using it. If this hack no longer works we'll need to think of something else.

rdolbeau · 2025-02-08T13:31:26Z

@matteo-frigo The issue is that on RISC-V V1.0, there's no support for complex operations in "interleaved" mode (real/imaginary interleaved in even/odd lanes), unlike in SVE where things like FCMLA and FCADD are quite useful.
So people have been suggesting playing to RISC-V V1.0's "strengths", such as the somewhat rich set of loads and stores, and/or the "LMUL" which is basically expending the size of the registers (LMUL=2 uses aligned pairs such as V[2,3], V[4,5] ..., LMUL=4 uses aligned 4 at once such as V[4,5,6,7] or V[8,9,10,11], etc.).
Unfortunately, using a split-register approach where the "load/store segmented" works well virtually halves the number of vector registers, while LMUL virtually divides the number of vector registers by LMUL (and a bit more as V0 is hardwired to 0). So even with perfect register allocation, those approaches creates extra spills unless the size of the codelet is small enough.

matteo-frigo · 2025-02-08T13:53:34Z

But complex operations should not really be needed, as long as you can reasonably shuffle even and odd elements.

SSE and SSE2 have the same problem, and we solved it by saying it that a complex codelet first does multiplications by twiddle factors, where in the worst case we pay the full cost and move on, and then perform a vector real FFTs on the [re, im, re, im], which acts as a split FFT of half the size, and then we combine real+i*imag together at the end. The first and last step are perhaps suboptimal, but the bulk of the computation doesn't do any shuffles or complex arithmetic.

Then SSE3 came out, which has complex multiplication, and I concluded it wasn't worth the hassle, the old scheme was good enough.

rdolbeau · 2025-02-08T14:11:14Z

But complex operations should not really be needed, as long as you can reasonably shuffle even and odd elements.

That's the issue... in V1.0, you don't get that, unlike the various local shuffles you get in NEON/SVE (zip/uzp/...). You can slide elements (one way). You can also do a full "vrgather", which is a very powerful operation - in particular with LMUL, as you can gather from up to 8 register into up to 8 registers with LMUL=8... However, you first need to creates the indexing vector, and the instruction is likely to always be very slow. On the Spacemit K1, "vrgather" starts at 4 cycles latency and then grows quadratically with LMUL, so 256 cycles latency (!!!) for LMUL=8.

V1.0 was specified to be "versatile" and supports many implementations, but the downside is you don't even get the multiple-of-128-bits-segment guarantee you get in SVE (and the Intel/AMD stuff), and very few inter-lane data manipulation operations, which is not very good for the way FFTW3 does things. Hence people proposing alternative solutions to the current "do it as all other SIMD do it" solution.

But complex operations should not really be needed, as long as you can reasonably shuffle even and odd elements.

In SVE they are available, might as well use them where it makes sense :-) So they are used for VZMUL* - not the most common of macros, but it does have an impact as I've seen when adding them in NEON on the same hardware (I never tried SVE without them).

rdolbeau mentioned this issue Feb 5, 2025

Merging of RISC-V V1.0 support. #377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RVV/R5V implementation and optimization #371

RVV/R5V implementation and optimization #371

ken-unger commented Oct 8, 2024

rdolbeau commented Oct 8, 2024 •

edited

Loading

matteo-frigo commented Feb 8, 2025

rdolbeau commented Feb 8, 2025

matteo-frigo commented Feb 8, 2025

rdolbeau commented Feb 8, 2025

RVV/R5V implementation and optimization #371

RVV/R5V implementation and optimization #371

Comments

ken-unger commented Oct 8, 2024

rdolbeau commented Oct 8, 2024 • edited Loading

matteo-frigo commented Feb 8, 2025

rdolbeau commented Feb 8, 2025

matteo-frigo commented Feb 8, 2025

rdolbeau commented Feb 8, 2025

rdolbeau commented Oct 8, 2024 •

edited

Loading