Added X25519_NO_UNROLLING option #254

Laczen · 2023-01-21T16:58:46Z

Added X25519_NO_UNROLLING option to reduce binary size. This reduces speed but allows a user to trade in some performance for size.

Signed-off-by: Laczen JMS laczenjms@gmail.com

LoupVaillant · 2023-01-22T18:12:38Z

src/monocypher.c

+            		if (((i & 0x1) == 0) && ((j & 0x1) == 1)) {
+                		t[i] += f[j] * (i64)(g[i - j]) * 2;
+            		} else {
+                		t[i] += f[j] * (i64)(g[i - j]);
+            		}
+        	}


I have the feeling here that we can be a bit smarter about the test, perhaps avoid it entirely. This would likely involve a tiny bit of unrolling, but we would likely get some performance back.

Also, if we're okay with wasting a bit of stack, we should be able to pre-compute the multiplications by 2 (and below, by 19 and 38). While the factor of 2 may not be worth it (it's a simple shift by 1, very fast everywhere), pre-computing the multiplication by 19 is likely to recover quite a bit of speed.

I need to experiment some with this.

I have been experimenting a bit with this loop and the fastest (and also smallest) I got it was by changing the calculation to:

FOR (i, 0, 10) { t[i] = 0; FOR (j, 0, 10) { if (j > i) { t[i] += f[j] * (i64)(g[10 - (j - i)] * 19); if (((i & 0x1) == 0) && ((j & 0x1) == 1)) { t[i] += f[j] * (i64)(g[10 - (j - i)] * 19); } } else { t[i] += f[j] * (i64)(g[i - j]); if (((i & 0x1) == 0) && ((j & 0x1) == 1)) { t[i] += f[j] * (i64)(g[i - j]); } } } }

Interesting. I would have thought that having 2 inner loops (from 0 to i and from i to 10) would have been a bit faster. Good job there.

Okay, I believe I have found something even better:

lfe t; fe g19; FOR (i, 0, 10) { g19[10 - i] = g[i] * 19; } FOR (i, 0, 10) { t[i] = 0; FOR (j, 0, 10) { i64 prod = f[j] * (i64)(j > i ? (g19[j - i]) : (g[i - j])); prod <<= ~i & j & 0x1; t[i] += prod; } } fe_carry(h, t);

The idea was hoisting out the multiplication by 19, which I thought cost quite a bit. The main problem doing that is that we're sacrificing stack size (it's probably 40 bytes bigger than your proposal). I also took the opportunity to simplify your test, and I found this neat arithmetic trick to achieve what you did. Anyway, my change here ultimately makes X25519 quite a bit faster on my laptop:

7% faster than your code if we compile under -O3 -march=native.

16% faster than your code if we compile under -Os -march=native.

Now one thing bothers me about this whole patch (whatever the version): buffers aren't wiped. See, all arrays that may contain secret data are wiped before they're freed. It's kind of a "best effort" thing, most notably because ordinary local variables, more likely to live in registers, aren't wiped at all. And in the case of X25519 field multiplications some of those local variables definitely spill on the stack, and they're not wiped either. Still, I wanted to note that those two little arrays we just added? They would be an exception to the "wipe all the buffers!" rule, and I don't like that.

While I know it's not entirely rational, I now feel compelled to look at this a bit more critically. Let's measure this all under various conditions. I'll be using GCC 11.3.0 on Ubuntu 22.04, on my x86-64 Intel Skylake laptop. Let's start with the sizes of monocypher.o, in bytes (each column compares itself to the previous column):

Unrolled use fe_mul() unroll (me) unroll (you)

-Os march=native 53,720 -1232 -2160 -32

-O2 march=native 57,552 -1384 -2168 -24

-O3 march=native 86,064 -1216 -592 +296 (!!)

And now the speed, in number of exchanges per second:

Unrolled use fe_mul() unroll (me) unroll (you)

-Os march=native 7050 6052 (-14%) 2099 (-66%) 1804 (-14%)

-O2 march=native 7402 6393 (-14%) 1889 (-70%) 1656 (-12%)

-O3 march=native 7641 6494 (-15%) 5458 (-16%) 5064 (-7%)

First, the obvious: someone who needs your patch would never use -O3, because the binary is just much bigger. Still, I'm impressed by how well performance is holding up under -O3: the smallest binary (mine) manages to keep total performance drop under 29%, much less than what I originally measured.

Another unsurprising finding is the effect of replacing fe_sq() by a single call to fe_mul(). This change alone saves between 1.2 and 1.3 kilobytes, and the performance drop-off is very stable.

More notable is the resilience of the unrolled code code to compilation options: going from -O3 to -Os only costs us 8% speed (the use fe_mul() version is just as resilient with a 7% drop in speed).

The real problems however start when we're running the fully rolled code under -O2 or -Os. The performance drops are massive here. The most generous comparison I can make (my version compared to the fe_mul() only version) still shows that we're dividing speed by more than 3! All this to save less than 2.2KB on the binary. I know embedded devices are tight on memory, but the smallest ones tend to be tight on speed as well. Saving 2.2KB is good, but a 3x slowdown is no joke.

My tentative conclusions right now:

Using fe_mul() instead of fe_sq() saves ~1.3KB, and cost less than 15% of speed. Definitely worth it in some circumstances. Not sure I want to accept a patch just for this though, because it's so easy to do it yourself. What I can do is put a nice comment that explains how to do it and what to expect exactly. Given the change, and my observations, I believe the effects should be stable across platforms as well.

Rolling everything like you did however, even with my performance improvements, is likely going too far: saving 2.2KB is nice, but I think dividing performance by more than 3 will be too expensive in most cases. It's a "desperate measures for a desperate cases" kind of deal, too niche in my opinion to accept the full patch right now.

One caveat in my analysis: I haven't analysed the impact of only inlining carry propagation. I expect a fairly stable performance impact, but I don't think it will save much more than 800 bytes of code: after we've simplified fe_sq() the carry routine is used only 3 times, and the carry propagation logic is not that big.

Sorry for giving you this bummer, but I must say on the other hand reviewing this patch has been a pleasure. Even though I'm now likely to reject it it was interesting and valuable research, that I will keep with me going forward when I do get around to make that embedded edition for real.

@LoupVaillant, thank you for taking the time to study this.

On my ryzen7 4700u I see a small reduction in speed with your proposal.

Regarding the size, I am focussing on the size of the X25519 part (as this is what is being changed). I am getting the following results (all compiled with -Os):
nrf51422 (cortex m0): from 8070BkB reduced to 2380BkB,
nrf52832 (cortex m4): from 5036BkB reduced to 2444BkB,
stm32f103 (cortex m3): from 5048BkB reduced to 2456BkB,

This is a reduction of more than 50%.

Regarding the wiping of buffers, there should be no exception. However since the t variable is created only for passing it to fe_carry it is very easy to set them to zero when copying to h at the end of fe_carry.

@LoupVaillant, I will do some performance benchmarks on real silicon to get an idea of performance.

I have taken some time to do some trials on real silicon (nucleo_f411re) and qemu. The times given are for one X25519 exchange:

unrolled rolled (my version, added t[] wipe) ratio

nucleo_f411re (84MHz) 20ms 96ms x4.8

cortex_m3 (qemu 12MHz) 82ms 418ms x5.1

cortext_m0 (qemu 1MHz) 1216ms 2513ms x2.1

This leaves me with mixed feelings: on cortex m3/m4 there is a code reduction of about a factor 2, but a speed decrease of a factor 5. On cortex m0 there is a code reduction of 3.5 with a speed decrease of a factor 2.

Thanks for those trials, those are really informative!

If we were serious about the wipe, we would hoist the array out of the function, and wipe it only once.
Kind of messy now that every single call to fe_mul() comes with an extra argument, but at least you only wipe once. Should help a bit.

Could you test this quick without the wipe so we get a feeling of the performance we might hope for?

The influence is neglectable (4ms on qemu cortex_m3), the way t is cleared is by setting it to 0 in the loop that converts t to h. t was only created to allow passing it to carry_fe(h,t).

BTW, regarding the figures for cortex m3/m4 this might not be so bad after all given the speed comparison (Zandberg et al): monocypher was 40 times faster than C25519.

Added X25519_NO_UNROLLING option to reduce binary size. This reduces speed but allows a user to trade in some performance for size. Removed wipe at end of fe_carry(t,h) and used WIPE_BUFFER instead, Improved the fe_mult loop, Signed-off-by: Laczen JMS <laczenjms@gmail.com>

LoupVaillant · 2023-01-24T15:58:20Z

src/monocypher.c

+
+	FOR (i, 0, 10) {
+        	h[i] = (i32)t[i];
+		t[i] = 0;


Hmm, actually this might be detected as a dead write, and skipped by the compiler. Won't happen if the compiler doesn't inline the function ad doesn't notice t is never used after carry propagation, but if we may want to be more careful and use volatile like crypto_wipe() does.

This will disable some optimisation and be a little slower still, but I don't expect any big difference.

LoupVaillant · 2023-01-24T16:16:53Z

src/monocypher.c

+			const int m = 1 + (~i & j & 0x1);
+			if (j > i) {
+				t[i] += m * f[j] * (i64)(g[10 - (j - i)] * 19);
+			} else {
+				t[i] += m * f[j] * (i64)(g[i - j]);


Your use of multiplication is surprising. If the compiler can see through it and take advantage that the only possible values for m are 1 and 2, it may do the clever thing and conditionally add the right hand side once more. But there's a good chance, especially under -Os, that you end up performing a full 32-bit multiplication.

Instead you probably want to use arithmetic shift instead. It would require a barrel shifter instead of a multiplier, but I expect this can still be faster:

const int m = ~i & j & 0x1; if (j > i) { t[i] += (f[j] << m) * (i64)(g[10 - (j - i)] * 19); } else { t[i] += (f[j] << m) * (i64)(g[i - j]); }

Not sure the generated code will actually change, and maybe most 32-bit multipliers are just as fast as the shifting, but in a low-transistor setting this may make a difference.

Oh by the way you can hoist the shift out of the test:

const int32 f2 = f[j] << (~i & j & 0x1); if (j > i) { t[i] += f2 * (i64)(g[10 - (j - i)] * 19); } else { t[i] += f2 * (i64)(g[i - j]); }

I expect most compilers spot this and optimise accordingly, but sometimes they miss things. This change should help them some.

LoupVaillant reviewed Jan 22, 2023

View reviewed changes

Laczen force-pushed the x25519_no_unroll branch from 1d98404 to de77114 Compare January 23, 2023 10:02

LoupVaillant mentioned this pull request Jan 24, 2023

Explicit support for embedded targets #253

Open

Laczen force-pushed the x25519_no_unroll branch from de77114 to 5872ded Compare January 24, 2023 15:43

LoupVaillant reviewed Jan 24, 2023

View reviewed changes

Laczen force-pushed the x25519_no_unroll branch from 5872ded to 8e41add Compare January 24, 2023 16:44

LoupVaillant force-pushed the master branch 2 times, most recently from b6ccd20 to df58bbf Compare February 19, 2023 17:39

LoupVaillant force-pushed the master branch 2 times, most recently from 388d6e2 to 9109231 Compare August 28, 2023 21:40

LoupVaillant force-pushed the master branch from a6cf08b to b02114d Compare January 15, 2024 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added X25519_NO_UNROLLING option #254

Added X25519_NO_UNROLLING option #254

Laczen commented Jan 21, 2023

LoupVaillant Jan 22, 2023

Laczen Jan 23, 2023

LoupVaillant Jan 23, 2023

LoupVaillant Jan 23, 2023

Laczen Jan 23, 2023 •

edited

Loading

Laczen Jan 24, 2023

Laczen Jan 24, 2023

LoupVaillant Jan 24, 2023

Laczen Jan 24, 2023

Laczen Jan 24, 2023

LoupVaillant Jan 24, 2023

LoupVaillant Jan 24, 2023

	Unrolled	use fe_mul()	unroll (me)	unroll (you)
-Os march=native	53,720	-1232	-2160	-32
-O2 march=native	57,552	-1384	-2168	-24
-O3 march=native	86,064	-1216	-592	+296 (!!)

	Unrolled	use fe_mul()	unroll (me)	unroll (you)
-Os march=native	7050	6052 (-14%)	2099 (-66%)	1804 (-14%)
-O2 march=native	7402	6393 (-14%)	1889 (-70%)	1656 (-12%)
-O3 march=native	7641	6494 (-15%)	5458 (-16%)	5064 (-7%)

	unrolled	rolled (my version, added t[] wipe)	ratio
nucleo_f411re (84MHz)	20ms	96ms	x4.8
cortex_m3 (qemu 12MHz)	82ms	418ms	x5.1
cortext_m0 (qemu 1MHz)	1216ms	2513ms	x2.1

Added X25519_NO_UNROLLING option #254

Are you sure you want to change the base?

Added X25519_NO_UNROLLING option #254

Conversation

Laczen commented Jan 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laczen Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Laczen Jan 23, 2023 •

edited

Loading