-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added X25519_NO_UNROLLING option #254
base: master
Are you sure you want to change the base?
Conversation
src/monocypher.c
Outdated
if (((i & 0x1) == 0) && ((j & 0x1) == 1)) { | ||
t[i] += f[j] * (i64)(g[i - j]) * 2; | ||
} else { | ||
t[i] += f[j] * (i64)(g[i - j]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling here that we can be a bit smarter about the test, perhaps avoid it entirely. This would likely involve a tiny bit of unrolling, but we would likely get some performance back.
Also, if we're okay with wasting a bit of stack, we should be able to pre-compute the multiplications by 2 (and below, by 19 and 38). While the factor of 2 may not be worth it (it's a simple shift by 1, very fast everywhere), pre-computing the multiplication by 19 is likely to recover quite a bit of speed.
I need to experiment some with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been experimenting a bit with this loop and the fastest (and also smallest) I got it was by changing the calculation to:
FOR (i, 0, 10) {
t[i] = 0;
FOR (j, 0, 10) {
if (j > i) {
t[i] += f[j] * (i64)(g[10 - (j - i)] * 19);
if (((i & 0x1) == 0) && ((j & 0x1) == 1)) {
t[i] += f[j] * (i64)(g[10 - (j - i)] * 19);
}
} else {
t[i] += f[j] * (i64)(g[i - j]);
if (((i & 0x1) == 0) && ((j & 0x1) == 1)) {
t[i] += f[j] * (i64)(g[i - j]);
}
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I would have thought that having 2 inner loops (from 0 to i and from i to 10) would have been a bit faster. Good job there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I believe I have found something even better:
lfe t;
fe g19;
FOR (i, 0, 10) {
g19[10 - i] = g[i] * 19;
}
FOR (i, 0, 10) {
t[i] = 0;
FOR (j, 0, 10) {
i64 prod = f[j] * (i64)(j > i
? (g19[j - i])
: (g[i - j]));
prod <<= ~i & j & 0x1;
t[i] += prod;
}
}
fe_carry(h, t);
The idea was hoisting out the multiplication by 19, which I thought cost quite a bit. The main problem doing that is that we're sacrificing stack size (it's probably 40 bytes bigger than your proposal). I also took the opportunity to simplify your test, and I found this neat arithmetic trick to achieve what you did. Anyway, my change here ultimately makes X25519 quite a bit faster on my laptop:
- 7% faster than your code if we compile under
-O3 -march=native
. - 16% faster than your code if we compile under
-Os -march=native
.
Now one thing bothers me about this whole patch (whatever the version): buffers aren't wiped. See, all arrays that may contain secret data are wiped before they're freed. It's kind of a "best effort" thing, most notably because ordinary local variables, more likely to live in registers, aren't wiped at all. And in the case of X25519 field multiplications some of those local variables definitely spill on the stack, and they're not wiped either. Still, I wanted to note that those two little arrays we just added? They would be an exception to the "wipe all the buffers!" rule, and I don't like that.
While I know it's not entirely rational, I now feel compelled to look at this a bit more critically. Let's measure this all under various conditions. I'll be using GCC 11.3.0 on Ubuntu 22.04, on my x86-64 Intel Skylake laptop. Let's start with the sizes of monocypher.o
, in bytes (each column compares itself to the previous column):
Unrolled | use fe_mul() | unroll (me) | unroll (you) | |
---|---|---|---|---|
-Os march=native | 53,720 | -1232 | -2160 | -32 |
-O2 march=native | 57,552 | -1384 | -2168 | -24 |
-O3 march=native | 86,064 | -1216 | -592 | +296 (!!) |
And now the speed, in number of exchanges per second:
Unrolled | use fe_mul() | unroll (me) | unroll (you) | |
---|---|---|---|---|
-Os march=native | 7050 | 6052 (-14%) | 2099 (-66%) | 1804 (-14%) |
-O2 march=native | 7402 | 6393 (-14%) | 1889 (-70%) | 1656 (-12%) |
-O3 march=native | 7641 | 6494 (-15%) | 5458 (-16%) | 5064 (-7%) |
First, the obvious: someone who needs your patch would never use -O3
, because the binary is just much bigger. Still, I'm impressed by how well performance is holding up under -O3
: the smallest binary (mine) manages to keep total performance drop under 29%, much less than what I originally measured.
Another unsurprising finding is the effect of replacing fe_sq()
by a single call to fe_mul()
. This change alone saves between 1.2 and 1.3 kilobytes, and the performance drop-off is very stable.
More notable is the resilience of the unrolled code code to compilation options: going from -O3
to -Os
only costs us 8% speed (the use fe_mul()
version is just as resilient with a 7% drop in speed).
The real problems however start when we're running the fully rolled code under -O2
or -Os
. The performance drops are massive here. The most generous comparison I can make (my version compared to the fe_mul()
only version) still shows that we're dividing speed by more than 3! All this to save less than 2.2KB on the binary. I know embedded devices are tight on memory, but the smallest ones tend to be tight on speed as well. Saving 2.2KB is good, but a 3x slowdown is no joke.
My tentative conclusions right now:
-
Using
fe_mul()
instead offe_sq()
saves ~1.3KB, and cost less than 15% of speed. Definitely worth it in some circumstances. Not sure I want to accept a patch just for this though, because it's so easy to do it yourself. What I can do is put a nice comment that explains how to do it and what to expect exactly. Given the change, and my observations, I believe the effects should be stable across platforms as well. -
Rolling everything like you did however, even with my performance improvements, is likely going too far: saving 2.2KB is nice, but I think dividing performance by more than 3 will be too expensive in most cases. It's a "desperate measures for a desperate cases" kind of deal, too niche in my opinion to accept the full patch right now.
One caveat in my analysis: I haven't analysed the impact of only inlining carry propagation. I expect a fairly stable performance impact, but I don't think it will save much more than 800 bytes of code: after we've simplified fe_sq()
the carry routine is used only 3 times, and the carry propagation logic is not that big.
Sorry for giving you this bummer, but I must say on the other hand reviewing this patch has been a pleasure. Even though I'm now likely to reject it it was interesting and valuable research, that I will keep with me going forward when I do get around to make that embedded edition for real.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LoupVaillant, thank you for taking the time to study this.
On my ryzen7 4700u I see a small reduction in speed with your proposal.
Regarding the size, I am focussing on the size of the X25519 part (as this is what is being changed). I am getting the following results (all compiled with -Os):
nrf51422 (cortex m0): from 8070BkB reduced to 2380BkB,
nrf52832 (cortex m4): from 5036BkB reduced to 2444BkB,
stm32f103 (cortex m3): from 5048BkB reduced to 2456BkB,
This is a reduction of more than 50%.
Regarding the wiping of buffers, there should be no exception. However since the t variable is created only for passing it to fe_carry
it is very easy to set them to zero when copying to h
at the end of fe_carry
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LoupVaillant, I will do some performance benchmarks on real silicon to get an idea of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken some time to do some trials on real silicon (nucleo_f411re) and qemu. The times given are for one X25519 exchange:
unrolled | rolled (my version, added t[] wipe) | ratio | |
---|---|---|---|
nucleo_f411re (84MHz) | 20ms | 96ms | x4.8 |
cortex_m3 (qemu 12MHz) | 82ms | 418ms | x5.1 |
cortext_m0 (qemu 1MHz) | 1216ms | 2513ms | x2.1 |
This leaves me with mixed feelings: on cortex m3/m4 there is a code reduction of about a factor 2, but a speed decrease of a factor 5. On cortex m0 there is a code reduction of 3.5 with a speed decrease of a factor 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those trials, those are really informative!
If we were serious about the wipe, we would hoist the array out of the function, and wipe it only once.
Kind of messy now that every single call to fe_mul()
comes with an extra argument, but at least you only wipe once. Should help a bit.
Could you test this quick without the wipe so we get a feeling of the performance we might hope for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The influence is neglectable (4ms on qemu cortex_m3), the way t is cleared is by setting it to 0 in the loop that converts t to h. t was only created to allow passing it to carry_fe(h,t).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, regarding the figures for cortex m3/m4 this might not be so bad after all given the speed comparison (Zandberg et al): monocypher was 40 times faster than C25519.
1d98404
to
de77114
Compare
Added X25519_NO_UNROLLING option to reduce binary size. This reduces speed but allows a user to trade in some performance for size. Removed wipe at end of fe_carry(t,h) and used WIPE_BUFFER instead, Improved the fe_mult loop, Signed-off-by: Laczen JMS <laczenjms@gmail.com>
de77114
to
5872ded
Compare
src/monocypher.c
Outdated
|
||
FOR (i, 0, 10) { | ||
h[i] = (i32)t[i]; | ||
t[i] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually this might be detected as a dead write, and skipped by the compiler. Won't happen if the compiler doesn't inline the function ad doesn't notice t
is never used after carry propagation, but if we may want to be more careful and use volatile
like crypto_wipe()
does.
This will disable some optimisation and be a little slower still, but I don't expect any big difference.
src/monocypher.c
Outdated
const int m = 1 + (~i & j & 0x1); | ||
if (j > i) { | ||
t[i] += m * f[j] * (i64)(g[10 - (j - i)] * 19); | ||
} else { | ||
t[i] += m * f[j] * (i64)(g[i - j]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your use of multiplication is surprising. If the compiler can see through it and take advantage that the only possible values for m
are 1 and 2, it may do the clever thing and conditionally add the right hand side once more. But there's a good chance, especially under -Os
, that you end up performing a full 32-bit multiplication.
Instead you probably want to use arithmetic shift instead. It would require a barrel shifter instead of a multiplier, but I expect this can still be faster:
const int m = ~i & j & 0x1;
if (j > i) {
t[i] += (f[j] << m) * (i64)(g[10 - (j - i)] * 19);
} else {
t[i] += (f[j] << m) * (i64)(g[i - j]);
}
Not sure the generated code will actually change, and maybe most 32-bit multipliers are just as fast as the shifting, but in a low-transistor setting this may make a difference.
Oh by the way you can hoist the shift out of the test:
const int32 f2 = f[j] << (~i & j & 0x1);
if (j > i) {
t[i] += f2 * (i64)(g[10 - (j - i)] * 19);
} else {
t[i] += f2 * (i64)(g[i - j]);
}
I expect most compilers spot this and optimise accordingly, but sometimes they miss things. This change should help them some.
5872ded
to
8e41add
Compare
b6ccd20
to
df58bbf
Compare
388d6e2
to
9109231
Compare
Added X25519_NO_UNROLLING option to reduce binary size. This reduces speed but allows a user to trade in some performance for size.
Signed-off-by: Laczen JMS laczenjms@gmail.com