What's changed
- Eliminate volatile writes in
ConcurrentLru
internal bookkeeping code for pure reads, improving concurrent read throughput by 175%. - Vectorize the hot methods in
CmSketch
using Neon intrinsics for ARM CPUs. This results in slightly betterConcurrentLfu
cache throughput measured on Apple M series and Azure Cobalt 100 CPUs. - Unroll loops in the hot methods in
CmSketch
. This results in slightly betterConcurrentLfu
throughput on CPUs without vector support (i.e. neither x86 AVX2 nor Arm Neon). - On vectorized code paths (AVX2 and Neon),
CmSketch
allocates the internal buffer using the pinned object heap on .NET6 or newer. Use of the fixed statement is removed, eliminating a very small overhead. Sketch block pointers are then aligned to 64 bytes, guaranteeing each block is always on the same CPU cache line. This provides a small speedup for theConcurrentLfu
maintenance thread by reducing CPU cache misses. - Minor improvements to the AVX2 JITted code via
MethodImpl(MethodImplOptions.AggressiveInlining)
and removal of local variables to improve performance on .NET8/9 and dynamic PGO.
Full changelog: v2.5.2...v2.5.3