Release v2.5.3 · bitfaster/BitFaster.Caching

What's changed

Eliminate volatile writes in ConcurrentLru internal bookkeeping code for pure reads, improving concurrent read throughput by 175%.
Vectorize the hot methods in CmSketch using Neon intrinsics for ARM CPUs. This results in slightly better ConcurrentLfu cache throughput measured on Apple M series and Azure Cobalt 100 CPUs.
Unroll loops in the hot methods in CmSketch. This results in slightly better ConcurrentLfu throughput on CPUs without vector support (i.e. neither x86 AVX2 nor Arm Neon).
On vectorized code paths (AVX2 and Neon), CmSketch allocates the internal buffer using the pinned object heap on .NET6 or newer. Use of the fixed statement is removed, eliminating a very small overhead. Sketch block pointers are then aligned to 64 bytes, guaranteeing each block is always on the same CPU cache line. This provides a small speedup for the ConcurrentLfu maintenance thread by reducing CPU cache misses.
Minor improvements to the AVX2 JITted code via MethodImpl(MethodImplOptions.AggressiveInlining) and removal of local variables to improve performance on .NET8/9 and dynamic PGO.

Full changelog: v2.5.2...v2.5.3