fix: handle race condition on cache retention #569

jeqo · 2024-07-04T04:02:11Z

Cache removal listener-related tests (DiskChunkCacheMetricsTest and MemorySegmentIndexesCacheTest) are flaky. Recent evidence:

To reproduce this locally, @RepeatedTest(10000) has been used.

The failure is caused by the timeout condition when waiting for a cache entry to be removed:

DiskChunkCacheMetricsTest > metrics() > repetition 279 of 1000 FAILED
    org.awaitility.core.ConditionTimeoutException: Condition with alias 'Deletion happens' didn't complete within 30 seconds because condition with lambda expression in io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheMetricsTest that uses javax.management.ObjectName was not fulfilled.
        at app//org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
        at app//org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
        at app//org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
        at app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
        at app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
        at app//io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheMetricsTest.metrics(DiskChunkCacheMetricsTest.java:125)

Waiting for RemovalListener to be called just after inserting a couple of entries seem to not been deterministic, and retention.ms time boundary is needed to get the removal called within the time-frame of the test (default retention.ms = 10min).

As a separate finding, while running this locally, I spot the exception of file not found after some thousand runs:

[2024-07-05 12:26:01,100] INFO DiskChunkCacheConfig values: 
	path = /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit16372691694497514473
	prefetch.max.size = 0
	retention.ms = 600000
	size = 1024
 (io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheConfig:370)
[2024-07-05 12:43:47,886] ERROR Failed to delete cached file for key ChunkKey(segmentFileName=segment, chunkId=1) with path /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit11001183003053723613/cache/segment-1 from cache directory. The reason of the deletion is EXPIRED (io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache:111)
java.nio.file.NoSuchFileException: /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit11001183003053723613/cache/segment-1
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148)
	at java.base/java.nio.file.Files.readAttributes(Files.java:1851)
	at java.base/java.nio.file.Files.size(Files.java:2468)
	at io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache.lambda$removalListener$0(DiskChunkCache.java:101)
	at com.github.benmanes.caffeine.cache.Async$AsyncEvictionListener.onRemoval(Async.java:117)
	at com.github.benmanes.caffeine.cache.Async$AsyncEvictionListener.onRemoval(Async.java:101)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyEviction(BoundedLocalCache.java:442)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$evictEntry$2(BoundedLocalCache.java:1071)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:1032)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterAccessEntries(BoundedLocalCache.java:939)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterAccessEntries(BoundedLocalCache.java:925)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireEntries(BoundedLocalCache.java:903)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1721)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1660)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3886)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

There seem to be multiple calls to this listener happening concurrently, causing this behavior (first caller to win, and the next one to don't find the file), so an additional handling is has been added. At runtime this exception is swallow by the listener execution, so this is mostly to have better logging when this happens.

This seems to be expected looking at the Caffeine docs:

The RemovalListener states:

An instance may be called concurrently by multiple threads to process different entries.

Also

Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.

AnatolyPopov · 2024-07-05T06:22:13Z

Sorry, I'm failing to understand what kind of race condition you are talking about? Could you please clarify?

jeqo · 2024-07-05T10:02:52Z

@AnatolyPopov ofc, sorry it wasn't explained properly. I have added more details on the description. This PR at least try to fix one of the known (now) causes for flaky failing tests:

AnatolyPopov · 2024-07-05T10:31:11Z

I wonder why at all this can happen. This basically means that the listener is running for a specific (key, value) pair multiple times if I understand correctly. Or is it tests only thing and the test itself cleans the file?

jeqo · 2024-07-15T13:46:05Z

@AnatolyPopov I have refactored the test to have a time-based eviction and have more consistent results (before it tested if either value 1 or 2 were deleted, not it tests if 1 or 2 or both are deleted).

I have separated the exception handling for missed file, as it's nice to have but it doesn't fixes the flakiness completely. The refactoring of the test is what is trying to fix the flakiness. These are two separated commits now. PTAL

AnatolyPopov · 2024-07-19T12:09:02Z

core/src/test/java/io/aiven/kafka/tieredstorage/fetch/cache/DiskChunkCacheMetricsTest.java

@@ -89,7 +89,8 @@ void metrics() throws IOException, JMException, StorageBackendException {
        final DiskChunkCache diskChunkCache = new DiskChunkCache(chunkManager, time);
        diskChunkCache.configure(Map.of(
            "size", size1,  // enough to put the first, but not both
-            "path", baseCachePath.toString()
+            "path", baseCachePath.toString(),
+            "retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis())


I believe the idea of the test was different. It was intended to test that only 1 chunk is deleted because the cache size reached the limit and because of Window TinyLfu it is not possible to say first or second chunk will be deleted. Specifying retention.ms I believe we are missing the case when the metrics are reported correctly for a single chunk since there is a high chance that both will be deleted.

We have seen failing test cases (unfortunately too old to reference) that are not dependent on time, meaning that even if we increase the timeout to more than 30s it will not help: the eviction already failed (for some reason; could be the file not exist exception), and the metrics will not tick.
This test is not meant to check if the deletion is working properly (I think it's out of the scope of this test), but to check that the deletion metric is reported.
So expanding the cache configuration to trigger deletion in a more consistent way seem like a fair trade to remove flakiness (otherwise we wait for the flaky test to trigger and hope to get enough logs to troubleshoot -- this hasn't been the case.. CI logs only show "test failed" without including logs.. maybe something else to fix?)

AnatolyPopov · 2024-07-19T12:12:04Z

core/src/main/java/io/aiven/kafka/tieredstorage/fetch/cache/DiskChunkCache.java

-                    metrics.chunkDeleted(fileSize);
-                    log.trace("Deleted cached file for key {} with path {} from cache directory."
-                        + " The reason of the deletion is {}", key, path, cause);
+                    if (Files.exists(path)) {


I still failing to understand how it is possible that path will be null or file will not exist. I think it should not be the case or otherwise something is quite wrong IMO.

I agree it's a weird case.

The RemovalListener states:

An instance may be called concurrently by multiple threads to process different entries.

This confirms that it could be a race condition by separate threads calling the removal. I also find weird that if there's a wining thread, the metric is not increased.

Also

Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.

We currently already handle the case where path/value is null by logging an error. This could be seen as an extension of handling the scenario where the referenced file is not found.
We are still logging this to troubleshoot if/when this happens.

jeqo · 2024-08-02T09:37:00Z

Finally, some additional evidence that this test is flaky:

https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10206672059/job/28240015688?pr=576
https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10213446503/job/28258822803
-- unfortunately the logs are not helpful.

While testing flakiness of disk-based cache delete metrics, it was found that eviction may happen from different reasons and the path may already be deleted. To avoid a runtime exception (that is anyway shallowed by listener execution), this PR introduces some validation before checking size.

To reduce flakiness where deletion is not executed in a consistent manner based on size, a time-based eviction configuration is added to have either 1 or 2 deletions happening while test is running to validate results. Before it was only checking either first or second value where deleted. Now it is checking 1 or 2 or both. To validated flakiness, @RepeatedTest(100000) was used, and it's now passing fine.

jeqo · 2024-08-08T17:25:43Z

Also, the same test but for the memory based cache is failing on main: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10283516010/job/28457507975

Managed to reproduce locally with @RepeatedTest(1000):

[2024-08-08 20:20:35,517] INFO CacheConfig values: 
	retention.ms = -1
	size = 18
 (io.aiven.kafka.tieredstorage.config.CacheConfig:370)

Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
org.awaitility.core.ConditionTimeoutException: Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
	at io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests.sizeBasedEviction(MemorySegmentIndexesCacheTest.java:262)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:711)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

Where condition is:

            await()
                .atMost(Duration.ofSeconds(30))
                .pollDelay(Duration.ofSeconds(2))
                .pollInterval(Duration.ofMillis(10))
                .until(() -> !mockingDetails(removalListener).getInvocations().isEmpty());

Similar to disk-based test, as deletion is not happening consistently leading to flakiness, this commit includes time-based retention to force deletion and validate metrics.

ivanyu · 2024-08-14T11:38:50Z

core/src/main/java/io/aiven/kafka/tieredstorage/fetch/cache/DiskChunkCache.java

+                    if (Files.exists(path)) {
+                        final long fileSize = Files.size(path);
+                        Files.delete(path);


Considering the racy nature of things here, a file may be deleted between the Files.exists check and Files.delete. I think, we should instead try to delete without a check and catch the "file not found" exception. "It is better to ask forgiveness than permission" :)

jeqo force-pushed the jeqo/fix-flakiness branch 4 times, most recently from c306da3 to f716f6b Compare July 5, 2024 06:16

jeqo changed the title ~~chore: fix flakiness~~ fix: handle race condition on disk-based cache retention Jul 5, 2024

jeqo marked this pull request as ready for review July 5, 2024 06:17

jeqo requested a review from a team as a code owner July 5, 2024 06:17

jeqo force-pushed the jeqo/fix-flakiness branch from f716f6b to 500aa8d Compare July 15, 2024 13:43

jeqo force-pushed the jeqo/fix-flakiness branch from 500aa8d to d3d29a0 Compare July 15, 2024 13:47

AnatolyPopov reviewed Jul 19, 2024

View reviewed changes

jeqo added 2 commits August 8, 2024 12:31

jeqo force-pushed the jeqo/fix-flakiness branch from d3d29a0 to 615a0af Compare August 8, 2024 17:26

fix: add retention-based eviction to memory-based cache metrics test

01deb75

Similar to disk-based test, as deletion is not happening consistently leading to flakiness, this commit includes time-based retention to force deletion and validate metrics.

jeqo force-pushed the jeqo/fix-flakiness branch from 4ecdfb7 to 01deb75 Compare August 8, 2024 20:25

jeqo changed the title ~~fix: handle race condition on disk-based cache retention~~ fix: handle race condition on cache retention Aug 12, 2024

ivanyu reviewed Aug 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle race condition on cache retention #569

fix: handle race condition on cache retention #569

jeqo commented Jul 4, 2024 •

edited

Loading

AnatolyPopov commented Jul 5, 2024

jeqo commented Jul 5, 2024

AnatolyPopov commented Jul 5, 2024

jeqo commented Jul 15, 2024

AnatolyPopov Jul 19, 2024

jeqo Jul 19, 2024

AnatolyPopov Jul 19, 2024

jeqo Jul 19, 2024

jeqo commented Aug 2, 2024 •

edited

Loading

jeqo commented Aug 8, 2024

ivanyu Aug 14, 2024

fix: handle race condition on cache retention #569

Are you sure you want to change the base?

fix: handle race condition on cache retention #569

Conversation

jeqo commented Jul 4, 2024 • edited Loading

AnatolyPopov commented Jul 5, 2024

jeqo commented Jul 5, 2024

AnatolyPopov commented Jul 5, 2024

jeqo commented Jul 15, 2024

AnatolyPopov Jul 19, 2024

Choose a reason for hiding this comment

jeqo Jul 19, 2024

Choose a reason for hiding this comment

AnatolyPopov Jul 19, 2024

Choose a reason for hiding this comment

jeqo Jul 19, 2024

Choose a reason for hiding this comment

jeqo commented Aug 2, 2024 • edited Loading

jeqo commented Aug 8, 2024

ivanyu Aug 14, 2024

Choose a reason for hiding this comment

jeqo commented Jul 4, 2024 •

edited

Loading

jeqo commented Aug 2, 2024 •

edited

Loading