Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle race condition on cache retention #569

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jeqo
Copy link
Contributor

@jeqo jeqo commented Jul 4, 2024

Cache removal listener-related tests (DiskChunkCacheMetricsTest and MemorySegmentIndexesCacheTest) are flaky. Recent evidence:

To reproduce this locally, @RepeatedTest(10000) has been used.

The failure is caused by the timeout condition when waiting for a cache entry to be removed:

DiskChunkCacheMetricsTest > metrics() > repetition 279 of 1000 FAILED
    org.awaitility.core.ConditionTimeoutException: Condition with alias 'Deletion happens' didn't complete within 30 seconds because condition with lambda expression in io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheMetricsTest that uses javax.management.ObjectName was not fulfilled.
        at app//org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
        at app//org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
        at app//org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
        at app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
        at app//org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
        at app//io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheMetricsTest.metrics(DiskChunkCacheMetricsTest.java:125)

Waiting for RemovalListener to be called just after inserting a couple of entries seem to not been deterministic, and retention.ms time boundary is needed to get the removal called within the time-frame of the test (default retention.ms = 10min).

As a separate finding, while running this locally, I spot the exception of file not found after some thousand runs:

[2024-07-05 12:26:01,100] INFO DiskChunkCacheConfig values: 
	path = /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit16372691694497514473
	prefetch.max.size = 0
	retention.ms = 600000
	size = 1024
 (io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCacheConfig:370)
[2024-07-05 12:43:47,886] ERROR Failed to delete cached file for key ChunkKey(segmentFileName=segment, chunkId=1) with path /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit11001183003053723613/cache/segment-1 from cache directory. The reason of the deletion is EXPIRED (io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache:111)
java.nio.file.NoSuchFileException: /var/folders/f_/6tkk7f6x7377dzmwfsqkdtk00000gq/T/junit11001183003053723613/cache/segment-1
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:148)
	at java.base/java.nio.file.Files.readAttributes(Files.java:1851)
	at java.base/java.nio.file.Files.size(Files.java:2468)
	at io.aiven.kafka.tieredstorage.fetch.cache.DiskChunkCache.lambda$removalListener$0(DiskChunkCache.java:101)
	at com.github.benmanes.caffeine.cache.Async$AsyncEvictionListener.onRemoval(Async.java:117)
	at com.github.benmanes.caffeine.cache.Async$AsyncEvictionListener.onRemoval(Async.java:101)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyEviction(BoundedLocalCache.java:442)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$evictEntry$2(BoundedLocalCache.java:1071)
	at java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1828)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.evictEntry(BoundedLocalCache.java:1032)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterAccessEntries(BoundedLocalCache.java:939)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireAfterAccessEntries(BoundedLocalCache.java:925)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.expireEntries(BoundedLocalCache.java:903)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.maintenance(BoundedLocalCache.java:1721)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1660)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run(BoundedLocalCache.java:3886)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)

There seem to be multiple calls to this listener happening concurrently, causing this behavior (first caller to win, and the next one to don't find the file), so an additional handling is has been added. At runtime this exception is swallow by the listener execution, so this is mostly to have better logging when this happens.

This seems to be expected looking at the Caffeine docs:

The RemovalListener states:

An instance may be called concurrently by multiple threads to process different entries.

Also

Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.

@jeqo jeqo force-pushed the jeqo/fix-flakiness branch 4 times, most recently from c306da3 to f716f6b Compare July 5, 2024 06:16
@jeqo jeqo changed the title chore: fix flakiness fix: handle race condition on disk-based cache retention Jul 5, 2024
@jeqo jeqo marked this pull request as ready for review July 5, 2024 06:17
@jeqo jeqo requested a review from a team as a code owner July 5, 2024 06:17
@AnatolyPopov
Copy link
Contributor

Sorry, I'm failing to understand what kind of race condition you are talking about? Could you please clarify?

@jeqo
Copy link
Contributor Author

jeqo commented Jul 5, 2024

@AnatolyPopov ofc, sorry it wasn't explained properly. I have added more details on the description. This PR at least try to fix one of the known (now) causes for flaky failing tests:
image

@AnatolyPopov
Copy link
Contributor

I wonder why at all this can happen. This basically means that the listener is running for a specific (key, value) pair multiple times if I understand correctly. Or is it tests only thing and the test itself cleans the file?

@jeqo
Copy link
Contributor Author

jeqo commented Jul 15, 2024

@AnatolyPopov I have refactored the test to have a time-based eviction and have more consistent results (before it tested if either value 1 or 2 were deleted, not it tests if 1 or 2 or both are deleted).

I have separated the exception handling for missed file, as it's nice to have but it doesn't fixes the flakiness completely. The refactoring of the test is what is trying to fix the flakiness. These are two separated commits now. PTAL

@@ -89,7 +89,8 @@ void metrics() throws IOException, JMException, StorageBackendException {
final DiskChunkCache diskChunkCache = new DiskChunkCache(chunkManager, time);
diskChunkCache.configure(Map.of(
"size", size1, // enough to put the first, but not both
"path", baseCachePath.toString()
"path", baseCachePath.toString(),
"retention.ms", String.valueOf(Duration.ofSeconds(10).toMillis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the idea of the test was different. It was intended to test that only 1 chunk is deleted because the cache size reached the limit and because of Window TinyLfu it is not possible to say first or second chunk will be deleted. Specifying retention.ms I believe we are missing the case when the metrics are reported correctly for a single chunk since there is a high chance that both will be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen failing test cases (unfortunately too old to reference) that are not dependent on time, meaning that even if we increase the timeout to more than 30s it will not help: the eviction already failed (for some reason; could be the file not exist exception), and the metrics will not tick.
This test is not meant to check if the deletion is working properly (I think it's out of the scope of this test), but to check that the deletion metric is reported.
So expanding the cache configuration to trigger deletion in a more consistent way seem like a fair trade to remove flakiness (otherwise we wait for the flaky test to trigger and hope to get enough logs to troubleshoot -- this hasn't been the case.. CI logs only show "test failed" without including logs.. maybe something else to fix?)

metrics.chunkDeleted(fileSize);
log.trace("Deleted cached file for key {} with path {} from cache directory."
+ " The reason of the deletion is {}", key, path, cause);
if (Files.exists(path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still failing to understand how it is possible that path will be null or file will not exist. I think it should not be the case or otherwise something is quite wrong IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's a weird case.

The RemovalListener states:

An instance may be called concurrently by multiple threads to process different entries.

This confirms that it could be a race condition by separate threads calling the removal. I also find weird that if there's a wining thread, the metric is not increased.

Also

Implementations of this interface should avoid performing blocking calls or synchronizing on shared resources.

We currently already handle the case where path/value is null by logging an error. This could be seen as an extension of handling the scenario where the referenced file is not found.
We are still logging this to troubleshoot if/when this happens.

@jeqo
Copy link
Contributor Author

jeqo commented Aug 2, 2024

While testing flakiness of disk-based cache delete metrics, it was found that eviction may happen from different reasons and the path may already be deleted.
To avoid a runtime exception (that is anyway shallowed by listener execution), this PR introduces some validation before checking size.
To reduce flakiness where deletion is not executed in a consistent manner based on size, a time-based eviction configuration is added to have either 1 or 2 deletions happening while test is running to validate results.

Before it was only checking either first or second value where deleted. Now it is checking 1 or 2 or both.

To validated flakiness, @RepeatedTest(100000) was used, and it's now passing fine.
@jeqo
Copy link
Contributor Author

jeqo commented Aug 8, 2024

Also, the same test but for the memory based cache is failing on main: https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/actions/runs/10283516010/job/28457507975

Managed to reproduce locally with @RepeatedTest(1000):

[2024-08-08 20:20:35,517] INFO CacheConfig values: 
	retention.ms = -1
	size = 18
 (io.aiven.kafka.tieredstorage.config.CacheConfig:370)

Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
org.awaitility.core.ConditionTimeoutException: Condition with Lambda expression in io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests was not fulfilled within 30 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:167)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:78)
	at org.awaitility.core.CallableCondition.await(CallableCondition.java:26)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:1006)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:975)
	at io.aiven.kafka.tieredstorage.fetch.index.MemorySegmentIndexesCacheTest$CacheTests.sizeBasedEviction(MemorySegmentIndexesCacheTest.java:262)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
	at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180)
	at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104)
	at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:711)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)

Where condition is:

            await()
                .atMost(Duration.ofSeconds(30))
                .pollDelay(Duration.ofSeconds(2))
                .pollInterval(Duration.ofMillis(10))
                .until(() -> !mockingDetails(removalListener).getInvocations().isEmpty());

Similar to disk-based test, as deletion is not happening consistently leading to flakiness, this commit includes time-based retention to force deletion and validate metrics.
@jeqo jeqo changed the title fix: handle race condition on disk-based cache retention fix: handle race condition on cache retention Aug 12, 2024
Comment on lines +101 to +103
if (Files.exists(path)) {
final long fileSize = Files.size(path);
Files.delete(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the racy nature of things here, a file may be deleted between the Files.exists check and Files.delete. I think, we should instead try to delete without a check and catch the "file not found" exception. "It is better to ask forgiveness than permission" :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants