Add an end-to-end test for trim gap handling using snapshots #2463

pcholakov · 2024-12-31T15:46:16Z

This test simulates a trim gap and verifies the behavior with and without a suitable snapshot present to enable fast-forward over the gap.

This is a follow-up to #2456 adding an e2e test for snapshot-based fast-forward over a log trim gap.

There are several to-dos here that require deeper changes - I'd like to do those as separate PRs to avoid delaying merging of trim-gap support itself. At a minimum this includes:

the create-snapshot admin API should return the min captured LSN of snapshots
the trim admin API should include the effective new trim point; currently BifrostAdmin can decide to no-op the request if the trim point is greater than the global tail it knows about, which makes it hard to test
[optional] we don't have a good way (that I'm aware of) to externally ask a specific partition processor to become leader; this would be useful for testing and potentially manual operations

Primary reviewer: @tillrohrmann

cc: @jackkleeman as an optional reviewer since I modified some test cluster infra and a test you previously added but feel free to ignore!

github-actions · 2024-12-31T16:03:59Z

Test Results

7 files ±0 7 suites ±0 4m 28s ⏱️ +7s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 446bd56. ± Comparison against base commit b86ac06.

♻️ This comment has been updated with latest results.

pcholakov · 2025-01-02T16:20:49Z

crates/local-cluster-runner/src/node/mod.rs

@@ -749,6 +749,30 @@ impl StartedNode {
    }
 }

+impl Drop for StartedNode {
+    fn drop(&mut self) {


I added this to avoid leaking restate-server processes from tests.

pcholakov · 2025-01-02T16:22:46Z

server/tests/trim_gap_handling.rs

+
+mod common;
+
+#[tokio::test]


I am using this rather than test(restate_core::test) because that macro just exits the process on panic, which prevents unwinding from running - and can lead to leaked spawned processes on test failure.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Ah! Of course; I recall the discussion now - switched to #[test_log::test(tokio::test)] as a middle ground to ensure that the Drop callback works.

pcholakov · 2025-01-02T16:23:27Z

server/tests/snapshots.rs

The new trim gap fast-forward test covers the same paths as this one.

pcholakov · 2025-01-02T16:24:43Z

Cargo.toml

@@ -34,6 +34,7 @@ description = "Restate makes distributed applications easy!"
 [workspace.dependencies]
 # Own crates
 codederror = { path = "crates/codederror" }
+mock-service-endpoint = { path = "tools/mock-service-endpoint" }


With this PR, the mock service handler is now usable from other packages - this is handy for e2e testing.

pcholakov · 2025-01-02T16:25:50Z

server/tests/common/replicated_loglet.rs

@@ -89,7 +89,6 @@ pub struct TestEnv {
    pub loglet: Arc<dyn Loglet>,
    pub metadata_writer: MetadataWriter,
    pub metadata_store_client: MetadataStoreClient,
-    pub cluster: StartedCluster,


I removed passing the cluster to the test routine as it is easy to accidentally drop it, and kill the cluster in the process. We can reintroduce it as a reference if it's needed in the future.

tillrohrmann

Thanks for creating the end-to-end test for our snapshots @pcholakov. The changes look good to me.

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

tillrohrmann · 2025-01-03T13:58:40Z

crates/local-cluster-runner/src/node/mod.rs

+                pid,
+            );
+            match nix::sys::signal::kill(
+                nix::unistd::Pid::from_raw(pid.try_into().unwrap()),


Is this try_into infallible or why is unwrap ok here?

tillrohrmann · 2025-01-03T14:15:32Z

server/tests/trim_gap_handling.rs

+
+mod common;
+
+#[tokio::test]


I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

tillrohrmann · 2025-01-03T14:17:00Z

server/tests/trim_gap_handling.rs

+    tracing_subscriber::fmt()
+        .event_format(tracing_subscriber::fmt::format().compact())
+        .with_env_filter(
+            tracing_subscriber::EnvFilter::builder()
+                .with_default_directive(LevelFilter::INFO.into())
+                .from_env_lossy(),
+        )
+        .init();


If you use test_log::test(tokio::test), then you don't have to set these things up yourself.

tillrohrmann · 2025-01-03T14:25:38Z

server/tests/trim_gap_handling.rs

+    // todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
+    // right now, all we are asserting is that the new node is applying newly appended log records


You could do this by manually changing the SchedulingPlan.

I didn't think it would be this easy... and it seems like it isn't. I added a step to manually set the SchedulingPlan but it only works intermittently - Scheduler::update_scheduling_plan nukes the changes as soon as it picks them up. I think this is important, let's do definitely do it but maybe as a follow-up task to provide a leadership hint to the scheduler?

tillrohrmann · 2025-01-03T14:29:37Z

server/tests/trim_gap_handling.rs

+                State::Alive(s) => s
+                    .partitions
+                    .values()
+                    .any(|p| p.effective_mode.cmp(&1).is_eq()),


I think it is clearer if you compared against RunMode instead of the ordinal value which is harder to remember.

The magic of try_from! Thanks for the tip :-)

tillrohrmann · 2025-01-03T14:40:28Z

server/tests/trim_gap_handling.rs

+    let mut i = 0;
+    loop {
+        client
+            .trim_log(TrimLogRequest {
+                log_id: 0,
+                trim_point,
+            })
+            .await?;
+        if i >= 2 {
+            break;
+        }
+        tokio::time::sleep(Duration::from_secs(1)).await;
+        i += 1;
+    }


How did you come up with the magic number of 3 attempts?

Empirically! I think Azmy suggested it may be related to the heartbeat interval and updating the global tail. Moot now; I've converted this to a retry until the desired effective trim is reached.

tillrohrmann · 2025-01-03T14:46:23Z

server/tests/trim_gap_handling.rs

+async fn trim_log(
+    client: &mut ClusterCtrlSvcClient<Channel>,
+    trim_point: u64,
+) -> googletest::Result<()> {
+    // todo(pavel): this is flimsy, ensure we actually trim the log to a particular LSN


If this method does not do anything because the admin node didn't have the up to date log tail, then I the remaining test will be stuck. This might be a problem for the stability of the test. Something to observe on our CI infra where timings can be quite skewed.

Sorry, I created the wrong impression with the todo comment - I've rebased on #2468 which allows this to be deterministic :-)

tillrohrmann · 2025-01-03T14:48:46Z

server/tests/trim_gap_handling.rs

+async fn grpc_connect(address: AdvertisedAddress) -> Result<Channel, tonic::transport::Error> {
+    match address {
+        AdvertisedAddress::Uds(uds_path) => {
+            // dummy endpoint required to specify an uds connector, it is not used anywhere
+            Endpoint::try_from("http://127.0.0.1")
+                .expect("/ should be a valid Uri")
+                .connect_with_connector(service_fn(move |_: Uri| {
+                    let uds_path = uds_path.clone();
+                    async move {
+                        Ok::<_, io::Error>(TokioIo::new(UnixStream::connect(uds_path).await?))
+                    }
+                })).await
+        }
+        AdvertisedAddress::Http(uri) => {
+            Channel::builder(uri)
+                .connect_timeout(Duration::from_secs(2))
+                .timeout(Duration::from_secs(2))
+                .http2_adaptive_window(true)
+                .connect()
+                .await
+        }
+    }
+}


This looks quite similar to create_tonic_channel_from_advertised_address. Could this be reused?

I copied it nearly verbatim from restatectl's grpc_connect utility - which looks like it may have been the origin of create_tonic_channel_from_advertised_address, too. I've done this under its own PR here:

#2469

…etting dropped

pcholakov · 2025-01-06T16:13:14Z

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

Yes, definitely! I was already working on that - I realize my todo might have created the wrong impression :-) Here is the change, on which this PR is now rebased: #2468.

I wasn't able to get the leadership change to work reliably but I'm pretty keen to do that too. However, I believe that the test as it stands should be reasonably robust to merge and won't cause undue noise in CI.

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 12b10fb to 713e010 Compare December 31, 2024 16:18

pcholakov commented Jan 2, 2025

View reviewed changes

pcholakov requested review from tillrohrmann and jackkleeman January 2, 2025 16:27

pcholakov marked this pull request as ready for review January 2, 2025 16:32

pcholakov force-pushed the feat/trim-gap-e2e-test branch 2 times, most recently from e7bd6c7 to 071338c Compare January 3, 2025 13:56

Base automatically changed from feat/trim-gap-handling to main January 3, 2025 14:31

tillrohrmann reviewed Jan 3, 2025

View reviewed changes

pcholakov mentioned this pull request Jan 6, 2025

Return effective trim point LSN on successful trim request #2468

Open

pcholakov added 9 commits January 6, 2025 14:57

Make mock-service-endpoint a workspace library

128d72f

Add trim gap handling end-to-end test

5fb7e90

Add a Drop implementation for Cluster to prevent leaking nodes

862151a

Assert log convergence after new follower joins

47535cc

Don't pass ownership of the cluster to the inner future to avoid it g…

0a0e628

…etting dropped

Robustness improvements

ed308f2

Remove original snapshots test which is a subset of the trim gap test

ed62531

Deterministically trim the log

9b1479e

PR feedback

446bd56

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 071338c to 446bd56 Compare January 6, 2025 13:31

pcholakov changed the base branch from main to feat/trim-log-report-lsn January 6, 2025 13:33

pcholakov force-pushed the feat/trim-gap-e2e-test branch from 60dbc4e to 446bd56 Compare January 6, 2025 16:06

pcholakov requested review from tillrohrmann and removed request for jackkleeman January 6, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an end-to-end test for trim gap handling using snapshots #2463

Add an end-to-end test for trim gap handling using snapshots #2463

pcholakov commented Dec 31, 2024 •

edited

Loading

github-actions bot commented Dec 31, 2024 •

edited

Loading

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

pcholakov Jan 2, 2025

tillrohrmann left a comment

tillrohrmann Jan 3, 2025

tillrohrmann Jan 3, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025 •

edited

Loading

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025

tillrohrmann Jan 3, 2025

pcholakov Jan 6, 2025 •

edited

Loading

pcholakov commented Jan 6, 2025

		// todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
		// right now, all we are asserting is that the new node is applying newly appended log records


		mod common;

		#[tokio::test]


		mod common;

		#[tokio::test]

Add an end-to-end test for trim gap handling using snapshots #2463

Are you sure you want to change the base?

Add an end-to-end test for trim gap handling using snapshots #2463

Conversation

pcholakov commented Dec 31, 2024 • edited Loading

github-actions bot commented Dec 31, 2024 • edited Loading

Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tillrohrmann Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

pcholakov commented Jan 6, 2025

pcholakov commented Dec 31, 2024 •

edited

Loading

github-actions bot commented Dec 31, 2024 •

edited

Loading

tillrohrmann Jan 3, 2025 •

edited

Loading

pcholakov Jan 6, 2025 •

edited

Loading