Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an end-to-end test for trim gap handling using snapshots #2463

Open
wants to merge 9 commits into
base: feat/trim-log-report-lsn
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Dec 31, 2024

This test simulates a trim gap and verifies the behavior with and without a suitable snapshot present to enable fast-forward over the gap.

This is a follow-up to #2456 adding an e2e test for snapshot-based fast-forward over a log trim gap.

There are several to-dos here that require deeper changes - I'd like to do those as separate PRs to avoid delaying merging of trim-gap support itself. At a minimum this includes:

  • the create-snapshot admin API should return the min captured LSN of snapshots
  • the trim admin API should include the effective new trim point; currently BifrostAdmin can decide to no-op the request if the trim point is greater than the global tail it knows about, which makes it hard to test
  • [optional] we don't have a good way (that I'm aware of) to externally ask a specific partition processor to become leader; this would be useful for testing and potentially manual operations

Primary reviewer: @tillrohrmann

cc: @jackkleeman as an optional reviewer since I modified some test cluster infra and a test you previously added but feel free to ignore!

Copy link

github-actions bot commented Dec 31, 2024

Test Results

  7 files  ±0    7 suites  ±0   4m 28s ⏱️ +7s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 446bd56. ± Comparison against base commit b86ac06.

♻️ This comment has been updated with latest results.

@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 12b10fb to 713e010 Compare December 31, 2024 16:18
@@ -749,6 +749,30 @@ impl StartedNode {
}
}

impl Drop for StartedNode {
fn drop(&mut self) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to avoid leaking restate-server processes from tests.


mod common;

#[tokio::test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this rather than test(restate_core::test) because that macro just exits the process on panic, which prevents unwinding from running - and can lead to leaked spawned processes on test failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Of course; I recall the discussion now - switched to #[test_log::test(tokio::test)] as a middle ground to ensure that the Drop callback works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new trim gap fast-forward test covers the same paths as this one.

@@ -34,6 +34,7 @@ description = "Restate makes distributed applications easy!"
[workspace.dependencies]
# Own crates
codederror = { path = "crates/codederror" }
mock-service-endpoint = { path = "tools/mock-service-endpoint" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, the mock service handler is now usable from other packages - this is handy for e2e testing.

@@ -89,7 +89,6 @@ pub struct TestEnv {
pub loglet: Arc<dyn Loglet>,
pub metadata_writer: MetadataWriter,
pub metadata_store_client: MetadataStoreClient,
pub cluster: StartedCluster,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed passing the cluster to the test routine as it is easy to accidentally drop it, and kill the cluster in the process. We can reintroduce it as a reference if it's needed in the future.

@pcholakov pcholakov marked this pull request as ready for review January 2, 2025 16:32
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch 2 times, most recently from e7bd6c7 to 071338c Compare January 3, 2025 13:56
Base automatically changed from feat/trim-gap-handling to main January 3, 2025 14:31
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the end-to-end test for our snapshots @pcholakov. The changes look good to me.

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

pid,
);
match nix::sys::signal::kill(
nix::unistd::Pid::from_raw(pid.try_into().unwrap()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this try_into infallible or why is unwrap ok here?


mod common;

#[tokio::test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.

Comment on lines 47 to 54
tracing_subscriber::fmt()
.event_format(tracing_subscriber::fmt::format().compact())
.with_env_filter(
tracing_subscriber::EnvFilter::builder()
.with_default_directive(LevelFilter::INFO.into())
.from_env_lossy(),
)
.init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use test_log::test(tokio::test), then you don't have to set these things up yourself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TY!

Comment on lines +204 to +196
// todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again
// right now, all we are asserting is that the new node is applying newly appended log records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do this by manually changing the SchedulingPlan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it would be this easy... and it seems like it isn't. I added a step to manually set the SchedulingPlan but it only works intermittently - Scheduler::update_scheduling_plan nukes the changes as soon as it picks them up. I think this is important, let's do definitely do it but maybe as a follow-up task to provide a leadership hint to the scheduler?

State::Alive(s) => s
.partitions
.values()
.any(|p| p.effective_mode.cmp(&1).is_eq()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clearer if you compared against RunMode instead of the ordinal value which is harder to remember.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic of try_from! Thanks for the tip :-)

Comment on lines 264 to 271
let mut i = 0;
loop {
client
.trim_log(TrimLogRequest {
log_id: 0,
trim_point,
})
.await?;
if i >= 2 {
break;
}
tokio::time::sleep(Duration::from_secs(1)).await;
i += 1;
}
Copy link
Contributor

@tillrohrmann tillrohrmann Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with the magic number of 3 attempts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically! I think Azmy suggested it may be related to the heartbeat interval and updating the global tail. Moot now; I've converted this to a retry until the desired effective trim is reached.

Comment on lines 254 to 258
async fn trim_log(
client: &mut ClusterCtrlSvcClient<Channel>,
trim_point: u64,
) -> googletest::Result<()> {
// todo(pavel): this is flimsy, ensure we actually trim the log to a particular LSN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method does not do anything because the admin node didn't have the up to date log tail, then I the remaining test will be stuck. This might be a problem for the stability of the test. Something to observe on our CI infra where timings can be quite skewed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I created the wrong impression with the todo comment - I've rebased on #2468 which allows this to be deterministic :-)

Comment on lines +338 to +354
async fn grpc_connect(address: AdvertisedAddress) -> Result<Channel, tonic::transport::Error> {
match address {
AdvertisedAddress::Uds(uds_path) => {
// dummy endpoint required to specify an uds connector, it is not used anywhere
Endpoint::try_from("http://127.0.0.1")
.expect("/ should be a valid Uri")
.connect_with_connector(service_fn(move |_: Uri| {
let uds_path = uds_path.clone();
async move {
Ok::<_, io::Error>(TokioIo::new(UnixStream::connect(uds_path).await?))
}
})).await
}
AdvertisedAddress::Http(uri) => {
Channel::builder(uri)
.connect_timeout(Duration::from_secs(2))
.timeout(Duration::from_secs(2))
.http2_adaptive_window(true)
.connect()
.await
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite similar to create_tonic_channel_from_advertised_address. Could this be reused?

Copy link
Contributor Author

@pcholakov pcholakov Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it nearly verbatim from restatectl's grpc_connect utility - which looks like it may have been the origin of create_tonic_channel_from_advertised_address, too. I've done this under its own PR here:

#2469

@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 071338c to 446bd56 Compare January 6, 2025 13:31
@pcholakov pcholakov changed the base branch from main to feat/trim-log-report-lsn January 6, 2025 13:33
@pcholakov pcholakov force-pushed the feat/trim-gap-e2e-test branch from 60dbc4e to 446bd56 Compare January 6, 2025 16:06
@pcholakov pcholakov requested review from tillrohrmann and removed request for jackkleeman January 6, 2025 16:10
@pcholakov
Copy link
Contributor Author

The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log function reliable?

Yes, definitely! I was already working on that - I realize my todo might have created the wrong impression :-) Here is the change, on which this PR is now rebased: #2468.

I wasn't able to get the leadership change to work reliably but I'm pretty keen to do that too. However, I believe that the test as it stands should be reasonably robust to merge and won't cause undue noise in CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants