-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an end-to-end test for trim gap handling using snapshots #2463
base: feat/trim-log-report-lsn
Are you sure you want to change the base?
Conversation
12b10fb
to
713e010
Compare
@@ -749,6 +749,30 @@ impl StartedNode { | |||
} | |||
} | |||
|
|||
impl Drop for StartedNode { | |||
fn drop(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to avoid leaking restate-server
processes from tests.
server/tests/trim_gap_handling.rs
Outdated
|
||
mod common; | ||
|
||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using this rather than test(restate_core::test)
because that macro just exits the process on panic, which prevents unwinding from running - and can lead to leaked spawned processes on test failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Of course; I recall the discussion now - switched to #[test_log::test(tokio::test)]
as a middle ground to ensure that the Drop callback works.
server/tests/snapshots.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new trim gap fast-forward test covers the same paths as this one.
@@ -34,6 +34,7 @@ description = "Restate makes distributed applications easy!" | |||
[workspace.dependencies] | |||
# Own crates | |||
codederror = { path = "crates/codederror" } | |||
mock-service-endpoint = { path = "tools/mock-service-endpoint" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this PR, the mock service handler is now usable from other packages - this is handy for e2e testing.
@@ -89,7 +89,6 @@ pub struct TestEnv { | |||
pub loglet: Arc<dyn Loglet>, | |||
pub metadata_writer: MetadataWriter, | |||
pub metadata_store_client: MetadataStoreClient, | |||
pub cluster: StartedCluster, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed passing the cluster to the test routine as it is easy to accidentally drop it, and kill the cluster in the process. We can reintroduce it as a reference if it's needed in the future.
e7bd6c7
to
071338c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating the end-to-end test for our snapshots @pcholakov. The changes look good to me.
The one aspect that makes me a bit uneasy is that it seems that we cannot reliably guarantee that a trim has happened. If this is correct, then we might add a test which is unstable in our CI environment. Maybe because of this, it's worth to first add the functionality to report back which lsn was trimmed so that we can make the trim_log
function reliable?
pid, | ||
); | ||
match nix::sys::signal::kill( | ||
nix::unistd::Pid::from_raw(pid.try_into().unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this try_into
infallible or why is unwrap
ok here?
server/tests/trim_gap_handling.rs
Outdated
|
||
mod common; | ||
|
||
#[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we have this panic hook is to ensure that if a panic occurs within a spawned task, the tests will fail. Otherwise, the panic might just be swallowed by the task.
server/tests/trim_gap_handling.rs
Outdated
tracing_subscriber::fmt() | ||
.event_format(tracing_subscriber::fmt::format().compact()) | ||
.with_env_filter( | ||
tracing_subscriber::EnvFilter::builder() | ||
.with_default_directive(LevelFilter::INFO.into()) | ||
.from_env_lossy(), | ||
) | ||
.init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use test_log::test(tokio::test)
, then you don't have to set these things up yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY!
// todo(pavel): promote node 3 to be the leader for partition 0 and invoke the service again | ||
// right now, all we are asserting is that the new node is applying newly appended log records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do this by manually changing the SchedulingPlan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think it would be this easy... and it seems like it isn't. I added a step to manually set the SchedulingPlan
but it only works intermittently - Scheduler::update_scheduling_plan
nukes the changes as soon as it picks them up. I think this is important, let's do definitely do it but maybe as a follow-up task to provide a leadership hint to the scheduler?
server/tests/trim_gap_handling.rs
Outdated
State::Alive(s) => s | ||
.partitions | ||
.values() | ||
.any(|p| p.effective_mode.cmp(&1).is_eq()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is clearer if you compared against RunMode
instead of the ordinal value which is harder to remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic of try_from
! Thanks for the tip :-)
server/tests/trim_gap_handling.rs
Outdated
let mut i = 0; | ||
loop { | ||
client | ||
.trim_log(TrimLogRequest { | ||
log_id: 0, | ||
trim_point, | ||
}) | ||
.await?; | ||
if i >= 2 { | ||
break; | ||
} | ||
tokio::time::sleep(Duration::from_secs(1)).await; | ||
i += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you come up with the magic number of 3 attempts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically! I think Azmy suggested it may be related to the heartbeat interval and updating the global tail. Moot now; I've converted this to a retry until the desired effective trim is reached.
server/tests/trim_gap_handling.rs
Outdated
async fn trim_log( | ||
client: &mut ClusterCtrlSvcClient<Channel>, | ||
trim_point: u64, | ||
) -> googletest::Result<()> { | ||
// todo(pavel): this is flimsy, ensure we actually trim the log to a particular LSN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this method does not do anything because the admin node didn't have the up to date log tail, then I the remaining test will be stuck. This might be a problem for the stability of the test. Something to observe on our CI infra where timings can be quite skewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I created the wrong impression with the todo comment - I've rebased on #2468 which allows this to be deterministic :-)
async fn grpc_connect(address: AdvertisedAddress) -> Result<Channel, tonic::transport::Error> { | ||
match address { | ||
AdvertisedAddress::Uds(uds_path) => { | ||
// dummy endpoint required to specify an uds connector, it is not used anywhere | ||
Endpoint::try_from("http://127.0.0.1") | ||
.expect("/ should be a valid Uri") | ||
.connect_with_connector(service_fn(move |_: Uri| { | ||
let uds_path = uds_path.clone(); | ||
async move { | ||
Ok::<_, io::Error>(TokioIo::new(UnixStream::connect(uds_path).await?)) | ||
} | ||
})).await | ||
} | ||
AdvertisedAddress::Http(uri) => { | ||
Channel::builder(uri) | ||
.connect_timeout(Duration::from_secs(2)) | ||
.timeout(Duration::from_secs(2)) | ||
.http2_adaptive_window(true) | ||
.connect() | ||
.await | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite similar to create_tonic_channel_from_advertised_address
. Could this be reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied it nearly verbatim from restatectl
's grpc_connect
utility - which looks like it may have been the origin of create_tonic_channel_from_advertised_address
, too. I've done this under its own PR here:
071338c
to
446bd56
Compare
60dbc4e
to
446bd56
Compare
Yes, definitely! I was already working on that - I realize my todo might have created the wrong impression :-) Here is the change, on which this PR is now rebased: #2468. I wasn't able to get the leadership change to work reliably but I'm pretty keen to do that too. However, I believe that the test as it stands should be reasonably robust to merge and won't cause undue noise in CI. |
This test simulates a trim gap and verifies the behavior with and without a suitable snapshot present to enable fast-forward over the gap.
This is a follow-up to #2456 adding an e2e test for snapshot-based fast-forward over a log trim gap.
There are several to-dos here that require deeper changes - I'd like to do those as separate PRs to avoid delaying merging of trim-gap support itself. At a minimum this includes:
BifrostAdmin
can decide to no-op the request if the trim point is greater than the global tail it knows about, which makes it hard to testPrimary reviewer: @tillrohrmann
cc: @jackkleeman as an optional reviewer since I modified some test cluster infra and a test you previously added but feel free to ignore!