-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for trim interval and persisted lsn watcher #2683
Conversation
88cd3db
to
a920638
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this!
// among partition leaders in case of coordinated cluster restarts | ||
let jitter = rand::rng().random_range(Duration::ZERO..interval.mul_f32(0.1)); | ||
let start_at = time::Instant::now().add(interval.into()).add(jitter); | ||
let effective_interval = with_jitter(interval, 0.1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
This also adds a little bit more stress to the tests to help them fail more often if there is an issue ``` // intentionally empty ```
This fixes mishandling of deleted and unknown nodes in the config in f-majority checks. Integration tests were misconfigured where the nodeset was [N2..N4] where N4 didn't actually exist in config. In this case we should not accept f-majority seal if only one node is sealed (replication=2) Although this bug wouldn't impact us immediately, it's best to fix this condition and I took it as an opportunity to update the semantics of provisioning state to match the latest design direction. Documentation has also been updated to reflect the correct semantics. Summary: - Nodes observed in node-set but not in nodes-config is treated as "provisioning" rather than "disabled" - Nodes that are "deleted" in config (tombstone exists) are treated as "disabled" - Nodes in provisioning are fully authoritative, but are automatically excluded from new nodesets (already filtered by candidacy filter in nodeset selector) - If provisioning nodes were added to the nodeset, they are treated as fully authoritative and are required to participate in f-majority. ``` // intentionally empty ```
- Trim operation will wait for f-majority before reporting success to increase reliability of subsequent get_trim_point - Adds protection against a dangerous scenario if the loglet over-reported its trim point in a sealed loglet case. The loglet might have more records than the effective sealed tail, it should never report a trim point beyond that (if this happens, the system will believe that the subsequent segment is missing records) - Remove superfluous check. The trim task already checks that trim point is clamped to the known global tail ``` // intentionally empty ```
- `restatectl replicated-loglet info` now prints a table with info from every node in the nodeset - `restatectl replicated-loglet digest` doesn't require --from/--to to function, and fixes for overblown memory usage if the supplied range is unnecessarily large - For both commands, lots of UI improvements. Some screenshots will be attached in comments. ``` // intentionally empty ```
This was overly/incorrectly protective. We'll see those errors when often because we only need to seal f-majority, subsequent tail repair operations on restarts should still succeed on this log-server. A future follow-up (was already planned) is to lazily seal the loglet once we observe this operation on this log-server, but I didn't think that doing it in this time frame is necessary. ``` // intentionally empty ```
The main fixes: - It was not actually possible to disable those two intervals, the configuration fails to parse the empty string that was suggested in docs. This is now replaced by `0s` or `0ms` (unfortunately only `0` will still fail, but can be fixed by custom serde_as which I don't have time for) - Fixed how `OptionFuture` is being used, select! in a loop will run indefinitely if OptionFuture contains None because it's always poll::Ready with value None. This is a critical bug that we didn't hit because it wasn't possible to disable those intervals before - Added jitter with `with_jitter()`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this problem @AhmedSoliman. LGTM. +1 for merging.
@@ -230,7 +230,7 @@ where | |||
_ = self.find_logs_tail_interval.tick() => { | |||
self.logs_controller.find_logs_tail(); | |||
} | |||
_ = OptionFuture::from(self.log_trim_check_interval.as_mut().map(|interval| interval.tick())) => { | |||
Some(_) = OptionFuture::from(self.log_trim_check_interval.as_mut().map(|interval| interval.tick())) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this 🙏 I wasn't fully aware of the OptionFuture
API back when I wrote it.
/// can be disabled by setting it to "". | ||
#[serde(with = "serde_with::As::<Option<serde_with::DisplayFromStr>>")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I forgot to put #[serde_as(as = "serde_with::NoneAsEmptyString")]
to support ""
to be translated into None
. "0s" is probably better.
The main fixes:
0s
or0ms
(unfortunately only0
will still fail, but can be fixed by custom serde_as which I don't have time for)OptionFuture
is being used, select! in a loop will run indefinitely if OptionFuture contains None because it's always poll::Ready with value None. This is a critical bug that we didn't hit because it wasn't possible to disable those intervals beforewith_jitter()
Stack created with Sapling. Best reviewed with ReviewStack.