-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bifrost] FindTail task in replicated loglet #2019
Conversation
8a4ec8c
to
009af2a
Compare
Test Results 5 files 5 suites 2m 41s ⏱️ Results for commit 59e75ad. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @AhmedSoliman for this PR. It looks good to me, I only have really minor comments, and couple of questions.
// todos: | ||
// 1- Do we have some nodes that are unsealed? let's run a seal task in the | ||
// background, but we can safely return the result. | ||
// iff local-tail is consistent max-local-tail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// iff local-tail is consistent max-local-tail. | |
// if local-tail is consistent max-local-tail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@muhamadazmy iff means "if and only if" https://en.wikipedia.org/wiki/If_and_only_if
crates/bifrost/src/providers/replicated_loglet/tasks/find_tail.rs
Outdated
Show resolved
Hide resolved
max_local_tail, current_known_global | ||
); | ||
} else if max_local_tail == current_known_global || | ||
// max_local_tail > known_global_tail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// max_local_tail > known_global_tail | |
// max_local_tail == known_global_tail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is correct. it says that if the max-local-tail is higher, then we require a write-quorum check.
global_tail: max_local_tail, | ||
}; | ||
} else { | ||
// F-majority sealed, but tail needs repair in range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this part. This is basically when max_local_tail > current_known_global. Why don't we take this value as the new current_known_global. if we have a write quorum ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, in line 264. What the else branch does is that is checks the availability of enough nodes such that we can establish write quorum for (any) offset, but it doesn't mean that max-local-tail is replicated on write-quorum.
if self.known_global_tail.latest_offset() >= max_local_tail { | ||
return FindTailResult::Open { | ||
global_tail: self.known_global_tail.latest_offset(), | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this can happen because by the time the node replied with their info the global tail has already advanced, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, or from any other external source.
if nodeset_checker.check_write_quorum(|attribute| match attribute { | ||
// We have all copies of the max-local-tail replicated. It's safe to | ||
// consider this offset as committed. | ||
NodeTailStatus::Known { local_tail, sealed } => { | ||
// not sealed, and reached its local tail reached the max we | ||
// are looking for. | ||
!sealed && (*local_tail >= max_local_tail) | ||
} | ||
_ => false, | ||
}) { | ||
return FindTailResult::Open { | ||
global_tail: max_local_tail, | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if there is a value between [known_global_tail..max_local_tail]
that can satisfy the write quorum and hence is the real known_global_tail. Can be found by the mean of binary search this range until we hit the max value that can form a write quorum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be, and that'll be one of the things the RepairTask will do (not binary-search, but walking backwards from max-tail)
if nodeset_checker.check_write_quorum(NodeTailStatus::is_known) { | ||
// We can repair. | ||
todo!("Tail repair is not implemented yet") | ||
} else { | ||
// wait for more nodes | ||
break 'check_nodeset; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question here about the potential repair process. Will this try to copy the missing logs from the max_local_tail until it forms a write quorum ?
Or will it find the max value in range [known_global..max_local]
that satisfy the write quorum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both. It'll first walk backwards from max-tail until it hits a record that's not fully replicated. Then it'll attempt to replicate those records (in the forward direction) until it reaches max-local-tail. It's critical that the order of repair goes in this direction to ensure correctness and repeatability if repair failed or got interrupted.
self.my_params.loglet_id, e | ||
)); | ||
} | ||
inflight_info_requests.abort_all(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be moved before the seal_task.run()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be confusing to have two places where tasks are aborted. In general, as I mentioned in the PR description, this code is very verbose (intentionally) and it'll be refactored at a later stage once the algorithm itself is tested and its correctness is verified. So I wouldn't worry so much about the cosmetics.
if !updated { | ||
// Nothing left to wait on. We tail didn't reach expected | ||
// target and no more nodes are expected to send us responses. | ||
return FindTailResult::Error(format!( | ||
"Could not determine a safe tail offset for loglet_id={}, perhaps too many nodes down?", | ||
self.my_params.loglet_id)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this check can happen immediately after the tokio::select!
block? for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this changes the intention. The check is intentionally done after we exhaust the other options above.
} | ||
} | ||
|
||
struct WaitForTailOnNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FindTailOnNode
and WaitFotTailOnNode
looks very similar maybe they can become one Generic task, since they only rely mainly on the Header in the incoming message anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider this as part of the cosmetic changes that I'll do later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really impressive work @AhmedSoliman 🚀 The changes look good to me. +1 for merging :-)
@@ -139,6 +140,13 @@ impl<T: TransportConnect> SequencerAppender<T> { | |||
// since backoff can be None, or run out of iterations, | |||
// but appender should never give up we fall back to fixed backoff | |||
let delay = retry.next().unwrap_or(DEFAULT_BACKOFF_TIME); | |||
info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could info be too verbose if there are waves of retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it'll be and I'll review all logging once we put things together, we have a lot of work left on logging/observability.
/// the previous seal process crashed). Additionally, we will start a TailRepair task to ensure consistent | ||
/// state of the records between known_global_tail and the max(local_tail) observed from f-majority | ||
/// of sealed log-servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding: The TailRepair
is needed for correctness because if we find a log entry that is beyond the known global tail and there are some nodes from nodeset unavailable, it could be that this entry was committed because together with the unavailable nodes there could be a write quorum. However, if it weren't and we use the offset of this entry as the global tail, then we risk that after the unavailable nodes came back, this entry couldn't be read by an f-majority read. That's why we establish write-quorum for these records to be on the safe side, right?
This, however, requires that for sealing a log, we not only need to establish f-majority but also write-quorum, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct.
crates/bifrost/src/providers/replicated_loglet/tasks/find_tail.rs
Outdated
Show resolved
Hide resolved
crates/bifrost/src/providers/replicated_loglet/tasks/find_tail.rs
Outdated
Show resolved
Hide resolved
Implementation of the find-tail process in case we are don't have a leader sequencer.
Implementation of the find-tail process in case we are don't have a leader sequencer.
Stack created with Sapling. Best reviewed with ReviewStack.