Add Bifrost trim gap handling support by fast-forwarding to the latest partition snapshot #2456

pcholakov · 2024-12-23T14:05:54Z

Closes: #2247

github-actions · 2024-12-23T14:22:32Z

Test Results

7 files ±0 7 suites ±0 4m 26s ⏱️ -13s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 0111eec. ± Comparison against base commit 1ac1f70.

♻️ This comment has been updated with latest results.

pcholakov · 2024-12-23T14:11:31Z

crates/worker/src/partition/mod.rs

-                    unimplemented!("Handling trim gap is currently not supported")
-                };
-                anyhow::Ok((lsn, envelope?))
+                if entry.is_data_record() {


At the moment, LogEntry.record is not public, and neither are bifrost::{MaybeRecord, TrimGap} - would we prefer to make those public and use pattern-matching directly?

We can, but it doesn't sound like you need that. See my comments below.

pcholakov · 2024-12-23T14:18:18Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+    let snapshot = match snapshot_repository {
+        Some(repository) => {
+            debug!("Looking for partition snapshot from which to bootstrap partition store");
+            // todo(pavel): pass target LSN to repository


We can optimize this by not downloading a snapshot that's older than the target LSN; I'll tackle this as a separate follow-up PR.

crates/worker/src/partition/mod.rs

pcholakov · 2024-12-23T14:54:33Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                );
+            }
+
+            // We expect the processor startup attempt will fail, avoid spinning too fast.


This seems reasonable to me. I chose to rather delay and try start again, just in case something has changed in the log - but at this point we're unlikely to get this processor going again by following the log. What's a good way to post a metric that we're spinning?

AhmedSoliman · 2024-12-23T17:01:00Z

crates/worker/src/partition/mod.rs

+                    Ok(stopped) => {
+                        match stopped {


Tip: You can remove one level of nesting:

Ok(ProcessorStopReason::LogTrimGap { to_lsn }) => .... Ok(_) => warn... Err(err) => warn...

Much better, thank you! <3

AhmedSoliman · 2024-12-23T17:06:04Z

crates/worker/src/partition/mod.rs

-                    unimplemented!("Handling trim gap is currently not supported")
-                };
-                anyhow::Ok((lsn, envelope?))
+                if entry.is_data_record() {


We can, but it doesn't sound like you need that. See my comments below.

AhmedSoliman · 2024-12-23T17:06:43Z

crates/worker/src/partition/mod.rs

+                    anyhow::Ok(Record::TrimGap(
+                        entry
+                            .trim_gap_to_sequence_number()
+                            .expect("trim gap has to-LSN"),
+                    ))


Would it make sense to stop the read stream at the first gap and return Err instead?

I understand this to mean the map_ok function translates a trim-gap into an Err(TrimGap {..}) instead? Maybe! The way we use anyhow::Result pervasively makes this a deeper change than I wanted to tackle right away; but it also makes more sense to treat trim gaps as just another record in the stream, with errors reserved for actual failure conditions.

Zooming out a bit, modeling the Partition Processor overall outcome as Result<Canceled | StoppedAtTrimGap, ProcessingError> seems accurate: the Ok / left path is an expected if rare reason to halt; the Err / right path is an exceptional failure condition.

If you have a few minutes, I'd love to hear more about how you'd solve this? I'm certain I am also missing some subtlety around properly consuming the log stream!

We discussed offline and agreed that it's best to represent this case as an error case.

AhmedSoliman · 2024-12-23T17:09:02Z

crates/core/src/task_center.rs

+    pub fn start_runtime<F, R>(
        self: &Arc<Self>,
        root_task_kind: TaskKind,
        runtime_name: &'static str,
        partition_id: Option<PartitionId>,
        root_future: impl FnOnce() -> F + Send + 'static,
-    ) -> Result<RuntimeTaskHandle<anyhow::Result<()>>, RuntimeError>
+    ) -> Result<RuntimeTaskHandle<R>, RuntimeError>
    where
-        F: Future<Output = anyhow::Result<()>> + 'static,
+        F: Future<Output = R> + 'static,
+        R: Send + 'static,


To me, it seems more than you want to have control over the error type rather than make the runtime behave like an async task with a return value.

In that case, your PartitionProcessorStopReason becomes the error type.

Maybe! We use anyhow::Error quite a bit in the PP now, so it would be difficult to disentangle the errors I care about, from other failure conditions. That aside, I still like modeling this as an outcome of either a known stop reason, or some other failure condition. I am treating PartitionProcessorStopReason as a normal return since both canceling the PP, or encountering a trim gap, are expected over a long enough timeline.

AhmedSoliman · 2024-12-23T17:11:12Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+        .await
+        && fast_forward_lsn.is_none()
+    {
+        return Ok(partition_store_manager


tip: remove return

Without the return statement, I need to pull the rest of the method body into an else arm - and I specifically wanted to keep it this way. I find it easier to read without the extra nesting. Open to change it back if you feel about using the if expression as the returned value of course :-)

AhmedSoliman · 2024-12-23T17:13:00Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+            tokio::time::sleep(Duration::from_millis(
+                10_000 + rand::random::<u64>() % 10_000,
+            ))


would RetryPolicy and its internal jitter logic work for you here?

Definitely! I didn't want to plumb a retry count through just yet but maybe even without it, we can leverage the retry policy already.

I remembered why I didn't want to tackle this just yet - right now the way to get consecutive retry decisions is to get an iterator. I want to introduce an alternative API to RetryPolicy which will make this more suitable for use cases like this one, but let me rather do that as a follow up PR!

AhmedSoliman · 2024-12-23T17:13:57Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+                warn!(
+                    partition_id = %partition_id,
+                    ?snapshot_path,
+                    "Failed to remove local snapshot directory, continuing with startup: {:?}",


let's try and avoid using Debug values in log messages higher than debug!

Very insightful rule of thumb to keep in mind, thank you!

pcholakov · 2024-12-24T12:35:25Z

One bit of offline feedback from @AhmedSoliman was to think edge cases around partition leadership, when an active leader encounters a trim gap.

Pushed an updated version which addresses the trim gap stop-reason as an explicit error, rather than an "ok" variant. This is because we really want to minimize the chances of misinterpreting a trim-gap record and accidentally consuming past it.

One deviation from the previous behavior with this latest revision is that read_commands (and by extension PartitionProcessor::run) immediately returns an error when a trim-gap is encountered, rather than trying to apply all the valid envelopes seen prior to it. This is acceptable: generally we would have stopped anyway; the only way to continue applying more records is to find a snapshot to fast-forward beyond the gap. Applying records from before the gap is technically not wrong but is wasteful.

Also addressed most of the remaining smaller comments, with the major exception of not using retry policy just yet - I plan to, but let's do it as a follow-up as I want to make some changes to make it easier to use here.

pcholakov · 2024-12-24T12:43:03Z

crates/worker/src/partition/mod.rs

-                            }
+                    for record in command_buffer.drain(..) {
+                        match record {
+                            Record::Envelope(lsn, envelope) => {


The Envelope handling logic is unchanged, this block is just indented due to the match expression needed to handle gaps.

This allows us to signal the PPM about log trim gaps that the PP may encounter at runtime, which require special handling.

tillrohrmann

Thanks a lot for creating this PR @pcholakov. The changes look really good to me. Well done 👏. The one question I had was how bulletproof is the logic in open_partition_store when to import a snapshot w/o dropping the partition cf based on the passed in fast_forward_lsn parameter. It seems possible that we don't fail with a trim gap error (e.g. PP being stopped) and then restart later with a partition store that has an initialized cf and an existing snapshot.

crates/worker/src/partition/mod.rs

tillrohrmann · 2024-12-30T08:32:25Z

crates/worker/src/partition/mod.rs

@@ -426,14 +464,14 @@ where
                            lsn,
                            envelope,
                            &mut transaction,
-                            &mut action_collector).await?;
+                            &mut action_collector).await.map_err(ProcessorError::from)?;


? calls ProcessorError::from automatically. So map_err can be removed. I stop flagging this here but there are a few other occurrences further down.

Thank you for flagging this! I distinctly recall compilation failures without these but I never investigated why - I think the return type might have been different at the time. All cleaned up now!

crates/worker/src/partition_processor_manager.rs

tillrohrmann · 2024-12-30T08:56:45Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+            Ok(import_snapshot(
+                partition_id,
+                key_range,
+                snapshot,
+                partition_store_manager,
+                options,
+            )
+            .await?)


Why is it ok to import a snapshot w/o dropping the partition cf? Is it really safe that if fast_forward_lsn == None, then we guarantee that the corresponding partition cf does not exist?

This is safe, though it's very unclear in the local context why this is :-) At the very top of open_partition_store, we immediately go down a different path if we already have a store (i.e. a column family exists for the partition) and there's no fast-forward target LSN:

if partition_store_manager .has_partition_store(partition_id) .await && fast_forward_lsn.is_none() { return Ok(partition_store_manager .open_partition_store( partition_id, key_range, OpenMode::OpenExisting, &options.storage.rocksdb, ) .await?); }

If we didn't go down that path, and fast_forward_lsn is unset, that means that the CF for partition_id does NOT exist.

I'll try refactor the code to make this more obvious, or failing that, at least add a comment to highlight this. Suggestions for how to make this better are welcome!

PS. import_snapshot is paranoid and will refuse to import if a CF already exists, so a potential bug here will cause a PP startup failure but not corrupt a data store.

pcholakov · 2025-01-03T13:04:08Z

Thanks a lot for the feedback, @tillrohrmann! 🙏 Cleaned up per your suggestions.

The one question I had was how bulletproof is the logic in open_partition_store when to import a snapshot w/o dropping the partition cf based on the passed in fast_forward_lsn parameter. It seems possible that we don't fail with a trim gap error (e.g. PP being stopped) and then restart later with a partition store that has an initialized cf and an existing snapshot.

I believe this case covered handled by the early-return here:

restate/crates/worker/src/partition_processor_manager/spawn_processor_task.rs

Lines 188 to 202 in 1baff85

    
           if partition_store_manager 
        
               .has_partition_store(partition_id) 
        
               .await 
        
               && fast_forward_lsn.is_none() 
        
           { 
        
               // We have an initialized partition store, and no fast-forward target - go on and open it. 
        
               return Ok(partition_store_manager 
        
                   .open_partition_store( 
        
                       partition_id, 
        
                       key_range, 
        
                       OpenMode::OpenExisting, 
        
                       &options.storage.rocksdb, 
        
                   ) 
        
                   .await?); 
        
           }

- it's the common path taken when we are starting up with an existing partition store and the previous stop-reason is trim-gap encountered. We only continue on the special cases below if (re-)bootstrap is required.

If the PP previously stopped for some reason other than a trim-gap, then it's just a clean startup with an existing store. In the case of either a fresh start, or a restart after some other error, we don't have a fast-forward LSN target and will take the early return.

I tried to refactor the code to make it a bit more obvious and readable - plus added some comments. Let me know if this helps, and if you have a better idea! :-)

tillrohrmann

Thanks for the clarification of how the partition store is initialized from a snapshot. It makes sense to me now :-) Really nice work. LGTM. +1 for merging.

tillrohrmann · 2025-01-03T13:44:20Z

crates/worker/src/partition_processor_manager/spawn_processor_task.rs

+        }
+        (Some(snapshot), None) => {
+            // We only reach this point if there is no initialized store for the partition (early
+            // return at start of method), we can import without first dropping the column family.


The comment in the parenthesis seems to be no longer correct.

Ah! Will update before I merge 👍

pcholakov force-pushed the feat/trim-gap-handling branch from 8a82f9e to 140e7bf Compare December 23, 2024 14:29

pcholakov requested review from tillrohrmann and AhmedSoliman December 23, 2024 14:58

pcholakov marked this pull request as ready for review December 23, 2024 14:58

pcholakov commented Dec 23, 2024

View reviewed changes

AhmedSoliman reviewed Dec 23, 2024

View reviewed changes

pcholakov requested review from AhmedSoliman and removed request for tillrohrmann December 24, 2024 12:35

pcholakov force-pushed the feat/trim-gap-handling branch 2 times, most recently from b76f3f9 to 87a876d Compare December 24, 2024 12:39

pcholakov commented Dec 24, 2024

View reviewed changes

pcholakov added 11 commits December 27, 2024 15:47

Add stopped-reason to ProcessorManager root future

41ff6ff

This allows us to signal the PPM about log trim gaps that the PP may encounter at runtime, which require special handling.

Add trim-gap handling by fast-forwarding the partition state on startup

c29f64f

Simplify open partition store logic

87f0c5c

Self-review

ff82b72

Stop reading commands from the log if a TrimGap is encountered

c42b130

Delay startup when we have a fast-forward target we can't reach

1e3ed7e

PR feedback misc

49c9004

Model TrimGapEncountered as an error within the PP

5611679

add RetryPolicy todo

d75db99

Rename StopReason to Error

84bab4f

Remove Record enum in favor of simple tuple

2539aa3

pcholakov force-pushed the feat/trim-gap-handling branch from 0475afc to 2539aa3 Compare December 27, 2024 13:47

tillrohrmann reviewed Dec 30, 2024

View reviewed changes

pcholakov mentioned this pull request Jan 2, 2025

Add an end-to-end test for trim gap handling using snapshots #2463

Open

pcholakov added 3 commits January 3, 2025 14:40

Addressing PR feedback

1baff85

Remove unnecessary map_err calls in PartitionProcessor::run_inner

0d94b45

Make partition store re-init more clear

001f20b

pcholakov requested review from tillrohrmann and removed request for AhmedSoliman January 3, 2025 13:04

tillrohrmann approved these changes Jan 3, 2025

View reviewed changes

Update outdated comment

0111eec

pcholakov merged commit ef1f76a into main Jan 3, 2025
12 checks passed

pcholakov deleted the feat/trim-gap-handling branch January 3, 2025 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bifrost trim gap handling support by fast-forwarding to the latest partition snapshot #2456

Add Bifrost trim gap handling support by fast-forwarding to the latest partition snapshot #2456

pcholakov commented Dec 23, 2024 •

edited

Loading

github-actions bot commented Dec 23, 2024 •

edited

Loading

pcholakov Dec 23, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 23, 2024

pcholakov Dec 23, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

AhmedSoliman Dec 23, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

AhmedSoliman Dec 24, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

pcholakov Dec 24, 2024

AhmedSoliman Dec 23, 2024

pcholakov Dec 24, 2024

pcholakov commented Dec 24, 2024

pcholakov Dec 24, 2024

tillrohrmann left a comment

tillrohrmann Dec 30, 2024

pcholakov Jan 3, 2025

tillrohrmann Dec 30, 2024

pcholakov Jan 3, 2025

pcholakov commented Jan 3, 2025

tillrohrmann left a comment

tillrohrmann Jan 3, 2025

pcholakov Jan 3, 2025

Add Bifrost trim gap handling support by fast-forwarding to the latest partition snapshot #2456

Add Bifrost trim gap handling support by fast-forwarding to the latest partition snapshot #2456

Conversation

pcholakov commented Dec 23, 2024 • edited Loading

github-actions bot commented Dec 23, 2024 • edited Loading

Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Dec 24, 2024

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Jan 3, 2025

tillrohrmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcholakov commented Dec 23, 2024 •

edited

Loading

github-actions bot commented Dec 23, 2024 •

edited

Loading