[SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures #50559

siying · 2025-04-10T22:26:59Z

What changes were proposed in this pull request?

Add unit test to verify stream query works as expected when writing to commit or offset log fails. And minor improvements to existing test code.

Why are the changes needed?

Improve testing

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Run this existing test

Was this patch authored or co-authored using generative AI tooling?

No.

…and commit log write failures

siying · 2025-04-10T22:29:22Z

@HeartSaVioR this PR is very simple. Can you take a look?

HeartSaVioR · 2025-04-14T21:59:08Z

.../org/apache/spark/sql/execution/streaming/state/RocksDBCheckpointFailureInjectionSuite.scala

+              additionalConfs = additionalConfs),
+            AddData(inputData, 4),
+            if (failureConf.logType == "commits") {
+              // If the failure is in the commit log, data is already committed. The batch will


This is very uneasy to follow, because the behavior of MemoryStream seems to impact a lot on the test. It's not easy to reason about when the commit on the source happens and how MemoryStream will behave. It'd be harder than the origin test logic.

I'd test with file stream where we only append files and Spark is expected to process all files "regardless of these failures". (This should be a contract.) Once we change the output mode to complete, we should see the same result in the latest batch which has no further file to process.

Also, I'm not comfortable with the behavior - If the failure is in the commit log, data is already committed.. Shouldn't we commit the offset for batch N to source when batch N is NOT committed? I suspect this is an indication of data loss/correctness - I wish I'm misunderstanding something.

I'll try to clean up the test code to be clearer, but committing source shouldn't be related here. We only cleanup the offset for the previous batch before ending the current one, so it should have no effect.

Wait, is it due to the fact we used Check"Last"Batch? Sigh I missed this.

I'm OK with not using file stream, but let's figure out how we can verify that (2, 1) has provided before this batch.

If that's hard to achieve, I'm OK with more direct code comment about "sink" status - e.g. (3, 2) and (2, 1) were already emitted to sink as batch 1 even with failure on commit log. The write against sink for retried batch 1 should have been ignored (memory sink), but we restart the query and the state in the sink has reset, hence the write against sink for batch 1 in new query has effect.

It'd be lot easier if we can describe the data being included in batch 0 to N. This is very complicated because I expect (3, 2) to be "reprocessed" in the "next query run", and we don't see it anywhere despite the fact the sink has reset hence there shouldn't be dedup for batch 1.

It seems to be due to CheckLastBatch - if we use CheckAnswer then it should contain both. If there is no easy way to verify two batches separately, let's just use CheckAnswer to confirm both batches altogether, with a code comment to describe we reprocessed (3, 2) in batch 1, and processed further input in batch 2.

Sorry I think I by mistake used CheckLastBatch. Using CheckNewAnswer will be much less confusing and I'm changing to use that.

.../org/apache/spark/sql/execution/streaming/state/RocksDBCheckpointFailureInjectionSuite.scala

…ng/state/RocksDBCheckpointFailureInjectionSuite.scala

HeartSaVioR

+1 pending CI

HeartSaVioR · 2025-04-15T23:26:16Z

https://github.com/siying/spark/actions/runs/14479972406/job/40614700665

pyspark-connect is only failing (with OOME), which is not relevant to this change.

HeartSaVioR · 2025-04-15T23:26:32Z

Thanks! Merging to master.

[SPARK-51768][SS] Create Failure Injection Test for Streaming offset …

843b2f0

…and commit log write failures

github-actions bot added SQL STRUCTURED STREAMING labels Apr 10, 2025

HeartSaVioR changed the title ~~[SPARK-51768][SS] Create Failure Injection Test for Streaming offset and commit log write failures~~ [SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures Apr 14, 2025

HeartSaVioR reviewed Apr 14, 2025

View reviewed changes

Comments

a724ef9

HeartSaVioR reviewed Apr 15, 2025

View reviewed changes

.../org/apache/spark/sql/execution/streaming/state/RocksDBCheckpointFailureInjectionSuite.scala Outdated Show resolved Hide resolved

Update sql/core/src/test/scala/org/apache/spark/sql/execution/streami…

d6cf285

…ng/state/RocksDBCheckpointFailureInjectionSuite.scala

HeartSaVioR approved these changes Apr 15, 2025

View reviewed changes

HeartSaVioR closed this in a551080 Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures #50559

[SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures #50559

siying commented Apr 10, 2025 •

edited

Loading

siying commented Apr 10, 2025

HeartSaVioR Apr 14, 2025 •

edited

Loading

siying Apr 14, 2025

HeartSaVioR Apr 14, 2025

HeartSaVioR Apr 14, 2025 •

edited

Loading

HeartSaVioR Apr 14, 2025

siying Apr 15, 2025

HeartSaVioR left a comment

HeartSaVioR commented Apr 15, 2025

HeartSaVioR commented Apr 15, 2025

[SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures #50559

[SPARK-51768][SS][TESTS] Create Failure Injection Test for Streaming offset and commit log write failures #50559

Conversation

siying commented Apr 10, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

siying commented Apr 10, 2025

HeartSaVioR Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

siying Apr 14, 2025

Choose a reason for hiding this comment

HeartSaVioR Apr 14, 2025

Choose a reason for hiding this comment

HeartSaVioR Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

HeartSaVioR Apr 14, 2025

Choose a reason for hiding this comment

siying Apr 15, 2025

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Apr 15, 2025

HeartSaVioR commented Apr 15, 2025

siying commented Apr 10, 2025 •

edited

Loading

HeartSaVioR Apr 14, 2025 •

edited

Loading

HeartSaVioR Apr 14, 2025 •

edited

Loading