Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix child WF ID generation #1803

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Fix child WF ID generation #1803

wants to merge 12 commits into from

Conversation

gow
Copy link
Contributor

@gow gow commented Feb 5, 2025

What was changed

We now use workflowInfo.OriginalRunID in parent when it is generating a child workflow ID.

Why?

We need this change to enable resetting of workflows that have pending children. What this means is Go SDK should be able to successfully replay StartChildWorkflowExecutionInitiated event in a parent that was reset.
Without this change Go SDK will run into non-determinism error since the child ID it generates (based on new run ID) doesn't match with the ID available in the StartChildWorkflowExecutionInitiated event. Since OriginalRunID stays the same across resets, with this change the SDK can successfully replay StartChildWorkflowExecutionInitiated events.

Checklist

  1. Closes
    N/A

  2. How was this tested:
    Manually tested it. Added integration tests and replay tests.

  3. Any docs updates needed?
    N/A

@gow gow requested a review from a team as a code owner February 5, 2025 07:04
@cretz
Copy link
Member

cretz commented Feb 5, 2025

Questions here:

  1. Can we confirm what Java does here?
  2. IIUC Core-based languages make child workflow IDs with a seeded random that changes seed on reset midway through execution, meaning post-reset child workflow IDs would be different than they were on original run but pre-reset ones are the same. This was done very much by intention. Should we consider a similar approach in Go/Java that has a "current run ID" that can change midway through execution?
  3. Can we assess the impact of behavior changes like this on existing users with existing expectations? I am guessing there is no impact because you could not reset to before child workflows?
  4. Can you confirm OriginalRunID always matches RunID at this point of code today? Meaning since resets were never allowed over child workflows, it is impossible to have these two values different at child workflow creation time today?

@gow
Copy link
Contributor Author

gow commented Feb 5, 2025

Questions here:

  1. Can we confirm what Java does here?

Java seems to be generating UUIDs when an ID is not provided in ChildWorkflow options. Additionally when I replayed an existing StartChildWorkflowExecutionInitiated event (by resetting to a point after the child completion) it didn't run into non-determinism error. Go on the other hand ran into non-det error complaining about workflowID. ("Error":"[TMPRL1100] unknown command CommandType: ChildWorkflow, ID: 0174afb2-0213-4644-9214-8b977310cc9f_14, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition")

  1. IIUC Core-based languages make child workflow IDs with a seeded random that changes seed on reset midway through execution, meaning post-reset child workflow IDs would be different than they were on original run but pre-reset ones are the same. This was done very much by intention. Should we consider a similar approach in Go/Java that has a "current run ID" that can change midway through execution?

Absolutely yes. Java seems to be already doing this (at least in my testing). Looks like it changes the seed at reset-point?
If this is an easier change in Go-SDK, I'd rather do it. Otherwise, we still need this fix since Go-SDK is currently broken and the upcoming reset changes will increase the exposure to this bug (See below)

  1. Can we assess the impact of behavior changes like this on existing users with existing expectations? I am guessing there is no impact because you could not reset to before child workflows?

Yes. So far all server versions have prevented resetting of workflows when the reset point falls in between child Init & Completed events (i.e within the lifespan of child). So no potential user impact or backward compatibility issues there.
However the server has always allowed workflows to be reset when the reset-point chosen is outside the child init & completed events. In cases where reset-point is after child complete event, workers with Go SDK run into non-det error while other SDKs seems to simply reapply events (child Init, Start & complete) without any issues. In that regard this PR is more of a bug fix in Go-SDK. And now with the upcoming server changes to allow reset-points to be in between Init-Completed, it's important that Go-SDK successfully replay a StartChildWorkflowExecutionInitiated event.

  1. Can you confirm OriginalRunID always matches RunID at this point of code today? Meaning since resets were never allowed over child workflows, it is impossible to have these two values different at child workflow creation time today?

Yes it is possible to have these two values different at child creation time. When the reset-point is outside the child's lifespan (Init - Completed) then the server has always allowed resets. So in cases when the reset-point is before the child-init, the parent will try to create a child again and this time OriginalRunID is different from RunID.

@cretz
Copy link
Member

cretz commented Feb 5, 2025

@gow - thanks for this info! So reading all of that I am seeing two things:

  1. We need to fix Go to be at least a little like other SDKs. We have two options:
    1. Start offering a re-seeded random to users and we use it for child IDs in Go with an SDK flag to only apply to new workflows
    2. Make available (maybe only to us) a "current run ID" that is updated when reset that is used for child workflow IDs (no SDK flag needed for compatibility)
  2. We are not concerned with backwards compatibility of only changing Go since it never worked anyways after reset, so long as the non-reset logic remains working

For 1, I will start an internal discussion and invite you. Regardless, I am not sure the desired outcome is what is in this PR.

@gow
Copy link
Contributor Author

gow commented Feb 5, 2025

The desired outcome of this PR is for Go-SDK to be able to replay StartChildWorkflowExecutionInitiated events like other SDKs. Currently it fails. Using OriginalRunID will fix this issue and unblock reset feature development. But I understand it doesn't bring Go-SDK to be fully be at parity with other SDKs.

@gow
Copy link
Contributor Author

gow commented Feb 7, 2025

Putting this on hold. Will discuss internally and work on a better fix.

@gow gow marked this pull request as draft February 7, 2025 19:41
@gow gow force-pushed the cg/reset_child_workflow_id branch from c54c992 to ab1a0c3 Compare February 12, 2025 06:31
@gow gow changed the title Use OriginalRunID instead of RunID to generate child workflow IDs Fix seed child WF ID generation Feb 12, 2025
@gow gow changed the title Fix seed child WF ID generation Fix child WF ID generation Feb 12, 2025
@gow gow force-pushed the cg/reset_child_workflow_id branch from ab1a0c3 to f0eb17d Compare February 12, 2025 23:49
@gow gow marked this pull request as ready for review February 12, 2025 23:52
@gow
Copy link
Contributor Author

gow commented Feb 12, 2025

After an internal discussion, we decided to bring Go SDK on par with the rest of the SDKs. So instead of simply using OriginalRunID, we are using the FirstExecutionRunId as the initial seed and updating it with the currentRunID whenever we encounter a reset.

Copy link
Contributor

@yuandrew yuandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, deferring to @Quinn-With-Two-Ns or @cretz for approval. I vaguely remember there was a test they were hoping was added, but can't remember exactly what that test was.

internal/internal_event_handlers.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@gow gow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember there was a test they were hoping was added, but can't remember exactly what that test was.

We need to add some integration tests here. I'm trying to figure out the structuring and setup of them in this codebase. I'll probably sync with you all offline to get some pointers.

internal/internal_event_handlers.go Outdated Show resolved Hide resolved
internal/internal_event_handlers.go Outdated Show resolved Hide resolved
@@ -492,8 +492,7 @@ OrderEvents:
break OrderEvents
}
case enumspb.EVENT_TYPE_WORKFLOW_TASK_SCHEDULED,
enumspb.EVENT_TYPE_WORKFLOW_TASK_TIMED_OUT,
enumspb.EVENT_TYPE_WORKFLOW_TASK_FAILED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Quinn-With-Two-Ns - can you think of any ramifications for now putting workflow task failed events in taskEvents.events where we hadn't before?

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though now we get into the more fun challenge of adding tests, heh

@@ -493,6 +493,24 @@ func (s *replayTestSuite) TestPartialReplayNonCommandEvent() {
require.NoError(s.T(), err)
}

func (s *replayTestSuite) TestResetWorkflowWithChildSuccess() {
Copy link
Member

@cretz cretz Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't seem to have resets in them. Here's my best guess on some tests needed:

  • Capture a JSON of a before-this-PR-change workflow that completed that has a reset after child workflow, and confirm it replays properly
  • Capture a JSON of a before-this-PR-change workflow that completed that has a before-this-PR-change reset before child workflow, and confirm it replays properly because it created a whole new child workflow
  • Make workflow with two child workflows, run it to completion, reset to in-between those two child workflows and let that reset one run to completion too. Confirm both replay successfully. Confirm that the original workflow's first child ID (i.e. the pre-reset-point one) is the same as the new workflow's first child ID by looking at history. Confirm the original workflow's second child ID (i.e. the post-reset-point one) is not the same as the new workflow's second child ID.

Those are the primary three I can think of, @Quinn-With-Two-Ns probably knows more. And you may want more to test certain behaviors such as the fact that the default parent close policy of terminate does what is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have marked this as draft again. Still working on it. Just sent it to see the CI builds. Talked to @Quinn-With-Two-Ns and I'm adding the cases you mentioned except this one

Capture a JSON of a before-this-PR-change workflow that completed that has a reset after child workflow, and confirm it replays properly

Because before-this-PR-change is broken and doesn't replay properly. But reset before the child workflow was working. So adding that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Chads point here was even though it doesn't replay properly with the current SDK, your change should cause it to replay properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I've added both cases.
Adding integration tests now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you should be able to add 3 test since you can test it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third case he mentioned above is an integration tests isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be an integration test and a replay test. You have two scenarios:

  1. Workflow was started with an old SDK, workflow had a child workflow and was reset after the child workflow finished. Currently that workflow is stuck due to NDE. Your change here should fix that workflow

  2. Workflow started with a new SDK, workflow had a child workflow and was reset after the child workflow finished. Workflow should complete successfully

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is a replay test, 2 is an integration test

Copy link
Contributor Author

@gow gow Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Done.
I added 1 integration test resetting a WF at 3 different reset points. The parent starts 2 children in sequence.

  • Reset # 1 - before both children. Both have new IDs after reset.
  • Reset # 2 - In between child-1 & child-2. Child-1 ID stays same. Child-2 ID changes.
  • Reset # 3 - After both children. Both have the same IDs as before reset.

I think there are lot more tests we need to write related to resets in general that are outside the scope of this PR. We should probably add these in followup PRs just to increase coverage.

  • Reset a running parent.
  • Reset a completed (base) parent but with another execution already running.
  • Reset a running/completed parent that handles a failed child.
  • Reset a running/completed parent where the reset points are in between Child init and child completed events (coming up in next server version)

@gow gow marked this pull request as draft February 13, 2025 21:43
@Quinn-With-Two-Ns
Copy link
Contributor

Also this TestNumPollersCounter test failure is not your fault. Seems to be very flaky after Go 1.24

@gow gow marked this pull request as ready for review February 14, 2025 02:01
@@ -2040,6 +2040,101 @@ func (ts *IntegrationTestSuite) TestResetWorkflowExecution() {
ts.Equal(originalResult, newResult)
}

// TestResetWorkflowExecutionWithChildren tests the behavior of child workflow ID generation when a workflow with children is reset.
// It repeatedly resets the workflow at different points in its execution and verifies that the child workflow IDs are generated correctly.
func (ts *IntegrationTestSuite) TestResetWorkflowExecutionWithChildren() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great test

Copy link
Member

@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but wait for @Quinn-With-Two-Ns to be able to look at test coverage and scenarios, he may be aware of more/better ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants