Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(clustering/sync): improve error handling #14196

Closed
wants to merge 1 commit into from

Conversation

StarlightIbuki
Copy link
Contributor

Summary

  1. refactor to abstract error handler;
  2. immediate retry when failure;
  3. variable renaming

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

KAG-6225

@github-actions github-actions bot added core/clustering cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Jan 20, 2025
@StarlightIbuki StarlightIbuki changed the title chore(sync): improve error handling chore(clustering/sync): improve error handling Jan 20, 2025
@StarlightIbuki StarlightIbuki changed the title chore(clustering/sync): improve error handling refactor(clustering/sync): improve error handling Jan 20, 2025
ADD-SP
ADD-SP previously requested changes Jan 20, 2025
kong/clustering/services/sync/rpc.lua Show resolved Hide resolved
-- retry without releasing the mutex as the DP has nothing to do except syncing
if is_full_sync then
yield()
return do_sync()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we run sync in a timer? could it cause more cost?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing so we do not release the lock and fetch the lock again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential infinite loop?

Copy link
Contributor

@ADD-SP ADD-SP Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand in this state, there is nothing DP can do other than syncing, but it is a little bit risky to sync again and again without any backoff delay.

@chronolaw
Copy link
Contributor

This PR may conflict with another refactor PR? What is your opinion?

@StarlightIbuki StarlightIbuki force-pushed the refactor/error-hdl branch 2 times, most recently from 691f8a3 to 5abb6fd Compare January 20, 2025 08:49
@StarlightIbuki
Copy link
Contributor Author

This PR may conflict with another refactor PR? What is your opinion?

Resolved

local do_sync


local function error_handler(err, is_full_sync)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this error_handler() is a premature optimization, we could have a simple function like:

try_sync(is_full_sync)
  if is_full_sync then
    yield()
    return do_sync()
  end
end

And keep the oringinal error handling code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes an unconditional yielding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be

        if not res then
  if is_full_sync then
    return try_sync()
  end
          return nil, err
        end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sync I don't think that we should hide this operation in an error handler function, we should show it clearly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will show you why this is designed like this in EE's PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway this change is removed


if not deltas then
return nil, "sync get_delta error: deltas is null"
return error_handler("sync get_delta error: deltas is null", is_full_sync)
Copy link
Contributor

@chronolaw chronolaw Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some comments about when and why we should run do_sync again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you commenting at the place you meant to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't understand why here we should call do_sync again, what's the difference between other error handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

local do_sync


local function error_handler(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have removed some code in this function, I don't think it is necessary now (only two lines).,

@@ -177,7 +177,16 @@ local function is_rpc_ready()
end


local function do_sync()
local do_sync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this forward declaration should be removed.

@StarlightIbuki
Copy link
Contributor Author

@chronolaw Let's close this PR then

@chronolaw chronolaw deleted the refactor/error-hdl branch January 21, 2025 06:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/clustering size/M skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants