Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve patchHelper to allow atomically updating a set of fields #11401

Open
sbueringer opened this issue Nov 11, 2024 · 4 comments
Open

Improve patchHelper to allow atomically updating a set of fields #11401

sbueringer opened this issue Nov 11, 2024 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@sbueringer
Copy link
Member

sbueringer commented Nov 11, 2024

Today patchHelper calculates patches by comparing two objects (let's call them original & desired). The patch that is send to the apiserver then only contains the delta.

This works fine as long as the original object is up-to-date. The problems start when the original object is outdated, e.g. because we read it from the controller-runtime cache and the object was stale. Please note that when patching status & spec the patch helper does not use optimistic locking.

An example:

original KCP object (stale)

status:
  replicas: 3
  readyReplicas: 3

desired KCP object

status:
  replicas: 4
  readyReplicas: 3

=> patchHelper will send a patch that sets .status.replicas = 4. readyReplicas is not included because patchHelper couldn't see a delta.

But the actual KCP object was:

status:
  replicas: 3
  readyReplicas: 2

So after patching we will end up with the following KCP object:

status:
  replicas: 4
  readyReplicas: 2

At no point did KCP calculate this combination of counter fields, but we still end up with this state in the apiserver because the patch calculation was based on a stale object. This can happen with various values for the counter fields and it gets problematic when other controllers are reading the KCP status and then making decisions based on it, e.g. they might think KCP has been already entirely upgraded if all replica fields are 3, but this might not actually be the case.

This might sound like a contrived example, but we actually had this problem and it's super hard to debug. The Cluster topology controller read KCP status, determined KCP was already entirely upgraded and then triggered worker upgrades, even though the KCP ugprade was still in progress.

One way to solve this would be to not read KCP from the cache and read it from the apiserver directly, but this is not great for scalability and puts more load on the kube-apiserver.

My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.

Somewhat related to ongoing work in #11105

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 11, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer sbueringer changed the title Improve patchHelper to allow atomically updating specific fields Improve patchHelper to allow atomically updating a set of fields Nov 11, 2024
@sbueringer sbueringer added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 11, 2024
@k8s-ci-robot k8s-ci-robot removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Nov 11, 2024
@sbueringer
Copy link
Member Author

sbueringer commented Nov 11, 2024

@vincepri @fabriziopandini @chrischdi Opinions?

@mboersma
Copy link
Contributor

One way to solve this would be to not read KCP from the cache and read it from the apiserver directly

Even this change would just minimize the chances of getting an outdated object, but not eliminate the problem, right? I think we would need to go with an optimistic locking approach, or your latter proposal of identifying which sets of change must be atomic.

to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed

Would this be an additional argument or arguments the caller needs to supply when calling patchHelper? Sort of an opt-in approach. Or would patchHelper have knowledge of correlated field requirements for CAPI objects?

@vincepri
Copy link
Member

My proposal would be to add an option to patchHelper to ensure a certain set of fields is always part of the patch, if one of the fields changed. Based on the example above this option would be used to ensure replicas & readyReplicas are always patched together.

In some ways though it feels like we might want to enable optimistic locking? That kind of guarantees what you mention above, although has other side effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants