Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider cluster.x-k8s.io/remediate-machine annotated when selecting the machine to be remediate #11385

Open
fabriziopandini opened this issue Nov 6, 2024 · 11 comments · May be fixed by #11495
Assignees
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fabriziopandini
Copy link
Member

What would you like to be added (User Story)?

As an operator, I want CAPI to try to remediate machines annotated with cluster.x-k8s.io/remediate-machine before other machines (also to be remediated, but not hand picked by the operator)

Detailed Description

CAPI should give priority to machine annotated with cluster.x-k8s.io/remediate-machine when selecting the machine to be remediated (thus giving priority to machine picked by the user for remediation)

Anything else you would like to add?

Note, this should be implemented in few places

  • MHC (Not 100% sure, but worth to check)
  • KCP (Machines with the annotation should be picked first for remediation)
  • MS (Machines with the annotation should be sorted as first for remediation)

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 6, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fabriziopandini fabriziopandini added area/machinehealthcheck Issues or PRs related to machinehealthchecks priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Nov 6, 2024
@sbueringer
Copy link
Member

/triage accepted

Also not sure about changes in the MHC controller, but we'll figure it out

@Karthik-K-N
Copy link
Contributor

Hi, Is this being worked on as part of any other PR or Can I contribute? Thanks

@sbueringer
Copy link
Member

Feel free to go ahead, thx!

@Karthik-K-N
Copy link
Contributor

Is there any reference available to ease my understanding.

@sbueringer
Copy link
Member

Not sure, @fabriziopandini ?

@fabriziopandini
Copy link
Member Author

Relevant code:

  • KCP
    // Select the machine to be remediated, which is the oldest machine to be remediated not yet provisioned (if any)
    // or the oldest machine to be remediated.
    //
    // NOTE: The current solution is considered acceptable for the most frequent use case (only one machine to be remediated),
    // however, in the future this could potentially be improved for the scenario where more than one machine to be remediated exists
    // by considering which machine has lower impact on etcd quorum.
    machineToBeRemediated := getMachineToBeRemediated(machinesToBeRemediated)
  • MS
    // Sort the machines from newest to oldest.
    // We are trying to remediate machines failing to come up first because
    // there is a chance that they are not hosting any workloads (minimize disruption).
    sort.SliceStable(machinesToRemediate, func(i, j int) bool {
    return machinesToRemediate[i].CreationTimestamp.After(machinesToRemediate[j].CreationTimestamp.Time)
    })

@fabriziopandini
Copy link
Member Author

/help

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 27, 2024
@Karthik-K-N
Copy link
Contributor

Thanks for the reference, I will check.

@Karthik-K-N
Copy link
Contributor

Made an attempt address this issue via PR #11495 with changes only for KCP, Added relevant UT as well.

Please take a look and help in clarifying on my understanding and approach. Based on the early feedback will proceed with missing parts. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
4 participants