nrt: cache: add support for dedicated informer #599

ffromani · 2023-06-30T08:18:59Z

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

add optional dedicated informer for the NodeResourceTopology scheduler plugin.
For its reconciliation mechanism to converge, the plugin needs to list pods in terminal phase which are not deleted yet

Which issue(s) this PR fixes:

Fixes #598

Special notes for your reviewer:

Technically API change but it is backward compatible (optional fields) and done like many others we did in the past. Still labeled for transparency

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2023-06-30T08:19:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2023-06-30T08:19:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ffromani · 2023-06-30T08:21:19Z

@Huang-Wei @dankensk PTAL when you have time. Would be adding a separate, optional, disabled by default informer acceptable? (please check the related issue for context) I'm also investigating from kubelet side about if/how terminal pods should be reported.

ffromani · 2023-06-30T08:21:40Z

/cc @swatisehgal @Tal-or

pkg/noderesourcetopology/cache/store.go

netlify · 2023-10-10T09:49:38Z

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins ready!

Name	Link
🔨 Latest commit	`e466bc4`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/65a814f2e45b12000952aebd
😎 Deploy Preview	https://deploy-preview-599--kubernetes-sigs-scheduler-plugins.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ffromani · 2023-11-15T07:28:58Z

@Huang-Wei @zwpaper since I mentioned this PR when reviewing #655, any design objections from you folks? note this is still on hold as we're still checking if the fix is better done on kubelet or not. But would this PR raise concern from the scheduling side?

ffromani · 2023-11-30T07:33:22Z

@Huang-Wei @zwpaper since I mentioned this PR when reviewing #655, any design objections from you folks? note this is still on hold as we're still checking if the fix is better done on kubelet or not. But would this PR raise concern from the scheduling side?

gentle ping :) no major objections I guess?

ffromani · 2023-12-04T10:18:40Z

assuming no design blockers, @Tal-or @swatisehgal PTAL
cc @PiotrProkop for awareness

ffromani · 2023-12-05T14:14:04Z

I'm also investigating from kubelet side about if/how terminal pods should be reported.

Any conclusion on this?

Not yet but we're heading on the direction this is a bug on the kubelet side. It will need better docs (and agreement about the desired state) anyway, discussion need to be revamped.

Perhaps we should also mention that fixing this in Kubelet is a possible alternative in the PR description. Do we want to position this in a way that the plan is to fix this in Kubelet in the long run, and this is a workaround in the short term? Can we add more details about that in the PR description.

That's a very good point. If everything goes well, the kubelet fix will land in 1.30. This change will ensure scheduler compatibility with older versions (any older supported version) which seems like a reasonnable price to pay - assuming there are no other design concerns.

zwpaper · 2023-12-07T02:53:11Z

hi @ffromani sorry for the late reply, I was really busy on those days.

I was not able to get some time to look into the changes, it would be better for other reviewers to take it.

ffromani · 2024-01-05T10:24:42Z

gentle ping @swatisehgal @Tal-or

swatisehgal

Thanks for your patience and your work on this!

I am happy to proceed with the current approach as the introduction of informer mode is a nice addition to current implementation and provides us a way to resolve the inconsistency in the pod data obtained from podresource API and that visible to scheduler.

But, I do feel that in the long term resolving this on the kubelet side would be a more appropriate fix as exposing resources of pods that are terminal does not seem appropriate to me. Irrespective of how we go about things from kubelet point of view, I think we are a way forward to handle this for now as well as in the future with some minor adjustments to default mode if needed.

I left comments suggesting a minor addition to tests and why IMO we should reevaluate the current default value of the mode.

pkg/noderesourcetopology/cache/store_test.go

apis/config/v1/defaults.go

Tal-or · 2024-01-07T16:01:06Z

Thank you for working on this, LGTM (not adding official label since I see there're other comments)

ffromani · 2024-01-08T06:47:30Z

Thanks for your patience and your work on this!

I am happy to proceed with the current approach as the introduction of informer mode is a nice addition to current implementation and provides us a way to resolve the inconsistency in the pod data obtained from podresource API and that visible to scheduler.

But, I do feel that in the long term resolving this on the kubelet side would be a more appropriate fix as exposing resources of pods that are terminal does not seem appropriate to me. Irrespective of how we go about things from kubelet point of view, I think we are a way forward to handle this for now as well as in the future with some minor adjustments to default mode if needed.

Thank you for the valuable feedback. You raise (and raised, now that I remember) a great point that makes me still torn about this approach. The problem is indeed the represantation of completed pod.

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.

At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments.
Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

swatisehgal · 2024-01-08T11:07:23Z

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

IMO the answer to your question should be NO (in the long term) but I understand that we are not in a perfect world and we can't achieve that immediately. Solving this in the scheduler obviously brings us to a state better than we are currently at. To be a bit more explicit, the way I see this working out in my head is as follows:

We merge your current PR that adds support for informer modes (default:dedicated and shared) in the scheduler plugin: we end up in a state where both the pod resource API are monitoring all pods including terminal pods. This is not ideal but resolves the reconciliation issue.
PodresourceAPI reports about resources of pods in terminal phase kubernetes/kubernetes#119423 is fixed in kubelet either as a bug to fully remove terminal pods from the podresource API list endpoint or by allowing some sort of filtering.
The exporters (NFD/RTE consumes podresource API updates) and in the scheduler plugin we default back to shared mode as now both scheduler plugin and podresource API would be accounting pods that are non-terminal (the ideal state!).

What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.

At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments. Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

Just want to state that I am happy with the PR in its current form with some minor updates but we can totally explore a more streamlined PR if you think that would be better.

ffromani · 2024-01-17T16:15:46Z

The core issue is: a completed (either failed or succeeded) pod should accounted for? IOW, should resources accounted to completed pods be reused if need arises? There are valid points for both approaches.

IMO the answer to your question should be NO (in the long term) but I understand that we are not in a perfect world and we can't achieve that immediately. Solving this in the scheduler obviously brings us to a state better than we are currently at. To be a bit more explicit, the way I see this working out in my head is as follows:
1. We merge your current PR that adds support for informer modes (default:dedicated and shared) in the scheduler plugin: we end up in a state where both the pod resource API are monitoring all pods including terminal pods. This is not ideal but resolves the reconciliation issue.

2. [PodresourceAPI reports about resources of pods in terminal phase kubernetes/kubernetes#119423](https://github.com/kubernetes/kubernetes/issues/119423) is fixed in kubelet either as a bug to fully remove terminal pods from the podresource API list endpoint or by allowing some sort of filtering.

3. The exporters (NFD/RTE consumes podresource API updates) and in the scheduler plugin we default back to shared mode as now both scheduler plugin and podresource API would be accounting pods that are non-terminal (the ideal state!).
What is more and more evident, anyway is the fact I'm still ruminating about the merit of the approach. This on itself is a strong indicator we should NOT bake in this workaround in the main codebase. Should the workaround be needeed, there are ways to add in the 3rd party codebases.
At the same time, out of practicality, the issue we're trying to fix here is real and affect real-world deployments. Hence I'm pivoting this PR to remove all but the parts which make integration of an external informer possible, and propose this as next step.

Just want to state that I am happy with the PR in its current form with some minor updates but we can totally explore a more streamlined PR if you think that would be better.

Thanks for sharing your insights! I was contemplating the best course of action based on your feedback.
On one hand, this is a workaround on the scheduler side, there's no question about it.
On the other hand, without this workaround the caching mechanism of the plugin can end up in unrecoverable stall with basic functionalities of kubernetes are in use, like jobs. The only fix is to restart the scheduler (!!!) when stalls are detected, and detecting a stall is not trivial in the first place.

All of this is bad UX at best, and a crippling blow to the plugin functionality in all the scenarios.

The best course of action would be agree about a standard in the kubernetes, or at least node, community, but this is likely to happen anytime soon - kubernetes/kubernetes#119423 got little traction to begin with, and in all fairness this is a nontrivial effort.

This will lead us to a state on which the plugin functionality is severely diminuished until the oldest support version (current-2 if not mistaken) is fixed, which will takes at least months for now.

Because all of these reasons, and because reviewers and maintainers are at worst neutral if not mildly positive out of practical matters about this change, I'm removing the hold and moving forward.

Granted, we should, no we must keep working toward a conclusive and clean fix. This change can't be the long term solution, even though "temporary" solutions tend to stick around forever. Let's steer the boat for once.

ffromani · 2024-01-17T16:17:07Z

/unhold

begrudgingly. It's a workaround. But we can't keep a cripple plugin until we have a N-2 fixed version. We will talk about at very least kube 1.32

Add an option to create and use a separate informer to get unfiltered pod listing from the apiserver. Due to mismatched view with respect to the kubelet, the plugin needs access to pod in terminal phase which are not deleted to make sure the reconciliation is done correctly. xref: kubernetes-sigs#598 xref: kubernetes/kubernetes#119423 Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2024-01-18T10:17:16Z

addressed all comments and fixed a subtle bug on which the pods were not filtered according to the new option value. Now they are.

swatisehgal

/lgtm

Thanks for addressing all my review comments.

/assign @Huang-Wei
for approval

swatisehgal · 2024-01-18T13:07:22Z

/hold
to allow time in case @Huang-Wei would like to review and prevent inadvertent merge.

ffromani · 2024-01-18T13:16:27Z

/hold to allow time in case @Huang-Wei would like to review and prevent inadvertent merge.

Looks like the bot was faster. We can revert if needed.

swatisehgal · 2024-01-18T13:19:49Z

Looks like the bot was faster. We can revert if needed.

Ah! Didn't expect that! Next time will be more careful.

@Huang-Wei Please let us know if you have concerns here.

Huang-Wei · 2024-01-18T18:07:30Z

@Huang-Wei Please let us know if you have concerns here.

No concern from my side.

k8s-ci-robot requested review from PiotrProkop and seanmalloy June 30, 2023 08:19

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2023

k8s-ci-robot requested review from swatisehgal and Tal-or June 30, 2023 08:21

ffromani commented Jun 30, 2023

View reviewed changes

pkg/noderesourcetopology/cache/store.go Show resolved Hide resolved

ffromani force-pushed the nrt-extra-informer branch from 64e5752 to ed541c5 Compare October 10, 2023 09:49

ffromani changed the title ~~WIP: POC: nrt: cache: add support for dedicated informer~~ nrt: cache: add support for dedicated informer Oct 10, 2023

ffromani marked this pull request as ready for review October 10, 2023 09:49

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 10, 2023

k8s-ci-robot requested a review from zwpaper October 10, 2023 09:50

ffromani force-pushed the nrt-extra-informer branch from ed541c5 to 303b02e Compare October 10, 2023 10:07

ffromani mentioned this pull request Nov 6, 2023

migrate nrt to ctrl runtime client #655

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 28, 2023

ffromani force-pushed the nrt-extra-informer branch from 303b02e to 0386051 Compare November 30, 2023 07:33

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 30, 2023

ffromani force-pushed the nrt-extra-informer branch from 0386051 to b0e265f Compare December 5, 2023 16:02

swatisehgal reviewed Jan 7, 2024

View reviewed changes

pkg/noderesourcetopology/cache/store_test.go Outdated Show resolved Hide resolved

apis/config/v1/defaults.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 17, 2024

ffromani force-pushed the nrt-extra-informer branch from b0e265f to e466bc4 Compare January 17, 2024 17:57

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2024

swatisehgal reviewed Jan 18, 2024

View reviewed changes

k8s-ci-robot assigned Huang-Wei and swatisehgal Jan 18, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2024

k8s-ci-robot merged commit 93c518b into kubernetes-sigs:master Jan 18, 2024
10 checks passed

ffromani deleted the nrt-extra-informer branch January 18, 2024 13:16

rbaturov mentioned this pull request Feb 1, 2024

consume support for scheduler plugin informerMode from upstream openshift-kni/numaresources-operator#835

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nrt: cache: add support for dedicated informer #599

nrt: cache: add support for dedicated informer #599

ffromani commented Jun 30, 2023 •

edited

Loading

k8s-ci-robot commented Jun 30, 2023

k8s-ci-robot commented Jun 30, 2023

ffromani commented Jun 30, 2023

ffromani commented Jun 30, 2023

netlify bot commented Oct 10, 2023 •

edited

Loading

ffromani commented Nov 15, 2023

ffromani commented Nov 30, 2023

ffromani commented Dec 4, 2023

ffromani commented Dec 5, 2023

zwpaper commented Dec 7, 2023

ffromani commented Jan 5, 2024

swatisehgal left a comment

Tal-or commented Jan 7, 2024

ffromani commented Jan 8, 2024

swatisehgal commented Jan 8, 2024 •

edited

Loading

ffromani commented Jan 17, 2024

ffromani commented Jan 17, 2024

ffromani commented Jan 18, 2024

swatisehgal left a comment

swatisehgal commented Jan 18, 2024

ffromani commented Jan 18, 2024

swatisehgal commented Jan 18, 2024

Huang-Wei commented Jan 18, 2024

nrt: cache: add support for dedicated informer #599

nrt: cache: add support for dedicated informer #599

Conversation

ffromani commented Jun 30, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jun 30, 2023

k8s-ci-robot commented Jun 30, 2023

ffromani commented Jun 30, 2023

ffromani commented Jun 30, 2023

netlify bot commented Oct 10, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-scheduler-plugins ready!

ffromani commented Nov 15, 2023

ffromani commented Nov 30, 2023

ffromani commented Dec 4, 2023

ffromani commented Dec 5, 2023

zwpaper commented Dec 7, 2023

ffromani commented Jan 5, 2024

swatisehgal left a comment

Choose a reason for hiding this comment

Tal-or commented Jan 7, 2024

ffromani commented Jan 8, 2024

swatisehgal commented Jan 8, 2024 • edited Loading

ffromani commented Jan 17, 2024

ffromani commented Jan 17, 2024

ffromani commented Jan 18, 2024

swatisehgal left a comment

Choose a reason for hiding this comment

swatisehgal commented Jan 18, 2024

ffromani commented Jan 18, 2024

swatisehgal commented Jan 18, 2024

Huang-Wei commented Jan 18, 2024

ffromani commented Jun 30, 2023 •

edited

Loading

netlify bot commented Oct 10, 2023 •

edited

Loading

swatisehgal commented Jan 8, 2024 •

edited

Loading