-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coscheduling queue sort plugin starves pods #110
Comments
This sounds a reasonable optimization. @denkensk @cwdsuzhou thoughts? |
@mateuszlitwin @Huang-Wei |
Ah true, I fail to notice that point.
@mateuszlitwin The failed PodGroup with an earlier timestamp will go through an internal backoff period, so that latter PodGroup is actually able to get scheduled, isn't it? If not, are you able to compose a simple test case to simulate this starvation? |
+1 for this |
Might be hard to design a simple test. Issue occurred multiple times in the production environment where we had 100s of pods pending and 1000s of nodes to check. I observed that newer, recently created, pods were not attempted by the scheduler (based on lack of scheduling events and relevant logs), however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync. The issue went away when I disabled the coscheduling queue sort. Maybe a test like this would reproduce the issue:
I am not familiar with all the details how queuing works in the scheduler, but AFAIK certain events can put all pending pods back to the active queue, which could lead to the starvation I described where old unschedulable pods always go to the front of the active queue and starve pods which were in the queue for long time. Isn't the periodic flush/sync such event for example? Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled. Maybe with more customization for the queue plugin we could improve it? |
ok, it sounds like a head of line blocking problem. Have you tried to increase the backoff and flush settings to mitigate the symptom? (I know it's just a mitigation :))
Totally understood the pain point here. The queue sort design of co-scheduling is that we want a group of Pods to be treated as a unit to achieve higher efficiency, which is essential in a highly-utilized cluster. While in vanilla default scheduler, it just schedules pod by pod, so every time Pod gets re-queued, it doesn't need to consider its "sibling" pods, so it's possible to renew its enqueue time as a new item, while co-scheduling cannot, which is the embarrassing part.
We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation. |
Actually, we have a similar feature request about exposing more funcs in frameWorkHandler to ensure the pods belongs sorting in |
@Huang-Wei do you have some links to the previous discussions? |
@mateuszlitwin The upstream is attempting (very likely I will drive this in 1.21) to provide some efficient queueing mechanics so that developers can control the pod enqueuing behavior in a fine-grained manner. Here are some references: |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
/kind bug I think it's still outstanding. I came across this when testing the v0.19.8 image. Here are the reproducing steps:
|
Thanks @Huang-Wei |
/assign @denkensk |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@cwdsuzhou: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
TL;DR for the latest status of this issue: it's a fairness issue due to missing the machinery to sort PodGroups - similar to PodInfo, we need to refresh a PodGroupInfo's queuing time so previously-failed PodGroup's sorting order can be adjusted. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
#559 can mitigate this, but in theory HOL can still happen. Move it to the next release. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hello @Huang-Wei , we observe the same issue, if there are more than ~150 pods in This is our scheduler config: scheduler-config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: false
profiles:
# Compose all plugins in one profile
- schedulerName: scheduler-plugins-scheduler
plugins:
multiPoint:
enabled:
- name: Coscheduling
- name: CapacityScheduling
- name: NodeResourceTopologyMatch
- name: NodeResourcesAllocatable
disabled:
- name: PrioritySort
pluginConfig:
- args:
podGroupBackoffSeconds: 20 # We increased this value Do you have any recommendations for settings we could try to remedy the situation a bit, i.e. make the scheduler look at more pods in the queue? Thank you! |
/reopen |
@googs1025: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Currently
coscheduling
plugin is usingInitialAttemptTimestamp
to compare pods of the same priority. If there are enough pods with earlyInitialAttemptTimestamp
which cannot be scheduled then pods with laterInitialAttemptTimestamp
will get starved - scheduler will never attempt to schedule them. This is because scheduler will re-queue "early" pods before "later" pods are attempted. Normal scheduler is using time when pod was inserted into the queue, so this situation cannot occur.The text was updated successfully, but these errors were encountered: