coscheduling queue sort plugin starves pods #110

mateuszlitwin · 2020-11-21T04:55:42Z

Currently coscheduling plugin is using InitialAttemptTimestamp to compare pods of the same priority. If there are enough pods with early InitialAttemptTimestamp which cannot be scheduled then pods with later InitialAttemptTimestamp will get starved - scheduler will never attempt to schedule them. This is because scheduler will re-queue "early" pods before "later" pods are attempted. Normal scheduler is using time when pod was inserted into the queue, so this situation cannot occur.

The text was updated successfully, but these errors were encountered:

Huang-Wei · 2020-12-01T01:56:59Z

This sounds a reasonable optimization. @denkensk @cwdsuzhou thoughts?

denkensk · 2020-12-01T02:48:57Z

@mateuszlitwin @Huang-Wei
We may talk about it In the beginning. kubernetes/enhancements#1463 (comment)
If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.

Huang-Wei · 2020-12-01T03:09:22Z

Ah true, I fail to notice that point.

This is because scheduler will re-queue "early" pods before "later" pods are attempted.

@mateuszlitwin The failed PodGroup with an earlier timestamp will go through an internal backoff period, so that latter PodGroup is actually able to get scheduled, isn't it? If not, are you able to compose a simple test case to simulate this starvation?

cwdsuzhou · 2020-12-04T02:51:24Z

@mateuszlitwin @Huang-Wei
We may talk about it In the beginning. kubernetes/enhancements#1463 (comment)
If we use the LastFailureTimestamp as the Normal scheduler. It will lead to undefined behavior in the heap.

+1 for this

mateuszlitwin · 2020-12-04T20:10:20Z

Might be hard to design a simple test.

Issue occurred multiple times in the production environment where we had 100s of pods pending and 1000s of nodes to check. I observed that newer, recently created, pods were not attempted by the scheduler (based on lack of scheduling events and relevant logs), however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync. The issue went away when I disabled the coscheduling queue sort.

Maybe a test like this would reproduce the issue:

create say 500 pods that are unschedulable
then create a single pod that could be scheduled (it's timestamp in the queue will be greater than other 500 pods)
generate some fake events in the cluster to move pods from backoff/unschedulable queues back to the active queue

I am not familiar with all the details how queuing works in the scheduler, but AFAIK certain events can put all pending pods back to the active queue, which could lead to the starvation I described where old unschedulable pods always go to the front of the active queue and starve pods which were in the queue for long time. Isn't the periodic flush/sync such event for example?

Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.

Maybe with more customization for the queue plugin we could improve it?

Huang-Wei · 2020-12-08T02:00:32Z

however older pods were attempted on a regular basis (but could not be scheduled because their scheduling constraints), at least once every sync.

ok, it sounds like a head of line blocking problem. Have you tried to increase the backoff and flush settings to mitigate the symptom? (I know it's just a mitigation :))

Coscheduling plugin queue sort is not compatible with the default sort. That is problematic especially because all scheduler profiles need to use the same sorting plugin, that is all profiles (e.g. default profile) are in fact forced to use co-scheduling sorting if co-scheduling is enabled.

Totally understood the pain point here.

The queue sort design of co-scheduling is that we want a group of Pods to be treated as a unit to achieve higher efficiency, which is essential in a highly-utilized cluster. While in vanilla default scheduler, it just schedules pod by pod, so every time Pod gets re-queued, it doesn't need to consider its "sibling" pods, so it's possible to renew its enqueue time as a new item, while co-scheduling cannot, which is the embarrassing part.

Maybe with more customization for the queue plugin we could improve it?

We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.

cwdsuzhou · 2020-12-08T02:54:15Z

We have some discussions in the upstream as well as this repo. I'm not quite sure I have the bandwidth to drive this in the near future. It'd be very appreciable if anyone is interested to drive the design & implementation.

Actually, we have a similar feature request about exposing more funcs in frameWorkHandler to ensure the pods belongs sorting in ActiveQueue together.

mateuszlitwin · 2020-12-11T21:33:37Z

@Huang-Wei do you have some links to the previous discussions?

Huang-Wei · 2020-12-12T01:41:27Z

@mateuszlitwin The upstream is attempting (very likely I will drive this in 1.21) to provide some efficient queueing mechanics so that developers can control the pod enqueuing behavior in a fine-grained manner.

Here are some references:

fejta-bot · 2021-03-12T02:31:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

Huang-Wei · 2021-03-19T20:21:23Z

/remove-lifecycle stale

Huang-Wei · 2021-03-19T20:24:42Z

/kind bug
/priority critical-urgent

I think it's still outstanding. I came across this when testing the v0.19.8 image. Here are the reproducing steps:

Prepare a PodGroup with minMember=3
Create a deployment with replicas=2
Wait for the two pods of the deployment to be pending
Scale up the deployment to be 3
It's not uncommon the 3 pods get into a starving state, and cannot be scheduled overtime.

denkensk · 2021-03-22T02:35:50Z

Thanks @Huang-Wei
I will test and reproduce the problem.

Huang-Wei · 2021-03-22T17:40:00Z

/assign @denkensk

fejta-bot · 2021-06-20T18:00:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-07-20T18:35:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-08-19T19:07:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-08-19T19:07:22Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cwdsuzhou · 2021-08-20T04:02:44Z

/reopen

k8s-ci-robot · 2021-08-20T04:02:52Z

@cwdsuzhou: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mateuszlitwin · 2023-01-14T00:40:10Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-04-14T01:38:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Huang-Wei · 2023-04-14T17:57:40Z

/remove-lifecycle stale

Huang-Wei · 2023-04-14T18:02:26Z

TL;DR for the latest status of this issue: it's a fairness issue due to missing the machinery to sort PodGroups - similar to PodInfo, we need to refresh a PodGroupInfo's queuing time so previously-failed PodGroup's sorting order can be adjusted.

#110 (comment)

k8s-triage-robot · 2023-07-13T18:58:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mateuszlitwin · 2023-07-15T00:05:04Z

/remove-lifecycle stale

Huang-Wei · 2023-07-25T17:21:06Z

#559 can mitigate this, but in theory HOL can still happen. Move it to the next release.

k8s-triage-robot · 2024-01-25T06:16:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-24T07:12:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-25T07:54:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-25T07:54:57Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fg91 · 2025-01-23T09:21:45Z

Hello @Huang-Wei ,

we observe the same issue, if there are more than ~150 pods in Pending phase which cannot be scheduled due to resource constraints (e.g. limited quotas for specific accelerator types), pods which are created later never get looked at by the scheduler - even if they would be schedulable immediately.

This is our scheduler config:

scheduler-config.yaml: |
  apiVersion: kubescheduler.config.k8s.io/v1
  kind: KubeSchedulerConfiguration
  leaderElection:
    leaderElect: false
  profiles:
  # Compose all plugins in one profile
  - schedulerName: scheduler-plugins-scheduler
    plugins:
      multiPoint:
        enabled:
        - name: Coscheduling
        - name: CapacityScheduling
        - name: NodeResourceTopologyMatch
        - name: NodeResourcesAllocatable
        disabled:
        - name: PrioritySort
    pluginConfig: 
    - args:
        podGroupBackoffSeconds: 20  # We increased this value

Do you have any recommendations for settings we could try to remedy the situation a bit, i.e. make the scheduler look at more pods in the queue? Thank you!

googs1025 · 2025-01-24T08:16:52Z

/reopen
I'd like to look into the context and try to come up with a solution.

k8s-ci-robot · 2025-01-24T08:16:56Z

@googs1025: Reopened this issue.

In response to this:

/reopen
I'd like to look into the context and try to come up with a solution.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 19, 2021

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 19, 2021

k8s-ci-robot assigned denkensk Mar 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 20, 2021

k8s-ci-robot closed this as completed Aug 19, 2021

k8s-ci-robot reopened this Aug 20, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 14, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2023

Huang-Wei modified the milestones: v1.22, v1.26 Apr 14, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2023

Huang-Wei modified the milestones: v1.26, v1.27 Jul 25, 2023

kerthcet mentioned this issue Nov 7, 2023

[Coscheduling] make podGroup a queueing unit #658

Open

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 24, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 25, 2024

k8s-ci-robot reopened this Jan 24, 2025

googs1025 mentioned this issue Jan 25, 2025

follow up: coscheduling queue sort plugin starves pods #874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coscheduling queue sort plugin starves pods #110

coscheduling queue sort plugin starves pods #110

mateuszlitwin commented Nov 21, 2020

Huang-Wei commented Dec 1, 2020

denkensk commented Dec 1, 2020

Huang-Wei commented Dec 1, 2020

cwdsuzhou commented Dec 4, 2020

mateuszlitwin commented Dec 4, 2020

Huang-Wei commented Dec 8, 2020

cwdsuzhou commented Dec 8, 2020

mateuszlitwin commented Dec 11, 2020

Huang-Wei commented Dec 12, 2020

fejta-bot commented Mar 12, 2021

Huang-Wei commented Mar 19, 2021

Huang-Wei commented Mar 19, 2021

denkensk commented Mar 22, 2021

Huang-Wei commented Mar 22, 2021

fejta-bot commented Jun 20, 2021

fejta-bot commented Jul 20, 2021

k8s-triage-robot commented Aug 19, 2021

k8s-ci-robot commented Aug 19, 2021

cwdsuzhou commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

mateuszlitwin commented Jan 14, 2023

k8s-triage-robot commented Apr 14, 2023

Huang-Wei commented Apr 14, 2023

Huang-Wei commented Apr 14, 2023

k8s-triage-robot commented Jul 13, 2023

mateuszlitwin commented Jul 15, 2023

Huang-Wei commented Jul 25, 2023 •

edited

Loading

k8s-triage-robot commented Jan 25, 2024

k8s-triage-robot commented Feb 24, 2024

k8s-triage-robot commented Mar 25, 2024

k8s-ci-robot commented Mar 25, 2024

fg91 commented Jan 23, 2025

googs1025 commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

coscheduling queue sort plugin starves pods #110

coscheduling queue sort plugin starves pods #110

Comments

mateuszlitwin commented Nov 21, 2020

Huang-Wei commented Dec 1, 2020

denkensk commented Dec 1, 2020

Huang-Wei commented Dec 1, 2020

cwdsuzhou commented Dec 4, 2020

mateuszlitwin commented Dec 4, 2020

Huang-Wei commented Dec 8, 2020

cwdsuzhou commented Dec 8, 2020

mateuszlitwin commented Dec 11, 2020

Huang-Wei commented Dec 12, 2020

fejta-bot commented Mar 12, 2021

Huang-Wei commented Mar 19, 2021

Huang-Wei commented Mar 19, 2021

denkensk commented Mar 22, 2021

Huang-Wei commented Mar 22, 2021

fejta-bot commented Jun 20, 2021

fejta-bot commented Jul 20, 2021

k8s-triage-robot commented Aug 19, 2021

k8s-ci-robot commented Aug 19, 2021

cwdsuzhou commented Aug 20, 2021

k8s-ci-robot commented Aug 20, 2021

mateuszlitwin commented Jan 14, 2023

k8s-triage-robot commented Apr 14, 2023

Huang-Wei commented Apr 14, 2023

Huang-Wei commented Apr 14, 2023

k8s-triage-robot commented Jul 13, 2023

mateuszlitwin commented Jul 15, 2023

Huang-Wei commented Jul 25, 2023 • edited Loading

k8s-triage-robot commented Jan 25, 2024

k8s-triage-robot commented Feb 24, 2024

k8s-triage-robot commented Mar 25, 2024

k8s-ci-robot commented Mar 25, 2024

fg91 commented Jan 23, 2025

googs1025 commented Jan 24, 2025

k8s-ci-robot commented Jan 24, 2025

Huang-Wei commented Jul 25, 2023 •

edited

Loading