v1.9.0
What's New
Support elastic queue capacity scheduling
Volcano now uses the proportion plugin for queue management. Users can set the guarantee, capacity and other fields of the queue to set the reserved resources and capacity limit of the queue. And by setting the weight value of the queue to realize the resource sharing within the cluster, the queue is proportionally divided into cluster resources according to the weight value, but this queue management method has the following problems:
- The capacity of the resources divided by the queue is reflected by the weight, which is not intuitive enough.
- All resources in the queue are divided using the same ratio, and the capacity cannot be set separately for each dimension of the queue.
Based on the above considerations, Volcano implements a new queue elasticity capacity management capability, it supports:
- Allows users to directly set the capacity of each dimension of resources for the queue instead of setting a weight value.
- Elastic capacity scheduling based deserved resources, and queue's resources can be shared and reclaimed back.
For example, in AI large model training scenario, setting different resource capacities for different GPU models in the queue, such as A100 and V100, respectively. At the same time, when the cluster resources are idle, the queue can reuse the resources of other idle queues, and when needed, reclaim the resources set by the user for the queue, that is, the amount of resources deserved, so as to realize the elastic capacity scheduling.
To use this feature, you need to set the deserved field of the queue and set the amount of resources to be deserved for each dimension. At the same time, you need to turn on the capacity plugin and turn off the proportion plugin in the scheduling configuration.
Please refer to Capacity Scheduling Design for more detail.
Capacity scheduling example: How to use capacity plugin.
Related PR: (#3277, #121, #3283, @Monokaix)
Support affinity scheduling between queues and nodes
Queues are usually associated with departments within the company, and different departments usually need to use different heterogeneous resource types. For example, the large model training team needs to use NIVDIA’s Tesla GPU, and the recommendation team needs to use AMD’s GPU. When users submit jobs to the queue , the job needs to be automatically scheduled to the node of the corresponding resource type according to the attributes of the queue.
Volcano has implemented affinity scheduling capabilities for queues and nodes. Users only need to set the node label that require affinity in the affinity field of the queue. Volcano will automatically schedule jobs submitted to the current queue to the nodes associated with the queue. Users do not need to Set the affinity of the job separately, and only need to set the affinity of the queue uniformly. Jobs submitted to the queue will be scheduled to the corresponding node based on the affinity of the queue and the node.
This feature supports hard affinity, soft affinity, and anti-affinity scheduling at the same time. When using it, you need to set a label with the key volcano.sh/nodegroup-name
for the node, and then set the affinity field of the queue to specify hard affinity, soft affinity label values.
The scheduling plugin for this feature is called nodegroup, for a complete example of its use see: How to use nodegroup plugin.
For detailed design documentation, see The nodegroup design.
Related PR: (#3132, @qiankunli, @wuyueandrew)
GPU sharing feature supports node scoring scheduling
GPU Sharing is a GPU sharing and isolation solution introduced in Volcano v1.8, which provides GPU sharing and device memory control capabilities to enhance the GPU resource utilization in AI training and inference scenarios. v1.9 adds a new scoring strategy for GPU nodes on top of this feature, so that the optimal node can be selected during job assignment to further enhance resource utilization. Users can set different scoring strategies. Currently, the following two strategies are supported:
-
Binpack: Provides a binpack algorithm for GPU card granularity, prioritizing to fill up a node with GPU cards that have already been allocated resources to avoid resource fragmentation and waste.
-
Spread: Prioritizes the use of idle GPU cards over shared cards that have already been allocated resources.
For detailed usage documentation, please refer to: How to use gpu sharing.
Related PR: (#3471, @archlitchi)
Volcano support Kubernetes v1.29
Volcano version follows the Kubernetes community version tempo and supports every base version of Kubernetes. The latest supported version is v1.29 and ran full UT, E2E use cases to ensure functionality and reliability. If you would like to participate in the development of Volcano adapting to new versions of Kubernetes, please refer to: #3459 to make community contributions.
Related PR: (#3295, @guoqinwill)
Enhance scheduler metrics
Volcano uses the client-go to talk with Kubernetes. Although the client can set the QPS to avoid requests from being flow-limited, it is difficult to observe how many QPS is actually used by the client, so in order to observe the frequency of requests from the client in real time, Volcano has added a new client-go metrics, which allows users to access the metrics to see the number of GET, POST and other requests per second, so as to get the actual QPS used per second, and thus decide whether or not the client needs to adjust the QPS. The client-go metrics also include client certificate rotation cycle statistics, response size per request statistics, etc.
Users can use curl http://$volcano_scheduler_pod_ip:8080/metrics to get all the detailed metrics of volcano scheduler.
Related PR: (#3274, @Monokaix)
Add license compliance check
In order to enhance the open source license compliance governance standards of the Volcano community, avoid the introduction of infectious open source protocols, and avoid potential risks, the Volcano community has introduced an open source license compliance checking tool. The so-called infectious protocol refers to software that uses this protocol as an open source license. Derivative works generated after modification, use, and copying must also be open sourced under this agreement. If the third-party library introduced by the PR submitted by the developer contains infectious open source protocols such as GPL, LGPL, etc., CI Access Control will intercept it. The developer needs to replace the third-party library with a loose free software license protocol such as MIT, Apache 2.0, BSD, etc. , to pass the open source license compliance check.
Related PR: (#3308, @Monokaix)
Improve scheduling stability
Volcano v1.9.0 has done more optimization in preemption, retry for scheduling failure, avoiding memory leaks, security enhancement, etc. The details include:
- Fix the problem of pods not being able to be scheduled due to frequent expansion and contraction of deployment in extreme cases, see PR for details: (#3376, @guoqinwill)
- Fix Pod preemption: see PR for details: (#3458, @LivingCcj)
- Optimize Pod scheduling failure retry mechanism: see PR for details: (#3435,@bibibox)
- Metrics metrics optimization: (#3463, @Monokaix)
- Security enhancements: (#3449, @lekaf974)
Changes
- cherry-pick bugfixs (#3464 @Monokaix)
- fix nil pointer panic when evict (#3443 @bibibox)
- fix errTask channel memory leak (#3434 @bibibox)
- register nodegroup plugin to factory (#3402 @wuyueandrew)
- fix panic when the futureIdle resources are calculated to be negative (#3393 @Lily922)
- fix jobflow CRD metadata.annotations: Too long error (#3356 @guoqinwill)
- fix PodGroup being incorrectly deleted due to frequent creation and deletion of pods (#3376 @guoqinwill)
- fix rollback unthoroughly when allocate error (#3360 @bibibox)
- fix panic when the gpu is faulty (#3355 @guoqinwill)
- Support preempting BestEffort pods when the pods number of nodes reaches the upper limit (#3338 @Lily922)
- change private function 'unmarshalSchedulerConf' to public function 'UnmarshalSchedulerConf' (#3333 @Lily922)
- add pdb support feature gate (#3328 @bibibox)
- add mock cache for ut and performance test (#3269 @lowang-bh)
- support dump scheduler cache snapshot to json file (#3162 @lowang-bh)
- Volcano adapts to the k8s v1.29 (#3295 @guoqinwill)
- add lowang-bh as reviewer (#3239 @lowang-bh)
- Dockerfile: go mod cache (#3142 @7sunarni)
- Add license lint check (#3308 @Monokaix)
- [feat] Add rest client metrics (#3274 @Monokaix)
- enhancement: copy cluster total resource from ssn, instead of summing up them again (#3039 @lowang-bh)
- fix update-development-yaml will remove more webhooks than disabled one (#3109 @lowang-bh)
- Fix ineffective volcano scheduler deployment string filter (#3304 @Monokaix)
- Fix preempt: when skipping continue on preempting, the overused judgement didn't consider the current task's resource (#3230 @lowang-bh)
- add nodegroup plugin (#3132 @wuyueandrew)
- Capacity scheduling design doc (#3277 @Monokaix)
- Enhancements to Comments and Code Adjustments in pkg/scheduler (#3294 @daniel-hutao)
- Refactor Execute() of Allocate Method into Multiple Subfunctions for Enhanced Readability and Maintainability (#3292 @daniel-hutao)