v1.5.0-Beta
What's New
Support Task Dependency
In most mainstream computing platforms such as MPI and Tensorflow, different pods undertake different roles, for example, master/worker. It is necessary to start master or worker first due to the working principle for different platforms. This feature aims to provide the ability to make the start order correct. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/task-launch-order-within-job.md. (#1920, #1833, @hwdef @shinytang6 @Thor-wl )
Support Reserve Resource for Queue
This feature provides the ability to reserve resources for specified queues in order to make sure there is always guaranteed resources for urgent jobs instead of waiting for resource release or being preempted. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/queue-guarantee-resource-reservation-design.md (#1905, #1904, @qiankunli )
Support Specified Nodes for Volcano in Cluster
In some scenarios such as multiple schedulers, it is necessary for Volcano to be only responsible for part of nodes in the cluster. This feature enable users to configure the nodes that are responsible for the Volcano. More details can be referred to #1834 (#1821, @qiankunli )
Add Tendorflow Job Plugin
Volcano provides a unified object for job management which allows user to run AI training such as Tensorflow, Pytorch, Mxnet, MPI with Volcano Job and enjoy the enhanced lifecycle management. However it is a bit complex for some users. This features is to add Tensorflow plugin based on Volcano job plugin framework which reduces the complexity of running Tensorflow with Volcano and make it easy to use. More details can be referred to https://github.com/volcano-sh/volcano/blob/master/docs/design/distributed-framework-plugins.md (#1874, @LuBingtan )
Other Notable Changes
- update CRD version to v1(#1919, @Thor-wl )
- update golang to v1.17(#1912, @Thor-wl )
- optimize: reuse predicate error on same task group(#1906, @justadogistaken )
- default sort task by index(#1898, @xiaoanyunfei )
- update PriorityClass from v1beta1 to v1 for go-client(#1897, @lc2705 )
- support label volcano.sh/task-priority(#1896, @qiankunli )
- enhance the security of TLS client authentication for webhook(#1895, @huone1 )
- add healthz and metric switch for deploy controller and scheduler(#1888, @huone1 )
- add elastic scheduler design doc(#1887, @qiankunli )
- add eventhandler framework proposal(#1886, @sivanzcw )
- add a argument csi-storage to control the storage capacity resource(#1875, @huone1 )
- add rbac for csinode(#1871, @Thor-wl )
- feat: add imagelocality priority to nodeOrder(#1868, @justadogistaken )
- optimize the CA parse Process(#1862, @huone1 )
- ignore the update event if pod is allocated in cache but not present in NodeName(#1857, @xing0821)
- improve taintTolerationScore interPodAffinityScore throghput when failure occurs(#1856, @justadogistaken )
- switch the order of patch and existence check(#1852, @zzr93)
- support to set healthz address and metrics address(#1849, @huone1 )
- add nodeSelector design doc(#1834, @qiankunli )
- enhance the volcano topology framework(#1762, @huone1 )
- support preempt with priority plugin alone(#1757, @Thor-wl )
- support reserved node(#1821, @qiankunli )
- support multi-cluster scheduling in framework(#1521, @william-wang )
- cleanup scheduler cache informerFactory(#1831, @xiaoanyunfei )
- cleanup AddPriorityClass(#1828, @xiaoanyunfei )
- clean addNumaInfo(#1829, @xiaoanyunfei )
- clean up readAdmissionConf(#1823, @xiaoanyunfei )
- Remove default quota info in
NewNamespaceCollection
(#1817, @zen-xu ) - add multiple tasks support(#1820, @hwdef )
- refactor the cache to support batch bind api for better performance(#1796, @huone1 )
- optimize resource comparision functions for performance(#1769, @huone1 )
- optimize some logs in admission process(#1738, @huone1 )
- add setting MinResources to pg for normal pod(#1666, @huone1 )
- don't return err message when the pod isn't in the nodeinfo cache(#1478, @huone1 )
- update vendor for resource reservation(#1494, @huone1 )
- Proposal: Add Machine Learning Framework Plugins in Volcano(#1806, @LuBingtan )
- upgrade spf13/cobra version to 1.2.1(#1801, @marffin)
- refactor the volcano to support multi-scheduler with each job and node get conresponding scheduler based on hash.(#1795, @william-wang )
- support multi-scheduler for k8s workload deployment, etc(#1792, @huone1 )
- use root context(#1715, @lowang-bh )
- add design docs for task-leve advanced scheduling policy(#1630, @hwdef )
- Add livenessProbe and readinessProbe in Grafana Container(#1788, @dipanjank )
- enhance the admission conf check(#1799, @huone1 )
- Catch add pod out of sync error(#1783, @zhiyuone )
- add UT for elect action(#1780, @Thor-wl )
- Adding oidc import to enable vcctl work with oidc cluster(#1793, @igormishsky)
- Add job conditions (status&lastTransitionTime)(#1764, @HecarimV )
- add ut converage report for v1.4.0(#1766, @Thor-wl )
- refactor the Jobinfo functions to reduce redundant computing(#1745, @william-wang )
Bug Fixes
- fix scheduling process starts even if resource synchronization is not complete(#1916, @huone1 )
- fix: allocate ut for "two Jobs on one node"(#1913, @justadogistaken )
- fix the deep clone of JobInfo(#1883, @lc2705 )
- fix the security alert from Kubernetes(#1873, @Thor-wl )
- fix pod cannot be allocated with sufficient resource(#1851, @aidaizyy )
- fix: avoid chan block within taintTolerationScore(#1848, @justadogistaken )
- fix: scheduler crash fatal error: concurrent map writes(#1847, @Jason-Liu-Dream)
- fix syntax error in function Remove GPUIndexPatch(#1841, @Thor-wl )
- fix: All pods is existing when restart count exceed max retry(#1719, @LuBingtan )
- fix there is nil pointer access in function setNodeState(#1800, @huone1 )
- fix OOM will occur if pod info is sync before node info(#1662, @huone1 )
- fix controller panic when create A large number of pods(#1814, @huone1 )
- fix a problem about equivalence ecache feature (#1593, @huone1 )
- fix(scheduler) gang plugin task min avaliable check(#1732, @king-jingxiang)
- fix: fix possible panic when 'SetNode' is called(#1685, @eggiter )
- fix bug that vcjob is not compeleted when maxRetry is 1(#1746, @Thor-wl )
- fix the security alerts(#1770, @Thor-wl )
- fix(scheduler): improve job/task clone func(#1729, @shinytang6 )
- fix broken grafana dashboard configuration.(#1773, @dipanjank )