-
Notifications
You must be signed in to change notification settings - Fork 69
Arktos Scalability 430 2021 Tracks
Ying Huang edited this page May 20, 2021
·
2 revisions
# Goals
1. Multiple resource partitions for pod scheduling (2+ for 430) - primary goal
- A tenant can have pods physically located in 2 different RPs - upon scheduling
- The scheduler in a tenant partition should be able to listen to multiple api servers belong to each RP
- Performance test for 50K hollow nodes. (2 TP, 2~3 RP)
- Performance test runs with SSL enabled
- QPS optimization
1. Daemon set handling in RP - non primary goal
- Remove from TP
- Support daemon set in RP
- Load test
1. Dynamic add/delete TP/RP design - TBD
- For design purpose only, not for implementation - consider as trying to avoid hardcode in a lot of places
- Quality bar only
- Dynamically discover new tenant partitions based on CRD objects in its resource manager
1. System partition pod handling - TBD
# Non Goals
1. API gateway
# Current status (release 0.7 - 2021.2.6)
## Performance test status
1. 10Kx2 cluster: 1 resource partition, support 20K hosts; 2 tenant partitions, each support 300K pods
- Density test passed
1. 10Kx1 cluster: 1 resource partition, support 10K hosts; 1 tenant partition, each support 300K pods
- Density test passed
- Load test completed (with known failures)
1. Single cluster
- 8K cluster passed density test, load test completed (with known failures)
- 10K cluster density test completed with etcd too many request error, load test completed (with know failures)
## Design & Development status
1. Code change for 2TPx1RP mostly completed and merged into master (v0.7.0)
1. Enable SSL in performance test - WIP (Yunwen)
1. Use insecure mode in local cluster setting for POC (Agreed on 3/1/2021)
1. Kubelet
- [x] Use a dedicated kube-client to talk to the resource manager.
- [x] Use multiple kube-clients to connect to multiple tenant partitions.
- [x] Track the mapping between tenant ID and kube-clients.
- [ ] Use the right kube-client to do CRUD for all objects (To verify)
1. Controllers
- Node controllers (in resource partition)
- [x] Use a dedicated kube-client to talk to the resource manager.
- [x] Use multiple kube-clients to talk to multiple tenant partitions.
- Other controllers (in tenant partition)
- [x] If the controller list/watches node objects, it needs to use multiple kube-clients to access multiple resource managers.
- DaemonSet controller (Service/PV/AttachDetach/Garbage)
- [] Move TTL/DaemonSet controllers to RP
- [x] Disable in TP, enable in RP
- Identify resources belong to RP only
- Further perf and scalability improvements (TBD, currently non goal)
- [ ] Partition or not cache all node objects in a single process.
1. Scheduler
- [x] Use a dedicated kube-client to talk to its tenant partition.
- [x] Use multiple kube-clients to connect to multiple resource managers, list/watching nodes from all resource managers.
- [?] Use the right kube-client to update nodes objects.
- Further perf and scalability improvements (TBD)
- [ ] Improve scheduling algorithm to reduce the possibility of scheduling conflicts.
- [ ] Improve scheduler sampling algorithm to reduce scheduling time.
1. API server - TBD
- Current haven't identified areas that need to be changed
1. Proxy
- Working on a design that will evaluate proxy vs. code change in each components (TBD)
1. Performance test tools
- Cluster loader
- [ ] How to talk to node in perf test (Hongwei)
- Kubemark
- [x] Support 2 TP scale out cluster set up, insecure (0.7)
- [ ] Support 2 TP scale out cluster set up, secure mode
- [ ] Support 2 TP, 2 RP scale out cluster set up, secure mode
- Kube-up
- [ ] Support for scale out (current only kubemark support scale out)
1. Performance test
- [ ] Single RP capacity test (>= 25K, preparing for 25Kx2 goal)
- [ ] QPS optimization (x2, x3, x4, etc. in density test)
- Regular density test for 10K single cluster, 10Kx2. Each will be done after 500 node test
- [ ] 2TP (10K), 1RP (20K), 20K density test, secure mode
1. Dev tools
- [x] One box setup script for 2 TP, 1 RP (Peng, Ying)
- [x] One box setup script for 2 TP, 2 RP (Ying)
1. 1.18 Changes
- Complete golang 1.13.9 migration (Sonya)
- [x] https://github.com/CentaurusInfra/arktos/issues/923
- Metrics platform migration (YingH)
- [x] Migrated from metrics server to Prometheus
- [ ] Get correct API responsiveness data
## Current Work in Progress (5/21):
1. Back porting 1.18 scheduler and related changes to master
1. Back porting scheduler redo - done
1. Apply VM/CommonInfo/Action changes to back porting branch - WIP (Yunwen)
1. UT pass code merge - currently all UT passing on merging branch
1. Integration test - WIP (Hongwei)
1. E2E test pass code merge
1. Perf test
1. 50K redo (Currently 6s. Waiting for rerun in master after code merge)
1. 1TP/1RP limit test in SSL mode
1. 15K density test - done 4/6, QPS 20, pod start up latency p50 1.8s, p99 4.4s
1. 20K density test - TODO
1. Issue tracker
1. Kubelet failed to upload events due to authorization error - Yunwen [Issue 1046](https://github.com/CentaurusInfra/arktos/issues/1046)
1. KCM (deployment controller) on TP cluster failed to sync up deployment with its token - Yunwen master [Issue 1039](https://github.com/CentaurusInfra/arktos/issues/1039)
1. KCM on TP cluster didn't get nodes in RP cluster(s) - Yunwen master [PR 1040](https://github.com/CentaurusInfra/arktos/pull/1040) [Issue 1038](https://github.com/CentaurusInfra/arktos/issues/1038)
1. Failed to change ttl annotation for hollow-node - Yunwen [Issue 1054](https://github.com/CentaurusInfra/arktos/issues/1054)
1. TP2: Unable to authenticate the request due to an error: invalid bearer token - Yunwen [Issue 1055](https://github.com/CentaurusInfra/arktos/issues/1055)
## Completed Tasks
1. Multiple resource partition design - decided to continue multiple client connection changes in all components for multiple RP for now. Will re-design if encountered issue in current approach. (2/17)
1. Setup local cluster for multiple TPs, RPs (Done - 2/24)
1. Script/manual for 2TP&1RP cluster set up with 3 hosts - insecure mode (2/19) [PR 994](https://github.com/CentaurusInfra/arktos/pull/994)
1. Local dev environment: SSL enabled for scheduler in TP connects to RP directly (2/24) [RP 1003](https://github.com/CentaurusInfra/arktos/pull/1003)
1. Component code changes
1. TP components connect to RP directly (Done - 3/15)
1. Scheduler connect to RP directly via separated clients (2/23) [PR 991](https://github.com/CentaurusInfra/arktos/pull/991)
1. KCM connected to RP directly via separated client (3/10) [PR 1015](https://github.com/CentaurusInfra/arktos/pull/1015)
1. Garbage collector support multiple RPs (3/15) [RP 1025](https://github.com/CentaurusInfra/arktos/pull/1015)
1. RP components connect to TP directly (Done - 3/12)
1. Nodelifecycle controller connects to TP directly via separated clients (3/9) [PR 1011](https://github.com/CentaurusInfra/arktos/pull/1011)
1. Kubelet connects to TP directly via kubeconfig (3/12) [PR 1021](https://github.com/CentaurusInfra/arktos/pull/1021)
1. Disable/Enable controllers in TP/RP
1. Move TTL/DaemonSet controller from TP KCM to RP KCM (3/10) [PR 1015](https://github.com/CentaurusInfra/arktos/pull/1015)
1. Enable service account/token controller in RP KCM local (3/15) [PR 1028](https://github.com/CentaurusInfra/arktos/pull/1028)
1. Scheduler backporting
1. Support multiple RPs in kube-up (Done)
1. Script changes to bring up and cleanup multiple RPs (2/23)
1. Merge kube-up/kubemark code from master to POC (3/15) [PR 1024](https://github.com/CentaurusInfra/arktos/pull/1024)
1. Move DaemonSet/TTL controller etc. to RP KCM (3/16) [PR 1031](https://github.com/CentaurusInfra/arktos/pull/1031)
1. Multiple RPs works in kube-up/kubemark (3/22)
1. Enable SSL in performance test - master
1. Code change (3/12) [RP 1001](https://github.com/CentaurusInfra/arktos/pull/1001)
1. 1TP/1RP 500 nodes perf test (3/12)
1. 2TP/1RP 500 nodes (3/30)
1. 1TP/1RP 15K nodes (4/6)
1. Perf test code changes (Done)
1. Perf test changes needs for multiple RPs (3/18)
1. Disable DaemonSet test in load (3/25) [PR 1050](https://github.com/CentaurusInfra/arktos/pull/1050)
1. Performance test (WIP)
1. Test single RP limit
1. 1TP/1RP achieved 40K hollow nodes (3/3). RP CPU ~44%
1. 15K density test in SSL mode - passed on 4/6 (QPS 20)
1. Get more resource in GCP (80K US central 3/8)
1. 10K density test insecure mode - benchmark (3/18)
1. Multiple TPs/RPs density test
1. 2TP/2RP 2x500 passed (3/27)
1. 2TP/2RP 2x5K density test (3/30)
1. Scale up 500 density test (3/30)
1. 2TP/2RP 2x10K density test (3/31)
1. 2TP/2RP 2x10K density test with double RS QPS (4/1 - density test passed but with high saturation pod start up latency)
1. Scheduler back porting perf test
- 1TP1RP 10K high QPS test 4/23
- Scheduler throughput can almost reach 200
- 1TP1RP 10K RS QPS 200 4/23
- Similar pod start up latency. p50 4.9s, p90 12s, p99 16s. scheduler throughput p50 204, max 588
- Scheduler permission issue and logging - fixed 4/27
- 4TP/2RP total 30K nodes, RS QPS 100 4/27
- No failure except pod start up latency too high (community benchmark 5s at 10K level)
- Pod start up latency maximum out of 4 TPs: p50 3.4s, p90 6.9s, p99 10.8s
- Scheduling throughput p50 100, max 224 ~ 294
- 5TP/5RP total 50K nodes, RS QPS 100 4/28
- No failure except pod start up latency too high
- Pod start up latency maximum out of 5 TPs: p50 2.1s, p90 5.6s, p99 10~13s
- Scheduling throughput p50 101 ~ 103, max 177 ~ 193
1. QPS tuning
1. Increase GC controller QPS (3/18)
1. 20->40 [PR 1034](https://github.com/CentaurusInfra/arktos/pull/1034) 10K density test 14 hours reduced to 9.6 hours
1. Increase replicaset controller QPS
1. 20->40 [PR 1034](https://github.com/CentaurusInfra/arktos/pull/1034)
1. High saturation pod start up latency, scheduler throughput 31 (4/1, 2TPx2RP 2X10K)
1. Check scheduler QPS distribution - (4/5 - all used pod binding)
1. Check Vinay's high qps log to find out how many schedulers were running and whether they were changing leader frequently. (4/5 - no changing leader)
1. Check global scheduler team optimization changes - 4/6
1. Community 1.18 high QPS 10K throughput confirm - 4/13
1. Back port arktos code to 1.18 scheduler - 4/22
1. 1x10K scheduler back porting with QPS 200 - 4/23
1. Failed with pod start up latency tp50 4.9, tp90 12s, tp99 16s, scheduling throughput tp50 204, tp90 311, tp99 558, max 588
1. Complete golang 1.13.9 migration (Done - 3/12)
1. Kube-openapi upgrade [Issue 923](https://github.com/CentaurusInfra/arktos/issues/923)
1. Add and verify import-alias (2/10) [PR 965](https://github.com/CentaurusInfra/arktos/pull/965)
1. Add hack/arktos_cherrypick.sh (2/19) [PR 990](https://github.com/CentaurusInfra/arktos/pull/990)
1. Promote admission webhook API to v1. Arktos only support v1beta1 now (2/20) [PR 981](https://github.com/CentaurusInfra/arktos/pull/981)
1. Promote admissionreview to v1. Arktos only support v1beta1 now (2/25) [PR 998](https://github.com/CentaurusInfra/arktos/pull/998)
1. Promote CRD to v1 - (3/3) [PR 1004](https://github.com/CentaurusInfra/arktos/pull/1004)
1. Bump kube-openapi to 20200410 version and SMD to V3 (3/12) [PR 1010](https://github.com/CentaurusInfra/arktos/pull/1010)
1. Regression fix
1. Failed to collect profiling of ETCD (3/11) [Issue 1008](https://github.com/CentaurusInfra/arktos/issues/1008) [PR 1009](https://github.com/CentaurusInfra/arktos/pull/1019)
1. Static pods being recycled on TP cluster [Issue 1006](https://github.com/CentaurusInfra/arktos/issues/1006) (Yunwen/Verifying)
1. ETCD object counts issue in 3/10 run (3/16) [PR 1027](https://github.com/CentaurusInfra/arktos/pull/1027) [Issue 1023](https://github.com/CentaurusInfra/arktos/issues/1023)
1. haproxy ssl check causes api server "TSL handshake error" (3/31) [PR 1060](https://github.com/CentaurusInfra/arktos/pull/1060) [Issue 1048](https://github.com/CentaurusInfra/arktos/issues/1048)
1. RP server failed to collect pprof files - (4/5) [PR 1058](https://github.com/CentaurusInfra/arktos/pull/1058) [Issue 1057](https://github.com/CentaurusInfra/arktos/issues/1057)
1. Change scheduler PVC binder code to support multiple RPs - (3/31) [PR 1063](https://github.com/CentaurusInfra/arktos/pull/1063) [Issue 1059](https://github.com/CentaurusInfra/arktos/issues/1059)
## Tasks on hold
1. Metrics platform migration
1. Regression fix
1. 500 nodes load run finished with error: DaemonSets timeout [Issue 1007](https://github.com/CentaurusInfra/arktos/issues/1008)
1. System partition pod - how to handle when HA proxy is removed (TBD)
1. Density test should be OK
1. Check node authorizer in secure mode
1. Kubeup/Kubemark improvement
1. Start proxy at the end (Yunwen)
1. TP/RP start concurrently (Hongwei)
1. Benchmark on cluster throughput, pod start up latency for 1TP/1RP, and 50K cluster
1. Issues
1. GC controller queries its own master nodes' lease info and cause 404 error in haproxy [Issue 1047](https://github.com/CentaurusInfra/arktos/issues/1047) - appears to be in master only. Fixed in POC. Park issue till POC changes being port back to master.
1. [Scale out POC] pod scheduler reported bound successfully but not appear in local [Issue 1049](https://github.com/CentaurusInfra/arktos/issues/1049) - related to system tenant design. Post 430
1. [Scale out POC] secret not found in kubelet [Issue 1052](https://github.com/CentaurusInfra/arktos/issues/1052) - related to system tenant design. Post 430
1. Tenant zeta request was not redirected to TP2 master correctly [Issue 1056](https://github.com/CentaurusInfra/arktos/issues/1056) - current proxy limitation
## Issue solved in POC - pending in master
1. Static pods being recycled on TP cluster (fixed in POC) [PR 1044](https://github.com/CentaurusInfra/arktos/pull/1044) [Issue 1006](https://github.com/CentaurusInfra/arktos/issues/1006)
1. Controllers on TP should union the nodes from RP cluster and local cluster - fixed in POC [PR 1044](https://github.com/CentaurusInfra/arktos/pull/1044) [Issue 1042](https://github.com/CentaurusInfra/arktos/issues/1042)