Skip to content

Arktos Scalability 430 2021 Tracks

Ying Huang edited this page May 20, 2021 · 2 revisions
# Goals 1. Multiple resource partitions for pod scheduling (2+ for 430) - primary goal - A tenant can have pods physically located in 2 different RPs - upon scheduling - The scheduler in a tenant partition should be able to listen to multiple api servers belong to each RP - Performance test for 50K hollow nodes. (2 TP, 2~3 RP) - Performance test runs with SSL enabled - QPS optimization 1. Daemon set handling in RP - non primary goal - Remove from TP - Support daemon set in RP - Load test 1. Dynamic add/delete TP/RP design - TBD - For design purpose only, not for implementation - consider as trying to avoid hardcode in a lot of places - Quality bar only - Dynamically discover new tenant partitions based on CRD objects in its resource manager 1. System partition pod handling - TBD # Non Goals 1. API gateway # Current status (release 0.7 - 2021.2.6) ## Performance test status 1. 10Kx2 cluster: 1 resource partition, support 20K hosts; 2 tenant partitions, each support 300K pods - Density test passed 1. 10Kx1 cluster: 1 resource partition, support 10K hosts; 1 tenant partition, each support 300K pods - Density test passed - Load test completed (with known failures) 1. Single cluster - 8K cluster passed density test, load test completed (with known failures) - 10K cluster density test completed with etcd too many request error, load test completed (with know failures) ## Design & Development status 1. Code change for 2TPx1RP mostly completed and merged into master (v0.7.0) 1. Enable SSL in performance test - WIP (Yunwen) 1. Use insecure mode in local cluster setting for POC (Agreed on 3/1/2021) 1. Kubelet - [x] Use a dedicated kube-client to talk to the resource manager. - [x] Use multiple kube-clients to connect to multiple tenant partitions. - [x] Track the mapping between tenant ID and kube-clients. - [ ] Use the right kube-client to do CRUD for all objects (To verify) 1. Controllers - Node controllers (in resource partition) - [x] Use a dedicated kube-client to talk to the resource manager. - [x] Use multiple kube-clients to talk to multiple tenant partitions. - Other controllers (in tenant partition) - [x] If the controller list/watches node objects, it needs to use multiple kube-clients to access multiple resource managers. - DaemonSet controller (Service/PV/AttachDetach/Garbage) - [] Move TTL/DaemonSet controllers to RP - [x] Disable in TP, enable in RP - Identify resources belong to RP only - Further perf and scalability improvements (TBD, currently non goal) - [ ] Partition or not cache all node objects in a single process. 1. Scheduler - [x] Use a dedicated kube-client to talk to its tenant partition. - [x] Use multiple kube-clients to connect to multiple resource managers, list/watching nodes from all resource managers. - [?] Use the right kube-client to update nodes objects. - Further perf and scalability improvements (TBD) - [ ] Improve scheduling algorithm to reduce the possibility of scheduling conflicts. - [ ] Improve scheduler sampling algorithm to reduce scheduling time. 1. API server - TBD - Current haven't identified areas that need to be changed 1. Proxy - Working on a design that will evaluate proxy vs. code change in each components (TBD) 1. Performance test tools - Cluster loader - [ ] How to talk to node in perf test (Hongwei) - Kubemark - [x] Support 2 TP scale out cluster set up, insecure (0.7) - [ ] Support 2 TP scale out cluster set up, secure mode - [ ] Support 2 TP, 2 RP scale out cluster set up, secure mode - Kube-up - [ ] Support for scale out (current only kubemark support scale out) 1. Performance test - [ ] Single RP capacity test (>= 25K, preparing for 25Kx2 goal) - [ ] QPS optimization (x2, x3, x4, etc. in density test) - Regular density test for 10K single cluster, 10Kx2. Each will be done after 500 node test - [ ] 2TP (10K), 1RP (20K), 20K density test, secure mode 1. Dev tools - [x] One box setup script for 2 TP, 1 RP (Peng, Ying) - [x] One box setup script for 2 TP, 2 RP (Ying) 1. 1.18 Changes - Complete golang 1.13.9 migration (Sonya) - [x] https://github.com/CentaurusInfra/arktos/issues/923 - Metrics platform migration (YingH) - [x] Migrated from metrics server to Prometheus - [ ] Get correct API responsiveness data ## Current Work in Progress (5/21): 1. Back porting 1.18 scheduler and related changes to master 1. Back porting scheduler redo - done 1. Apply VM/CommonInfo/Action changes to back porting branch - WIP (Yunwen) 1. UT pass code merge - currently all UT passing on merging branch 1. Integration test - WIP (Hongwei) 1. E2E test pass code merge 1. Perf test 1. 50K redo (Currently 6s. Waiting for rerun in master after code merge) 1. 1TP/1RP limit test in SSL mode 1. 15K density test - done 4/6, QPS 20, pod start up latency p50 1.8s, p99 4.4s 1. 20K density test - TODO 1. Issue tracker 1. Kubelet failed to upload events due to authorization error - Yunwen [Issue 1046](https://github.com/CentaurusInfra/arktos/issues/1046) 1. KCM (deployment controller) on TP cluster failed to sync up deployment with its token - Yunwen master [Issue 1039](https://github.com/CentaurusInfra/arktos/issues/1039) 1. KCM on TP cluster didn't get nodes in RP cluster(s) - Yunwen master [PR 1040](https://github.com/CentaurusInfra/arktos/pull/1040) [Issue 1038](https://github.com/CentaurusInfra/arktos/issues/1038) 1. Failed to change ttl annotation for hollow-node - Yunwen [Issue 1054](https://github.com/CentaurusInfra/arktos/issues/1054) 1. TP2: Unable to authenticate the request due to an error: invalid bearer token - Yunwen [Issue 1055](https://github.com/CentaurusInfra/arktos/issues/1055) ## Completed Tasks 1. Multiple resource partition design - decided to continue multiple client connection changes in all components for multiple RP for now. Will re-design if encountered issue in current approach. (2/17) 1. Setup local cluster for multiple TPs, RPs (Done - 2/24) 1. Script/manual for 2TP&1RP cluster set up with 3 hosts - insecure mode (2/19) [PR 994](https://github.com/CentaurusInfra/arktos/pull/994) 1. Local dev environment: SSL enabled for scheduler in TP connects to RP directly (2/24) [RP 1003](https://github.com/CentaurusInfra/arktos/pull/1003) 1. Component code changes 1. TP components connect to RP directly (Done - 3/15) 1. Scheduler connect to RP directly via separated clients (2/23) [PR 991](https://github.com/CentaurusInfra/arktos/pull/991) 1. KCM connected to RP directly via separated client (3/10) [PR 1015](https://github.com/CentaurusInfra/arktos/pull/1015) 1. Garbage collector support multiple RPs (3/15) [RP 1025](https://github.com/CentaurusInfra/arktos/pull/1015) 1. RP components connect to TP directly (Done - 3/12) 1. Nodelifecycle controller connects to TP directly via separated clients (3/9) [PR 1011](https://github.com/CentaurusInfra/arktos/pull/1011) 1. Kubelet connects to TP directly via kubeconfig (3/12) [PR 1021](https://github.com/CentaurusInfra/arktos/pull/1021) 1. Disable/Enable controllers in TP/RP 1. Move TTL/DaemonSet controller from TP KCM to RP KCM (3/10) [PR 1015](https://github.com/CentaurusInfra/arktos/pull/1015) 1. Enable service account/token controller in RP KCM local (3/15) [PR 1028](https://github.com/CentaurusInfra/arktos/pull/1028) 1. Scheduler backporting 1. Support multiple RPs in kube-up (Done) 1. Script changes to bring up and cleanup multiple RPs (2/23) 1. Merge kube-up/kubemark code from master to POC (3/15) [PR 1024](https://github.com/CentaurusInfra/arktos/pull/1024) 1. Move DaemonSet/TTL controller etc. to RP KCM (3/16) [PR 1031](https://github.com/CentaurusInfra/arktos/pull/1031) 1. Multiple RPs works in kube-up/kubemark (3/22) 1. Enable SSL in performance test - master 1. Code change (3/12) [RP 1001](https://github.com/CentaurusInfra/arktos/pull/1001) 1. 1TP/1RP 500 nodes perf test (3/12) 1. 2TP/1RP 500 nodes (3/30) 1. 1TP/1RP 15K nodes (4/6) 1. Perf test code changes (Done) 1. Perf test changes needs for multiple RPs (3/18) 1. Disable DaemonSet test in load (3/25) [PR 1050](https://github.com/CentaurusInfra/arktos/pull/1050) 1. Performance test (WIP) 1. Test single RP limit 1. 1TP/1RP achieved 40K hollow nodes (3/3). RP CPU ~44% 1. 15K density test in SSL mode - passed on 4/6 (QPS 20) 1. Get more resource in GCP (80K US central 3/8) 1. 10K density test insecure mode - benchmark (3/18) 1. Multiple TPs/RPs density test 1. 2TP/2RP 2x500 passed (3/27) 1. 2TP/2RP 2x5K density test (3/30) 1. Scale up 500 density test (3/30) 1. 2TP/2RP 2x10K density test (3/31) 1. 2TP/2RP 2x10K density test with double RS QPS (4/1 - density test passed but with high saturation pod start up latency) 1. Scheduler back porting perf test - 1TP1RP 10K high QPS test 4/23 - Scheduler throughput can almost reach 200 - 1TP1RP 10K RS QPS 200 4/23 - Similar pod start up latency. p50 4.9s, p90 12s, p99 16s. scheduler throughput p50 204, max 588 - Scheduler permission issue and logging - fixed 4/27 - 4TP/2RP total 30K nodes, RS QPS 100 4/27 - No failure except pod start up latency too high (community benchmark 5s at 10K level) - Pod start up latency maximum out of 4 TPs: p50 3.4s, p90 6.9s, p99 10.8s - Scheduling throughput p50 100, max 224 ~ 294 - 5TP/5RP total 50K nodes, RS QPS 100 4/28 - No failure except pod start up latency too high - Pod start up latency maximum out of 5 TPs: p50 2.1s, p90 5.6s, p99 10~13s - Scheduling throughput p50 101 ~ 103, max 177 ~ 193 1. QPS tuning 1. Increase GC controller QPS (3/18) 1. 20->40 [PR 1034](https://github.com/CentaurusInfra/arktos/pull/1034) 10K density test 14 hours reduced to 9.6 hours 1. Increase replicaset controller QPS 1. 20->40 [PR 1034](https://github.com/CentaurusInfra/arktos/pull/1034) 1. High saturation pod start up latency, scheduler throughput 31 (4/1, 2TPx2RP 2X10K) 1. Check scheduler QPS distribution - (4/5 - all used pod binding) 1. Check Vinay's high qps log to find out how many schedulers were running and whether they were changing leader frequently. (4/5 - no changing leader) 1. Check global scheduler team optimization changes - 4/6 1. Community 1.18 high QPS 10K throughput confirm - 4/13 1. Back port arktos code to 1.18 scheduler - 4/22 1. 1x10K scheduler back porting with QPS 200 - 4/23 1. Failed with pod start up latency tp50 4.9, tp90 12s, tp99 16s, scheduling throughput tp50 204, tp90 311, tp99 558, max 588 1. Complete golang 1.13.9 migration (Done - 3/12) 1. Kube-openapi upgrade [Issue 923](https://github.com/CentaurusInfra/arktos/issues/923) 1. Add and verify import-alias (2/10) [PR 965](https://github.com/CentaurusInfra/arktos/pull/965) 1. Add hack/arktos_cherrypick.sh (2/19) [PR 990](https://github.com/CentaurusInfra/arktos/pull/990) 1. Promote admission webhook API to v1. Arktos only support v1beta1 now (2/20) [PR 981](https://github.com/CentaurusInfra/arktos/pull/981) 1. Promote admissionreview to v1. Arktos only support v1beta1 now (2/25) [PR 998](https://github.com/CentaurusInfra/arktos/pull/998) 1. Promote CRD to v1 - (3/3) [PR 1004](https://github.com/CentaurusInfra/arktos/pull/1004) 1. Bump kube-openapi to 20200410 version and SMD to V3 (3/12) [PR 1010](https://github.com/CentaurusInfra/arktos/pull/1010) 1. Regression fix 1. Failed to collect profiling of ETCD (3/11) [Issue 1008](https://github.com/CentaurusInfra/arktos/issues/1008) [PR 1009](https://github.com/CentaurusInfra/arktos/pull/1019) 1. Static pods being recycled on TP cluster [Issue 1006](https://github.com/CentaurusInfra/arktos/issues/1006) (Yunwen/Verifying) 1. ETCD object counts issue in 3/10 run (3/16) [PR 1027](https://github.com/CentaurusInfra/arktos/pull/1027) [Issue 1023](https://github.com/CentaurusInfra/arktos/issues/1023) 1. haproxy ssl check causes api server "TSL handshake error" (3/31) [PR 1060](https://github.com/CentaurusInfra/arktos/pull/1060) [Issue 1048](https://github.com/CentaurusInfra/arktos/issues/1048) 1. RP server failed to collect pprof files - (4/5) [PR 1058](https://github.com/CentaurusInfra/arktos/pull/1058) [Issue 1057](https://github.com/CentaurusInfra/arktos/issues/1057) 1. Change scheduler PVC binder code to support multiple RPs - (3/31) [PR 1063](https://github.com/CentaurusInfra/arktos/pull/1063) [Issue 1059](https://github.com/CentaurusInfra/arktos/issues/1059) ## Tasks on hold 1. Metrics platform migration 1. Regression fix 1. 500 nodes load run finished with error: DaemonSets timeout [Issue 1007](https://github.com/CentaurusInfra/arktos/issues/1008) 1. System partition pod - how to handle when HA proxy is removed (TBD) 1. Density test should be OK 1. Check node authorizer in secure mode 1. Kubeup/Kubemark improvement 1. Start proxy at the end (Yunwen) 1. TP/RP start concurrently (Hongwei) 1. Benchmark on cluster throughput, pod start up latency for 1TP/1RP, and 50K cluster 1. Issues 1. GC controller queries its own master nodes' lease info and cause 404 error in haproxy [Issue 1047](https://github.com/CentaurusInfra/arktos/issues/1047) - appears to be in master only. Fixed in POC. Park issue till POC changes being port back to master. 1. [Scale out POC] pod scheduler reported bound successfully but not appear in local [Issue 1049](https://github.com/CentaurusInfra/arktos/issues/1049) - related to system tenant design. Post 430 1. [Scale out POC] secret not found in kubelet [Issue 1052](https://github.com/CentaurusInfra/arktos/issues/1052) - related to system tenant design. Post 430 1. Tenant zeta request was not redirected to TP2 master correctly [Issue 1056](https://github.com/CentaurusInfra/arktos/issues/1056) - current proxy limitation ## Issue solved in POC - pending in master 1. Static pods being recycled on TP cluster (fixed in POC) [PR 1044](https://github.com/CentaurusInfra/arktos/pull/1044) [Issue 1006](https://github.com/CentaurusInfra/arktos/issues/1006) 1. Controllers on TP should union the nodes from RP cluster and local cluster - fixed in POC [PR 1044](https://github.com/CentaurusInfra/arktos/pull/1044) [Issue 1042](https://github.com/CentaurusInfra/arktos/issues/1042)