Skip to content
This repository has been archived by the owner on Jan 9, 2025. It is now read-only.

Latest commit

 

History

History
7 lines (4 loc) · 1.89 KB

TIPS.md

File metadata and controls

7 lines (4 loc) · 1.89 KB

Tips and Tricks

Here's a few things that have been gotchas over the past few months. Most of them revolve around Docker-in-Docker. Nested virtualization is really hard and has a lot of design tradeoffs. Each layer has additional overhead and places where things can go wrong.

MTU size is important! This solution has several layers of virtualization and that adds a lot of complexity. I used an MTU of 1450 for the runner pod and 1400 for any Docker container inside the runner, but your network architects may have better guidance for you. In theory, this shouldn't be as much of a problem as it is, since container should have some level of Path MTU Discovery, but I've found this generates all manner of random failures if it isn't working exactly as expected. It seems simpler to me to be explicit about sizing here and that solved a lot of network gremlins for me.

Over-provision (promising more resources than you have available) at your own risk. This is a rabbit-hole down which there is no end, but you should read and understand the guidance offered by your base hypervisor platform (e.g., vSphere, Hyper-V, etc) first. Then look at potentially over-provisioning within Kubernetes. The Kubernetes scheduler is only as smart as what it can see and it can't see into the nested solution above it, if that makes sense. I settled on a ratio of about 2 max usage pods to 1 worker node and it seems so far that that's acceptable. The quick little tasks that don't require a lot of compute can go anywhere, the heavy usage pods can be moved around on the nodes by the scheduler to optimize longer-running jobs, and the worker nodes can be moved around via vMotion as needed to make the best use of our physical compute. When this goes wrong, it can cause all manner of cryptic error messages so I've made it a habit to always check resource utilization first.