-
I am just starting to set up a k3s cluster on Proxmox and I noticed that the The host system:
The VMs:
Storage for all VMs is the ZFS mirror pool if two NVMe drives. I've run The "k3s server" VM (the one that spikes) idles at around 5-10%. Whenever some activity happens on the control plane (for example a node joins, or a helm chart is installed), CPU spikes to 100% and takes several minutes to get back down. At the same time, Proxmox reports Disk read IO in the GBs which seems excessive when my SQLite db is ~75MB. This is what I see on the Proxmox dashboard. Note that the first spike was the first time the node joined, I then deleted it and joined again, that caused the second, much shorter spike. Not sure why the second spike was shorter, but the first is much more common, I've experienced it quite a bit. And here's a snippet from the log, starting with the time the node joined. I've cut off much of what follows, the SQL statements get increasingly slow until it eventually settles down again.
The SQLite db (if I found the right one):
My
Here is a list of my running pods. As you can see the cluster is tiny, it doesn't have any significant workloads yet because I'm only just setting it up:
The only other thing I can think of is that I set up a very similar configuration a few weeks ago, but instead of using k3s, I used Talos and saw the exact same behaviour. However, it was much harder to debug (and I had some other issues), so I wanted to try out k3s, where I can SSH in and at least debug things a bit better. So whatever is happening here is very likely some sort of issue that isn't specific to k3s. I would appreciate if anyone has suggestions what could be going on here or how I can debug it: From everything I can test, performance should not be an issue with this deployment. I don't know why this immense disk IO is happening, but it seems to originate from within k3s and be related to kube API server activity. Edit: Just remembered from other discussions I saw the note to check for
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I can't say I've seen this before. Are you sure that this is coming from the datastore, or are you perhaps deploying some very large images, or something that requires a lot of disk IO? If the database files are on the same disk as your workload, then heavy workload will compete with the datastore for throughput. The compact messages suggest that the datastore itself is performing fine as a baseline - so I suspect that this has more to do to with how you've set up your environment and what you're running on it than anything else. |
Beta Was this translation helpful? Give feedback.
-
I figured it out after a long time of debugging: Turns out I assume the additional components I am running on top of k3s (e.g. Cilium) push it across the 2GB limit (or possibly, it doesn't reach 2GB and the kernel freaks out before that). Either way, looks like high CPU is a memory pressure symptom in these circumstances. |
Beta Was this translation helpful? Give feedback.
I figured it out after a long time of debugging: Turns out
htop
hides kernel threads so I didn't seekswapd0
, which (apparently) takes over 100% CPU if you don't have swap (my understanding, this is the recommendation for Kubernetes). This seems to be because it frantically tries to compress and save memory rather than OOM killing things. Curiously, neither the qemu agent reports memory pressure (sitting at ~60%) norhtop
which showed more like 80%. Nothing indicated that it was hitting its limit. By enablinghtop
to show kernel threads, I saw the process jump to the top and once I increased RAM from 2GB to 3GB for the control plane node, the issue went away.I assume the additional compo…