k3s "Slow SQL" & 100% CPU spike in small cluster of Proxmox VMs #11656

Javex · 2025-01-26T10:18:27Z

Javex
Jan 26, 2025

I am just starting to set up a k3s cluster on Proxmox and I noticed that the k3s server process spikes to 100% CPU and logs a flood of "Slow SQL" messages whenever changes are made to the cluster, for example when another node joins the cluster. I wouldn't be surprised at a small CPU spike, but these spikes can last for 5+ minutes and given how tiny and fresh the cluster is, this is not something I would expect. I'll try to give as much information as I can.

The host system:

CPU: 8 core Intel i3-12100
RAM: 32GB
Disk: 2 x Samsung SSD 970 EVO Plus 1TB as a ZFS mirror pool
OS: Proxmox 8.2.4 on Linux 6.8.8-2-pve

The VMs:

1 "k3s server" with 2 CPUs & 2GB RAM
2 "k3s agent" with 2 CPUs & 2 GB RAM each

Storage for all VMs is the ZFS mirror pool if two NVMe drives. I've run fio (inside the "k3s server" VM) and received around 800MB/s read & write and ~200k IOPS (roughly), so I don't think it's a drive performance issue.

The "k3s server" VM (the one that spikes) idles at around 5-10%. Whenever some activity happens on the control plane (for example a node joins, or a helm chart is installed), CPU spikes to 100% and takes several minutes to get back down. At the same time, Proxmox reports Disk read IO in the GBs which seems excessive when my SQLite db is ~75MB. This is what I see on the Proxmox dashboard. Note that the first spike was the first time the node joined, I then deleted it and joined again, that caused the second, much shorter spike. Not sure why the second spike was shorter, but the first is much more common, I've experienced it quite a bit.

And here's a snippet from the log, starting with the time the node joined. I've cut off much of what follows, the SQL statements get increasingly slow until it eventually settles down again.

Jan 26 20:16:01 k3s-prod-server-1.example.com k3s[16031]: I0126 20:16:01.232393   16031 range_allocator.go:247] "Successfully synced" key="k3s-prod-agent-1.example.com"
Jan 26 20:16:10 k3s-prod-server-1.example.com k3s[16031]: I0126 20:16:10.262622   16031 range_allocator.go:247] "Successfully synced" key="k3s-prod-server-1.example.com"
Jan 26 20:16:20 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:20+10:30" level=info msg="certificate CN=k3s-prod-agent-2.example.com signed by CN=k3s-server-ca@1737783621: notBefore=2025-01-25 05:40:21 +0000 UTC notAfter=2026-01-26 09:46:20 +0000 UTC"
Jan 26 20:16:24 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:24+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:22.701359458 +1030 ACDT m=+97523.565440687) (total time: 1.378309497s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC LIMIT 500"
Jan 26 20:16:27 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:27+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:26.583687123 +1030 ACDT m=+97527.447768324) (total time: 1.360129013s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC LIMIT 500"
Jan 26 20:16:30 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:30+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:28.324891351 +1030 ACDT m=+97529.188972556) (total time: 2.58359044s):  SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? AND mkv.name > ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.thename ASC LIMIT 1"
Jan 26 20:16:32 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:32+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:28.020891972 +1030 ACDT m=+97528.884973178) (total time: 2.178874799s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC LIMIT 500"
Jan 26 20:16:49 k3s-prod-server-1.example.com k3s[16031]: W0126 20:16:49.385721   16031 transport.go:356] Unable to cancel request for *otelhttp.Transport
Jan 26 20:16:49 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:49.540469   16031 controller.go:195] "Failed to update lease" err="Put \"https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k3s-prod-server-1.example.com?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Jan 26 20:16:49 k3s-prod-server-1.example.com k3s[16031]: W0126 20:16:49.973339   16031 reflector.go:492] k8s.io/client-go@v1.32.0-k3s1/tools/cache/reflector.go:251: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: W0126 20:16:50.005039   16031 reflector.go:492] k8s.io/client-go@v1.32.0-k3s1/tools/cache/reflector.go:251: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: W0126 20:16:50.050578   16031 reflector.go:492] k8s.io/client-go@v1.32.0-k3s1/tools/cache/reflector.go:251: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:31.958068   16031 authentication.go:74] "Unable to authenticate the request" err="[invalid bearer token, context canceled]"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:50.171146   16031 writers.go:123] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:50.176760   16031 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:50.179182   16031 writers.go:136] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: E0126 20:16:50.182543   16031 timeout.go:140] "Post-timeout activity" timeElapsed="16.512717196s" method="GET" path="/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock" result=null
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:50+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:46.440125671 +1030 ACDT m=+97547.304206871) (total time: 4.094223396s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), COUNT(c.theid) FROM ( SELECT * FROM ( SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv JOIN ( SELECT MAX(mkv.id) AS id FROM kine AS mkv WHERE mkv.name LIKE ? AND mkv.name > ? GROUP BY mkv.name) AS maxkv ON maxkv.id = kv.id WHERE kv.deleted = 0 OR ? ) AS lkv ORDER BY lkv.thename ASC ) c"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:50+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:33.594820261 +1030 ACDT m=+97534.458901483) (total time: 16.962144674s):  SELECT ( SELECT MAX(rkv.id) AS id FROM kine AS rkv), ( SELECT MAX(crkv.prev_revision) AS prev_revision FROM kine AS crkv WHERE crkv.name = 'compact_rev_key'), kv.id AS theid, kv.name AS thename, kv.created, kv.deleted, kv.create_revision, kv.prev_revision, kv.lease, kv.value, kv.old_value FROM kine AS kv WHERE kv.name LIKE ? AND kv.id > ? ORDER BY kv.id ASC LIMIT 500"
Jan 26 20:16:50 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T20:16:50+10:30" level=info msg="Slow SQL (started: 2025-01-26 20:16:49.479644745 +1030 ACDT m=+97550.343725953) (total time: 1.305644605s): INSERT INTO kine(name, created, deleted, create_revision, prev_revision, lease, value, old_value) values(?, ?, ?, ?, ?, ?, ?, ?)"

The SQLite db (if I found the right one):

root@k3s-prod-server-1:~# ls -lh /var/lib/rancher/k3s/server/db
total 83M
drwx------ 2 root root 4.0K Jan 25 16:10 etcd
-rw-r--r-- 1 root root  74M Jan 26 20:34 state.db
-rw-r--r-- 1 root root  32K Jan 26 20:35 state.db-shm
-rw-r--r-- 1 root root 9.3M Jan 26 20:35 state.db-wal

My config.yaml on the server:

# cat config.yaml
# Encryption at rest for secrets
secrets-encryption: true

# Using helmfile at the moment
disable-helm-controller: true

# Cilium instead of Flannel
flannel-backend: none
# Using Cilium for network policies
disable-network-policy: true
# Using Cilium replacement for kube-proxy
disable-kube-proxy: true

# Separate CoreDNS is configured for this IP
# Hardcoded even though this is the default just to make it explicit since
# CoreDNS is configured separately.
cluster-dns: 10.43.0.10

# Configure Authentik as valid OIDC provider for k8s API server
kube-apiserver-arg:
  - "oidc-issuer-url=https://auth.example.com/application/o/kubernetes-dashboard/"
  - "oidc-client-id=<client id>
  - "oidc-groups-claim=groups"

disable:
  - traefik  # have custom install
  - coredns  # have custom install
  - servicelb  # use metallb instead

node-taint:
  - node-role.kubernetes.io/control-plane=true:NoSchedule

Here is a list of my running pods. As you can see the cluster is tiny, it doesn't have any significant workloads yet because I'm only just setting it up:

% kubectl get pods -A
NAMESPACE              NAME                                                       READY   STATUS    RESTARTS       AGE
cert-manager           cert-manager-549bf7cbc5-bkzb9                              1/1     Running   5 (22m ago)    28h
cert-manager           cert-manager-cainjector-5594979f8b-zxt48                   1/1     Running   5 (21m ago)    28h
cert-manager           cert-manager-webhook-5645b4cfd5-wxvnv                      1/1     Running   5 (22m ago)    28h
kube-system            cilium-82n2v                                               1/1     Running   0              21m
kube-system            cilium-fdldh                                               1/1     Running   3 (55m ago)    9h
kube-system            cilium-n6wrm                                               1/1     Running   1 (54m ago)    28h
kube-system            cilium-operator-746cbd94c4-r9tbv                           1/1     Running   5 (22m ago)    28h
kube-system            cilium-operator-746cbd94c4-w5chf                           1/1     Running   9 (22m ago)    28h
kube-system            coredns-69dc9c6fcc-64pzb                                   1/1     Running   3 (54m ago)    27h
kube-system            hubble-relay-67c9d76469-xxh4d                              1/1     Running   0              28h
kube-system            hubble-ui-69d69b64cf-8c29l                                 2/2     Running   5 (22m ago)    28h
kube-system            local-path-provisioner-698b58967b-czw5l                    1/1     Running   0              28h
kube-system            metrics-server-8584b5786c-v5t7n                            1/1     Running   5 (21m ago)    28h
kubernetes-dashboard   kubernetes-dashboard-api-6dc69bc8df-xzm5v                  1/1     Running   0              27h
kubernetes-dashboard   kubernetes-dashboard-auth-55bb4c9c64-9hscv                 1/1     Running   0              27h
kubernetes-dashboard   kubernetes-dashboard-extra-oauth2-proxy-75bf7c94df-cpv8j   1/1     Running   15 (22m ago)   27h
kubernetes-dashboard   kubernetes-dashboard-kong-5798959f64-fktv9                 1/1     Running   0              47s
kubernetes-dashboard   kubernetes-dashboard-metrics-scraper-77d59b585c-cgr7k      1/1     Running   0              27h
kubernetes-dashboard   kubernetes-dashboard-web-584d6f4d64-n7xsh                  1/1     Running   0              27h
metallb-system         metallb-controller-8474b54bc4-ksd4m                        1/1     Running   6 (22m ago)    28h
metallb-system         metallb-speaker-2kknf                                      4/4     Running   0              21m
metallb-system         metallb-speaker-q2pvc                                      4/4     Running   0              3h21m
metallb-system         metallb-speaker-rft5r                                      4/4     Running   12 (21m ago)   28h
traefik                traefik-668b5d5757-l5ljj                                   1/1     Running   4 (22m ago)    27h
traefik                traefik-668b5d5757-r999l                                   1/1     Running   6 (22m ago)    27h

The only other thing I can think of is that I set up a very similar configuration a few weeks ago, but instead of using k3s, I used Talos and saw the exact same behaviour. However, it was much harder to debug (and I had some other issues), so I wanted to try out k3s, where I can SSH in and at least debug things a bit better. So whatever is happening here is very likely some sort of issue that isn't specific to k3s. I would appreciate if anyone has suggestions what could be going on here or how I can debug it: From everything I can test, performance should not be an issue with this deployment. I don't know why this immense disk IO is happening, but it seems to originate from within k3s and be related to kube API server activity.

Edit: Just remembered from other discussions I saw the note to check for COMPACT which I have in my logs:

Jan 26 10:55:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T10:55:59+10:30" level=info msg="COMPACT compactRev=30440 targetCompactRev=30570 currentRev=31570"
Jan 26 10:55:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T10:55:59+10:30" level=info msg="COMPACT deleted 130 rows from 130 revisions in 13.186877ms - compacted to 30570/31570"
Jan 26 10:55:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T10:55:59+10:30" level=info msg="COMPACT compacted from 30440 to 30570 in 1 transactions over 15ms"
Jan 26 11:00:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T11:00:59+10:30" level=info msg="COMPACT compactRev=30570 targetCompactRev=30700 currentRev=31700"
Jan 26 11:00:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T11:00:59+10:30" level=info msg="COMPACT deleted 130 rows from 130 revisions in 15.460444ms - compacted to 30700/31700"
Jan 26 11:00:59 k3s-prod-server-1.example.com k3s[16031]: time="2025-01-26T11:00:59+10:30" level=info msg="COMPACT compacted from 30570 to 30700 in 1 transactions over 18ms"

Answered by Javex

Feb 3, 2025

I figured it out after a long time of debugging: Turns out htop hides kernel threads so I didn't see kswapd0, which (apparently) takes over 100% CPU if you don't have swap (my understanding, this is the recommendation for Kubernetes). This seems to be because it frantically tries to compress and save memory rather than OOM killing things. Curiously, neither the qemu agent reports memory pressure (sitting at ~60%) nor htop which showed more like 80%. Nothing indicated that it was hitting its limit. By enabling htop to show kernel threads, I saw the process jump to the top and once I increased RAM from 2GB to 3GB for the control plane node, the issue went away.

I assume the additional compo…

View full answer

brandond · 2025-01-27T20:27:03Z

brandond
Jan 27, 2025
Collaborator

I can't say I've seen this before. Are you sure that this is coming from the datastore, or are you perhaps deploying some very large images, or something that requires a lot of disk IO? If the database files are on the same disk as your workload, then heavy workload will compete with the datastore for throughput. The compact messages suggest that the datastore itself is performing fine as a baseline - so I suspect that this has more to do to with how you've set up your environment and what you're running on it than anything else.

1 reply

Javex Feb 3, 2025
Author

Thank you for your reply! I finally worked it out and it had nothing to do with IO in the end. I can't explain why or how high disk IO would be reported together with high CPU if there is memory pressure (see my answer) if there isn't even a swap partition or file.

Javex · 2025-02-03T09:26:31Z

Javex
Feb 3, 2025
Author

I figured it out after a long time of debugging: Turns out htop hides kernel threads so I didn't see kswapd0, which (apparently) takes over 100% CPU if you don't have swap (my understanding, this is the recommendation for Kubernetes). This seems to be because it frantically tries to compress and save memory rather than OOM killing things. Curiously, neither the qemu agent reports memory pressure (sitting at ~60%) nor htop which showed more like 80%. Nothing indicated that it was hitting its limit. By enabling htop to show kernel threads, I saw the process jump to the top and once I increased RAM from 2GB to 3GB for the control plane node, the issue went away.

I assume the additional components I am running on top of k3s (e.g. Cilium) push it across the 2GB limit (or possibly, it doesn't reach 2GB and the kernel freaks out before that). Either way, looks like high CPU is a memory pressure symptom in these circumstances.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s "Slow SQL" & 100% CPU spike in small cluster of Proxmox VMs #11656

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

k3s "Slow SQL" & 100% CPU spike in small cluster of Proxmox VMs #11656

Javex Jan 26, 2025

Replies: 2 comments · 1 reply

brandond Jan 27, 2025 Collaborator

Javex Feb 3, 2025 Author

Javex Feb 3, 2025 Author

Javex
Jan 26, 2025

Replies: 2 comments 1 reply

brandond
Jan 27, 2025
Collaborator

Javex Feb 3, 2025
Author

Javex
Feb 3, 2025
Author