Skip to content

Commit

Permalink
workaround cgroup zero memory working set problem
Browse files Browse the repository at this point in the history
We observed zero container_memory_working_set_bytes metrics problem on nodes with cgroup (v1). The same problem is not observed on nodes with cgroup v2. For example:

$ kubectl get --raw /api/v1/nodes/<cgroup_node>/proxy/metrics/resource
container_memory_working_set_bytes{container="logrotate",namespace="...",pod="dev-5cd4cc79d6-s9cll"} 0 1732247555792

$ kubectl get --raw /api/v1/nodes/<cgroup2_node>/proxy/metrics/resource
container_memory_working_set_bytes{container="logrotate",namespace="...",pod="dev-5cd4cc79d6-test"} 1.37216e+06 1732247626298

The metrics-server logs:
metrics-server-77786dd5c5-w4skb metrics-server I1121 22:02:47.705690       1 decode.go:196] "Failed getting complete container metric" containerName="logrotate" containerMetric={"StartTime":"2024-10-23T13:12:07.815984128Z","Timestamp":"2024-11-21T22:02:41.755Z","CumulativeCpuUsed":12016533431788,"MemoryUsage":0}
metrics-server-77786dd5c5-w4skb metrics-server I1121 22:02:47.706713       1 decode.go:104] "Failed getting complete Pod metric" pod=".../dev-5cd4cc79d6-s9cll"

On the cgroup v1 node:
$ kc exec -it dev-5cd4cc79d6-s9cll -c logrotate -- /bin/sh -c "cat /sys/fs/cgroup/memory/memory.usage_in_bytes; cat /sys/fs/cgroup/memory/memory.stat |grep -w total_inactive_file |cut -d' ' -f2"
212414464
214917120

On the cgroup v2 node:
$ kc exec -it dev-5cd4cc79d6-test -c logrotate -- /bin/sh -c "cat /sys/fs/cgroup/memory.current; cat /sys/fs/cgroup/memory.stat |grep -w inactive_file |cut -d' ' -f2"
212344832
210112512

The current logic is, if one of the containers encounters this problem, the whole pod metrics will be dropped. This is an overkill. Because the zero memory working set container usually takes a small % of whole pod resource usage. However without PodMetrics, downstream components e.g. HPA are not able to autoscale the deployment/statefulset, etc., which has larger impact for the system.

The patch workarounds the cgroup zero memory working set problem by keeping the PodMetrics unless all the containers in the pod encounter this problem at the same time.

Signed-off-by: Zhu, Yi <chuyee@gmail.com>
  • Loading branch information
chuyee authored Nov 22, 2024
1 parent 9ebbad9 commit ab72bcb
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion pkg/scraper/client/resource/decode.go
Original file line number Diff line number Diff line change
Expand Up @@ -194,11 +194,13 @@ func checkContainerMetrics(podMetric storage.PodMetricsPoint) map[string]storage
// drop metrics when CumulativeCpuUsed or MemoryUsage is zero
if containerMetric.CumulativeCpuUsed == 0 || containerMetric.MemoryUsage == 0 {
klog.V(1).InfoS("Failed getting complete container metric", "containerName", containerName, "containerMetric", containerMetric)
return nil
} else {
podMetrics[containerName] = containerMetric
}
}
}
if len(podMetrics) == 0 {
return nil
}
return podMetrics
}

0 comments on commit ab72bcb

Please sign in to comment.