Best practice prometheus monitoring #425

runningman84 · 2019-05-01T20:15:01Z

Describe the bug
I would like to monitor a k3s system. Therefore I installed the prometheus operator helm chart. Out of the box a lot of alerts are in state FIRING.
A lot of rules which cover the apiserver and kubelet are not working. Should users just disable these rules or ar you going to provide your own default rules for a k3s setup?

To Reproduce
Install prometheus helm chart with default values

Expected behavior
Everything should look green if k3s specific instructions were followed....

Screenshots
KubeAPIDown (1 active)
KubeControllerManagerDown (1 active)
KubeDaemonSetRolloutStuck (1 active) kube-state-metrics
KubeSchedulerDown (1 active)
KubeletDown (1 active)
TargetDown (2 active) apiserver, kubelet

runningman84 · 2019-05-12T14:27:55Z

In order to remove target scrape errors I use this configuration:

    kubeApiServer:
      enabled: false
    kubeEtcd:
      enabled: false
    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false

Unfortunatly core parts of k3s are not monitored using this config.

JeffreyVdb · 2019-07-10T11:41:11Z

It should be possible to monitor the API server, or at least give an option to change the advertise address.

szamuboy · 2019-07-12T13:40:55Z

You can try my HelmChart CRD.

It's not perfect, the kubelet for some reason does not report certain labels but it solves most of your issues.

hlugt · 2019-08-13T10:01:29Z

I am also trying to get kube-prometheus to work on k3s (currenlty version 0.8.0). I am running my cluster on arm, which complicates it a bit: kube-state-metrics and the kube-rbac-proxy for example are not readily available for arm. I made some images myself but lucky enough carlosedp has made the necessary arm images available. You can have a look at his github cluster_monitoring.

Problem is though authorization for node-exporter and kube-state-metrics (and possibly more): it seems k3s uses another authentication version as user phillebaba has found. See issue carlosedp/cluster-monitoring#13 (comment) .

Can k3s developers or anyone else maybe shed some light or advise on this?

carlosedp · 2019-08-20T17:35:43Z

I've added a workaround in my cluster-monitoring stack to remove kube-rbac-proxy from node_exporter and kube-state-metrics.

Can you test-out the k3s branch from https://github.com/carlosedp/cluster-monitoring/tree/k3s and report back if it worked? It's a matter of applying the manifests from "manifests" dir. They are already generated from jsonnet.

anarcher · 2019-08-30T04:38:20Z

With k3s (k3d) and kube-state-metrics (kube-rbac-proxy), I have the same problem. If the intention of k3s is to remove alpha and non-default features, I think the kube-rbac-proxy should change to use authentications/v1, or to remove kube-rbac-proxy in kube-state-metrics and node-exporter in our monitoring stack.
But I wish that k3s handles with authentications/v1beta1 too. :->
kind works fine with kube-rbac-proxy.

Problem is though authorization for node-exporter and kube-state-metrics (and possibly more): it seems k3s uses another authentication version as user phillebaba has found. See issue carlosedp/cluster-monitoring#13 (comment) .

carlosedp · 2019-08-30T15:43:32Z

The problem with changing to auth/v1 is that it would not be compatible with previous versions of k8s where the api was still beta.

ramukima · 2019-10-15T20:14:25Z

You can try my HelmChart CRD.

It's not perfect, the kubelet for some reason does not report certain labels but it solves most of your issues.

Fails first and eventually succeeds with the following additional changes after failure -

  valuesContent: |-
    prometheusOperator:
      createCustomResource: false

It just disables the creation of CRDs after first failed attempt.

hlugt · 2019-11-14T10:56:20Z

I've added a workaround in my cluster-monitoring stack to remove kube-rbac-proxy from node_exporter and kube-state-metrics.

Can you test-out the k3s branch from https://github.com/carlosedp/cluster-monitoring/tree/k3s and report back if it worked? It's a matter of applying the manifests from "manifests" dir. They are already generated from jsonnet.

Hi, I now do have node-exporter metrics, thx, but cadvisor and the k3s kubelet still give authentication errors?

Edit: I have changed prometheus-serviceMonitorKubelet.yaml to use https and include tls and now I can collect metrics with the carlosedp set of manifests (so without the kube-rbac-proxy).

carlosedp · 2019-11-14T14:07:34Z

As I added to the readme on the repo with more details on carlosedp/cluster-monitoring#17, under K3s you need to use Docker as the runtime to have all cAdvisor metrics.

onedr0p · 2020-04-05T17:58:07Z

Any update on this? It would be great to monitor with the Prometheus Operator Helm Chart. kubeApiServer working just fine, it's only the following three that are not able to be monitored

    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false
    kubeProxy:
      enabled: false

larssb · 2020-06-02T19:30:33Z

Yeah what is the latest on this?

brandond · 2020-06-02T23:09:17Z

Is there an issue? I have a bog-standard prometheus install pointed at metrics-server and node-exporter. Literally copied the manifests over from an EKS cluster and didn't have to change anything.

larssb · 2020-06-03T18:47:56Z

Hi @brandond,

This issue just gave me the impression that Prometheus could be challenging to get up and running. So I was wondering, trying to inquire for an update onto some best practices. But, if its simply just throwing a Prometheus Helm chart at K3S I'll better just jump into it.

brandond · 2020-06-03T19:02:48Z

You have to make sure you have things like metrics-server, kube-state-metrics, node-exporter etc deployed, but that's not unique to k3s. Nor is the prometheus scraper configuration. None of these should require any configuration that wouldn't be necessary on any other k8s cluster.

larssb · 2020-06-05T12:54:37Z

Great stuff. Thank you Mr. @brandond

isshwar · 2020-07-21T15:19:32Z

Hi,

I am new to k3s. I have got k3s installation set up. I am trying to pull metrics from the cluster. My prometheus is hosted outside.

It would be great help if someone could throw some light on how to set this up. I have literally spent hours trying to find a solution.

do the installation should have metrics server or kube-state-metrics running?

ioagel · 2020-08-11T15:32:31Z

kubeControllerManager:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>
kubeScheduler:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>

This fixed my problems.

cubic3d · 2020-08-23T16:20:51Z

@ioagel @onedr0p did you find a way to get kubeProxy working or is it the only component without metrics access?

lictw · 2020-09-19T21:59:49Z

+1 for KubeProxy

djhoese · 2020-10-04T13:51:22Z

Following @ioagel's advice I got the controller manager and scheduler to work for my K3s cluster. I ended up having to disable (enable: false) etcd and proxy for my single node test cluster. Thanks @ioagel.

TiemenSch · 2020-10-28T14:38:50Z

I tried getting the https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack chart to run on my K3s cluster of 3 RPi4's, but sadly some of the images aren't proper multi-arch images (e.g. they fail with standard_init_linux.go:211: exec user process caused "exec format error"). I used the following HelmChart spec:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: kube-prometheus-stack
  namespace: kube-system
spec:
  chart: kube-prometheus-stack
  repo: https://prometheus-community.github.io/helm-charts
  targetNamespace: monitoring

So, what would be the simplest (best practice) way to deploy a minimal installation of Prometheus and Grafana and perhaps point them to the metrics-server afterwards?

Some of the guides on the internet immediately start utilizing all sorts of templated helper repositories, but that doesn't quite serve as an easy to understand minimal baseline installation at all. A tutorial installation IMHO shouldn't rely on any custom repo's, but rather use the conventional ones where possible.

... After upgrading to k3s 1.19 (from 1.18), the promethgeus target scraping for those two targets stopped wokring, `Get "http://10.2.0.30:10252/metrics": dial tcp 10.2.0.30:10252: connect: connection refused` Looking at k3s-io/k3s#425 (comment) suggests the endpoint approach should work. Experimenting with removing the explicit endpoint callout to see if there is an improvement. Signed-off-by: Jeff Billimek <jeff@billimek.com>

billimek · 2020-11-10T13:34:01Z

It appears, for me at least, after upgrading from k3s 1.18 to 1.19 that the explicit endpoint approach stopped working.

I suspect that there is now a firewall rule preventing connections to the endpoints on port 10251 & 10252 from anywhere other than 127.0.0.1

edit: This commit seems to be the culprit: 4808c4e#diff-c68274534954d72488196ca23f12cfb3ebe65998d9e7c4a43d7ba9acc9532574

cablespaghetti · 2021-01-30T22:30:49Z

This should help people a bit :)

prometheus-community/helm-charts#626

Also I try to keep this repo up-to-date which is a bit of a quick start: https://github.com/cablespaghetti/k3s-monitoring

onedr0p · 2021-04-04T11:19:25Z

This should help people a bit :)

prometheus-community/helm-charts#626

Also I try to keep this repo up-to-date which is a bit of a quick start: https://github.com/cablespaghetti/k3s-monitoring

Are you able to monitor kube controller, kube scheduler or kube proxy? I've looked at you repo and saw your PR over at kube-prometheus-stack but it seems like nothing works on k3s to have these monitored.

@brandond I would love to hear how this worked for you, it doesn't seem like I'm doing anything wrong in my helm values. You can take a look at them over at:

https://github.com/onedr0p/home-cluster/blob/main/cluster/monitoring/kube-prometheus-stack/helm-release.yaml

As soon as I enable those metrics (with or without an endpoint) they will not be scraped and the target will appear as down in prometheus.

It would be great to get more eyes on this, as rolling out k3s in a production env would be wise to have these metrics collected.

Let me know if you need any more information.

cablespaghetti · 2021-04-04T11:33:36Z

The people maintaining kube-prometheus-stack unfortunately didn't like the PR due to the level of tweaking required to get k3s working. As such I'm not sure it's possible with the main chart right now.

The way things work with k3s is that the api server endpoint gives you metrics from controller manager and scheduler as well. So you'll probably have all the metrics but the helm chart rules and dashboard don't expect them to be tagged with job=apiserver.

I'm not sure how kube proxy works off the top of my head but it may well be the same.

The way I see it is there are two options. Maintain a fork of the chart or have an option in k3s to split out the metrics endpoints in a "more standard" way which is compatible with the chart as it stands.

ThomasADavis · 2021-04-05T05:45:29Z

In a separate issue, the rancher monitoring now uses a forked version of PushProx to get many of stats bound to localhost, from a single port.

To see it in action, without loading up all of rancher, try this manifest file (drop in /var/lib/rancher/k3s/server/manifests). You'll get the operator servicemonitor, and 4 or 5 of the sets of stats from a single exporter..

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: pushprox
  namespace: monitoring
spec:
  chart: https://charts.rancher.io/assets/rancher-pushprox/rancher-pushprox-0.1.201.tgz
  metricsPort: 10249
  component: k3s-server
  valuesContent: |-
    serviceMonitor:
      enabled: true
    clients:
      port: 10013
      useLocalhost: true
      tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"

To see what stats this enables, do a 'curl -s http://localhost:10249/metrics'

onedr0p · 2021-04-05T12:43:10Z

@ThomasADavis I had to update your config:

---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: pushprox
  namespace: monitoring
spec:
  chart: https://charts.rancher.io/assets/rancher-pushprox/rancher-pushprox-0.1.201.tgz
  targetNamespace: monitoring
  valuesContent: |-
    metricsPort: 10249
    component: k3s-server
    serviceMonitor:
      enabled: true
    clients:
      port: 10013
      useLocalhost: true
      tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"

However, while that does get these components monitored they are not working out of the box with the default prometheus rules or grafana dashboards shipped with kube-prometheus-stack.

Prometheus

Grafana

nvtkaszpir · 2021-04-05T21:13:26Z

Wild idea, kube-prometheus-stack + promethesu relabel of k3s to patch standard k8s deployment?

RouNNdeL · 2021-04-18T18:09:13Z

I think if there was an option to bind the controller manager and the scheduler to 0.0.0.0 in k3s, then it should work when combined with:

kubeControllerManager:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>
kubeScheduler:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>

Is there such an option or is the bind address hardcoded?

onedr0p · 2021-04-18T18:27:31Z

@RouNNdeL see this commit 4808c4e

Before this change, k3s configured the scheduler and controller's insecure ports to listen on 0.0.0.0. Those ports include pprof, which provides a DoS vector at the very least.

These ports are only enabled for componentstatus checks in the first place, and componentstatus is hardcoded to only do the check on localhost anyway (see https://github.com/kubernetes/kubernetes/blob/v1.18.2/pkg/registry/core/rest/storage_core.go#L341-L344), so there shouldn't be any downside to switching them to listen only on localhost.

It is hardcoded.

RouNNdeL · 2021-04-18T18:49:33Z

Are there any chances we would see this implemented as an option? We would have to accept the security risks of enabling it, but I'd be fine with that.

onedr0p · 2021-07-12T15:37:43Z

I was able to get etcd monitored in kube-prometheus-stack in a standard way:

Set on the k3s servers: --etcd-expose-metrics=true
Set in the kube-prometheus-stack config:

    kubeEtcd:
      enabled: true
      endpoints:
      - IP of k3s master 1
      - IP of k3s master 2
      - IP of k3s master 3
      service:
        enabled: true
        port: 2381
        targetPort: 2381

Default dashboard shipped with kube-prometheus-stack:

onedr0p · 2021-07-12T18:33:41Z

I have got all component monitored again:

#3619 (comment)

I believe this issue can be closed!

cwayne18 · 2022-07-20T16:25:27Z

Thanks!

deniseschannon added the kind/question No code change, just asking/answering a question label May 4, 2019

deniseschannon assigned ibuildthecloud May 4, 2019

runningman84 mentioned this issue Jul 12, 2019

Prometheus API server monitoring is not working out of the box #628

Closed

hlugt mentioned this issue Nov 14, 2019

Investigate the use of kube-rbac-proxy on K3s carlosedp/cluster-monitoring#16

Closed

onedr0p mentioned this issue Apr 5, 2020

Monitor kube controller and scheduler onedr0p/home-ops#46

Closed

onedr0p mentioned this issue Jul 12, 2021

For questions, doubts, guidances please use Discussions. Don't open a new Issue. carlosedp/cluster-monitoring#91

Closed

onedr0p mentioned this issue Jul 12, 2021

Expose kube-scheduler, kube-proxy and kube-controller metrics endpoints #3619

Closed

cwayne18 closed this as completed Jul 20, 2022

jkleinlercher mentioned this issue Jun 11, 2024

[grafana] some dashboards seems to work not properly suxess-it/kubriX#192

Closed

Best practice prometheus monitoring #425

Best practice prometheus monitoring #425

Comments

runningman84 commented May 1, 2019

runningman84 commented May 12, 2019

JeffreyVdb commented Jul 10, 2019

szamuboy commented Jul 12, 2019 • edited Loading

hlugt commented Aug 13, 2019

carlosedp commented Aug 20, 2019

anarcher commented Aug 30, 2019 • edited Loading

carlosedp commented Aug 30, 2019

ramukima commented Oct 15, 2019

hlugt commented Nov 14, 2019 • edited Loading

carlosedp commented Nov 14, 2019

onedr0p commented Apr 5, 2020 • edited Loading

larssb commented Jun 2, 2020

brandond commented Jun 2, 2020

larssb commented Jun 3, 2020

brandond commented Jun 3, 2020

larssb commented Jun 5, 2020

isshwar commented Jul 21, 2020

ioagel commented Aug 11, 2020

cubic3d commented Aug 23, 2020

lictw commented Sep 19, 2020

djhoese commented Oct 4, 2020 • edited Loading

TiemenSch commented Oct 28, 2020

billimek commented Nov 10, 2020 • edited Loading

cablespaghetti commented Jan 30, 2021 • edited Loading

onedr0p commented Apr 4, 2021

cablespaghetti commented Apr 4, 2021

ThomasADavis commented Apr 5, 2021

onedr0p commented Apr 5, 2021 • edited Loading

Prometheus

Grafana

nvtkaszpir commented Apr 5, 2021

RouNNdeL commented Apr 18, 2021

onedr0p commented Apr 18, 2021 • edited Loading

RouNNdeL commented Apr 18, 2021

onedr0p commented Jul 12, 2021 • edited Loading

onedr0p commented Jul 12, 2021 • edited Loading

cwayne18 commented Jul 20, 2022

szamuboy commented Jul 12, 2019 •

edited

Loading

anarcher commented Aug 30, 2019 •

edited

Loading

hlugt commented Nov 14, 2019 •

edited

Loading

onedr0p commented Apr 5, 2020 •

edited

Loading

djhoese commented Oct 4, 2020 •

edited

Loading

billimek commented Nov 10, 2020 •

edited

Loading

cablespaghetti commented Jan 30, 2021 •

edited

Loading

onedr0p commented Apr 5, 2021 •

edited

Loading

onedr0p commented Apr 18, 2021 •

edited

Loading

onedr0p commented Jul 12, 2021 •

edited

Loading

onedr0p commented Jul 12, 2021 •

edited

Loading