-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial version for Talos metric scraping How To #10094
Open
Hr46ph
wants to merge
1
commit into
siderolabs:main
Choose a base branch
from
Hr46ph:how-to_monitoring
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
--- | ||
title: "How to configure Talos for metric scraping" | ||
description: "This how-to explains how to configure Talos to allow scraping with Prometheus." | ||
aliases: | ||
|
||
--- | ||
|
||
This was originally posted on Github in reply to an ongoing discussion. You can find it here: | ||
[Link to Github Discussion](https://github.com/siderolabs/talos/discussions/7214#discussioncomment-11709688 "How to get etcd metrics") | ||
|
||
In order for Prometheus to succesfully scrape metrics from etcd, Controller Manager and the Scheduler, we need to make a few changes in the machine config and adjust Helm values for `kube-prometheus-stack`. | ||
|
||
This how-to is written under the assumption that you have a working deployment of the `kube-prometheus-stack` community project. If you have another Prometheus deployment you may need to make adjustments to suit your particular setup. | ||
|
||
Create a patch for the control planes and save it as `etcd_metrics_patch.yaml`: | ||
|
||
```yaml | ||
- op: add | ||
path: /cluster/etcd/extraArgs | ||
value: | ||
listen-metrics-urls: https://0.0.0.0:2379 | ||
- op: add | ||
path: /cluster/controllerManager/extraArgs | ||
value: | ||
bind-address: 0.0.0.0 | ||
- op: add | ||
path: /cluster/scheduler/extraArgs | ||
value: | ||
bind-address: 0.0.0.0 | ||
``` | ||
|
||
Patch the control plane nodes with: | ||
`talosctl patch mc -n 10.0.0.1 --patch @etcd_metrics_patch.yaml` | ||
|
||
And repeat with the IP address for each control plane node. | ||
|
||
For Prometheus scrape jobs to succesfully read from `etcd`, it requires certificates to authenticate. We can get those by running the following commands: | ||
|
||
```sh | ||
talosctl get etcdrootsecret -o yaml | ||
``` | ||
Expected output: | ||
```yaml | ||
spec: | ||
etcdCA: | ||
LS0t....LS0K | ||
``` | ||
|
||
```sh | ||
talosctl get etcdsecret -o yaml | ||
``` | ||
Expected output: | ||
```yaml | ||
spec: | ||
etcd: | ||
crt: LS0t....LS0K | ||
key: LS0t....= | ||
``` | ||
|
||
The strings we need are the values from `etcdCA`, `etcd.crt` and `etc.key`. They are base64 encoded and can be used without the need to decode as the secret we will create also takes base64 encoded string values. In other words, just copy/paste into the new secret which we will apply in the namespace where Prometheus is running. By default, this will be the namespace `monitoring` and the template below assumes that. If yours is different make sure to edit the `metadata`. | ||
|
||
Create a new secret and save it as `etcd-secret.yaml` and edit it with your etc CA and cert values: | ||
|
||
```yaml | ||
--- | ||
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: etcd-client-cert | ||
namespace: monitoring | ||
type: Opaque | ||
data: | ||
etcd-ca.crt: | ||
LS0t....LS0K | ||
etcd-client.crt: | ||
LS0t....LS0K | ||
etcd-client-key.key: | ||
LS0t....= | ||
``` | ||
`kubectl apply -f etcd-secret.yaml`. | ||
|
||
In your `talos-custom-values.yaml` (example below) for the helm deployment of `kube-prometheus-stack`, add or change the following parts. Make sure to replace the IP's with your control planes. Also read the comment for kube-proxy to decide whether to enable that or not: | ||
|
||
```yaml | ||
kubeControllerManager: | ||
endpoints: | ||
- 10.0.0.1 | ||
- 10.0.0.2 | ||
- 10.0.0.3 | ||
|
||
kubeEtcd: | ||
endpoints: | ||
- 10.0.0.1 | ||
- 10.0.0.2 | ||
- 10.0.0.3 | ||
service: | ||
selector: | ||
component: etcd | ||
serviceMonitor: | ||
scheme: https | ||
insecureSkipVerify: false | ||
serverName: "localhost" | ||
caFile: "/etc/prometheus/secrets/etcd-client-cert/etcd-ca.crt" | ||
certFile: "/etc/prometheus/secrets/etcd-client-cert/etcd-client.crt" | ||
keyFile: "/etc/prometheus/secrets/etcd-client-cert/etcd-client-key.key" | ||
|
||
## In case you run a kube-proxy replacement (like Cilium kube-proxy replacement) you need to set enabled: false or comment this out. This is for Kubernetes kube-proxy scraping only and will not work on proxy replacements. | ||
kubeProxy: | ||
enabled: true | ||
endpoints: | ||
- 10.0.0.1 | ||
- 10.0.0.2 | ||
- 10.0.0.3 | ||
|
||
kubeScheduler: | ||
endpoints: | ||
- 10.0.0.1 | ||
- 10.0.0.2 | ||
- 10.0.0.3 | ||
|
||
prometheus: | ||
prometheusSpec: | ||
secrets: | ||
- etcd-client-cert | ||
``` | ||
|
||
Note stating the obvious: the values above are probably not enough for a complete and succesful deployment of `kube-prometheus-stack`. These are only the additional changes that you need to make this particular scraping work. Make sure you have a working setup before applying these, or integrate them into your values for a new setup. | ||
|
||
To apply the above changes to an already running `kube-prometheus-stack`, you can use a command similar to this: | ||
```sh | ||
helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring --reuse-values --values talos-custom-values.yaml | ||
``` |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say probably it might be easier to listen for metrics on a separate non-https url, e.g.
http://0.0.0.0:10003
(whatever port), so that drops the requirement to extract etcd certs and put them into the cluster (which I think a security risk on its own).Instead, the port can be blocked from the outside via https://www.talos.dev/v1.9/talos-guides/network/ingress-firewall/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thats interesting and indeed an easier / better way! Thanks, I will give that a try and update the how to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry and I really hate to say this but I am preparing to move away from Talos, which means I cannot properly test and finish this.