Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the prometheus-longterm-metrics and thanos optional components #461

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
37 changes: 36 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,42 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- Add the `prometheus-longterm-metrics` and `thanos` optional components

The `prometheus-longterm-metrics` component collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus's ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are
added by default see the
``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template`` file.

If you do not want the default longterm-metric rules included, set the ``PROMETHEUS_LONGTERM_RULES_FILE`` to anything
other than ``True`` in your ``env.local`` file.
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus

Enabling the `prometheus-longterm-metrics` component creates the additional endpoint ``/prometheus-longterm-metrics``.

The `thanos` component enables better storage of longterm metrics collected by the
``optional-components/prometheus-longterm-metrics`` component. Data will be collected from the
``prometheus-longterm-metrics`` and stored in an S3 object store indefinitely.

When enabling this component, please change the default values for the ``THANOS_MINIO_ROOT_USER`` and ``THANOS_MINIO_ROOT_PASSWORD``
by updating the ``env.local`` file. These set the login credentials for the root user that runs the
[minio](https://min.io/) object store.

Enabling the `thanos` component creates the additional endpoints:
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved

* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio web console to inspect the data stored by minio.
Comment on lines +46 to +47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those be configurable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bits would be useful to configure? The endpoints, paths, images, other?

I agree that we could always add more configuration options, I'm just wondering which are a priority for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoints would be the priority, to allow serving them from some other location, though still a low priority relative to the feature as a whole. Worst case, redirects can be defined in the nginx configuration, so don't block the PR just for this.


This also includes an update to the prometheus version from `v2.19.0` to the current latest `v2.52.0`. This is to
required to support the interaction between prometheus and thanos.
mishaschwartz marked this conversation as resolved.
Show resolved Hide resolved

[2.4.0](https://github.com/bird-house/birdhouse-deploy/tree/2.4.0) (2024-06-04)
------------------------------------------------------------------------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion birdhouse/components/monitoring/default.env
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ export GRAFANA_VERSION="7.0.3"
export GRAFANA_DOCKER=grafana/grafana
export GRAFANA_IMAGE='${GRAFANA_DOCKER}:${GRAFANA_VERSION}'

export PROMETHEUS_VERSION="v2.19.0"
export PROMETHEUS_VERSION="v2.52.0"
export PROMETHEUS_DOCKER=prom/prometheus
export PROMETHEUS_IMAGE='${PROMETHEUS_DOCKER}:${PROMETHEUS_VERSION}'

Expand Down
8 changes: 8 additions & 0 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,14 @@ export THREDDS_ADDITIONAL_CATALOG=""
#export ALERTMANAGER_EXTRA_INHIBITION=""
#export ALERTMANAGER_EXTRA_RECEIVERS=""

# Below are for the prometheus-longterm-metrics optional component
#export PROMETHEUS_LONGTERM_RETENTION_TIME=1y

# Below are for the thanos optional component
# Change these from the default for added security
#export THANOS_MINIO_ROOT_USER="${__DEFAULT__THANOS_MINIO_ROOT_USER}"
#export THANOS_MINIO_ROOT_PASSWORD="${__DEFAULT__THANOS_MINIO_ROOT_PASSWORD}"

#############################################################################
# Emu optional vars
#############################################################################
Expand Down
37 changes: 37 additions & 0 deletions birdhouse/optional-components/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -443,3 +443,40 @@ How to enable X-Robots-Tag Header in ``env.local`` (a copy from `env.local.examp

.. seealso::
See the `env.local.example`_ file for more details about this ``BIRDHOUSE_PROXY_ROOT_LOCATION`` behaviour.

Prometheus Long-term Metrics
----------------------------

This is a second prometheus instance that collects longterm monitoring metrics from the original prometheus instance
(the one created by the ``components/monitoring`` component).

Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are
selectable using prometheus' ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are
added by default see the ``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template``.

If you do not want the default longterm-metric rules included, set the ``PROMETHEUS_LONGTERM_RULES_FILE`` to anything
other than ``True`` in your ``env.local`` file. You may want to do this if you've created your own set of rules in
another component that you would like to use instead of the default ones.

To configure this component:

* update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus

Enabling this component creates the additional endpoint ``/prometheus-longterm-metrics``.

Thanos
------

This enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics``
component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store
indefinitely.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indefinitely ! Do we actually want this? Can we set an expiry after like 10 years?

Grafana will be able to display data from Thanos go to back to 10 years? With this kind of extreme long term stats, what is the UI to visualize it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can choose to change this if you wish. Thanos suggests keeping data indefinitely by default. If you do not need to keep data forever, I suggest just using the prometheus-longterm-monitoring component without thanos and setting the PROMETHEUS_LONGTERM_RETENTION_TIME to whatever you'd like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

Same question, in the case we want to use Thanos, how to visualize the data stored on Thanos? I assume if Thanos is enabled, the retention duration on the Prometheus side will be very short to avoid doubling the storage so without data being stored in Prometheus, how to visualize that data stored on Thanos.

Just a question. If another component is required, we can do it in a follow up Pr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, is there a switch to disable Thanos?

To answer my own question, Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component? The Prometheus-long-term component can function standalone of Thanos?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanos is actually a separate component so it does not have to be enabled together with the Prometheus-long-term component

Yes that's right. prometheus-longterm-metrics collects and stores specific metrics that we want to keep for longer from prometheus. If you want to also enable thanos, then thanos will store those same metrics in a much more compact/efficient way so that you can store more data over a longer time period.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with forever storage for long-term-metrics. The point is to keep an archive of key metrics. If those are daily or hourly, archiving a few dozen metrics won't be a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how much space that will take in practice and we will adjust. And eventually we need a way to visualize those older metrics. Otherwise, what's the point to keep them forever if we can not visualize?


When enabling this component, please change the default values for the ``THANOS_MINIO_ROOT_USER`` and
``THANOS_MINIO_ROOT_PASSWORD`` by updating the ``env.local`` file. These set the login credentials for the root user
that runs the minio_ object store.

Enabling this component creates the additional endpoints:
* ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos
* ``/thanos-minio``: a minio_ web console to inspect the data stored by minio_.

.. _minio: https://min.io/
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
prometheus.yml
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
providers:
prometheus-longterm-metrics:
# below URL is only used to fill in the required location in Magpie
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL
url: http://proxy:80
title: PrometheusLongtermMetrics
public: true
c4i: false
type: api
sync_type: api

permissions:
- service: prometheus-longterm-metrics
permission: read
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: write
group: administrators
action: create
- service: prometheus-longterm-metrics
permission: read
group: monitoring
action: create
- service: prometheus-longterm-metrics
permission: write
group: monitoring
action: create
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: "3.4"

services:
magpie:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
- ./optional-components/prometheus-longterm-metrics/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/prometheus-longterm-metrics.yml:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
prometheus:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/monitoring/${PROMETHEUS_LONGTERM_RULES_FILE}:/etc/prometheus/prometheus-longterm-metrics.rules:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# This file is intentionally left blank in order to allow a user to choose whether to enable the default rules that are
# set in the prometheus.rules file.
# By setting the PROMETHEUS_LONGTERM_ENABLE_DEFAULT_RULES environment variable to True, the rules in prometheus.rules
# will be added. By setting that value to anything else, this file will be added instead.
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
groups:
- name: longterm-metrics-hourly
interval: 1h
rules:
# percentage of the time, over the last hour, that all CPUs were working
# 1 means all CPUs were working all the time, 0 means they were all idle all the time
- record: instance:cpu_load:avg_rate1h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the 2nd Prometheus-longterm will scrap only these 2 new metrics or it will also scrap all the existing ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, only these two. I suggest we add more rules to this default list in the future though.

If you have other metrics that you want it to scrape for long term storage, you just have to give that rule the label group: longterm-metrics.

expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[1h]))
labels:
group: longterm-metrics
# total number of bytes that were sent or received over the network in the last hour
- record: instance:network_bytes_transmitted:sum_rate1h
expr: sum by(instance) (rate(node_network_transmit_bytes_total[1h]) + rate(node_network_receive_bytes_total[1h]))
labels:
group: longterm-metrics
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
location /prometheus-longterm-metrics {
auth_request /secure-prometheus-longterm-metrics-auth;
auth_request_set $auth_status $upstream_status;
proxy_pass http://prometheus-longterm-metrics:9090;
proxy_set_header Host $host;
}

location = /secure-prometheus-longterm-metrics-auth {
internal;
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/prometheus-longterm-metrics$request_uri;
proxy_pass_request_body off;
proxy_set_header Host $host;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Forwarded-Proto $real_scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
proxy:
volumes:
- ./optional-components/prometheus-longterm-metrics/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/prometheus-longterm-metrics:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
export PROMETHEUS_LONGTERM_RETENTION_TIME=1y
export PROMETHEUS_LONGTERM_ENABLE_DEFAULT_RULES=True
export PROMETHEUS_LONGTERM_SCRAPE_INTERVAL=1h

# These are the prometheus defaults
export PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION=2h
export PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION=1d12h

export PROMETHEUS_LONGTERM_RULES_FILE='$([ "${PROMETHEUS_LONGTERM_ENABLE_DEFAULT_RULES}" = "True" ] && echo prometheus.rules || echo prometheus.null.rules)'

OPTIONAL_VARS="
$OPTIONAL_VARS
\$PROMETHEUS_LONGTERM_SCRAPE_INTERVAL
"

export DELAYED_EVAL="
$DELAYED_EVAL
PROMETHEUS_LONGTERM_RULES_FILE
"

COMPONENT_DEPENDENCIES="
./components/monitoring
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this dependency will make the new Prometheus long term not able to run standalone.

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it cannot run standalone, it collects metrics from the services that are enabled in the monitoring component.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say I already have a bunch of PAVICS servers running with monitoring enabled and I just want to point this new Prometheus to aggregate all the data?

Can be in a follow up PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to create a method for saving existing prometheus data over the long term for a single birdhouse/PAVICS deployment. If you want something that will collect prometheus data for multiple servers I would recommend creating a new repository to host that code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the architecture here is pluggable, we do not need a separate repo and duplicate the work. To deploy the 2nd Prometheus only, on a separate machine, I see we only enable the proxy and the prometheus-longterm-metrics and optionally thanos components on the new machine and that's it.

But agree we can make this "standalone" support in a separate PR.

"
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
version: "3.4"

x-logging:
&default-logging
driver: "json-file"
options:
max-size: "50m"
max-file: "10"

services:
prometheus-longterm-metrics:
image: ${PROMETHEUS_IMAGE}
container_name: prometheus-longterm-metrics
volumes:
- ./optional-components/prometheus-longterm-metrics/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_longterm_persistence:/prometheus:rw
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
- --storage.tsdb.retention.time=${PROMETHEUS_LONGTERM_RETENTION_TIME}
- --web.external-url=https://${BIRDHOUSE_FQDN_PUBLIC}/prometheus-longterm-metrics/
- --storage.tsdb.min-block-duration=${PROMETHEUS_LONGTERM_TSDB_MIN_BLOCK_DURATION}
- --storage.tsdb.max-block-duration=${PROMETHEUS_LONGTERM_TSDB_MAX_BLOCK_DURATION}
restart: always
logging: *default-logging

volumes:
prometheus_longterm_persistence:
external:
name: prometheus_longterm_persistence
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh -x

docker volume create prometheus_longterm_persistence # metrics db
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
global:
external_labels:
instance_name: prometheus-longterm-metrics

scrape_configs:
- job_name: 'federate'
scrape_interval: ${PROMETHEUS_LONGTERM_SCRAPE_INTERVAL}

honor_labels: true
metrics_path: '/prometheus/federate'

params:
'match[]':
- '{group="longterm-metrics"}'

static_configs:
- targets:
- 'prometheus:9090'
Copy link
Collaborator

@tlvu tlvu Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make this target list a template expansion variable, then we can easily add more hosts and override this default local Prometheus and points to other hosts, ex:

- existing_pavics_1:9090
- existing_pavics_2:9090
- ...

This assume I open up the Prometheus port on existing_pavics_1 and existing_pavics_2 and I handle manually the inclusion of prometheus.rules on on existing_pavics_1 and existing_pavics_2.

Then you can remove the dependency on ./components/monitoring.

Just add a note in the README to enable ./components/monitoring together with this ./optional-components/prometheus-longterm-metrics component only if the admin wants both Prometheus to be on the same machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is for a separate discussion and out of the scope of this PR.

We can discuss the best way to handle multiple targets later on but at the very least we need to consider the security implications of recommending users open up additional ports.

I really think that any code that handles monitoring multiple prometheus endpoints should go in a separate repo.

2 changes: 2 additions & 0 deletions birdhouse/optional-components/thanos/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
config/magpie/config.yml
config/proxy/conf.extra-service.d/monitoring.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
providers:
thanos:
# below URL is only used to fill in the required location in Magpie
# actual auth validation is performed with Twitcher 'verify' endpoint without accessing this proxied URL
url: http://proxy:80
title: Thanos
public: true
c4i: false
type: api
sync_type: api

permissions:
- service: thanos
permission: read
group: administrators
action: create
- service: thanos
permission: write
group: administrators
action: create
- service: thanos
permission: read
group: monitoring
action: create
- service: thanos
permission: write
group: monitoring
action: create
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
version: "3.4"

services:
magpie:
volumes:
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PERMISSIONS_CONFIG_PATH}/thanos.yml:ro
- ./optional-components/thanos/config/magpie/config.yml:${MAGPIE_PROVIDERS_CONFIG_PATH}/thanos.yml:ro
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
location /thanos-query {
auth_request /secure-thanos-auth;
auth_request_set $auth_status $upstream_status;
proxy_pass http://thanos-query:19192;
proxy_set_header Host $host;
}

location /thanos-minio/ {
auth_request /secure-thanos-auth;
auth_request_set $auth_status $upstream_status;

rewrite ^/thanos-minio/(.*) /$1 break;
proxy_pass http://thanos-minio:9001;

proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

# This allows WebSocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}

location = /secure-thanos-auth {
internal;
proxy_pass https://${BIRDHOUSE_FQDN_PUBLIC}${TWITCHER_VERIFY_PATH}/thanos$request_uri;
proxy_pass_request_body off;
proxy_set_header Host $host;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;
proxy_set_header X-Forwarded-Proto $real_scheme;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Host $host:$server_port;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: "3.4"

services:
proxy:
volumes:
- ./optional-components/thanos/config/proxy/conf.extra-service.d:/etc/nginx/conf.extra-service.d/thanos:ro
Loading
Loading