Skip to content

Commit

Permalink
Changes downloader alert to fire after 72h, not 21h (#1035)
Browse files Browse the repository at this point in the history
The downloader used to download databases about every 8h. We have reconfigured
th downloader to only download about every 24h. Additionally, there is no need
for this alert to be super sensitive. It is okay if annotations happen with
databases that are only a couple days old. However, if downloader can't
download all the files for more than 3d, then fire an alert.
  • Loading branch information
nkinkade authored Mar 20, 2024
1 parent cb76903 commit 4d34d30
Showing 1 changed file with 6 additions and 9 deletions.
15 changes: 6 additions & 9 deletions config/federation/prometheus/alerts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,21 +74,18 @@ groups:
the switch itself, or with the transit provider.
dashboard: 'https://grafana.mlab-oti.measurementlab.net/d/SuqnZ6Hiz/?orgId=1&var-site_name={{$labels.site}}'

# DownloaderIsFailingToUpdate: The downloader hasn't successfully retrieved the files in
# at least 21 hours, meaning that at least the last two download attempts have failed.
# DownloaderIsFailingToUpdate: The downloader hasn't successfully retrieved
# all databases in more than 72 hours.
- alert: DownloaderIsFailingToUpdate
expr: time() - downloader_last_success_time_seconds > (21 * 60 * 60)
expr: time() - downloader_last_success_time_seconds > (72 * 60 * 60)
for: 1h
labels:
repo: dev-tracker
severity: ticket
cluster: prometheus-federation
annotations:
summary: Neither of the last two attempts to download the maxmind or
routeviews feeds were successful.
description: Check for errors with the downloader service on grafana with
the downloader_error_count metric, or check the stackdriver logs for
the downloader cluster.
summary: The downloader hasn't successfully retrieved all databases in more than 3d
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#downloaderisfailingtoupdate
dashboard: https://grafana.mlab-oti.measurementlab.net/d/ZGuYht1mk/

# DownloaderNotRunning: The downloader cluster crashed and not running at all.
Expand Down Expand Up @@ -1365,4 +1362,4 @@ groups:
annotations:
summary: Daily storage costs are 50% over the average storage costs for the month.
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#billing_dailystorageincrease
dashboard: https://grafana.mlab-oti.measurementlab.net/d/d8145875-e912-484e-b8f2-b77f63bd28a3/cloud-storage-usage?orgId=1
dashboard: https://grafana.mlab-oti.measurementlab.net/d/d8145875-e912-484e-b8f2-b77f63bd28a3/cloud-storage-usage?orgId=1

0 comments on commit 4d34d30

Please sign in to comment.