Skip to content

Commit 4d34d30

Browse files
authored
Changes downloader alert to fire after 72h, not 21h (#1035)
The downloader used to download databases about every 8h. We have reconfigured th downloader to only download about every 24h. Additionally, there is no need for this alert to be super sensitive. It is okay if annotations happen with databases that are only a couple days old. However, if downloader can't download all the files for more than 3d, then fire an alert.
1 parent cb76903 commit 4d34d30

File tree

1 file changed

+6
-9
lines changed

1 file changed

+6
-9
lines changed

config/federation/prometheus/alerts.yml

+6-9
Original file line numberDiff line numberDiff line change
@@ -74,21 +74,18 @@ groups:
7474
the switch itself, or with the transit provider.
7575
dashboard: 'https://grafana.mlab-oti.measurementlab.net/d/SuqnZ6Hiz/?orgId=1&var-site_name={{$labels.site}}'
7676

77-
# DownloaderIsFailingToUpdate: The downloader hasn't successfully retrieved the files in
78-
# at least 21 hours, meaning that at least the last two download attempts have failed.
77+
# DownloaderIsFailingToUpdate: The downloader hasn't successfully retrieved
78+
# all databases in more than 72 hours.
7979
- alert: DownloaderIsFailingToUpdate
80-
expr: time() - downloader_last_success_time_seconds > (21 * 60 * 60)
80+
expr: time() - downloader_last_success_time_seconds > (72 * 60 * 60)
8181
for: 1h
8282
labels:
8383
repo: dev-tracker
8484
severity: ticket
8585
cluster: prometheus-federation
8686
annotations:
87-
summary: Neither of the last two attempts to download the maxmind or
88-
routeviews feeds were successful.
89-
description: Check for errors with the downloader service on grafana with
90-
the downloader_error_count metric, or check the stackdriver logs for
91-
the downloader cluster.
87+
summary: The downloader hasn't successfully retrieved all databases in more than 3d
88+
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#downloaderisfailingtoupdate
9289
dashboard: https://grafana.mlab-oti.measurementlab.net/d/ZGuYht1mk/
9390

9491
# DownloaderNotRunning: The downloader cluster crashed and not running at all.
@@ -1365,4 +1362,4 @@ groups:
13651362
annotations:
13661363
summary: Daily storage costs are 50% over the average storage costs for the month.
13671364
description: https://github.com/m-lab/ops-tracker/wiki/Alerts-&-Troubleshooting#billing_dailystorageincrease
1368-
dashboard: https://grafana.mlab-oti.measurementlab.net/d/d8145875-e912-484e-b8f2-b77f63bd28a3/cloud-storage-usage?orgId=1
1365+
dashboard: https://grafana.mlab-oti.measurementlab.net/d/d8145875-e912-484e-b8f2-b77f63bd28a3/cloud-storage-usage?orgId=1

0 commit comments

Comments
 (0)