Releases: m-lab/prometheus-support
Blackbox_exporter probes now timeout at 9s instead of 5s
This is a small release that does one principal thing: it changes the timeout for all blackbox_exporter probes to 9s. Previously, many were at 5s, which is likely not enough for some of our less well provisioned sites in far flung places. Indeed, for some of those 9s might not even be enough, but nearly doubling time current timeout will be an improvement.
The one other small change is that "Ops: Platform Overview" Grafana dashboard link was updated in alerts.yml.
Weekly release: 2018-02-06 to 2018-02-13
The release brings three new changes:
-
Scraping of the Prom node_exporter instances running in the script_exporter and snmp_exporter GCE instances.
-
New alerts for the script_exporter job and metrics.
-
Imports the JSON for the Grafana dashboard "Ops: Switch Overview".
For the full list of changes, see the diff between this release and the last.
Weekly release: 2018-01-29 to 2018-02-06
- Adds a new JSON Grafana dashboard for the paris traceroute pipeline: Pipeline_PT.json
- Adds 4 new Prometheus recording rules for switch discard metrics.
- Re-merges the
script-exporter
scrape jobs such that all script-exporter targets will get scraped at 1m intervals again. For the ndt_e2e script, the mitigation to avoid end-to-end testing every minute is for the ndt_e2e script to cache the result and only refresh the cache every 10 minutes, unless the service is down, in which case it will retest every probe (i.e., every minute). - Adds an explicit version to the
gcp-service-discovery
Docker image (v.1.0)
See the change details by viewing the CS between the previous release and this one.
Point release: sets ndt_e2e script-targets to scrape only every 10m
This release only contains one change from the last release. It splits the script-targets job into two jobs so that the job which probes the ndt_e2e script only runs once every 10 minutes, instead of every minute.
Weekly release: 2018-01-22 to 2018-01-29
Better memory usage for prometheus container in k8s
Update ClusterDown to prevent false-positive alerts during cluster node auto-upgrades
Support for loading dashboards-as-code via JSON-export.
First JSON dashboards:
- Alert_ParserDailyVolumeTooLow.json
- NDT_EarlyWarning.json
- NDT_GlobalTestRate.json
- Prometheus_SelfMonitoring.json
Weekly release: 2018-01-02 to 2018-01-22
This release includes:
- Add alerts to monitor for nagios-oam.measuremementlab.net
- Reduce k8s resource allocations slightly so prometheus continues to run after k8s upgrades.
- Improvement to
ParserDailyVolumeTooLow
alert
Weekly release v1.10
This release improves two alerts:
- ScraperMostRecentArchivedFileTimeIsTooOld -- adding a longer threshold before firing to prevent false positives.
- SnmpExporterMissingMetrics -- reports when the snmp exporter is no collecting any metrics.
This release includes two fixes or updates;
- The blackbox exporter resource allocation is increased to work around a known issue in the Go GC with the latest build -- prometheus/blackbox_exporter#270
- The alertmanager github receiver URL is set unconditionally, even though it should not run in sandbox or staging because the alertmanager configuration is stricter and rejects the empty string for the webhook url.
Weekly release includes: upgrades, alert refinements, and new recording rules.
This release adds improvements to alerts, new recording rules, and improved usability of switch snmp metrics.
The ParserDailyVolumeTooLow
alert should have improved accuracy.
The blackbox exporter is upgraded from v0.4 to v0.11, including corresponding configuration changes.
We have a large number of recording rules used by the NDT Early Warning dashboards, that identify sites with less than 2x site headroom capacity. These rules have already helped prioritize 10g upgrades for specific sites in the US.
Add base recording rules for NDT early warning
This release adds recording rules for the ndt early warning dashboards.
This release updates the version of Alertmanager to hyperlink URLs in alert annotations.
This release updates the github receiver which now leaves issues open after resolution requiring manually closing them.
Prometheus 1.8 and Grafana 4.6.2 upgrades
This release of prometheus-support includes minor version upgrades to the Prometheus and Grafana servers, as well as an update of the bigquery-exporter to v0.3.
This release also includes multiple bigquery exporter query updates: ipv6 bais, ndt server metrics, and ndt test counts (which was running in mlab-oti as-hoc).
Resource changes:
- the blackbox exporter CPU alloc is now 1 CPU to resolve suspected overload.
Alert changes:
- ParserDailyVolumeTooLow added to track the pipeline daily volume.
- DownloaderDownOrMissing added to report if the downloader is not running at all. Coincides with the production deployment of downloader.
- ScraperMostRecentArchivedFileTimeIsTooOld will now fire only after 56 hours (instead of 36) to allow for rsyncd config updates. This is to reduce redundancy with the ParserDailyVolumeTooLow alert and reduce the frequency of this scraper alert, which is currently our most common one, which has a reputation for auto-closing without doing anything else.