Skip to content

Releases: m-lab/prometheus-support

Blackbox_exporter probes now timeout at 9s instead of 5s

12 Mar 20:54
705024b
Compare
Choose a tag to compare

This is a small release that does one principal thing: it changes the timeout for all blackbox_exporter probes to 9s. Previously, many were at 5s, which is likely not enough for some of our less well provisioned sites in far flung places. Indeed, for some of those 9s might not even be enough, but nearly doubling time current timeout will be an improvement.

The one other small change is that "Ops: Platform Overview" Grafana dashboard link was updated in alerts.yml.

Weekly release: 2018-02-06 to 2018-02-13

13 Feb 18:25
fd908e2
Compare
Choose a tag to compare

The release brings three new changes:

  • Scraping of the Prom node_exporter instances running in the script_exporter and snmp_exporter GCE instances.

  • New alerts for the script_exporter job and metrics.

  • Imports the JSON for the Grafana dashboard "Ops: Switch Overview".

For the full list of changes, see the diff between this release and the last.

Weekly release: 2018-01-29 to 2018-02-06

06 Feb 16:16
49f1671
Compare
Choose a tag to compare
  • Adds a new JSON Grafana dashboard for the paris traceroute pipeline: Pipeline_PT.json
  • Adds 4 new Prometheus recording rules for switch discard metrics.
  • Re-merges the script-exporter scrape jobs such that all script-exporter targets will get scraped at 1m intervals again. For the ndt_e2e script, the mitigation to avoid end-to-end testing every minute is for the ndt_e2e script to cache the result and only refresh the cache every 10 minutes, unless the service is down, in which case it will retest every probe (i.e., every minute).
  • Adds an explicit version to the gcp-service-discovery Docker image (v.1.0)

See the change details by viewing the CS between the previous release and this one.

Point release: sets ndt_e2e script-targets to scrape only every 10m

30 Jan 00:23
3535918
Compare
Choose a tag to compare

This release only contains one change from the last release. It splits the script-targets job into two jobs so that the job which probes the ndt_e2e script only runs once every 10 minutes, instead of every minute.

Weekly release: 2018-01-22 to 2018-01-29

29 Jan 20:44
7513631
Compare
Choose a tag to compare

Better memory usage for prometheus container in k8s
Update ClusterDown to prevent false-positive alerts during cluster node auto-upgrades
Support for loading dashboards-as-code via JSON-export.
First JSON dashboards:

  • Alert_ParserDailyVolumeTooLow.json
  • NDT_EarlyWarning.json
  • NDT_GlobalTestRate.json
  • Prometheus_SelfMonitoring.json

Weekly release: 2018-01-02 to 2018-01-22

22 Jan 17:52
952852b
Compare
Choose a tag to compare

This release includes:

  • Add alerts to monitor for nagios-oam.measuremementlab.net
  • Reduce k8s resource allocations slightly so prometheus continues to run after k8s upgrades.
  • Improvement to ParserDailyVolumeTooLow alert

Weekly release v1.10

02 Jan 22:27
f604143
Compare
Choose a tag to compare

This release improves two alerts:

  • ScraperMostRecentArchivedFileTimeIsTooOld -- adding a longer threshold before firing to prevent false positives.
  • SnmpExporterMissingMetrics -- reports when the snmp exporter is no collecting any metrics.

This release includes two fixes or updates;

  • The blackbox exporter resource allocation is increased to work around a known issue in the Go GC with the latest build -- prometheus/blackbox_exporter#270
  • The alertmanager github receiver URL is set unconditionally, even though it should not run in sandbox or staging because the alertmanager configuration is stricter and rejects the empty string for the webhook url.

Weekly release includes: upgrades, alert refinements, and new recording rules.

19 Dec 19:57
075806c
Compare
Choose a tag to compare

This release adds improvements to alerts, new recording rules, and improved usability of switch snmp metrics.

The ParserDailyVolumeTooLow alert should have improved accuracy.

The blackbox exporter is upgraded from v0.4 to v0.11, including corresponding configuration changes.

We have a large number of recording rules used by the NDT Early Warning dashboards, that identify sites with less than 2x site headroom capacity. These rules have already helped prioritize 10g upgrades for specific sites in the US.

Add base recording rules for NDT early warning

12 Dec 20:16
78744dc
Compare
Choose a tag to compare

This release adds recording rules for the ndt early warning dashboards.

This release updates the version of Alertmanager to hyperlink URLs in alert annotations.

This release updates the github receiver which now leaves issues open after resolution requiring manually closing them.

Prometheus 1.8 and Grafana 4.6.2 upgrades

04 Dec 17:56
f425e1a
Compare
Choose a tag to compare

This release of prometheus-support includes minor version upgrades to the Prometheus and Grafana servers, as well as an update of the bigquery-exporter to v0.3.

This release also includes multiple bigquery exporter query updates: ipv6 bais, ndt server metrics, and ndt test counts (which was running in mlab-oti as-hoc).

Resource changes:

  • the blackbox exporter CPU alloc is now 1 CPU to resolve suspected overload.

Alert changes:

  • ParserDailyVolumeTooLow added to track the pipeline daily volume.
  • DownloaderDownOrMissing added to report if the downloader is not running at all. Coincides with the production deployment of downloader.
  • ScraperMostRecentArchivedFileTimeIsTooOld will now fire only after 56 hours (instead of 36) to allow for rsyncd config updates. This is to reduce redundancy with the ParserDailyVolumeTooLow alert and reduce the frequency of this scraper alert, which is currently our most common one, which has a reputation for auto-closing without doing anything else.