Releases · m-lab/prometheus-support

12 Mar 20:54

nkinkade

705024b

Blackbox_exporter probes now timeout at 9s instead of 5s

This is a small release that does one principal thing: it changes the timeout for all blackbox_exporter probes to 9s. Previously, many were at 5s, which is likely not enough for some of our less well provisioned sites in far flung places. Indeed, for some of those 9s might not even be enough, but nearly doubling time current timeout will be an improvement.

The one other small change is that "Ops: Platform Overview" Grafana dashboard link was updated in alerts.yml.

Assets 2

13 Feb 18:25

nkinkade

production/1.15

fd908e2

Weekly release: 2018-02-06 to 2018-02-13

The release brings three new changes:

Scraping of the Prom node_exporter instances running in the script_exporter and snmp_exporter GCE instances.
New alerts for the script_exporter job and metrics.
Imports the JSON for the Grafana dashboard "Ops: Switch Overview".

For the full list of changes, see the diff between this release and the last.

Assets 2

06 Feb 16:16

nkinkade

production/1.14

49f1671

Weekly release: 2018-01-29 to 2018-02-06

Adds a new JSON Grafana dashboard for the paris traceroute pipeline: Pipeline_PT.json
Adds 4 new Prometheus recording rules for switch discard metrics.
Re-merges the script-exporter scrape jobs such that all script-exporter targets will get scraped at 1m intervals again. For the ndt_e2e script, the mitigation to avoid end-to-end testing every minute is for the ndt_e2e script to cache the result and only refresh the cache every 10 minutes, unless the service is down, in which case it will retest every probe (i.e., every minute).
Adds an explicit version to the gcp-service-discovery Docker image (v.1.0)

See the change details by viewing the CS between the previous release and this one.

Assets 2

30 Jan 00:23

nkinkade

production/1.13

3535918

Point release: sets ndt_e2e script-targets to scrape only every 10m

This release only contains one change from the last release. It splits the script-targets job into two jobs so that the job which probes the ndt_e2e script only runs once every 10 minutes, instead of every minute.

Assets 2

29 Jan 20:44

stephen-soltesz

production/1.12

7513631

Weekly release: 2018-01-22 to 2018-01-29

Better memory usage for prometheus container in k8s
Update ClusterDown to prevent false-positive alerts during cluster node auto-upgrades
Support for loading dashboards-as-code via JSON-export.
First JSON dashboards:

Alert_ParserDailyVolumeTooLow.json
NDT_EarlyWarning.json
NDT_GlobalTestRate.json
Prometheus_SelfMonitoring.json

Assets 2

22 Jan 17:52

stephen-soltesz

production/1.11

952852b

Weekly release: 2018-01-02 to 2018-01-22

This release includes:

Add alerts to monitor for nagios-oam.measuremementlab.net
Reduce k8s resource allocations slightly so prometheus continues to run after k8s upgrades.
Improvement to ParserDailyVolumeTooLow alert

Assets 2

02 Jan 22:27

stephen-soltesz

production/1.10

f604143

Weekly release v1.10

This release improves two alerts:

ScraperMostRecentArchivedFileTimeIsTooOld -- adding a longer threshold before firing to prevent false positives.
SnmpExporterMissingMetrics -- reports when the snmp exporter is no collecting any metrics.

This release includes two fixes or updates;

The blackbox exporter resource allocation is increased to work around a known issue in the Go GC with the latest build -- prometheus/blackbox_exporter#270
The alertmanager github receiver URL is set unconditionally, even though it should not run in sandbox or staging because the alertmanager configuration is stricter and rejects the empty string for the webhook url.

Assets 2

19 Dec 19:57

stephen-soltesz

production/1.9

075806c

Weekly release includes: upgrades, alert refinements, and new recording rules.

This release adds improvements to alerts, new recording rules, and improved usability of switch snmp metrics.

The ParserDailyVolumeTooLow alert should have improved accuracy.

The blackbox exporter is upgraded from v0.4 to v0.11, including corresponding configuration changes.

We have a large number of recording rules used by the NDT Early Warning dashboards, that identify sites with less than 2x site headroom capacity. These rules have already helped prioritize 10g upgrades for specific sites in the US.

Assets 2

12 Dec 20:16

stephen-soltesz

production/1.8

78744dc

Add base recording rules for NDT early warning

This release adds recording rules for the ndt early warning dashboards.

This release updates the version of Alertmanager to hyperlink URLs in alert annotations.

This release updates the github receiver which now leaves issues open after resolution requiring manually closing them.

Assets 2

04 Dec 17:56

stephen-soltesz

production/1.7

f425e1a

Prometheus 1.8 and Grafana 4.6.2 upgrades

This release of prometheus-support includes minor version upgrades to the Prometheus and Grafana servers, as well as an update of the bigquery-exporter to v0.3.

This release also includes multiple bigquery exporter query updates: ipv6 bais, ndt server metrics, and ndt test counts (which was running in mlab-oti as-hoc).

Resource changes:

the blackbox exporter CPU alloc is now 1 CPU to resolve suspected overload.

Alert changes:

ParserDailyVolumeTooLow added to track the pipeline daily volume.
DownloaderDownOrMissing added to report if the downloader is not running at all. Coincides with the production deployment of downloader.
ScraperMostRecentArchivedFileTimeIsTooOld will now fire only after 56 hours (instead of 36) to allow for rsyncd config updates. This is to reduce redundancy with the ParserDailyVolumeTooLow alert and reduce the frequency of this scraper alert, which is currently our most common one, which has a reputation for auto-closing without doing anything else.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: m-lab/prometheus-support

Blackbox_exporter probes now timeout at 9s instead of 5s

Weekly release: 2018-02-06 to 2018-02-13

Weekly release: 2018-01-29 to 2018-02-06

Point release: sets ndt_e2e script-targets to scrape only every 10m

Weekly release: 2018-01-22 to 2018-01-29

Weekly release: 2018-01-02 to 2018-01-22

Weekly release v1.10

Weekly release includes: upgrades, alert refinements, and new recording rules.

Add base recording rules for NDT early warning

Prometheus 1.8 and Grafana 4.6.2 upgrades