Skip to content

Commit

Permalink
Merge pull request #148 from m-lab/sandbox-soltesz
Browse files Browse the repository at this point in the history
Enable more reliable recording rules using irate
  • Loading branch information
stephen-soltesz authored Dec 9, 2017
2 parents 0a0a15d + 2f6e1b0 commit 78744dc
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 5 deletions.
21 changes: 17 additions & 4 deletions config/federation/prometheus/rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,13 @@
# DO:
# * Do use raw prometheus expressions on the right hand side of a new rule.
# * "Recording rules should be of the general form level:metric:operations."
# * Do use irate with a range that is 4x scrape_interval.
#
# DO NOT:
# * Do not use recording rules on the right hand side of a new rule.
# * Do not overwrite a metric name with itself.
# * Do not use 'label_replace' to overwrite a metric name.
# * Do not use rate with a range less than 4x the scrape_interval.


# Precalculate the increase of ipv4 and ipv6 sidestream connections.
Expand All @@ -42,23 +44,34 @@ lsb:sidestream_connection_count:increase2m =


## NDT Early Warning aggregation rules.
#
# Rules are evaluated every global.evaluation_interval seconds. When
# scrape_interval equals the evaluation_interval, there are potential races for
# short range operators, e.g. 2m when the eval and scrape intervals are 1m. At
# evaluation time, not every timeseries will contain 2 points in a 2m window.
#
# If we want to calculate the rate over 2m and increase the likelihood that we
# see at least two points we must use irate with a larger window, e.g. 4x the
# scrape interval. In our case this is 4m. irate only uses the last two samples
# to calculate an instantaneous rate.

# Per-machine inotify creation rates, using only c2s_snaplog + s2c_snaplog files.
# Units: requests per minute.
machine:inotify_extension_create:rpm2m =
# NOTE: using 'without' instead of 'by' preserves all other labels.
60.0 * sum without(ext) (rate(inotify_extension_create_total{ext=~".*_snaplog"}[2m]))
60.0 * sum without(ext) (irate(inotify_extension_create_total{ext=~".*_snaplog"}[4m]))

# TODO: aggregate on per-machine interface aliases when available.
# Per-switch "Out" (i.e. Download) bits per second.
# Per-switch "Out" (i.e. Download) bits per second. We use irate to calculate
# rates over the last two samples only.
# Units: bits per second.
switch:ifHCOutOctets:bps2m =
8 * rate(ifHCOutOctets{ifAlias="uplink"}[2m])
8 * irate(ifHCOutOctets{ifAlias="uplink"}[4m])

# Per-machine maximum ratio of time spent performing I/O on all devices.
# Units: none
machine:node_disk_io_time_ms:max_ratio2m =
max without(device) (rate(node_disk_io_time_ms{service="nodeexporter"}[2m])) / 1000
max without(device) (irate(node_disk_io_time_ms{service="nodeexporter"}[4m])) / 1000

# NDT vserver disk quota utilization, 12 hour estimate.
# Units: KB
Expand Down
1 change: 0 additions & 1 deletion k8s/prometheus-federation/deployments/prometheus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ spec:
args: ["-config.file=/etc/prometheus/prometheus.yml",
"-storage.local.path=/prometheus",
"-storage.local.retention=2880h",
"-log.level=debug",
"-alertmanager.url=http://alertmanager-public-service.default.svc.cluster.local:9093",
"-web.external-url=http://status.{{GCLOUD_PROJECT}}.measurementlab.net:9090",
"-web.console.libraries=/usr/share/prometheus/console_libraries",
Expand Down

0 comments on commit 78744dc

Please sign in to comment.