Skip to content

Commit

Permalink
Merge pull request #1 from mkilchhofer/init
Browse files Browse the repository at this point in the history
  • Loading branch information
mkilchhofer authored Aug 28, 2024
2 parents f9a2d63 + 76e4055 commit c56fdf7
Show file tree
Hide file tree
Showing 21 changed files with 2,684 additions and 3 deletions.
30 changes: 30 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# EditorConfig is awesome: http://EditorConfig.org
# Uses editorconfig to maintain consistent coding styles

# top-most EditorConfig file
root = true

# Unix-style newlines with a newline ending every file
[*]
charset = utf-8
end_of_line = lf
indent_size = 2
indent_style = space
insert_final_newline = true
max_line_length = 120
trim_trailing_whitespace = true

[{go.mod,go.sum,*.go}]
indent_style = tab
indent_size = 4

[*.{tf,tfvars}]
indent_size = 2
indent_style = space

[*.md]
max_line_length = 0
trim_trailing_whitespace = false

[COMMIT_EDITMSG]
max_line_length = 0
36 changes: 36 additions & 0 deletions .github/workflows/terratest.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Terratest
on: pull_request

permissions: {}

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.22.x'

- name: Install dependencies
run: |
pwd
cd test
go get .
- name: Test with the Go CLI
run: |
pwd
cd test
go test -v
- name: Check for updated README (terraform-docs)
uses: terraform-docs/gh-actions@v1.2.0
with:
working-dir: .
fail-on-diff: "true"
config-file: ".terraform-docs.yml"
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@
# .tfstate files
*.tfstate
*.tfstate.*
.terraform.lock.hcl

# Crash log files
crash.log
crash.*.log

# Exclude all .tfvars files, which are likely to contain sensitive data, such as
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# password, private keys, and other secrets. These should not be part of version
# control as they are data points which are potentially sensitive and subject
# to change depending on the environment.
*.tfvars
*.tfvars.json
Expand Down
14 changes: 14 additions & 0 deletions .terraform-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
formatter: "markdown"

output:
file: "README.md"

settings:
anchor: false
indent: 3

sections:
show:
- providers
- inputs
- outputs
73 changes: 72 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,73 @@
# terraform-grafana-prometheus-alerts
Terraform module to convert Prometheus alert rules to Grafana alerts

Terraform module to convert [Prometheus Alerting rules] to [Grafana-managed alert rules]

## Motivation / Why using this module

There are plenty of apps (mostly out of CNCF's ecosystem) where the vendor or the community provides monitoring dashboards
and alerts. Dashboards are normally provided as a JSON file which can be loaded into Grafana. Alerts are mostly provided
as [Prometheus Alerting rules].

There are users who already operate a Grafana instance or use a managed Grafana instance from a cloud provider (Grafana
Cloud, Amazon Managed Grafana, Azure Managed Grafana, etc.). Why not using this Grafana instance for the
alerting?

The problem is that Grafana's unified alerting uses another format for the alert definition but the concept with labels,
annotations (provide description and runbook URLs) is almost identical.
This module allows you to reuse the [Prometheus Alerting rules] and configure them inside Grafana.

## Example usage

```hcl
module "cert_manager_rules" {
source = "github.com/mkilchhofer/terraform-grafana-prometheus-alerts"
prometheus_alerts_file_path = file("/path/to/alerts/cert-manager.yaml")
folder_uid = grafana_folder.test.uid
datasource_uid = grafana_data_source.prometheus.uid
}
```

## Requirements

- Grafana 8.0+ (Unified alerting)

## Limitations

- Defining multiple alerts with the same name is not supported in Grafana

## Overriding definitions of Prometheus Alerting file

TODO

## TF module documentation

<!-- BEGIN_TF_DOCS -->
### Providers

| Name | Version |
|------|---------|
| grafana | ~> 3.2 |

### Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| datasource\_uid | The UID of the Grafana datasource being queried with the expressions inside the Alerting rule file | `string` | n/a | yes |
| default\_evaluation\_interval\_duration | How often is the rule evaluated by default. (When not defined inside your Alerting rules file) | `string` | `"5m"` | no |
| disable\_provenance | Allow modifying the rule group from other sources than Terraform or the Grafana API. | `bool` | `false` | no |
| folder\_uid | The UID of the Grafana folder that the alerts belongs to. | `string` | n/a | yes |
| org\_id | The Organization ID of of the Grafana Alerting rule groups. (Only supported with basic auth, API keys are already org-scoped) | `string` | `null` | no |
| overrides | Overrides per Alert rule | <pre>map(object({<br> alert_threshold = optional(number)<br> exec_err_state = optional(string)<br> is_paused = optional(bool)<br> no_data_state = optional(string)<br> labels = optional(map(string))<br> }))</pre> | `{}` | no |
| prometheus\_alerts\_file\_path | Path to the Prometheus Alerting rules file | `string` | n/a | yes |

### Outputs

| Name | Description |
|------|-------------|
| alertsfile\_map | n/a |
| file\_as\_yaml | n/a |
<!-- END_TF_DOCS -->

[Grafana-managed alert rules]: https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/#grafana-managed-alert-rules
[Prometheus Alerting rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
134 changes: 134 additions & 0 deletions grafana_alert.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
resource "grafana_rule_group" "this" {
# for_each = local.file_as_yaml.groups
for_each = local.alertsfile_map

name = each.value.name
folder_uid = var.folder_uid
org_id = var.org_id

# There is no function supporting Golang's "duration" (format of interval within an alert group)
# Use timeadd() function which supports it.
interval_seconds = (
(parseint(formatdate("s", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 1) +
(parseint(formatdate("m", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 60) +
(parseint(formatdate("h", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 3600)
)

disable_provenance = var.disable_provenance

dynamic "rule" {
for_each = {for rule in each.value.rules: rule.alert => rule}

content {
name = rule.value.alert
for = try(rule.value.for, null)
condition = "ALERTCONDITION"

annotations = {for k, v in rule.value.annotations : k => replace(v, "$value", "$values.QUERY_RESULT.Value")}
labels = merge(rule.value.labels, try(var.overrides[rule.value.alert].labels, {}))

exec_err_state = try(var.overrides[rule.value.alert].exec_err_state, null)
is_paused = try(var.overrides[rule.value.alert].is_paused, null)
no_data_state = try(var.overrides[rule.value.alert].no_data_state, null)

data {
ref_id = "QUERY"
relative_time_range {
from = 600
to = 0
}
datasource_uid = var.datasource_uid
model = jsonencode({
editorMode = "code"
expr = rule.value.expr
intervalMs = 1000
maxDataPoints = 43200
refId = "QUERY"
})
}

## Reduce
data {
ref_id = "QUERY_RESULT"
relative_time_range {
from = 600
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
"conditions" = [
{
"evaluator" = {
"params" = [0]
"type" = "gt"
}
"operator" = {
"type" = "and"
}
"query" = {
"params" = []
}
"reducer" = {
"params" = []
"type" = "avg"
}
"type" = "query"
},
]
"datasource" = {
"name" = "Expression"
"type" = "__expr__"
"uid" = "__expr__"
}
"expression" = "QUERY"
"intervalMs" = 1000
"maxDataPoints" = 43200
"reducer" = "last"
"refId" = "QUERY_RESULT"
"type" = "reduce"
})
}

## Threshold
data {
ref_id = "ALERTCONDITION"
relative_time_range {
from = 600
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
"conditions" = [
{
"evaluator" = {
"params" = [try(var.overrides[rule.value.alert].alert_threshold, 0)]
"type" = "gt"
}
"operator" = {
"type" = "and"
}
"query" = {
"params" = ["QUERY_RESULT"]
}
"reducer" = {
"params" = []
"type" = "last"
}
"type" = "query"
},
]
"datasource" = {
"type" = "__expr__"
"uid" = "__expr__"
}
"expression" = "QUERY_RESULT"
"hide" = false
"intervalMs" = 1000
"maxDataPoints" = 43200
"refId" = "ALERTCONDITION"
"type" = "threshold"
})
}
}
}
}
4 changes: 4 additions & 0 deletions locals.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
locals {
file_as_yaml = yamldecode(var.prometheus_alerts_file_path)
alertsfile_map = {for group in local.file_as_yaml.groups: group.name => group}
}
7 changes: 7 additions & 0 deletions outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
output "file_as_yaml" {
value = local.file_as_yaml
}

output "alertsfile_map" {
value = local.alertsfile_map
}
61 changes: 61 additions & 0 deletions test/alerts-cert-manager.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Source: https://github.com/monitoring-mixins/website/blob/master/assets/cert-manager/alerts.yaml
groups:
- name: cert-manager
rules:
- alert: CertManagerAbsent
annotations:
description: New certificates will not be able to be minted, and existing ones
can't be renewed until cert-manager is back.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerabsent
summary: Cert Manager has disappeared from Prometheus service discovery.
expr: absent(up{job="cert-manager"})
for: 10m
labels:
severity: critical
- name: certificates
rules:
- alert: CertManagerCertExpirySoon
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: The domain that this cert covers will be unavailable after {{ $value
| humanizeDuration }}. Clients using endpoints that this cert protects will
start to fail in {{ $value | humanizeDuration }}.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertexpirysoon
summary: The cert `{{ $labels.name }}` is {{ $value | humanizeDuration }} from
expiry, it should have renewed over a week ago.
expr: |
avg by (exported_namespace, namespace, name) (
certmanager_certificate_expiration_timestamp_seconds - time()
) < (21 * 24 * 3600) # 21 days in seconds
for: 1h
labels:
severity: warning
- alert: CertManagerCertNotReady
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: This certificate has not been ready to serve traffic for at least
10m. If the cert is being renewed or there is another valid cert, the ingress
controller _may_ be able to serve that instead.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertnotready
summary: The cert `{{ $labels.name }}` is not ready to serve traffic.
expr: |
max by (name, exported_namespace, namespace, condition) (
certmanager_certificate_ready_status{condition!="True"} == 1
)
for: 10m
labels:
severity: critical
- alert: CertManagerHittingRateLimits
annotations:
dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
description: Depending on the rate limit, cert-manager may be unable to generate
certificates for up to a week.
runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerhittingratelimits
summary: Cert manager hitting LetsEncrypt rate limits.
expr: |
sum by (host) (
rate(certmanager_http_acme_client_request_count{status="429"}[5m])
) > 0
for: 5m
labels:
severity: critical
Loading

0 comments on commit c56fdf7

Please sign in to comment.