Merge pull request #1 from mkilchhofer/init

mkilchhofer · Aug 28, 2024 · c56fdf7 · c56fdf7
2 parents f9a2d63 + 76e4055
commit c56fdf7
Show file tree

Hide file tree

Showing 21 changed files with 2,684 additions and 3 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -0,0 +1,30 @@
+# EditorConfig is awesome: http://EditorConfig.org
+# Uses editorconfig to maintain consistent coding styles
+
+# top-most EditorConfig file
+root = true
+
+# Unix-style newlines with a newline ending every file
+[*]
+charset = utf-8
+end_of_line = lf
+indent_size = 2
+indent_style = space
+insert_final_newline = true
+max_line_length = 120
+trim_trailing_whitespace = true
+
+[{go.mod,go.sum,*.go}]
+indent_style = tab
+indent_size = 4
+
+[*.{tf,tfvars}]
+indent_size = 2
+indent_style = space
+
+[*.md]
+max_line_length = 0
+trim_trailing_whitespace = false
+
+[COMMIT_EDITMSG]
+max_line_length = 0
diff --git a/.github/workflows/terratest.yaml b/.github/workflows/terratest.yaml
@@ -0,0 +1,36 @@
+name: Terratest
+on: pull_request
+
+permissions: {}
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.22.x'
+
+      - name: Install dependencies
+        run: |
+          pwd
+          cd test
+          go get .
+
+      - name: Test with the Go CLI
+        run: |
+          pwd
+          cd test
+          go test -v
+
+      - name: Check for updated README (terraform-docs)
+        uses: terraform-docs/gh-actions@v1.2.0
+        with:
+          working-dir: .
+          fail-on-diff: "true"
+          config-file: ".terraform-docs.yml"
diff --git a/.gitignore b/.gitignore
@@ -4,14 +4,15 @@
 # .tfstate files
 *.tfstate
 *.tfstate.*
+.terraform.lock.hcl
 
 # Crash log files
 crash.log
 crash.*.log
 
 # Exclude all .tfvars files, which are likely to contain sensitive data, such as
-# password, private keys, and other secrets. These should not be part of version 
-# control as they are data points which are potentially sensitive and subject 
+# password, private keys, and other secrets. These should not be part of version
+# control as they are data points which are potentially sensitive and subject
 # to change depending on the environment.
 *.tfvars
 *.tfvars.json

diff --git a/.terraform-docs.yml b/.terraform-docs.yml
@@ -0,0 +1,14 @@
+formatter: "markdown"
+
+output:
+  file: "README.md"
+
+settings:
+  anchor: false
+  indent: 3
+
+sections:
+  show:
+    - providers
+    - inputs
+    - outputs
diff --git a/README.md b/README.md
@@ -1,2 +1,73 @@
 # terraform-grafana-prometheus-alerts
-Terraform module to convert Prometheus alert rules to Grafana alerts
+
+Terraform module to convert [Prometheus Alerting rules] to [Grafana-managed alert rules]
+
+## Motivation / Why using this module
+
+There are plenty of apps (mostly out of CNCF's ecosystem) where the vendor or the community provides monitoring dashboards
+and alerts. Dashboards are normally provided as a JSON file which can be loaded into Grafana. Alerts are mostly provided
+as [Prometheus Alerting rules].
+
+There are users who already operate a Grafana instance or use a managed Grafana instance from a cloud provider (Grafana
+Cloud, Amazon Managed Grafana, Azure Managed Grafana, etc.). Why not using this Grafana instance for the
+alerting?
+
+The problem is that Grafana's unified alerting uses another format for the alert definition but the concept with labels,
+annotations (provide description and runbook URLs) is almost identical.
+This module allows you to reuse the [Prometheus Alerting rules] and configure them inside Grafana.
+
+## Example usage
+
+```hcl
+module "cert_manager_rules" {
+  source = "github.com/mkilchhofer/terraform-grafana-prometheus-alerts"
+
+  prometheus_alerts_file_path = file("/path/to/alerts/cert-manager.yaml")
+  folder_uid                  = grafana_folder.test.uid
+  datasource_uid              = grafana_data_source.prometheus.uid
+}
+```
+
+## Requirements
+
+- Grafana 8.0+ (Unified alerting)
+
+## Limitations
+
+- Defining multiple alerts with the same name is not supported in Grafana
+
+## Overriding definitions of Prometheus Alerting file
+
+TODO
+
+## TF module documentation
+
+<!-- BEGIN_TF_DOCS -->
+### Providers
+
+| Name | Version |
+|------|---------|
+| grafana | ~> 3.2 |
+
+### Inputs
+
+| Name | Description | Type | Default | Required |
+|------|-------------|------|---------|:--------:|
+| datasource\_uid | The UID of the Grafana datasource being queried with the expressions inside the Alerting rule file | `string` | n/a | yes |
+| default\_evaluation\_interval\_duration | How often is the rule evaluated by default. (When not defined inside your Alerting rules file) | `string` | `"5m"` | no |
+| disable\_provenance | Allow modifying the rule group from other sources than Terraform or the Grafana API. | `bool` | `false` | no |
+| folder\_uid | The UID of the Grafana folder that the alerts belongs to. | `string` | n/a | yes |
+| org\_id | The Organization ID of of the Grafana Alerting rule groups. (Only supported with basic auth, API keys are already org-scoped) | `string` | `null` | no |
+| overrides | Overrides per Alert rule | <pre>map(object({<br>    alert_threshold = optional(number)<br>    exec_err_state  = optional(string)<br>    is_paused       = optional(bool)<br>    no_data_state   = optional(string)<br>    labels          = optional(map(string))<br>  }))</pre> | `{}` | no |
+| prometheus\_alerts\_file\_path | Path to the Prometheus Alerting rules file | `string` | n/a | yes |
+
+### Outputs
+
+| Name | Description |
+|------|-------------|
+| alertsfile\_map | n/a |
+| file\_as\_yaml | n/a |
+<!-- END_TF_DOCS -->
+
+[Grafana-managed alert rules]: https://grafana.com/docs/grafana/latest/alerting/fundamentals/alert-rules/#grafana-managed-alert-rules
+[Prometheus Alerting rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
diff --git a/grafana_alert.tf b/grafana_alert.tf
@@ -0,0 +1,134 @@
+resource "grafana_rule_group" "this" {
+#   for_each = local.file_as_yaml.groups
+  for_each = local.alertsfile_map
+
+  name             = each.value.name
+  folder_uid       = var.folder_uid
+  org_id           = var.org_id
+
+  # There is no function supporting Golang's "duration" (format of interval within an alert group)
+  # Use timeadd() function which supports it.
+  interval_seconds = (
+    (parseint(formatdate("s", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 1) +
+    (parseint(formatdate("m", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 60) +
+    (parseint(formatdate("h", timeadd("1970-01-01T00:00:00Z", try(each.value.interval, var.default_evaluation_interval_duration))), 10) * 3600)
+  )
+
+  disable_provenance = var.disable_provenance
+
+  dynamic "rule" {
+    for_each = {for rule in each.value.rules:  rule.alert => rule}
+
+    content {
+      name      = rule.value.alert
+      for       = try(rule.value.for, null)
+      condition = "ALERTCONDITION"
+
+      annotations = {for k, v in rule.value.annotations : k => replace(v, "$value", "$values.QUERY_RESULT.Value")}
+      labels      = merge(rule.value.labels, try(var.overrides[rule.value.alert].labels, {}))
+
+      exec_err_state = try(var.overrides[rule.value.alert].exec_err_state, null)
+      is_paused      = try(var.overrides[rule.value.alert].is_paused, null)
+      no_data_state  = try(var.overrides[rule.value.alert].no_data_state, null)
+
+      data {
+        ref_id = "QUERY"
+        relative_time_range {
+          from = 600
+          to   = 0
+        }
+        datasource_uid = var.datasource_uid
+        model = jsonencode({
+          editorMode    = "code"
+          expr          = rule.value.expr
+          intervalMs    = 1000
+          maxDataPoints = 43200
+          refId         = "QUERY"
+        })
+      }
+
+      ## Reduce
+      data {
+        ref_id = "QUERY_RESULT"
+        relative_time_range {
+          from = 600
+          to   = 0
+        }
+        datasource_uid = "__expr__"
+        model          = jsonencode({
+          "conditions" = [
+            {
+              "evaluator" = {
+                "params" = [0]
+                "type"   = "gt"
+              }
+              "operator" = {
+                "type" = "and"
+              }
+              "query" = {
+                "params" = []
+              }
+              "reducer" = {
+                "params" = []
+                "type"   = "avg"
+              }
+              "type" = "query"
+            },
+          ]
+          "datasource" = {
+            "name" = "Expression"
+            "type" = "__expr__"
+            "uid"  = "__expr__"
+          }
+          "expression"    = "QUERY"
+          "intervalMs"    = 1000
+          "maxDataPoints" = 43200
+          "reducer"       = "last"
+          "refId"         = "QUERY_RESULT"
+          "type"          = "reduce"
+        })
+      }
+
+      ## Threshold
+      data {
+        ref_id = "ALERTCONDITION"
+        relative_time_range {
+          from = 600
+          to   = 0
+        }
+        datasource_uid = "__expr__"
+        model          = jsonencode({
+          "conditions" = [
+            {
+              "evaluator" = {
+                "params" = [try(var.overrides[rule.value.alert].alert_threshold, 0)]
+                "type"   = "gt"
+              }
+              "operator" = {
+                "type" = "and"
+              }
+              "query" = {
+                "params" = ["QUERY_RESULT"]
+              }
+              "reducer" = {
+                "params" = []
+                "type"   = "last"
+              }
+              "type" = "query"
+            },
+          ]
+          "datasource" = {
+            "type" = "__expr__"
+            "uid"  = "__expr__"
+          }
+          "expression"    = "QUERY_RESULT"
+          "hide"          = false
+          "intervalMs"    = 1000
+          "maxDataPoints" = 43200
+          "refId"         = "ALERTCONDITION"
+          "type"          = "threshold"
+        })
+      }
+    }
+  }
+}
diff --git a/locals.tf b/locals.tf
@@ -0,0 +1,4 @@
+locals {
+  file_as_yaml   = yamldecode(var.prometheus_alerts_file_path)
+  alertsfile_map = {for group in local.file_as_yaml.groups:  group.name => group}
+}
diff --git a/outputs.tf b/outputs.tf
@@ -0,0 +1,7 @@
+output "file_as_yaml" {
+  value = local.file_as_yaml
+}
+
+output "alertsfile_map" {
+  value = local.alertsfile_map
+}
diff --git a/test/alerts-cert-manager.yaml b/test/alerts-cert-manager.yaml
@@ -0,0 +1,61 @@
+# Source: https://github.com/monitoring-mixins/website/blob/master/assets/cert-manager/alerts.yaml
+groups:
+- name: cert-manager
+  rules:
+  - alert: CertManagerAbsent
+    annotations:
+      description: New certificates will not be able to be minted, and existing ones
+        can't be renewed until cert-manager is back.
+      runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerabsent
+      summary: Cert Manager has disappeared from Prometheus service discovery.
+    expr: absent(up{job="cert-manager"})
+    for: 10m
+    labels:
+      severity: critical
+- name: certificates
+  rules:
+  - alert: CertManagerCertExpirySoon
+    annotations:
+      dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
+      description: The domain that this cert covers will be unavailable after {{ $value
+        | humanizeDuration }}. Clients using endpoints that this cert protects will
+        start to fail in {{ $value | humanizeDuration }}.
+      runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertexpirysoon
+      summary: The cert `{{ $labels.name }}` is {{ $value | humanizeDuration }} from
+        expiry, it should have renewed over a week ago.
+    expr: |
+      avg by (exported_namespace, namespace, name) (
+        certmanager_certificate_expiration_timestamp_seconds - time()
+      ) < (21 * 24 * 3600) # 21 days in seconds
+    for: 1h
+    labels:
+      severity: warning
+  - alert: CertManagerCertNotReady
+    annotations:
+      dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
+      description: This certificate has not been ready to serve traffic for at least
+        10m. If the cert is being renewed or there is another valid cert, the ingress
+        controller _may_ be able to serve that instead.
+      runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagercertnotready
+      summary: The cert `{{ $labels.name }}` is not ready to serve traffic.
+    expr: |
+      max by (name, exported_namespace, namespace, condition) (
+        certmanager_certificate_ready_status{condition!="True"} == 1
+      )
+    for: 10m
+    labels:
+      severity: critical
+  - alert: CertManagerHittingRateLimits
+    annotations:
+      dashboard_url: https://grafana.example.com/d/TvuRo2iMk/cert-manager
+      description: Depending on the rate limit, cert-manager may be unable to generate
+        certificates for up to a week.
+      runbook_url: https://github.com/imusmanmalik/cert-manager-mixin/blob/main/RUNBOOK.md#certmanagerhittingratelimits
+      summary: Cert manager hitting LetsEncrypt rate limits.
+    expr: |
+      sum by (host) (
+        rate(certmanager_http_acme_client_request_count{status="429"}[5m])
+      ) > 0
+    for: 5m
+    labels:
+      severity: critical