diff --git a/monitoring/README.md b/monitoring/README.md index ee0c5e5301a..1e0eb95abad 100644 --- a/monitoring/README.md +++ b/monitoring/README.md @@ -12,6 +12,10 @@ This directory contains chaos interleaved grafana dashboards along with the util > Contains utilities required to setup monitoring infrastructure on a kubernetes cluster. +- [Tutorials](./tutorials) + + > Contains tutorials for users on monitoring target applications under chaos using various tools. + ## Setup the LitmusChaos Infrastructure - Install the litmus chaos operator and CRDs diff --git a/monitoring/tutorials/README.md b/monitoring/tutorials/README.md new file mode 100644 index 00000000000..e092e6fc1ad --- /dev/null +++ b/monitoring/tutorials/README.md @@ -0,0 +1,7 @@ +# Tutorials + +This directory contains tutorials for users on monitoring target applications under chaos using various tools. + +- [Otel-demo](./otel-demo) + + > Contains a tutorial on injecting chaos into target applications using LitmusChaos and observing the chaos with OpenTelemetry. diff --git a/monitoring/tutorials/otel-demo/README.md b/monitoring/tutorials/otel-demo/README.md new file mode 100644 index 00000000000..48c6d5b9e5e --- /dev/null +++ b/monitoring/tutorials/otel-demo/README.md @@ -0,0 +1,95 @@ +# Otel-demo tutorial + +This tutorial provides a step-by-step guide for injecting chaos into target applications using LitmusChaos and observing the chaos with OpenTelemetry. + +otel_demo_tutorial_architecture + +### 0. Prerequisites +- Kubernetes 1.24+ +- 8 GB of free RAM +- Helm 3.9+ + +### 1. Install Litmus +1. Create the `litmus` namespace. + ```bash + kubectl create ns litmus + ``` +2. Add the Litmus Helm repository. + ```bash + helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/ + ``` +3. Install Litmus using Helm. + ```bash + helm install chaos litmuschaos/litmus \ + --namespace=litmus \ + --set portal.frontend.service.type=NodePort \ + --set mongodb.image.registry=ghcr.io/zcube \ + --set mongodb.image.repository=bitnami-compat/mongodb \ + --set mongodb.image.tag=6.0.5 + ``` +4. Verify the installation. + ```bash + kubectl get all -n litmus + ``` +5. Forward the Litmus frontend service port. + ```bash + kubectl port-forward svc/chaos-litmus-frontend-service 9091:9091 -n litmus + ``` + Access the Litmus frontend at [http://localhost:9091](http://localhost:9091) and log in with `admin` / `litmus`. + +### 2. Set Up Litmus Environment +1. Create a new environment. + - Environment Name: `local` + - Environment Type: `Production` +2. Configure a new chaos infrastructure. + - Name: `local` + - Chaos Components Installation: `Cluster-wide access` + - Installation Location (Namespace): `litmus` + - Service Account Name: `litmus` +3. Deploy the new chaos infrastructure. + ```bash + cd ~/Downloads + kubectl apply -f local-litmus-chaos-enable.yml + ``` + Wait until the status shows `CONNECTED`. + +### 3. Install Otel-demo microservices & Observability tools +1. Create the `otel-demo` namespace. + ```bash + kubectl create ns otel-demo + ``` +2. Add the OpenTelemetry Helm repository. + ```bash + helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts + ``` +3. Install Otel-demo microservices and Observability tools using Helm. + ```bash + cd litmus/monitoring/tutorials/otel-demo + helm install my-otel-demo open-telemetry/opentelemetry-demo --namespace otel-demo --values custom_otel_demo_values.yml + ``` + It contains Otel-demo microservices, OpenTelemetry(with chaos metrics), Prometheus, Jaeger and Grafana. +4. Verify the installation. + ```bash + kubectl get all -n otel-demo + ``` +5. Forward the Otel-demo frontend proxy port. + ```bash + kubectl port-forward svc/my-otel-demo-frontendproxy 8080:8080 -n otel-demo + ``` +6. Access the following services. + - Web store: [http://localhost:8080/](http://localhost:8080/) + - Grafana: [http://localhost:8080/grafana/](http://localhost:8080/grafana/) + - Load Generator UI: [http://localhost:8080/loadgen/](http://localhost:8080/loadgen/) + - Jaeger UI: [http://localhost:8080/jaeger/ui/](http://localhost:8080/jaeger/ui/) + +### 4. Add Grafana Panel +Import the `chaos-experiments-dashboard.json` file into Grafana to visualize the results of chaos experiments. + +### 5. Observe chaos +Explore the following experiments to observe chaos on the Otel-demo microservices. + +- [Pod Network Latency](./cart-service) + > Performs a pod network latency experiment on the cart service. + +- [Pod Delete](./recommendation-service) + > Performs a pod delete experiment on the recommendation service. diff --git a/monitoring/tutorials/otel-demo/cart-service/README.md b/monitoring/tutorials/otel-demo/cart-service/README.md new file mode 100644 index 00000000000..335aa14e98d --- /dev/null +++ b/monitoring/tutorials/otel-demo/cart-service/README.md @@ -0,0 +1,26 @@ +# cart service pod network latency +## Description +- This experiment injects network latency to the cart service pod. +- The Probe checks Prometheus metrics Latency of cart service requests. +## Steps +### 1. Probe Settings +- probe type: `Prometheus Probe` +- name: `cart-service-pod-network-latency-probe` +- timeout: 3s +- interval: 3s +- prometheus endpoint: `http://my-otel-demo-prometheus-server.otel-demo:9090` +- prometheus query: `histogram_quantile(0.99, sum(rate(duration_milliseconds_bucket{service_name=\"cartservice\"}[5m])) by (le))/1000` +- Data Comparison: + - Type: Float + - Criteria: `<` + - Value: `3.0` +### 2. Make Experiment +1. New Experimnet +2. Complete Overview +3. Start off by Upload YML(cart-service-pod-network-latency.yml) +### 3. Run Experiment +1. Click on the `Run` button +2. Check Experiment Status and Logs +3. Check the Resilience Score +4. Check the Chaos Exporter metrics using Grafana and confirm if the experiment failed. ![cart_service_pod_network_latency_experiment_result_dashboard.png](../screenshots/cart_service_pod_network_latency_experiment_result_dashboard.png) +5. Check cart service Spanmetrics Metrics using Grafana ![cartservice_spanmetrics.png](../screenshots/cartservice_spanmetrics.png) \ No newline at end of file diff --git a/monitoring/tutorials/otel-demo/cart-service/cart-service-pod-network-latency.yml b/monitoring/tutorials/otel-demo/cart-service/cart-service-pod-network-latency.yml new file mode 100644 index 00000000000..5aa913a3356 --- /dev/null +++ b/monitoring/tutorials/otel-demo/cart-service/cart-service-pod-network-latency.yml @@ -0,0 +1,315 @@ +kind: Workflow +apiVersion: argoproj.io/v1alpha1 +metadata: + name: cart-service-pod-network-latency + namespace: litmus + creationTimestamp: null + labels: + infra_id: 5b9be872-6396-4ad1-b64a-ed4b25edd516 + revision_id: bd738dca-14f0-4145-8f67-afb3d8c17991 + workflow_id: 1912f522-5197-4bd5-8854-732ccf1882bb + workflows.argoproj.io/controller-instanceid: 5b9be872-6396-4ad1-b64a-ed4b25edd516 +spec: + templates: + - name: test + inputs: {} + outputs: {} + metadata: {} + steps: + - - name: install-chaos-faults + template: install-chaos-faults + arguments: {} + - - name: pod-network-latency-pok + template: pod-network-latency-pok + arguments: {} + - - name: cleanup-chaos-resources + template: cleanup-chaos-resources + arguments: {} + - name: install-chaos-faults + inputs: + artifacts: + - name: pod-network-latency-pok + path: /tmp/pod-network-latency-pok.yaml + raw: + data: > + apiVersion: litmuschaos.io/v1alpha1 + + description: + message: | + Injects network latency on pods belonging to an app deployment + kind: ChaosExperiment + + metadata: + name: pod-network-latency + labels: + name: pod-network-latency + app.kubernetes.io/part-of: litmus + app.kubernetes.io/component: chaosexperiment + app.kubernetes.io/version: ci + spec: + definition: + scope: Namespaced + permissions: + - apiGroups: + - "" + resources: + - pods + verbs: + - create + - delete + - get + - list + - patch + - update + - deletecollection + - apiGroups: + - "" + resources: + - events + verbs: + - create + - get + - list + - patch + - update + - apiGroups: + - "" + resources: + - configmaps + verbs: + - get + - list + - apiGroups: + - "" + resources: + - pods/log + verbs: + - get + - list + - watch + - apiGroups: + - "" + resources: + - pods/exec + verbs: + - get + - list + - create + - apiGroups: + - apps + resources: + - deployments + - statefulsets + - replicasets + - daemonsets + verbs: + - list + - get + - apiGroups: + - apps.openshift.io + resources: + - deploymentconfigs + verbs: + - list + - get + - apiGroups: + - "" + resources: + - replicationcontrollers + verbs: + - get + - list + - apiGroups: + - argoproj.io + resources: + - rollouts + verbs: + - list + - get + - apiGroups: + - batch + resources: + - jobs + verbs: + - create + - list + - get + - delete + - deletecollection + - apiGroups: + - litmuschaos.io + resources: + - chaosengines + - chaosexperiments + - chaosresults + verbs: + - create + - list + - get + - patch + - update + - delete + image: docker.io/litmuschaos/go-runner:latest + imagePullPolicy: Always + args: + - -c + - ./experiments -name pod-network-latency + command: + - /bin/bash + env: + - name: TARGET_CONTAINER + value: "" + - name: NETWORK_INTERFACE + value: eth0 + - name: LIB_IMAGE + value: docker.io/litmuschaos/go-runner:latest + - name: TC_IMAGE + value: gaiadocker/iproute2 + - name: NETWORK_LATENCY + value: "2000" + - name: TOTAL_CHAOS_DURATION + value: "60" + - name: RAMP_TIME + value: "" + - name: JITTER + value: "0" + - name: PODS_AFFECTED_PERC + value: "" + - name: TARGET_PODS + value: "" + - name: CONTAINER_RUNTIME + value: containerd + - name: DEFAULT_HEALTH_CHECK + value: "false" + - name: DESTINATION_IPS + value: "" + - name: DESTINATION_HOSTS + value: "" + - name: SOCKET_PATH + value: /run/containerd/containerd.sock + - name: NODE_LABEL + value: "" + - name: SEQUENCE + value: parallel + labels: + name: pod-network-latency + app.kubernetes.io/part-of: litmus + app.kubernetes.io/component: experiment-job + app.kubernetes.io/runtime-api-usage: "true" + app.kubernetes.io/version: ci + outputs: {} + metadata: {} + container: + name: "" + image: litmuschaos/k8s:2.11.0 + command: + - sh + - -c + args: + - kubectl apply -f /tmp/ -n {{workflow.parameters.adminModeNamespace}} + && sleep 30 + resources: {} + - name: cleanup-chaos-resources + inputs: {} + outputs: {} + metadata: {} + container: + name: "" + image: litmuschaos/k8s:2.11.0 + command: + - sh + - -c + args: + - kubectl delete chaosengine -l workflow_run_id={{workflow.uid}} -n + {{workflow.parameters.adminModeNamespace}} + resources: {} + - name: pod-network-latency-pok + inputs: + artifacts: + - name: pod-network-latency-pok + path: /tmp/chaosengine-pod-network-latency-pok.yaml + raw: + data: > + apiVersion: litmuschaos.io/v1alpha1 + + kind: ChaosEngine + + metadata: + namespace: "{{workflow.parameters.adminModeNamespace}}" + labels: + workflow_run_id: "{{ workflow.uid }}" + workflow_name: cart-service-pod-network-latency + annotations: + probeRef: '[{"name":"cart-service-pod-network-latency-probe","mode":"EOT"}]' + generateName: pod-network-latency-pok + spec: + engineState: active + appinfo: + appns: otel-demo + applabel: app.kubernetes.io/component=cartservice + appkind: deployment + chaosServiceAccount: litmus-admin + experiments: + - name: pod-network-latency + spec: + components: + env: + - name: TARGET_CONTAINER + value: "" + - name: NETWORK_INTERFACE + value: eth0 + - name: LIB_IMAGE + value: docker.io/litmuschaos/go-runner:latest + - name: TC_IMAGE + value: gaiadocker/iproute2 + - name: NETWORK_LATENCY + value: "2000" + - name: TOTAL_CHAOS_DURATION + value: "150" + - name: RAMP_TIME + value: "" + - name: JITTER + value: "0" + - name: PODS_AFFECTED_PERC + value: "" + - name: TARGET_PODS + value: "" + - name: CONTAINER_RUNTIME + value: containerd + - name: DEFAULT_HEALTH_CHECK + value: "false" + - name: DESTINATION_IPS + value: "" + - name: DESTINATION_HOSTS + value: "" + - name: SOCKET_PATH + value: /run/containerd/containerd.sock + - name: NODE_LABEL + value: "" + - name: SEQUENCE + value: parallel + outputs: {} + metadata: + labels: + weight: "10" + container: + name: "" + image: docker.io/litmuschaos/litmus-checker:2.11.0 + args: + - -file=/tmp/chaosengine-pod-network-latency-pok.yaml + - -saveName=/tmp/engine-name + resources: {} + entrypoint: test + arguments: + parameters: + - name: adminModeNamespace + value: litmus + serviceAccountName: argo-chaos + podGC: + strategy: OnWorkflowCompletion + securityContext: + runAsUser: 1000 + runAsNonRoot: true +status: + startedAt: null + finishedAt: null diff --git a/monitoring/tutorials/otel-demo/chaos-exporter-dashboard.json b/monitoring/tutorials/otel-demo/chaos-exporter-dashboard.json new file mode 100644 index 00000000000..0586abffe08 --- /dev/null +++ b/monitoring/tutorials/otel-demo/chaos-exporter-dashboard.json @@ -0,0 +1,647 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 5, + "links": [], + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 8, + "panels": [], + "title": "Chaos Exporter Dashboard", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "fillOpacity": 50, + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineWidth": 0, + "spanNulls": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "transparent", + "value": null + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 20, + "x": 0, + "y": 1 + }, + "id": 1, + "options": { + "alignValue": "center", + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": false + }, + "mergeValues": true, + "rowHeight": 0.7, + "showValue": "auto", + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.1", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "litmuschaos_experiment_total_duration", + "format": "time_series", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "{{chaosengine_name}}", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "title": "Chaos Experiments Duration", + "type": "state-timeline" + }, + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "format": "short", + "gridPos": { + "h": 6, + "w": 5, + "x": 0, + "y": 8 + }, + "id": 2, + "max": 100, + "min": 0, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "litmuschaos_cluster_scoped_experiments_installed_count", + "format": "time_series", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "Total Experiments", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "thresholds": "0,50,100", + "title": "Total Experiments", + "type": "gauge", + "valueMaps": [ + { + "text": "No Data", + "value": "null" + } + ], + "valueName": "current" + }, + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "dark-yellow", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "format": "short", + "gridPos": { + "h": 6, + "w": 5, + "x": 5, + "y": 8 + }, + "id": 5, + "max": 100, + "min": 0, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "sum(litmuschaos_awaited_experiments)", + "format": "time_series", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "Queued Experiments", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "thresholds": "0,50,100", + "title": "Awaited Experiments", + "type": "gauge", + "valueMaps": [ + { + "text": "No Data", + "value": "null" + } + ], + "valueName": "current" + }, + { + "alert": { + "alertRuleTags": {}, + "conditions": [ + { + "evaluator": { + "params": [ + 0.99 + ], + "type": "gt" + }, + "operator": { + "type": "and" + }, + "query": { + "params": [ + "A", + "5s", + "now" + ] + }, + "reducer": { + "params": [], + "type": "max" + }, + "type": "query" + } + ], + "executionErrorState": "alerting", + "for": "1s", + "frequency": "1s", + "handler": 1, + "message": "Chaos Probe Failed !!!\n\n
\n

Chaos Details:-
\n

\n

\n

App Details:-
\n

\n

", + "name": "Chaos Experiment Probe Failure Alerts alert", + "noDataState": "no_data", + "notifications": [] + }, + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "Probes failed", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 50, + "gradientMode": "opacity", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "stepAfter", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "log": 2, + "type": "log" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "line+area" + } + }, + "mappings": [], + "max": 1, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "transparent", + "value": null + }, + { + "color": "red", + "value": 0.99 + } + ] + }, + "unit": "none" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/.*Fail/" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#E02F44", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byValue", + "options": { + "op": "gte", + "reducer": "allIsZero", + "value": 0 + } + }, + "properties": [ + { + "id": "custom.hideFrom", + "value": { + "legend": true, + "tooltip": true, + "viz": false + } + } + ] + }, + { + "matcher": { + "id": "byValue", + "options": { + "op": "gte", + "reducer": "allIsNull", + "value": 0 + } + }, + "properties": [ + { + "id": "custom.hideFrom", + "value": { + "legend": true, + "tooltip": true, + "viz": false + } + } + ] + } + ] + }, + "gridPos": { + "h": 12, + "w": 10, + "x": 10, + "y": 8 + }, + "id": 9, + "options": { + "legend": { + "calcs": [], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "7.5.5", + "targets": [ + { + "datasource": { + "uid": "DS_PROMETHEUS" + }, + "editorMode": "code", + "exemplar": true, + "expr": "litmuschaos_experiment_verdict{probe_success_percentage!=\"100.000000\"}", + "format": "time_series", + "hide": false, + "instant": false, + "interval": "1s", + "legendFormat": "{{app_label}} - {{chaosresult_name}} - {{probe_success_percentage}}", + "refId": "A" + } + ], + "title": "Chaos Experiment Probe Failure Alerts", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "format": "short", + "gridPos": { + "h": 6, + "w": 5, + "x": 0, + "y": 14 + }, + "id": 3, + "max": 100, + "min": 0, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "sum(litmuschaos_passed_experiments)", + "format": "time_series", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "Passed Experiments", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "thresholds": "0,50,100", + "title": "Passed Experiments", + "type": "gauge", + "valueMaps": [ + { + "text": "No Data", + "value": "null" + } + ], + "valueName": "current" + }, + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "dark-red", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "format": "short", + "gridPos": { + "h": 6, + "w": 5, + "x": 5, + "y": 14 + }, + "id": 4, + "max": 100, + "min": 0, + "options": { + "minVizHeight": 75, + "minVizWidth": 75, + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showThresholdLabels": false, + "showThresholdMarkers": true, + "sizing": "auto" + }, + "pluginVersion": "11.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "webstore-metrics" + }, + "disableTextWrap": false, + "editorMode": "code", + "expr": "sum(litmuschaos_failed_experiments)", + "format": "time_series", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "Failed Experiments", + "range": true, + "refId": "A", + "useBackend": false + } + ], + "thresholds": "0,50,100", + "title": "Failed Experiments", + "type": "gauge", + "valueMaps": [ + { + "text": "No Data", + "value": "null" + } + ], + "valueName": "current" + } + ], + "refresh": "5m", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Chaos Experiments Dashboard", + "uid": "chaos-experiments-dashboard", + "version": 3, + "weekStart": "" +} \ No newline at end of file diff --git a/monitoring/tutorials/otel-demo/custom_otel_demo_values.yml b/monitoring/tutorials/otel-demo/custom_otel_demo_values.yml new file mode 100644 index 00000000000..fc3c25d5152 --- /dev/null +++ b/monitoring/tutorials/otel-demo/custom_otel_demo_values.yml @@ -0,0 +1,79 @@ +opentelemetry-collector: + config: + receivers: + otlp: + protocols: + http: + # Since this collector needs to receive data from the web, enable cors for all origins + # `allowed_origins` can be refined for your deployment domain + cors: + allowed_origins: + - "http://*" + - "https://*" + prometheus: + config: + scrape_configs: + - job_name: 'chaos-exporter' + static_configs: + - targets: [ 'chaos-exporter.litmus.svc.cluster.local:8080' ] + relabel_configs: + - target_label: instance + replacement: 'chaos-exporter-service' + httpcheck/frontendproxy: + targets: + - endpoint: 'http://{{ include "otel-demo.name" . }}-frontendproxy:8080' + redis: + endpoint: "valkey-cart:6379" + collection_interval: 10s + + exporters: + ## Create an exporter to Jaeger using the standard `otlp` export format + otlp: + endpoint: '{{ include "otel-demo.name" . }}-jaeger-collector:4317' + tls: + insecure: true + # Create an exporter to Prometheus (metrics) + otlphttp/prometheus: + endpoint: 'http://{{ include "otel-demo.name" . }}-prometheus-server:9090/api/v1/otlp' + tls: + insecure: true + opensearch: + logs_index: otel + http: + endpoint: "http://otel-demo-opensearch:9200" + tls: + insecure: true + + processors: + # This processor is used to help limit high cardinality on next.js span names + # When this PR is merged (and released) we can remove this transform processor + # https://github.com/vercel/next.js/pull/64852 + transform: + error_mode: ignore + trace_statements: + - context: span + statements: + # could be removed when https://github.com/vercel/next.js/pull/64852 is fixed upstream + - replace_pattern(name, "\\?.*", "") + - replace_match(name, "GET /api/products/*", "GET /api/products/{productId}") + resource: + attributes: + - key: service.instance.id + from_attribute: k8s.pod.uid + action: insert + + connectors: + spanmetrics: {} + + service: + pipelines: + traces: + processors: [memory_limiter, resource, transform, batch] + exporters: [otlp, debug, spanmetrics] + metrics: + receivers: [httpcheck/frontendproxy, redis, otlp, spanmetrics, prometheus] + processors: [memory_limiter, resource, batch] + exporters: [otlphttp/prometheus, debug] + logs: + processors: [memory_limiter, resource, batch] + exporters: [opensearch, debug] \ No newline at end of file diff --git a/monitoring/tutorials/otel-demo/recommendation-service/README.md b/monitoring/tutorials/otel-demo/recommendation-service/README.md new file mode 100644 index 00000000000..532ca079ede --- /dev/null +++ b/monitoring/tutorials/otel-demo/recommendation-service/README.md @@ -0,0 +1,27 @@ +# recommendation service pod delete +## Description +- This experiment injects pod delete chaos to the recommendation service pod. +- The Probe checks the Prometheus metrics for the error rate of the ListRecommendations span + - ListRecommendations is included in the frontend service, even though it utilizes the recommendation service. +## Steps +### 1. Probe Settings +- probe type: `Prometheus Probe` +- name: `recommendation-service-pod-delete-probe` +- timeout: 3s +- interval: 3s +- prometheus endpoint: `http://my-otel-demo-prometheus-server.otel-demo:9090` +- prometheus query: `sum(rate(calls_total{status_code=\"STATUS_CODE_ERROR\", span_name=\"grpc.oteldemo.RecommendationService/ListRecommendations\"}[5m]))` +- Data Comparison: + - Type: Float + - Criteria: `<` + - Value: `0.05` +### 2. Make Experiment +1. New Experimnet +2. Complete Overview +3. Start off by Upload YML(recommendation-service-pod-delete.yml) +### 3. Run Experiment +1. Click on the `Run` button +2. Check Experiment Status and Logs +3. Check the Resilience Score +4. Check the Chaos Exporter metrics using Grafana and confirm if the experiment passed. ![recommendation_service_pod_delete_experiment_result_dashboard.png](../screenshots/recommendation_service_pod_delete_experiment_result_dashboard.png) +5. Check Error Rate in frontend service Spanmetrics using Grafana ![frontend_spanmetrics.png](../screenshots/frontend_spanmetrics.png) \ No newline at end of file diff --git a/monitoring/tutorials/otel-demo/recommendation-service/recommendation-service-pod-delete.yml b/monitoring/tutorials/otel-demo/recommendation-service/recommendation-service-pod-delete.yml new file mode 100644 index 00000000000..8d4c38d3a2a --- /dev/null +++ b/monitoring/tutorials/otel-demo/recommendation-service/recommendation-service-pod-delete.yml @@ -0,0 +1,286 @@ +kind: Workflow +apiVersion: argoproj.io/v1alpha1 +metadata: + name: recommendation-service-pod-delete + namespace: litmus + creationTimestamp: null + labels: + infra_id: 5b9be872-6396-4ad1-b64a-ed4b25edd516 + revision_id: fb9618ec-40fa-4a4d-a8b3-a3451da85d06 + workflow_id: cf6dead4-944d-4c86-ba82-b5576ec0ceaf + workflows.argoproj.io/controller-instanceid: 5b9be872-6396-4ad1-b64a-ed4b25edd516 +spec: + templates: + - name: recommendationservice-pod-delete + inputs: {} + outputs: {} + metadata: {} + steps: + - - name: install-chaos-faults + template: install-chaos-faults + arguments: {} + - - name: pod-delete-zkg + template: pod-delete-zkg + arguments: {} + - - name: cleanup-chaos-resources + template: cleanup-chaos-resources + arguments: {} + - name: install-chaos-faults + inputs: + artifacts: + - name: pod-delete-zkg + path: /tmp/pod-delete-zkg.yaml + raw: + data: > + apiVersion: litmuschaos.io/v1alpha1 + + description: + message: | + Deletes a pod belonging to a deployment/statefulset/daemonset + kind: ChaosExperiment + + metadata: + name: pod-delete + labels: + name: pod-delete + app.kubernetes.io/part-of: litmus + app.kubernetes.io/component: chaosexperiment + app.kubernetes.io/version: ci + spec: + definition: + scope: Namespaced + permissions: + - apiGroups: + - "" + resources: + - pods + verbs: + - create + - delete + - get + - list + - patch + - update + - deletecollection + - apiGroups: + - "" + resources: + - events + verbs: + - create + - get + - list + - patch + - update + - apiGroups: + - "" + resources: + - configmaps + verbs: + - get + - list + - apiGroups: + - "" + resources: + - pods/log + verbs: + - get + - list + - watch + - apiGroups: + - "" + resources: + - pods/exec + verbs: + - get + - list + - create + - apiGroups: + - apps + resources: + - deployments + - statefulsets + - replicasets + - daemonsets + verbs: + - list + - get + - apiGroups: + - apps.openshift.io + resources: + - deploymentconfigs + verbs: + - list + - get + - apiGroups: + - "" + resources: + - replicationcontrollers + verbs: + - get + - list + - apiGroups: + - argoproj.io + resources: + - rollouts + verbs: + - list + - get + - apiGroups: + - batch + resources: + - jobs + verbs: + - create + - list + - get + - delete + - deletecollection + - apiGroups: + - litmuschaos.io + resources: + - chaosengines + - chaosexperiments + - chaosresults + verbs: + - create + - list + - get + - patch + - update + - delete + image: docker.io/litmuschaos/go-runner:latest + imagePullPolicy: Always + args: + - -c + - ./experiments -name pod-delete + command: + - /bin/bash + env: + - name: TOTAL_CHAOS_DURATION + value: "15" + - name: RAMP_TIME + value: "" + - name: FORCE + value: "true" + - name: CHAOS_INTERVAL + value: "5" + - name: PODS_AFFECTED_PERC + value: "" + - name: TARGET_CONTAINER + value: "" + - name: TARGET_PODS + value: "" + - name: DEFAULT_HEALTH_CHECK + value: "false" + - name: NODE_LABEL + value: "" + - name: SEQUENCE + value: parallel + labels: + name: pod-delete + app.kubernetes.io/part-of: litmus + app.kubernetes.io/component: experiment-job + app.kubernetes.io/version: ci + outputs: {} + metadata: {} + container: + name: "" + image: litmuschaos/k8s:2.11.0 + command: + - sh + - -c + args: + - kubectl apply -f /tmp/ -n {{workflow.parameters.adminModeNamespace}} + && sleep 30 + resources: {} + - name: cleanup-chaos-resources + inputs: {} + outputs: {} + metadata: {} + container: + name: "" + image: litmuschaos/k8s:2.11.0 + command: + - sh + - -c + args: + - kubectl delete chaosengine -l workflow_run_id={{workflow.uid}} -n + {{workflow.parameters.adminModeNamespace}} + resources: {} + - name: pod-delete-zkg + inputs: + artifacts: + - name: pod-delete-zkg + path: /tmp/chaosengine-pod-delete-zkg.yaml + raw: + data: > + apiVersion: litmuschaos.io/v1alpha1 + + kind: ChaosEngine + + metadata: + namespace: "{{workflow.parameters.adminModeNamespace}}" + labels: + workflow_run_id: "{{ workflow.uid }}" + workflow_name: recommendation-service-pod-delete + annotations: + probeRef: '[{"name":"recommendation-service-pod-delete-probe","mode":"EOT"}]' + generateName: pod-delete-zkg + spec: + appinfo: + appns: otel-demo + applabel: app.kubernetes.io/component=recommendationservice + appkind: deployment + engineState: active + chaosServiceAccount: litmus-admin + experiments: + - name: pod-delete + spec: + components: + env: + - name: TOTAL_CHAOS_DURATION + value: "120" + - name: RAMP_TIME + value: "" + - name: FORCE + value: "true" + - name: CHAOS_INTERVAL + value: "5" + - name: PODS_AFFECTED_PERC + value: "" + - name: TARGET_CONTAINER + value: "" + - name: TARGET_PODS + value: "" + - name: DEFAULT_HEALTH_CHECK + value: "false" + - name: NODE_LABEL + value: "" + - name: SEQUENCE + value: parallel + outputs: {} + metadata: + labels: + weight: "10" + container: + name: "" + image: docker.io/litmuschaos/litmus-checker:2.11.0 + args: + - -file=/tmp/chaosengine-pod-delete-zkg.yaml + - -saveName=/tmp/engine-name + resources: {} + entrypoint: recommendationservice-pod-delete + arguments: + parameters: + - name: adminModeNamespace + value: litmus + serviceAccountName: argo-chaos + podGC: + strategy: OnWorkflowCompletion + securityContext: + runAsUser: 1000 + runAsNonRoot: true +status: + startedAt: null + finishedAt: null diff --git a/monitoring/tutorials/otel-demo/screenshots/cart_service_pod_network_latency_experiment_result_dashboard.png b/monitoring/tutorials/otel-demo/screenshots/cart_service_pod_network_latency_experiment_result_dashboard.png new file mode 100644 index 00000000000..54b6a93421e Binary files /dev/null and b/monitoring/tutorials/otel-demo/screenshots/cart_service_pod_network_latency_experiment_result_dashboard.png differ diff --git a/monitoring/tutorials/otel-demo/screenshots/cartservice_spanmetrics.png b/monitoring/tutorials/otel-demo/screenshots/cartservice_spanmetrics.png new file mode 100644 index 00000000000..5e1e2b97f9f Binary files /dev/null and b/monitoring/tutorials/otel-demo/screenshots/cartservice_spanmetrics.png differ diff --git a/monitoring/tutorials/otel-demo/screenshots/frontend_spanmetrics.png b/monitoring/tutorials/otel-demo/screenshots/frontend_spanmetrics.png new file mode 100644 index 00000000000..5044ad3333b Binary files /dev/null and b/monitoring/tutorials/otel-demo/screenshots/frontend_spanmetrics.png differ diff --git a/monitoring/tutorials/otel-demo/screenshots/otel_demo_tutorial_architecture.png b/monitoring/tutorials/otel-demo/screenshots/otel_demo_tutorial_architecture.png new file mode 100644 index 00000000000..919cf82e7cd Binary files /dev/null and b/monitoring/tutorials/otel-demo/screenshots/otel_demo_tutorial_architecture.png differ diff --git a/monitoring/tutorials/otel-demo/screenshots/recommendation_service_pod_delete_experiment_result_dashboard.png b/monitoring/tutorials/otel-demo/screenshots/recommendation_service_pod_delete_experiment_result_dashboard.png new file mode 100644 index 00000000000..10ae21b1be2 Binary files /dev/null and b/monitoring/tutorials/otel-demo/screenshots/recommendation_service_pod_delete_experiment_result_dashboard.png differ