Skip to content

Commit

Permalink
Merge #1681: Support GPU monitoring
Browse files Browse the repository at this point in the history
* Enable GPU monitoring

* Update README

* Fix changelog

* Mount cgroups
  • Loading branch information
gjulianm authored Feb 19, 2025
1 parent c7c5991 commit 8c6cbd4
Show file tree
Hide file tree
Showing 9 changed files with 41 additions and 8 deletions.
4 changes: 4 additions & 0 deletions charts/datadog/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Datadog changelog

## 3.91.0

* Add support for GPU monitoring

## 3.90.5

* Update `fips.image.tag` to `1.1.7` updating openSSL version to 3.0.16
Expand Down
2 changes: 1 addition & 1 deletion charts/datadog/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
apiVersion: v1
name: datadog
version: 3.90.5
version: 3.91.0
appVersion: "7"
description: Datadog Agent
keywords:
Expand Down
5 changes: 4 additions & 1 deletion charts/datadog/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Datadog

![Version: 3.90.5](https://img.shields.io/badge/Version-3.90.5-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)
![Version: 3.91.0](https://img.shields.io/badge/Version-3.91.0-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)

[Datadog](https://www.datadoghq.com/) is a hosted infrastructure monitoring platform. This chart adds the Datadog Agent to all nodes in your cluster via a DaemonSet. It also optionally depends on the [kube-state-metrics chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics). For more information about monitoring Kubernetes with Datadog, please refer to the [Datadog documentation website](https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/).

Expand Down Expand Up @@ -749,6 +749,9 @@ helm install <RELEASE_NAME> \
| datadog.envFrom | list | `[]` | Set environment variables for all Agents directly from configMaps and/or secrets |
| datadog.excludePauseContainer | bool | `true` | Exclude pause containers from Agent Autodiscovery. |
| datadog.expvarPort | int | `6000` | Specify the port to expose pprof and expvar to not interfere with the agent metrics port from the cluster-agent, which defaults to 5000 |
| datadog.gpuMonitoring.configureCgroupPerms | bool | `false` | Configure cgroup permissions for GPU monitoring |
| datadog.gpuMonitoring.enabled | bool | `false` | Enable GPU monitoring |
| datadog.gpuMonitoring.runtimeClassName | string | `"nvidia"` | Runtime class name for the agent pods to get access to NVIDIA resources |
| datadog.helmCheck.collectEvents | bool | `false` | Set this to true to enable event collection in the Helm Check (Requires Agent 7.36.0+ and Cluster Agent 1.20.0+) This requires datadog.HelmCheck.enabled to be set to true |
| datadog.helmCheck.enabled | bool | `false` | Set this to true to enable the Helm check (Requires Agent 7.35.0+ and Cluster Agent 1.19.0+) This requires clusterAgent.enabled to be set to true |
| datadog.helmCheck.valuesAsTags | object | `{}` | Collects Helm values from a release and uses them as tags (Requires Agent and Cluster Agent 7.40.0+). This requires datadog.HelmCheck.enabled to be set to true |
Expand Down
12 changes: 8 additions & 4 deletions charts/datadog/templates/_container-system-probe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
{{- include "containers-common-env" . | nindent 4 }}
- name: DD_LOG_LEVEL
value: {{ .Values.agents.containers.systemProbe.logLevel | default .Values.datadog.logLevel | quote }}
{{- if .Values.datadog.serviceMonitoring.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: HOST_ROOT
value: "/host/root"
{{- end }}
Expand Down Expand Up @@ -70,14 +70,14 @@
mountPath: /host/proc
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.networkMonitoring.enabled .Values.datadog.discovery.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.networkMonitoring.enabled .Values.datadog.discovery.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: cgroups
mountPath: /host/sys/fs/cgroup
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- end }}
{{- include "linux-container-host-release-volumemounts" . | nindent 4 }}
{{- if (eq (include "should-add-host-path-for-os-release-paths" .) "true") }}
{{- if (eq (include "should-add-host-path-for-os-release-paths" .) "true") }}
{{- if ne .Values.datadog.osReleasePath "/etc/redhat-release" }}
- name: etc-redhat-release
mountPath: /host/etc/redhat-release
Expand All @@ -94,12 +94,16 @@
readOnly: true
{{- end }}
{{- end }}
{{- if .Values.datadog.serviceMonitoring.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: hostroot
mountPath: /host/root
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
- name: gpu-devices
mountPath: /var/run/nvidia-container-devices/all
{{- end }}
{{- if and (eq (include "runtime-compilation-enabled" .) "true") .Values.datadog.systemProbe.enableDefaultKernelHeadersPaths }}
- name: modules
mountPath: /lib/modules
Expand Down
7 changes: 6 additions & 1 deletion charts/datadog/templates/_daemonset-volumes-linux.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@
path: /etc/passwd
name: passwd
{{- end }}
{{- if or (and (eq (include "should-enable-system-probe" .) "true") .Values.datadog.serviceMonitoring.enabled) (and (eq (include "should-enable-security-agent" .) "true") .Values.datadog.securityAgent.compliance.enabled) }}
{{- if or (and (eq (include "should-enable-system-probe" .) "true") (or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled)) (and (eq (include "should-enable-security-agent" .) "true") .Values.datadog.securityAgent.compliance.enabled) }}
- hostPath:
path: /
name: hostroot
Expand Down Expand Up @@ -219,4 +219,9 @@
secretName: datadog-kubelet-cert
name: kubelet-cert-volume
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
- name: gpu-devices
hostPath:
path: /dev/null
{{- end }}
{{- end -}}
2 changes: 1 addition & 1 deletion charts/datadog/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ Return a remote image path based on `.Values` (passed as root) and `.` (any `.im
Return true if a system-probe feature is enabled.
*/}}
{{- define "system-probe-feature" -}}
{{- if or .Values.datadog.securityAgent.runtime.enabled .Values.datadog.securityAgent.runtime.fimEnabled .Values.datadog.networkMonitoring.enabled .Values.datadog.systemProbe.enableTCPQueueLength .Values.datadog.systemProbe.enableOOMKill .Values.datadog.serviceMonitoring.enabled .Values.datadog.discovery.enabled -}}
{{- if or .Values.datadog.securityAgent.runtime.enabled .Values.datadog.securityAgent.runtime.fimEnabled .Values.datadog.networkMonitoring.enabled .Values.datadog.systemProbe.enableTCPQueueLength .Values.datadog.systemProbe.enableOOMKill .Values.datadog.serviceMonitoring.enabled .Values.datadog.discovery.enabled .Values.datadog.gpuMonitoring.enabled -}}
true
{{- else -}}
false
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ spec:
{{- if or .Values.agents.priorityClassCreate .Values.agents.priorityClassName }}
priorityClassName: {{ .Values.agents.priorityClassName | default (include "datadog.fullname" . ) }}
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
runtimeClassName: {{ .Values.datadog.gpuMonitoring.runtimeClassName }}
{{- end }}
containers:
{{- include "container-agent" . | nindent 6 }}
{{- if eq (include "should-enable-trace-agent" .) "true" }}
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/system-probe-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ data:
discovery:
enabled: {{ $.Values.datadog.discovery.enabled }}
{{- end }}
gpu_monitoring:
enabled: {{ $.Values.datadog.gpuMonitoring.enabled }}
configure_cgroup_perms: {{ $.Values.datadog.gpuMonitoring.configureCgroupPerms }}
runtime_security_config:
enabled: {{ $.Values.datadog.securityAgent.runtime.enabled }}
fim_enabled: {{ $.Values.datadog.securityAgent.runtime.fimEnabled }}
Expand Down
11 changes: 11 additions & 0 deletions charts/datadog/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -835,6 +835,17 @@ datadog:
# datadog.discovery.enabled -- (bool) Enable Service Discovery
enabled: # false

gpuMonitoring:
# datadog.gpuMonitoring.enabled -- Enable GPU monitoring
enabled: false

# datadog.gpuMonitoring.configureCgroupPerms -- Configure cgroup permissions for GPU monitoring
configureCgroupPerms: false

# datadog.gpuMonitoring.runtimeClassName -- Runtime class name for the agent pods to get access to NVIDIA resources
runtimeClassName: "nvidia"


# Software Bill of Materials configuration
sbom:
containerImage:
Expand Down

0 comments on commit 8c6cbd4

Please sign in to comment.