Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU monitoring #1681

Merged
merged 7 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions charts/datadog/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Datadog changelog

## 3.91.0

* Add support for GPU monitoring

## 3.90.5

* Update `fips.image.tag` to `1.1.7` updating openSSL version to 3.0.16
Expand Down
2 changes: 1 addition & 1 deletion charts/datadog/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
apiVersion: v1
name: datadog
version: 3.90.5
version: 3.91.0
appVersion: "7"
description: Datadog Agent
keywords:
Expand Down
5 changes: 4 additions & 1 deletion charts/datadog/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Datadog

![Version: 3.90.5](https://img.shields.io/badge/Version-3.90.5-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)
![Version: 3.91.0](https://img.shields.io/badge/Version-3.91.0-informational?style=flat-square) ![AppVersion: 7](https://img.shields.io/badge/AppVersion-7-informational?style=flat-square)

[Datadog](https://www.datadoghq.com/) is a hosted infrastructure monitoring platform. This chart adds the Datadog Agent to all nodes in your cluster via a DaemonSet. It also optionally depends on the [kube-state-metrics chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics). For more information about monitoring Kubernetes with Datadog, please refer to the [Datadog documentation website](https://docs.datadoghq.com/agent/basic_agent_usage/kubernetes/).

Expand Down Expand Up @@ -749,6 +749,9 @@ helm install <RELEASE_NAME> \
| datadog.envFrom | list | `[]` | Set environment variables for all Agents directly from configMaps and/or secrets |
| datadog.excludePauseContainer | bool | `true` | Exclude pause containers from Agent Autodiscovery. |
| datadog.expvarPort | int | `6000` | Specify the port to expose pprof and expvar to not interfere with the agent metrics port from the cluster-agent, which defaults to 5000 |
| datadog.gpuMonitoring.configureCgroupPerms | bool | `false` | Configure cgroup permissions for GPU monitoring |
| datadog.gpuMonitoring.enabled | bool | `false` | Enable GPU monitoring |
| datadog.gpuMonitoring.runtimeClassName | string | `"nvidia"` | Runtime class name for the agent pods to get access to NVIDIA resources |
| datadog.helmCheck.collectEvents | bool | `false` | Set this to true to enable event collection in the Helm Check (Requires Agent 7.36.0+ and Cluster Agent 1.20.0+) This requires datadog.HelmCheck.enabled to be set to true |
| datadog.helmCheck.enabled | bool | `false` | Set this to true to enable the Helm check (Requires Agent 7.35.0+ and Cluster Agent 1.19.0+) This requires clusterAgent.enabled to be set to true |
| datadog.helmCheck.valuesAsTags | object | `{}` | Collects Helm values from a release and uses them as tags (Requires Agent and Cluster Agent 7.40.0+). This requires datadog.HelmCheck.enabled to be set to true |
Expand Down
12 changes: 8 additions & 4 deletions charts/datadog/templates/_container-system-probe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
{{- include "containers-common-env" . | nindent 4 }}
- name: DD_LOG_LEVEL
value: {{ .Values.agents.containers.systemProbe.logLevel | default .Values.datadog.logLevel | quote }}
{{- if .Values.datadog.serviceMonitoring.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: HOST_ROOT
value: "/host/root"
{{- end }}
Expand Down Expand Up @@ -70,14 +70,14 @@
mountPath: /host/proc
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.networkMonitoring.enabled .Values.datadog.discovery.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.networkMonitoring.enabled .Values.datadog.discovery.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: cgroups
mountPath: /host/sys/fs/cgroup
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- end }}
{{- include "linux-container-host-release-volumemounts" . | nindent 4 }}
{{- if (eq (include "should-add-host-path-for-os-release-paths" .) "true") }}
{{- if (eq (include "should-add-host-path-for-os-release-paths" .) "true") }}
{{- if ne .Values.datadog.osReleasePath "/etc/redhat-release" }}
- name: etc-redhat-release
mountPath: /host/etc/redhat-release
Expand All @@ -94,12 +94,16 @@
readOnly: true
{{- end }}
{{- end }}
{{- if .Values.datadog.serviceMonitoring.enabled }}
{{- if or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled }}
- name: hostroot
mountPath: /host/root
mountPropagation: {{ .Values.datadog.hostVolumeMountPropagation }}
readOnly: true
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
- name: gpu-devices
mountPath: /var/run/nvidia-container-devices/all
Comment on lines +104 to +105
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mount the cgroups in this case?
(here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed as we're already mounting host root (just above), and the readOnly doesn't apply for this case as we can still write it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to have an explicit reference as other products do

{{- end }}
{{- if and (eq (include "runtime-compilation-enabled" .) "true") .Values.datadog.systemProbe.enableDefaultKernelHeadersPaths }}
- name: modules
mountPath: /lib/modules
Expand Down
7 changes: 6 additions & 1 deletion charts/datadog/templates/_daemonset-volumes-linux.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@
path: /etc/passwd
name: passwd
{{- end }}
{{- if or (and (eq (include "should-enable-system-probe" .) "true") .Values.datadog.serviceMonitoring.enabled) (and (eq (include "should-enable-security-agent" .) "true") .Values.datadog.securityAgent.compliance.enabled) }}
{{- if or (and (eq (include "should-enable-system-probe" .) "true") (or .Values.datadog.serviceMonitoring.enabled .Values.datadog.gpuMonitoring.enabled)) (and (eq (include "should-enable-security-agent" .) "true") .Values.datadog.securityAgent.compliance.enabled) }}
- hostPath:
path: /
name: hostroot
Expand Down Expand Up @@ -219,4 +219,9 @@
secretName: datadog-kubelet-cert
name: kubelet-cert-volume
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
- name: gpu-devices
hostPath:
path: /dev/null
{{- end }}
{{- end -}}
2 changes: 1 addition & 1 deletion charts/datadog/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -329,7 +329,7 @@ Return a remote image path based on `.Values` (passed as root) and `.` (any `.im
Return true if a system-probe feature is enabled.
*/}}
{{- define "system-probe-feature" -}}
{{- if or .Values.datadog.securityAgent.runtime.enabled .Values.datadog.securityAgent.runtime.fimEnabled .Values.datadog.networkMonitoring.enabled .Values.datadog.systemProbe.enableTCPQueueLength .Values.datadog.systemProbe.enableOOMKill .Values.datadog.serviceMonitoring.enabled .Values.datadog.discovery.enabled -}}
{{- if or .Values.datadog.securityAgent.runtime.enabled .Values.datadog.securityAgent.runtime.fimEnabled .Values.datadog.networkMonitoring.enabled .Values.datadog.systemProbe.enableTCPQueueLength .Values.datadog.systemProbe.enableOOMKill .Values.datadog.serviceMonitoring.enabled .Values.datadog.discovery.enabled .Values.datadog.gpuMonitoring.enabled -}}
true
{{- else -}}
false
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ spec:
{{- if or .Values.agents.priorityClassCreate .Values.agents.priorityClassName }}
priorityClassName: {{ .Values.agents.priorityClassName | default (include "datadog.fullname" . ) }}
{{- end }}
{{- if .Values.datadog.gpuMonitoring.enabled }}
runtimeClassName: {{ .Values.datadog.gpuMonitoring.runtimeClassName }}
{{- end }}
containers:
{{- include "container-agent" . | nindent 6 }}
{{- if eq (include "should-enable-trace-agent" .) "true" }}
Expand Down
3 changes: 3 additions & 0 deletions charts/datadog/templates/system-probe-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@ data:
discovery:
enabled: {{ $.Values.datadog.discovery.enabled }}
{{- end }}
gpu_monitoring:
enabled: {{ $.Values.datadog.gpuMonitoring.enabled }}
configure_cgroup_perms: {{ $.Values.datadog.gpuMonitoring.configureCgroupPerms }}
runtime_security_config:
enabled: {{ $.Values.datadog.securityAgent.runtime.enabled }}
fim_enabled: {{ $.Values.datadog.securityAgent.runtime.fimEnabled }}
Expand Down
11 changes: 11 additions & 0 deletions charts/datadog/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -835,6 +835,17 @@ datadog:
# datadog.discovery.enabled -- (bool) Enable Service Discovery
enabled: # false

gpuMonitoring:
# datadog.gpuMonitoring.enabled -- Enable GPU monitoring
enabled: false

# datadog.gpuMonitoring.configureCgroupPerms -- Configure cgroup permissions for GPU monitoring
configureCgroupPerms: false

# datadog.gpuMonitoring.runtimeClassName -- Runtime class name for the agent pods to get access to NVIDIA resources
runtimeClassName: "nvidia"


# Software Bill of Materials configuration
sbom:
containerImage:
Expand Down
Loading