Support GPU monitoring #1681

gjulianm · 2025-01-31T11:30:25Z

What this PR does / why we need it:

This PR adds support for enabling GPU monitoring in system-probe.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

fixes #

Special notes for your reviewer:

This PR does not have automatic support for mixed node clusters (where some nodes have GPU and others don't). However, using the affinity value and the existing documentation to join an existing clusterAgent this can be done without issues. Assuming we have already a values.yml file for a regular, non-GPU deployment, the steps to enable GPU monitoring only on GPU nodes are the following:

in agents.affinity, add a node selector that stops the non-GPU agent from running on GPU nodes:

# Base values.yaml (for non-GPU nodes)
agents:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present
            operator: NotIn
            values:
              - "true"

Here we chose the nvidia.com/gpu.present tag as it's automatically added to GPU nodes by the NVIDIA GPU operator. However, any other appropriate tag may be chosen

Create another file (e.g., values-gpu.yaml that will be applied on top of the previous one. In this file we enable GPU monitoring, configure the clusteragent to join the existing cluster as per the instructions and include the affinity for the GPU nodes:

# GPU-specific values-gpu.yaml (for GPU nodes)
datadog:
  kubeStateMetricsEnabled: false # Disabled as we're joining an existing cluster agent
  gpuMonitoring:
    enabled: true

agents:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present
            operator: In
            values:
              - "true"

existingClusterAgent:
  join: true

# Disabled datadogMetrics deployment since it should have been already deployed with the other chart release.
datadog-crds:
  crds:
    datadogMetrics: false

Deploy the datadog chart twice, first with the first values.yaml file as modified in step 1, and then a second time (with a different name) adding the values-gpu.yaml file as defined in step 2:

$ helm install -f values.yaml datadog datadog
$ helm install -f values.yaml -f values-gpu.yaml datadog-gpu datadog

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

Chart Version bumped
Documentation has been updated with helm-docs (run: .github/helm-docs.sh)
CHANGELOG.md has been updated
Variables are documented in the README.md
For Datadog Operator chart or value changes update the test baselines (run: make update-test-baselines)

val06

reviewed

val06 · 2025-02-07T14:38:52Z

charts/datadog/templates/_container-system-probe.yaml

+    - name: gpu-devices
+      mountPath: /var/run/nvidia-container-devices/all


should we also mount the cgroups in this case?
(here)

It's not needed as we're already mounting host root (just above), and the readOnly doesn't apply for this case as we can still write it.

i would prefer to have an explicit reference as other products do

gjulianm · 2025-02-19T10:31:22Z

/merge

dd-devflow · 2025-02-19T10:31:25Z

View all feedbacks in Devflow UI.
2025-02-19 10:31:25 UTC ℹ️ Start processing command /merge

2025-02-19 10:31:28 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 41m.

2025-02-19 11:07:16 UTC ℹ️ MergeQueue: This merge request was merged

gjulianm self-assigned this Jan 31, 2025

github-actions bot added the chart/datadog This issue or pull request is related to the datadog chart label Jan 31, 2025

Enable GPU monitoring

4b4c8ad

gjulianm force-pushed the guillermo.julian/enable-gpu-monitoring branch from a36f748 to 4b4c8ad Compare February 7, 2025 12:45

gjulianm added 2 commits February 7, 2025 13:36

Update README

95b7086

Fix changelog

3ae2807

gjulianm marked this pull request as ready for review February 7, 2025 13:45

gjulianm requested review from a team as code owners February 7, 2025 13:45

val06 requested changes Feb 7, 2025

View reviewed changes

gjulianm mentioned this pull request Feb 10, 2025

Add PodResources mount #1696

Open

5 tasks

gjulianm added 4 commits February 10, 2025 13:47

Merge branch 'main' into guillermo.julian/enable-gpu-monitoring

1d82304

Merge branch 'main' into guillermo.julian/enable-gpu-monitoring

cda2490

Mount cgroups

44c00df

Merge branch 'main' into guillermo.julian/enable-gpu-monitoring

71b57c3

val06 approved these changes Feb 18, 2025

View reviewed changes

celenechang approved these changes Feb 18, 2025

View reviewed changes

dd-devflow bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Feb 19, 2025

dd-mergequeue bot merged commit 8c6cbd4 into main Feb 19, 2025
28 checks passed

dd-devflow bot added mergequeue-status: done and removed mergequeue-status: in_progress labels Feb 19, 2025

dd-mergequeue bot deleted the guillermo.julian/enable-gpu-monitoring branch February 19, 2025 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPU monitoring #1681

Support GPU monitoring #1681

gjulianm commented Jan 31, 2025 •

edited by val06

Loading

val06 left a comment

val06 Feb 7, 2025

gjulianm Feb 7, 2025

val06 Feb 17, 2025

gjulianm commented Feb 19, 2025

dd-devflow bot commented Feb 19, 2025 •

edited

Loading

		- name: gpu-devices
		mountPath: /var/run/nvidia-container-devices/all

Support GPU monitoring #1681

Support GPU monitoring #1681

Conversation

gjulianm commented Jan 31, 2025 • edited by val06 Loading

What this PR does / why we need it:

Which issue this PR fixes

Special notes for your reviewer:

Checklist

val06 left a comment

Choose a reason for hiding this comment

val06 Feb 7, 2025

Choose a reason for hiding this comment

gjulianm Feb 7, 2025

Choose a reason for hiding this comment

val06 Feb 17, 2025

Choose a reason for hiding this comment

gjulianm commented Feb 19, 2025

dd-devflow bot commented Feb 19, 2025 • edited Loading

gjulianm commented Jan 31, 2025 •

edited by val06

Loading

dd-devflow bot commented Feb 19, 2025 •

edited

Loading