Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GPU monitoring #1681

Merged
merged 7 commits into from
Feb 19, 2025
Merged

Conversation

gjulianm
Copy link
Contributor

@gjulianm gjulianm commented Jan 31, 2025

What this PR does / why we need it:

Jira Ticket

This PR adds support for enabling GPU monitoring in system-probe.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

  • fixes #

Special notes for your reviewer:

This PR does not have automatic support for mixed node clusters (where some nodes have GPU and others don't). However, using the affinity value and the existing documentation to join an existing clusterAgent this can be done without issues. Assuming we have already a values.yml file for a regular, non-GPU deployment, the steps to enable GPU monitoring only on GPU nodes are the following:

  1. in agents.affinity, add a node selector that stops the non-GPU agent from running on GPU nodes:
# Base values.yaml (for non-GPU nodes)
agents:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present
            operator: NotIn
            values:
              - "true"

Here we chose the nvidia.com/gpu.present tag as it's automatically added to GPU nodes by the NVIDIA GPU operator. However, any other appropriate tag may be chosen

  1. Create another file (e.g., values-gpu.yaml that will be applied on top of the previous one. In this file we enable GPU monitoring, configure the clusteragent to join the existing cluster as per the instructions and include the affinity for the GPU nodes:
# GPU-specific values-gpu.yaml (for GPU nodes)
datadog:
  kubeStateMetricsEnabled: false # Disabled as we're joining an existing cluster agent
  gpuMonitoring:
    enabled: true

agents:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present
            operator: In
            values:
              - "true"

existingClusterAgent:
  join: true

# Disabled datadogMetrics deployment since it should have been already deployed with the other chart release.
datadog-crds:
  crds:
    datadogMetrics: false
  1. Deploy the datadog chart twice, first with the first values.yaml file as modified in step 1, and then a second time (with a different name) adding the values-gpu.yaml file as defined in step 2:
$ helm install -f values.yaml datadog datadog
$ helm install -f values.yaml -f values-gpu.yaml datadog-gpu datadog 

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

  • Chart Version bumped
  • Documentation has been updated with helm-docs (run: .github/helm-docs.sh)
  • CHANGELOG.md has been updated
  • Variables are documented in the README.md
  • For Datadog Operator chart or value changes update the test baselines (run: make update-test-baselines)

@gjulianm gjulianm self-assigned this Jan 31, 2025
@github-actions github-actions bot added the chart/datadog This issue or pull request is related to the datadog chart label Jan 31, 2025
@gjulianm gjulianm force-pushed the guillermo.julian/enable-gpu-monitoring branch from a36f748 to 4b4c8ad Compare February 7, 2025 12:45
@gjulianm gjulianm marked this pull request as ready for review February 7, 2025 13:45
@gjulianm gjulianm requested review from a team as code owners February 7, 2025 13:45
Copy link

@val06 val06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

Comment on lines +104 to +105
- name: gpu-devices
mountPath: /var/run/nvidia-container-devices/all
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mount the cgroups in this case?
(here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed as we're already mounting host root (just above), and the readOnly doesn't apply for this case as we can still write it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer to have an explicit reference as other products do

@gjulianm gjulianm mentioned this pull request Feb 10, 2025
5 tasks
@gjulianm
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Feb 19, 2025

View all feedbacks in Devflow UI.
2025-02-19 10:31:25 UTC ℹ️ Start processing command /merge


2025-02-19 10:31:28 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 41m.


2025-02-19 11:07:16 UTC ℹ️ MergeQueue: This merge request was merged

@dd-mergequeue dd-mergequeue bot merged commit 8c6cbd4 into main Feb 19, 2025
28 checks passed
@dd-mergequeue dd-mergequeue bot deleted the guillermo.julian/enable-gpu-monitoring branch February 19, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chart/datadog This issue or pull request is related to the datadog chart mergequeue-status: done
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants