-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPU monitoring #1681
Support GPU monitoring #1681
Conversation
a36f748
to
4b4c8ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed
- name: gpu-devices | ||
mountPath: /var/run/nvidia-container-devices/all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mount the cgroups in this case?
(here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed as we're already mounting host root (just above), and the readOnly
doesn't apply for this case as we can still write it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to have an explicit reference as other products do
/merge |
View all feedbacks in Devflow UI.
The median merge time in
|
What this PR does / why we need it:
Jira Ticket
This PR adds support for enabling GPU monitoring in system-probe.
Which issue this PR fixes
(optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged)Special notes for your reviewer:
This PR does not have automatic support for mixed node clusters (where some nodes have GPU and others don't). However, using the
affinity
value and the existing documentation to join an existingclusterAgent
this can be done without issues. Assuming we have already avalues.yml
file for a regular, non-GPU deployment, the steps to enable GPU monitoring only on GPU nodes are the following:agents.affinity
, add a node selector that stops the non-GPU agent from running on GPU nodes:Here we chose the
nvidia.com/gpu.present
tag as it's automatically added to GPU nodes by the NVIDIA GPU operator. However, any other appropriate tag may be chosenvalues-gpu.yaml
that will be applied on top of the previous one. In this file we enable GPU monitoring, configure the clusteragent to join the existing cluster as per the instructions and include the affinity for the GPU nodes:values.yaml
file as modified in step 1, and then a second time (with a different name) adding thevalues-gpu.yaml
file as defined in step 2:Checklist
[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]
.github/helm-docs.sh
)CHANGELOG.md
has been updatedREADME.md
make update-test-baselines
)