Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nvidia MIG #258

Merged

Conversation

piyush-jena
Copy link
Contributor

@piyush-jena piyush-jena commented Nov 13, 2024

Issue number:

Related:

Description of changes:
Adding nvidia-migmanager service and binary that configures the instance with nvidia mig.

Testing done:

  1. Instance joined the cluster
NAME                                           STATUS   ROLES    AGE   VERSION
ip-XXXX.us-west-2.compute.internal   Ready    <none>   15h   v1.29.5-eks-1109419
  1. Model Default:
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "pass-device-specs": true
      }
    }
  }
}
  1. Model Updates:
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.device-partitioning-strategy="mig"
bash-5.1#: apiclient set settings.kubelet-device-plugins.nvidia.mig.profile."a100-40gb"="1g.5gb"
bash-5.1# apiclient get settings.kubelet-device-plugin
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "volume-mounts",
        "device-partitioning-strategy": "mig",
        "device-sharing-strategy": "none",
        "mig": {
          "profile": {
            "a100-40gb": "1g.5gb"
          }
        },
        "pass-device-specs": true
      }
    }
  }
}

kubectl describe node shows 56 gpus post instance reboot.

  1. Bounded check:
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "hello"="1g.5gb"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-7NsnlaurtHEacSYL': Status 400 when PATCHing /settings?tx=apiclient-apply-7NsnlaurtHEacSYL: Json deserialize error: Unable to deserialize into NvidiaGPUModel: NVIDIA GPU Model must match '^([a-z])(\d+)\.(\d+)gb$', given: hello at line 1 column 62
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="2"
> EOF
bash-5.1# apiclient apply <<EOF
> [settings.kubelet-device-plugins.nvidia.mig.profile]
> "a100.40gb"="5"
> EOF
Failed to apply settings: Failed to PATCH settings from '-' to '/settings?tx=apiclient-apply-GzUHB0axGlWNPzGw': Status 400 when PATCHing /settings?tx=apiclient-apply-GzUHB0axGlWNPzGw: Json deserialize error: Unable to deserialize into MIGProfile: MIG Profile must match '^[0-9]g\.\d+gb$', given: 5 at line 1 column 71
  1. Files generated:

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@piyush-jena piyush-jena changed the title feat: add nvidia mig feat: add nvidia MIG Nov 13, 2024
@piyush-jena piyush-jena marked this pull request as draft November 13, 2024 20:26
@piyush-jena piyush-jena force-pushed the nvidia-mig-feature branch 3 times, most recently from 6b025dc to d72bc52 Compare November 18, 2024 14:14
@piyush-jena piyush-jena force-pushed the nvidia-mig-feature branch 6 times, most recently from 4cb9254 to 385d4fe Compare January 24, 2025 23:54
@piyush-jena piyush-jena requested a review from bcressey January 24, 2025 23:56

// The GPU in the current instance is not one of the known GPUs. We attempt using a profile that doesn't belong to one of the known GPUs.
for (gpu, mig_profile) in &mig_settings.profile {
if !known_gpus.contains(gpu) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this always true? You could make it another ensure! or omit the check, to help de-indent the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mig_settings might contain some known and some unsupported (by us) GPUs (that support MIG). If the current instance has one of the "currently unsupported" GPUs, we wouldn't want to call nvidia-smi commands in cases we know aren't valid.

@piyush-jena piyush-jena force-pushed the nvidia-mig-feature branch 3 times, most recently from c2f0243 to e44778b Compare February 4, 2025 10:07
@piyush-jena piyush-jena force-pushed the nvidia-mig-feature branch 2 times, most recently from 7edc279 to eafcab1 Compare February 4, 2025 20:01
@piyush-jena
Copy link
Contributor Author

Force push fixes all the above comments.

@piyush-jena piyush-jena force-pushed the nvidia-mig-feature branch 2 times, most recently from a811e5b to 164e650 Compare February 4, 2025 23:00
@piyush-jena
Copy link
Contributor Author

Got rid of hack commit since we removed dependency on settings-sdk structs to simplify the code.

@piyush-jena piyush-jena merged commit 16100a8 into bottlerocket-os:develop Feb 6, 2025
2 checks passed
@piyush-jena piyush-jena deleted the nvidia-mig-feature branch February 11, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants