Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Feb 10, 2025

Toolset to manage and run GHA runners for pytorch/* repositories

Imported the toolset to manage runners for pytorch/* repositories in the tools/multi-tenant folder.

It includes:

  • tools/multi-tenant/playbooks/setup-host.yml - Ansible playbook to setup hosts as GHA runners;
  • tools/multi-tenant/services/ghad-manager/ghad-manager.py - Main daemon running as a root responsible to make sure GHA daemon is authenticated and running for all users;
  • tools/multi-tenant/scripts/reset_oldest_ebs.py - AWS-specific. Tool to help safely reset ebs, safely draining instances with jobs for maintenances or cattle spa runs;
  • tools/multi-tenant/images/multi-tenant-gpu/Dockerfile - Main docker image where CI runs;

Design

Each GPU is assigned to a linux unix user. Each user is assigned a quota for both CPU and memory based on available resources and number of GPUs using cgroups quotas. Then each user runs docker in userspace. A main daemon running as root controls the authentication to GH API using a private key. It then spawns for each user, in its respective docker constrained environment, a docker container that runs GHA daemon and the user docker is made available inside of it by mounting the docker daemon socket inside the running container;

Except from few limitations to hardware access (that are there for security and are intentional) the user running the CI is not aware that the CI is running in a docker container, and the ci can spawn new docker images with the same hardware access as the main docker image.

As each job gets a new fresh environment, there is very limited worry in regards to trash buildup in the main instance.

Copy link

vercel bot commented Feb 10, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Visit Preview Feb 10, 2025 7:12pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 10, 2025
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants