OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

jeanschmidt · 2025-02-10T19:00:44Z

Toolset to manage and run GHA runners for pytorch/* repositories

Imported the toolset to manage runners for pytorch/* repositories in the tools/multi-tenant folder.

It includes:

tools/multi-tenant/playbooks/setup-host.yml - Ansible playbook to setup hosts as GHA runners;
tools/multi-tenant/services/ghad-manager/ghad-manager.py - Main daemon running as a root responsible to make sure GHA daemon is authenticated and running for all users;
tools/multi-tenant/scripts/reset_oldest_ebs.py - AWS-specific. Tool to help safely reset ebs, safely draining instances with jobs for maintenances or cattle spa runs;
tools/multi-tenant/images/multi-tenant-gpu/Dockerfile - Main docker image where CI runs;

Design

Each GPU is assigned to a linux unix user. Each user is assigned a quota for both CPU and memory based on available resources and number of GPUs using cgroups quotas. Then each user runs docker in userspace. A main daemon running as root controls the authentication to GH API using a private key. It then spawns for each user, in its respective docker constrained environment, a docker container that runs GHA daemon and the user docker is made available inside of it by mounting the docker daemon socket inside the running container;

Except from few limitations to hardware access (that are there for security and are intentional) the user running the CI is not aware that the CI is running in a docker container, and the ci can spawn new docker images with the same hardware access as the main docker image.

As each job gets a new fresh environment, there is very limited worry in regards to trash buildup in the main instance.

vercel · 2025-02-10T19:00:53Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Updated (UTC)
torchci	⬜️ Ignored (Inspect)	Visit Preview	Feb 10, 2025 7:12pm

github-advanced-security

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

OpenSourcing multi-tenant GHA solution for pytorch/* CI

f10e053

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 10, 2025

github-advanced-security bot found potential problems Feb 10, 2025

View reviewed changes

20250210201219

53fa13f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

jeanschmidt commented Feb 10, 2025 •

edited

Loading

vercel bot commented Feb 10, 2025 •

edited

Loading

github-advanced-security bot left a comment

OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

Are you sure you want to change the base?

OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275

Conversation

jeanschmidt commented Feb 10, 2025 • edited Loading

Toolset to manage and run GHA runners for pytorch/* repositories

It includes:

Design

vercel bot commented Feb 10, 2025 • edited Loading

github-advanced-security bot left a comment

Choose a reason for hiding this comment

jeanschmidt commented Feb 10, 2025 •

edited

Loading

vercel bot commented Feb 10, 2025 •

edited

Loading