OpenSourcing multi-tenant GHA solution for pytorch/* CI #6275
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Toolset to manage and run GHA runners for pytorch/* repositories
Imported the toolset to manage runners for pytorch/* repositories in the
tools/multi-tenant
folder.It includes:
tools/multi-tenant/playbooks/setup-host.yml
- Ansible playbook to setup hosts as GHA runners;tools/multi-tenant/services/ghad-manager/ghad-manager.py
- Main daemon running as a root responsible to make sure GHA daemon is authenticated and running for all users;tools/multi-tenant/scripts/reset_oldest_ebs.py
- AWS-specific. Tool to help safely reset ebs, safely draining instances with jobs for maintenances or cattle spa runs;tools/multi-tenant/images/multi-tenant-gpu/Dockerfile
- Main docker image where CI runs;Design
Each GPU is assigned to a linux unix user. Each user is assigned a quota for both CPU and memory based on available resources and number of GPUs using cgroups quotas. Then each user runs docker in userspace. A main daemon running as root controls the authentication to GH API using a private key. It then spawns for each user, in its respective docker constrained environment, a docker container that runs GHA daemon and the user docker is made available inside of it by mounting the docker daemon socket inside the running container;
Except from few limitations to hardware access (that are there for security and are intentional) the user running the CI is not aware that the CI is running in a docker container, and the ci can spawn new docker images with the same hardware access as the main docker image.
As each job gets a new fresh environment, there is very limited worry in regards to trash buildup in the main instance.