Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to request spread meta gpus for fractioned multi-gpu training #1

Open
elgalu opened this issue Jun 20, 2022 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@elgalu
Copy link

elgalu commented Jun 20, 2022

Feature: to be able to request, for example, on a 8 GPU server 70% of each of them for a single training job, e.g. by adding a new multigpu limit:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

That way multi-gpu training is possible while allowing 30% x 8 free meta GPUs

@Dimss
Copy link
Collaborator

Dimss commented Jul 12, 2022

@elgalu thank you for the FR!
It is looks like a usable feature, however I'd like to have some clarification here, if that's ok.
Can you please explain, why having 70% of each 8 GPUs is better for a training job, then, for example, having 5.6 full GPUs?

For example, those are equals in term of the metagpu allocation units:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

and

        resources:
          limits:
            cnvrg.io/metagpu: 560 #  each GPU equal to 100 metagpus, so 70 * 8 = 560 => 5.6 GPUs

So, why the training job will prefer to have 70% of each 8 gpu, which is in total 560 units, rather having the same 560 units, but spawning on 5 full GPUs and 60% of the sixth?

@Dimss Dimss self-assigned this Jul 12, 2022
@Dimss Dimss added the enhancement New feature or request label Jul 12, 2022
@elgalu
Copy link
Author

elgalu commented Jul 12, 2022

Hi, thanks!. Because the multi-gpu training job doesn't need 100% of each GPU memory. It can parallelize well without occupying, for example, the entire A100 80GB memory on each device. It can't however occupy 100% of 80GB on 5 devices, I think the issue is how multi-gpu training works in the underlying frameworks like PyTorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants