Allow to request spread meta gpus for fractioned multi-gpu training #1

elgalu · 2022-06-20T15:54:33Z

Feature: to be able to request, for example, on a 8 GPU server 70% of each of them for a single training job, e.g. by adding a new multigpu limit:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

That way multi-gpu training is possible while allowing 30% x 8 free meta GPUs

The text was updated successfully, but these errors were encountered:

Dimss · 2022-07-12T14:28:12Z

@elgalu thank you for the FR!
It is looks like a usable feature, however I'd like to have some clarification here, if that's ok.
Can you please explain, why having 70% of each 8 GPUs is better for a training job, then, for example, having 5.6 full GPUs?

For example, those are equals in term of the metagpu allocation units:

        resources:
          limits:
            cnvrg.io/metagpu: 70
            cnvrg.io/multigpu: 8

and

        resources:
          limits:
            cnvrg.io/metagpu: 560 #  each GPU equal to 100 metagpus, so 70 * 8 = 560 => 5.6 GPUs

So, why the training job will prefer to have 70% of each 8 gpu, which is in total 560 units, rather having the same 560 units, but spawning on 5 full GPUs and 60% of the sixth?

elgalu · 2022-07-12T15:13:11Z

Hi, thanks!. Because the multi-gpu training job doesn't need 100% of each GPU memory. It can parallelize well without occupying, for example, the entire A100 80GB memory on each device. It can't however occupy 100% of 80GB on 5 devices, I think the issue is how multi-gpu training works in the underlying frameworks like PyTorch.

Dimss self-assigned this Jul 12, 2022

Dimss added the enhancement New feature or request label Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to request spread meta gpus for fractioned multi-gpu training #1

Allow to request spread meta gpus for fractioned multi-gpu training #1

elgalu commented Jun 20, 2022

Dimss commented Jul 12, 2022 •

edited

Loading

elgalu commented Jul 12, 2022

Allow to request spread meta gpus for fractioned multi-gpu training #1

Allow to request spread meta gpus for fractioned multi-gpu training #1

Comments

elgalu commented Jun 20, 2022

Dimss commented Jul 12, 2022 • edited Loading

elgalu commented Jul 12, 2022

Dimss commented Jul 12, 2022 •

edited

Loading