You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@elgalu thank you for the FR!
It is looks like a usable feature, however I'd like to have some clarification here, if that's ok.
Can you please explain, why having 70% of each 8 GPUs is better for a training job, then, for example, having 5.6 full GPUs?
For example, those are equals in term of the metagpu allocation units:
resources:
limits:
cnvrg.io/metagpu: 560# each GPU equal to 100 metagpus, so 70 * 8 = 560 => 5.6 GPUs
So, why the training job will prefer to have 70% of each 8 gpu, which is in total 560 units, rather having the same 560 units, but spawning on 5 full GPUs and 60% of the sixth?
Hi, thanks!. Because the multi-gpu training job doesn't need 100% of each GPU memory. It can parallelize well without occupying, for example, the entire A100 80GB memory on each device. It can't however occupy 100% of 80GB on 5 devices, I think the issue is how multi-gpu training works in the underlying frameworks like PyTorch.
Feature: to be able to request, for example, on a 8 GPU server 70% of each of them for a single training job, e.g. by adding a new
multigpu
limit:That way multi-gpu training is possible while allowing 30% x 8 free meta GPUs
The text was updated successfully, but these errors were encountered: