Failure in setting up machine.json for multiple GPU training #666
Replies: 2 comments 10 replies
-
Could you please provide the complete |
Beta Was this translation helpful? Give feedback.
-
Sure.
"numb_node": 1,
"numb_node": 1,
} Here are the sub file for the first model: cd 000 export CUDA_VISIBLE_DEVICES=0 cd /home/jiangshuai/ML/TensorMol/TensorMol/structures_sampl/dpgen/test_dpgen_dpmd_multiple_GPUs/iter.000000/00.train wait cd 000 export CUDA_VISIBLE_DEVICES=1 cd /home/jiangshuai/ML/TensorMol/TensorMol/structures_sampl/dpgen/test_dpgen_dpmd_multiple_GPUs/iter.000000/00.train wait touch 29f2dc03-9281-4da1-b6a3-89a96c37b3e6_tag_finished Other sub files are basically the same except the directory and file name. For instance, here is the sub file for the second model: cd 001 export CUDA_VISIBLE_DEVICES=0 cd /home/jiangshuai/ML/TensorMol/TensorMol/structures_sampl/dpgen/test_dpgen_dpmd_multiple_GPUs/iter.000000/00.train wait cd 001 export CUDA_VISIBLE_DEVICES=1 cd /home/jiangshuai/ML/TensorMol/TensorMol/structures_sampl/dpgen/test_dpgen_dpmd_multiple_GPUs/iter.000000/00.train wait touch 6d8ab71b-0b71-4f3a-8879-cfe313a1ea3b_tag_finished |
Beta Was this translation helpful? Give feedback.
-
Hi, dear all,
I installed dpgen v0.10.3 and dpmd 2.1.0 and have more than four 3090 GPUs available on one node, so I would like to use four GPUs or more to train four models simultaneously, but it didn't work. There are four python PIDs running on one GPU only instead of four GPUs.
Here are parts of machine.json related to GPU settings:
"train_resources": {
"numb_node": 1,
"numb_gpu": 4,
"task_per_node": 12,
"manual_cuda_devices": 4,
"manual_cuda_multiplicity":1,
"cuda_multi_task": true,
"group_size": 4
},
Here are the messages about GPU running:
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions