Replies: 1 comment
-
It looks to me you run multiple CUDA processes on a single CUDA time at the same time. In this case, it might be necessary to setup NVIDIA MPS, otherwise different processes will compete. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My computer configuration is 2 nodes, each node has 64 cores, a total of 128 cores(The system hyperthread has been shut down), and a 4060 graphics card. I used a 4060 graphics card and 64 cores under 1 node to train the 01.train phase potential function for the first loop, with an initial data set of only about 90. It takes two or three days, so why is the efficiency so low? Is it because my hardware is too low? Or am I setting the parameters wrong? please leave a comment. Thank you
part parameters of machine.json file are set as follows: (See attachment for all parameters)
api_version: 1.0,
deepmd_version: "2.2.8",
"train" :[
{
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"machine_type":"shell",
"Remote_root" : "/ home/combustion/Documents/dpgen_gpu / 1.16/2.27 / new_work"
},
"number_node": 1,
"cpu_per_node": 64,
"gpu_per_node": 1,
"group_size": 4,
"source_list": ["~/.bashrc;conda activate deepmd;export OMP_NUM_THREADS=64;export TF_INTRA_OP_PARALLELISM_THREADS=64; export TF_INTER_OP_PARALLELISM_THREADS=4"]
part Parameter Settings of param.json are as follows: (All parameters can be found in the attachment)
"type_map": ["Al","N","H","Cl","O"],
"Mass_map" : [26.9815, 14.0067, 1.0000, 35.4530, 15.9994].
"init_data_prefix": "./cp2k_data",
"init_data_sys": ["aimd/training_data"],
"init_multi_systems":true,
"init_batch_size":["auto"],
"sys_configs": [["./POSCAR"] ],
"sys_batch_size" : ["auto"],
"_comment": " that's all ",
"numb_models": 4,
machine.txt
param.txt
Beta Was this translation helpful? Give feedback.
All reactions