Why is it so time-consuming to train the initial potential function #1469

12jscvb · 2024-02-28T14:33:13Z

12jscvb
Feb 28, 2024

My computer configuration is 2 nodes, each node has 64 cores, a total of 128 cores（The system hyperthread has been shut down）, and a 4060 graphics card. I used a 4060 graphics card and 64 cores under 1 node to train the 01.train phase potential function for the first loop, with an initial data set of only about 90. It takes two or three days, so why is the efficiency so low? Is it because my hardware is too low? Or am I setting the parameters wrong? please leave a comment. Thank you

part parameters of machine.json file are set as follows: (See attachment for all parameters)

api_version: 1.0,
deepmd_version: "2.2.8",
"train" :[
{
"machine": {
"batch_type": "Shell",
"context_type": "local",
"local_root" : "./",
"machine_type":"shell",
"Remote_root" : "/ home/combustion/Documents/dpgen_gpu / 1.16/2.27 / new_work"
},
"number_node": 1,
"cpu_per_node": 64,
"gpu_per_node": 1,
"group_size": 4,
"source_list": ["~/.bashrc;conda activate deepmd;export OMP_NUM_THREADS=64;export TF_INTRA_OP_PARALLELISM_THREADS=64; export TF_INTER_OP_PARALLELISM_THREADS=4"]

part Parameter Settings of param.json are as follows: (All parameters can be found in the attachment)
"type_map": ["Al","N","H","Cl","O"],
"Mass_map" : [26.9815, 14.0067, 1.0000, 35.4530, 15.9994].
"init_data_prefix": "./cp2k_data",
"init_data_sys": ["aimd/training_data"],
"init_multi_systems":true,
"init_batch_size":["auto"],
"sys_configs": [["./POSCAR"] ],
"sys_batch_size" : ["auto"],
"_comment": " that's all ",
"numb_models": 4,

machine.txt
param.txt

njzjz · 2024-03-03T21:07:34Z

njzjz
Mar 3, 2024
Maintainer

It looks to me you run multiple CUDA processes on a single CUDA time at the same time. In this case, it might be necessary to setup NVIDIA MPS, otherwise different processes will compete.
You can also use 4 GPUs (if you can get them) to train 4 models.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is it so time-consuming to train the initial potential function #1469

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why is it so time-consuming to train the initial potential function #1469

12jscvb Feb 28, 2024

Replies: 1 comment

njzjz Mar 3, 2024 Maintainer

12jscvb
Feb 28, 2024

njzjz
Mar 3, 2024
Maintainer