Can we shuffle dataset first and then split the dataset into training set and validation set? #649

theAfish · 2022-02-22T05:28:47Z

theAfish
Feb 22, 2022

Currently, I am using dpgen to explore systems with different structures. After using "dp collect xxx", we can obtain separate data sets for each structure, such as:

init.000
init.001
init.002
...
sys.000
sys.001
sys.002
...

So there's one thing I would like to ask: is there any way to separate each sys.* file into validation set and training set? or how could I obtain a good validation set together with the training set with dpgen?

Answered by wanghan-iapcm

Feb 23, 2022

All data in the dataset generated by dpgen are critical, because all selected are estimated to be of bad accuracy and are added to the training dataset to improve the quality of the model. In other words, removing any data from the dataset may reduce the accuracy of the model. Therefore it is recommended to generate an independent validation set, rather than splitting the dataset generated by dpgen.

View full answer

wanghan-iapcm · 2022-02-23T01:41:22Z

wanghan-iapcm
Feb 23, 2022
Maintainer

All data in the dataset generated by dpgen are critical, because all selected are estimated to be of bad accuracy and are added to the training dataset to improve the quality of the model. In other words, removing any data from the dataset may reduce the accuracy of the model. Therefore it is recommended to generate an independent validation set, rather than splitting the dataset generated by dpgen.

3 replies

theAfish Feb 23, 2022
Author

Thank you very much!

LZH-1996 Sep 22, 2022

I have a question : it means we should set up the validation dataset by myself? If the validation parameter is not set in param.json, does it mean that there is no testing and validation during the deepmdkit training process ?

theAfish Sep 22, 2022
Author

Sure, the validation sets should be prepared by yourself. Several AIMD trajectories could work well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we shuffle dataset first and then split the dataset into training set and validation set? #649

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can we shuffle dataset first and then split the dataset into training set and validation set? #649

theAfish Feb 22, 2022

Replies: 1 comment · 3 replies

wanghan-iapcm Feb 23, 2022 Maintainer

theAfish Feb 23, 2022 Author

LZH-1996 Sep 22, 2022

theAfish Sep 22, 2022 Author

theAfish
Feb 22, 2022

Replies: 1 comment 3 replies

wanghan-iapcm
Feb 23, 2022
Maintainer

theAfish Feb 23, 2022
Author

theAfish Sep 22, 2022
Author