Can we shuffle dataset first and then split the dataset into training set and validation set? #649
-
Currently, I am using dpgen to explore systems with different structures. After using "dp collect xxx", we can obtain separate data sets for each structure, such as:
So there's one thing I would like to ask: is there any way to separate each sys.* file into validation set and training set? or how could I obtain a good validation set together with the training set with dpgen? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
All data in the dataset generated by |
Beta Was this translation helpful? Give feedback.
All data in the dataset generated by
dpgen
are critical, because all selected are estimated to be of bad accuracy and are added to the training dataset to improve the quality of the model. In other words, removing any data from the dataset may reduce the accuracy of the model. Therefore it is recommended to generate an independent validation set, rather than splitting the dataset generated bydpgen
.