There is always an error when executing "dpgen run param.json machine. Json", I don't know if there is any error in setting file "machine. #811
Unanswered
yuanzuiqaq
asked this question in
Q&A
Replies: 1 comment 5 replies
-
please provide log files under |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
First of all, thank you very much for taking the time to read my question. When I execute "dpgen run param.json machine. Json" I always get errors that confuse me.
I used PBS to submit tasks in the remote supercomputer center, and we did not have gpu available for use, so my "machine. Json" file and the errors are as follows.
my error:
(deepmd) [nano006@ln02 run]$ dpgen run param.json machine.json
DeepModeling
Version: 0.10.6
Date: Jul-13-2022
Path: /public/nano006/.local/lib/python3.9/site-packages/dpgen
Dependency
pymatgen unknown version or path
monty 2022.4.26 /public/nano006/miniconda3/envs/deepmd/lib/python3.9/site-packages/monty
ase 3.22.1 /public/nano006/.local/lib/python3.9/site-packages/ase
paramiko 2.11.0 /public/nano006/.local/lib/python3.9/site-packages/paramiko
custodian 2022.5.26 /public/nano006/.local/lib/python3.9/site-packages/custodian
Reference
Please cite:
Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E,
DP-GEN: A concurrent learning platform for the generation of reliable deep learning
based potential energy models, Computer Physics Communications, 2020, 107206.
Description
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
2022-07-19 13:38:24,137 - INFO : info:check_all_finished: False
2022-07-19 13:38:24,138 - INFO : remote path: /public/nano006/cxh/deepmd/dpgen_example/run_path/872e78444e69466ca3d6fccef767c7bf6e00649d
2022-07-19 13:38:30,915 - INFO : job: 7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 submit; job_id is 425938.mu01
2022-07-19 13:38:35,350 - INFO : job: 7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 425938.mu01 terminated;fail_cout is 1; resubmitting job
2022-07-19 13:38:38,729 - INFO : job:7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 re-submit after terminated; new job_id is 425939.mu01
2022-07-19 13:38:42,170 - INFO : job:7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 job_id:425939.mu01 after re-submitting; the state now is <JobStatus.terminated: 4>
2022-07-19 13:38:42,170 - INFO : job: 7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 425939.mu01 terminated;fail_cout is 2; resubmitting job
2022-07-19 13:38:45,600 - INFO : job:7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 re-submit after terminated; new job_id is 425940.mu01
2022-07-19 13:38:49,237 - INFO : job:7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 job_id:425940.mu01 after re-submitting; the state now is <JobStatus.terminated: 4>
2022-07-19 13:38:49,237 - INFO : job: 7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 425940.mu01 terminated;fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 241, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 612, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 612, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 605, in handle_unexpected_job_state
raise RuntimeError(f"job:{self.job_hash} {self.job_id} failed {self.fail_count} times.job_detail:{self}")
RuntimeError: job:7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61 425940.mu01 failed 3 times.job_detail:{'7ced0e2aa93d65c2cdbe53b2335ef0d78e226d61': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '001', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '003', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '000', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 32, 'gpu_per_node': 0, 'queue_name': 'thr', 'group_size': 4, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': ['~/miniconda3/envs/deepmd/bin/dp'], 'envs': {}, 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': '425940.mu01', 'fail_count': 3}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/public/nano006/.local/bin/dpgen", line 8, in
sys.exit(main())
File "/public/nano006/.local/lib/python3.9/site-packages/dpgen/main.py", line 185, in main
args.func(args)
File "/public/nano006/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 3642, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/public/nano006/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 3607, in run_iter
run_train (ii, jdata, mdata)
File "/public/nano006/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 610, in run_train
submission.run_submission()
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 185, in run_submission
self.handle_unexpected_submission_state()
File "/public/nano006/.local/lib/python3.9/site-packages/dpdispatcher/submission.py", line 244, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/public/nano006/cxh/deepmd/dpgen_example/run_path/872e78444e69466ca3d6fccef767c7bf6e00649d.
Debug information: submission_hash==872e78444e69466ca3d6fccef767c7bf6e00649d.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
my "machine.json":
{
"api_version": "1.0",
"deepmd_version": "2.1.3",
"train" :[
{
}
Thank you for taking time out of your busy schedule to answer my question!!!
Beta Was this translation helpful? Give feedback.
All reactions