Replies: 2 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Hello, have you solved the problem now? I have the same problem now |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
2023-09-26 11:57:18,486 - INFO : Find old submission; recover submission from json file;submission.submission_hash:c40cbdcc7d969ad3ae8930771348a761b6d46f47; machine.context.remote_root:/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47; submission.work_base:iter.000000/00.train;
2023-09-26 11:57:18,536 - INFO : info:check_all_finished: False
2023-09-26 11:57:18,539 - INFO : job: 35225006374c02dda988d09b9556589231a34548 6329 terminated;fail_cout is 10; resubmitting job
2023-09-26 11:57:18,547 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13634
2023-09-26 11:57:18,794 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13634 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:18,794 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13634 terminated;fail_cout is 11; resubmitting job
2023-09-26 11:57:18,799 - INFO : job:35225006374c02dda988d09b9556589231a34548 re-submit after terminated; new job_id is 13656
2023-09-26 11:57:19,046 - INFO : job:35225006374c02dda988d09b9556589231a34548 job_id:13656 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-26 11:57:19,047 - INFO : job: 35225006374c02dda988d09b9556589231a34548 13656 terminated;fail_cout is 12; resubmitting job
Traceback (most recent call last):
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 352, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 846, in handle_unexpected_job_state
raise RuntimeError(
RuntimeError: job:35225006374c02dda988d09b9556589231a34548 13656 failed 12 times.job_detail:{'35225006374c02dda988d09b9556589231a34548': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '001', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '003', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}, {'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '000', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 0, 'queue_name': '', 'group_size': 4, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': [], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 13656, 'fail_count': 12}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jiang/.local/bin/dpgen", line 8, in
sys.exit(main())
^^^^^^
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 229, in run_submission
self.handle_unexpected_submission_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 355, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/jiang/work/dpgen_example/run/nnwork/c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Debug information: submission_hash==c40cbdcc7d969ad3ae8930771348a761b6d46f47.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
throught the above information, I find the train.log file that shows the ‘dp: nov vocab file specified’,but I don't know how to solve this problem, Thanks!
Beta Was this translation helpful? Give feedback.
All reactions