Replies: 4 comments 2 replies
-
In addition, I look for the remote address where 'train' of dpgen 'run' is performed. The file train.log contains much information that confuse me. |
Beta Was this translation helpful? Give feedback.
-
I find an erro message in the line 91 of train.log :
when I look through some posts for help, I find one say that this erro of tensorflow may be related to batch_size. But I dont know what this "batch_size" refer to. But I am not sure whether this probelm terminated my job. I look forward to someone could give me some help. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Please provide a reproducible example. You can submit the issue to https://github.com/deepmodeling/deepmd-kit/issues and provide more information. |
Beta Was this translation helpful? Give feedback.
-
I met the similar problem in model_devi step, instead. RuntimeError: job:d67e559b8039daf73b04141a1b94a52b6e7cf3cd 7033 failed 3 times.
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
|
Beta Was this translation helpful? Give feedback.
-
When I perform "run" part of dpgen, I use 'train' with a remote GPU by SSH, use 'model_devi' and 'fp' in local slurm platform. But I have erro at the first step 'train'. Erro is as follows:
I want to train 4 models. But output say that jobs of each model training are all terminated. One of it is even terminated for 3 times.
I dont know why this erro happen.
Following is erro report:
Description
2024-01-30 21:17:32,861 - INFO : info:check_all_finished: False
2024-01-30 21:17:32,863 - INFO : remote path: /home/biglinn/deepmdjob/temp/ca378746f3011422e96ceae2219a8ac994ad81c8
2024-01-30 21:17:34,983 - INFO : job: 25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba submit; job_id is 61164
2024-01-30 21:17:36,896 - INFO : job: 67886ba977f4351e6d7b8467aa846efcedca46d8 submit; job_id is 61782
2024-01-30 21:17:38,827 - INFO : job: 8bcc3500ec04fde48a87f38df4b31c38b7888581 submit; job_id is 62401
2024-01-30 21:17:40,666 - INFO : job: 702d4e7aea5c1aaf52f3f976bc1700d826e94418 submit; job_id is 63030
2024-01-30 21:19:15,423 - INFO : job: 702d4e7aea5c1aaf52f3f976bc1700d826e94418 63030 terminated; fail_cout is 1; resubmitting job
2024-01-30 21:19:17,303 - INFO : job:702d4e7aea5c1aaf52f3f976bc1700d826e94418 re-submit after terminated; new job_id is 66621
2024-01-30 21:19:18,043 - INFO : job:702d4e7aea5c1aaf52f3f976bc1700d826e94418 job_id:66621 after re-submitting; the state now is <JobStatus.running: 3>
2024-01-30 21:23:38,577 - INFO : job: 25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba 61164 terminated; fail_cout is 1; resubmitting job
2024-01-30 21:23:40,385 - INFO : job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba re-submit after terminated; new job_id is 69149
2024-01-30 21:23:41,141 - INFO : job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba job_id:69149 after re-submitting; the state now is <JobStatus.running: 3>
2024-01-30 21:24:13,476 - INFO : job: 67886ba977f4351e6d7b8467aa846efcedca46d8 61782 terminated; fail_cout is 1; resubmitting job
2024-01-30 21:24:15,250 - INFO : job:67886ba977f4351e6d7b8467aa846efcedca46d8 re-submit after terminated; new job_id is 70568
2024-01-30 21:24:15,986 - INFO : job:67886ba977f4351e6d7b8467aa846efcedca46d8 job_id:70568 after re-submitting; the state now is <JobStatus.running: 3>
2024-01-30 21:30:07,291 - INFO : job: 25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba 69149 terminated; fail_cout is 2; resubmitting job
2024-01-30 21:30:09,173 - INFO : job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba re-submit after terminated; new job_id is 73800
2024-01-30 21:30:10,014 - INFO : job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba job_id:73800 after re-submitting; the state now is <JobStatus.running: 3>
2024-01-30 21:41:59,361 - INFO : job: 8bcc3500ec04fde48a87f38df4b31c38b7888581 62401 terminated; fail_cout is 1; resubmitting job
2024-01-30 21:41:54,880 - INFO : job:8bcc3500ec04fde48a87f38df4b31c38b7888581 re-submit after terminated; new job_id is 79737
2024-01-30 21:41:55,702 - INFO : job:8bcc3500ec04fde48a87f38df4b31c38b7888581 job_id:79737 after re-submitting; the state now is <JobStatus.running: 3>
2024-01-30 21:46:21,991 - INFO : job: 25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba 73800 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:25d44bdc0fdbba7a8f78bbb09d2b7d9f64899dba 73800 failed 3 times.
Possible remote error message: �[31m==> /home/biglinn/deepmdjob/temp/ca378746f3011422e96ceae2219a8ac994ad81c8/002/train.log <==
lib/python3.11/site-packages/deepmd/descriptor/se_a.py", line 751, in _pass_filter
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/ops/array_ops.py", line 1223, in slice
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/ops/gen_array_ops.py", line 9825, in _slice
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 2652, in _create_op_internal
File "/home/biglinn/.deepmd-kit/lib/python3.11/site-packages/tensorflow/python/framework/ops.py", line 1160, in from_node_def
�[0m
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/public/home/biglinn/miniconda3/envs/dmd/bin/dpgen", line 10, in
sys.exit(main())
^^^^^^
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpgen/generator/run.py", line 5411, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpgen/generator/run.py", line 4742, in run_iter
run_train(ii, jdata, mdata)
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpgen/generator/run.py", line 864, in run_train
submission.run_submission()
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpdispatcher/submission.py", line 261, in run_submission
self.handle_unexpected_submission_state()
File "/public/home/biglinn/miniconda3/envs/dmd/lib/python3.11/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/biglinn/deepmdjob/temp/ca378746f3011422e96ceae2219a8ac994ad81c8.
Debug information: submission_hash==ca378746f3011422e96ceae2219a8ac994ad81c8.
Please check error messages above and in remote_root. The submission information is saved in /public/home/biglinn/.dpdispatcher/submission/ca378746f3011422e96ceae2219a8ac994ad81c8.json.
For furthur actions, run the following command with proper flags: dpdisp submission ca378746f3011422e96ceae2219a8ac994ad81c8
Beta Was this translation helpful? Give feedback.
All reactions