An error meet when run dpgen #1175
Unanswered
maoxinxina
asked this question in
Q&A
Replies: 2 comments
-
Did you find a solution please, I'm having the same problem? |
Beta Was this translation helpful? Give feedback.
0 replies
-
It's an error reported by Slurm, saying there was no available node you requested. You might ask your cluster administrator what is available. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Here I am running the dpgen on the Slurm squeue, An error occur.
Description
2023-04-04 15:56:02,277 - INFO : info:check_all_finished: False
Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 285, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 751, in handle_unexpected_job_state
self.submit_job()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 798, in submit_job
job_id = self.machine.do_submit(self)
job_id = self.machine.do_submit(self)
2023-04-04 15:56:02,277 - INFO : info:check_all_finished: False
Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 285, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 751, in handle_unexpected_job_state
self.submit_job()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 798, in submit_job
job_id = self.machine.do_submit(self)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/utils.py", line 179, in wrapper
return func(*args, **kwargs)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/slurm.py", line 84, in do_submit
raise RuntimeError(
RuntimeError: status command squeue fails to execute
error message:sbatch: error: Batch job submission failed: Requested node configuration is not available
return code 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/HOME/scz0aai/run/deepmd-kit/bin/dpgen", line 8, in
sys.exit(main())
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 222, in run_submission
self.handle_unexpected_submission_state()
File "/HOME/scz0aai/run/deepmd-kit/lib/python3.10/site-packages/dpdispatcher/submission.py", line 288, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work/1f2a3a2a757b38d4b506119950b64ccf1c5c9d04.
Debug information: submission_hash==1f2a3a2a757b38d4b506119950b64ccf1c5c9d04.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
The machine.json is set as:
{
"api_version": "1.0",
"deepmd_version": "2.0.1",
"train" :[
{
"command": "dp",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"_cpu_per_node": 4,
"gpu_per_node": 1,
"group_size": 1,
"queue_name":"gpu",
"_custom_flags" :["#SBATCH --mem=20G"],
"source_list":[ "/HOME/scz0aai/run/deepmd-kit"
],
"module_list":["cuda/11.6"]
}
}
],
"model_devi":[
{
"command": "lmp",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"_cpu_per_node": 4,
"gpu_per_node": 1,
"group_size": 10,
"queue_name":"gpu",
"_custom_flags" : ["#SBATCH --mem=20G"],
"exlued_list":[],
"source_list":["source activate /HOME/scz0aai/run/deepmd-kit; module load cuda/11.6"
],
"module_list":[]
}
}
],
"fp":[
{
"command": "mpirun -np 4 vasp_std",
"machine": {
"batch_type": "Slurm",
"context_type": "local",
"local_root" : "./",
"remote_root": "/HOME/scz0aai/run/maoxin/dpgen_test/tmp2023/rererun/work"
},
"resources": {
"number_node": 1,
"cpu_per_node": 4,
"gpu_per_node": 1,
"_group_size": 125,
"source_list":["module load intel/parallelstudio/2017.1.5; export PATH=/HOME/scz0aai/run/vasp.5.4.4/bin:$PATH"
]
}
}
]
}
So I wonder how to tackle the issue. Thanks a lot.
Beta Was this translation helpful? Give feedback.
All reactions