An error when the job is submitted to a remote cluster #757
Unanswered
phyoung123
asked this question in
Q&A
Replies: 2 comments 3 replies
-
|
Beta Was this translation helpful? Give feedback.
1 reply
-
problem fixed |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I have completed the first two steps that is training and exploration in a remote GPU cluster, and I want to put the fp task to another remote cpu cluster, but an error encountered.
error message:
Traceback (most recent call last):
File "/data/home/scv6293/.conda/envs/deepmd/bin/dpgen", line 8, in
sys.exit(main())
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3236, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3222, in run_iter
run_fp (ii, jdata, mdata)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2669, in run_fp
run_fp_inner(iter_index, jdata, mdata, forward_files, backward_files, _vasp_check_fin,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2637, in run_fp_inner
submission = make_submission(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 358, in make_submission
machine = Machine.load_from_dict(abs_mdata_machine)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/machine.py", line 128, in load_from_dict
context = BaseContext.load_from_dict(machine_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/base_context.py", line 34, in load_from_dict
context = context_class.load_from_dict(context_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 253, in load_from_dict
ssh_context = cls(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 227, in init
self.ssh_session = SSHSession(**remote_profile)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 38, in init
self._setup_ssh()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 111, in _setup_ssh
self.ssh.connect(hostname=self.hostname, port=self.port,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in connect
retry_on_signal(lambda: sock.connect(addr))
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/util.py", line 279, in retry_on_signal
return function()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in
retry_on_signal(lambda: sock.connect(addr))
socket.timeout: timed out
and the machine.json
"fp":[
{
"machine":{
"batch_type":"Slurm",
"context_type":"SSHContext",
"local_root":"./",
"remote_profile":{
"hostname":"ssh.cn-zhongwei-1.paracloud.com",
"port":22,
"password":"111111111",
"username":"scfa0089@NC-E"
},
"remote_root":"/public1/home/scfa0089/lzg/xufy/z/run/work"
},
"resources":{
"cpu_per_node":48,
"_node_cpu":24,
"number_node":1,
"gpu_per_node":0,
"queue_name":"v5_192",
"_exclude_list":[],
"_with_mpi":false,
"group_size":100,
"_source_list":["/public1/soft/other/module.sh"],
"module_list":[
"mpi/intel/19.3.0"],
"_partition":"large",
"_comment":"that's all"
},
"command":"mpirun -np 48 vasp_gam"
}
]
}
Otherwise, I can ssh to this hostname by ssh scfa0089@NC-E@ssh.cn-zhongwei-1.paracloud.com
How can I solve this problem and dispatch the task to different cluster. Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions