An error when the job is submitted to a remote cluster #757

phyoung123 · 2022-06-15T13:00:02Z

phyoung123
Jun 15, 2022

Hello, I have completed the first two steps that is training and exploration in a remote GPU cluster, and I want to put the fp task to another remote cpu cluster, but an error encountered.
error message:

Traceback (most recent call last):
File "/data/home/scv6293/.conda/envs/deepmd/bin/dpgen", line 8, in
sys.exit(main())
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3236, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 3222, in run_iter
run_fp (ii, jdata, mdata)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2669, in run_fp
run_fp_inner(iter_index, jdata, mdata, forward_files, backward_files, _vasp_check_fin,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2637, in run_fp_inner
submission = make_submission(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 358, in make_submission
machine = Machine.load_from_dict(abs_mdata_machine)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/machine.py", line 128, in load_from_dict
context = BaseContext.load_from_dict(machine_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/base_context.py", line 34, in load_from_dict
context = context_class.load_from_dict(context_dict)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 253, in load_from_dict
ssh_context = cls(
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 227, in init
self.ssh_session = SSHSession(**remote_profile)
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 38, in init
self._setup_ssh()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 111, in _setup_ssh
self.ssh.connect(hostname=self.hostname, port=self.port,
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in connect
retry_on_signal(lambda: sock.connect(addr))
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/util.py", line 279, in retry_on_signal
return function()
File "/data/home/scv6293/.conda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 349, in
retry_on_signal(lambda: sock.connect(addr))
socket.timeout: timed out

and the machine.json

"fp":[
{
"machine":{
"batch_type":"Slurm",
"context_type":"SSHContext",
"local_root":"./",
"remote_profile":{
"hostname":"ssh.cn-zhongwei-1.paracloud.com",
"port":22,
"password":"111111111",
"username":"scfa0089@NC-E"
},
"remote_root":"/public1/home/scfa0089/lzg/xufy/z/run/work"
},
"resources":{
"cpu_per_node":48,
"_node_cpu":24,
"number_node":1,
"gpu_per_node":0,
"queue_name":"v5_192",
"_exclude_list":[],
"_with_mpi":false,
"group_size":100,
"_source_list":["/public1/soft/other/module.sh"],
"module_list":[
"mpi/intel/19.3.0"],
"_partition":"large",
"_comment":"that's all"
},
"command":"mpirun -np 48 vasp_gam"
}
]
}

Otherwise, I can ssh to this hostname by ssh scfa0089@NC-E@ssh.cn-zhongwei-1.paracloud.com
How can I solve this problem and dispatch the task to different cluster. Thanks in advance.

AnguseZhang · 2022-06-15T18:34:51Z

AnguseZhang
Jun 15, 2022
Maintainer

DP-GEN supports dispatching tasks to different clusters, so there's no problem with the variety of machines.
It seems that there's something wrong with the connection between the computing machine and the machine where DP-GEN's main process runs. How did you authorize? By password or some key? If you can ssh to the fp cluster without password by ssh-copy-id (which copies your public key to the remote cluster), you can delete the password and have a try again.
Probably you can contact with the administer for paracloud.com?

1 reply

phyoung123 Jun 16, 2022
Author

Hi~YuZhi, Thank you very much and many thanks for this respectable program. I still have a little doubt about the connection of these two remote servers, a GPU and a CPU, these two servers belong to different service providers, do both machines have to be connected to the extranet? now only the GPU can connect to the extranet, which means I can use ssh from GPU to connect the CPU one but not the other way around. I authorize this connection by password, and I also attempt to use ssh-copy-id as you suggested and #438, password is also needed. Besides, I also attempt to resubmit the DPGEN task on the server of our research group, which can connect to the extranet, to connect the remote GPU server for training step, I get another error (See below for details).

error message:

Traceback (most recent call last):
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/bin/dpgen", line 8, in
sys.exit(main())
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2714, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 2677, in run_iter
run_train (ii, jdata, mdata)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpgen/generator/run.py", line 580, in run_train
submission = make_submission(
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 349, in make_submission
machine = Machine.load_from_dict(mdata_machine)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/machine.py", line 123, in load_from_dict
context = BaseContext.load_from_dict(machine_dict)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/base_context.py", line 34, in load_from_dict
context = context_class.load_from_dict(context_dict)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 253, in load_from_dict
ssh_context = cls(
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 227, in init
self.ssh_session = SSHSession(**remote_profile)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 38, in init
self._setup_ssh()
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/dpdispatcher/ssh_context.py", line 111, in _setup_ssh
self.ssh.connect(hostname=self.hostname, port=self.port,
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 435, in connect
self._auth(
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 766, in _auth
raise saved_exception
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/paramiko/client.py", line 753, in _auth
self._transport.auth_password(username, password)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/paramiko/transport.py", line 1563, in auth_password
return self.auth_handler.wait_for_response(my_event)
File "/apps/users/caep-lzg/package/anaconda/envs/deepmd/lib/python3.9/site-packages/paramiko/auth_handler.py", line 244, in wait_for_response
raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed.

phyoung123 · 2022-06-17T02:25:49Z

phyoung123
Jun 17, 2022
Author

problem fixed

2 replies

Saberve Feb 6, 2023

Hello, I have the same problem as you. How did you solve this problem? Can you share it? Thank you very much!

KangBaoBBS May 10, 2024

Hello, I have the same problem as you. "Timed out" is here.
File "/data/home/scv8739/.local/lib/python3.9/site-packages/dpdispatcher/contexts/ssh_context.py", line 144, in _setup_ssh
sock.connect((self.hostname, self.port))
socket.timeout: timed out
How did you solve this problem? Can you share it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error when the job is submitted to a remote cluster #757

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

An error when the job is submitted to a remote cluster #757

phyoung123 Jun 15, 2022

Replies: 2 comments · 3 replies

AnguseZhang Jun 15, 2022 Maintainer

phyoung123 Jun 16, 2022 Author

phyoung123 Jun 17, 2022 Author

Saberve Feb 6, 2023

KangBaoBBS May 10, 2024

phyoung123
Jun 15, 2022

Replies: 2 comments 3 replies

AnguseZhang
Jun 15, 2022
Maintainer

phyoung123 Jun 16, 2022
Author

phyoung123
Jun 17, 2022
Author