Skip to content

Commit

Permalink
add multiple transfer option headings
Browse files Browse the repository at this point in the history
  • Loading branch information
calizilla committed Feb 7, 2025
1 parent fdcb636 commit 574665b
Showing 1 changed file with 82 additions and 12 deletions.
94 changes: 82 additions & 12 deletions notebooks/05_data_transfer.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,51 @@ output:
from: markdown+emoji
---

The data transfer queue on Gadi is called ‘copyq’. You can easily use this queue to transfer data between Gadi and RDS (or other locations) by first setting up ssh keys for password-less transfers between Gadi and Artemis/RDS.
## Research data store (RDS)

For transfer of large files, the use of ‘resumable’ rsync is recommended. As the USyd RDS servers only allow sftp connections, this method is not possible to run on Gadi’s copyq. Instead, the transfer can be initiated using Artemis ‘dtq’ and using Gadi’s ‘data mover’ node: `gadi-dm.nci.org.au`.
The RDS is NOT being decommisioned along with Artemis HPC. Any RDS projects you currently have will persist on RDS. It is your responsibility to backup any data on Artemis filesystems (`/home`, `/scratch`, `/project`) that you wish to keep prior to the decomission date of August 29 2025.

#### **Resources**
- [Instructions for SSH key set up](https://medium.com/@prateek.malhotra004/streamlining-secure-remote-access-a-guide-to-passwordless-ssh-connections-between-linux-servers-8c26bb008af9)
In this section, we will focus on how to transfer data between Gadi HPC and RDS.

## **Set up SSH keys**
## Data transfer options

### Globus - COMING SOON

In the coming months, [Globus](https://sydneyuni.atlassian.net/wiki/spaces/RC/pages/3492052996/Globus+Data+Transfer) will be available for simplified and efficient data transfer. We will provide training and materials on this once available. In the meantime, the below options are available, and examples for each method will be provided in the subsequent sections.

### Transfer using sftp from Gadi copyq

The data transfer queue on Gadi is called `copyq`. This is comparable to the data transfer queue on Artemis `dtq`. Data transfer methods that you used to put data onto Artemis for example from the web via `wget` or from another server should be easily portable to use on Gadi's `copyq`.

Please note that the compute nodes on Gadi **do not have** internet access like the Artemis compute nodes do, so all required data must first be downloaded before submitting a compute job that requries the data.

Due to stringent security settings around Artemis and RDS, commands like `rsync` or `scp` CANNOT be initiated from NCI Gadi login nodes or `copy`. To initiate the transfer from Gadi, `sftp` must be used. One can `rsync` or `scp` from Artemis login nodes or `dt` to Gadi, however this option will of course cease when Artemis is decomissioned.

### Transfer using rsync from Artemis dtq - PRIOR TO DECOMISSION ONLY

For transfer of large files, the use of resumable `rsync` is recommended (see script below). As the USyd RDS servers only allow sftp connections, this method is not possible to run on Gadi’s `copyq`. Instead, the transfer can be initiated using Artemis `dtq` and using Gadi’s data mover node: `gadi-dm.nci.org.au`. After decomisison, Globus will provide fast and reliable large data transfers.

### Transfer using RDS mapped network drive and data transfer client

For smaller files or datasets, you can map your RDS project as a network drive and transfer the data to Gadi via an intermediate data transfer client GUI such as `filezilla`. While simple to use, these are not recommended for large data transfers, as the local computer becomes a bottleneck. Faster speeds will be obtained on campus, but still may be prohibitively slow for larger datasets.

### Transfer from ssh connection to RDS

You can also initiate the transfer on the command line by `ssh` to either `research-data-ext.sydney.edu.au` (off-campus) or `research-data-int.sydney.edu.au` (on campus or USyd VPN) then using typical transfer commands such as `rsync` or `scp`. Since the connection will be terminated if your computer sleeps, terminal crashes, network drops out etc, this method is not particualrly robust for large transfers. Using something like `tmux` or `screen` can help, however note that the login server has finite capacity, and transfer speeds can be seriously impacted during times of high load.


## Set up SSH keys for passwordless data transfer

If you are transferring data directly for example `scp` on the command line or via a transfer client on your local computer, entering a password to initiate the transfer is straightforward. If however you want to transfer via a job submitted to either `copy` or `dtq`, you will need to set up SSH keys first.

SSH key pairs are used for secure communication between two systems. The pair consists of a **private** key and a **public** key. The **private** key should remain private and only be known by the user. It is stored securely on the user's computer. The **public** key can be shared with any system the user wants to connect to. It is added to the remote system's authorized keys. When a connection is attempted, the remote system uses the public key to create a message for the user's system.

We will set up SSH keys to allow us to move data between USyd's HPC and RDS and Gadi. **You only need to do this once**.
There are many general guides for this online, for example [this one](https://medium.com/@prateek.malhotra004/streamlining-secure-remote-access-a-guide-to-passwordless-ssh-connections-between-linux-servers-8c26bb008af9). For step-by-step instructions on how to set up keys between Gadi and RDS, expand the drop down below.

::: {.callout-note collapse="true"}
### Click to expand

Follow the below steps carefully to set up SSH keys between RDS and Gadi. Note, **you only need to do this once**.

1. Log into Gadi with your chosen method, e.g:

Expand Down Expand Up @@ -110,11 +143,24 @@ sftp <your-unikey>@research-data-ext.sydney.edu.au

This time, you shouldn't be prompted for a password. You can proceed to transfer data between Gadi and USyd's Artemis/RDS system now on the `copyq`.

## **Customise the transfer script**
:::

Whenever you need to copy large files between RDS and Gadi, you should use the script below. This script can be submitted to the `copyq` on Gadi. A copy of it has been provided in your group's `/g/data/<project>/scripts` directory. An example of a script has also been provided here.

Make a copy of this file to your `/scratch` workspace on Gadi and edit it to suit your needs.
## Transfer using sftp from Gadi copyq

The scripts below use `sftp` to transfer data between RDS and Gadi on the Gadi `copyq`.

Copies of it these scripts have been placed in `/scratch/qc03/data-transfer-scripts/`.

Make a copy of these scripts to your `/scratch/<nci-project-code>` or `/home/<nci-user-id>` workspace on Gadi and edit to suit your needs.


TBC

- separate below content into indivdual scripts eg "file_from_rds_to_gadi.pbs", "folder_from_rds_to_gadi.pbs" etc
- make the demo scripts folder (read and write user and group for folder) and change perms for files so read only
- update below content and put example code for each script below under each relevant subheading


```bash
cp /g/data/<project>/scripts/transfer.pbs /scratch/<project>/<workspace>
Expand Down Expand Up @@ -290,7 +336,7 @@ sftp -r ${remote_user}@${remote_host}:${remote_path}/${remote_dir} ${dest_path}
#sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put -r ${local_dir}"
```

## **Run the transfer script**
**Run the transfer script**

Once you have customised the script, you can submit it to the `copyq` on Gadi. Run the script from the directory where you saved it:

Expand All @@ -310,8 +356,32 @@ Once it says R (running), you can confirm it is going to where you want on RDS/A
ls <path>
```

## **Confirm the transfer**
**Confirm the transfer**

To confirm the transfer was successful, you'll need to check your joblogs. These are located in the same directory as your script and are named `transfer.o<jobid>`. Check for **Exit status: 0**. If you see this, the transfer was successful.

However, this doesn't guarantee the integrity of the files. You should check the files themselves to ensure they are intact. You can do this using md5checksums. See this [SIH tidbits blogpost](https://sydney-informatics-hub.github.io/tidbits/safely-downloading-your-sequence-data-to-rds.html) about how to use these. You'll need to create md5checksums for the original files if they don't already exist and compare them after transfer.
However, this doesn't guarantee the integrity of the files. You should check the files themselves to ensure they are intact. You can do this using md5checksums. See this [SIH tidbits blogpost](https://sydney-informatics-hub.github.io/tidbits/safely-downloading-your-sequence-data-to-rds.html) about how to use these. You'll need to create md5checksums for the original files if they don't already exist and compare them after transfer.

## Transfer using rsync from Artemis dtq - PRIOR TO DECOMISSION ONLY

TBC

See my favourite script at https://sydney-informatics-hub.github.io/training.gadi.intro/05-Data-transfer/index.html

- provide example for single file transfer (as show at above link) - also put in the qc03 demo transfer scripts folder
- provide example for recursive folder transfer (with comment about inclusion or exclusion of trailing slash on source path)
- provide example for job array based transfer
- provide example for to gadi and from gadi by reversing host/source/dest

## Transfer using RDS mapped network drive and data transfer client

TBC

also include link to the confluence 'how to map rds network drive' page

## Transfer from ssh connection to RDS

TBC

include ext and int
include use of screen or tmux

0 comments on commit 574665b

Please sign in to comment.