Skip to content

Latest commit

 

History

History
185 lines (138 loc) · 4.52 KB

NERSC.md

File metadata and controls

185 lines (138 loc) · 4.52 KB

Access NERSC Clusters

There are two clusters: (i) Cori (Old), (ii) Perlmutter (New). Before establishing an ssh connection, run ./sshproxy.sh for a passwordless login.

# create ssh keys
./sshproxy.sh -u aakram

This command will create ssh keys, nersc, nersc.pub, nersc-cert.pub, in the ~/.ssh directory with 24 Hr validity. Once ssh keys are generated, one can login to a certain cluster as follows:

(a) - Login to Cori

# Login to Cori
ssh -i ~/.ssh/nersc aakram@cori.nersc.gov

(b) - Login to Perlmutter

# Login to Perlmutter
ssh -i ~/.ssh/nersc aakram@perlmutter-p1.nersc.gov
ssh -i ~/.ssh/nersc aakram@saul-p1.nersc.gov

(c) - Data Transfer to NERSC

Use special data nodes for data transfers to NERSC.

# use data node: dtn01
scp -i ~/.ssh/nersc train_40k.tar.gz aakram@dtn01.nersc.gov:/global/u2/a/aakram/
scp -i ~/.ssh/nersc aakram@dtn01.nersc.gov:/global/u2/a/aakram/ .

Submit Jobs on Cori/Perlmutter

For interactive run, first use tmux to create a sesssion, attach/detach the session as needed. When logging in to Cori or Perlmutter clusters, one login to a random node. So note this node and ssh to that one in order to attach to a tmux session whenever needed.

1. Interactive Jobs

There are two ways to allocate resources interactivly: (i) salloc (ii) srun --pty bash -l commands. When using srun we span a new bash session using --pty bash -l.

  • CPU Resources
# activate conda env
conda activate exatrkx-cori
export EXATRKX_DATA=$SCRATCH        # PSCRATCH
# allocate cpu resources (cori)
salloc -N 1 -q regular -C haswell -A m3443 -t 04:00:00  # OR
  srun -N 1 -q regular -C haswell -A m3443 -t 04:00:00 --pty /bin/bash -l

# allocate cpu resources (perlmutter)
salloc -N 1 -q regular -C cpu -A m3443 -t 04:00:00      # OR
  srun -N 1 -q regular -C cpu -A m3443 -t 04:00:00 --pty /bin/bash -l
# run the pipeline
traintrack configs/pipeline_fulltrain.yaml
  • GPU Resources
# activate conda env
conda activate exatrkx-cori
export EXATRKX_DATA=$SCRATCH        # PSCRATCH
# allocate gpu resources (cori)
module load cgpu
salloc -C gpu -N 1 -G 1 -c 32 -t 4:00:00 -A m3443  # OR
  srun -C gpu -N 1 -G 1 -c 32 -t 4:00:00 -A m3443 --pty /bin/bash -l  

# allocate gpu resources (perlmutter)
salloc -C gpu -N 1 -G 1 -c 32 -t 4:00:00 -A m3443_g  # OR
  srun -C gpu -N 1 -G 1 -c 32 -t 4:00:00 -A m3443_g --pty /bin/bash -l
# run the pipeline
traintrack configs/pipeline_fulltrain.yaml
  • Exiting
# exit
exit

# unload "cgpu" module (cori)
module unload cgpu

# deactivate conda env
conda deactivate

2. Non-interactive Jobs

For sbatch for jobs, two scripts are available: submit_cori.sh and submit_perlm.sh. For Cori, do the following:

# load environment
conda activate exatrkx-cori
export EXATRKX_DATA=$CSCRATCH

# load gpu settings (cori)
module load cgpu

# submit job
sbatch submit_cori.sh

Alternatively, just run the submit.jobs script that will set everything together. The submit.jobs looks like the following:

#!/bin/bash
export SLURM_SUBMIT_DIR=$HOME"/ctd2022"
export SLURM_WORKING_DIR=$HOME"/ctd2022/logs"
mkdir -p $SLURM_WORKING_DIR;

eval "$(conda shell.bash hook)"
conda activate exatrkx-cori
export EXATRKX_DATA=$CSCRATCH
module load cgpu

sbatch $SLURM_SUBMIT_DIR/submit_cori.sh

The same logic applied to the Perlmutter cluster.

TrainTrack in Batch Mode

One can run TrainTrack in batch mode.

traintrack --slurm configs/pipeline_fulltrain.yaml

It will use configs/batch_cpu_default and configs/batch_gpu_default for cpu and gpu-based jobs on the cluster. TrainTrack uses Simple Slurm to setup a batch job. For successfull laund of GPU jobs on Perlmutter, one needs to fix few settings:

job_name: train_gpu
constraint: gpu
nodes: 1
gpus: 1
time: "4:00:00"
qos: regular
output: logs/%x-%j.out
account: m3443

This is equivalent to sbatch script:

#!/bin/bash

# 1 Node, 1 Task, 1 GPU

#SBATCH -A m3443
#SBATCH -J ctd
#SBATCH -C gpu
#SBATCH -q regular                 # special
#SBATCH -t 4:00:00
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 128
#SBATCH --gpus-per-task=1
#SBATCH --signal=SIGUSR1@90

# *** I/O ***
#SBATCH -D .
#SBATCH -o logs/%x-%j.out
#SBATCH -e logs/%x-%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=a.akram@gsi.de

export SLURM_CPU_BIND="cores"
srun traintrack configs/pipeline_fulltrain.yaml