Skip to content

using discovery

Peng Tian edited this page Jun 21, 2020 · 31 revisions

Using the Discovery Cluster

Software Modules

Software on the Discovery cluster comes in modules. If you would you like to use pre-built code, rather than compile your own software, you will need to use the “module” command. The basic idea is that, by loading the appropriate modules, you will be able to call executables from the command line with minimal effort (e.g. the commands will be in your path, and libraries will be loaded).

To load a module, please execute:

module load <module name>

For example, to use the python 2.7 module, you can execute:

module load python/2.7.15

and python will be available from the command prompt. To see all modules available, type:

module avail 

while the command:

module list

will list the modules that you have loaded.

Use module whatis <module name> to see information about a specific module, including additional prerequisites.

You can also install programs directly to your home directory in a gateway server (e.g., you can install your own python or spark distribution if you want). These home directories are shared with compute servers, and any software or data you have stored there will be immediately be available to an interactive job.

If there are some programs that you use often, it makes sense to pre-load the corresponding modules every time you log in to the cluster. You can make this happen by adding these lines to your .bashrc file; an example can be found here.

Running Jobs

You should never run jobs on the login nodes (login.discovery.neu.edu). All jobs you run must be executed on one of Discovery's compute nodes. To use these machines, you need to first reserve them.

Machine reservation and job submission to compute nodes is managed by the Simple Linux Utility for Resource Management (a.k.a. SLURM). In short, jobs you submit to the cluster can (and must) reserve available resources-in particular, CPU cores-from specific compute nodes. These resources are taken from cluster partitions, namely, groups of machines with similar hardware.

Submitting Jobs To SLURM

There are two main ways for submitting jobs to the discovery cluster:

Each of these are described in detail in the above links. In interactive mode, you can reserve a compute node, connect to it, and run your jobs interactively, by typing appropriate commands. Batch mode allows you to submit and execute scripts non-interactively, without logging in to machines first. The typical use of interactive mode is to run small experiments and to troubleshoot your code; if you intend to run many jobs, especially in parallel, you should be submitting batch jobs instead.

All jobs you submit should read and write to a /scratch directory, not your home directory. For that reason, you should create a scratch directory and launch your jobs from this directory.

Useful SLURM Commands

  • sinfo: shows the partitions in the cluster and their usage. This is particularly useful to find compute nodes that are idle, and can therefore be used to run your jobs.
  • sbatch: runs a slurm script in batch mode.
  • scancel: kills running jobs.
  • squeue: shows all jobs running in the cluster.
  • sacct: Report accounting information by individual job and job step
  • sstat: Report accounting information about currently running jobs and job steps (more detailed than sacct)
  • sreport: Report resources usage by cluster, partition, user, account, etc.
  • srun: If you would prefer to work interactively on a node, you may launch and interactive session with srun.
  • sview: Show graphic indicating cluster usage/reservations etc. (requires X11 forwarding).

Tip use squeue -u $USER where $USER is your user id to see only your own jobs. You can use this to find which compute nodes you have currently reserved.

Tip To cancel jobs in a certain range, use scancel {jobID1..jobID2}. For example, using scancel {1001..1005} will cancel all your jobs between jobID 1001 and 1005.

Tip. Useful short names (i.e., aliases) for several of the above SLURM commands are provided in the default .bashrc file.

Type, e.g., man scancel, man squeue, etc., to see more options and usage guidelines. An overview of SLURM can be found here. A detailed description can also be found in the Discovery cluster website; many computing centers use SLURM as a manager, and also maintain websites containing information on how to use SLURM; though tailored to their own clusters, these can also be useful (see a list in the resources section). A two-page SLURM cheatsheet can be found here.

SLURM FAQ

See also this FAQ for an extended version.

Q: What is the difference between srun and sbatch?

A: The documentation says

srun is used to submit a job for execution in real time

while

sbatch is used to submit a job script for later execution.

They both accept practically the same set of parameters. The main difference is that srun is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch is batch processing and non-blocking (results are written to a file and you can submit other commands right away).

If you use srun in the background with the & sign, then you remove the 'blocking' feature of srun, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.

If you use sbatch, you submit your job and it is handled by SLURM; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process. (source)

Q: What are some things that I can do with one that I cannot do with the other, and why?

A: A feature that is available to sbatch and not to srun is job arrrays. As srun can be used within an sbatch script, there is nothing that you cannot do with sbatch. (source)

Q: How do parameters like -N, --ntasks, -p, etc., differ for srun vs. sbatch?

A: All the parameters such as --ntasks, --nodes, --cpus-per-task, --ntasks-per-node have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of --exclusive. (source)

Q: How do sbatch and srun interact with each other, and what is the "canonical" use case for each?

A: You typically use sbatch to submit a job and srun in the submission script to create job steps as SLURM calls them. srun is used to launch the processes. If your program is a parallel MPI program, srun takes care of creating all the MPI processes. If not, srun will run your program as many times as specified by the --ntasks option. There are many use cases depending on whether your program is paralleled or not, has a long running time or not, is composed of a single executable or not, etc. Unless otherwise specified, srun inherits by default the pertinent options of the sbatch which it runs under. (source)

Q: Would I ever use srun by itself?

A: Other than for small tests, no. A common use is srun --pty bash to get a shell on a compute node. (source)

Additional Resources

SLURM main website, including tutorials, documentation, and FAQ. See also: