-
Notifications
You must be signed in to change notification settings - Fork 9
using discovery
Software on the Discovery cluster comes in modules. If you would you like to use pre-built code, rather than compile your own software, you will need to use the “module” command. The basic idea is that, by loading the appropriate modules, you will be able to call executables from the command line with minimal effort (e.g. the commands will be in your path, and libraries will be loaded).
To load a module, please execute:
module load <module name>
For example, to use the python 2.7 module, you can execute:
module load python/2.7.15
and python will be available from the command prompt. To see all modules available, type:
module avail
while the command:
module list
will list the modules that you have loaded.
Use module whatis <module name>
to see information about a specific module, including additional prerequisites.
You can also install programs directly to your home directory in a gateway server (e.g., you can install your own python
or spark
distribution if you want). These home directories are shared with compute servers, and any software or data you have stored there will be immediately be available to an interactive job.
If there are some programs that you use often, it makes sense to pre-load the corresponding modules every time you log in to the cluster. You can make this happen by adding these lines to your .bashrc
file; an example can be found here.
You should never run jobs on the login nodes (login.discovery.neu.edu
). All jobs you run must be executed on one of Discovery's compute nodes. To use these machines, you need to first reserve them.
Machine reservation and job submission to compute nodes is managed by the Simple Linux Utility for Resource Management (a.k.a. SLURM). In short, jobs you submit to the cluster can (and must) reserve available resources-in particular, CPU cores-from specific compute nodes. These resources are taken from cluster partitions, namely, groups of machines with similar hardware.
There are two main ways for submitting jobs to the discovery cluster:
- In interactive mode, and
- in batch mode.
Each of these are described in detail in the above links. In interactive mode, you can reserve a compute node, connect to it, and run your jobs interactively, by typing appropriate commands. Batch mode allows you to submit and execute scripts non-interactively, without logging in to machines first. The typical use of interactive mode is to run small experiments and to troubleshoot your code; if you intend to run many jobs, especially in parallel, you should be submitting batch jobs instead.
All jobs you submit should read and write to a /scratch
directory, not your home directory. For that reason, you should create a scratch directory and launch your jobs from this directory.
-
sinfo
: shows the partitions in the cluster and their usage. This is particularly useful to find compute nodes that are idle, and can therefore be used to run your jobs. -
sbatch
: runs a slurm script in batch mode. -
scancel
: kills running jobs. -
squeue
: shows all jobs running in the cluster. -
sacct
: Report accounting information by individual job and job step -
sstat
: Report accounting information about currently running jobs and job steps (more detailed than sacct) -
sreport
: Report resources usage by cluster, partition, user, account, etc. -
srun
: If you would prefer to work interactively on a node, you may launch and interactive session with srun. -
sview
: Show graphic indicating cluster usage/reservations etc. (requires X11 forwarding).
Tip use
squeue -u $USER
where$USER
is your user id to see only your own jobs. You can use this to find which compute nodes you have currently reserved.
Tip To cancel jobs in a certain range, use
scancel {jobID1..jobID2}
. For example, usingscancel {1001..1005}
will cancel all your jobs between jobID 1001 and 1005.
Tip. Useful short names (i.e., aliases) for several of the above SLURM commands are provided in the default
.bashrc
file.
Type, e.g., man scancel
, man squeue
, etc., to see more options and usage guidelines. An overview of SLURM can be found here. A detailed description can also be found in the Discovery cluster website; many computing centers use SLURM as a manager, and also maintain websites containing information on how to use SLURM; though tailored to their own clusters, these can also be useful (see a list in the resources section). A two-page SLURM cheatsheet can be found here.
See also this FAQ for an extended version.
A: The documentation says
srun
is used to submit a job for execution in real time
while
sbatch
is used to submit a job script for later execution.
They both accept practically the same set of parameters. The main difference is that srun
is interactive and blocking (you get the result in your terminal and you cannot write other commands until it is finished), while sbatch
is batch processing and non-blocking (results are written to a file and you can submit other commands right away).
If you use srun
in the background with the & sign, then you remove the 'blocking' feature of srun
, which becomes interactive but non-blocking. It is still interactive though, meaning that the output will clutter your terminal, and the srun
processes are linked to your terminal. If you disconnect, you will loose control over them, or they might be killed (depending on whether they use stdout or not basically). And they will be killed if the machine to which you connect to submit jobs is rebooted.
If you use sbatch
, you submit your job and it is handled by SLURM; you can disconnect, kill your terminal, etc. with no consequence. Your job is no longer linked to a running process. (source)
A: A feature that is available to sbatch
and not to srun
is job arrrays. As srun
can be used within an sbatch script, there is nothing that you cannot do with sbatch
. (source)
A: All the parameters such as --ntasks
, --nodes
, --cpus-per-task
, --ntasks-per-node
have the same meaning in both commands. That is true for nearly all parameters, with the notable exception of --exclusive
. (source)
A: You typically use sbatch
to submit a job and srun
in the submission script to create job steps as SLURM calls them. srun
is used to launch the processes. If your program is a parallel MPI program, srun
takes care of creating all the MPI processes. If not, srun
will run your program as many times as specified by the --ntasks
option. There are many use cases depending on whether your program is paralleled or not, has a long running time or not, is composed of a single executable or not, etc. Unless otherwise specified, srun
inherits by default the pertinent options of the sbatch
which it runs under. (source)
A: Other than for small tests, no. A common use is srun --pty bash
to get a shell on a compute node. (source)
SLURM main website, including tutorials, documentation, and FAQ. See also:
Back to main page.