-
Notifications
You must be signed in to change notification settings - Fork 9
batch mode
To submit multiple non interactive jobs simultaneously, it is best to use sbatch
. To do so, you first need to create an sbatch script, and then execute it using sbatch
An sbatch
script is a file that has the following form:
#!/bin/bash
#SBATCH (... sbatch options)
module load (... load modules that you wish to use when running the script)
srun (...run main job in the script)
Lines in this script preceded by #SBATCH
should be used to select certain options that will always be run with this script. Examples include
#SBATCH --job-name=my_nice_job (sets the job name)
#SBATCH --exclusive (reserves a machine for exclusive use)
#SBATCH --nodes=5 (reserves 5 machines)
#SBATCH --tasks-per-node=2 (sets 2 tasks for each machine)
#SBATCH --cpus-per-task=1 (sets 1 core for each task)
#SBATCH -w compute-0-033 (reserves a specific machine)
#SBATCH --mem=100Gb (reserves 100 GB memory)
#SBATCH --partition=my_partition (requests that the job is executed in partition my partition)
#SBATCH -time=4:00:00 (reserves machines/cores for 4 hours.)
#SBATCH --output=my_nice_job.%j.out (sets the standard output to be stored in file my_nice_job.%j.out, where %j is the job id)
#SBATCH --error=my_nice_job.%j.err (sets the standard error to be stored in file my_nice_job.%j.err, where %j is the job id)
#SBATCH --exclude=c0100 (excluding node c0100)
#SBATCH --gres=gpu:1 (reserves 1 gpu per machine)
#SBATCH --constraint=“E5-2690v3@2.60GHz” (only consider reserving the machines that has Intel E5-2690v3 chip)
#SBATCH --nodelist=c0[100-200] (only consider reserving the mahcines c0100-c0200)
More options can be found by typing man sbatch
.
Finally, the job (or jobs, if there are many jobs to be executed) can be listed at the end of the script, preceded by srun
. For example, to run a python program called my_program.py
you need to add:
srun python my_program.py
Once you have created your script, you can run it by calling
sbatch myscript
This submits your script to the default partition (or the partition specified in the script through an #SBATCH -p
setting. Any parameter that can be set inside the script can also be set by passing it directly to sbatch
. For example:
sbatch -p ser-par-10g-3 -w compute-0-097 -n 1 myscript
executes the script on compute-0-097, reserving only one core, while
sbatch -p ser-par-10g-3 -w compute-0-097 --exclusive myscript
reserves the node for exclusive use (no other job can be submitted in the same time at compute-0-097).
Tip: The examples above illustrate that options specified in an sbatch script can also be passed outside the script when calling sbatch
. If you expect that all executions of the script will use the same parameter (e.g., partition, name, number of cores etc), write these directly in the script. If you expect these to vary from one execution to the next, leave
these to be determined outside the script, once you run sbatch
. In any case, options passed to sbatch
from the command line override options inside the sbatch
script.
To monitor whether your script was submitted succesfully, you can run:
squeue -u $USER
where $USER
is your user id. This will show you information on your script, including its job id, whether it is pending (P) or running (R), as well as the machine it is running on. You can terminate your script by typing:
scancel jobid
where jobid
is its id.
To reschedule a job "re-queued in held state:"
scontrol release <jobid>
Tip: On most partitions, jobs submitted have a time limit of 24 hours.
Suppose that you have written a python program called my_program.py
that reads a text file (say myfile
) provided by the command line, removes all spaces, and prints it in the standard output. Normally (e.g., in interactive mode) you would execute this as:
python my_program.py myfile
and the result would be printed right below.
You have 100 files in a directory called input/
, and would like all of them to be processed by this program in parallel. To do so, you can create the following script, called, e.g. my_script
#!/bin/bash
#SBATCH --job-name=my_script
#SBATCH -n 1
#SBATCH --output=my_script.%j.out
#SBATCH --error=my_script.%j.err
module load python-2.7.5
srun python my-program.py $1
The $1
above refers to the first command line argument passed to the script. Then, calling:
sbatch -p partition my_script my_file
will execute my_script
over my_file
on a machine on partition partition
. This will occupy exactly one core (due to the -n 1
option), and any output printed in either the standard error and standard output will be appropriately directed to the files specified in the batch comments. In particular, if the job created by this is 27773, the text without spaces will be stored in my_script.27773.out
.
Processing all files in directory input
in parallel can be done by using a bash for loop as follows:
for file in `ls input`; do sbatch -p partition my_script $file; done
Suppose you want to calculate the function value f(alpha,beta,gamma)
for alpha
ranging from 0 to 10, beta
be either "0.001" "0.004" or "0.007", and 'gamma' ranging from 0 to 10.
You can create a bash script that performs all these computations in parallel. The min bash script, called
main.bash
, looks like this:
#!/bin/bash
for alpha in `seq 0 1 10`
do
for beta in "0.001" "0.004" "0.007"
do
for gamma in `seq 0 1 10`
do
work=/gss_gpfs_scratch/username/file/
cd $work
sbatch execute.bash $alpha $beta $gamma
done
done
done
main.bash
calls execute.bash
, which includes this sbatch script:
#!/bin/bash
#set a job name
#SBATCH --job-name=run1
#a file for job output, you can check job progress
#SBATCH --output=run1_%j.out
# a file for errors from the job
#SBATCH --error=run1_%j.err
#time you think you need: default is one day
#in minutes in this case, hh:mm:ss
#SBATCH --time=24:00:00
#number of tasks you are requesting
#SBATCH -n 1
#partition to use
#SBATCH --partition=ser-par-10g-4
module load python-2.7.5
python main.py $1 $2 $3
According to the discovery cluster usage policy, total number of jobs per user each time on one public partition (ser-par-10g-4 etc.) is 100, you should not exceed the job limit. But there is no limit on private faculty partition. For public node, the execution time limit is 24 hours. For the private faculty node there is no time limit.
An example python file main.py
is:
import sys
def f(a,b,c):
return a+b+c
if __name__=="__main__":
alpha= float(sys.argv[1])
beta=float(sys.argv[2])
gamma=float(sys.argv[3])
print alpha,'+',beta,'+',gamma,'=',f(alpha,beta,gamma)
You can submit these batch jobs by running:
./main.bash
To learn more about general bash scripts you can have a look at this tutorial. More on sbatch
can be found here.
.
Back to the Discovery Cluster page
Back to main page.