Skip to content

partitions

Stratis Ioannidis edited this page Sep 18, 2020 · 44 revisions

Discovery Cluster Partitions

General Purpose Partitions

In the Discovery environment, collections of compute nodes are organized into “partitions”. Always submit jobs to specific partitions using sbatch, or srun. Information on Discovery Cluster partitions can be found here. Current usage and node availability can be displayed by typing sinfo: the NODELIST column in sinfo indicates the names of corresponding machines. The names of the public partitions are:

-debug 
-express 
-short 
-long 
-gpu
-multigpu
-large

A summary of a few commonly used partitions is below:

Partition Name Old Name
debug, express, short, long, large general, infiniband, ser-par-10g-2, ser-par-10g-3, ser-par-10g-4, ht-10g, largemem-10g, interactive-10g

Timing, memory, and core limits for commonly used partitions are as follows:

Partition Name Requires Approval? Time Limit (default/max) Core Limit RAM Limit Running Jobs (default/max) Submitted Jobs (default/max)
debug No 20min/20min 128 256GB 10/25 25/100
express No 30min/1h 2048 25TB 50/250 250/1000
short No 4h/24h 1024 25TB 50/500 100/1000
long Yes 1day/5days 1024 25TB 25/250 50/500
large Yes 6h/6h N/A N/A 100/100 100/1000
gpu No 4h/8h N/A N/A 25/250 50/1000

These partitions contain several different types of machines, accessible via appropriate --constraint calls (see Using Constraints):

Partition Name CPU Frequency Core Number Memory Available Constraint
debug, express, short, long, large Dual Intel Xeon E5-2650 2.00GHz 16 128GB E5-2650@2.00GHz
Dual Intel Xeon E5-2680 v2 2.80GHz 20 128GB E5-2680v2@2.80GHz
Dual Intel Xeon E5-2690 v3 2.60GHz 24 128GB E5-2690v3@2.60GHz

Constraints of the gpu partition are as follows:

Partition Name CPU+GPU Frequency Core Number Memory Available Constraint
gpu Dual Intel Xeon E5-2650+ one K20m NVIDIA GPU (23 nodes) 2.00GHz 16 128GB E5-2650@2.00GHz
Dual Intel Xeon E5-2690v3+ one K40m NVIDIA GPU (16 nodes) 2.60GHz 24 128GB E5-2690v3@2.60GHz
Dual Intel Xeon E5-2680v4+ 8 k80 NVIDIA GPU (8 nodes) 2.40GHz 28 256GB E5-2680v4@2.40GHz
Dual Intel Xeon E5-2680v4+ 4 p100 NVIDIA GPU (12 nodes) 2.40GHz 28 256GB E5-2680v4@2.40GHz
Intel Gold 6132 + 4 v100-sxm2 NVIDIA GPU (24 nodes) 2.60GHz N/A N/A N/A

Approval Access Partitions

In order to ensure fair distribution of resources, access to the long, large, and multigpu partitions is restricted to researchers who have demonstrated their need to use these partitions.

To request access to the long partition, open a general Research Computing ServiceNow ticket, detailing your requirements for needing access to the long partition. IT will be in contact with you and will require that you meet with a member of our staff for a consultation. Be prepared to share your code and an example of previous jobs that you’ve attempted to run. Note that if your code is easily check pointed, you are not a good candidate for using the long partition.

To request access to the multigpu partition, download and complete all parts of the multigpu access form located here. Attach this form to a general Research Computing ServiceNow ticket. Your request will be reviewed by two faculty members, and you will be notified of your application’s acceptance or rejection through the ServiceNow ticket that you submitted.

To request access to the large partition, download and complete all parts of the large partition access form located here. Attach this form to a general Research Computing ServiceNow ticket. Your request will be reviewed by a faculty member, and you will be notified of your application’s acceptance or rejection through the ServiceNow ticket that you submitted.

Dedicated Access Partitions

Several partitions are owned by ECE faculty, and access to them is restricted: their use is permitted only after obtaining explicit access from the respective owners. Information on these partitions can be found below:

Partition Name Machines Name Range Old Name CPUs per Machine RAM CPU Constraint
ioannidis 8 c[3096-3103] ioannidis1 40 128GB Intel Xeon CPU E5-2680v2 2.8GHz E5-2680v2@2.80GHz
8 c[3120-3127] ioannidis2 56 500GB Intel Xeon CPU E5-2680v4 2.4GHz E5-2680v4@2.40GHz
danabrooks 1 c[4021] danabrooks 96 256GB Intel Xeon CPU E7-4830v3 2.8GHz E7-4830v3@2.8GHz

Using Constraints

As there are many different CPU configurations in a partition, additional arguments need to be passed to use specific machines under either sbatch (i.e., in batch-mode) or srun (i.e., in interactive mode).

For example, to submit a job to a Dual Intel Xeon E5-2650 machine from the short partition, you should evoke sbatch with the following arguments:

#SBATCH --partition=short
#SBATCH --constraint=E5-2650@2.00GHz

As another example, if you want to use nodes in the old ioannidis1 partition, you should add the following constraints:

#SBATCH --partition=ioannidis
#SBATCH --constraint=E5-2680v2@2.80GHz

Appropriate constraints are listed in the tables above. Note that an alternative way of getting access to specific nodes is through their name. For example, submitting a job to specific node c3096 in the ioannidis1 partition (presuming c3096 is idle) can be done via:

#SBATCH --partition=ioannidis
#SBATCH -w c3096

This is useful if you are trying to ensure that your experiments run on the same machine.

Tip. See also --nodelist option for submitting jobs to multiple machines.

Retrieving a Node's Hardware Configuration

To get the CPU, memory, etc., configuration of all nodes with a single command, type

grep Feature /shared/centos7/etc/slurm/nodes.conf 

from any node, including the gateway.

Alternatively, log in to a node in interactive mode and type:

lscpu

This will show you information about that node specifically.