How to request GPU and CPU resources#
To understand how our cluster is configured
- To request GPU or GPU nodes you must first understand how your cluster is configured, for this reason we recommend that you refer to the home page where we have a cluster description that may help you.
- Basically there are 2 architectures: icelake (GPU nodes) and sandy (CPU nodes).
- Based on these architectures, the resources within the cluster are requested.
Mainly, there are 2 ways to specify resource to use when submitting a job using constraints and using gres.
Slurm Constrains at TeideHPC#
Nodes can have features assigned to them by the TeideHPC administrators. Users can specify which of these features are required by their job using the --constraint option. Only nodes having features matching the job constraints will be used to satisfy the request.
For a more detailed explanation of what constrains are you can see the official slurm documentation
To find out list of constrains at TeideHPC look at the column AVAIL FEATURES after execute this command.
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
node18109-1 64 257214 ilk,gpu,a100 gpu:a100:8
node2204-[3-4] 20 31906 ivy (null)
node17109-1,node17110-1,node18110-1,node 64 257214 ilk,viz,t4 gpu:t4:1
node0303-2,node0304-[1-4],node1301-[1-4] 16 30000+ sandy (null)
node17102-1 64 257214 ilk,gpu,a100,3g.20gb,2g.10gb,1g.5gb gpu:3g.20gb:1(S:0),gpu:2g.10gb
node17101-1,node17103-1,node17104-1,node 64 257214 ilk,gpu,a100 gpu:a100:4(S:0-1)
constraints | Type | Definition |
---|---|---|
ilk | nodes architecture | Allows you to select an icelake type node (they have GPU) |
sandy | nodes architecture | Allows you to select a computing node with sandy-bridge architecture |
ivy | nodes architecture | Allows you to select a computing node with ivy-bridge architecture |
viz | display node | Allows you to select a display-specialized GPU node (Nodes with Nvidia T4) |
gpu | gpu nodes | Allows you to select a gpu node (Nodes with NVidia A100) |
a100 | gpu model | Allows you to select a gpu node with NVidia A100 directly |
t4 | gpu model | Allows you to select a gpu node with NVidia T4 directly |
Constraints usage example.#
The following commands in interactive session have their equivalent in batch. You only need to write this directive with #SBATCH --constraint=<constraint>
in your scripts.
- For a computing node with sandy architecture (16 cores | 32/64 GB).
- For a compute node with icelake architecture (64 cores | 256 GB).
- For a GPU node (icelake) with one GPU (Nvidia A100).
- For a display-specialized GPU node (Nvidia Tesla T4)
Generic Resource GRES#
In Slurm, GRES stands for Generic Resource. GRES is a feature that allows you to specify and manage various types of generic resources such as GPUs (Graphics Processing Units) within a computing cluster.
Slurm's GRES functionality enables efficient allocation, scheduling, and tracking of these resources for jobs submitted to the cluster. It helps ensure that the requested resources are available and properly utilized by the jobs that require them.
To use GRES effectively, you need to understand how your cluster is configured, how available GRES types and quantities are defined and specify GRES requirements when submitting jobs.
To get type of GRES defined in the cluster you can use:
To find out list of GRES at TeideHPC look at the column GRES after execute this command.
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
node18109-1 64 257214 ilk,gpu,a100 gpu:a100:8
node2204-[3-4] 20 31906 ivy (null)
node17109-1,node17110-1,node18110-1,node 64 257214 ilk,viz,t4 gpu:t4:1
node0303-2,node0304-[1-4],node1301-[1-4] 16 30000+ sandy (null)
node17102-1 64 257214 ilk,gpu,a100,3g.20gb,2g.10gb,1g.5gb gpu:3g.20gb:1(S:0),gpu:2g.10gb
node17101-1,node17103-1,node17104-1,node 64 257214 ilk,gpu,a100 gpu:a100:4(S:0-1)
Alternatively, it can also be viewed by listing and filtering the nodes:
....
NodeName=node1315-4 Arch=x86_64 CoresPerSocket=8
NodeName=node2204-3 Arch=x86_64 CoresPerSocket=10
...
NodeName=node17101-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=4,gres/gpu:a100=4
NodeName=node17102-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=7,gres/gpu:1g.5gb=2,gres/gpu:2g.10gb=1,gres/gpu:3g.20gb=1,gres/gpu:a100=3
NodeName=node17103-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=4,gres/gpu:a100=4
NodeName=node17104-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=4,gres/gpu:a100=4
...
As you can see, each node in cluster may have a different definition. For example this node has 4 GPUs NVidia A100.
NodeName=node17104-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=4,gres/gpu:a100=4
and this node has 3 NVidia A100 and 1 partitioned GPU (Nvidia A100).
NodeName=node17102-1 Arch=x86_64 CoresPerSocket=32
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=7,gres/gpu:1g.5gb=2,gres/gpu:2g.10gb=1,gres/gpu:3g.20gb=1,gres/gpu:a100=3
To understand, what is the meaning of gres/gpu:1g.5gb=2,gres/gpu:2g.10gb=1,gres/gpu:3g.20gb=1,gres/gpu:a100=3
you must understand what is MIG
GRES usage example.#
- For a GPU node (icelake) with one GPU (Nvidia A100).
- For a MIG partition of GPU (Nvidia A100) .
- For a display-specialized GPU (Nvidia Tesla T4)