GPUs at TeideHPC#
GPU is an acronym for Graphics Processing Unit and represents precisely the heart of a graphics card just like the CPU does in a PC. Apart from the heart, it is also your brain, since it is in charge of carrying out all the complex calculations that allow some programs to run much faster than on a CPU.
Among the main uses of GPUs are the following:
- Video edition
- 3D graphics rendering
- Automatic learning
- Scientific applications
- etc...
The TeideHPC cluster offers 2 different GPU models to use with your jobs. We recommend taking a look at the cluster description to get an idea of what it looks like.
GPUs models available.#
These are the GPUs currently available at TeideHPC:
GPU model | # of nodes | # of GPUs/node | Slurm type specifier | CPU cores/node | CPU memory/node | Compute Capability (*) | GPU mem (GiB) |
---|---|---|---|---|---|---|---|
Nvidia A100 | 16 | 4 | a100 | 64 | 256GB | 80 | 40 GB |
Nvidia A100 | 1 | 8 | a100 | 64 | 512GB | 80 | 40 GB |
Nvidia Tesla T4 | 4 | 1 | t4 | 32 | 256GB | 75 | 16 GB |
(*) Compute Capability is a technical term created by NVIDIA as a compact way to describe what hardware functions are available on some models of GPU and not on others. It is not a measure of performance and is relevant only if you are compiling your own GPU programs. See the page on CUDA programming for more.
What is MIG? (NVIDIA Multi-Instance GPU)#
Multi-Instance GPU (MIG) is a technology from NVIDIA that allow divide the GPU into up to seven fully isolated instances with their own high-bandwidth memory, cache, and processing cores.
Which is the motivation to use MIG?.#
Without MIG, different jobs running on the same GPU, like different AI inference requests, compete for the same resources. A job consuming larger memory bandwidth deprives others of it, causing multiple jobs to miss their latency targets.
With MIG, jobs run concurrently on different instances, each with dedicated resources for compute, memory, and memory bandwidth usage, resulting in predictable performance with QoS and maximum GPU utilization.
In short:
-
A100 NVIDIA GPUs are currently the most powerful GPUs the money can buy.
-
Their memory ranges from 40 GB to 80GB per card.
-
Not many applications can take advantage of the full power of these cards.
-
Unfortunately once SLURM allocates one GPU no other job can make use of the GPU.
Jobs are GPU exclusive
Even though there are other approaches such as NVIDIA MPS, but unfortunately SLURM can only useone GPU with MPS per node.
- Basically It allows us to physically partition the GPUs so more than one job can make use of a GPU.
GPU partitions for a Nvidia A100 (40GB).#
A GPU can be partitioned into MIG instance partitions of different sizes. For example, on a 40 GB NVIDIA A100, an administrator could create two instances with 20 gigabytes (GB) of memory each, three instances with 10 GB, or seven instances with 5 GB. Or a combination.
Not every partition is possible. There are restrictions. These are the posibilities for each card.
Here you can see 2 examples of partitioning for a 40GB A100 GPU.
To get the list of features and resources of each node and mig partition you can use this command.
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
node18109-1 64 257214 ilk,gpu,a100 gpu:a100:8
node2204-[3-4] 20 31906 ivy (null)
node17109-1,node17110-1,node18110-1,node 64 257214 ilk,viz,t4 gpu:t4:1
node0303-2,node0304-[1-4],node1301-[1-4] 16 30000+ sandy (null)
node17101-1,node17103-1,node17104-1,node 64 257214 ilk,gpu,a100 gpu:a100:4(S:0-1)
node17102-1 64 257214 ilk,gpu,a100,3g.20gb,2g.10gb,1g.5gb gpu:3g.20gb:1(S:0),gpu:2g.10gb
- for a specific node with mig partitions:
NodeName=node17102-1 Arch=x86_64 CoresPerSocket=32
CPUAlloc=8 CPUEfctv=64 CPUTot=64 CPULoad=0.00
AvailableFeatures=ilk,gpu,a100,3g.20gb,2g.10gb,1g.5gb
ActiveFeatures=ilk,gpu,a100,3g.20gb,2g.10gb,1g.5gb
Gres=gpu:3g.20gb:1(S:0),gpu:2g.10gb:1(S:0),gpu:a100:3(S:0-1),gpu:1g.5gb:2(S:0)
NodeAddr=node17102-1 NodeHostName=node17102-1 Version=22.05.8
OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47 UTC 2022
RealMemory=257214 AllocMem=0 FreeMem=142408 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=main,batch,express,long
BootTime=2023-07-24T08:38:22 SlurmdStartTime=2023-07-24T09:20:41
LastBusyTime=2023-08-03T09:07:43
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=7,gres/gpu:1g.5gb=2,gres/gpu:2g.10gb=1,gres/gpu:3g.20gb=1,gres/gpu:a100=3
AllocTRES=cpu=8,gres/gpu=2,gres/gpu:a100=2
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
- One node without mig partitions:
NodeName=node17101-1 Arch=x86_64 CoresPerSocket=32
CPUAlloc=8 CPUEfctv=64 CPUTot=64 CPULoad=0.00
AvailableFeatures=ilk,gpu,a100
ActiveFeatures=ilk,gpu,a100
Gres=gpu:a100:4(S:0-1)
NodeAddr=node17101-1 NodeHostName=node17101-1 Version=22.05.8
OS=Linux 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47 UTC 2022
RealMemory=257214 AllocMem=0 FreeMem=119559 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=main,batch,express,long
BootTime=2023-07-17T13:29:12 SlurmdStartTime=2023-07-21T12:42:08
LastBusyTime=2023-08-03T11:06:24
CfgTRES=cpu=64,mem=257214M,billing=64,gres/gpu=4,gres/gpu:a100=4
AllocTRES=cpu=8,gres/gpu=4,gres/gpu:a100=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
MIG Device Names#
By default, a MIG device consists of a single “GPU Instance” and a single “Compute Instance”. The table below highlights a naming convention to refer to a MIG device by its GPU Instance's compute slice count and its total memory in GB (rather than just its memory slice count).
When only a single CI is created (that consumes the entire compute capacity of the GI), then the CI sizing is implied in the device name.
The description below shows the profile names on the A100-SXM4-40GB. These are the device name when using single CI.
Memory | 20gb | 10gb | 5gb |
---|---|---|---|
GPU Instance | 3g | 2g | 1g |
Compute Instance | 3c | 2c | 1c |
MIG Device | 3g.20gb | 2g.10gb | 1g.5gb |
GPCGPCGPC | GPCGPC | GPC |
In next section you can see how to get a GPU node to execute with slurm. Also, we recommend to visit the request GPU and compute page.