What is SLURM#
SLURM is an open source and highly scalable cluster management and job scheduling system for large clusters.
Slurm has three key functions.
-
First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.
-
Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
-
Finally, it arbitrates contention for resources by managing a queue of pending work.
Currently the cluster has version 19 of slurm:
The most important concepts within slurm are:
- Nodes.
- Partitions.
- Jobs.
- Task (a task represent a process) in a job.
Login nodes#
It is from the login nodes where the user interacts with Slurm and from where they can launch and monitor their jobs. From here, the user accesses their data and the results of the executions.
Info
Remember that the login nodes are shared by all users, so that software execution on these nodes is forbidden. The Slurm scheduler must be used for this.
Partititions#
Partitions can be considered as job queues, each of which has an assortment of constrains such as job size limit, job time limit, user permitted to use it, etc. Priority-ordered jobs are allocated nodes within a partition until the resources (nodes, processores, memeroy,etc) within that partition are exhausted.
The partitions defined in our cluster are the following and their only restrictions are the time and the users who can use it:
Partición | Tiempo máximo | |
---|---|---|
express | 3 horas | all users |
batch | 24 horas | all users |
long | 72 horas | authorized users |
fatnodes | -- | upon request |
If no partition is specified, the default partition is the batch
.
If your work lasts longer than the maximum time established by the partition where you launched it, contact us. support@hpc.iter.es to request an extension of the time limit.
Most common commands in SLURM#
The most common slurm commands are listed below. To get more information about the command as options you can always run
sbatch <script file>
: launch scriptsqueue
: check the status of job queuesscancel <job_id list>
: cancel a jobscontrol show job <job_id>
: get information about jobsinfo
: view status of system queuessalloc <opciones>
: start interactive session (get a node for to use it)srun <aplicacion>
: submit a job to run or start the job steps in real timesacct
: check the accounting of the account itselfsstat
: get information about the resources used by a running job
You can see a more detailed summary guide in our section Useful commands in slurm or in slurm web.