How to Run Jobs on Slurm#

Launch an Execution#

There are three primary methods for launching jobs in Slurm: interactive sessions, real-time job execution, and script-based job submission. Each method requires specifying the necessary resources.

Interactive Session Execution#

Using the salloc command, Slurm will give the user a compute node where, interactively, the user can work and run any software needed. This is a very useful option for testing new software or new data to work with, without the need to launch jobs to the queue, with the risk of them failing as soon as they start.

Requesting a single core:

salloc -N 1

Warning!

The default parameters assigned by slurm are:

#SBATCH --node=1
#SBATCH --ntask=1
#SBATCH --ntask-per-node=1
#SBATCH --cpu-per-task=1
#SBATCH --mem=2GB

Requesting multiple cores:

salloc -N 1 --cpus-per-task=16

In this command, we are requesting exclusive access to an entire node, assuming that the node has 16 cores. This type of request is particularly useful for tasks that require a large amount of memory and CPU resources, allowing the application to fully utilize the cores of the node without sharing them with other jobs. This configuration maximizes performance for workloads that are intensive in computation or memory.

Requesting from a particular partition, assign a job name and duration to it:

salloc -N 1 -p <partition> -J <job_name> -t <HH:MM:SS>

Info

After securing resources, you might see a message like:

salloc: Granted job allocation 968978

And when the job begins:

srun: Job 968978 step creation temporarily disabled, retrying
srun: Step created for job 968978
xxxxxx@node0101-1.hpc.iter.es's password:
Welcome to node0101-1. Deployment version 0.5-hpc. Deployed Wed May 13 19:58:32 WEST 2022.

Note

Once we have a node via salloc, we can, from another terminal, access to that node via SSH to have several sessions open on the same node and work on multiple things at the same time.

Exiting the Session:

[xxxxxx@node0101-1 ~]$ exit

Once we exit the node from the terminal where we did salloc, Slurm will release the node and the work will be mark as completed. Obviously, once we exit, all the extra sessions we have opened via SSH will be automatically closed.

Running a Job in Real Time#

Use the srun command to launch a job directly into the queue:

srun -p <partition> -J <job_name> -t <days-HH:MM:SS> <aplicacion>

This command directly submits a job requesting 16 cores, which is useful for applications needing parallel processing.

More available options:

Nombre	Dirección IP
-p <partition>	partition on which the jobs will run
-N <nodes>	number of nodes
-n=<num_tasks>	number of task
--tasks-per-node=<number>	task per node (consider -N)
-J <job_name>	job name
-t <days-HH:MM:SS>	expected time
-d=<type:job_id[:job_id]>	job dependency type and job dependency id (optional)
-o </path/to/file.out>	file for stdout(standard ouput stream)
-e </path/to/file.out>	file for sterr (standard error stream)
-D <directory>	default directory for execution
--mail-user=<email>	email for slurm notifications
--mail-type=<eventos>	event list notifications

Execute in SLURM via a script#

The sbatch command sends a job to the queue to be executed by one or more nodes, depending on the resources that have been specified.

sbatch [-p <partition>] [-J <job_name>] [-t <days-HH:MM:SS>] mi_script.sh

The most basic structure for a script is as follows.

# !/bin/bash

#SBATCH -J <job_name>
#SBATCH -p <partition>
#SBATCH -N <nodes>
#SBATCH --tasks=<number>
#SBATCH --cpus-per-task=<number>
#SBATCH --constrains=<node arquitecture>  # sandy, ilk (icelake)... arquitecture
#SBATCH -t <days-HH:MM:SS>
#SBATCH -o <file.out>
#SBATCH -D .
#SBATCH --mail-user=<cuenta_de_correo>
#SBATCH –mail-type=BEGIN,END,FAIL,TIME_LIMIT_50,TIME_LIMIT_80,TIME_LIMIT_90
##########################################################


module purge
module load <modules>

srun <aplicacion>

Difference Between CPU Parameters#

--cpus-per-task: Specifies the number of CPUs (cores) that will be allocated to each task. This is used when a task requires more than one CPU to run, such as in applications that can utilize multithreading or parallelism at the thread level.

--ntasks: Defines the total number of tasks that will be executed in the job. Each task is a separate instance of the program you are running. This is primarily used for task-level parallelization, such as in programs that use MPI (Message Passing Interface).

--ntasks-per-node: Determines how many tasks will be executed on each node. This is useful when you want to control how tasks are distributed among assigned nodes. It is often used in conjunction with ntasks to ensure a specific distribution of tasks per node.

Example#

Objective#

You want to run a total of 8 tasks of your application, where each task will use 4 cores. You want to distribute these tasks across 2 nodes.

Slurm Script#

This script configures Slurm to run a program that benefits from both task-level and thread-level parallelization:

#!/bin/bash

#SBATCH -J example_parallelism
#SBATCH -p batch             # Partition batch
#SBATCH -N 2                 # Requesting 2 nodes
#SBATCH --ntasks=8           # Total of 8 tasks to execute
#SBATCH --ntasks-per-node=4  # Distribute 4 tasks per node
#SBATCH --cpus-per-task=4    # Each task will use 4 CPUs (cores)
#SBATCH --time=01:00:00      # Time limit of one hour
#SBATCH -o result_%j.out     # Standard output
#SBATCH -e errors_%j.err     # Standard errors

module load my_module        # Load necessary modules
srun my_application          # Execute the application

Script Explanation#

--ntasks=8: This parameter sets that the job will consist of 8 independent tasks. In the context of MPI, you might think of this as launching 8 distinct processes.

--ntasks-per-node=4: Indicates that each node assigned to the job will run 4 of these tasks. Since you have requested 2 nodes and want to execute 8 tasks in total, each node will handle 4 tasks.

--cpus-per-task=4: Specifies that each task should utilize 4 cores. This is useful for tasks that can run threads concurrently, taking advantage of thread-level parallelization within each task.

Expected Result#

With this configuration, the Slurm system will distribute the job across the 2 available nodes, placing 4 tasks on each one, and each task will use 4 CPU cores on the assigned node. This can result in efficient use of the available hardware, maximizing the performance of the program that benefits from both task-level and thread-level parallelization.