To work with a batch, the user should build a batch submit file. SLURM accepts batch files that respect the following basic syntax:
#!/bin/bash # #SBATCH <OPTIONS> (...) srun <INSTRUCTION> |
The #SBATCH directive teaches SLURM how to configure the environment for the job at matter, while srun is used to actually execute specific commands. It is worth now getting acquainted of some of the most useful and commonly used options one can leverage in a batch submit file.
In the following, the structure of batch jobs alongside with few basic examples will be discussed.
In order to run instructions, a job has to be scheduled on SLURM: the basic srun command allows to execute very simple commands, such as one-liners on compute nodes.
The srun command can be also enriched with several useful options:
srun -N5 /bin/hostname
will run the hostname command on 5 nodes;
If the user needs to specify the run conditions in detail as well as running more complex jobs, the user must write down a batch submit file.
A SLURM batch job can be configured quite extensively, so shedding some light on the most common sbatch options may help us configuring jobs properly.
To have a more complete list of command options you can visit the SLURM documentation. [30]
#SBATCH --partition=<NAME>
#SBATCH --job-name=<NAME>
#SBATCH --output=<FILE>
#SBATCH --nodelist=<NODES>
node[1:5]
and we specify --nodelist=node[1-2]
, our job will only use these two nodes.#SBATCH --nodes=<INT>
#SBATCH --ntasks=<INT>
#SBATCH --ntasks-per-node=<INT>
#SBATCH --time=<TIME>
#SBATCH --mem=<INT>
In the following we present some advanced SBATCH options. These ones will allow the user to set up constraints and use specific computing hardware peripherals, such as GPUs.
#SBATCH --constraint=<...>
--constraint=IB
(use forcibly Infini-Band nodes) or --constraint=<CPUTYPE>
(use forcibly CPUTYPE CPUs).#SBATCH --gres=<GRES>:<#GRES>
--gres=gpu:<INT>
where <INT>
is the number of GPUs we want to use.#SBATCH --mem-per-cpu=<INT>
In the following, few utilization examples are given in order to practice with these concepts.
A user can retrieve information regarding active job queues with the squeue
command for a synthetic overview of SLURM job queue status.
bash-4.2$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8501 slurmHPC_ test_sl apascoli R 0:01 1 hpc-200-06-05 |
The table shows 8 columns: JOBID, PARTITION, NAME, USER, ST, TIME, NODES and NODELIST(REASON).
Among the information printed with the squeue command, the user can find the job id as well as the running time and status.
In case of a running a job, the command:
sstat -j <jobID>
will give detailed infos on the status; where the jobID is given to you by SLURM once you have submitted the job.
Then, to see specific information about a single job use:
bash-4.2$ sstat --format=JobID,AveCPU -j 8501 JobID AveCPU ------------ ---------- 8501.0 213503982+ |
The option --format allows to customise the output based on the desired features.
For instance in the example above are shown:
Many more features are listed in the slum manual. [31]
Once submitted, a job can be killed by the command:
scancel <jobID>
It is recommended to give scancel an array of jobIDs to kill multiple job at once.
Please DO NOT USE scancel in for loops or shell scripts, as it can result in a degradation of performance.
More features are listed in the SLURM manual. [33]