...
The table shows 4 columns: NODELIST, NODES, PARTITION , AND and STATE.
- NODELIST shows node names. Multiple occurrences are allowed since a node can belong to more than one partition
- NODES indicates the number of machines available.
- PARTITION which in slurm is a synonym of "queue" indicates to which partition the node belongs. If a partition name comes with an ending asterisk, it means that that partition will be considered the default one to run the job, if not otherwise specified.
- STATE indicates if the node is not running jobs ("idle"), if it is in drain state ("drain") or if it is running some jobs ("allocated").
...
A Slurm batch job can be configured quite extensively, so shedding some light on the most common sbatch options may help us configuring jobs properly.
To have a more complete list of command options you can visit the slurm documentation. [30]
#SBATCH --partition=<NAME>
This instruction allows the user to specify on which queue (partition in Slurm jargon) the batch job must land. The option is case-sensitive.#SBATCH --job-name=<NAME>
This instruction assigns a “human-friendly” name to the job, making it easier for the user to recognize among other jobs.#SBATCH --output=<FILE>
This instructions allows to redirect any output to a file (dynamically created at run-time).#SBATCH --nodelist=<NODES>
This instruction forces Slurm to use a specific subset of nodes for the batch job. For example: if we have a cluster of five nodes:node[1:5]
and we specify--nodelist=node[1-2]
, our job will only use these two nodes.#SBATCH --nodes=<INT>
This instructions tells slurm to run over <INT> random nodes belonging to the partition.
N.B. Slurm chooses the best <INT> nodes evaluating current payload, so the choice is not entirely random. If we want specific nodes to be used, the correct option is the aforementioned --nodelist.#SBATCH --ntasks=<INT>
This command tells Slurm to use <INT> CPUS to perform the job. The CPU load gets distributed to optimize the efficiency and the computational burden on nodes.#SBATCH --ntasks-per-node=<INT>
This command is quite different from the former one: in this case Slurm forces the adoption of <INT> CPUS per node. Suppose you chose 2 nodes for your computation, writing --ntasks-per-node=4, you will force the job to use 4 CPUS on the first node as well as 4 CPUS on the second one.#SBATCH --time=<TIME>
This command sets an upper time limit for the job to be considered running. When this limit is exceeded, the job will be automatically held.#SBATCH --mem=<INT>
This option sets an upper limit for memory usage on every compute node in the cluster. It must be coherent with node hardware capabilities in order to avoid failures
...
A user can retrieve information regarding active job queues with the squeue
command for a synthetic overview of Slurm job queue status.
Among the information printed with the squeue command, the user can find the job id as well as the running time and status.
...
To execute this script, the command to be issued is
sbatch <executable.sh>
...
Code Block |
---|
-bash-4.2$ sbatch test_slurm.sh Submitted batch job 23 |
Then, to see information about a single job use:
...
8501 |
To retrieve the information about the submitted jobs you can use the command "squeue":
Code Block |
---|
bash-4.2$ |
...
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8501 slurmHPC_ test_sl apascoli R 0:01 1 hpc-200-06-05 |
The table shows 8 columns: JOBID, PARTITION, NAME, USER, ST, TIME, NODES and NODELIST(REASON).
- JOBID shows the id corresponding to the submitted jobs.
- PARTITION which in slurm is a synonym of "queue" indicates to which partition the node belongs.
- NAME it corresponds to the name assigned in the submit file, otherwise it will match the name of the submit file.
- USER indicates the user who submitted the job.
- ST indicates if the jobs are running ("R") or if it is pending ("PD").
- TIME shows how long the jobs have run for using the format days-hours:minutes:seconds.
- NODES indicates the number of machines running the job.
- NODELIST indicates where the job is running or the reason it is still pending.
Then, to see information about a single job use:
Code Block |
---|
bash-4.2$ sstat --format=JobID,AveCPU -j 8501
JobID AveCPU
------------ ----------
8501.0 213503982+ |
The option "--format" allows to customise the output based on the desired features.
For instance in the example above are shown:
- JobID
- AveCPU Average (system + user) CPU time of all tasks in job.
Many more features are listed in the slum manual. [31]
The output, from the option --output=tasks1.txt, and the output should be something like (hostnames and formatting may change):
Code Block |
---|
-bash-4.2$ cat tasks1.txt hpc-200-06-17 hpc-200-06-17 hpc-200-06-17 hpc-200-06-17 hpc-200-06-17 [...] |
...