Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the following, few details on the account request and first access are given.

Anchor
Account Request
Account Request
Account Request

If a user already has an hpc account, it can skip this paragraph. Otherwise, some preliminary steps occur.

Download and fill in the access authorization form. If help is needed, please contact us at hpc-support@lists.cnaf.infn.it.
If you don't have an INFN association, please attach a scan of a personal document of yours (passport or ID card for example).
Send the form via mail to sysop@cnaf.infn.it and to user-support@lists.cnaf.infn.it
Once the account creation is completed, you will receive a confirmation on your email address. In that occasion, you will receive the credentials you need to access the cluster as well.

Anchor
Access
Access
Access

First of all, it is necessary SSH into bastion.cnaf.infn.it with the own credentials.

...

For support, contact us (hpc-support@lists.cnaf.infn.it).

Anchor
SLURM architecture
SLURM architecture
SLURM architecture

Slurm workload manager relies on the following scheme:

...

If not requested differently at submit time, jobs will be submitted to the _int partition. Users can choose freely what partition to use by configuring properly the batch submit file (see below).

Anchor
Check the cluster status with SLURM
Check the cluster status with SLURM
Check the cluster status with SLURM

You can check the cluster status using the sinfo -N command which will print a summary table on the standard output.

...

Code Block
-bash-4.2$ sinfo -N
NODELIST       NODES      PARTITION STATE
hpc-200-06-05      1  slurmHPC_int* idle
hpc-200-06-05      1 slurmHPC_short idle
hpc-200-06-05      1   slurmHPC_inf idle
hpc-200-06-05      1      slurm_GPU idle
hpc-200-06-06      1 slurmHPC_short idle
hpc-200-06-06      1  slurmHPC_int* idle
hpc-200-06-06      1   slurmHPC_inf idle
[...]

Anchor
The structure of a basic batch job
The structure of a basic batch job
The structure of a basic batch job

To work with a batch, the user should build a batch submit file. Slurm accepts batch files that respect the following basic syntax:

...

In the following, the structure of batch jobs alongside with few basic examples will be discussed.

Anchor
Submit basic instructions on Slurm with srun
Submit basic instructions on Slurm with srun
Submit basic instructions on Slurm with srun

In order to run instructions, a job has to be scheduled on Slurm: the basic srun command allows to execute very simple commands, such as one-liners on compute nodes.

...

If the user needs to specify the run conditions in detail as well as running more complex jobs, the user must write down a batch submit file.

Anchor
#SBATCH options
#SBATCH options
#SBATCH options

A Slurm batch job can be configured quite extensively, so shedding some light on the most common sbatch options may help us configuring  jobs properly.

  • #SBATCH --partition=<NAME>
    This instruction allows the user to specify on which queue (partition in Slurm jargon) the batch job must land. The option is case-sensitive.

  • #SBATCH --job-name=<NAME>
    This instruction assigns a “human-friendly” name to the job, making it easier for the user to recognize among other jobs.

  • #SBATCH --output=<FILE>
    This instructions allows to redirect any output to a file (dynamically created at run-time).

  • #SBATCH --nodelist=<NODES>
    This instruction forces Slurm to use a specific subset of nodes for the batch job. For example: if we have a cluster of five nodes: node[1:5] and we specify --nodelist=node[1-2], our job will only use these two nodes.

  • #SBATCH --nodes=<INT>   
    This instructions tells slurm to run over <INT> random nodes belonging to the partition.
    N.B. Slurm chooses the best <INT> nodes evaluating current payload, so the choice is not entirely random. If we want specific nodes to be used, the correct option is the aforementioned --nodelist.

  • #SBATCH --ntasks=<INT>
    This command tells Slurm to use <INT> CPUS to perform the job. The CPU load gets distributed to optimize the efficiency and the computational burden on nodes.

  • #SBATCH --ntasks-per-node=<INT>
    This command is quite different from the former one: in this case Slurm forces the adoption of <INT> CPUS per node. Suppose you chose 2 nodes for your computation, writing --ntasks-per-node=4, you will force the job to use 4 CPUS on the first node as well as 4 CPUS on the second one. 

  • #SBATCH --time=<TIME>
    This command sets an upper time limit for the job to be considered running. When this limit is exceeded, the job will be automatically held.

  • #SBATCH --mem=<INT>
    This option sets an upper limit for memory usage on every compute node in the cluster. It must be coherent with node hardware capabilities in order to avoid failures

Anchor
Advanced batch job configuration
Advanced batch job configuration
Advanced batch job configuration

In the following we present some advanced SBATCH options. These ones will allow the user to set up constraints and use specific computing hardware peripherals, such as GPUs.

...

In the following, few utilization examples are given in order to practice with these concepts.

Anchor
Retrieve job informations
Retrieve job informations
Retrieve job informations

A user can retrieve information regarding active job queues with the squeue command for a synthetic overview of Slurm job queue status.
Among the information printed with the squeue command, the user can find the job id as well as the running time and status.

...

where jobid is an id given to you by Slurm once you submit the job.

Anchor
Examples
Examples
Examples

IBelow some examples of submit files follow to help the user get comfortable with Slurm.
See them as a controlled playground to test some of the features of Slurm.

Anchor
Simple batch submit
Simple batch submit
Simple batch submit

Code Block
#!/bin/bash

#

#SBATCH --job-name=tasks1

#SBATCH --output=tasks1.txt

#SBATCH --nodelist=hpc-200-06-[17-18]

#SBATCH --ntasks-per-node=8

#SBATCH --time=5:00

#SBATCH --mem-per-cpu=100

srun hostname -s

...

As we can see the execution involved 8 CPUs only and the payload was organized to minimize the burden over the nodes.

Anchor
Simple MPI submit
Simple MPI submit
Simple MPI submit

Code Block
#!/bin/bash

#

#SBATCH --job-name=test_mpi_picalc

#SBATCH --output=res_picalc.txt

#SBATCH --nodelist=... #or use --nodes=...

#SBATCH --ntasks=8

#SBATCH --time=5:00

#SBATCH --mem-per-cpu=1000

srun picalc.mpi

...

Where the --bcast option copies the executable to every node by specifying the destination path. In this case, we decided to copy the executable into the home folder keeping the original name as-is.

Anchor
Additional information
Additional information
Additional information

Complete overview of hardware specs per node:http://wiki.infn.it/strutture/cnaf/clusterhpc/home.

...