Monitoring queue and jobs in LSF

With the bqueues command you can check queues status. The output will be a formatted list of queues like this:

QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP 
ops 200 Open:Active - - - - 43 0 43 0 
dteam 200 Open:Active - - - - 0 0 0 0 
cms 100 Open:Active - - - - 18 18 0 0
…

with fields meaning:

QUEUE_NAME: The name of the queue.
PRIO: The priority of the queue. The larger the value, the higher the priority.
STATUS: The current status of the queue. The possible values are:
- Open. The queue is able to accept jobs.
- Closed. The queue is not able to accept jobs.
- Active. Jobs in the queue may be started.
- Inactive. Jobs in the queue cannot be started for the time being.
MAX: The maximum number of job slots that can be used by the jobs from the queue.
NJOBS: The total number of tasks for jobs in the queue. This includes tasks in pending, running, and suspended jobs.
PEND: The total number of tasks for all pending jobs in the queue.
RUN: The total number of tasks for all running jobs in the queue.
SUSP: The total number of tasks for all suspended jobs in the queue.

Example:

bqueues -l argo

-l: displays queue information in a long multiline format.
argo: queue name
In the output, you can find:
- CPULIMIT: max CPU time for the job
- RUNLIMIT: max WCT time for the job
- MEMLIMIT: max RAM available to the job
- RUN: number of running jobs
- PEND: number of pending jobs

To monitor the jobs status, the bjobs command can be used. The output will be a list of formatted information on jobs matching the filters you provide in the command line. By default, all the jobs own by the user are shown. The filters can be on the job owner (-u option. Also –u all can be used to remove owner filter) on the status (–d for finished or –r for running), on the queue (-q option) and on the jobID (simply to it in the end of the command). The output will be like this:

JOBID      USER   STAT QUEUE  FROMHOST EXECHOST     JOBNAME   SUBMITTIME
134243571 opssgm0 RUN  ops   ce04-lcg wn-206-08-2 *365516924 Sep 22 01:23
134317499 opssgm0 RUN  ops   ce01-lcg wn-206-08-2 *595144446 Sep 22 18:19
134363787 pilops0 RUN  ops   ce06-lcg wn-201-07-2 *697026255 Sep 23 00:00
134418246 opssgm0 RUN  ops   ce04-lcg wn-206-03-1 *523626970 Sep 23 06:33
134587141 opssgm0 RUN  ops   ce06-lcg wn-200-01-0 *147508850 Sep 23 17:14

where the fields have respectively the following meaning: jobID, owner, status, queue, submission host, execution host, job name, submission time. Using the –l option more information will be shown and in the output you will find CPU time used for completed jobs and eventually the reason the job failed. Some examples:

bjobs -u all -q cms –d
bjobs -l 15728943

To monitor jobs status is also possible to visit http://tier1.cnaf.infn.it/monitor/index.html.
Moreover, disk situation can be checked here: http://www.cnaf.infn.it/~vladimir/gpfs.

Monitoring with Grafana

A monitoring service, using Grafana [22], can be found here: https://mon-tier1.cr.cnaf.infn.it.

On this page it's possible to select which type of resource to monitor and then look for the desired experiment.

The same actions are feasible from the "Home" button, which opens the following menu.

NB: If some script fails to get data it results in a zero record, which doesn't mean that the resource is unavailable.

In the top right corner you can change the time interval to watch:

and, selecting one of the "per queue" views, you can choose which ones to show.

Page tree

11 - Monitoring

Monitoring queue and jobs in LSF

Monitoring with Grafana