HTCondor jobs

HTCondor is a job scheduler. You give HTCondor a file containing commands that tell it how to run jobs. HTCondor locates a machine that can run each job within the pool of machines, packages up the job and ships it off to this execute machine. The jobs run, and output is returned to the machine that submitted the jobs.

For HTCondor to run a job, it must be given details such as the names and location of the executable and all needed input files. These details are specified in the submit description file.

A simple example of an executable is a sleep job that waits for 40 seconds and then exits:

#!/bin/bash
# file name: sleep.sh

TIMETOWAIT="40"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAIT

To submit this sample job, we need to specify all the details such as the names and location of the executable and all needed input files creating a submit description file where each line has the formcommand

name = value

like this:

# Unix submit description file
# sleep.sub -- simple sleep job

executable              = sleep.sh
log                     = sleep.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
queue

Submit local jobs

To submit jobs locally, i.e. from CNAF UI, use the following command:

condor_submit. The following options are required:
- -name sn-02.cr.cnaf.infn.it: to correctly address the job to the submit node;
- -spool: to transfer the input files and keep a local copy of the output files;
  - if -spool is not requested than the user should use -remote sn-02.cr.cnaf.infn.it instead of -name sn-02.cr.cnaf.infn.it
- the submit description file (a .sub file containing the relevant information for the batch system), to be indicated as argument.

For example:

-bash-4.2$ condor_submit -name sn-02.cr.cnaf.infn.it -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 6798880.

where 6798880 is the cluster id.

To see all jobs launched by a user locally on a submit node use

condor_q -name sn-02.cr.cnaf.infn.it <user>

For example:

-bash-4.2$ condor_q -name sn-02.cr.cnaf.infn.it arendina

-- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 07/29/20 11:11:57
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
arendina ID: 6785585 7/28 15:35 _ _ _ 1 6785585.0
arendina ID: 6785603 7/28 15:46 _ _ _ 1 6785603.0
arendina ID: 6798880 7/29 10:19 _ _ _ 1 6798880.0
Total for query: 3 jobs; 3 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 47266 jobs; 35648 completed, 3 removed, 6946 idle, 4643 running, 26 held, 0 suspended

To get the list of held jobs and the held reason add the option

-held

Whereas, to see information about a single job use

condor_q -name sn-02.cr.cnaf.infn.it <cluster id>

To investigate why does a job end up in a 'Held' state:

condor_q -name sn-02.cr.cnaf.infn.it <cluster id> -af HoldReason

and to get more detailed information use the option

-better-analyze

For example:

-bash-4.2$ condor_q -better-analyze -name sn-02.cr.cnaf.infn.it 6805590

-- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?...
The Requirements expression for job 6805590.000 is

(TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
(TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)

Job 6805590.000 defines the following attributes:

DiskUsage = 20
ImageSize = 275
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
RequestDisk = DiskUsage
RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)
ResidentSetSize = 0

The Requirements expression for job 6805590.000 reduces to these conditions:

Slots
Step Matched Condition
----- -------- ---------
[0] 24584 TARGET.Arch == "X86_64"
[1] 24584 TARGET.OpSys == "LINUX"
[3] 24584 TARGET.Disk >= RequestDisk
[5] 24584 TARGET.Memory >= RequestMemory
[7] 24584 TARGET.HasFileTransfer


6805590.000: Job is completed.

Last successful match: Wed Jul 29 16:37:03 2020


6805590.000: Run analysis summary ignoring user priority. Of 829 machines,
0 are rejected by your job's requirements
122 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
707 are able to run your job

It is possible to format the output of condor_q with the option -af :

-af list specific attributes
-af:j shows the attribute names
-af:th formats a nice table

The job outputs cannot be copied automatically. The user should launch:

condor_transfer_data -name sn-02.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission:

-bash-4.2$ condor_transfer_data -name sn-02.cr.cnaf.infn.it 6806037
Fetching data files...
-bash-4.2$ ls
ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub

At the end, to remove a job use the command condor_rm:

-bash-4.2$ condor_rm -name sn-02.cr.cnaf.infn.it 6806037
All jobs in cluster 6806037 have been marked for removal

Also, to fix the submit node you want to submit the job you can launch the command

export _condor_SCHEDD_HOST=sn-02.cr.cnaf.infn.it

and the commands to submit and check the job become easier:

-bash-4.2$ export _condor_SCHEDD_HOST=sn-02.cr.cnaf.infn.it
-bash-4.2$ condor_submit -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 8178760.
-bash-4.2$ condor_q 8178760

-- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 09/02/20 10:25:30
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
arendina ID: 8178760 9/2 10:25 _ _ 1 1 8178760.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 32798 jobs; 13677 completed, 2 removed, 13991 idle, 5115 running, 13 held, 0 suspended

Submit grid jobs

First, create the proxy:

voms-proxy-init --voms <vo name>

then you can submit the job with the following commands:

export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI
condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub

For example:

-bash-4.2$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI
-bash-4.2$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 2015349.

where "sleep.sub" is the submit file:

# Unix submit description file
# sleep.sub -- simple sleep job

use_x509userproxy = true
# needed for all the operation where a certificate is required

+owner = undefined

delegate_job_GSI_credentials_lifetime = 0
# this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically

executable              = sleep.sh
log                     = sleep.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
queue

that differs from the sleep.sub file that has been submitted locally by the command name:

+owner = undefined

that allows the computing element to identify the user through the voms-proxy.

Note that the submit description file of a grid job is basically different from one that has to be submitted locally.

To check the job status of a single job use

condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

So, for the previous example we have:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015349

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:02:21
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2015349 7/29 16:59 _ _ 1 1 2015349.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 31430 jobs; 8881 completed, 2 removed, 676 idle, 1705 running, 20166 held, 0 suspended

The user is mapped through the voms-proxy in the user name virgo008 as owner of the job. Then, to get the list of the submitted jobs of a user just change <cluster id> with <owner>:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it virgo008

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:09:42
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2014655 7/29 11:30 _ _ _ 1 2014655.0
virgo008 ID: 2014778 7/29 12:24 _ _ _ 1 2014778.0
virgo008 ID: 2014792 7/29 12:40 _ _ _ 1 2014792.0
virgo008 ID: 2015159 7/29 15:11 _ _ _ 1 2015159.0
virgo008 ID: 2015161 7/29 15:12 _ _ _ 1 2015161.0
virgo008 ID: 2015184 7/29 15:24 _ _ _ 1 2015184.0
virgo008 ID: 2015201 7/29 15:33 _ _ _ 1 2015201.0
virgo008 ID: 2015207 7/29 15:39 _ _ _ 1 2015207.0
virgo008 ID: 2015217 7/29 15:43 _ _ _ 1 2015217.0
virgo008 ID: 2015224 7/29 15:47 _ _ _ 1 2015224.0
virgo008 ID: 2015349 7/29 16:59 _ _ _ 1 2015349.0

Total for query: 11 jobs; 11 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 31429 jobs; 8898 completed, 3 removed, 591 idle, 1737 running, 20200 held, 0 suspended

As in the local case, to get the job outputs the user shoud launch:

condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission:

-bash-4.2$ condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
Fetching data files...
-bash-4.2$ ls
ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub

And to remove a job submitted via grid:

-bash-4.2$ condor_rm -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
All jobs in cluster 2015217 have been marked for removal

Experiment share usage

If a user wants to know the usage of an entire experiment goup, in particular to see the number of jobs submitted by each experiment user, the command is:

condor_q -all -name sn-02 -cons 'AcctGroup == "<exp-name>"' -af Owner jobstatus | sort | uniq -c

The output will look like this:

-bash-4.2$ condor_q -all -name sn-02 -cons 'AcctGroup == "pulp-fiction"' -af Owner jobstatus | sort | uniq -c
1 MWallace 1
3 VVega 4
20 Wolf 4
572 JulesW 1
1606 Ringo 4
1 Butch 2
5 Jody 4

In the first column there is the number of submitted jobs, in the second there is the user who has submitted them and in the third there is the jobs' status (1=pending, 2=running, 3=removed, 4=completed, 5=held, 6=submission_err).

Page tree

HTCondor jobs

Submit local jobs

Submit grid jobs

Experiment share usage