HTCondor jobs

HTCondor is a job scheduler. You give HTCondor a file containing commands that instruct it on how to manage jobs. Within the pool of machines, HTCondor locates the best machine that can run each job, packages up the job and ships it off to this execute machine. The jobs run, and output is returned to the machine that submitted the jobs.

Submission to the cluster with environment modules

To use HTCondor we implemented a solution based on environment modules. The traditional interaction methods, i.e. specifying all command line options, remain valid, yet less handy and more verbose. The HTC modules will set all environment variables needed to correctly submit to the HTCondor cluster.
Once logged into any Tier 1 user interface, this utility will be available. You can list all the available modules using:

Showing available modules

ashtimmermanus@ui-tier1 ~$ module avail
---------------------------- /opt/exp_software/opssw/modules/modulefiles ----------------------------
htc/auth  htc/ce  htc/local  use.own  

Key:
modulepath  default-version

These htc/* modules have different roles.

All modules in the htc family provide on-line help via the "module help <module name>" command, e.g.:

Display module`s help information

budda@ui-tier1:~
 $ module help htc
-------------------------------------------------------------------
Module Specific Help for /opt/exp_software/opssw/modules/modulefiles/htc/local:

Defines environment variables and aliases to ease the interaction with the INFN-T1 HTCondor local job submission system
-------------------------------------------------------------------

Local Submission

To submit local jobs or query the local schedd, sn01-htc that is the HTCondor cluster access point, use the htc/local module. This is the default module loaded when loading the "htc" family. So, for using this HTC module you can switch to the htc module and using command condor_q for showing the local jobs queue:

Swithing module

ashtimmermanus@ui-tier1 ~$ module switch htc # default is htc/local
ashtimmermanus@ui-tier1 ~$ condor_q


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/21/24 14:41:41
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for ashtimmermanus: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 47942 jobs; 12542 completed, 0 removed, 25401 idle, 9999 running, 0 held, 0 suspended

A simple example of an executable is the following sleep job that waits for the specified amount of time and then exits:

Example file to submitting a job

apascolinit1@ui-tier1 ~
$ cat sleep.sh
#!/bin/env bash
sleep $1

To submit this sample job, we need to specify all the details such as the names and location of the executable and all needed input files creating a submit description file where each line has the form:

Submission and control of job status

apascolinit1@ui-tier1 ~
$ cat submit.sub
# submit description file
# submit.sub -- simple sleep job

batch_name              = Local-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT

queue

apascolinit1@ui-tier1 ~
$ module switch htc

apascolinit1@ui-tier1 ~
$ condor_submit submit.sub
Submitting job(s).
1 job(s) submitted to cluster 15.

apascolinit1@ui-tier1 ~
$ condor_q


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 03/18/24 17:15:44
OWNER        BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
apascolinit1 Local-Sleep   3/18 17:15      _      1      _      1 15.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for apascolinit1: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

GRID submission with environment modules

The htc/ce module eases the usage of the condor_q and condor_submit commands setting up all the needed variables to contact our Grid compute entry points.

Switching module for GRID submission

ashtimmermanus@ui-tier1 ~$ module switch htc/ce
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/local{ver=23} to htc/ce{auth=SCITOKENS:num=2}
  Loading requirement: htc/auth{auth=SCITOKENS}

The module accepts two parameters that can be specified on the module switch command line, as shown below (bold=default value).

parameter	values	description
num	1,2,3,4,5,6	connects to ce{num}-htc
auth	VOMS,SCITOKENS	calls the htc/auth module with the selected authentication method

For explicitly using Token authentication on ce03-htc use the following module switch command:

Switching module with using token authentication

ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=3
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/ce{auth=SCITOKENS:num=2} to htc/ce{auth=SCITOKENS:num=3}
  Unloading useless requirement: htc/auth{auth=SCITOKENS}
  Loading requirement: htc/auth{auth=SCITOKENS}

The following command starts a usable OIDC agent and makes it available for the current shell. If oidc-agent-service has already started an agent for you, this agent will be reused and made available.

Make oidc-agent available in the current terminal

ashtimmermanus@ui-tier1 ~$ eval `oidc-agent-service use`
36535

To see the list of locally configured OIDC clients, use the following command oidc-add -l:

List existing configuration

ashtimmermanus@ui-tier1 ~$ oidc-add -l
The following account configurations are usable: 
aksieniia_token
testhtc

If you haven't registered an OIDC client yet, follow the next steps:

To register a new oidc client use the following command (it is needed to do just the first time to create a new one):

Registered a new client using the oidc-gen

ashtimmermanus@ui-tier1 ~$ oidc-gen -w device
Enter short name for the account to configure: htc_23
[1] https://iam-t1-computing.cloud.cnaf.infn.it/
[2] https://iam-test.indigo-datacloud.eu/
[3] https://iam.deep-hybrid-datacloud.eu/
[4] https://iam.extreme-datacloud.eu/
[5] https://iam-demo.cloud.cnaf.infn.it/
[6] https://b2access.eudat.eu:8443/oauth2
[7] https://b2access-integration.fz-juelich.de/oauth2
[8] https://login-dev.helmholtz.de/oauth2
[9] https://login.helmholtz.de/oauth2
[10] https://services.humanbrainproject.eu/oidc/
[11] https://accounts.google.com
[12] https://aai-dev.egi.eu/auth/realms/egi
[13] https://aai-demo.egi.eu/auth/realms/egi
[14] https://aai.egi.eu/auth/realms/egi
[15] https://login.elixir-czech.org/oidc/
[16] https://oidc.scc.kit.edu/auth/realms/kit
[17] https://wlcg.cloud.cnaf.infn.it/
Issuer [https://iam-t1-computing.cloud.cnaf.infn.it/]: 
The following scopes are supported: openid profile email address phone offline_access eduperson_scoped_affiliation eduperson_entitlement eduperson_assurance entitlements wlcg.groups
Scopes or 'max' (space separated) [openid profile offline_access]: profile wlcg.groups wlcg compute.create compute.modify compute.read compute.cancel
Registering Client ...
Generating account configuration ...
accepted

...
...
Enter encryption password for account configuration 'htc_23': 
Confirm encryption Password: 
Everything setup correctly!

The -w device instructs oidc-agent to use the device code flow for the authentication, which is the recommended way with INDIGO-IAM. oidc-agent will display a list of different providers that can be used for registration:

[1] https://wlcg.cloud.cnaf.infn.it/
[2] https://iam-test.indigo-datacloud.eu/
...
[20] https://oidc.scc.kit.edu/auth/realms/kit/

Select one of the registered providers, or type a custom issuer (for IAM, the last character of the issuer string is always a /, e.g. https://wlcg.cloud.cnaf.infn.it/).

Then oidc-agent asks for the scopes, typing max (without quotes) allows to get all the allowed scopes, but this is discouraged. Instead, specify the minimum required scopes for the task the client is registered for. In the case of job submission to HTCondor-CE, these scopes are:

compute.create
compute.modify
compute.read
compute.cancel

oidc-agent will register a new client and store the client credentials encrypted on the user interface.

IAM asks you authorization for the client to operate on your behalf by requesting to autenticate with your browser and enter a code provided on the terminal to the indicated web address.

Finally, a password to encrypt the client information on the machine is prompted to the user twice to be set. It can be any password of choice and it must be remembered whenever the client has to be loaded with the oidc-add <client_name> command.

After a client is registered, it can be used. In our example the test_client client is used to obtain access tokens. One does not need to run oidc-gen again unless to update or create a new client configuration.

2. You can then load an account in the agent again with the oidc-add [the client name] command, as follows:

Load an account

ashtimmermanus@ui-tier1 ~$ oidc-add htc_23
Enter decryption password for account config 'htc_23':
success

Once you’ve loaded the account, you can use oidc-token to get tokens for that account.

Access tokens are valid typically for 60 minutes.

The scopes determine the actions that can be performed with the token.

3. To submit a job, you must save the token in a file with the path as the scitokens_file command in the submit file. Tokens can be obtained using the oidc-token command, as follows:

Getting tokens with oidc-token

ashtimmermanus@ui-tier1 ~]$ MASK=$(umask); umask 0077 ; oidc-token htc_23 > ${HOME}/token; umask $MASK

It limits the permissions to the token file to be readable and writeable only by the file owner, for obvious security reasons.

4. For submitting a test job, you can use the example file as follows:

Showing submit-description file

ashtimmermanus@ui-tier1 ~$ cat submit_token.sub
# submit description file
# submit.sub -- simple sleep job
 
scitokens_file          = $ENV(HOME)/htc_test/grid-token/token
+owner                  = undefined
 
batch_name              = Grid-Token-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
 
queue

Where scitokens_file = $ENV(HOME)/token is the path to the file containing the scitoken to authenticate to the CE.

5. Switch to htc/ce module for GRID submission with the module switch command:

Switch to GRID submission with token

ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=1
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/ce{auth=SCITOKENS:num=1} to htc/ce{auth=SCITOKENS:num=1}
  Loading requirement: htc/auth{auth=SCITOKENS}

6. You should begin by submitting a simple test job with command condor_submit:

Submitting jobs to Condor

ashtimmermanus@ui-tier1 ~$ export BEARER_TOKEN=$(oidc-token htc_23)
ashtimmermanus@ui-tier1 ~$ condor_submit submit_token.sub
Submitting job(s).
1 job(s) submitted to cluster 52465.

7. You can see the information about the job with the condor_q command, as follows:

Showing jobs queue

ashtimmermanus@ui-tier1 ~$ condor_q


-- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 06/21/24 16:39:44
OWNER          BATCH_NAME          SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus Grid-Token-Sleep   6/21 16:38      _      _      1      1 52465.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for ashtimmermanus: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 4239 jobs; 1297 completed, 0 removed, 228 idle, 2709 running, 5 held, 0 suspended

SSL submission

The SSL Submission substitutes VOMS proxy, although these two processes are almost identical.

CAVEAT

To be able to submit jobs using the SSL authentication, your x509 User Proxy FQAN must be mapped in the CE configuration before job submission.
You will need to send to the support team, via the user-support@lists.cnaf.infn.it mailing list, the output of the voms-proxy-info --all --chain command corresponding to a valid VOMS proxy:

budda@ui-tier1:~
 $ voms-proxy-info --all --chain
=== Proxy Chain Information ===
X.509 v3 certificate
Subject: CN=1569994718,CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Issuer: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Valid from: Tue Apr 09 16:18:41 CEST 2024
Valid to: Wed Apr 10 04:18:41 CEST 2024
CA: false
Signature alg: SHA384WITHRSA
Public key type: RSA 2048bit
Allowed usage: digitalSignature keyEncipherment
Serial number: 1569994718
VOMS extensions: yes.

X.509 v3 certificate
Subject: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Issuer: CN=GEANT TCS Authentication RSA CA 4B,O=GEANT Vereniging,C=NL
Valid from: Mon Oct 16 12:57:40 CEST 2023
Valid to: Thu Nov 14 11:57:40 CET 2024
Subject alternative names:
  email: carmelo.pellegrino@cnaf.infn.it
CA: false
Signature alg: SHA384WITHRSA
Public key type: RSA 8192bit
Allowed usage: digitalSignature keyEncipherment
Allowed extended usage: clientAuth emailProtection
Serial number: 73237961961532056736463686571865333148

=== Proxy Information ===
subject   : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it/CN=1569994718
issuer    : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
identity  : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
type      : RFC3820 compliant impersonation proxy
strength  : 2048
path      : /tmp/x509up_u23069
timeleft  : 00:00:00
key usage : Digital Signature, Key Encipherment
=== VO km3net.org extension information ===
VO        : km3net.org
subject   : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
issuer    : /DC=org/DC=terena/DC=tcs/C=IT/ST=Napoli/O=Universita degli Studi di Napoli FEDERICO II/CN=voms02.scope.unina.it
attribute : /km3net.org/Role=NULL/Capability=NULL
timeleft  : 00:00:00
uri       : voms02.scope.unina.it:15005

Get a proxy with voms-proxy-init:

apascolinit1@ui-tier1 ~
$ voms-proxy-init --voms cms
Enter GRID pass phrase for this identity:
Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u23077.

Your proxy is valid until Tue Mar 19 22:39:41 CET 2024

Submit a job to the CE

Submit file

apascolinit1@ui-tier1 ~
$ cat submit_ssl.sub
# Unix submit description file
# subimt.sub -- simple sleep job

use_x509userproxy       = true
+owner                  = undefined

batch_name              = Grid-SSL-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT

queue

Submit a job with SSL

apascolinit1@ui-tier1 ~
$ module switch htc/ce auth=VOMS num=1
Don't forget to voms-proxy-init!

apascolinit1@ui-tier1 ~
$ condor_submit submit_ssl.sub
Submitting job(s).
1 job(s) submitted to cluster 36.

apascolinit1@ui-tier1 ~
$ condor_q


-- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 03/19/24 10:45:18
OWNER      BATCH_NAME        SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
apascolini Grid-SSL-Sleep   3/19 10:44      _      1      _      1 36.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for apascolini: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 2 jobs; 1 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Local jobs submission without environment modules

To submit jobs locally, i.e. from CNAF UI, use the following command:

condor_submit. The following options are required:
- -name sn01-htc.cr.cnaf.infn.it: to correctly address the job to the submit node;
- -spool: to transfer the input files and keep a local copy of the output files;
  - if -spool is not specified, than the user should use -remote sn01-htc.cr.cnaf.infn.it instead of -name sn01-htc.cr.cnaf.infn.it
- the submit description file (a .sub file containing the relevant information for the batch system), to be indicated as argument.

IMPORTANT NOTE

The -spool option is mandatory when the submit folder is a home directory. Home directories are NOT present on worker nodes. The -spool condor_submit option activates the spool mechanism that copies input and output files back and forth from/to the submit folder and the worker node scratch directory.

For example:

To see all jobs

ashtimmermanus@ui-tier1 ~$ condor_submit -name sn01-htc.cr.cnaf.infn.it -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 342094.

where 342094 is the cluster id.

To see all jobs launched by a user locally on a submit node, use the following commands.:

To see all jobs

condor_q -name sn01-htc.cr.cnaf.infn.it <user>

For example:

ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it ashtimmermanus


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 14:52:09
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 46789 jobs; 12431 completed, 1 removed, 23897 idle, 9992 running, 468 held, 0 suspended

To get the list of held jobs and the held reason add the option -held.

Whereas, to see information about a single job using the following commands.:

condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id>

For example:

ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it 64667

-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 15:58:32
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 50343 jobs; 12871 completed, 1 removed, 26961 idle, 10000 running, 510 held, 0 suspended

To investigate why a job ends up in a 'Held' state, use the following command:

condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id> -af HoldReason

Finally, to get more detailed information on why the job takes a long time to enter in run state, use the -better-analyze option. For example:

ashtimmermanus@ui-tier1 ~$ condor_q -better-analyze -name sn01-htc.cr.cnaf.infn.it 342094


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?...
The Requirements expression for job 342094.000 is

    ( !StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":")) &&
    (TARGET.t1_allow_sam isnt true) &&
    (( !StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":")) &&
      ((NumJobStarts == 0) && ((TARGET.Arch == "X86_64") &&
          (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
          (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer))))

Job 342094.000 defines the following attributes:

    AcctGroup = "dteam"
    DiskUsage = 1
    ImageSize = 1
    NumJobStarts = 0
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements expression for job 342094.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           0  StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":")
[1]       34451  TARGET.t1_allow_sam isnt true
[3]           0  StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":")
[5]       34387  TARGET.Arch == "X86_64"
[10]      26383  TARGET.Memory >= RequestMemory
[11]      26328  [5] && [10]
[16]      26323  [1] && [11]


342094.000:  Run analysis summary ignoring user priority.  Of 926 machines,
      8 are rejected by your job's requirements
     63 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    855 are able to run your job

It is possible to format the output of condor_q with the option -af <ClassAd expression> :

-af list specific attributes
-af:j shows the attribute names
-af:th formats output in a nice table

The job outputs cannot be copied automatically. The user should launch the condor_transfer_data command:

condor_transfer_data -name sn01-htc.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission using the following commands.:

ashtimmermanus@ui-tier1 ~$ condor_transfer_data -name sn01-htc.cr.cnaf.infn.it 342094
Fetching data files...

At the end, to remove a job use the command condor_rm:

ashtimmermanus@ui-tier1 ~$ condor_rm -name sn01-htc.cr.cnaf.infn.it 342094
All jobs in cluster 342094 have been marked for removal

Also, to fix the submit node you want to submit the job you can launch using the following command:

export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it

and the commands to submit and check the job become easier:

ashtimmermanus@ui-tier1 ~$ export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it
ashtimmermanus@ui-tier1 ~$ condor_submit -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 342097.
[ashtimmermanus@ui-tier1 ~]$ condor_q 342097


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/26/24 14:22:16
OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus ID: 342097   6/26 14:21      _      _      1      1 342097.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 42786 jobs; 12809 completed, 4 removed, 19100 idle, 9999 running, 874 held, 0 suspended

Submit grid jobs without environment modules

A token can be obtained from the command-line using oidc-agent. The oidc agent has to be started using the following command:

 eval `oidc-agent-service use`

This starts the agent and sets the required environment variables.

Job submission:

ashtimmerman@ui-tier1 ~$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS
ashtimmerman@ui-tier1 ~$ condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it -spool token_sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 4037450.

ashtimmerman@ui-tier1 ~$ condor_q -pool ce07-htc.cr.cnaf.infn.it:9619 -name ce07-htc.cr.cnaf.infn.it 4037450
 
 -- Schedd: ce07-htc.cr.cnaf.infn.it : <131.154.192.106:25329?... @ 12/07/23 17:56:19
OWNER          BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus ID: 4037450  12/7  14:40      _      _      _      1 4037450.0

Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 8752 jobs; 4843 completed, 0 removed, 591 idle, 3317 running, 1 held, 0 suspended

With VOMS proxies

First, create the proxy using the following command:

voms-proxy-init --voms <vo name>

then you can submit the job with the following commands:

export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL
condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub

For example:

-bash-4.2$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL
-bash-4.2$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 2015349.

where "sleep.sub" is the following submit file:

# submit description file
# sleep.sub -- simple sleep job

use_x509userproxy = true
# needed for all the operation where a certificate is required

+owner = undefined

delegate_job_GSI_credentials_lifetime = 0
# this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically

executable              = sleep.sh
log                     = sleep.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
queue

that differs from the sleep.sub file that has been submitted locally by the command name:

+owner = undefined

that allows the computing element to identify the user through the voms-proxy.

Note that the submit description file of a grid job is basically different from one that has to be submitted locally.

To check the job status of a single job use

condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

So, for the previous example we have:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015349

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:02:21
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2015349 7/29 16:59 _ _ 1 1 2015349.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 31430 jobs; 8881 completed, 2 removed, 676 idle, 1705 running, 20166 held, 0 suspended

The user is mapped through the voms-proxy in the user name virgo008 as owner of the job. Then, to get the list of the jobs submitted by a user just change <cluster id> with <owner>:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it virgo008

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:09:42
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2014655 7/29 11:30 _ _ _ 1 2014655.0
virgo008 ID: 2014778 7/29 12:24 _ _ _ 1 2014778.0
virgo008 ID: 2014792 7/29 12:40 _ _ _ 1 2014792.0
virgo008 ID: 2015159 7/29 15:11 _ _ _ 1 2015159.0
virgo008 ID: 2015161 7/29 15:12 _ _ _ 1 2015161.0
virgo008 ID: 2015184 7/29 15:24 _ _ _ 1 2015184.0
virgo008 ID: 2015201 7/29 15:33 _ _ _ 1 2015201.0
virgo008 ID: 2015207 7/29 15:39 _ _ _ 1 2015207.0
virgo008 ID: 2015217 7/29 15:43 _ _ _ 1 2015217.0
virgo008 ID: 2015224 7/29 15:47 _ _ _ 1 2015224.0
virgo008 ID: 2015349 7/29 16:59 _ _ _ 1 2015349.0

Total for query: 11 jobs; 11 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 31429 jobs; 8898 completed, 3 removed, 591 idle, 1737 running, 20200 held, 0 suspended

As in the local case, to get the job outputs the user should launch using the following commands.:

condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission time:

-bash-4.2$ condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
Fetching data files...
-bash-4.2$ ls
ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub

And to remove a job submitted via grid using the following commands.:

-bash-4.2$ condor_rm -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
All jobs in cluster 2015217 have been marked for removal

Long lived proxy inside a job

One problem which may occurr while running a grid job, is represented by the default short-life of the VOMS proxy used to submit the job. Indeed, the job will be aborted if it does not finish before the expiration time of the proxy. The easiest solution to this problem would be to use very long-lived proxies, but at the expense of an increased security risk. Furthermore, the duration of a VOMS proxy is limited by the VOMS server and cannot be made arbitrarily long.

To overcome this limitation, a proxy credential repository system is used, which allows the user to create and store a long-term proxy in a dedicated server (a "MyProxy" server). At Tier-1 this MyProxy store is myproxy.cnaf.infn.it.

For instance, using the following command, it is possible to retrieve a grid proxy with a lifetime of 168 hours and store it into a the Myproxy server with credentials valid for 720 hours:

dlattanzio@ui-tier1 ~$ myproxy-init --proxy_lifetime 168 --cred_lifetime 720 --voms vo.padme.org --pshost myproxy.cnaf.infn.it --dn_as_username --credname proxyCred --local_proxy
Enter GRID pass phrase for this identity:
Contacting voms2.cnaf.infn.it:15020 [/DC=org/DC=terena/DC=tcs/C=IT/ST=Roma/O=Istituto Nazionale di Fisica Nucleare/OU=CNAF/CN=voms2.cnaf.infn.it] "vo.padme.org"...
Remote VOMS server contacted succesfully.

voms2.cnaf.infn.it:15020: The validity of this VOMS AC in your proxy is shortened to 86400 seconds!

Created proxy in /tmp/myproxy-proxy.10164.21287.

Your proxy is valid until Sat Dec 24 18:34:42 CET 2022
Enter MyProxy pass phrase:
Verifying - Enter MyProxy pass phrase:
Your identity: /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Daniele Lattanzio dlattanzio@infn.it
Creating proxy ............................................ Done
Proxy Verify OK
Your proxy is valid until: Thu Dec  1 18:34:50 2022
A proxy valid for 720 hours (30.0 days) for user /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Daniele Lattanzio dlattanzio@infn.it now exists on myproxy.cnaf.infn.it.

As seen above, the user will be requested first to insert the GRID certificate password, and then a "MyProxy pass phrase" for future proxy retrievals.

The same password has to be indicated also in the submission file, which will be similiar to the one in the example below:

# submit description file
# sleep2.sub -- simple sleep job

use_x509userproxy = true
# needed for all the operation where a certificate is required

+owner = undefined

MyProxyHost = myproxy.cnaf.infn.it:7512
MyProxyPassword = ***
MyProxyCredentialName = proxyCred
MyProxyRefreshThreshold = 3300
MyProxyNewProxyLifetime = 1440

delegate_job_GSI_credentials_lifetime = 0
# this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically

executable              = sleep.sh
log                     = sleep.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
queue

where MyProxyRefreshThreshold and MyProxyNewProxyLifetime represent respectively the time (in second) before the expiration of a proxy that the proxy should be refreshed and the new lifetime (in minutes) of the proxy after it is refreshed.

Experiment share usage

If a user wants to know the usage of an entire experiment goup, in particular to see the number of jobs submitted by each experiment user, the command is:

condor_q -all -name sn01-htc -cons 'AcctGroup == "<exp-name>"' -af Owner jobstatus | sort | uniq -c

The output will look like this:

-bash-4.2$ condor_q -all -name sn01-htc -cons 'AcctGroup == "pulp-fiction"' -af Owner '{"Idle", "Running", "Removed", "Completed", "Held"}[jobstatus-1]' | sort | uniq -c
1 MWallace Idle
161 VVega Completed
605 Wolf Running
572 JulesW Idle
5884 Ringo Running
33 Butch Running
5 Jody Held

In the first column there is the number of submitted jobs, in the second there is the user who has submitted them and in the third there is the job' status.

Page tree

HTCondor jobs

Submission to the cluster with environment modules

Local Submission

GRID submission with environment modules

SSL submission

Local jobs submission without environment modules

Submit grid jobs without environment modules

With VOMS proxies

Long lived proxy inside a job

Experiment share usage