HTCondor jobs

HTCondor is a job scheduler. You give HTCondor a file containing commands that instruct it on how to manage jobs. Within the pool of machines, HTCondor locates the best machine that can run each job, packages up the job and ships it off to this execute machine. The jobs run, and output is returned to the machine that submitted the jobs.

Submission to the cluster with environment modules

To use HTCondor we implemented a solution based on environment modules. The traditional interaction methods, i.e. specifying all command line options, remain valid, yet less handy and more verbose. The HTC modules will set all environment variables needed to correctly submit to the HTCondor cluster.
Once logged into any Tier 1 user interface, this utility will be available. You can list all the available modules using:

Showing available modules

ashtimmermanus@ui-tier1 ~$ module avail
---------------------------- /opt/exp_software/opssw/modules/modulefiles ----------------------------
htc/auth  htc/ce  htc/local  use.own  

Key:
modulepath  default-version

These htc/* modules have different roles.

All modules in the htc family provide on-line help via the "module help <module name>" command, e.g.:

Display module`s help information

budda@ui-tier1:~
 $ module help htc
-------------------------------------------------------------------
Module Specific Help for /opt/exp_software/opssw/modules/modulefiles/htc/local:

Defines environment variables and aliases to ease the interaction with the INFN-T1 HTCondor local job submission system
-------------------------------------------------------------------

Local job Submission

To submit local jobs or query the local schedd, sn01-htc that is the HTCondor cluster access point, use the htc/local module. This is the default module loaded when loading the "htc" family. So, for using this HTC module you can switch to the htc module and using command condor_q for showing the local jobs queue:

Swithing module

ashtimmermanus@ui-tier1 ~$ module switch htc # default is htc/local
ashtimmermanus@ui-tier1 ~$ condor_q


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/21/24 14:41:41
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for ashtimmermanus: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 47942 jobs; 12542 completed, 0 removed, 25401 idle, 9999 running, 0 held, 0 suspended

A simple example of an executable is the following sleep job that waits for the specified amount of time and then exits:

Example file to submitting a job

apascolinit1@ui-tier1 ~
$ cat sleep.sh
#!/bin/env bash
sleep $1

To submit this sample job, we need to specify all the details such as the names and location of the executable and all needed input files creating a submit description file where each line has the form:

Submission and control of job status

apascolinit1@ui-tier1 ~
$ cat submit.sub
# submit description file
# submit.sub -- simple sleep job

batch_name              = Local-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT

queue

apascolinit1@ui-tier1 ~
$ module switch htc

apascolinit1@ui-tier1 ~
$ condor_submit submit.sub
Submitting job(s).
1 job(s) submitted to cluster 15.

apascolinit1@ui-tier1 ~
$ condor_q


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 03/18/24 17:15:44
OWNER        BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
apascolinit1 Local-Sleep   3/18 17:15      _      1      _      1 15.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for apascolinit1: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

If your account belongs to more than one unix group (excluding the "catchall" cnafusers group), by loading the htc module a TUI selector will prompt you to choose the specific group to submit jobs to. To avoid the interactive TUI, you can set the CONDOR_SHARE environment variable before loading the module. Please refer to the output of the following command to know the values appropriate for your account:

Show the Unix groups available to your account

id -nG | tr ' ' '\n' | grep -v ^cnafusers$

For example:

Show the Unix groups available to your account

[cpellegr@ui-tier1] $ id -nG | tr ' ' '\n' | grep -v ^cnafusers$
tier1
cms
hpc
darkside
borexino
km3
cupid
newchim
asfin
herd
luna
gminus
muone
virgo
user-support
cta
darksidesgm

And then:

Fix a Unix group

export CONDOR_SHARE=darkside
module load htc

to always select the "darkside" group and skip the interactive selection, you can put the variable definition in your ~/.bashrc file.

Grid submission with environment modules

The htc/ce module eases the usage of the condor_q and condor_submit commands setting up all the needed variables to contact our Grid compute entry points.

Switching module for GRID submission

ashtimmermanus@ui-tier1 ~$ module switch htc/ce
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/local{ver=23} to htc/ce{auth=SCITOKENS:num=2}
  Loading requirement: htc/auth{auth=SCITOKENS}

The module accepts two parameters that can be specified on the module switch command line, as shown below (bold=default value).

parameter	values	description
num	1,2,3,4,5,6	connects to ce{num}-htc
auth	VOMS,SCITOKENS	calls the htc/auth module with the selected authentication method

For explicitly using Token authentication on ce03-htc use the following module switch command:

Switching module with using token authentication

ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=3
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/ce{auth=SCITOKENS:num=2} to htc/ce{auth=SCITOKENS:num=3}
  Unloading useless requirement: htc/auth{auth=SCITOKENS}
  Loading requirement: htc/auth{auth=SCITOKENS}

The following command starts a usable OIDC agent and makes it available for the current shell. If oidc-agent-service has already started an agent for you, this agent will be reused and made available.

Make oidc-agent available in the current terminal

ashtimmermanus@ui-tier1 ~$ eval `oidc-agent-service use`
36535

To see the list of locally configured OIDC clients, use the following command oidc-add -l:

List existing configuration

ashtimmermanus@ui-tier1 ~$ oidc-add -l
The following account configurations are usable: 
aksieniia_token
testhtc

If you haven't registered an OIDC client yet, follow the next steps:

To register a new oidc client use the following command (it is needed to do just the first time to create a new one):

Registered a new client using the oidc-gen

ashtimmermanus@ui-tier1 ~$ oidc-gen -w device
Enter short name for the account to configure: htc_23
[1] https://iam-t1-computing.cloud.cnaf.infn.it/
[2] https://iam-test.indigo-datacloud.eu/
[3] https://iam.deep-hybrid-datacloud.eu/
[4] https://iam.extreme-datacloud.eu/
[5] https://iam-demo.cloud.cnaf.infn.it/
[6] https://b2access.eudat.eu:8443/oauth2
[7] https://b2access-integration.fz-juelich.de/oauth2
[8] https://login-dev.helmholtz.de/oauth2
[9] https://login.helmholtz.de/oauth2
[10] https://services.humanbrainproject.eu/oidc/
[11] https://accounts.google.com
[12] https://aai-dev.egi.eu/auth/realms/egi
[13] https://aai-demo.egi.eu/auth/realms/egi
[14] https://aai.egi.eu/auth/realms/egi
[15] https://login.elixir-czech.org/oidc/
[16] https://oidc.scc.kit.edu/auth/realms/kit
[17] https://wlcg.cloud.cnaf.infn.it/
Issuer [https://iam-t1-computing.cloud.cnaf.infn.it/]: 
The following scopes are supported: openid profile email address phone offline_access eduperson_scoped_affiliation eduperson_entitlement eduperson_assurance entitlements wlcg.groups
Scopes or 'max' (space separated) [openid profile offline_access]: profile wlcg.groups wlcg compute.create compute.modify compute.read compute.cancel
Registering Client ...
Generating account configuration ...
accepted

...
...
Enter encryption password for account configuration 'htc_23': 
Confirm encryption Password: 
Everything setup correctly!

The -w device instructs oidc-agent to use the device code flow for the authentication, which is the recommended way with INDIGO-IAM. oidc-agent will display a list of different providers that can be used for registration:

[1] https://wlcg.cloud.cnaf.infn.it/
[2] https://iam-test.indigo-datacloud.eu/
...
[20] https://oidc.scc.kit.edu/auth/realms/kit/

Select one of the registered providers, or type a custom issuer (for IAM, the last character of the issuer string is always a /, e.g. https://wlcg.cloud.cnaf.infn.it/).

Then oidc-agent asks for the scopes, typing max (without quotes) allows to get all the allowed scopes, but this is discouraged. Instead, specify the minimum required scopes for the task the client is registered for. In the case of job submission to HTCondor-CE, these scopes are:

compute.create
compute.modify
compute.read
compute.cancel

oidc-agent will register a new client and store the client credentials encrypted on the user interface.

IAM asks you authorization for the client to operate on your behalf by requesting to autenticate with your browser and enter a code provided on the terminal to the indicated web address.

Finally, a password to encrypt the client information on the machine is prompted to the user twice to be set. It can be any password of choice and it must be remembered whenever the client has to be loaded with the oidc-add <client_name> command.

After a client is registered, it can be used. In our example the test_client client is used to obtain access tokens. One does not need to run oidc-gen again unless to update or create a new client configuration.

2. You can then load an account in the agent again with the oidc-add [the client name] command, as follows:

Load an account

ashtimmermanus@ui-tier1 ~$ oidc-add htc_23
Enter decryption password for account config 'htc_23':
success

Once you’ve loaded the account, you can use oidc-token to get tokens for that account.

Access tokens are valid typically for 60 minutes.

The scopes determine the actions that can be performed with the token.

3. To submit a job, you must save the token in a file with the path as the scitokens_file command in the submit file. Tokens can be obtained using the oidc-token command, as follows:

Getting tokens with oidc-token

ashtimmermanus@ui-tier1 ~]$ MASK=$(umask); umask 0077 ; oidc-token htc_23 > ${HOME}/token; umask $MASK

It limits the permissions to the token file to be readable and writeable only by the file owner, for obvious security reasons.

4. For submitting a test job, you can use the example file as follows:

Showing submit-description file

ashtimmermanus@ui-tier1 ~$ cat submit_token.sub
# submit description file
# submit.sub -- simple sleep job
 
scitokens_file          = $ENV(HOME)/htc_test/grid-token/token
+owner                  = undefined
 
batch_name              = Grid-Token-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
 
queue

Where scitokens_file = $ENV(HOME)/token is the path to the file containing the scitoken to authenticate to the CE.

5. Switch to htc/ce module for grid submission with the module switch command:

Switch to GRID submission with token

ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=1
Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"!

Switching from htc/ce{auth=SCITOKENS:num=1} to htc/ce{auth=SCITOKENS:num=1}
  Loading requirement: htc/auth{auth=SCITOKENS}

6. You should begin by submitting a simple test job with command condor_submit:

Submitting jobs to Condor

ashtimmermanus@ui-tier1 ~$ export BEARER_TOKEN=$(oidc-token htc_23)
ashtimmermanus@ui-tier1 ~$ condor_submit submit_token.sub
Submitting job(s).
1 job(s) submitted to cluster 52465.

7. You can see the information about the job with the condor_q command, as follows:

Showing jobs queue

ashtimmermanus@ui-tier1 ~$ condor_q


-- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 06/21/24 16:39:44
OWNER          BATCH_NAME          SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus Grid-Token-Sleep   6/21 16:38      _      _      1      1 52465.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for ashtimmermanus: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 4239 jobs; 1297 completed, 0 removed, 228 idle, 2709 running, 5 held, 0 suspended

SSL submission

The SSL Submission substitutes VOMS proxy, although these two processes are almost identical.

CAVEAT

To be able to submit jobs using the SSL authentication, your x509 User Proxy FQAN must be mapped in the CE configuration before job submission.
You will need to send to the support team, via the user-support@lists.cnaf.infn.it mailing list, the output of the voms-proxy-info --all --chain command corresponding to a valid VOMS proxy:

budda@ui-tier1:~
 $ voms-proxy-info --all --chain
=== Proxy Chain Information ===
X.509 v3 certificate
Subject: CN=1569994718,CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Issuer: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Valid from: Tue Apr 09 16:18:41 CEST 2024
Valid to: Wed Apr 10 04:18:41 CEST 2024
CA: false
Signature alg: SHA384WITHRSA
Public key type: RSA 2048bit
Allowed usage: digitalSignature keyEncipherment
Serial number: 1569994718
VOMS extensions: yes.

X.509 v3 certificate
Subject: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org
Issuer: CN=GEANT TCS Authentication RSA CA 4B,O=GEANT Vereniging,C=NL
Valid from: Mon Oct 16 12:57:40 CEST 2023
Valid to: Thu Nov 14 11:57:40 CET 2024
Subject alternative names:
  email: carmelo.pellegrino@cnaf.infn.it
CA: false
Signature alg: SHA384WITHRSA
Public key type: RSA 8192bit
Allowed usage: digitalSignature keyEncipherment
Allowed extended usage: clientAuth emailProtection
Serial number: 73237961961532056736463686571865333148

=== Proxy Information ===
subject   : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it/CN=1569994718
issuer    : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
identity  : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
type      : RFC3820 compliant impersonation proxy
strength  : 2048
path      : /tmp/x509up_u23069
timeleft  : 00:00:00
key usage : Digital Signature, Key Encipherment
=== VO km3net.org extension information ===
VO        : km3net.org
subject   : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it
issuer    : /DC=org/DC=terena/DC=tcs/C=IT/ST=Napoli/O=Universita degli Studi di Napoli FEDERICO II/CN=voms02.scope.unina.it
attribute : /km3net.org/Role=NULL/Capability=NULL
timeleft  : 00:00:00
uri       : voms02.scope.unina.it:15005

Get a proxy with voms-proxy-init:

apascolinit1@ui-tier1 ~
$ voms-proxy-init --voms cms
Enter GRID pass phrase for this identity:
Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"...
Remote VOMS server contacted succesfully.


Created proxy in /tmp/x509up_u23077.

Your proxy is valid until Tue Mar 19 22:39:41 CET 2024

Submit a job to the CE

Submit file

apascolinit1@ui-tier1 ~
$ cat submit_ssl.sub
# Unix submit description file
# subimt.sub -- simple sleep job

use_x509userproxy       = true
+owner                  = undefined

batch_name              = Grid-SSL-Sleep
executable              = sleep.sh
arguments               = 3600
log                     = $(batch_name).log.$(Process)
output                  = $(batch_name).out.$(Process)
error                   = $(batch_name).err.$(Process)
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT

queue

Submit a job with SSL

apascolinit1@ui-tier1 ~
$ module switch htc/ce auth=VOMS num=1
Don't forget to voms-proxy-init!

apascolinit1@ui-tier1 ~
$ condor_submit submit_ssl.sub
Submitting job(s).
1 job(s) submitted to cluster 36.

apascolinit1@ui-tier1 ~
$ condor_q


-- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 03/19/24 10:45:18
OWNER      BATCH_NAME        SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
apascolini Grid-SSL-Sleep   3/19 10:44      _      1      _      1 36.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for apascolini: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 2 jobs; 1 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Local jobs submission without environment modules

To submit jobs locally, i.e. from CNAF UI, use the following command:

condor_submit. The following options are required:
- -name sn01-htc.cr.cnaf.infn.it: to correctly address the job to the submit node;
- -spool: to transfer the input files and keep a local copy of the output files;
  - if -spool is not specified, than the user should use -remote sn01-htc.cr.cnaf.infn.it instead of -name sn01-htc.cr.cnaf.infn.it
- primary_unix_group=<condor-share-name>: or alternatively put it as a line in the submit file. This is to specify the primary Unix group you want your job to be executed with. Refer to the list of groups returned by the id -nG shell command to know the exact name of the groups your account belongs to. Exclude from the list the cnafusers group. The specified group corresponds to an HTCondor share and is accounted to the corresponding scientific community.
- the submit description file (a .sub file containing the relevant information for the batch system), to be indicated as argument.

IMPORTANT NOTE

The -spool option is mandatory when the submit folder is a home directory. Home directories are NOT present on worker nodes. The -spool condor_submit option activates the spool mechanism that copies input and output files back and forth from/to the submit folder and the worker node scratch directory.

For example:

To see all jobs

ashtimmermanus@ui-tier1 ~$ condor_submit -name sn01-htc.cr.cnaf.infn.it primary_unix_group=dteam -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 342094.

where 342094 is the cluster id.

To see all jobs launched by a user locally on a submit node, use the following commands.:

To see all jobs

condor_q -name sn01-htc.cr.cnaf.infn.it <user>

For example:

ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it ashtimmermanus


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 14:52:09
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 46789 jobs; 12431 completed, 1 removed, 23897 idle, 9992 running, 468 held, 0 suspended

To get the list of held jobs and the held reason add the option -held.

Whereas, to see information about a single job using the following commands.:

condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id>

For example:

ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it 64667

-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 15:58:32
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 50343 jobs; 12871 completed, 1 removed, 26961 idle, 10000 running, 510 held, 0 suspended

To investigate why a job ends up in a 'Held' state, use the following command:

condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id> -af HoldReason

Finally, to get more detailed information on why the job takes a long time to enter in run state, use the -better-analyze option. For example:

ashtimmermanus@ui-tier1 ~$ condor_q -better-analyze -name sn01-htc.cr.cnaf.infn.it 342094


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?...
The Requirements expression for job 342094.000 is

    ( !StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":")) &&
    (TARGET.t1_allow_sam isnt true) &&
    (( !StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":")) &&
      ((NumJobStarts == 0) && ((TARGET.Arch == "X86_64") &&
          (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
          (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer))))

Job 342094.000 defines the following attributes:

    AcctGroup = "dteam"
    DiskUsage = 1
    ImageSize = 1
    NumJobStarts = 0
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024)

The Requirements expression for job 342094.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           0  StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":")
[1]       34451  TARGET.t1_allow_sam isnt true
[3]           0  StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":")
[5]       34387  TARGET.Arch == "X86_64"
[10]      26383  TARGET.Memory >= RequestMemory
[11]      26328  [5] && [10]
[16]      26323  [1] && [11]


342094.000:  Run analysis summary ignoring user priority.  Of 926 machines,
      8 are rejected by your job's requirements
     63 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
    855 are able to run your job

It is possible to format the output of condor_q with the option -af <ClassAd expression> :

-af list specific attributes
-af:j shows the attribute names
-af:th formats output in a nice table

The job outputs cannot be copied automatically. The user should launch the condor_transfer_data command:

condor_transfer_data -name sn01-htc.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission using the following commands.:

ashtimmermanus@ui-tier1 ~$ condor_transfer_data -name sn01-htc.cr.cnaf.infn.it 342094
Fetching data files...

At the end, to remove a job use the command condor_rm:

ashtimmermanus@ui-tier1 ~$ condor_rm -name sn01-htc.cr.cnaf.infn.it 342094
All jobs in cluster 342094 have been marked for removal

Also, to fix the submit node you want to submit the job you can launch using the following command:

export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it

and the commands to submit and check the job become easier:

ashtimmermanus@ui-tier1 ~$ export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it
ashtimmermanus@ui-tier1 ~$ condor_submit -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 342097.
[ashtimmermanus@ui-tier1 ~]$ condor_q 342097


-- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/26/24 14:22:16
OWNER          BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus ID: 342097   6/26 14:21      _      _      1      1 342097.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 42786 jobs; 12809 completed, 4 removed, 19100 idle, 9999 running, 874 held, 0 suspended

Submit grid jobs without environment modules

A token can be obtained from the command-line using oidc-agent. The oidc agent has to be started using the following command:

 eval `oidc-agent-service use`

This starts the agent and sets the required environment variables.

Job submission:

ashtimmerman@ui-tier1 ~$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS
ashtimmerman@ui-tier1 ~$ condor_submit -pool ce03-htc.cr.cnaf.infn.it:9619 -remote ce03-htc.cr.cnaf.infn.it -spool token_sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 4037450.

ashtimmerman@ui-tier1 ~$ condor_q -pool ce03-htc.cr.cnaf.infn.it:9619 -name ce03-htc.cr.cnaf.infn.it 4037450
 
 -- Schedd: ce03-htc.cr.cnaf.infn.it : <131.154.192.106:25329?... @ 12/07/23 17:56:19
OWNER          BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
ashtimmermanus ID: 4037450  12/7  14:40      _      _      _      1 4037450.0

Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 8752 jobs; 4843 completed, 0 removed, 591 idle, 3317 running, 1 held, 0 suspended

With VOMS proxies

First, create the proxy using the following command:

voms-proxy-init --voms <vo name>

then you can submit the job with the following commands:

export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL
condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub

For example:

-bash-4.2$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL
-bash-4.2$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub
Submitting job(s).
1 job(s) submitted to cluster 2015349.

where "sleep.sub" is the following submit file:

# submit description file
# sleep.sub -- simple sleep job

use_x509userproxy = true
# needed for all the operation where a certificate is required

+owner = undefined

delegate_job_GSI_credentials_lifetime = 0
# this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically

executable              = sleep.sh
log                     = sleep.log
output                  = outfile.txt
error                   = errors.txt
should_transfer_files   = Yes
when_to_transfer_output = ON_EXIT
queue

that differs from the sleep.sub file that has been submitted locally by the command name:

+owner = undefined

that allows the computing element to identify the user through the voms-proxy.

Note that the submit description file of a grid job is basically different from one that has to be submitted locally.

To check the job status of a single job use

condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

So, for the previous example we have:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015349

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:02:21
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2015349 7/29 16:59 _ _ 1 1 2015349.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
Total for all users: 31430 jobs; 8881 completed, 2 removed, 676 idle, 1705 running, 20166 held, 0 suspended

The user is mapped through the voms-proxy in the user name virgo008 as owner of the job. Then, to get the list of the jobs submitted by a user just change <cluster id> with <owner>:

-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it virgo008

-- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:09:42
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
virgo008 ID: 2014655 7/29 11:30 _ _ _ 1 2014655.0
virgo008 ID: 2014778 7/29 12:24 _ _ _ 1 2014778.0
virgo008 ID: 2014792 7/29 12:40 _ _ _ 1 2014792.0
virgo008 ID: 2015159 7/29 15:11 _ _ _ 1 2015159.0
virgo008 ID: 2015161 7/29 15:12 _ _ _ 1 2015161.0
virgo008 ID: 2015184 7/29 15:24 _ _ _ 1 2015184.0
virgo008 ID: 2015201 7/29 15:33 _ _ _ 1 2015201.0
virgo008 ID: 2015207 7/29 15:39 _ _ _ 1 2015207.0
virgo008 ID: 2015217 7/29 15:43 _ _ _ 1 2015217.0
virgo008 ID: 2015224 7/29 15:47 _ _ _ 1 2015224.0
virgo008 ID: 2015349 7/29 16:59 _ _ _ 1 2015349.0

Total for query: 11 jobs; 11 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 31429 jobs; 8898 completed, 3 removed, 591 idle, 1737 running, 20200 held, 0 suspended

As in the local case, to get the job outputs the user should launch using the following commands.:

condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>

with the cluster id returned by condor_submit command at submission time:

-bash-4.2$ condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
Fetching data files...
-bash-4.2$ ls
ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub

And to remove a job submitted via grid using the following commands.:

-bash-4.2$ condor_rm -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217
All jobs in cluster 2015217 have been marked for removal

Renewal of JWTs from local long-lived HTCondor jobs

With the adoption of JWTs - based on the openid-connect and OAuth2 protocols - in place of X509-based VOMS proxies for accessing grid storage, a compelling problem arises: how to automatically renew these JWT access tokens for long-lived batch jobs, i.e. those lasting longer than one hour, that require continued access to storage via grid protocols.

To address this issue for local HTCondor jobs, the Access Point has been equipped with the CredMon and CredMonOauth2 daemons. These components allow users to delegate a refresh token, which HTCondor can then use to supply valid access tokens to running jobs belonging to that authorizing user.

To enable the automatic JWT renewal two additional commands need to be added to the submit file:

submit commands to enable JWT renewal

use_oauth_services = list of credential providers (e.g. IAM instances), separated by comma
<name>_oauth_permissions = list of scopes to enable for JWTs emitted by the <name> provider, separated by comma
<name>_oauth_resource = audience for the JWTs emitted by the <name> provider, rarely used

The first command defines the list of JWT issuers to be used, separated by comma. The string to be set depends on how the CNAF T1 admins have configured HTCondor and thus, please find your token issuer from the table below:

Token issuer	Credential provider name
https://iam-t1-computing.cloud.cnaf.infn.it	t1

If your token issuer is not present, please send an email to the user-support mailing list (user-support@lists.cnaf.infn.it) to request for inclusion.

Inside the job script running in the worker node you should be able to obtain a valid token by issuing the following command:

export BEARER_TOKEN=$(jq -r .access_token $_CONDOR_CREDS/<name>.use)

The file in $_CONDOR_CREDS/<name>.use will automatically be updated by HTCondor. Be sure to re-assign the BEARER_TOKEN environment variable right before any access to the grid storage, e.g. before any gfal-* command.

On job submission, the condor_submit command will not actually submit any job until the scheduler is delegated to refresh the access tokens issued by the specified providers. In case no credentials have been delegated, the condor_submit command shows a web link on the terminal. Following the link in a web browser, a login page is shown and at the end of the login process the delegation is complete and the browser page can be closed. Subsequent executions of the condor_submit command will submit jobs that are enabled with continuously refreshed access tokens.

The delegated credentials will be kept by HTCondor until 60 minutes after the last job submitted with token renewal for those credentials is complete or deleted.

To see the currently stored credential, use the condor_store_cred command:

delete the delegated credentials

condor_store_cred query-oauth

You can at any time revoke the delegation by issuing the following command:

delete the delegated credentials

condor_store_cred delete-oauth -s <name>

Example

Here an example of a job using tokens emitted by iam-t1-computing is shown.

Consider the following basic submit file:

test-access-webdav.sub

executable            = test-access-webdav.sh
arguments             = 10m
output                = test-access-webdav-$(ClusterId).$(ProcId).out
error                 = test-access-webdav-$(ClusterId).$(ProcId).err
log                   = test-access-webdav-$(ClusterId).$(ProcId).log
primary_unix_group    = user-support
use_oauth_services    = t1
t1_oauth_permissions  = profile,iam,group,offline_access
should_transfer_files = yes
queue 1

and script:

test-access-webdav.sh

#!/bin/bash

# just simulate a job of unpredictable duration
sleep ${RANDOM}m

# copy the output of the job to a grid storage
export BEARER_TOKEN=$(jq -r .access_token $_CONDOR_CREDS/t1.use)
gfal-copy results davs:://xfer-test.cr.cnaf.infn.it:8443/user-support/my-super-important-results

Upon the first condor_submit, a web link is shown on the terminal.

Terminal session

 $ condor_submit test-access-webdav.sub
Submitting job(s)
Hellow, cpellegr.
Please visit: https://sn01-htc.cr.cnaf.infn.it/key/0B150b161b8e59fe5052REDACTED575d48ka84b4e2c7ab04610bd0a1b6932e995

By following it, the CredMon login page is loaded. Click on the "Login" button.

The login process into iam is not here reported. The scopes approval page is shown.

Upon authorization, the process is completed.

The next condor_submit actually submits a job.

Terminal session

 $ condor_submit test-access-webdav.sub
Submitting job(s).
1 job(s) submitted to cluster 17642117.

Experiment share usage

If a user wants to know the usage of an entire experiment goup, in particular to see the number of jobs submitted by each experiment user, the command is:

condor_q -all -name sn01-htc -cons 'AcctGroup == "<exp-name>"' -af Owner jobstatus | sort | uniq -c

The output will look like this:

-bash-4.2$ condor_q -all -name sn01-htc -cons 'AcctGroup == "pulp-fiction"' -af Owner '{"Idle", "Running", "Removed", "Completed", "Held"}[jobstatus-1]' | sort | uniq -c
1 MWallace Idle
161 VVega Completed
605 Wolf Running
572 JulesW Idle
5884 Ringo Running
33 Butch Running
5 Jody Held

In the first column there is the number of submitted jobs, in the second there is the user who has submitted them and in the third there is the job status.

Page tree

HTCondor jobs

Submission to the cluster with environment modules

Local job Submission

Grid submission with environment modules

SSL submission

Local jobs submission without environment modules

Submit grid jobs without environment modules

With VOMS proxies

Renewal of JWTs from local long-lived HTCondor jobs

Example

Experiment share usage