HTCondor is a job scheduler. You give HTCondor a file containing commands that instruct it on how to manage jobs. Within the pool of machines, HTCondor locates the best machine that can run each job, packages up the job and ships it off to this execute machine. The jobs run, and output is returned to the machine that submitted the jobs.
Submission to the cluster with environment modules
To use HTCondor we implemented a solution based on environment modules. The traditional interaction methods, i.e. specifying all command line options, remain valid, yet less handy and more verbose. The HTC modules will set all environment variables needed to correctly submit to the HTCondor cluster.
Once logged into any Tier 1 user interface, this utility will be available. You can list all the available modules using:
ashtimmermanus@ui-tier1 ~$ module avail ---------------------------- /opt/exp_software/opssw/modules/modulefiles ---------------------------- htc/auth htc/ce htc/local use.own Key: modulepath default-version
These htc/* modules have different roles.
All modules in the htc
family provide on-line help via the "module help <module name>
" command, e.g.:
budda@ui-tier1:~ $ module help htc ------------------------------------------------------------------- Module Specific Help for /opt/exp_software/opssw/modules/modulefiles/htc/local: Defines environment variables and aliases to ease the interaction with the INFN-T1 HTCondor local job submission system -------------------------------------------------------------------
Local Submission
To submit local jobs or query the local schedd, sn01-htc that is the HTCondor cluster access point, use the htc/local module. This is the default module loaded when loading the "htc" family. So, for using this HTC module you can switch to the htc module and using command condor_q for showing the local jobs queue:
ashtimmermanus@ui-tier1 ~$ module switch htc # default is htc/local ashtimmermanus@ui-tier1 ~$ condor_q -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/21/24 14:41:41 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for ashtimmermanus: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 47942 jobs; 12542 completed, 0 removed, 25401 idle, 9999 running, 0 held, 0 suspended
A simple example of an executable is the following sleep job that waits for the specified amount of time and then exits:
apascolinit1@ui-tier1 ~ $ cat sleep.sh #!/bin/env bash sleep $1
To submit this sample job, we need to specify all the details such as the names and location of the executable and all needed input files creating a submit description file where each line has the form:
apascolinit1@ui-tier1 ~ $ cat submit.sub # submit description file # submit.sub -- simple sleep job batch_name = Local-Sleep executable = sleep.sh arguments = 3600 log = $(batch_name).log.$(Process) output = $(batch_name).out.$(Process) error = $(batch_name).err.$(Process) should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue apascolinit1@ui-tier1 ~ $ module switch htc apascolinit1@ui-tier1 ~ $ condor_submit submit.sub Submitting job(s). 1 job(s) submitted to cluster 15. apascolinit1@ui-tier1 ~ $ condor_q -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 03/18/24 17:15:44 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS apascolinit1 Local-Sleep 3/18 17:15 _ 1 _ 1 15.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for apascolinit1: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
GRID submission with environment modules
The htc/ce module eases the usage of the condor_q and condor_submit commands setting up all the needed variables to contact our Grid compute entry points.
ashtimmermanus@ui-tier1 ~$ module switch htc/ce Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"! Switching from htc/local{ver=23} to htc/ce{auth=SCITOKENS:num=2} Loading requirement: htc/auth{auth=SCITOKENS}
The module accepts two parameters that can be specified on the module switch
command line, as shown below (bold=default value).
parameter | values | description |
---|---|---|
num | 1,2,3,4,5,6 | connects to ce{num}-htc |
auth | VOMS,SCITOKENS | calls the htc/auth module with the selected authentication method |
For explicitly using Token authentication on ce03-htc
use the following module switch command:
ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=3 Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"! Switching from htc/ce{auth=SCITOKENS:num=2} to htc/ce{auth=SCITOKENS:num=3} Unloading useless requirement: htc/auth{auth=SCITOKENS} Loading requirement: htc/auth{auth=SCITOKENS}
The following command starts a usable OIDC agent and makes it available for the current shell. If oidc-agent-service has already started an agent for you, this agent will be reused and made available.
ashtimmermanus@ui-tier1 ~$ eval `oidc-agent-service use` 36535
To see the list of locally configured OIDC clients, use the following command oidc-add -l:
ashtimmermanus@ui-tier1 ~$ oidc-add -l The following account configurations are usable: aksieniia_token testhtc
If you haven't registered an OIDC client yet, follow the next steps:
- To register a new oidc client use the following command (it is needed to do just the first time to create a new one):
ashtimmermanus@ui-tier1 ~$ oidc-gen -w device Enter short name for the account to configure: htc_23 [1] https://iam-t1-computing.cloud.cnaf.infn.it/ [2] https://iam-test.indigo-datacloud.eu/ [3] https://iam.deep-hybrid-datacloud.eu/ [4] https://iam.extreme-datacloud.eu/ [5] https://iam-demo.cloud.cnaf.infn.it/ [6] https://b2access.eudat.eu:8443/oauth2 [7] https://b2access-integration.fz-juelich.de/oauth2 [8] https://login-dev.helmholtz.de/oauth2 [9] https://login.helmholtz.de/oauth2 [10] https://services.humanbrainproject.eu/oidc/ [11] https://accounts.google.com [12] https://aai-dev.egi.eu/auth/realms/egi [13] https://aai-demo.egi.eu/auth/realms/egi [14] https://aai.egi.eu/auth/realms/egi [15] https://login.elixir-czech.org/oidc/ [16] https://oidc.scc.kit.edu/auth/realms/kit [17] https://wlcg.cloud.cnaf.infn.it/ Issuer [https://iam-t1-computing.cloud.cnaf.infn.it/]: The following scopes are supported: openid profile email address phone offline_access eduperson_scoped_affiliation eduperson_entitlement eduperson_assurance entitlements wlcg.groups Scopes or 'max' (space separated) [openid profile offline_access]: profile wlcg.groups wlcg compute.create compute.modify compute.read compute.cancel Registering Client ... Generating account configuration ... accepted ... ... Enter encryption password for account configuration 'htc_23': Confirm encryption Password: Everything setup correctly!
The -w device
instructs oidc-agent
to use the device code flow for the authentication, which is the recommended way with INDIGO-IAM. oidc-agent will display a list of different providers that can be used for registration:
[1] https://wlcg.cloud.cnaf.infn.it/ [2] https://iam-test.indigo-datacloud.eu/ ... [20] https://oidc.scc.kit.edu/auth/realms/kit/
Select one of the registered providers, or type a custom issuer (for IAM, the last character of the issuer string is always a /
, e.g. https://wlcg.cloud.cnaf.infn.it/
).
Then oidc-agent
asks for the scopes, typing max
(without quotes) allows to get all the allowed scopes, but this is discouraged. Instead, specify the minimum required scopes for the task the client is registered for. In the case of job submission to HTCondor-CE, these scopes are:
compute.create
compute.modify
compute.read
compute.cancel
oidc-agent
will register a new client and store the client credentials encrypted on the user interface.
IAM asks you authorization for the client to operate on your behalf by requesting to autenticate with your browser and enter a code provided on the terminal to the indicated web address.
Finally, a password to encrypt the client information on the machine is prompted to the user twice to be set. It can be any password of choice and it must be remembered whenever the client has to be loaded with the oidc-add <client_name>
command.
After a client is registered, it can be used. In our example the test_client
client is used to obtain access tokens. One does not need to run oidc-gen
again unless to update or create a new client configuration.
2. You can then load an account in the agent again with the oidc-add [the client name] command, as follows:
ashtimmermanus@ui-tier1 ~$ oidc-add htc_23 Enter decryption password for account config 'htc_23': success
Once you’ve loaded the account, you can use oidc-token
to get tokens for that account.
Access tokens are valid typically for 60 minutes.
The scopes determine the actions that can be performed with the token.
3. To submit a job, you must save the token in a file with the path as the scitokens_file
command in the submit file. Tokens can be obtained using the oidc-token command, as follows:
ashtimmermanus@ui-tier1 ~]$ MASK=$(umask); umask 0077 ; oidc-token htc_23 > ${HOME}/token; umask $MASK
It limits the permissions to the token file to be readable and writeable only by the file owner, for obvious security reasons.
4. For submitting a test job, you can use the example file as follows:
ashtimmermanus@ui-tier1 ~$ cat submit_token.sub # submit description file # submit.sub -- simple sleep job scitokens_file = $ENV(HOME)/htc_test/grid-token/token +owner = undefined batch_name = Grid-Token-Sleep executable = sleep.sh arguments = 3600 log = $(batch_name).log.$(Process) output = $(batch_name).out.$(Process) error = $(batch_name).err.$(Process) should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
Where scitokens_file = $ENV(HOME)/token
is the path to the file containing the scitoken to authenticate to the CE.
5. Switch to htc/ce module for GRID submission with the module switch
command:
ashtimmermanus@ui-tier1 ~$ module switch htc/ce auth=SCITOKENS num=1 Don't forget to "export BEARER_TOKEN=$(oidc-token <client-name>)"! Switching from htc/ce{auth=SCITOKENS:num=1} to htc/ce{auth=SCITOKENS:num=1} Loading requirement: htc/auth{auth=SCITOKENS}
6. You should begin by submitting a simple test job with command condor_submit
:
ashtimmermanus@ui-tier1 ~$ export BEARER_TOKEN=$(oidc-token htc_23) ashtimmermanus@ui-tier1 ~$ condor_submit submit_token.sub Submitting job(s). 1 job(s) submitted to cluster 52465.
7. You can see the information about the job with the condor_q
command, as follows:
ashtimmermanus@ui-tier1 ~$ condor_q -- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 06/21/24 16:39:44 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS ashtimmermanus Grid-Token-Sleep 6/21 16:38 _ _ 1 1 52465.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for ashtimmermanus: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 4239 jobs; 1297 completed, 0 removed, 228 idle, 2709 running, 5 held, 0 suspended
SSL submission
The SSL Submission substitutes VOMS proxy, although these two processes are almost identical.
CAVEAT
To be able to submit jobs using the SSL authentication, your x509 User Proxy FQAN must be mapped in the CE configuration before job submission.
You will need to send to the support team, via the user-support@lists.cnaf.infn.it mailing list, the output of the voms-proxy-info --all --chain
command corresponding to a valid VOMS proxy:
budda@ui-tier1:~ $ voms-proxy-info --all --chain === Proxy Chain Information === X.509 v3 certificate Subject: CN=1569994718,CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org Issuer: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org Valid from: Tue Apr 09 16:18:41 CEST 2024 Valid to: Wed Apr 10 04:18:41 CEST 2024 CA: false Signature alg: SHA384WITHRSA Public key type: RSA 2048bit Allowed usage: digitalSignature keyEncipherment Serial number: 1569994718 VOMS extensions: yes. X.509 v3 certificate Subject: CN=Carmelo Pellegrino cpellegr@infn.it,O=Istituto Nazionale di Fisica Nucleare,C=IT,DC=tcs,DC=terena,DC=org Issuer: CN=GEANT TCS Authentication RSA CA 4B,O=GEANT Vereniging,C=NL Valid from: Mon Oct 16 12:57:40 CEST 2023 Valid to: Thu Nov 14 11:57:40 CET 2024 Subject alternative names: email: carmelo.pellegrino@cnaf.infn.it CA: false Signature alg: SHA384WITHRSA Public key type: RSA 8192bit Allowed usage: digitalSignature keyEncipherment Allowed extended usage: clientAuth emailProtection Serial number: 73237961961532056736463686571865333148 === Proxy Information === subject : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it/CN=1569994718 issuer : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it identity : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it type : RFC3820 compliant impersonation proxy strength : 2048 path : /tmp/x509up_u23069 timeleft : 00:00:00 key usage : Digital Signature, Key Encipherment === VO km3net.org extension information === VO : km3net.org subject : /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Carmelo Pellegrino cpellegr@infn.it issuer : /DC=org/DC=terena/DC=tcs/C=IT/ST=Napoli/O=Universita degli Studi di Napoli FEDERICO II/CN=voms02.scope.unina.it attribute : /km3net.org/Role=NULL/Capability=NULL timeleft : 00:00:00 uri : voms02.scope.unina.it:15005
- Get a proxy with
voms-proxy-init
:apascolinit1@ui-tier1 ~ $ voms-proxy-init --voms cms Enter GRID pass phrase for this identity: Contacting voms2.cern.ch:15002 [/DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch] "cms"... Remote VOMS server contacted succesfully. Created proxy in /tmp/x509up_u23077. Your proxy is valid until Tue Mar 19 22:39:41 CET 2024
- Submit a job to the CESubmit file
apascolinit1@ui-tier1 ~ $ cat submit_ssl.sub # Unix submit description file # subimt.sub -- simple sleep job use_x509userproxy = true +owner = undefined batch_name = Grid-SSL-Sleep executable = sleep.sh arguments = 3600 log = $(batch_name).log.$(Process) output = $(batch_name).out.$(Process) error = $(batch_name).err.$(Process) should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
Submit a job with SSLapascolinit1@ui-tier1 ~ $ module switch htc/ce auth=VOMS num=1 Don't forget to voms-proxy-init! apascolinit1@ui-tier1 ~ $ condor_submit submit_ssl.sub Submitting job(s). 1 job(s) submitted to cluster 36. apascolinit1@ui-tier1 ~ $ condor_q -- Schedd: ce01-htc.cr.cnaf.infn.it : <131.154.193.64:9619?... @ 03/19/24 10:45:18 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS apascolini Grid-SSL-Sleep 3/19 10:44 _ 1 _ 1 36.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for apascolini: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 2 jobs; 1 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Local jobs submission without environment modules
To submit jobs locally, i.e. from CNAF UI, use the following command:
condor_submit
. The following options are required:-name sn01-htc.cr.cnaf.infn.it
: to correctly address the job to the submit node;-spool
: to transfer the input files and keep a local copy of the output files;- if
-spool
is not specified, than the user should use-remote sn01-htc.cr.cnaf.infn.it
instead of-name sn01-htc.cr.cnaf.infn.it
- if
- the submit description file (a
.sub
file containing the relevant information for the batch system), to be indicated as argument.
IMPORTANT NOTE
The -spool
option is mandatory when the submit folder is a home directory. Home directories are NOT present on worker nodes. The -spool
condor_submit
option activates the spool mechanism that copies input and output files back and forth from/to the submit folder and the worker node scratch directory.
For example:
ashtimmermanus@ui-tier1 ~$ condor_submit -name sn01-htc.cr.cnaf.infn.it -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 342094.
where 342094 is the cluster id.
To see all jobs launched by a user locally on a submit node, use the following commands.:
condor_q -name sn01-htc.cr.cnaf.infn.it <user>
For example:
ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it ashtimmermanus -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 14:52:09 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 46789 jobs; 12431 completed, 1 removed, 23897 idle, 9992 running, 468 held, 0 suspended
To get the list of held jobs and the held reason add the option -held
.
Whereas, to see information about a single job using the following commands.:
condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id>
For example:
ashtimmermanus@ui-tier1 ~$ condor_q -name sn01-htc.cr.cnaf.infn.it 64667 -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/25/24 15:58:32 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 50343 jobs; 12871 completed, 1 removed, 26961 idle, 10000 running, 510 held, 0 suspended
To investigate why a job ends up in a 'Held' state, use the following command:
condor_q -name sn01-htc.cr.cnaf.infn.it <cluster id> -af HoldReason
Finally, to get more detailed information on why the job takes a long time to enter in run state, use the -better-analyze
option. For example:
ashtimmermanus@ui-tier1 ~$ condor_q -better-analyze -name sn01-htc.cr.cnaf.infn.it 342094 -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... The Requirements expression for job 342094.000 is ( !StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":")) && (TARGET.t1_allow_sam isnt true) && (( !StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":")) && ((NumJobStarts == 0) && ((TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)))) Job 342094.000 defines the following attributes: AcctGroup = "dteam" DiskUsage = 1 ImageSize = 1 NumJobStarts = 0 RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024) The Requirements expression for job 342094.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 0 StringListMember(split(AcctGroup ?: "none",".")[0],t1_OverPledgeGroups ?: "",":") [1] 34451 TARGET.t1_allow_sam isnt true [3] 0 StringListMember("gpfs_data",t1_GPFS_CHECK ?: "",":") [5] 34387 TARGET.Arch == "X86_64" [10] 26383 TARGET.Memory >= RequestMemory [11] 26328 [5] && [10] [16] 26323 [1] && [11] 342094.000: Run analysis summary ignoring user priority. Of 926 machines, 8 are rejected by your job's requirements 63 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 855 are able to run your job
It is possible to format the output of condor_q
with the option -af <ClassAd expression>
:
-af list specific attributes
-af:j shows the attribute names
-af:th formats output in a nice table
The job outputs cannot be copied automatically. The user should launch the condor_transfer_data
command:
condor_transfer_data -name sn01-htc.cr.cnaf.infn.it <cluster id>
with the cluster id returned by condor_submit
command at submission using the following commands.:
ashtimmermanus@ui-tier1 ~$ condor_transfer_data -name sn01-htc.cr.cnaf.infn.it 342094 Fetching data files...
At the end, to remove a job use the command condor_rm
:
ashtimmermanus@ui-tier1 ~$ condor_rm -name sn01-htc.cr.cnaf.infn.it 342094 All jobs in cluster 342094 have been marked for removal
Also, to fix the submit node you want to submit the job you can launch using the following command:
export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it
and the commands to submit and check the job become easier:
ashtimmermanus@ui-tier1 ~$ export _condor_SCHEDD_HOST=sn01-htc.cr.cnaf.infn.it ashtimmermanus@ui-tier1 ~$ condor_submit -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 342097. [ashtimmermanus@ui-tier1 ~]$ condor_q 342097 -- Schedd: sn01-htc.cr.cnaf.infn.it : <131.154.192.242:9618?... @ 06/26/24 14:22:16 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS ashtimmermanus ID: 342097 6/26 14:21 _ _ 1 1 342097.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 42786 jobs; 12809 completed, 4 removed, 19100 idle, 9999 running, 874 held, 0 suspended
Submit grid jobs without environment modules
A token can be obtained from the command-line using oidc-agent
. The oidc agent has to be started using the following command:
eval `oidc-agent-service use`
This starts the agent and sets the required environment variables.
Job submission:
ashtimmerman@ui-tier1 ~$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SCITOKENS ashtimmerman@ui-tier1 ~$ condor_submit -pool ce07-htc.cr.cnaf.infn.it:9619 -remote ce07-htc.cr.cnaf.infn.it -spool token_sleep.sub Submitting job(s). 1 job(s) submitted to cluster 4037450. ashtimmerman@ui-tier1 ~$ condor_q -pool ce07-htc.cr.cnaf.infn.it:9619 -name ce07-htc.cr.cnaf.infn.it 4037450 -- Schedd: ce07-htc.cr.cnaf.infn.it : <131.154.192.106:25329?... @ 12/07/23 17:56:19 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS ashtimmermanus ID: 4037450 12/7 14:40 _ _ _ 1 4037450.0 Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 8752 jobs; 4843 completed, 0 removed, 591 idle, 3317 running, 1 held, 0 suspended
With VOMS proxies
First, create the proxy using the following command:
voms-proxy-init --voms <vo name>
then you can submit the job with the following commands:
export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub
For example:
-bash-4.2$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=SSL -bash-4.2$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 2015349.
where "sleep.sub
" is the following submit file:
# submit description file # sleep.sub -- simple sleep job use_x509userproxy = true # needed for all the operation where a certificate is required +owner = undefined delegate_job_GSI_credentials_lifetime = 0 # this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically executable = sleep.sh log = sleep.log output = outfile.txt error = errors.txt should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
that differs from the sleep.sub
file that has been submitted locally by the command name:
+owner = undefined
that allows the computing element to identify the user through the voms-proxy.
Note that the submit description file of a grid job is basically different from one that has to be submitted locally.
To check the job status of a single job use
condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>
So, for the previous example we have:
-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015349 -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:02:21 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS virgo008 ID: 2015349 7/29 16:59 _ _ 1 1 2015349.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 31430 jobs; 8881 completed, 2 removed, 676 idle, 1705 running, 20166 held, 0 suspended
The user is mapped through the voms-proxy in the user name virgo008 as owner of the job. Then, to get the list of the jobs submitted by a user just change <cluster id>
with <owner>
:
-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it virgo008 -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:09:42 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS virgo008 ID: 2014655 7/29 11:30 _ _ _ 1 2014655.0 virgo008 ID: 2014778 7/29 12:24 _ _ _ 1 2014778.0 virgo008 ID: 2014792 7/29 12:40 _ _ _ 1 2014792.0 virgo008 ID: 2015159 7/29 15:11 _ _ _ 1 2015159.0 virgo008 ID: 2015161 7/29 15:12 _ _ _ 1 2015161.0 virgo008 ID: 2015184 7/29 15:24 _ _ _ 1 2015184.0 virgo008 ID: 2015201 7/29 15:33 _ _ _ 1 2015201.0 virgo008 ID: 2015207 7/29 15:39 _ _ _ 1 2015207.0 virgo008 ID: 2015217 7/29 15:43 _ _ _ 1 2015217.0 virgo008 ID: 2015224 7/29 15:47 _ _ _ 1 2015224.0 virgo008 ID: 2015349 7/29 16:59 _ _ _ 1 2015349.0 Total for query: 11 jobs; 11 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 31429 jobs; 8898 completed, 3 removed, 591 idle, 1737 running, 20200 held, 0 suspended
As in the local case, to get the job outputs the user should launch using the following commands.:
condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>
with the cluster id returned by condor_submit
command at submission time:
-bash-4.2$ condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217 Fetching data files... -bash-4.2$ ls ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub
And to remove a job submitted via grid using the following commands.:
-bash-4.2$ condor_rm -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217 All jobs in cluster 2015217 have been marked for removal
Long lived proxy inside a job
One problem which may occurr while running a grid job, is represented by the default short-life of the VOMS proxy used to submit the job. Indeed, the job will be aborted if it does not finish before the expiration time of the proxy. The easiest solution to this problem would be to use very long-lived proxies, but at the expense of an increased security risk. Furthermore, the duration of a VOMS proxy is limited by the VOMS server and cannot be made arbitrarily long.
To overcome this limitation, a proxy credential repository system is used, which allows the user to create and store a long-term proxy in a dedicated server (a "MyProxy" server). At Tier-1 this MyProxy store is myproxy.cnaf.infn.it.
For instance, using the following command, it is possible to retrieve a grid proxy with a lifetime of 168 hours and store it into a the Myproxy server with credentials valid for 720 hours:
dlattanzio@ui-tier1 ~$ myproxy-init --proxy_lifetime 168 --cred_lifetime 720 --voms vo.padme.org --pshost myproxy.cnaf.infn.it --dn_as_username --credname proxyCred --local_proxy Enter GRID pass phrase for this identity: Contacting voms2.cnaf.infn.it:15020 [/DC=org/DC=terena/DC=tcs/C=IT/ST=Roma/O=Istituto Nazionale di Fisica Nucleare/OU=CNAF/CN=voms2.cnaf.infn.it] "vo.padme.org"... Remote VOMS server contacted succesfully. voms2.cnaf.infn.it:15020: The validity of this VOMS AC in your proxy is shortened to 86400 seconds! Created proxy in /tmp/myproxy-proxy.10164.21287. Your proxy is valid until Sat Dec 24 18:34:42 CET 2022 Enter MyProxy pass phrase: Verifying - Enter MyProxy pass phrase: Your identity: /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Daniele Lattanzio dlattanzio@infn.it Creating proxy ............................................ Done Proxy Verify OK Your proxy is valid until: Thu Dec 1 18:34:50 2022 A proxy valid for 720 hours (30.0 days) for user /DC=org/DC=terena/DC=tcs/C=IT/O=Istituto Nazionale di Fisica Nucleare/CN=Daniele Lattanzio dlattanzio@infn.it now exists on myproxy.cnaf.infn.it.
As seen above, the user will be requested first to insert the GRID certificate password, and then a "MyProxy pass phrase" for future proxy retrievals.
The same password has to be indicated also in the submission file, which will be similiar to the one in the example below:
# submit description file # sleep2.sub -- simple sleep job use_x509userproxy = true # needed for all the operation where a certificate is required +owner = undefined MyProxyHost = myproxy.cnaf.infn.it:7512 MyProxyPassword = *** MyProxyCredentialName = proxyCred MyProxyRefreshThreshold = 3300 MyProxyNewProxyLifetime = 1440 delegate_job_GSI_credentials_lifetime = 0 # this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically executable = sleep.sh log = sleep.log output = outfile.txt error = errors.txt should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
where MyProxyRefreshThreshold
and MyProxyNewProxyLifetime
represent respectively the time (in second) before the expiration of a proxy that the proxy should be refreshed and the new lifetime (in minutes) of the proxy after it is refreshed.
Experiment share usage
If a user wants to know the usage of an entire experiment goup, in particular to see the number of jobs submitted by each experiment user, the command is:
condor_q -all -name sn01-htc -cons 'AcctGroup == "<exp-name>"' -af Owner jobstatus | sort | uniq -c
The output will look like this:
-bash-4.2$ condor_q -all -name sn01-htc -cons 'AcctGroup == "pulp-fiction"' -af Owner '{"Idle", "Running", "Removed", "Completed", "Held"}[jobstatus-1]' | sort | uniq -c 1 MWallace Idle 161 VVega Completed 605 Wolf Running 572 JulesW Idle 5884 Ringo Running 33 Butch Running 5 Jody Held
In the first column there is the number of submitted jobs, in the second there is the user who has submitted them and in the third there is the job' status.