Job submission can be direct using the HTCondor batch system from the UIs or via Grid middleware from an UI that can be located anywhere.
In the latter case authentication and authorization are by default managed through Virtual Organizations (VO) [12]. To became member of the VO, the user needs a personal certificate and has to enroll on a VOMS Server.
Some examples of VOs hosted in CNAF VOMS servers (they are all replicated in another site, normally Padova or Naples) are:
- https://voms.cnaf.infn.it:8443/vomses
- https://voms2.cnaf.infn.it:8443/vomses
- https://vomsmania.cnaf.infn.it:8443/vomses
Information about VOs can be found at:
Infrastructure
CNAF Tier1 has replaced LSF by HTCondor in the first half of 2020. The infrastructure consists of:
sn-02.cr.cnaf.infn.it:
a submit node for submission from local UI.ce02-htc.cr.cnaf.infn.it
,
ce03-htc.cr.cnaf.infn.it
,
ce04-htc.cr.cnaf.infn.it
,
ce05-htc.cr.cnaf.infn.it
,
ce06-htc.cr.cnaf.infn.it,
6 computing elements for grid submission.ce07-htc.cr.cnaf.infn.it
:htc-1.cr.cnaf.infn.it
,
htc-2.cr.cnaf.infn.it
,
htc-3.cr.cnaf.infn.it:
three central manager nodes in high availability.- 830 worker nodes.
On a UI the user can submit a job to the submit node, which then deals with the routing to the central manager, responsible for dispatching the jobs to the worker nodes.
CEs in production:
- grid submission endpoints:
ce02-htc.cr.cnaf.infn.it:9619, ce03-htc.cr.cnaf.infn.it:9619, ce04-htc.cr.cnaf.infn.it:9619, ce05-htc.cr.cnaf.infn.it:9619, ce06-htc.cr.cnaf.infn.it:9619, ce07-htc.cr.cnaf.infn.it:9619.
- local submission (from CNAF UI) endpoint:
sn-02.cr.cnaf.infn.it.
HTCondor docs and guides are available here [6], [7].
HTCondor jobs
HTCondor is a job scheduler. You give HTCondor a file containing commands that tell it how to run jobs. HTCondor locates a machine that can run each job within the pool of machines, packages up the job and ships it off to this execute machine. The jobs run, and output is returned to the machine that submitted the jobs.
For HTCondor to run a job, it must be given details such as the names and location of the executable and all needed input files. These details are specified in the submit description file.
A simple example of an executable is a sleep job that waits for 40 seconds and then exits:
#!/bin/bash # file name: sleep.sh TIMETOWAIT="40" echo "sleeping for $TIMETOWAIT seconds" /bin/sleep $TIMETOWAIT
To submit this sample job, we need to specify all the details such as the names and location of the executable and all needed input files creating a submit description file where each line has the formcommand
name = value
like this:
# Unix submit description file # sleep.sub -- simple sleep job executable = sleep.sh log = sleep.log output = outfile.txt error = errors.txt should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
Submit local jobs
To submit jobs locally, i.e. from CNAF UI, use the following command:
condor_submit
. The following options are required:-name sn-02.cr.cnaf.infn.it:
to correctly address the job to the submit node;-spool:
to transfer the input files and keep a local copy of the output files;- if
-spool
is not requested than the user should use-remote sn-02.cr.cnaf.infn.it
instead of-name sn-02.cr.cnaf.infn.it
- if
- the submit description file (a
.sub
file containing the relevant information for the batch system), to be indicated as argument.
For example:
-bash-4.2$ condor_submit -name sn-02.cr.cnaf.infn.it -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 6798880.
where 6798880 is the cluster id.
To see all jobs launched by a user locally on a submit node use
condor_q -name sn-02.cr.cnaf.infn.it <user>
For example:
-bash-4.2$ condor_q -name sn-02.cr.cnaf.infn.it arendina -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 07/29/20 11:11:57 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS arendina ID: 6785585 7/28 15:35 _ _ _ 1 6785585.0 arendina ID: 6785603 7/28 15:46 _ _ _ 1 6785603.0 arendina ID: 6798880 7/29 10:19 _ _ _ 1 6798880.0 Total for query: 3 jobs; 3 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 47266 jobs; 35648 completed, 3 removed, 6946 idle, 4643 running, 26 held, 0 suspended
To get the list of held jobs and the held reason add the option
-held
Whereas, to see information about a single job use
condor_q -name sn-02.cr.cnaf.infn.it <cluster id>
To investigate why does a job end up in a 'Held' state:
condor_q -name sn-02.cr.cnaf.infn.it <cluster id> -af HoldReason
and to get more detailed information use the option
-better-analyze
For example:
-bash-4.2$ condor_q -better-analyze -name sn-02.cr.cnaf.infn.it 6805590 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... The Requirements expression for job 6805590.000 is (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer) Job 6805590.000 defines the following attributes: DiskUsage = 20 ImageSize = 275 MemoryUsage = ((ResidentSetSize + 1023) / 1024) RequestDisk = DiskUsage RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(ImageSize + 1023) / 1024) ResidentSetSize = 0 The Requirements expression for job 6805590.000 reduces to these conditions: Slots Step Matched Condition ----- -------- --------- [0] 24584 TARGET.Arch == "X86_64" [1] 24584 TARGET.OpSys == "LINUX" [3] 24584 TARGET.Disk >= RequestDisk [5] 24584 TARGET.Memory >= RequestMemory [7] 24584 TARGET.HasFileTransfer 6805590.000: Job is completed. Last successful match: Wed Jul 29 16:37:03 2020 6805590.000: Run analysis summary ignoring user priority. Of 829 machines, 0 are rejected by your job's requirements 122 reject your job because of their own requirements 0 match and are already running your jobs 0 match but are serving other users 707 are able to run your job
It is possible to format the output of condor_q
with the option -af
:
-af list specific attributes
-af:j shows the attribute names
-af:th formats a nice table
The job outputs cannot be copied automatically. The user should launch:
condor_transfer_data -name sn-02.cr.cnaf.infn.it <cluster id>
with the cluster id returned by condor_submit
command at submission:
-bash-4.2$ condor_transfer_data -name sn-02.cr.cnaf.infn.it 6806037 Fetching data files... -bash-4.2$ ls ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub
At the end, to remove a job use the command condor_rm:
-bash-4.2$ condor_rm -name sn-02.cr.cnaf.infn.it 6806037 All jobs in cluster 6806037 have been marked for removal
Also, to fix the submit node you want to submit the job you can launch the command
export _condor_SCHEDD_HOST=sn-02.cr.cnaf.infn.it
and the commands to submit and check the job become easier:
-bash-4.2$ export _condor_SCHEDD_HOST=sn-02.cr.cnaf.infn.it -bash-4.2$ condor_submit -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 8178760. -bash-4.2$ condor_q 8178760 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 09/02/20 10:25:30 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS arendina ID: 8178760 9/2 10:25 _ _ 1 1 8178760.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 32798 jobs; 13677 completed, 2 removed, 13991 idle, 5115 running, 13 held, 0 suspended
Submit grid jobs
First, create the proxy:
voms-proxy-init --voms <vo name>
then you can submit the job with the following commands:
export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub
For example:
-bash-4.2$ export _condor_SEC_CLIENT_AUTHENTICATION_METHODS=GSI -bash-4.2$ condor_submit -pool ce02-htc.cr.cnaf.infn.it:9619 -remote ce02-htc.cr.cnaf.infn.it -spool sleep.sub Submitting job(s). 1 job(s) submitted to cluster 2015349.
where "sleep.sub
" is the submit file:
# Unix submit description file # sleep.sub -- simple sleep job use_x509userproxy = true # needed for all the operation where a certificate is required +owner = undefined delegate_job_GSI_credentials_lifetime = 0 # this has to be included if the proxy will last more than 24h, otherwise it will be reduced to 24h automatically executable = sleep.sh log = sleep.log output = outfile.txt error = errors.txt should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
that differs from the sleep.sub
file that has been submitted locally by the command name:
+owner = undefined
that allows the computing element to identify the user through the voms-proxy.
Note that the submit description file of a grid job is basically different from one that has to be submitted locally.
To check the job status of a single job use
condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>
So, for the previous example we have:
-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015349 -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:02:21 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS virgo008 ID: 2015349 7/29 16:59 _ _ 1 1 2015349.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 31430 jobs; 8881 completed, 2 removed, 676 idle, 1705 running, 20166 held, 0 suspended
The user is mapped through the voms-proxy in the user name virgo008 as owner of the job. Then, to get the list of the submitted jobs of a user just change <cluster id>
with <owner>
:
-bash-4.2$ condor_q -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it virgo008 -- Schedd: ce02-htc.cr.cnaf.infn.it : <131.154.192.41:9619?... @ 07/29/20 17:09:42 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS virgo008 ID: 2014655 7/29 11:30 _ _ _ 1 2014655.0 virgo008 ID: 2014778 7/29 12:24 _ _ _ 1 2014778.0 virgo008 ID: 2014792 7/29 12:40 _ _ _ 1 2014792.0 virgo008 ID: 2015159 7/29 15:11 _ _ _ 1 2015159.0 virgo008 ID: 2015161 7/29 15:12 _ _ _ 1 2015161.0 virgo008 ID: 2015184 7/29 15:24 _ _ _ 1 2015184.0 virgo008 ID: 2015201 7/29 15:33 _ _ _ 1 2015201.0 virgo008 ID: 2015207 7/29 15:39 _ _ _ 1 2015207.0 virgo008 ID: 2015217 7/29 15:43 _ _ _ 1 2015217.0 virgo008 ID: 2015224 7/29 15:47 _ _ _ 1 2015224.0 virgo008 ID: 2015349 7/29 16:59 _ _ _ 1 2015349.0 Total for query: 11 jobs; 11 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 31429 jobs; 8898 completed, 3 removed, 591 idle, 1737 running, 20200 held, 0 suspended
As in the local case, to get the job outputs the user shoud launch:
condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it <cluster id>
with the cluster id returned by condor_submit
command at submission:
-bash-4.2$ condor_transfer_data -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217 Fetching data files... -bash-4.2$ ls ce_testp308.sub errors.txt outfile.txt sleep.log sleep.sh sleep.sub test.sub
And to remove a job submitted via grid:
-bash-4.2$ condor_rm -pool ce02-htc.cr.cnaf.infn.it:9619 -name ce02-htc.cr.cnaf.infn.it 2015217 All jobs in cluster 2015217 have been marked for removal
Experiment share usage
If a user wants to know the usage of an entire experiment goup, in particular to see the number of jobs submitted by each experiment user, the command is:
condor_q -all -name sn-02 -cons 'AcctGroup == "<exp-name>"' -af Owner jobstatus | sort | uniq -c
The output will look like this:
-bash-4.2$ condor_q -all -name sn-01 -cons 'AcctGroup == "pulp-fiction"' -af Owner jobstatus | sort | uniq -c 1 MWallace 1 3 VVega 4 20 Wolf 4 572 JulesW 1 1606 Ringo 4 1 Butch 2 5 Jody 4
In the first column there is the number of submitted jobs, in the second there is the user who has submitted them and in the third there is the jobs' status (1=pending, 2=running, 3=removed, 4=completed, 5=held, 6=submission_err).
Examples
All the options and the submit description commands of the condor_submit
command are available in the Command Reference Manual [26]. Also for a short guide on the submit description file and its commands you can see the Appendix A.
Some helpful examples follow below.
CPUs, GPUs and RAM requests
Generally, for a job it could be useful to specify the number of CPUs or maybe it would be better to specify the amount of required RAM with the options:
request_cpus = <number of CPUs>
request_memory = <RAM amount in MB>
in the command lines of the job submit file. For example, this can be the script of a submit description file with specific requests of CPUs and RAM:
-bash-4.2$ cat sleep.sub # Unix submit description file # sleep.sub -- simple sleep job request_cpus = 2 request_memory = 1000 executable = sleep.sh log = sleep.log output = outfile.txt error = errors.txt should_transfer_files = Yes when_to_transfer_output = ON_EXIT queue
On the other hand, if your job has to use GPUs for running, you have to insert the right requirement:
+WantGPU = true request_GPUs = 1 requirements = (TARGET.CUDACapability >= 1.2) && (TARGET.CUDADeviceName =?= "Tesla K40m") && $(requirements:True)
Jobs with ROOT-program as executable
First of all, you have to setup ROOT before using it. Your collaboration may have installed a ROOT distribution in /opt/exp_software (which is a location shared between the user-interface and the worker nodes).
In this case you should find one or more ROOT installation directories there:
[fornaricta@ui-tier1]$ ls /opt/exp_software/cta/local_software/root/ 5.34.26 5.34.36 5.34.38 root root-6.10.08 root-6.16.00 root_build_5.34.38
so you can choose your preferred version with a submit file like the following one:
[fornaricta@ui-tier1]$ cat test.sub universe = vanilla executable = test.sh arguments = 5.34.26 output = job.out error = job.err log = job.log WhenToTransferOutput = ON_EXIT ShouldTransferFiles = YES queue 1
where the executable file has this content:
[fornaricta@ui-tier1]$ cat test.sh #!/bin/bash source /storage/gpfs_data/ctalocal/fornaricta/root_config.sh $1 /opt/exp_software/cta/local_software/root/$1/bin/root -b -q
and the configuration script (located on a gpfs path, shared between the user-interface and the worker nodes) is:
[fornaricta@ui-tier1]$ cat /storage/gpfs_data/ctalocal/fornaricta/root_config.sh #!/bin/bash export LD_LIBRARY_PATH=/opt/exp_software/cta/local_software/root/$1/lib/root:$LD_LIBRARY_PATH
Submitting:
[fornaricta@ui-tier1]$ condor_submit -spool -name sn-02.cr.cnaf.infn.it test.sub Submitting job(s). 1 job(s) submitted to cluster 5824045. [fornaricta@ui-tier1]$ condor_q -name sn-02.cr.cnaf.infn.it 5824045.0 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 07/08/20 18:54:17 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fornaricta ID: 5824045 7/8 18:54 _ _ 1 1 5824045.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 45425 jobs; 25623 completed, 3 removed, 13181 idle, 5297 running, 1321 held, 0 suspended [fornaricta@ui-tier1]$ condor_q -name sn-02.cr.cnaf.infn.it 5824045.0 -- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 07/08/20 18:55:03 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fornaricta ID: 5824045 7/8 18:54 _ 1 _ 1 5824045.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 45474 jobs; 25644 completed, 3 removed, 13222 idle, 5286 running, 1319 held, 0 suspended [fornaricta@ui-tier1]$ condor_q -name sn-02.cr.cnaf.infn.it 5824045.0 -- Schedd: sn-01.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 07/08/20 18:55:04 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fornaricta ID: 5824045 7/8 18:54 _ _ _ 1 5824045.0 Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 45476 jobs; 25646 completed, 3 removed, 13223 idle, 5284 running, 1320 held, 0 suspended [fornaricta@ui-tier1]$ condor_transfer_data -name sn-02.cr.cnaf.infn.it 5824045.0 Fetching data files... [fornaricta@ui-tier1]$ cat job.err [fornaricta@ui-tier1]$ cat job.out ******************************************* * * * W E L C O M E to R O O T * * * * Version 5.34/26 20 February 2015 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * ******************************************* ROOT 5.34/26 (v5-34-26@v5-34-26, Jun 16 2015, 18:41:55 on linuxx8664gcc) CINT/ROOT C/C++ Interpreter version 5.18.00, July 2, 2010 Type ? for help. Commands must be C++ statements. Enclose multiple statements between { }
If no ROOT installation is available in /opt/exp_software, you can source one of the multiple distributions available from CVMFS:
[fornarivirgo@ui01-virgo root_test]$ ls /cvmfs/sft.cern.ch/lcg/releases/ROOT/ 5.34.24-64287 6.06.06-71859 6.10.00-8b404 6.12.04-4473c 6.12.06-76fef 6.14.00-66c89 6.14.04-2a3e5 6.14.04-dedca 6.16.00-23725 6.16.00-5be98 6.16.00-b4729 6.18.00-d0330 ...
For instance:
[fornarivirgo@ui01-virgo root_test]$ cat test.sub universe = vanilla Executable = test.sh ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT Log = log.log Output = log.out Error = log.err queue 1 [fornarivirgo@ui01-virgo root_test]$ cat test.sh #!/bin/bash . /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_96python3 x86_64-centos7-gcc8-opt root -b -q [fornarivirgo@ui01-virgo root_test]$ condor_submit -spool -name sn-02.cr.cnaf.infn.it test.sub Submitting job(s). 1 job(s) submitted to cluster 8445482. [fornarivirgo@ui01-virgo root_test]$ condor_q -name sn-02.cr.cnaf.infn.it 8445482 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 09/10/20 17:21:42 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fornarivirgo ID: 8445482 9/10 17:21 _ _ 1 1 8445482.0 Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Total for all users: 39956 jobs; 21240 completed, 2 removed, 13371 idle, 4218 running, 1125 held, 0 suspended [fornarivirgo@ui01-virgo root_test]$ condor_q -name sn-02.cr.cnaf.infn.it 8445482 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.58:9618?... @ 09/10/20 17:23:58 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fornarivirgo ID: 8445482 9/10 17:21 _ _ _ 1 8445482.0 Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 39959 jobs; 21288 completed, 2 removed, 13341 idle, 4202 running, 1126 held, 0 suspended [fornarivirgo@ui01-virgo root_test]$ condor_transfer_data -name sn-02.cr.cnaf.infn.it 8445482 Fetching data files... [fornarivirgo@ui01-virgo root_test]$ cat log.err [fornarivirgo@ui01-virgo root_test]$ cat log.out ------------------------------------------------------------ | Welcome to ROOT 6.18/00 https://root.cern | | (c) 1995-2019, The ROOT Team | | Built for linuxx8664gcc on Jun 25 2019, 09:22:23 | | From tags/v6-18-00@v6-18-00 | | Try '.help', '.demo', '.license','.credits', '.quit'/'.q'| ------------------------------------------------------------
Singularity in batch jobs
All the CNAF computing servers support containerization through Singularity [24] . Singularity is a containerization tool for running software in a reproducible way on various different platforms. With the advent of new operating systems, programming languages, and libraries, containers offer a sustainable solution for running old software. The software is shipped in so-called image, i.e. a file or folder containing a minimal operating system, the application to run and all its dependencies.
This section of the Tier-1 User Guide is intended to drive the user towards a transition from a native application workflow to one based on containers.
Singularity supports several image formats:
- Singularity .img files
- Singularity images in registry via the shub:// protocol
- Docker images in registry via the docker:// protocol
- a tar archive, eventually bzipped or gzipped
- a folder
Obtain images
Official images for official software may have already been prepared by the software group within your experiment and be available through a shared filesystem (such as CVMFS), SingularityHub or other supported repository. Please check with your software manager about the support to Singularity.
Create a new image using a recipe (expert users)
A singularity recipe is a text file that contains all the instructions and configurations needed to build an image. It is organized in a header and a number of optional sections. The header is composed by a list of configuration settings in the form of "keyword: value" describing the system to use as base for the new image. Sections are identified by the % sign followed by a keyword. For a detailed list of all possible sections refer to [25].
An example of recipe file follows:
## Assume the file is called Recipe.txt and the final image is named name.img # Headers BootStrap: docker From: cern/slc6-base ## Help section %help This text will be available by running singularity help name.img %labels ## Metadata to store (available through "singularity inspect name.img") AUTHOR Carmelo Pellegrino VERSION 1.0 ## Environment variables setup %environment export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH ## Files to be copied in image from host system %files # source (in host system) [destination (in container, default "/")] ./code/ / ## Commands to execute in container during image building, after bootstrap %post yum -y update && yum -y upgrade && \ yum -y install --setopt=tsflags=nodocs \ devtoolset-6 compat-gcc-34-g77 \ cernlib-g77-devel cernlib-g77-utils ## Commands to be executed in default container instantiation %runscript exec /bin/bash
Build it with:
sudo singularity build name.img Recipe.txt
NOTE: since building a new image requires root access on the host system, it will not be possible on CNAF computing resources.
Run software
Singularity allows to run image-shipped software by simply prepending the singularity exec image.img text on the command line. The current working directory (CWD) is maintained and non-system files are available, so that, for example, the output of the ls command run in a container is not very different from the same command run natively.
Suppose the g77 compiler version 3.4 is contained in the name.img image. Example commands and their output follow:
$ cat hw.f program hello_world implicit none write ( *, * ) 'Hello, world!' stop end $ singularity exec name.img g77 --version # i.e.: "g77 --version" runs in a container GNU Fortran (GCC) 3.4.6 20060404 (Red Hat 3.4.6-19.el6) Copyright (C) 2006 Free Software Foundation, Inc. GNU Fortran comes with NO WARRANTY, to the extent permitted by law. You may redistribute copies of GNU Fortran under the terms of the GNU General Public License. For more information about these matters, see the file named COPYING or type the command `info -f g77 Copying'. $ singularity exec name.img g77 hw.f -o hw $ singularity exec name.img ./hw Hello, world!
If a particular external folder is not available in container, for example exp_software or cvmfs, it can be forcedly bound by means of the -B orig:dest
command line option. Many folders can also be bound at the same time:
$ singularity exec -B /cvmfs/:/cvmfs/ -B /opt/exp_software/:/opt/exp_software/ name.img ./hw
How to run Geant4 using Singularity
Geant4 needs old verion of “liblzma
” and “libpcre
” libraries related to Scien-tific Linux 6, for this reason an SL6 environment (using Singularity container) is therefore required to compile/execute the software from there.
Recompile Geant4 inside a container instantiated from a Singularity image with SL6 (usually provided by CNAF User Support), so that the libraries used are actually those that require Geant4. To do this, just start a Singularity container with the following command:
$ singularity exec /opt/exp_software/cupid/UI_SL6.img /bin/bash Singularity>
then in the folder where the Geant sources are present, they can be recompiled.
If it is necessary to be able to access a host operating system path, simply use the ``-B
'' (binding path) option specifying the host path separated from the mountpoint in the container using the colons. For example, to access /opt/exp_software/cupid
, change the previous command to:
$ singularity exec -B /opt/exp_software:/opt/exp_software /opt/exp_software/cupid/UI_SL6.img /bin/bash Singularity>
Once Geant has been recompiled, to submit job, the executable must be written in order to launch the commands related Geant within the Singularity container.
To perform, for example, the “cuspide
” application, a line like the following must be inserted in the script:
singularity exec --no-home -B /opt/exp_software:/opt/exp_software /opt/exp_software/cupid/UI_SL6.img /opt/exp_software/cupid/geant4/bin/10.04.p02/cuspide
where the ``--no-home
'' option causes Singularity to avoid mounting the user's home, since the job, not having the permissions to do so, would fail if it did.
Jupyter notebook in interactive batch jobs
At Tier-1 it's now possible to use Jupyter notebooks served by JupyterHub. The service is reachable via browser at the following page: https://jupyterhub-t1.cr.cnaf.infn.it/
Once you get there, you will be asked to login by using your account bastion credentials. The account must belong to an experiment which has pledged CPU resources on the batch system.
Moreover, right after the login it is also possible to customize the jupyter environment following the instructions at the User environment customization paragraph.
When you login, the Hub service submits a local HTCondor job which is named jupyter-<username>
. You can check its status from your user interface as a local job submitted on the sn-02, with the following command:
-bash-4.2$ condor_q -name sn-02 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.42:9618?... @ 01/13/22 17:50:38 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS dlattanzioauger jupyter-dlattanzioauger 1/13 17:47 _ 1 _ 1 1035919.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for dlattanzioauger: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 25632 jobs; 12551 completed, 0 removed, 10796 idle, 2174 running, 111 held, 0 suspended
As long as the notebook is running, you will also see your job in RUN status. The service runs on a farm worker node.
Here's the Home page that you will see after the job went running:
From this page, it is possible to browse through the different tools, for instance you can open a new Python notebook or a bash shell:
Inside the file "README.md" on the left column, you will find some useful informations and a link to this user guide.
It is important to know that the work done here and all the data produced (e.g. if you run a python code by your notebook from this web page) are treated as a local job, so they follow the same rules valid for all the local jobs. This means that the jobs will have a limited lifetime, will use the pledged CPU resources and their status can be checked by the usual condor_q
command.
The output files produced by a job must be copied somewhere else, otherwise, if they are saved only locally on the wn, they will be lost once the job is done. Posix access to /cvmfs
, /storage
and /opt/exp_software
is guaranteed.
When you logout from the service or if you close the browser, the notebook keeps running, so after a new login it still will be there if the job lifetime is not expired. On the other hand, you can also decide to stop the notebook, in this case at the next login a new job is automatically submitted. There are two ways to stop the notebook:
1) From the upper-left corner, select the menu File > Hub Control Panel and then click on the "Stop My Server" red button.
2) The second way is removing the job by using the usual HTCondor command used for local jobs. For instance, run from the ui:
$ condor_rm -name sn-02 -cons 'JobBatchName=="jupyter-dlattanzioauger"' All jobs matching constraint (JobBatchName=="jupyter-dlattanzioauger") have been marked for removal
Check that job has been succesfully removed:
-bash-4.2$ condor_q -name sn-02 -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.42:9618?... @ 01/13/22 18:44:25 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for dlattanzioauger: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 31480 jobs; 13169 completed, 0 removed, 15852 idle, 2352 running, 107 held, 0 suspended
User environment customization
If a user needs to customize the owner environment, it has to explain the content of a python configuration file in the specific window at the login web page.
For example, in order to load the CVMFS path the user should explain the right instructions as in the picture below.
Indeed, in this case the LD_LIBRARY_PATH
is properly loaded inside of the jupyter notebook.
N.B. This procedure cannot work if the jupyter notebook is already running.
Conda environment creation
During the Jupyter notebook excecution, it is possible to create a Conda environment by installing different packages like another Python version. For instance, you can issue the following commands:
%conda create -n test -y python=3.6 %conda run -n test pip install ipykernel %conda run -n test python -m ipykernel install --user --name python3.6 --display-name PythonCustom_3.6
N.B. If you had previously the jupyter environment, before launching the above commands, you need to import again the proper environment variable:
import os os.environ['LD_LIBRARY_PATH']=''
At this point, you will see a new shortcut to a new notebook which contains the new PythonCustom_3.6 conda environment.
For example, inside this new notebook, you can launch the ROOT software:
import sys sys.path.append('/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib/python3.6/site-packages/') sys.path.append('/cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/lib') import ROOT
Welcome to JupyROOT 6.18/00
Software installation in a Conda environment
After having issued a proper conda environment, the user has to activate it in order to use it from the CLI:
Singularity> conda activate test
N.B. In order to avoid the following error:
Singularity> conda activate test /usr/bin/python3: symbol lookup error: /usr/bin/python3: undefined symbol: _Py_LegacyLocaleDetected Singularity>
The user needs to unset the LD_LIBRARY_PATH environment variable:
Singularity> unset LD_LIBRARY_PATH Singularity> conda activate test (test) Singularity>
At this point, it is possible for example to install a software like R:
(test) Singularity> conda install R
and then it can be executed:
(test) Singularity> R R version 3.6.1 (2019-07-05) -- "Action of the Toes" Copyright (C) 2019 The R Foundation for Statistical Computing Platform: x86_64-conda_cos6-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
After the R installation, the user can install the native R kernel for Jupyter:
> install.packages('IRkernel')
During the installation, the user will be prompted with a list of different countries. For a quicker installation, the nearest can be chosen.
At this point, running the following command, a new R shortcut will be available on the main Jupyter dashboard:
IRkernel::installspec()
DAG Jobs
If a user has to manage multiple jobs in a direct acyclic graph (DAG) order it is possible to organize the submission with a DAG input file.
In the graph below, the vertices, generally tagged with a letter, represent the jobs and the edges represent the degree of relationship.
In this case the A job is a parent of the B¹, B² and B³ jobs, namely the children.
This implies that the B* jobs start once the A job finished.
This automatic implementation of successive submissions could be very useful.
Example
To properly perform a DAG job it occurs to write a DAG input file with the specific tags of the jobs and the degree of relationship, for instance:
-bash-4.2$ cat simple.dag JOB A sleep.sub JOB B snore.sub PARENT A CHILD B
In this case the sleep.sub
job is called A, whereas the snore.sub
job is B.
Moreover, A is a parent of B, so the B job starts only once the A job finished.
In order to submit properly the DAG job, the DAG input file has to be into a folder reacheable from the schedd. At tier-1, in general this folder could be a directory into the gpfs_data file system. After this, to submit the DAG job it is enough to issue the following commands
-bash-4.2$ export _condor_SCHEDD_HOST=sn-02.cr.cnaf.infn.it -bash-4.2$ condor_submit_dag simple.dag Renaming rescue DAGs newer than number 0 ----------------------------------------------------------------------- File for submitting this DAG to HTCondor : simple.dag.condor.sub Log of DAGMan debugging messages : simple.dag.dagman.out Log of HTCondor library output : simple.dag.lib.out Log of HTCondor library error messages : simple.dag.lib.err Log of the life of condor_dagman itself : simple.dag.dagman.log Submitting job(s). 1 job(s) submitted to cluster 1271844. -----------------------------------------------------------------------
The submission produces the log files shown in the output.
Then, to check the job status a user can launch the condor_q
command
-bash-4.2$ condor_q -- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.42:9618?... @ 01/26/22 17:47:19 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS arendinajuno simple.dag+1271844 1/26 16:51 _ 1 _ 1 1271846.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for arendinajuno: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 26968 jobs; 18806 completed, 1 removed, 4984 idle, 3132 running, 45 held, 0 suspended
and after the A job is done, the child job B is queued:
-- Schedd: sn-02.cr.cnaf.infn.it : <131.154.192.42:9618?... @ 01/26/22 17:48:56 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS arendinajuno simple.dag+1271947 1/26 17:46 _ 1 _ 1 1271949.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for arendinajuno: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 28900 jobs; 18810 completed, 1 removed, 6840 idle, 3204 running, 45 held, 0 suspended