General informations
HTCondor is a powerful yet complex batch system: in this page you'll find basic informations on how to submit and manage jobs, but for a fully fledged user manual you can find all the details on the official HTCondor documentation.
All the command you will find on the user guide need to be executed on the frontend node, the one you'll land via ssh (htc-fe.rmlab.infn.it)
Submit a job
$ condor_submit example.sub
The "condor_submit" command allows you to submit a job to the central manager, that will take care of distribute it to all available worker nodes, depending on the required job specification.
The .sub file is called "job descriptor" where you can write all the informations related to the job you want to submit (universe, job numbers, parameters, etc)
This is a sample job descriptor that uses the Vanilla Universe (see "universe")
* Executable = foo.sh
* Universe = vanilla
Error = bar/err.$(ClusterId).$(ProcId)
Output = bar/out.$(ClusterId).$(ProcId)
* Log = bar/log.$(ClusterId).$(ProcId)
arguments = arg1 arg2 $(ProcId)
request_cpus = 1
* request_memory = 1024
request_disk = 10240
should_transfer_files = no
* Queue 1
* = required fields
You can also submit many similar jobs with one queue command
For reference you can have a look at job submission manual or at science job example
Parametri:
Executable = Job to be executed
Universe = choose between vanilla/docker/parallel
Error, Output, Log = where to write log files. Those folders need to exist before submitting the job
Arguments = Strings array to be provided to the job itself.
request_[...] = Job requirements
Queue = This is the submission request, here you can specify the number of job to be executed
Universe
A universe represent an execution environment for the job: HTCondor support several universes :
* vanilla
grid
java
scheduler
local
* parallel
vm
container
* docker
* = Supported by rm2lab htc cluster
Docker Universe
To run jobs using a docker image you can write a job descriptor as follow :
---------------------------------------------------------
universe = docker
docker_image = ubuntu:20.04
executable = /bin/cat
arguments = /etc/hosts
transfer_input_files = input.txt
transfer_output_files = output.txt
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = out.$(ClusterId).$(ProcId)
error = err.$(ClusterId).$(ProcId)
log = log.$(ClusterId).$(ProcId)
request_cpus = 1
request_memory = 256M
request_disk = 1024M
queue 1
---------------------------------------------------------
Further details about docker universe applications can be found on the official manual
Parallel Universe
Parallel universe allows you to submit parallel jobs using mpi: the scheduler will wait for all the needed resources to be available before submitting the job.
To use OpenMPI you need to use a wrapper script instead of your executable: the executable, compiled with openmpi, needs to be passed as first argument on transferred using the "trasfer_input_file" directive.
universe = parallel
executable = openmpiscript.sh
arguments = mpitest
should_transfer_files = yes
transfer_input_files = mpitest
when_to_transfer_output = on_exit_or_evict
output = logs/out.$(NODE)
error = logs/err.$(NODE)
log = logs/log
machine_count = 15
request_cpus = 10
queue
The openmpiscript.sh is locate in the /scripts folder of the frontend node, it needs to be copied in the folder where you submit the job
The following directives indicate how many parallel processes need to be run and how many cpus are needed for each process.
machine_count = 15
request_cpus = 10
Detailed info about parallel applications can be found here
Jobs with GPU
To execute a job that needs a GPU you can specify "request_GPUs = 1" in the job submission file.
Manage a job
$ condor_rm clusterID.procID
Removes a specific job
$ condor_hold clusterID.procID
Put a job on hold, it will not be executed until released
$ condor_release clusterID.procID
Release a previously hold job.
Queue check
$ condor_q <parameters>
The condor_q parameters allows you to check a job / queue status
Parameters:
-nobatch Display a job per line
-global Shows all the active queues in the cluster
-hold Shows information about job on hold (you can also use -analyze or --better-analyze parameters)
-run Shows info on running job, to be used with -nobatch