General informations

HTCondor is a powerful yet complex batch system: in this page you'll find basic informations on how to submit and manage jobs, but for a fully fledged user manual you can find all the details on the official HTCondor documentation

All the command you will find on the user guide need to be executed on the frontend node, the one you'll land via ssh (htc-fe.rmlab.infn.it)

Submit a job

$ condor_submit example.sub

The "condor_submit" command allows you to submit a job to the central manager, that will take care of distribute it to all available worker nodes, depending on the required job specification.

The .sub file is called "job descriptor" where you can write all the informations related to the job you want to submit (universe, job numbers, parameters, etc)

  
This is a sample job descriptor that uses the Vanilla Universe (see "universe")


*       Executable      = foo.sh
*       Universe        = vanilla
        Error           = bar/err.$(ClusterId).$(ProcId)
        Output          = bar/out.$(ClusterId).$(ProcId)        
*       Log             = bar/log.$(ClusterId).$(ProcId)
        arguments       = arg1 arg2 $(ProcId)
      request_cpus   = 1
*     request_memory = 1024
      request_disk   = 10240
      should_transfer_files = no
*       Queue 1
 
*  = required fields

You can also submit many similar jobs with one queue command 

For reference you can have a look at job submission manual or at science job example

Parametri:

Executable = Job to be executed
Universe = choose between vanilla/docker/parallel
Error, Output, Log = where to write log files. Those folders need to exist before submitting the job
Arguments = Strings array to be provided to the job itself.
request_[...] = Job requirements
Queue = This is the submission request, here you can specify the number of job to be executed

Universe


A universe represent an execution environment for the job: HTCondor support several universes :

* vanilla
grid
java
scheduler
local
* parallel
vm
container
* docker


*  = Supported by rm2lab htc cluster
 

Docker Universe

To run jobs using a docker image you can write a job descriptor as follow :

---------------------------------------------------------
universe                = docker
docker_image            = ubuntu:20.04
executable              = /bin/cat
arguments               = /etc/hosts
transfer_input_files     = input.txt
transfer_output_files   = output.txt
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
output                  = out.$(ClusterId).$(ProcId)
error                   = err.$(ClusterId).$(ProcId)
log                     = log.$(ClusterId).$(ProcId)
request_cpus            = 1
request_memory          = 256M
request_disk            = 1024M
queue 1
---------------------------------------------------------

Further details about docker universe applications can be found on the official manual


Parallel Universe

Parallel universe allows you to submit parallel jobs using mpi: the scheduler will wait for all the needed resources to be available before submitting the job.

To use OpenMPI you need to use a wrapper script instead of your executable: the executable, compiled with openmpi, needs to be passed as first argument on transferred using the "trasfer_input_file" directive.

universe = parallel
executable = openmpiscript.sh
arguments = mpitest
should_transfer_files = yes
transfer_input_files = mpitest
when_to_transfer_output = on_exit_or_evict
output = logs/out.$(NODE)
error   = logs/err.$(NODE)
log    = logs/log
machine_count = 15
request_cpus = 10
queue

The openmpiscript.sh is locate in the /scripts folder of the frontend node, it needs to be copied in the folder where you submit the job

The following directives indicate how many parallel processes need to be run and how many cpus are needed for each process.

machine_count = 15
request_cpus = 10

Detailed info about parallel applications can be found here
 

Jobs with GPU

To execute a job that needs a GPU you can specify "request_GPUs = 1" in the job submission file.

Manage a job


$ condor_rm clusterID.procID

Removes a specific job 


$ condor_hold clusterID.procID

Put a job on hold, it will not be executed until released


$ condor_release clusterID.procID

Release a previously hold job.
 

Queue check


$ condor_q <parameters>

The condor_q parameters allows you to check a job / queue status


Parameters:

-nobatch            Display a job per line
-global                Shows all the active queues in the cluster
-hold                   Shows information about job on hold (you can also use -analyze or --better-analyze parameters)
-run                     Shows info on running job, to be used with -nobatch



  • No labels