1
HEP, Cavendish Laboratory Configuring & Enabling Condor in LHC Computing Grid Condor is a specialized workload management system for compute-intensive jobs, which can effectively manage a variety of clusters of dedicated compute nodes. Today, there are grid schedulers, resource managers, and workload management systems available that can provide the functionality of the traditional batch queuing system e.g. Torque/PBS or provide the ability to harness cycles from idle desktop workstations. Condor addresses both of these areas by providing a single tool. In Grid-style computing environment, Condor's "flocking" technology allows multiple Condor compute installations to work together and opens a wide range of possible options for resource sharing. Central Manager negotiator master startd collector schedd Submit Node schedd master Execute Node startd master Regular Node master schedd startd Execute Node startd master Execute Node startd master Process Spawned ClassAd Communication Pathway Condor Working Model Like other full-featured batch systems, Condor provides the traditional job queuing mechanism, scheduling policy, priority scheme, along with resource classifications. In a nutshell, job is submitted to Condor via a machine running a scheduler (schedd). The scheduler communicates with the collector process on the Central Manager ( CM). The negotiator on the CM performs a matchmaking service and sends jobs to an available machine on the network which begins running the job on that machine. Machines that can run jobs (Execute Node) also communicate with the collector (via a startd process). A shadow process on the Submit node keeps communicating with the running job so if the job stops executing, Condor can detect this (e.g. if the job or the machine crashes). If checkpointing is not in use, these jobs can be restarted by Condor if requested and allowed. Although, Condor as a batch system, is officially supported by gLite/EGEE, various parts of the middleware still limited to the PBS/Torque in terms of transparent integrity. We have extended the support to allow middleware to work seamlessly with Condor and enable interaction with local/university compute clusters. We provide details of the configuration, implementation, and testing of Condor for LCG in multi-cultural environment, where a common cluster is used for different types of jobs. The system is presented as an extension to the default LCG/gLite configuration that provides transparent access for both LCG and local jobs to the common resource. Using Condor and Chirp/Parrot, we have extended the possibilities to use university clusters for LCG/gLite jobs in a very non-privileged way. Pools maintained by individual Group/Department Local Users HEP Cluster Execute node Execute node Execute node CamGrid Project Model HEP CM + gLite Submit node Local (HEP) Submit node Central Submit node Grid Users Grid Users Central Submit node HEP submission CamGrid submission LCG/gLite submission Condor works by providing a High Throughput Computing (HTC) environment. In addition to the typical usage scenario, Condor can also effectively manage no dedicated resources by taking advantage of spare cycles when those resources are idle. The ClassAd mechanism in Condor provides an extremely flexible and expressive framework for matching resource requests (jobs) with resource offers (machines). This is why we have chosen Condor as the primary batch system for our LCG farm. The same cluster is also the part of CamGrid Project. The condor central manager is configured to act as a submit node only for the LCG/gLite submission. The additional submit node is for our local users and for CamGrid submission. CamGrid is made up of a number of Condor pools belonging to the departments and that allow their resources to be shared by using Condor's flocking mechanism. This federated approach means that there is no single point of failure in this environment, and the grid does not depend on any individual pool to continue working. Grid jobs, runs only on the HEP cluster but the CmGrid jobs, submitted through central submit hosts or the local submit host, run everywhere across the CamGrid infrastructure. Santanu Das. 3rd EGEE User Forum, 1114 February 2008, ClermontFerrand, FRANCE. [email protected]

Enabling Condor in LHC Computing Grid

Embed Size (px)

Citation preview

Page 1: Enabling Condor in LHC Computing Grid

HEP, Cavendish Laboratory

Configuring & Enabling Condor in LHC Computing Grid

Condor is a specialized workload management system for compute-intensive jobs, which can effectively manage a variety of clusters of dedicated compute nodes. Today, there are grid schedulers, resource managers, and workload management systems available that can provide the functionality of the traditional batch queuing system e.g. Torque/PBS or provide the ability to harness cycles from idle desktop workstations. Condor addresses both of these areas by providing a single tool. In Grid-style computing environment, Condor's "flocking" technology allows multiple Condor compute installations to work together and opens a wide range of possible options for resource sharing.

Central Manager

negotiator

master

startd

collector schedd Submit Node

schedd

master

Execute Node

startd

master

Regular Node

master

schedd

startd

Execute Node

startd

master

Execute Node

startd

master

Process Spawned ClassAd Communication Pathway

Condor Working Model Like other full-featured batch systems, Condor provides the traditional job queuing mechanism, scheduling policy, priority scheme, along with resource classifications. In a nutshell, job is submitted to Condor via a machine running a scheduler (schedd). The scheduler communicates with the collector process on the Central Manager (CM). The negotiator on the CM performs a matchmaking service and sends jobs to an available machine on the network which begins running the job on that machine. Machines that can run jobs (Execute Node) also communicate with the collector (via a startd process). A shadow process on the Submit node keeps communicating with the running job so if the job stops executing, Condor can detect this (e.g. if the job or the machine crashes). If checkpointing is not in use, these jobs can be restarted by Condor if requested and allowed.

Although, Condor as a batch system, is officially supported by gLite/EGEE, various parts of the middleware still limited to the PBS/Torque in terms of transparent integrity. We have extended the support to allow middleware to work seamlessly with Condor and enable interaction with local/university compute clusters. We provide details of the configuration, implementation, and testing of Condor for LCG in multi-cultural environment, where a common cluster is used for different types of jobs. The system is presented as an extension to the default LCG/gLite configuration that provides transparent access for both LCG and local jobs to the common resource. Using Condor and Chirp/Parrot, we have extended the possibilities to use university clusters for LCG/gLite jobs in a very non-privileged way.

Pool

s m

aint

aine

d by

indi

vidu

al G

roup

/Dep

artm

ent

Local Users

HEP Cluster

Execute node

Execute node

Execute node

CamGrid Project Model

HEP

CM

+

gLit

e Su

bmit

nod

e Lo

cal (

HEP

) Su

bmit

nod

e

Cent

ral

Subm

it n

ode

Grid Users

Grid Users

Cent

ral

Subm

it n

ode

HEP submission CamGrid submission LCG/gLite submission

Condor works by providing a High Throughput Computing (HTC) environment. In addition to the typical usage scenario, Condor can also effectively manage no dedicated resources by taking advantage of spare cycles when those resources are idle. The ClassAd mechanism in Condor provides an extremely flexible and expressive framework for matching resource requests (jobs) with resource offers (machines). This is why we have chosen Condor as the primary batch system for our LCG farm. The same cluster is also the part of CamGrid Project. The condor central manager is configured to act as a submit node only for the LCG/gLite submission. The additional submit node is for our local users and for CamGrid submission. CamGrid is made up of a number of Condor pools belonging to the departments and that allow their resources to be shared by using Condor's flocking mechanism. This federated approach means that there is no single point of failure in this environment, and the grid does not depend on any individual pool to continue working. Grid jobs, runs only on the HEP cluster but the CmGrid jobs, submitted through central submit hosts or the local submit host, run everywhere across the CamGrid infrastructure.

Santanu  Das.  3rd  EGEE  User  Forum,  11-­‐14  February  2008,  Clermont-­‐Ferrand,  FRANCE.  [email protected]