24
Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor-G: A Case in Distributed Job Delegation

Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] Condor-G: A Case in Distributed

Embed Size (px)

Citation preview

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor-G: A Case in Distributed Job

Delegation

www.cs.wisc.edu/condor

Job Delegation

› Transfer of responsibility to schedule and execute a job

› Multiple delegations can form a chain

www.cs.wisc.edu/condor

Job Delegation in Condor-G Today

Condor-G

Globus GRAM

Batch System Front-end

Execute Machine

www.cs.wisc.edu/condor

Expanding the Model

› What can we do with new forms of job delegation?

› Some ideas Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling

www.cs.wisc.edu/condor

Mirroring

› What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one

starts running jobs On recovery, primary Condor-G gets job

status from secondary one

› Removes Condor-G submit point as single point of failure

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

www.cs.wisc.edu/condor

Load-Balancing

› What it does Front-end Condor-G distributes all jobs

among several back-end Condor-Gs Front-end Condor-G keeps updated job

status

› Improves scalability

› Maintains single submit point for users

www.cs.wisc.edu/condor

Load-Balancing Example

Condor-G Back-end 1

Condor-G Front-end

Condor-G Back-end 3

Condor-G Back-end 2

www.cs.wisc.edu/condor

Glide-In Schedd

› What it does Drop a Condor-G onto the front-end

machine of a cluster Delegate jobs to the cluster through

the glide-in schedd

› Apply cluster-specific policies to jobs

www.cs.wisc.edu/condor

Glide-In Schedd Example

Condor-G

Glide-In Schedd

Batch System

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling

› Match a job to a Virtual Organization (VO), then to a resource within that VO

› Easier to schedule jobs across multiple VOs and grids

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling Example

Experiment Condor-G

Experiment Resource Broker

VO Condor-G

VO Resource Broker

Globus GRAM

Batch Scheduler

www.cs.wisc.edu/condor

Endless Possibilities

› These new models can be combined with each other or with other new models

› Resulting system can be arbitrarily sophisticated

www.cs.wisc.edu/condor

Job Delegation Challenges

› New complexity introduces new issues and exacerbates existing ones

› A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging

www.cs.wisc.edu/condor

Transparency

› Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines

› Users need to know what’s happening with their jobs

www.cs.wisc.edu/condor

Representation

› Job state is a vector› How best to show this to user

Summary• Current delegation endpoint• Job state at endpoint

Full information available if desired• Series of nested ClassAds?

www.cs.wisc.edu/condor

Scheduling Control

› Avoid loops in delegation path

› Give user control of scheduling Allow limiting of delegation path

length? Allow user to specify part or all of

delegation path

www.cs.wisc.edu/condor

Active Job Control

› User may request certain actions hold, suspend, vacate, checkpoint

› Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later

www.cs.wisc.edu/condor

Active Job Control (cont)

› Endpoint systems may not support actions If possible, execute them at furthest

point that does support them

› Allow user to apply action in middle of delegation path

www.cs.wisc.edu/condor

Revocation

› Leases Lease must be renewed periodically

for delegation to remain valid Allows revocation during long-term

failures

› What are good values for lease lifetime and update interval?

www.cs.wisc.edu/condor

Error Handling and Debugging

› Many more places for things to go horribly wrong

› Need clear, simple error semantics

› Logs, logs, logs Have them everywhere

www.cs.wisc.edu/condor

Current Status

› Done Mirroring

› In Progress Condor-G -> Condor-G delegation

• User must specify hops

Glide-in schedd• Set up by hand

www.cs.wisc.edu/condor

Thank You!

› Questions?