23
Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Eager, Lazy, and Just-in-Time Planning Edinburgh Workshop Oct 2003

Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] Eager, Lazy, and Just-in-Time

Embed Size (px)

Citation preview

Page 1: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Eager, Lazy, and Just-in-Time

Planning Edinburgh Workshop

Oct 2003

Page 2: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

2http://www.cs.wisc.edu/condor

Planning –vs- Scheduling

› Can you control the resources? Yes? Scheduling. No? Planning.

› Planning is a ‘client’ operation.

Page 3: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

3http://www.cs.wisc.edu/condor

The question of When

› Lots of planning open questions.

› An important consideration: When the planning occurs.

Time

Eager Just-in-TimeLazy

Page 4: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

4http://www.cs.wisc.edu/condor

Eager Example› First Pass of EDG

Resource Broker

RB DAGMan

Condor-G

Globus

Fabric

Site Scheduler

Page 5: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

5http://www.cs.wisc.edu/condor

Eager Condor-G Submit File

universe = globus

globussite = beak.cs.wisc.edu/jobmanager-lsf

executable = find_particlearguments = ….output = ….log = …

Page 6: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

6http://www.cs.wisc.edu/condor

EDG Resource Broker Gets Lazy…

› Addition of a DAGMan callouts› DAGMan is given a command (script) to run

immediately before submission of job to Condor-G (different than a PRE script on a node)

› The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph

› This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute

Page 7: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

7http://www.cs.wisc.edu/condor

Eager Example› First Pass of EDG

Resource Broker

RB DAGMan

Condor-G

Globus

Fabric

Site Scheduler

callout

Page 8: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

8http://www.cs.wisc.edu/condor

Moving Condor-G to Just-In-Time

› Delay the binding of the task (job) to the resource until the resource is ready.

› Need to know when the resource is ready.

› One way: unimplemented globus 1.1 “queue wait time” estimate Not really just-in-time, because of lies, lies

lies…

› Another way… Condor-G Glidein Mechanism.

Page 9: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

9http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd

LSFLSF

CollectorCollector

Condor-G Globus Resource

600 Condorjobs

Page 10: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

10http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd

LSFLSF

CollectorCollector

Condor-G Globus Resource

600 Condorjobs

GlideIn jobs

Page 11: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

11http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd

LSFLSF

CollectorCollector

Condor-G Globus Resource

GridManagerGridManager

600 Condorjobs

GlideIn jobs

Page 12: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

12http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd JobManagerJobManager

LSFLSF

CollectorCollector

Condor-G Globus Resource

GridManagerGridManager

600 Condorjobs

GlideIn jobs

Page 13: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

13http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd JobManagerJobManager

LSFLSF

StartdStartd

CollectorCollector

Condor-G Globus Resource

GridManagerGridManager

600 Condorjobs

GlideIn jobs

Page 14: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

14http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd JobManagerJobManager

LSFLSF

StartdStartd

CollectorCollector

Condor-G Globus Resource

GridManagerGridManager

600 Condorjobs

GlideIn jobs

Page 15: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

15http://www.cs.wisc.edu/condor

How It Works

ScheddSchedd JobManagerJobManager

LSFLSF

User JobUser Job

StartdStartd

CollectorCollector

Condor-G Globus Resource

GridManagerGridManager

600 Condorjobs

GlideIn jobs

Page 16: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

16http://www.cs.wisc.edu/condor

A Just-in-time Submit

executable = find_particlerequirements = TARGET.Arch ==

“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”

# job describes the “power”rank = MFlops * 10000 + Memory

Page 17: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

17http://www.cs.wisc.edu/condor

Another Just-in-time Submit

executable = find_particlerequirements = TARGET.Arch ==

“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”

rank = sam_data_overlap(MY.dataset,TARGET.sam_site_name) + (TARGET.Mflops / 100000)

+dataset = search_space_id_0133313

Page 18: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

18http://www.cs.wisc.edu/condor

Lots of Tradeoffs…› Just-in-Time

Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances.

Con: Coordination of multiple resources

› Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all

the resources in advance? Con: Plan falls apart if assumptions change.

Page 19: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

19http://www.cs.wisc.edu/condor

Some observations› A complete separation of task from

resource is difficult. Lots and lots of structured data required. But this separation is required to in order to

achieve Just-In-Time planning.

› Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help.

Page 20: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

20http://www.cs.wisc.edu/condor

Plan for failure

› Much effort on how to create a plan.

› How about a plan for when things fail?

Page 21: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

21http://www.cs.wisc.edu/condor

Job Failure Policy Expressions

› Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file.

› Can be used to describe a successful run, or what to do in the face of failure.

on_exit_remove = <expression>on_exit_hold = <expression>periodic_remove = <expression>periodic_hold = <expression>

Page 22: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

22http://www.cs.wisc.edu/condor

Job Failure Policy Examples› Do not remove from queue (i.e. reschedule) if

exits with a signal:on_exit_remove = ExitBySignal == False

› Place on hold if exits with nonzero status or ran for less than an hour:

on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime –

JobStartDate) < 3600)› Place on hold if job has spent more than 50% of

its time suspended:periodic_hold = CumulativeSuspensionTime

> (RemoteWallClockTime / 2.0)

Page 23: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Eager, Lazy, and Just-in-Time

23http://www.cs.wisc.edu/condor

Thank you!

http://www.cs.wisc.edu/condor

[email protected]