23
Peter Couvares Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor DAGMan: Introduction & Update

Condor DAGMan: Introduction & Update

Embed Size (px)

DESCRIPTION

Condor DAGMan: Introduction & Update. DAGMan. D irected A cyclic G raph Man ager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”). - PowerPoint PPT Presentation

Citation preview

Page 1: Condor DAGMan: Introduction & Update

Peter CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor DAGMan:Introduction &

Update

Page 2: Condor DAGMan: Introduction & Update

2http://www.cs.wisc.edu/condor

DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 3: Condor DAGMan: Introduction & Update

3http://www.cs.wisc.edu/condor

Why is This Important?

› Most real science involves complex sequences of tasks – on many resources at many sites. E.g., move data, compute, check, move back, etc.

› … and many types of jobs working together Condor, Grid (Condor-G), MPI, shell scripts, etc.

› Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.

Page 4: Condor DAGMan: Introduction & Update

4http://www.cs.wisc.edu/condor

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B Job C

Job D

Page 5: Condor DAGMan: Introduction & Update

5http://www.cs.wisc.edu/condor

Defining a DAG

› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

› each node will run the Condor or Grid job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 6: Condor DAGMan: Introduction & Update

6http://www.cs.wisc.edu/condor

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon to begin running your jobs:

% condor_submit_dag diamond.dag

› condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.

Page 7: Condor DAGMan: Introduction & Update

7http://www.cs.wisc.edu/condor

DAGMan

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

C

D

A

A

B.dagFile

Page 8: Condor DAGMan: Introduction & Update

8http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue

C

D

B

C

B

A

Page 9: Condor DAGMan: Introduction & Update

9http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue

X

D

A

BRescue

File

Page 10: Condor DAGMan: Introduction & Update

10http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG

› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

C

D

A

BRescue

File

C

Page 11: Condor DAGMan: Introduction & Update

11http://www.cs.wisc.edu/condor

DAGMan

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 12: Condor DAGMan: Introduction & Update

12http://www.cs.wisc.edu/condor

Additional DAGMan Features

› Provides other knobs handy for job management…

nodes can have PRE & POST scripts job submission can be “throttled” NEW: failed nodes can be

automatically re-tried a configurable number of times

Page 13: Condor DAGMan: Introduction & Update

13http://www.cs.wisc.edu/condor

PRE & POST Scripts

› Executes locally on the submit host before or after job submission…

› Example:# diamond.dagPRE A prepare-A.shJob A a.subJob B b.subJob C c.subJob D d.subPOST D double-check.shParent A Child B CParent B C Child D

› PRE/POST scripts are part of node

PREJob A

Job B Job C

Job DPOST

Page 14: Condor DAGMan: Introduction & Update

14http://www.cs.wisc.edu/condor

DAG “Throttling”

› You can tell DAGMan to limit the maximum number of jobs it submits at any one time condor_submit_dag -maxjobs N useful for managing resource limitations (e.g.,

licenses)

› You can also can limit the number of simultaneous PRE or POST scripts. Added after Vladimir Litvin’s 7000-node DAG

started 7000 PRE scripts on his machine!

Page 15: Condor DAGMan: Introduction & Update

15http://www.cs.wisc.edu/condor

Node RETRY

› Tells DAGMan to re-run a node multiple times if necessary…

› Example:# diamond.dagJob A a.subJob B b.subRETRY B 5Job C c.subRETRY C 5Job D d.subParent A Child B CParent B C Child D

Job A

Job B Job C

Job D

Page 16: Condor DAGMan: Introduction & Update

16http://www.cs.wisc.edu/condor

DAGMan Progress

› Testing… lots of testing. 10,000+ node DAGs run smoothly Developed automated DAG testing

tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald)

Lots of bugs fixed

Page 17: Condor DAGMan: Introduction & Update

17http://www.cs.wisc.edu/condor

DAGMan Progress (cont’d)

› New features Improved logging (timestamps, etc.) More efficient recovery Node RETRY capability DAG info in condor_q (with –dag flag) Robust in more failure cases Recursive DAGs for conditional execution

› DAGMan for Windows (Ray Pingree)

Page 18: Condor DAGMan: Introduction & Update

18http://www.cs.wisc.edu/condor

DAGMan Success

› DAGMan is becoming part of the common framework for running on the grid. Particle Physics Data Grid (PPDG) Grid Physics Network (GriPhyN) Many Super Computing 2001 demos more…

Page 19: Condor DAGMan: Introduction & Update

19http://www.cs.wisc.edu/condor

DAGMan in the GriPhyN ArchitectureApplication

Planner

Executor

Catalog Services

Info Services

Policy/Security

Monitoring

Repl. Mgmt.

Reliable TransferService

Compute Resource Storage Resource

DAG

DAG

DAGMAN, Kangaroo

GRAM GridFTP; GRAM; SRM

GSI, CAS

MDS

MCAT; GriPhyN catalogs

GDMP

MDS

Globus

diagram by Ian Foster (Argonne)

Page 20: Condor DAGMan: Introduction & Update

DAGMan in PPDG Tools

diagram by Jim Amundson (Fermilab)

Page 21: Condor DAGMan: Introduction & Update

21http://www.cs.wisc.edu/condor

What’s Next?

› More flexible control of node execution Currently implicit: “all my parents returned

0”. Why not, “all parents returned 0 AND ran for

more than two hours” or “parent A returned 0 and parent B returned 42”?

› 1st step: represent DAG nodes internally as ClassAds Allows DAGMan to decide when to run

nodes based on arbitrary requirements

Page 22: Condor DAGMan: Introduction & Update

22http://www.cs.wisc.edu/condor

What’s Next? (cont’d)

› Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs.

DAGMan Condor-G

Condor

DaP Scheduler

Page 23: Condor DAGMan: Introduction & Update

23http://www.cs.wisc.edu/condor

Thank You!

› Interested in seeing more? Come to the DAGMan BoF

• Wednesday 9am - noon• Room 3393, Computer Sciences (1210 W. Dayton

St.)

Email us:• [email protected]

Try it!• http://www.cs.wisc.edu/condor