28

Part 8: DAGMan

  • Upload
    phong

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Part 8: DAGMan. Part 8: DAGMan. A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan. A: Grid Workflow Management. Job Dependencies. In many applications, some jobs are dependent on other jobs E.g. job A must finish before job B starts Often because job B uses output from job A - PowerPoint PPT Presentation

Citation preview

Page 1: Part 8: DAGMan
Page 2: Part 8: DAGMan

Part 8:DAGMan

Page 3: Part 8: DAGMan

Part 8: DAGMan

• A: Grid Workflow Management

• B: DAGMan

• C: Laboratory: DAGMan

Page 4: Part 8: DAGMan

A: Grid Workflow Management

Page 5: Part 8: DAGMan

Job Dependencies

• In many applications, some jobs are dependent on other jobs• E.g. job A must finish before job B starts• Often because job B uses output from job A

• We call a set of interdependent jobs a workflow

• Condor-G can run jobs in any order• We need a workflow manager

Page 6: Part 8: DAGMan

Two Motivating Examples

The Sloan Digital Sky Survey

The MontageProject

Page 7: Part 8: DAGMan

Sloan Digital Sky Survey

• Map one-quarter of the entire sky

• Determine the positions and absolute brightness of more than 100 million celestial objects.

• Measure the distance to a million of the nearest galaxies, and to 100,000 quasars.

http://www.sdss.org

Page 8: Part 8: DAGMan

Workflow to Find Galaxy Clusters

catalog

cluster

5

4

core

brg

field

tsObj

3

2

1

brg

field

tsObj

2

1

brg

field

tsObj

2

1

brg

field

tsObj

2

1

core

3

fieldPrep

maxBrg

maxBcg

bcgCoal

getCatalog

Page 9: Part 8: DAGMan

Workflow to Find Galaxy Clusters

maxBrg

maxBcg

bcgCoal

getCatalog

Page 10: Part 8: DAGMan

Montage

• Create a large mosaic image from many smaller images

• Used for astronomy data

• Correct optical distortions and intensity differences

http://montage.ipac.caltech.edu

Page 11: Part 8: DAGMan

Montage Workflow

Data Stage in nodes

Montage compute nodesData stage out nodes

Inter pool transfer nodes

Page 12: Part 8: DAGMan

Montage Workflow

1202 nodes

Page 13: Part 8: DAGMan

B: DAGMan

Page 14: Part 8: DAGMan

DAGMan

• Directed Acyclic Graph Manager• Workflow manager for Condor-G

• DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

• By default, Condor may run your jobs in any order, or everything simultaneously, so we need DAGMan to enforce an ordering when necessary.

• (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 15: Part 8: DAGMan

What is a DAG?

• A DAG is the data structure used by DAGMan to represent these dependencies.

• Each job is a “node” in the DAG.

• Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B Job C

Job D

Page 16: Part 8: DAGMan

Defining a DAG

• A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

• each node will run the Condor job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 17: Part 8: DAGMan

Submitting a DAG

• To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag• condor_submit_dag submits a job with DAGMan as the

executable.

• This job happens to run on the submitting machine, not any other computer.

• Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

Page 18: Part 8: DAGMan

DAGMan

Running a DAG

• DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

C

D

A

A

B.dagFile

Page 19: Part 8: DAGMan

DAGMan

Running a DAG (cont’d)

• DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue

C

D

B

C

B

A

Page 20: Part 8: DAGMan

DAGMan

Running a DAG (cont’d)

• In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue

X

D

A

BRescue

File

Page 21: Part 8: DAGMan

DAGMan

Recovering a DAG

• Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

C

D

A

BRescue

File

C

Page 22: Part 8: DAGMan

DAGMan

Recovering a DAG (cont’d)

• Once that job completes, DAGMan will continue the DAG as if the failure never happened.

CondorJobQueue

C

D

A

B

D

Page 23: Part 8: DAGMan

DAGMan

Finishing a DAG

• Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 24: Part 8: DAGMan

Additional DAGMan Features

• Provides other handy features for job management…

• nodes can have PRE & POST scripts• failed nodes can be automatically re-tried a

configurable number of times• job submission can be “throttled”

Page 25: Part 8: DAGMan

Another sample DAGMan submit file

# Filename: diamond.dagJob A A.condorJob B B.condorJob C C.condorJob D D.condorScript PRE A top_pre.cshScript PRE B mid_pre.perl $JOBScript POST B mid_post.perl $JOB $RETURNScript PRE C mid_pre.perl $JOBScript POST C mid_post.perl $JOB $RETURNScript PRE D bot_pre.cshPARENT A CHILD B CPARENT B C CHILD DRetry C 3

Job A

Job B Job C

Job D

Page 26: Part 8: DAGMan

Lab 8: DAGMan

Page 27: Part 8: DAGMan

Lab 8: DAGMan

• In this lab, you’ll:• Run a simple DAGMan job• Run a more complex DAGMan job• Recover a failed DAGMan job

Page 28: Part 8: DAGMan

Credits

• NSF disclaimer

• Portions of this presentation were adapted from the following sources:• Jaime Frey, UW-Madison