23
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan: Managing Job Dependencies with Condor

  • Upload
    chance

  • View
    33

  • Download
    2

Embed Size (px)

DESCRIPTION

Condor DAGMan: Managing Job Dependencies with Condor. Condor DAGMan. What is DAGMan? What is it good for? How does it work? What’s next?. DAGMan. D irected A cyclic G raph Man ager - PowerPoint PPT Presentation

Citation preview

Page 1: Condor DAGMan: Managing Job Dependencies with Condor

Peter F. CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor

DAGMan:Managing Job Dependencies with

Condor

Page 2: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Condor DAGMan

› What is DAGMan?

› What is it good for?

› How does it work?

› What’s next?

Page 3: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 4: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Typical Scenarios› Jobs whose output needs to be

summarized or post-processed once they complete.

› Jobs that need data to be generated or pre-processed before they can use it.

› Jobs which require data to be staged to/from remote repositories before they start or after they finish.

Page 5: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parents” or “children” (or neither) – as long as there are no loops!

Job A

Job B Job C

Job D

Page 6: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

An Example DAG

› Jobs whose output needs to be summarized or post-processed once they complete:

Job A Job B Job C

Job D

Page 7: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Another Example DAG

› Jobs that need data to be generated or pre-processed before they can use it:

Job A

Job B Job C Job D

Page 8: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Defining a DAG

› A DAG is defined by a .dag file., listing all its nodes and any dependencies:

# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

Job A

Job B Job C

Job D

Page 9: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Defining a DAG (cont’d)

› Each node in the DAG will run a Condor job, specified by a Condor submit file:

# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

Job A

Job B Job C

Job D

Page 10: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon & begin running your jobs:

% condor_submit_dag diamond.dag

› The DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

Page 11: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

C

D

A

A

B.dagFile

Page 12: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue

C

D

B

C

B

A

Page 13: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue

X

D

A

BRescue

File

Page 14: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Recovering a DAG

› Once the failed job is ready to be re-run, the Rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

C

D

A

BRescue

File

C

Page 15: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Recovering a DAG (cont’d)

› Once that job completes, DAGMan will continue the DAG as if the failure never happened.

CondorJobQueue

C

D

A

B

D

Page 16: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

DAGMan

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 17: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Additional Features

› Provides some other handy features for job management…

nodes can have PRE & POST scripts

job submission can be “throttled”

Page 18: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

PRE & POST Scripts

› Each node can have a PRE or POST script, executed as part of the node:

# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subPARENT A CHILD B CPARENT B C CHILD DScript PRE B stage-in.shScript POST B stage-out.sh

Job A

PRE

Job BPOST

Job C

Job D

Page 19: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

PRE & POST Scripts (cont’d)

› Useful for staging a job’s data from remote repositories, and/or putting it back afterwards.

› Ex: PRE: Globus FTP the data from afar Run the job POST: Globus FTP the data back

Page 20: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Submit Throttling

› DAGMan can limit the maximum number of jobs it will submit to Condor at once: condor_submit_dag -maxjobs N

› Useful for managing resource limitations (e.g., storage). Ex: 1000 jobs, each of which require 1 GB of

disk space, and you have 100 GB of disk.

Page 21: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Summary› DAGMAN:

manages dependencies, holding & running jobs only at the appropriate times

monitors job progress is fault-tolerant is recoverable in case of job failure provides some additional features to Condor currently DAGMan itself can only run on

Unix, but its jobs can run anywhere

Page 22: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Future Work

› More sophisticated management of remote data transfer & staging to maximize CPU throughput. Keep the pipeline full! I.e., intelligently

manage disk & network to always have remote data ready where a CPU becomes available.

Possible integration with Kangaroo, etc.

› Better integration with Condor tools condor_q, etc. displaying DAG information

Page 23: Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan www.cs.wisc.edu/condor

Conclusion

› Interested in seeing more? Come to the DAGMan demo

• Wednesday 9am - noon• Room 3393, Computer Sciences (1210 W. Dayton

St.)

Email me:• <[email protected]>

Try it:• http://www.cs.wisc.edu/condor