Author
chance
View
31
Download
2
Embed Size (px)
DESCRIPTION
Condor DAGMan: Managing Job Dependencies with Condor. Condor DAGMan. What is DAGMan? What is it good for? How does it work? What’s next?. DAGMan. D irected A cyclic G raph Man ager - PowerPoint PPT Presentation
Peter F. CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Condor
DAGMan:Managing Job Dependencies with
Condor
Condor DAGMan www.cs.wisc.edu/condor
Condor DAGMan
› What is DAGMan?
› What is it good for?
› How does it work?
› What’s next?
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
› Directed Acyclic Graph Manager
› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
Condor DAGMan www.cs.wisc.edu/condor
Typical Scenarios› Jobs whose output needs to be
summarized or post-processed once they complete.
› Jobs that need data to be generated or pre-processed before they can use it.
› Jobs which require data to be staged to/from remote repositories before they start or after they finish.
Condor DAGMan www.cs.wisc.edu/condor
What is a DAG?
› A DAG is the data structure used by DAGMan to represent these dependencies.
› Each job is a “node” in the DAG.
› Each node can have any number of “parents” or “children” (or neither) – as long as there are no loops!
Job A
Job B Job C
Job D
Condor DAGMan www.cs.wisc.edu/condor
An Example DAG
› Jobs whose output needs to be summarized or post-processed once they complete:
Job A Job B Job C
Job D
Condor DAGMan www.cs.wisc.edu/condor
Another Example DAG
› Jobs that need data to be generated or pre-processed before they can use it:
Job A
Job B Job C Job D
Condor DAGMan www.cs.wisc.edu/condor
Defining a DAG
› A DAG is defined by a .dag file., listing all its nodes and any dependencies:
# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
Job A
Job B Job C
Job D
Condor DAGMan www.cs.wisc.edu/condor
Defining a DAG (cont’d)
› Each node in the DAG will run a Condor job, specified by a Condor submit file:
# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
Job A
Job B Job C
Job D
Condor DAGMan www.cs.wisc.edu/condor
Submitting a DAG
› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon & begin running your jobs:
% condor_submit_dag diamond.dag
› The DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Running a DAG
› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
CondorJobQueue
C
D
A
A
B.dagFile
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Running a DAG (cont’d)
› DAGMan holds & submits jobs to the Condor queue at the appropriate times.
CondorJobQueue
C
D
B
C
B
A
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Running a DAG (cont’d)
› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.
CondorJobQueue
X
D
A
BRescue
File
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Recovering a DAG
› Once the failed job is ready to be re-run, the Rescue file can be used to restore the prior state of the DAG.
CondorJobQueue
C
D
A
BRescue
File
C
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Recovering a DAG (cont’d)
› Once that job completes, DAGMan will continue the DAG as if the failure never happened.
CondorJobQueue
C
D
A
B
D
Condor DAGMan www.cs.wisc.edu/condor
DAGMan
Finishing a DAG
› Once the DAG is complete, the DAGMan job itself is finished, and exits.
CondorJobQueue
C
D
A
B
Condor DAGMan www.cs.wisc.edu/condor
Additional Features
› Provides some other handy features for job management…
nodes can have PRE & POST scripts
job submission can be “throttled”
Condor DAGMan www.cs.wisc.edu/condor
PRE & POST Scripts
› Each node can have a PRE or POST script, executed as part of the node:
# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subPARENT A CHILD B CPARENT B C CHILD DScript PRE B stage-in.shScript POST B stage-out.sh
Job A
PRE
Job BPOST
Job C
Job D
Condor DAGMan www.cs.wisc.edu/condor
PRE & POST Scripts (cont’d)
› Useful for staging a job’s data from remote repositories, and/or putting it back afterwards.
› Ex: PRE: Globus FTP the data from afar Run the job POST: Globus FTP the data back
Condor DAGMan www.cs.wisc.edu/condor
Submit Throttling
› DAGMan can limit the maximum number of jobs it will submit to Condor at once: condor_submit_dag -maxjobs N
› Useful for managing resource limitations (e.g., storage). Ex: 1000 jobs, each of which require 1 GB of
disk space, and you have 100 GB of disk.
Condor DAGMan www.cs.wisc.edu/condor
Summary› DAGMAN:
manages dependencies, holding & running jobs only at the appropriate times
monitors job progress is fault-tolerant is recoverable in case of job failure provides some additional features to Condor currently DAGMan itself can only run on
Unix, but its jobs can run anywhere
Condor DAGMan www.cs.wisc.edu/condor
Future Work
› More sophisticated management of remote data transfer & staging to maximize CPU throughput. Keep the pipeline full! I.e., intelligently
manage disk & network to always have remote data ready where a CPU becomes available.
Possible integration with Kangaroo, etc.
› Better integration with Condor tools condor_q, etc. displaying DAG information
Condor DAGMan www.cs.wisc.edu/condor
Conclusion
› Interested in seeing more? Come to the DAGMan demo
• Wednesday 9am - noon• Room 3393, Computer Sciences (1210 W. Dayton
St.)
Email me:• <[email protected]>
Try it:• http://www.cs.wisc.edu/condor