Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu Condor: A Project and

  • View
    223

  • Download
    2

Embed Size (px)

Text of Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu...

  • Slide 1

Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Condor: A Project and a System Scientific Data Intensive Computing Workshop 04 Microsoft Research May 2004 Slide 2 http://www.cs.wisc.edu/condor 2 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data! Slide 3 http://www.cs.wisc.edu/condor 3 The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. Slide 4 http://www.cs.wisc.edu/condor 4 The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a heterogeneous distributed environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, Slide 5 http://www.cs.wisc.edu/condor 5 A Multifaceted Project Harnessing the power of clusters - opportunistic and/or dedicated (Condor) Job management services for Grid applications (Condor-G, Stork) Fabric management services for Grid resources (Condor, GlideIns, NeST) Distributed I/O technology (Parrot, Kangaroo, NeST) Job-flow management (DAGMan, Condor, Hawk) Distributed monitoring and management (HawkEye) Technology for Distributed Systems (ClassAD, MW) Packaging and Integration (NMI, VDT) Slide 6 http://www.cs.wisc.edu/condor 6 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data! Slide 7 http://www.cs.wisc.edu/condor 7 What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed fault-tolerant high- throughput computing (HTC) facility. Distributed Ownership: decrease in cost- performance ratio caused Huge increase in organization aggregate computing capacity Much smaller increase in the capacity accessible by a single person HTC Large amounts of processing capacity sustained over very long time periods Slide 8 http://www.cs.wisc.edu/condor 8 Condor can manage a large number of jobs Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000s), the data, etc. Condor can handle work flow / inter-job dependencies (DAGMan) Condor users can set job priorities Condor administrators can set user priorities Slide 9 http://www.cs.wisc.edu/condor 9 Condor can manage Dedicated Resources Dedicated Resources Compute Clusters Manage Node monitoring, scheduling Job launch, monitor & cleanup Slide 10 http://www.cs.wisc.edu/condor 10 and Condor can manage non-dedicated resources Non-dedicated resources examples: Desktop workstations in offices Workstations in student labs Non-dedicated resources are often idle --- ~70% of the time! Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources Slide 11 http://www.cs.wisc.edu/condor 11 Some HTC Challenges Condor does whatever it takes to run your jobs, even if some machines Crash (or are disconnected) Run out of disk space Dont have your software installed Are frequently needed by others Are far away & managed by someone else Slide 12 http://www.cs.wisc.edu/condor 12 The Condor System Unix and Win2k/XP Operational since 1986 Just at UW: more than 1800 CPUs in 10 pools on our campus Software available free on the web Open license Adopted by the real world (Galileo, Maxtor, Micron, Oracle, Tigr, Xerox, NASA, Texas Instruments, ) Slide 13 http://www.cs.wisc.edu/condor 13 Downloads and Deployments Slide 14 http://www.cs.wisc.edu/condor 14 Slide 15 http://www.cs.wisc.edu/condor 15 Outline What is the Condor Project? What is the Condor HTC Software? Recipe for using desktops for science Data! Slide 16 http://www.cs.wisc.edu/condor 16 Recipe Tip: Useful Distributed Ownership mechanisms in Condor Checkpoint / Migration Checkpoint == picture of process state Enables preempt/resume scheduling and migration, ensures forward progress Remote System Calls Redirect I/O and other system calls back to the submit machine. Matchmaking with ClassAds Slide 17 http://www.cs.wisc.edu/condor 17 ClassAds Set of bindings of Attribute Names to Expressions Self-describing (no separate schema) Combine query and data Arbitrarily composed and nested Bilateral Resource owners are generous if it doesnt cost them anything! Slide 18 http://www.cs.wisc.edu/condor 18 Examples [ Type= "Job"; Owner= "raman"; Cmd= "run_sim"; Args= "-Q 17 3200"; Cwd= "/u/raman"; Memory= 31; Qdate= 886799469;... Rank= other.Kflops... Requirements= other.Type =... ] [ Type= "Machine"; Name= "xxy.cs...."; Arch= "iX86"; OpSys= "Solaris"; Mips= 104; Kflops= 21893; State= "Unclaimed"; LoadAvg= 0.042969;... Rank=...; Requirements=...; ] Slide 19 >, =, &&,... Functions strcat, substr, floor, member,... Lists { expr, expr,... } ClassAds [ name=expr; name=expr;... ]"> http://www.cs.wisc.edu/condor 19 Attribute Expressions Constants 104, 0.042969, "iX86" References attr, self.attr, other.attr, expr.attr Operators+, *, >>, =, &&,... Functions strcat, substr, floor, member,... Lists { expr, expr,... } ClassAds [ name=expr; name=expr;... ] Slide 20 http://www.cs.wisc.edu/condor 20 Examples Descriptive attributes Type = "Job"; Owner = "raman"; Arch = "iX86"; OpSys = "Solaris"; Memory = 64;// megabytes Disk = 323496;// k bytes Slide 21 http://www.cs.wisc.edu/condor 21 Examples Current state Daytime = 36017;// secs past midnight KeyboardIdle = 1432;// seconds State = "Unclaimed"; LoadAvg = 0.042969; Slide 22 http://www.cs.wisc.edu/condor 22 Examples Parameters ResearchGrp = { "raman", "miron", "solomon", "jbasney" }; Friends = { "tannenba", "wright" }; Untrusted = { "rival", "riffraff" }; WantCheckpoint = 1; Slide 23 http://www.cs.wisc.edu/condor 23 Examples Derived data Rank =// machine's rank for job 10 * member(other.Owner,ResearchGrp) + member(other.Owner, Friends); Rank =// job's rank for machine Kflops/1E3 + other.Memory/32; Slide 24 10000 ">10000 && other.Memory >= self.Memory;">10000 " title="http://www.cs.wisc.edu/condor 24 Examples Job constraint Requirements = other.Type = "Machine" && Arch = "iX86" && OpsSys = "Solaris" && Disk > 10000 "> http://www.cs.wisc.edu/condor 24 Examples Job constraint Requirements = other.Type = "Machine" && Arch = "iX86" && OpsSys = "Solaris" && Disk > 10000 && other.Memory >= self.Memory; Slide 25 http://www.cs.wisc.edu/condor 25 Examples Machine constraint Requirements = ! member(other.Owner, Untrusted) && Rank >= 10 ? true : Rank > 0 ? (LoadAvg 15*60) : DayTime 18*60*60; Slide 26 http://www.cs.wisc.edu/condor 26 Matching Algorithm To match two ads A and B Set up environment such that in A self evaluates to A other evaluates to B other attributes are searched for first in A and then in B and vice versa (with A and B interchanged) Check if A.Requirements and B.Requirements both evaluate to true A.Rank and B.Rank for preferences Slide 27 = 10 || other.Kflps >= 1000 TRUEif either attribute exists and satisfies the given condition"> http://www.cs.wisc.edu/condor 27 Three-valued Logic other.Memory > 32all other.Memory == 32UNDEFINED other.Memory != 32 if other has no !(other.Memory == 32)"Memory" attribute other.Mips >= 10 || other.Kflps >= 1000 TRUEif either attribute exists and satisfies the given condition Slide 28 http://www.cs.wisc.edu/condor 28 Recipe Tip: Build from Bottom up! Start with a service for a single user, on a single machine. Personal Condor Condor on your own workstation, no local system/root access required, no system administrator intervention needed Slide 29 http://www.cs.wisc.edu/condor 29 your workstation personal Condor 600 Condor jobs Slide 30 http://www.cs.wisc.edu/condor 30 Personal Condor?! Whats the benefit of a Condor Pool with just one user and one machine? Slide 31 http://www.cs.wisc.edu/condor 31 Your Personal Condor will... keep an eye on your jobs and will keep you posted on their progress implement your policy on the execution order of the jobs keep a log of your job activities add fault tolerance to your jobs implement your policy on when the jobs can run on your workstation Slide 32 http://www.cs.wisc.edu/condor 32 Expand from your desktop Build a Condor pool inside your organization Install Condor on multiple machines, pointing them to your initial machine as the manager. Utilize Condor resources at remote organizations (build a grid) Takes advantage of your Condor-using friends Get permission to access their resources flock Then configure your Condor pool to flock to these pools Accounting system is flocking aware Slide 33 http://www.cs.wisc.edu/condor 33 your workstation Friendly Condor Pool personal Condor 600 Condor jobs Condor Pool Slide 34 http://www.cs.wisc.edu/condor 34 Condor-G What about resources at remote organizations that are NOT managed via Condor? (perhaps they are managed via PBS, SGE, LSF, ) Condor-G Job task-broker for Grid Middleware. Submit jobs to resources managed via grid middleware such as Globus (GT2 & GT3), Nordugrid, Unicore, or Oracle (or Condor) Oracle: run PL/SQL programs on Oracle just like a normal job, via transactions, put in DAGs, etc. Slide 35 http://www.cs.wisc.edu/condor 35 Condor GlideIn Problems What if the grid middleware or remote scheduler doesnt provide services I want? What about end-to-end semantic guarantees? Solution Submit the Condor daemons to remote schedulers inst

View more >