Author
magee
View
20
Download
1
Embed Size (px)
DESCRIPTION
OGF 19 Condor Software Forum Routing Jobs to the Grid. What’s a Job Router?. Specialized scheduler operating on schedd’s jobs. Job 1 Job 2 Job 3 Job 4 Job 5 …. Job Router a.k.a. Schedd On The Side. Job 4*. job queue. Schedd. Adapted Quill Technology. - PowerPoint PPT Presentation
Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
OGF 19Condor Software Forum
Routing Jobs to the Grid
www.cs.wisc.edu/condor
Schedd
Job Routera.k.a.
ScheddOn The
Side
What’s a Job Router?Specialized scheduler operating on schedd’s jobs.
Job 1Job 2Job 3Job 4Job 5…Job 4*
job queue
www.cs.wisc.edu/condor
Adapted Quill Technology
› Using Quill library to mirror job queue in memoryo Efficient - just “tails” the logo Independent - mirror without clogging
schedd command queue
› Modifying the job queue is another matter - must interact with schedd
www.cs.wisc.edu/condor
Usage Case
Routing: Vanilla -> Grid
www.cs.wisc.edu/condor
Condor Farm Story
Schedd
StartdResources
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
Application
condor_submit
job queue
•Now that this is working, howcan I use my collaborator’sresources too?
www.cs.wisc.edu/condor
Option #1: Merge Farms
› Combine machines with collaborator into one Condor resource pool.o Everything works just like it did before.o Excellent option for small to medium clusters.o Requires bidirectional connectivity to all
startds, or equivalent via GCB.o Requires some administrative coordination
(e.g. upgrades, negotiator policy, security, etc.)
www.cs.wisc.edu/condor
Option #1b: submit to multiple pools
› condor_submit -remote …
› Works
› Ok for small scale
› Have to manually partition jobs
www.cs.wisc.edu/condor
Option #2: Flocking Together
Schedd
LocalStartds
RemoteStartds
•full featured(std universe etc)•automatic matchmaking•easy to configure
•requires bidirectionalconnectivity•both sites must runcondor
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
www.cs.wisc.edu/condor
Gatekeeper
X
Option #3: Grid Universe
Schedd
Startds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed Random
SeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
•easier to live with private networks•may use non-Condor resources
•restricted Condor feature set(e.g. no std universe over grid)•must pre-allocating jobsbetween vanilla and grid universe
vanilla site X
www.cs.wisc.edu/condor
Option #4: Routing Jobs
Schedd
LocalStartds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
ScheddOn The
Side Gatekeeper
X
Y
Z
vanilla site X
RandomSeed
RandomSeed
site Y site Z
•dynamic allocation of jobsbetween vanilla and grid universes.•not every job is appropriate fortransformation into a grid job.
www.cs.wisc.edu/condor
Example Routing Table
[GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = “(…)”][GridResource = “condor schedd.site2 collector.site2”; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500]…
www.cs.wisc.edu/condor
What About I/O?
› Jobs must be sandboxable (i.e. specifying input/output via transfer-files mechanism).
› Routing of standard universe is not supported.
› Must have enough storage space at site for input/output files!
www.cs.wisc.edu/condor
What Types of Grids?› Routing table may contain any
combination of grid types supported by Condor’s grid universe.
› Example: Condor-C
Schedd
ScheddOn The
Side
Schedd X
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
site X
•for two Condor sites, schedd-to-scheddsubmission requires no additional software•however, still not as trivial to use as flocking
www.cs.wisc.edu/condor
Source Routing
› Routing the old-fashioned way:
universe = GridGridResource = condor site1 …remote_universe = Gridremote_GridResource = condor site2 …remote_remote_universe = Gridremote_remote_GridResource = pbs
www.cs.wisc.edu/condor
Routing At the Site
Gatekeeper
XSchedd
ScheddOn The
Side
Schedd X3
X2
•navigate internal firewalls•provide custom routesfor special users•improve scalability•However, keep in mindI/O requirements etc.
www.cs.wisc.edu/condor
Multicast in Future?
› Currently: route one job to one site
› Multicast: route one job to many sites
› Thin out all but first to germinate
› … or all but first to yield fruit.
www.cs.wisc.edu/condor
Future Glidein FactoryGatekeeper
X
Schedd
Startds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
•true late binding of jobs to resources•may run on top of non-Condor sites•supports full feature-set of Condor(e.g. standard universe)
•requires GCB for private networks
homesite X
ScheddOn The
Side
glidein jobs
www.cs.wisc.edu/condor
Glideing in the Factory
Schedd
ScheddOn The
Side
glidein factory
site X
schedd-to-schedd
schedd-to-gatekeeper
•hierarchical strategy for scalabilityand reliability•better match for private networks
•may require some additional horsepowerfrom gatekeeper machine, perhaps adedicated element for “edge services”.
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
www.cs.wisc.edu/condor
Pluggable Router
› Beyond simple ClassAd transforms
› Pluggins would fire when job matches entry in routing table
› Don’t yet understand semantics
› There is work to do!
www.cs.wisc.edu/condor
Thanks
Interested?Let us know.
We are currentlyusing job routingfor specific usersat UW. Jaime Frey
Future developmentwill focus on moreuse-cases.