40
Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison {jfrey|tannenba}@cs.wisc.edu http://www.cs.wisc.edu/condor OGF 19 Condor Software Forum Condor-G

OGF 19 Condor Software Forum Condor-G

  • Upload
    alena

  • View
    97

  • Download
    5

Embed Size (px)

DESCRIPTION

OGF 19 Condor Software Forum Condor-G. What Is It?. Condor-G is a specialization of Condor. It is also known as the “grid universe”. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue. Grid Fault-Tolerance. - PowerPoint PPT Presentation

Citation preview

Page 1: OGF 19 Condor Software Forum Condor-G

Jaime Frey, Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison{jfrey|tannenba}@cs.wisc.eduhttp://www.cs.wisc.edu/condor

OGF 19Condor Software Forum

Condor-G

Page 2: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

What Is It?

› Condor-G is a specialization of Condor. It is also known as the “grid universe”.

› Condor-G speaks many different job management protocols.

› Condor-G benefits from all the wonderful Condor features, like a real job queue.

Page 3: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Grid Fault-Tolerance

› Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily

unavailable The network goes down

Page 4: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Remote Resource Access: Globus

“globusrun myjob …”

Globus GRAM ProtocolGlobus

JobManager

fork()

Organization A Organization B

Page 5: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

GlobusGlobus GRAM Protocol

Globus JobManager

fork()

Organization A Organization B

“globusrun myjob …”

Page 6: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

“globusrun myjob …”

Page 7: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Globus + Condor

“globusrun …”

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

Page 8: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G + Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

Condor-GCondor-G

myjob1myjob2myjob3myjob4myjob5…

Page 9: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Fault-Tolerance:Lost Contact with Remote

JobmanagerCan we contact gatekeeper?

Yes – network was downNo – machine crashed

or job completed

Yes - jobmanager crashed No – retry until we can talk to gatekeeper again…

Can we reconnect to jobmanager?

Has job completed?

No – is job still running?

Yes – update queue

Restart jobmanager

Page 10: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Just to be fair…

› The gatekeeper doesn’t have to submit to a Condor pool. It could be PBS, LSF, Sun Grid

Engine…

› Condor-G will work fine whatever the remote batch system is.

Page 11: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Other Condor-G Features

› Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore

› Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems

› Job Scheduling Use Matchmaking to select resources for jobs

› GlideIn Allows late binding of resources and job

checkpoint/migration

Page 12: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G

Condor-GCondor-G

Job Description (Job ClassAd)

GT2 [.1|2|4]

HTTPSCondor PBS/LSF NorduGrid

GT4

WSRFUnicore

Page 13: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Pre-WS GRAM

› Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))

Page 14: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

OGSA GRAM

› Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java))

› Museum mode

Page 15: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

WS GRAM

› Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>

Page 16: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

NorduGrid

› Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)

Page 17: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Unicore

› Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert

Page 18: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor

› Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java

Page 19: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

PBS

› Submit filegrid_resource = pbs

Page 20: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

LSF

› Submit filegrid_resource = lsf

Page 21: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Grid Universe Fault-Tolerance: Credential

Management› Authentication in many grid protocols is done

with limited-lifetime X509 proxies› Proxy may expire before jobs finish executing› Condor can put jobs on hold and email user to

refresh proxy› Condor can automatically retrieve new proxies

from MyProxy› When the proxy is refreshed, Condor forwards

it to the jobs

Page 22: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

MyProxy

› Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password

› Or give password on command linecondor_submit -p password submit.desc

Page 23: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking

› Use Condor-G matchmaking with grid universe jobs

› Allows Condor-G to dynamically assign computing jobs to grid sites

› An example of lazy planning

Page 24: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so:

Executable = fooUniverse = gridGrid_Resource = gt2 \

beak.cs.wisc.edu/jobmanager-pbsqueue

Page 25: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› With matchmaking, grid universe jobs can use requirements and rank:

Executable = fooUniverse = gridGrid_Resource = $$(ResourceName)Requirements = arch == LINUXRank = NumberOfNodes * random()Queue

› The $$(x) syntax inserts information from the target ClassAd when a match is made.

Page 26: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking, cont.

› Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via

condor_advertise command-line utility (method used by D0 JIM, USCMS)

Program to query Globus MDS and convert information into ClassAd (method used by EDG)

Run HawkEye with appropriate plugins on the gatekeeper

› For explanation of Condor-G matchmaking setup for USCMS, see http://www.cs.wisc.edu/condor/USCMS_matchmaking.html

Page 27: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking: Creating

the Resource Ad› Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True

Page 28: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G Matchmaking: Creating

the Resource Ad› Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file

› Call periodically

› Use unix time for UpdateSequenceNumber

Page 29: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

But Wait, There’s More…

› What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling

of jobs For job checkpointing and migration For remote system calls

› What if you don’t want to send a job to a site until the moment the job will start running (late binding)

Page 30: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

One Solution: Condor-G GlideIn

› You can use the Grid Universe to run Condor daemons on grid resources

› When the resources run these GlideIn jobs, they will temporarily join your Condor Pool

› You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources

Page 31: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Globus Grid

PBS LSF

Condor

Condor Pool

glide-in jobs

Page 32: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

GlideIn Concerns

› What if a grid resource kills my GlideIn job? That resource will disappear from your pool and

your jobs will be rescheduled on other machines Standard universe jobs will resume from their

last checkpoint like usual

› What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with

a job in 10 minutes, it terminates, freeing the resource

Page 33: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor

schedd(Job caretaker)

condor_submit

matchmaker

Startd(Runs job)

Page 34: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-G

schedd(Job caretaker)

condor_submit

gridmanager gahp

Globus gatekeeper

PBS or LSF

Page 35: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-C

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

schedd

matchmaker

startd

Page 36: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Condor-C to non-Condor

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

schedd

gridmanager

pbs/lsf-gahp PBS or LSF

Page 37: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Gliding in Condor-C

schedd(Job caretaker)

condor_submit

gridmanager

gridmanager

pbs/lsf-gahp

PBS or LSFcondor-gahp

gahp

Globusgatekeeper

schedd1. Glide-in

2. Submit jobs

Page 38: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Matchmaking with Condor-C

› In all of these examples, Condor-C went to a specific remote schedd

› This is not required: you can do matchmaking

Page 39: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor

Matchmaking with Condor-C

schedd(Job caretaker)

condor_submit

gridmanager condor-gahp

matchmaker

schedd

schedd

… submit job

Page 40: OGF 19 Condor Software Forum Condor-G

www.cs.wisc.edu/condor