Upload
alena
View
91
Download
5
Embed Size (px)
DESCRIPTION
OGF 19 Condor Software Forum Condor-G. What Is It?. Condor-G is a specialization of Condor. It is also known as the “grid universe”. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue. Grid Fault-Tolerance. - PowerPoint PPT Presentation
Jaime Frey, Todd TannenbaumComputer Sciences DepartmentUniversity of Wisconsin-Madison{jfrey|tannenba}@cs.wisc.eduhttp://www.cs.wisc.edu/condor
OGF 19Condor Software Forum
Condor-G
www.cs.wisc.edu/condor
What Is It?
› Condor-G is a specialization of Condor. It is also known as the “grid universe”.
› Condor-G speaks many different job management protocols.
› Condor-G benefits from all the wonderful Condor features, like a real job queue.
www.cs.wisc.edu/condor
Grid Fault-Tolerance
› Condor-G does whatever it takes to run your jobs, even if … Your local machine machine crashes The grid service is temporarily
unavailable The network goes down
www.cs.wisc.edu/condor
Remote Resource Access: Globus
“globusrun myjob …”
Globus GRAM ProtocolGlobus
JobManager
fork()
Organization A Organization B
www.cs.wisc.edu/condor
GlobusGlobus GRAM Protocol
Globus JobManager
fork()
Organization A Organization B
“globusrun myjob …”
www.cs.wisc.edu/condor
Globus + Condor
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
“globusrun myjob …”
www.cs.wisc.edu/condor
Globus + Condor
“globusrun …”
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
www.cs.wisc.edu/condor
Condor-G + Globus + Condor
Globus GRAM Protocol Globus JobManager
Submit to Condor
Condor PoolOrganization A Organization B
Condor-GCondor-G
myjob1myjob2myjob3myjob4myjob5…
www.cs.wisc.edu/condor
Condor-G Fault-Tolerance:Lost Contact with Remote
JobmanagerCan we contact gatekeeper?
Yes – network was downNo – machine crashed
or job completed
Yes - jobmanager crashed No – retry until we can talk to gatekeeper again…
Can we reconnect to jobmanager?
Has job completed?
No – is job still running?
Yes – update queue
Restart jobmanager
www.cs.wisc.edu/condor
Just to be fair…
› The gatekeeper doesn’t have to submit to a Condor pool. It could be PBS, LSF, Sun Grid
Engine…
› Condor-G will work fine whatever the remote batch system is.
www.cs.wisc.edu/condor
Other Condor-G Features
› Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore
› Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems
› Job Scheduling Use Matchmaking to select resources for jobs
› GlideIn Allows late binding of resources and job
checkpoint/migration
www.cs.wisc.edu/condor
Condor-G
Condor-GCondor-G
Job Description (Job ClassAd)
GT2 [.1|2|4]
HTTPSCondor PBS/LSF NorduGrid
GT4
WSRFUnicore
www.cs.wisc.edu/condor
Pre-WS GRAM
› Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))
www.cs.wisc.edu/condor
OGSA GRAM
› Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java))
› Museum mode
www.cs.wisc.edu/condor
WS GRAM
› Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>
www.cs.wisc.edu/condor
NorduGrid
› Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)
www.cs.wisc.edu/condor
Unicore
› Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert
www.cs.wisc.edu/condor
Condor
› Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java
www.cs.wisc.edu/condor
PBS
› Submit filegrid_resource = pbs
www.cs.wisc.edu/condor
LSF
› Submit filegrid_resource = lsf
www.cs.wisc.edu/condor
Grid Universe Fault-Tolerance: Credential
Management› Authentication in many grid protocols is done
with limited-lifetime X509 proxies› Proxy may expire before jobs finish executing› Condor can put jobs on hold and email user to
refresh proxy› Condor can automatically retrieve new proxies
from MyProxy› When the proxy is refreshed, Condor forwards
it to the jobs
www.cs.wisc.edu/condor
MyProxy
› Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password
› Or give password on command linecondor_submit -p password submit.desc
www.cs.wisc.edu/condor
Condor-G Matchmaking
› Use Condor-G matchmaking with grid universe jobs
› Allows Condor-G to dynamically assign computing jobs to grid sites
› An example of lazy planning
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so:
Executable = fooUniverse = gridGrid_Resource = gt2 \
beak.cs.wisc.edu/jobmanager-pbsqueue
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› With matchmaking, grid universe jobs can use requirements and rank:
Executable = fooUniverse = gridGrid_Resource = $$(ResourceName)Requirements = arch == LINUXRank = NumberOfNodes * random()Queue
› The $$(x) syntax inserts information from the target ClassAd when a match is made.
www.cs.wisc.edu/condor
Condor-G Matchmaking, cont.
› Where do these target ClassAds representing Globus gatekeepers come from? Several options: Simple script on gatekeeper publishes an ad via
condor_advertise command-line utility (method used by D0 JIM, USCMS)
Program to query Globus MDS and convert information into ClassAd (method used by EDG)
Run HawkEye with appropriate plugins on the gatekeeper
› For explanation of Condor-G matchmaking setup for USCMS, see http://www.cs.wisc.edu/condor/USCMS_matchmaking.html
www.cs.wisc.edu/condor
Condor-G Matchmaking: Creating
the Resource Ad› Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True
www.cs.wisc.edu/condor
Condor-G Matchmaking: Creating
the Resource Ad› Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file
› Call periodically
› Use unix time for UpdateSequenceNumber
www.cs.wisc.edu/condor
But Wait, There’s More…
› What if you want to run standard universe jobs on grid resources For matchmaking and dynamic scheduling
of jobs For job checkpointing and migration For remote system calls
› What if you don’t want to send a job to a site until the moment the job will start running (late binding)
www.cs.wisc.edu/condor
One Solution: Condor-G GlideIn
› You can use the Grid Universe to run Condor daemons on grid resources
› When the resources run these GlideIn jobs, they will temporarily join your Condor Pool
› You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources
www.cs.wisc.edu/condor
yourworkstation
Friendly Condor Pool
personalCondor
600 Condorjobs
Globus Grid
PBS LSF
Condor
Condor Pool
glide-in jobs
www.cs.wisc.edu/condor
GlideIn Concerns
› What if a grid resource kills my GlideIn job? That resource will disappear from your pool and
your jobs will be rescheduled on other machines Standard universe jobs will resume from their
last checkpoint like usual
› What if all my jobs are completed before a GlideIn job runs? If a GlideIn Condor daemon is not matched with
a job in 10 minutes, it terminates, freeing the resource
www.cs.wisc.edu/condor
Condor
schedd(Job caretaker)
condor_submit
matchmaker
Startd(Runs job)
www.cs.wisc.edu/condor
Condor-G
schedd(Job caretaker)
condor_submit
gridmanager gahp
Globus gatekeeper
PBS or LSF
www.cs.wisc.edu/condor
Condor-C
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
schedd
matchmaker
startd
www.cs.wisc.edu/condor
Condor-C to non-Condor
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
schedd
gridmanager
pbs/lsf-gahp PBS or LSF
www.cs.wisc.edu/condor
Gliding in Condor-C
schedd(Job caretaker)
condor_submit
gridmanager
gridmanager
pbs/lsf-gahp
PBS or LSFcondor-gahp
gahp
Globusgatekeeper
schedd1. Glide-in
2. Submit jobs
www.cs.wisc.edu/condor
Matchmaking with Condor-C
› In all of these examples, Condor-C went to a specific remote schedd
› This is not required: you can do matchmaking
www.cs.wisc.edu/condor
Matchmaking with Condor-C
schedd(Job caretaker)
condor_submit
gridmanager condor-gahp
matchmaker
schedd
schedd
… submit job
www.cs.wisc.edu/condor