Upload
hangoc
View
215
Download
1
Embed Size (px)
Citation preview
Squidnode
Worker 1
Worker 2
Worker 3
Worker 4
Most workdone by Condor
glideinWMS
A pilotbased Grid Workload Management Systempresented by Igor Sfiligoi, Fermilab, Batavia, IL
Grid worker nodeSchedd
Submit node
Collector
Pool Central Manager Node
Negotiator
Collector/CCB Collector/CCB
condor_submit
frontend
glideinWMS VO Frontend
condor_status
condor_status
condor_q condor_advertise
CollectorglideinWMS
Collector Node
Factory
glideinWMS Glidein Factory
condor_status
condor_advertise
Factory entry
Factory entry
condor_status
condor_advertise
Schedd
condor_submit
condor_q
Scheddcondor_submit
condor_q
GAHP
CondorGridmanager
CE/Gatekeeper node
Globus
Web server
Startd
glexec
glidein
Batch queue
Batch starter
ShadowStarter
User job
Legend:
Condor process
glideinWMS process
System software
Grid software
glideinWMS domain
Grid domain(OSG, EGEE and NorduGrid tested)
Download it from http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/Initially sponsored by USCMS and developed at Fermilab.Work sponsored by U.S. DoE under contract No. DE-AC02-07CH11359
glideinWMS architecture
Computing resourcepool
Submit job
Get result
Users want simple accessto compute resources Site C
Site F
Site E
Site BSite A
Site DPilot Pilot
PilotPilot
Pilot Pilot
PilotPilotPilot
Computing resourcepool
Pilot factory
Pilot factory
Submit job
Get result
CE
CE
CE
CE
CE
CEA pilot WMS
creates avirtual private pool
overGrid resources
Why use a pilotbased WMS?
VO Frontend
Glidein GlideinStartd
Schedd
Glidein factory
Collector
Negotiator
Dashboard
Globus Globus
User job
Schematic view A detailed view
For more details see http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.html
Worker 1
Worker 2
Worker 3
Worker 4
The flow is fully asynchronous
Glidein startup logic
Worker Node
glidein
load list of files
execute scripts
Errors?
Startd
Yes
No
load files
Downloaded scriptsdo all the real work
Glidein factoryWeb Server
Web Cache
Modular andconfigurable
Like a dedicatedCondor pool
from the user point of view!
Monitoring
Standard Condor tools
Both for the users and glideinWMS administrators
Pseudointeractive monitoring(to peek at Grid jobs in real time)
Web monitoring
Glidein factory only (being improved)
User jobs neversee bad nodes
Supports:● ls● cat/tail● ps● top● ...
Web cache
This batch slot would notbe able to run a user job
Ready to pull auser job for execution
(1)(2)
(3)(4)
(6)
(5)
(7) (8)
(9)
(10)
(11)
(12)
(13)(15)
(14)
(17)(18)
(19)
(8b – as needed)
(1)
(1)
(1)
(2)
(2)
(3)
(3)
(4 - as needed)
(1)(2a)
(2a)
(2b)
(2b) (3)
(3)
Two independentloops for glideinWMS
(16)
condor_advertise
Current users of glideinWMSpresented by Igor Sfiligoi, Fermilab, Batavia, IL
CMS Processing CMS User Analysis
CDF's Combined User Analysisand Centralized Activities
MINOS User Analysis
Scalability tests
Fully automated, process driven
VO Frontend
Glidein GlideinStartd
Schedd
Glidein factory
Collector
Negotiator
Dashboard
Globus Globus
CMS job
Worker 1
Worker 2
Worker 3
Worker 4
ProdAgent
Schedd
ProdAgent
Has been running for over a year now
Using both OSG and EGEE CMS Tier-1's
Portal basedUsers don't see the glideinWMS
VO Frontend
Glidein GlideinStartd
Glidein factory
Collector
Negotiator
Dashboard
Globus Globus
User job
Worker 1
Worker 2
Worker 3
Worker 4
Schedd
CRAB Server
CRAB client
Still a prototypeUsing both OSG and EGEE CMS Tier-2's
Portal basedUsers and automated tasks don't see the glideinWMS
Used glidein-based solution for several years.Currently moving to glideinWMS.
Using both CDF-owned and opportunistic OSG resources.
VO Frontend
Glidein GlideinStartd
Schedd
Glidein factory
Collector
Negotiator
Globus Globus
User job
Worker 1
Worker 2
Worker 3
Worker 4
Users submit to a central scheddAll processes share a single machine
Has been running for a year now
Using both MINOS-owned and opportunistic OSG resources at Fermilab.
Download it from http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/glideinWMS initially sponsored by USCMS and developed at Fermilab.Work sponsored by U.S. DoE under contract No. DE-AC02-07CH11359
Condor scalability tested
Collector scalability● 11k on a single collector, on LAN● 22k using a tree of 1+70 collectors
over WAN (between Europe and USA)Schedd scalability● Limited to:
● ~22k running jobs on a single schedd● 200k idle jobs on a single schedd
glideinWMS scalability tested
VO Frontend scalability● 200k idle and 22k running user jobs ● 20 schedds
Glidein factory scalability● Limited to:
● ~50 Grid sites basic installation ● ~150 Grid sites with fine tuning
● 5 VO frontends● 22k running glideins
Prototype thatscales higher
available
VO Frontend
Glidein GlideinStartd
Glidein factory
Collector
Negotiator
Dashboard
Globus Globus
CDF job
Worker 1
Worker 2
Worker 3
Worker 4
Schedd
CAF
CAF client
CAF client
MCgen
Calib
Ntupling
DataProd
Ran out ofports