2
Squid node Most work done by Condor glideinWMS - A pilot-based Grid Workload Management System presented by Igor Sfiligoi, Fermilab, Batavia, IL Grid worker node Schedd Submit node Collector Pool Central Manager Node Negotiator Collector/CCB Collector/CCB condor_submit frontend glideinWMS VO Frontend condor_status condor_status condor_q condor_advertise Collector glideinWMS Collector Node Factory glideinWMS Glidein Factory condor_status condor_advertise Factory entry Factory entry condor_status condor_advertise Schedd condor_submit condor_q Schedd condor_submit condor_q GAHP Condor Gridmanager CE/Gatekeeper node Globus Web server Startd glexec glidein Batch queue Batch starter Shadow Starter User job Legend: Condor process glideinWMS process System software Grid software glideinWMS domain Grid domain (OSG, EGEE and NorduGrid tested) Download it from http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/ Initially sponsored by USCMS and developed at Fermilab. Work sponsored by U.S. DoE under contract No. DE-AC02-07CH11359 glideinWMS architecture Computing resource pool Submit job Get result Users want simple access to compute resources Site C Site F Site E Site B Site A Site D Pilot Pilot Pilot Pilot Pilot Pilot Pilot Pilot Pilot Computing resource pool Pilot factory Pilot factory Submit job Get result CE CE CE CE CE CE A pilot WMS creates a virtual private pool over Grid resources Why use a pilot-based WMS? VO Frontend Glidein Glidein Startd Schedd Glidein factory Collector Negotiator Dashboard Globus Globus User job Schematic view A detailed view For more details see http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.html Worker 1 Worker 2 Worker 3 Worker 4 The flow is fully asynchronous Glidein startup logic Worker Node glidein load list of files execute scripts Errors? Startd Yes No load files Downloaded scripts do all the real work Glidein factory Web Server Web Cache Modular and configurable Like a dedicated Condor pool from the user point of view! Monitoring Standard Condor tools Both for the users and glideinWMS administrators Pseudo-interactive monitoring (to peek at Grid jobs in real time) Web monitoring Glidein factory only (being improved) User jobs never see bad nodes Supports: ls cat/tail ps top ... Web cache This batch slot would not be able to run a user job Ready to pull a user job for execution (1) (2) (3) (4) (6) (5) (7) (8) (9) (10) (11) (12) (13) (15) (14) (17) (18) (19) (8b – as needed) (1) (1) (1) (2) (2) (3) (3) (4 - as needed) (1) (2a) (2a) (2b) (2b) (3) (3) Two independent loops for glideinWMS (16) condor_advertise

glideinWMS architecture Why use a pilotbased WMS? Glidein

  • Upload
    hangoc

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: glideinWMS architecture Why use a pilotbased WMS? Glidein

Squidnode

Worker 1

Worker 2

Worker 3

Worker 4

Most workdone by Condor

glideinWMS­

A pilot­based Grid Workload Management Systempresented by Igor Sfiligoi, Fermilab, Batavia, IL

Grid worker nodeSchedd

Submit node

Collector

Pool Central Manager Node

Negotiator

Collector/CCB Collector/CCB

condor_submit

frontend

glideinWMS VO Frontend

condor_status

condor_status

condor_q condor_advertise

CollectorglideinWMS

Collector Node

Factory

glideinWMS Glidein Factory

condor_status

condor_advertise

Factory entry

Factory entry

condor_status

condor_advertise

Schedd

condor_submit

condor_q

Scheddcondor_submit

condor_q

GAHP

CondorGridmanager

CE/Gatekeeper node

Globus

Web server

Startd

glexec

glidein

Batch queue

Batch starter

ShadowStarter

User job

Legend:

Condor process

glideinWMS process

System software

Grid software

glideinWMS domain

Grid domain(OSG, EGEE and NorduGrid tested)

Download it from http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/Initially sponsored by USCMS and developed at Fermilab.Work sponsored by U.S. DoE under contract No. DE-AC02-07CH11359

glideinWMS architecture

Computing resourcepool

Submit job

Get result

Users want simple accessto compute resources Site C

Site F

Site E

Site BSite A

Site DPilot Pilot

PilotPilot

Pilot Pilot

PilotPilotPilot

Computing resourcepool

Pilot factory

Pilot factory

Submit job

Get result

CE

CE

CE

CE

CE

CEA pilot WMS

creates avirtual private pool

overGrid resources

Why use a pilot­based WMS?

VO Frontend

Glidein GlideinStartd

Schedd

Glidein factory

Collector

Negotiator

Dashboard

Globus Globus

User job

Schematic view A detailed view

For more details see http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.html

Worker 1

Worker 2

Worker 3

Worker 4

The flow is fully asynchronous

Glidein startup logic

Worker Node

glidein

load list of files

execute scripts

Errors?

Startd

Yes

No

load files

Downloaded scriptsdo all the real work

Glidein factoryWeb Server

Web Cache

Modular andconfigurable

Like a dedicatedCondor pool

from the user point of view!

Monitoring

Standard Condor tools

Both for the users and glideinWMS administrators

Pseudo­interactive monitoring(to peek at Grid jobs in real time)

Web monitoring

Glidein factory only (being improved)

User jobs neversee bad nodes

Supports:● ls● cat/tail● ps● top● ...

Web cache

This batch slot would notbe able to run a user job

Ready to pull auser job for execution

(1)(2)

(3)(4)

(6)

(5)

(7) (8)

(9)

(10)

(11)

(12)

(13)(15)

(14)

(17)(18)

(19)

(8b – as needed)

(1)

(1)

(1)

(2)

(2)

(3)

(3)

(4 - as needed)

(1)(2a)

(2a)

(2b)

(2b) (3)

(3)

Two independentloops for glideinWMS

(16)

condor_advertise

Page 2: glideinWMS architecture Why use a pilotbased WMS? Glidein

Current users of glideinWMSpresented by Igor Sfiligoi, Fermilab, Batavia, IL

CMS Processing CMS User Analysis

CDF's Combined User Analysisand Centralized Activities

MINOS User Analysis

Scalability tests

Fully automated, process driven

VO Frontend

Glidein GlideinStartd

Schedd

Glidein factory

Collector

Negotiator

Dashboard

Globus Globus

CMS job

Worker 1

Worker 2

Worker 3

Worker 4

ProdAgent

Schedd

ProdAgent

Has been running for over a year now

Using both OSG and EGEE CMS Tier-1's

Portal basedUsers don't see the glideinWMS

VO Frontend

Glidein GlideinStartd

Glidein factory

Collector

Negotiator

Dashboard

Globus Globus

User job

Worker 1

Worker 2

Worker 3

Worker 4

Schedd

CRAB Server

CRAB client

Still a prototypeUsing both OSG and EGEE CMS Tier-2's

Portal basedUsers and automated tasks don't see the glideinWMS

Used glidein-based solution for several years.Currently moving to glideinWMS.

Using both CDF-owned and opportunistic OSG resources.

VO Frontend

Glidein GlideinStartd

Schedd

Glidein factory

Collector

Negotiator

Globus Globus

User job

Worker 1

Worker 2

Worker 3

Worker 4

Users submit to a central scheddAll processes share a single machine

Has been running for a year now

Using both MINOS-owned and opportunistic OSG resources at Fermilab.

Download it from http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/glideinWMS initially sponsored by USCMS and developed at Fermilab.Work sponsored by U.S. DoE under contract No. DE-AC02-07CH11359

Condor scalability tested

Collector scalability● 11k on a single collector, on LAN● 22k using a tree of 1+70 collectors

over WAN (between Europe and USA)Schedd scalability● Limited to:

● ~22k running jobs on a single schedd● 200k idle jobs on a single schedd

glideinWMS scalability tested

VO Frontend scalability● 200k idle and 22k running user jobs ● 20 schedds

Glidein factory scalability● Limited to:

● ~50 Grid sites basic installation ● ~150 Grid sites with fine tuning

● 5 VO frontends● 22k running glideins

Prototype thatscales higher

available

VO Frontend

Glidein GlideinStartd

Glidein factory

Collector

Negotiator

Dashboard

Globus Globus

CDF job

Worker 1

Worker 2

Worker 3

Worker 4

Schedd

CAF

CAF client

CAF client

MCgen

Calib

Ntupling

DataProd

Ran out ofports