20
Eileen Berman

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008 Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Embed Size (px)

DESCRIPTION

Condor in the Fermilab Grid FacilitiesApril 30, 2008 The Fermilab Grid Facility supports many different activities in support of these scientists. The scale of these activities places demands that often require close collaborations.  Petabytes of data to store and manage  Billions of events to analyze  Tens of millions of jobs to run per year Collaboration with the Condor team over the years, has enabled Fermilab to deliver the quality software and services that this large scale science demands. 3

Citation preview

Page 1: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Eileen Berman

Page 2: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Fermi National Accelerator Laboratory is a high energy physics laboratory outside of Chicago

Many different experiments, comprised of 1000’s of users working for many years, rely on Fermilab to provide the core services and software necessary to enable the research that leads to scientific discoveries

The Fermilab Grid Facilities are strategic in the execution of this mission 2

Page 3: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

The Fermilab Grid Facility supports many different activities in support of these scientists.

The scale of these activities places demands that often require close collaborations.

Petabytes of data to store and manage Billions of events to analyze Tens of millions of jobs to run per yearCollaboration with the Condor team over the

years, has enabled Fermilab to deliver the quality software and services that this large scale science demands.

3

Page 4: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Fermilab Campus Grid (FermiGrid)DZero Data Handling and Job

Submission System (SAMGrid)Resource Selection Service (ReSS)Glidein Workload Management

System (WMS)Grid/Site Security Boundary (OSE)End To End Security

4

Page 5: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008 5

The Grid

StorageResources

StorageResources

ComputingResources Computin

gResources

ComputingResources

ComputingResources

StorageResourcesSite

D

Site C

Site B

Site A

Page 6: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

SAZServer

GUMSServer

VOMSServer

FERMIGRID SE(dcache SRM)

Gratia

BlueArc

VOMSServer

SAZServer

GUMSServer

Step 6- Grid job is forwarded

to target cluster

VOMRSServer

6

Step 2 - user iss

ues voms-proxy-init

user receives voms sig

ned credentials

Step 3 – user submits their grid job viaglobus-job-run, globus-job-submit, or condor-g

Step 5 – Gateway requests

GUMS

Mapping based on VO & Role

Step 4 – Gateway checks against

Site Authorization Service

Periodic

Synchronization

Exterior

Interior

Periodic

Synchronization

Step

1 - u

ser

regist

ers w

ith V

O

ReSS

CMSWC2

CMSWC1

CDFOSG2

D0CAB1

GPGrid

D0CAB2

CMSWC3

CDFOSG1

Publish campus

classads

CDFOSG3/4

Condor

Site Wide

Gateway

clusters send ClassAdsto the site wide gateway

Page 7: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

As data volumes increase, demands for increased data processing power has grown beyond the ability to support in a diverse disconnected environment.

FermiGrid is a strategic cross campus grid supporting the experiments at Fermilab 11,786 cores amongst the clusters

• 7586 managed by Condor Access to several Petabytes (1015) of storage O(1000) users Provides a common mechanism for supporting the

users while minimizing the expense of service delivery.

7

Page 8: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Condor is the underlying batch system on most of the clustersThe Site-Wide Gateway is a natural pairing of Condor-G and a hierarchical grid deploymentGateway uses jobmanager-cemon

Based on jobmanager-condor Matches jobs against cluster information via a local condor

matchmaking service (ReSS) Forwards to matched sub-cluster Enhances scalability Fault tolerance under development

Additional Gateways will provide job access to Teragrid resourcesSignificant scalability work done by Condor team to support usage patterns

8

Page 9: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008 9

Page 10: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

SamGrid is the data handling and job submission system supporting the Dzero collaboration (500 physicists) data processing and simulated event generation needs 100’s of Terabytes of data needing processing Billions of events with data and metadata to manage and

deliver Needed a widely distributed system to support the data

delivery and data processing needs of the experiment, located across the globe

Early adopter of Condor matchmaking service with Condor-G to select clusters instead of machines. Had to learn how to describe clusters

Uses Condor matchmaking capabilities to throttle the flow of jobs from the schedulers to the sites (stable, configurable) Uses cluster classad requirements

Policies can be implemented based on job status values (e.g. running, held…) 10

Page 11: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Condor team implemented multi-tiered job submission mechanism necessary in order to take advantage of the diverse clusters on the Grid.

11

Condor Schedd

Grid

Page 12: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008 12

SAZServer

GUMSServer

VOMSServer

FERMIGRID SE(dcache SRM)

Gratia

BlueArcCMSWC2

VOMSServer

SAZServer

GUMSServer

Step 2 - user iss

ues voms-proxy-init

user receives voms sig

ned credentials

Step 3 – user submits their grid job viaglobus-job-run, globus-job-submit, or condor-g

Step 5 – Gateway requests

GUMS

Mapping based on VO &

RoleStep 4 – Gateway checks against

Site Authorization Service clusters send ClassAds

via CEMonto the site wide gateway

Step 6- Grid job is forwarded

to target cluster

Periodic

Synchronization

Site Wide

Gateway

Exterior

InteriorCMSWC1

VOMRSServer

Periodic

Synchronization

Step

1 - u

ser

regist

ers w

ith V

O

CDFOSG2

D0CAB1 GPGridD0

CAB2CMSWC3

CDFOSG1

ReSS Publish campus

classads

CDFOSG3/4

Page 13: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008 13

CondorMatch Maker

InfoGatherer

classads

classads classads classads

CondorScheduler

jobWhat Gate?Gate 3

job

CEMon

CE

Gate1

job-managersjob-managersjob-managersjobs info

CLUSTER

GIP

CEMon

CE

Gate2

job-managersjob-managersjob-managersjobs info

CLUSTER

GIP

CEMon

CE

Gate3

job-managersjob-managersjob-managersjobs info

CLUSTER

GIP

ReSS

• Users express cluster requirements in the JDL• Jobs sit in the site batch system queue until execution

Grid / Site Interface

VO /

Grid

In

terf

ace

Gri

dSi

teU

sers

Queryclassads

Query

Page 14: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Needed the ability to match diverse job needs with available resources across an entire grid (OSG) without requiring users to know the details

Resource Selection Service using same concept as SamGrid, generalized for the Open Science Grid (OSG) Condor-G Matchmaking

ReSS is the production OSG resource selection service – Grid Wide

FermiGrid uses ReSS internally for matching jobs to the campus clusters.

For more information see – Mats Rynge’s talk “OSGMM and ReSS – Matchmaking on OSG” 14

Page 15: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 20084/30/2008 15

WMS Legend:

CollectorGlidein factory

VO frontend

Legend:

Central managerExecute daemon

Submit nodeand user(s)

Gatekeeper

Worker node

Courtesy: I. Sfiligoi (condor week 07)

Page 16: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Provides glideins in a more transparent manner insulating the user from the diverse nature of the grid

Supports vanilla and standard universe jobs Works across firewalls

Required GCB be made production quality to support this

Support for calling glExec built in to Condor directly Allows separation of execution of the Glidein pilot

and the user job Scaling upgrades were done to support the

O(10,000) jobs running simultaneously Creates a virtual condor pool so distributed

grid resources look like dedicated local resources 16

Page 17: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Work is being done at Fermilab at the boundary of the Condor-G world and Condor, determining the security considerations on this boundary.

Defines how site installations can work securely when interfacing with a grid environment.

17

Page 18: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008 18

Acronyms:UCrd - User credentialsJCrd - Job specific user credentialsJPld - Job payload

WN

WMS 1 WMS n

WMS 2

...

JCrdJP

ld

Datastore

JPld

JCrd

JPldJCrd

JPldJCrd

JCrd

JCrd

JCrd

JCrd

JCrd

JCrd

JPld sign.Ucrd

JCrd

The Grid

StorageResources

StorageResources

ComputingResources Computin

gResources

ComputingResources

ComputingResources

StorageResourcesSite

D

Site C

Site B

Site A

Page 19: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

With the implementation of Glidein WMS’s, a job may make many stops on its way to a worker node How do you insure the job integrity?

Can determine (e.g. on a worker node), if the job has been tampered with

Implements signed classads to help improve the integrity of the user job as it travels in the Grid

Collaboratory investigation between Fermilab and the Condor team.

See Ian’s talk (Thursday) ‘End-to-end security in Condor’

19

Page 20: Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside

Condor in the Fermilab Grid Facilities

April 30, 2008

Fermilab provides a secure, production grid environment offering grid software and services that enables scientists to produce scientific results on the scale required by today’s petascale experiments.

The productive relationship existing between Fermilab and the Condor team has contributed to this success.

Condor is part of the Grid Facility at Fermilab, serving the production needs of many experiments and that of the wider OSG.

20