32
EGEE-II INFSO-RI- 031688 EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October 2006 Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

Embed Size (px)

Citation preview

Page 1: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks

The EGEE Production Grid

Ian Bird

EGEE Operations Manager

HEPiX

Jefferson Lab, 12th October 2006

Enabling Grids for E-sciencE

Page 2: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Outline

• Some history– What led up to where we

are now?– The EGEE project

• What is the EGEE grid infrastructure today?

– What has been achieved?– How is it used?– How does it compare and

relate to other production grids?

• Outlook

Page 3: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Some history … LHC EGEE Grid

• 1999 – Monarc Project– Early discussions on how to organise distributed computing

for LHC

• 2000 – growing interest in grid technology– HEP community was the driver in launching the DataGrid

project

• 2001-2004 - EU DataGrid project– middleware & testbed for an operational grid

• 2002-2005 – LHC Computing Grid – LCG– deploying the results of DataGrid to provide aproduction facility for LHC experiments

• 2004-2006 – EU EGEE project phase 1– starts from the LCG grid– shared production infrastructure– expanding to other communities and sciences

• 2006-2008 – EU EGEE-II – Building on phase 1– Expanding applications and communities …

• … and in the future – Worldwide grid infrastructure??– Interoperating and co-operating infrastructures?

CERN

Page 4: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE project• EGEE - €32 M

– 1 April 2004 – 31 March 2006– 71 partners in 27 countries, federated in regional Grids

• EGEE-II - €35 M– 1 April 2006 – 31 March 2008– 91 partners in 32 countries – 13 Federations

• Objectives– Large-scale, production-quality

infrastructure for e-Science – Attracting new resources and

users from industry as well asscience

– Improving and maintaining “gLite” Grid middleware

Page 5: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Infrastructure

Certification testbeds (SA3)

Pre-production service

Production service

Test-beds & Services

Operations Coordination Centre

Regional Operations Centres

Global Grid User Support

EGEE Network Operations Centre (SA2)

Operational Security Coordination Team

Support Structures

Operations Advisory Group (+NA4)

Joint Security Policy Group EuGridPMA (& IGTF)

Grid Security Vulnerability Group

Security & Policy Groups

Infrastructure:• Physical test-beds & services• Support organisations & procedures• Policy groups

Page 6: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Certification & release preparation

• The goal is to produce a middleware distribution that can be deployed widely

– Not the same as middleware releases from development projects

– More like a Linux distribution – bringing together many pieces from several sources

• Extensive certification test-bed:– Close to 100 machines involved,

CERN + partners

• Emulate the main deployment environments

• Certification testing:– Installation and configuration– Component (service) functionality– System testing (trying to emulate

real workloads and stress testing)– Beginning to use virtualization to

simplify the testing environment

• Deployment into the pre-production system

– Final step of certification – validation by real sites

– Validation by applications – also allows to prepare apps for new versions

Page 7: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Pre-production service

• Pre-production service is now ~ 20 sites• Provides access to some 500 CPU

– Some sites allow access to their full production batch systems for scale tests

• Sites install and test different configurations and sets of services– Try to get good feeling for the quality of the release or updates before

general release to production

– Feedback to: certification, integration, developers, etc.

• P-PS is now used in the way it was intended– For some time it was acting as a second certification test-bed for the gLite-

1.x branch

– Some services may be demonstrated in this environment before going to production (or they may need more work)

Page 8: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Production service

sites

Size of the infrastructure today:

• 196 sites in 42 countries

• ~32 000 CPU

• ~ 3 PB disk, + tape MSS

0

5000

10000

15000

20000

25000

30000

35000

No.

CPU

CPU

Page 9: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Usage of the infrastructureEGEE workload

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

Jo

bs

/mo

nth

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

Normalized CPU time

0

1000000

2000000

3000000

4000000

5000000

6000000

Jan-

05

Feb-0

5

Mar

-05

Apr-0

5

May

-05

Jun-

05

Jul-0

5

Aug-0

5

Sep-0

5

Oct-

05

Nov-0

5

Dec-0

5

Jan-

06

Feb-0

6

Mar

-06

Apr-0

6

May

-06

Jun-

06

Jul-0

6

Aug-0

6

k.S

I2k

. h

ou

rs

other VOs

planck

ops

magic

lhcb

geant4

fusion

esr

egrid

egeode

dteam

compchem

cms

biomed

atlas

alice

>50k jobs/day

~7000 CPU-months/month

Page 10: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Non-LHC VOs

EGEE workload

0

50,000

100,000

150,000

200,000

250,000

Jo

bs

/mo

nth

planck

ops

magic

geant4

fusion

esr

egrid

egeode

compchem

biomed

other VOs

Normalized CPU time

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

k.S

I2k

. h

ou

rs

planck

ops

magic

geant4

fusion

esr

egrid

egeode

dteam

compchem

biomed

other VOs

Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month

• one year ago this was the overall scale of work for all VOs

Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month

• one year ago this was the overall scale of work for all VOs

Page 11: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use of the infrastructure

20k jobs running simultaneously

Page 12: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

CPU Usage

Virtual Organizations

Jan. ’06

Sep. ’06

Page 13: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Use for massive data transfer

Large LHC experiments now transferring ~ 1PB/month each

Page 14: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Applications on EGEE

• More than 25 applications from anincreasing number of domains– Astrophysics

– Computational Chemistry

– Earth Sciences

– Financial Simulation

– Fusion

– Geophysics

– High Energy Physics

– Life Sciences

– Multimedia

– Material Sciences

– …..

• Application types:• Simulation• Bulk Processing• Responsive Apps.• Workflow• Parallel Jobs

• Legacy Applications

Page 15: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Simulation

• Examples– LHC Monte Carlo simulation

– Fusion

– WISDOM—malaria/avian flu

• Characteristics– Jobs are CPU-intensive

– Large number of independent jobs

– Run by few (expert) users

– Small input; large output

• Needs– Batch-system services

– Minimal data management for storage of results

ATLAS

ITER

Page 16: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Drug Discovery

• WISDOM focuses on in silico drug discovery for neglected and emerging diseases.

• Malaria — Summer 2005– 46 million ligands docked

– 1 million selected

– 1TB data produced; 80 CPU-years used in 6 weeks

• Avian Flu — Spring 2006– H5N1 neuraminidase

– Impact of selected point mutations on eff. of existing drugs

– Identification of new potential drugs acting on mutated N1

• Fall 2006– Extension to other neglected diseases

Page 17: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Bulk Processing

• Examples– HEP processing of raw data, analysis

– Earth observation data processing

• Characteristics– Widely-distributed input data

– Significant amount of input and output data

• Needs– Job management tools (workload management)

– Meta-data services

– More sophisticated data management

Page 18: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Responsive Apps. (I)

• Examples–Prototyping new applications

–Monitoring grid operations

–Direct interactivity

• Characteristics–Small amounts of input and output data

–Not CPU-intensive

–Short response time (few minutes)

• Needs–Configuration which allows “immediate” execution (QoS)

–Services must treat jobs with minimum latency

Page 19: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Responsive Apps. (II)

• Grid as a backend infrastructure:– gPTM3D: interactive analysis of medical images

– GPS@: bioinformatics via web portal

– GATE: radiotherapy planning

– DILIGENT: digital libraries

– Volcano sonification

• Characteristics– Rapid response: a human waiting for the result!

– Many small but CPU-intensive tasks

– User is not aware of “grid”!

• Needs– Interfacing (data & computing) with non-grid application or portal

– User and rights management between front-end and grid

Page 20: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Workflow

• Examples–“Bronze Standard”: image registration

–Flood prediction

• Characteristics–Use of grid and non-grid services

–Complex set of algorithms for the analysis

–Complex dependencies between individual tasks

• Needs–Tools for managing the workflow itself

–Standard interfaces for services (I.e. web-services)

Page 21: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Parallel Jobs

• Examples– Climate modeling

– Earthquake analysis

– Computational chemistry

• Characteristics– Many interdependent, communicating tasks

– Many CPUs needed simultaneously

– Use of MPI libraries

• Needs– Configuration of resources for flexible use of MPI

– Pre-installation of optimized MPI libraries

Page 22: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 22

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Legacy Applications

• Examples–Commercial or closed source binaries

–Geocluster: geophysical analysis software

–FlexX: molecular docking software

–Matlab, Mathematics, …

• Characteristics–Licenses: control access to software on the grid

–No recompilation no direct use of grid APIs!

• Needs–License server and grid deployment model

–Transparent access to data on the grid

Page 23: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 23

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid management: structure

• Operations Coordination Centre (OCC)

– management, oversight of all operational and support activities

• Regional Operations Centres (ROC)

– providing the core of the support infrastructure, each supporting a number of resource centres within its region

– Grid Operator on Duty

• Resource centres – providing resources

(computing, storage, network, etc.);

• Grid User Support (GGUS)

– At FZK, coordination and management of user support, single point of contact for users

Page 24: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 24

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid Monitoring

• Goal:– Proactively monitor operational state & performance of the grid

– Trigger corrective actions at sites, ROCs, service managers

• Many tools used:– Distributed responsibility for tools maintenance and operation

– Operator portal, Info sys monitor, SFT/SAM, job monitors, etc.

• Site Functional Tests (SFT) Site Availability Monitor (SAM)– Framework to sample/test services at sites and publish results

– Can include ad-hoc tests (e.g. VO-specific) in the framework or externally

– Allows dynamic look-up by VO of sites that are currently OK for them

– SAM: extends the concept to measure service availability

– Web service access to the data

– Intend to use this to generate trouble tickets and alarms

• Primary tools of the operator on duty are – Information system monitoring and SFT/SAM

Page 25: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 25

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Site metrics - availability

Page 26: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 26

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Support - GGUS

Page 27: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 27

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

The EGEE Network Operations Centre

• Creating a “Network Support unit” in the EGEE operational model;

• Tasks:– Receive tickets from NRENs, and

forward to GGUS if impact on grid– Receive tickets from GGUS if a

network issue– Troubleshoot & follow up with sites

or NRENs

GGUS

Users

SupportUnits

ENOC

NRENs

GÉANT2

EGEE Network

Page 28: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 28

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperation

• Interoperability and interoperation (or co-operation)

• EGEE has interoperability activities with:(enabling the middlewares to work together)

– Open Science Grid (U.S.) – quite far advanced– Nordugrid (ARC) – task in EGEE-II, 4 workshops and ongoing activity– UNICORE – task in EGEE-II– NAREGI (Japan) – 1 workshop, continued activity– GIN (OGF) – active in several areas

• EGEE has interoperation activities with:(enabling the infrastructures to co-operate)

– Open Science Grid – actually in use– Anticipated with NorduGrid (NDGF) for WLCG

Page 29: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 29

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Interoperating information systems

EGEE

OSG

Naregi

Teragrid

Pragma

Nordugrid

Page 30: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 30

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Related infrastructure projects

DEISATeraGrid

Coordination in SA1 for:

• EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID

Interoperation with

• OSG, NAREGI

SA3: • DEISA, ARC, NAREGI

Page 31: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 31

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Sustainability: Beyond EGEE-II

• Need to prepare for permanent Grid infrastructure– Maintain Europe’s leading position in global science Grids

– Ensure a reliable and adaptive support for all sciences

– Independent of short project funding cycles

– Modelled on success of GÉANT Infrastructure managed in collaboration

with national grid initiatives

Page 32: EGEE-II INFSO-RI-031688EGEE and gLite are registered trademarks The EGEE Production Grid Ian Bird EGEE Operations Manager HEPiX Jefferson Lab, 12 th October

[email protected] HEPiX; JLab; 9th-13th October 2006 32

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Summary of status

• Today we have an operating production infrastructure – Probably the largest in the world, supporting many science domains– Relied upon by several as their primary source of computing

• We have a managed operations process addressing most areas– Constantly evolving

• Inter/Co-operation is a fact and is becoming more important very quickly– Several applications need to work across grids – and they need support for

that

• A large fraction of the value of the operations activity is in the intangibles – processes, structures, expertise, etc.

• We recognise that there are many outstanding problems with the current state of things: reliability and robustness are the focus for the next year