86
Grid Computing and Grid Site Infrastructure David Groep NIKHEF/PDP and UvA

Grid Computing and Grid Site Infrastructure

  • Upload
    tayte

  • View
    87

  • Download
    0

Embed Size (px)

DESCRIPTION

Grid Computing and Grid Site Infrastructure. David Groep NIKHEF/PDP and UvA. Outline. The challenges: LHC, Remote sensing, medicine, … Distributed Computing and the Grid Grid model and community formation, security Grid software: computing, storage, resource brokers, information system - PowerPoint PPT Presentation

Citation preview

Page 1: Grid Computing  and Grid Site Infrastructure

Grid Computing and Grid Site Infrastructure

David GroepNIKHEF/PDP and UvA

Page 2: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 2

Outline

• The challenges: LHC, Remote sensing, medicine, …

• Distributed Computing and the Grid– Grid model and community formation, security– Grid software:

computing, storage, resource brokers, information system

• Grid Clusters & Storage : the Amsterdam e-Science Facility– Computing: installing and operating large clusters– Data management: getting throughput to disk and tape

• Monitoring and site functional tests

• Does it work?

Page 3: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 3

The LHC Computing Challenge

Physics @ CERN• LHC particle accellerator

• operational in 2007

• 4 experiments

• ~ 10 Petabyte per year (=10 000 000 GByte)

• 150 countries

• > 10000 Users

• lifetime ~ 20 years

level 1 - special hardware

40 MHz (40 TB/sec)

level 2 - embeddedlevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &

offline analysishttp://www.cern.ch/

Page 4: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 4

LHC Physics Data Processing

Source: ATLAS introduction movie, NIKHEF and CERN, 2005

Page 5: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 5

Tiered data distribution model

10 PB data distribution issue• 15 major centres in the world• ~200 institutions• ~10 000 people

for 1 experiment:Processing issue• 1 ‘event’ takes ~ 90 s• There are 100 events/s• Need: 9000 CPUs (today)• But also: reprocessing, simulation, &c: 2-5x needed in total

Page 6: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 6

T0-T1 Data Flow to Amsterdam

Tier-0Tier-0 diskbuffer

diskbuffer

TapeTape

diskstorage

diskstorage

diskstorage

diskstorage

Assumptions

Amsterdam is 13%24 hours/day

No in-efficienciesYear 2008

Assumptions

Amsterdam is 13%24 hours/day

No in-efficienciesYear 2008 RAW

RAW

1.6 GB/file0.026 Hz

2246 f/day41.6 MB/s3.4 TB/day

1.6 GB/file0.026 Hz

2246 f/day41.6 MB/s3.4 TB/day

ESD1ESD1

500 MB/file0.052 Hz

4500 f/day26 MB/s

2.2 TB/day

500 MB/file0.052 Hz

4500 f/day26 MB/s

2.2 TB/day

AODmAODm

500 MB/file0.04 Hz

3456 f/day20 MB/s

1.6 TB/day

500 MB/file0.04 Hz

3456 f/day20 MB/s

1.6 TB/day

Total T0-T1 BandwidthRAW+ESD1+AODm1

10,000 file/day88 MByte/sec700 Mbit/sec

Total T0-T1 BandwidthRAW+ESD1+AODm1

10,000 file/day88 MByte/sec700 Mbit/sec

Total to tapeRAW

41.6 MB/s2~3 drives

18 tape/day

Total to tapeRAW

41.6 MB/s2~3 drives

18 tape/day

Total to diskESD1+AOD1

46 MB/s4 TB/day

8000 file/day

Total to diskESD1+AOD1

46 MB/s4 TB/day

8000 file/day

ATLAS data flows (draft). Source: Kors Bos, NIKHEF

7.2

TByt

e/da

y

to e

very

Tier-1

Page 7: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 7

Other similar applications: Earth Obs

Source: Wim Som de Cerff, KNMI, De Bilt

Page 8: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 8

WISDOM Malaria drug discovery

Anti-malaria drug discovery– ligand docking onto the malaria virus in silico– 60 000 jobs, taking over 100 CPU years total– using 3 000 CPUs completed less than 2 months

– 47 sites– 15 countries– 3000 CPUs– 12 TByte of disk

Page 9: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 9

Infrastructure to cope with this

Source: EGEE SA1, Grid Operations Centre, RAL, UK, dd March 2006

Page 10: Grid Computing  and Grid Site Infrastructure

The Grid and Community Building

The Grid Concept

Protocols and standards

Authentication and Authorization

Page 11: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 11

Three essential ingredients for Grid

‘Access computing like the electrical power grid’

A grid combines resources that– Are not managed by a single organization

– Use a common, open protocol … that is general purpose

– Provide additional qualities of service, i.e., are usable as a collective and transparent resource

Based on: Ian Foster, GridToday, November 2003

Page 12: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 12

Grid AuthN/AuthZ and VOs

• Access to shared services– cross-domain authentication, authorization, accounting, billing– common protocols for collective services

• Support multi-user collaborations in “Virtual Organisations”– a set of individuals or organisations, – not under single hierarchical control, – temporarily joining forces to solve a particular problem at hand, – bringing to the collaboration a subset of their resources, – sharing those under their own conditions

– user’s home organization may or need not know about their activities

• Need to enable ‘easy’ single sign-on– a user is typically involved in many different VO’s

• Leave the resource owner always in control

Page 13: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 13

Virtual vs. Organic structure

Organization A Organization B

Compute Server C1Compute Server C2

Compute Server C3

File server F1 (disks A and B)

Person C(Student)

Person A(Faculty)

Person B(Staff) Person D

(Staff)Person F(Faculty)

Person E(Faculty)

Virtual Community C

Person A(Principal Investigator)

Compute Server C1'

Person B(Administrator)

File server F1 (disk A)

Person E(Researcher)

Person D(Researcher)

Graphic: GGF OGSA Working Group

Page 14: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 14

More characteristic VO structure

Page 15: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 15

Trust in Grid Security

‘Security’: Authentication and Authorization

• There is no a priori trust relationship between members

or member organisations!– VO lifetime can vary from hours to decades– VO not necessarily persistent (both long- and short-lived)– people and resources are members of many VOs

• … but a relationship is required– as a basis for authorising access– for traceability and liability, incident handling, and accounting

Page 16: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 16

Authentication vs. Authorization

• Single Authentication token (“passport”) – issued by a party trusted by all, – recognised by many resource providers, users, and VOs– satisfy traceability and persistency requirement– in itself does not grant any access, but provides

a unique binding between an identifier and the subject

• Per-VO Authorisations (“visa”)– granted to a person/service via a virtual organisation– based on the ‘passport’ name– embedded in the single-sign-on token (proxy)– acknowledged by the resource owners – providers can obtain lists of authorised users per VO,

but can still ban individual users

Page 17: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 17

Federated PKI for authentication

• A Federation of many independent CAs– common minimum requirements– trust domain as required by users and relying parties– well-defined and peer-reviewed acceptance process

• User has a single identity– from a local CA close by– works across VOs, with single sign-on via ‘proxies’– certificate itself also usable outside the grid

CA 1CA 2

CA 3

CA ncharter

guidelines

acceptanceprocess

relying party 1

relying party n

International Grid Trust Federation and EUGridPMA, see http://www.eugridpma.org/

Page 18: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 18

Authorization Attributes (VOMS)

• VO-managed attributed embedded in the proxy

• attributed signed by VO• proxy signed by user• user cert signed by CA

Query

Authentication

Request

AuthDB

C=IT/O=INFN /L=CNAF/CN=Pinco Palla/CN=proxy

VOMSpseudo-cert

VOMSpseudo-cert

Page 19: Grid Computing  and Grid Site Infrastructure

Grid Services

Logical Elements in a Grid

Information System

Page 20: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 20

Services in a Grid

• Computing Element “front-end service for (set of) computers”– Cluster computing: typically Linux with IP interconnect– Capability computing: typically shared-memory supercomputers – A ‘head node’ batches or forwards requests to the cluster

• Storage Element “front-end service for disk or tape”– Disk and tape based– Varying retention time, QoS, uniformity of performance– ‘ought’ to be ACL aware: mapping of grid authorization to POSIX ACLs

• File Catalogue …• Information System …

– Directory-based for static information– Monitoring and bookkeeping for real-time information

• Resource Broker …– Matching user job requirements to offers in the information system– WMS allows disconnected operation of the user interface

Page 21: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 21

Typical Grid Topology

Page 22: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 22

Job Description Language

This is JDL that the user might send to the Resource Broker

Executable = "catfiles.sh";StdOutput = "catted.out";StdError = "std.err"; Arguments = "EssentialJobData.txt

LogicalJobs.jdl /etc/motd";

InputSandbox = {"/home/davidg/tmp/jobs/LogicalJobs.jdl", "/home/davidg/tmp/jobs/catfiles.sh" };OutputSandBox = {"catted.out", "std.err"};

InputData = "LF:EssentialJobData.txt"; ReplicaCatalog = "ldap://rls.edg.org/lc=WPSIX,dc=cnrs,dc=fr";DataAccessProtocol = “gsiftp";

RetryCount = 2;

Page 23: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 23

How to you see the Grid?

Broker matches the user’s request with the site • ‘information supermarket’ matchmaking (using Condor Matchmaking)• uses the information published by the site

Grid Information system‘the only information a user ever gets about a site’

• So: should be reliable, consistent and complete• Standard schema (GLUE) to

describe sites, queues, storage(complex schema semantics)

• Currently presented as an LDAP directory

LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap

Page 24: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 24

Glue Attributes Set by the Site

• Site information– SiteSysAdminContact: mailto: [email protected]– SiteSecurityContact: mailto: [email protected]

• Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie

HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3

GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248

• Storage: similar info (paths, max number of files, quota, retention, …)

Page 25: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 25

Information system and brokering issues

• Size of information system scales with #sites and #details– already 12 MByte of LDIF– matching a job takes ~15 sec

• Scheduling policies are infinitely complex– no static schema can likely express this information

• Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006

The info system is the single most important grid service

• Current broker tries to make optimal decision… instead of a `reasonable’ one

Page 26: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 26

Example: GlueServiceAccessControlRule

For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule

(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:

/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt

(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam

(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0

(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm

Page 27: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 27

Example: GlueSEControlProtocolType

For your viewing pleasure: GlueSEControlProtocolType

freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic

… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!

Page 28: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 28

Example: GlueHostOperatingSystemRelease

Today's attribute: GlueHostOperatingSystemRelease    1  GlueHostOperatingSystemRelease: 3.02    1  GlueHostOperatingSystemRelease: 3.03    1  GlueHostOperatingSystemRelease: 3.2    1  GlueHostOperatingSystemRelease: 3.5    1  GlueHostOperatingSystemRelease: 303    1  GlueHostOperatingSystemRelease: 304    1  GlueHostOperatingSystemRelease: 3_0_4    1  GlueHostOperatingSystemRelease: SL    1  GlueHostOperatingSystemRelease: Sarge    1  GlueHostOperatingSystemRelease: sl3    2  GlueHostOperatingSystemRelease: 3.0    2  GlueHostOperatingSystemRelease: 305    4  GlueHostOperatingSystemRelease: 3.05    4  GlueHostOperatingSystemRelease: SLC3    5  GlueHostOperatingSystemRelease: 3.04    5  GlueHostOperatingSystemRelease: SL3   18  GlueHostOperatingSystemRelease: 3.0.3   19  GlueHostOperatingSystemRelease: 7.3   24  GlueHostOperatingSystemRelease: 3   37  GlueHostOperatingSystemRelease: 3.0.5   47  GlueHostOperatingSystemRelease: 3.0.4

Page 29: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 29

Example: GlueSAPolicyMaxNumFiles

136 separate Glue attributes seen

For your viewing pleasure: GlueSAPolicyMaxNumFiles

freq value

6 GlueSAPolicyMaxNumFiles: 99999999999999

26 GlueSAPolicyMaxNumFiles: 999999

52 GlueSAPolicyMaxNumFiles: 0

78 GlueSAPolicyMaxNumFiles: 00

1381 GlueSAPolicyMaxNumFiles: 10

136 separate Glue attributes seen

For your viewing pleasure: GlueServiceStatusInfo

freq value

2 GlueServiceStatusInfo: No Known Problems.

55 GlueServiceStatusInfo: No problems

206 GlueServiceStatusInfo: No Problems

Page 30: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 30

LCG’s Most Popular Resource Centre

Page 31: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 31

Example: SiteLatitude

Today's attribute: GlueSiteLatitude    1  GlueSiteLatitude: 1.376059    1  GlueSiteLatitude: 33.063924198120645    1  GlueSiteLatitude: 37.0    1  GlueSiteLatitude: 38.739925290125484    1  GlueSiteLatitude: 39.21 …   1  GlueSiteLatitude: 45.4567    1  GlueSiteLatitude: 55.9214118    1  GlueSiteLatitude: 56.44    1  GlueSiteLatitude: 59.56    1  GlueSiteLatitude: 67    1  GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu    2  GlueSiteLatitude: 40.8527    2  GlueSiteLatitude: 48.7    2  GlueSiteLatitude: 49.16    2  GlueSiteLatitude: 50    3  GlueSiteLatitude: 41.7827    3  GlueSiteLatitude: 46.12

   8  GlueSiteLatitude: 0.0

Page 32: Grid Computing  and Grid Site Infrastructure

The Amsterdam e-Science Facilities

Building and Running a Grid Resource Centre

Compute Clusters

Storage and Disk Pools

Page 33: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 33

Grid Resources in Amsterdam

• 2x 1.2 PByte in 2 robots

• 36+512 CPUs IA32

• disk caches 10 + 50 TByte

• multiple 10 Gbit/s links

240 CPUs IA32

7 TByte disk cache

10 + 1 Gbit link SURFnet

2 Gbit/s to SARA

only resources with either GridFTP or Grid job management

BIG GRID Approved January 2006!

Investment of € 29M in next 4 years

For: LCG, LOFAR, Life Sciences,

Medical, DANS, Philips Research, …

See http://www.biggrid.nl/

Page 34: Grid Computing  and Grid Site Infrastructure

Computing

Cluster topology

Connectivity

System services setup

Page 35: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 35

Grid Site Logical Layout

Page 36: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 36

Batch Systems and Schedulers

• Batch system keeps list of nodes and jobs• Scheduler matches jobs to nodes based on policies

Page 37: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 37

NDPF Logical Composition

Page 38: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 38

NDPF Network Topology

Page 39: Grid Computing  and Grid Site Infrastructure

Quattor and ELFms

System Installation

Page 40: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 40

Installation

Installing and managing a large cluster requires a system that

• Scales to O(10 000) nodes, with1. a wide variety in configuration (‘service nodes’)2. and also many instances of identical systems (‘worker nodes’)

• You can validate new configurations before it’s too late• Can rapidly recovery from node failures

by commissioning a new box (i.e. in minutes)

Popular systems include• Quattor (from ELFms)• xCAT• OSCAR• SystemImager & cfEngine

Page 41: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 41

Quattor

ELFms stands for ‘Extremely Large Fabric management system’Subsystems:• : configuration, installation and management of nodes• : system / service monitoring• : hardware / state management

• ELFms manages and controls most of the nodes in the CERN CC– ~2100 nodes out of ~ 2400– Multiple functionality and cluster size (batch nodes, disk servers, tape

servers, DB, web, …)– Heterogeneous hardware (CPU, memory, HD size,..)– Supported OS: Linux (RH7, RHES2.1, RHES3) and Solaris (9)

Node ConfigurationManagement

NodeManagement

Source: German Cancio, CERN IT, see http://www.quattor.org/

Developed within the EU DataGrid Project (http://www.edg.org/), development and maintenance now coordinated by CERN/IT

Page 42: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 42

System Installation

• Central configuration drives:• Automated Installation Infrastructure (AII)

PXE and RedHat KickStart or Solaris JumpStart

• Software Package Management (SPMA)transactional management based on RPMs or PKGs

• Node Configuration (NCM): autonomous agentsservice configuration components, (re-) configuration

– CDB: the ‘desired’ state of the node• two-tiered configuration language (PAN and LLD XML)• self-validating, complete language (“swapspace=2*physmem”)• template inheritance and composition

( “tbn20 = base system + CE_software + pkg_add(emacs)” )

Page 43: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 43

Configuration Database

CDB

pan

GUI

Scripts

CLI

Node

CCM

Cache

XML

RDBMS

SQL

SOAP

HTTP

NodeManagement Agents

LEAF, LEMON, others

Source: German Cancio, CERN IT

Page 44: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 44

NodeManagement Agents

Configuration Database

CDBGUI

Scripts

CLI

Node

CCM

Cache

RDBMS

SQL

SOAP

pan

XML HTTP

CERNCC

name_srv1: 137.138.16.5time_srv1: ip-time-1

lxbatchcluster/name: lxbatchmaster: lxmaster01pkg_add (lsf5.1)

lxpluscluster/name: lxpluspkg_add (lsf5.1) disk_srv

lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta)

lxplus020 eth0/ip: 137.138.4.225

lxplus029

Source: German Cancio, CERN IT

Page 45: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 45

Configuration Database

CDB

pan

Node

CCM

Cache

XML

RDBMS

SQL

HTTP

GUI

Scripts

CLI SOAP

Source: German Cancio, CERN IT

Page 46: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 46

Configuration Database

CDB

pan

GUI

Scripts

CLI

Node

XML

RDBMS

SQL

SOAP

HTTP

CCM

Cache NodeManagement Agents

Source: German Cancio, CERN IT

Page 47: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 47

Configuration Database

CDB

pan

GUI

Scripts

CLI

Node

CCM

Cache

XML

SOAP

HTTP

RDBMS

SQL

LEAF, LEMON, others

Source: German Cancio, CERN IT

Page 48: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 48

Configuration Database

CDB

pan

GUI

Scripts

CLI

XML

RDBMS

SQL

SOAP

HTTP

Node

CCM

Cache NodeManagement Agents

Source: German Cancio, CERN IT

Page 49: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 49

Managing (cluster) nodes

Install server

base OS dhcppxe

nfs/http

Vendor System installer

RH73, RHES,Fedora,…

System services AFS,LSF,SSH,accounting..

Installed softwarekernel, system, applications..

CCMNode Configuration

Manager (NCM)

RPM, PKG

nfshttp

ftp

Software Servers

packages

(RPM, PKG)SWReppackages

CDB

Standard nodesManaged nodes

Install Manager

Node (re)install

cacheSW package

Manager (SPMA)

Source: German Cancio, CERN IT

Page 50: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 50

Node Management Agents• NCM (Node Configuration Manager):

framework system, where service specific plug-ins called Components make the necessary system changes to bring the node to its CDB desired state– Regenerate local config files (eg. /etc/sshd/sshd_config),

restart/reload services (SysV scripts)– Large number of components available (system and Grid services)

• SPMA (Software Package Mgmt Agent) and SWRep: Manage all or a subset of packages on the nodes– Full control on production nodes: full control - on development nodes: non-

intrusive, configurable management of system and security updates.– Package manager, not only upgrader (roll-back and transactions)

• Portability: generic framework with plug-ins for NCM and SPMA– available for RHL (RH7, RHES3) and Solaris 9

• Scalability to O(10K) nodes– Automated replication for redundant / load balanced CDB/SWRep servers– Use scalable protocols eg. HTTP and replication/proxy/caching technology

(slides here)Source: German Cancio, CERN IT

Page 51: Grid Computing  and Grid Site Infrastructure

Back in Amsterdam

User and directory management

Logging and auditing

Monitoring and fault detection

Page 52: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 52

User directory and automount maps

Large number of alternatives exists (nsswitch.conf/pam.d)• files-based (/etc/passwd, /etc/auto.home, …)• YP/NIS, NIS+• Database (MySQL/Oracle)• LDAP

We went with LDAP:• information is in a central location (like NIS)• can scale by adding slave servers (like NIS)• is secure by LDAP over TLS (unlike NIS)• can be managed by external programs (also unlike NIS)

(in due course we will do real-time grid credential mapping to uid’s)

But you will need nscd, or a large number of slave servers

Page 53: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 53

Logging and Auditing

Auditing and logging• syslog (also for grid gatekeeper, gsiftp, credential mapping)• process accounting (psacct)

For the paranoid – use tools included for CAPP/EAL3+: LAuS• system call auditing• highly detailed:

useful both for debugging and incident response• default auditing is critical: system will halt on audit errors

If your worker nodes are on private IP space• need to preserve a log of the NAT box as well

Page 54: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 54

Grid Cluster Logging

Grid statistics and accounting• rrdtool views from the batch system load per VO

– combine qstat and pbsnodes output via script, cron and RRD

• cricket network traffic grapher

• extract pbs accounting data in dedicated database– grid users have a ‘generic’ uid from a dynamic pool –

need to link this in the database to the grid DN and VO

• from accounting db, upload anonymized records to APEL– APEL is the grid accounting system for VOs and funding agencies– accounting db also useful to charge costs to projects locally

Page 55: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 55

NDPF Occupancy

Usage of the NIKHEF NDPF Compute farm

Average occupancy in 2005: ~ 78%

each colour represents a grid VO, black line is #CPUs available

Page 56: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 56

But at times, in more detail

Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!

An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …

PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.

Page 57: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 57

Black Holes

A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…

Page 58: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 58

Clusters: what did we see?

• the Grid (and your cluster) are error amplifiers– “black holes” may eat your jobs piecemeal

– dangerous “default” values can spoil the day (“GlueERT: 0”)

• Monitor! (and allow for (some) failures, and design for rapid recovery)

• Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)

• If you want users to have clue, you push publish your clues correctly (the information system is all they can see)

• Grid middleware may effectively do a DoS on your system– doing qstat for every job every minute, to feed the logging & bookkeeping …

• Power consumption is the greatest single limitation in CPU density• And finally: keep your machine room tidy, and label everything … or your

colleague will not be able to find that #$%^$*! machine in the middle of the night…

Page 59: Grid Computing  and Grid Site Infrastructure

Data Storage

Typical data flows

Matching storage to computing

Disk pools and their interfaces

Page 60: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 60

Atlas Tier-1 data flows

Tier-0

CPUfarm

T1T1OtherTier-1s

diskbuffer

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

RAW

ESD2

AODm2

0.044 Hz3.74K f/day44 MB/s3.66 TB/day

T1T1OtherTier-1s

T1T1Tier-2s

Tape

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

diskstorage

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD2

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm2

500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day

ESD1

0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm1

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

AODm2

500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day

Plus simulation Plus simulation && analysis data analysis data

flowflow

Real data storage, reprocessing and

distribution

ATLAS data flows (draft). Source: Kors Bos, NIKHEF

Page 61: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 61

SC3 storage network (SARA)

Disk-to-Disk 583 MByte/si.e. 4.6 Gbps

over the world

Graphic: Mark van de Sanden, SARA

Page 62: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 62

Tier-1 Architecture SARA (storage)

Graphic: Mark van de Sanden, SARA

Page 63: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 63

Matching Storage to Computing

Doing the math• Simple job

– Read 1 MByte piece of file (typically 1 “event”)– Calculate on it for 30 seconds– Do this for 2000 events per file (i.e. 2 GByte files)– On 1000 files (1 day of running) this takes 700 days– Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB

• On the Grid: spread out over 1000 CPUs– All jobs start at the same time, retrieving a 2 GByte input– The machine with this 2 TByte disk is on a 100 Mbps link– Effective 10 MByte/s throughput– Thus, 10 kByte/s per machine– It takes 55 hours before the file transfers finish!– And after that, only 17 hours of calculation

= 1Mbyte

Page 64: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 64

Storage

Just for ATLAS, one of the experiments

• RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape– Expected to be a permanent “museum” copy– Largely scheduled access (intelligent staging possible), read & write– Disk buffers before tape store can be smallish (~ 10%)

• ‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput– Lifetime of data is finite but long (typically 2+ years)– Access needed from worker nodes, i.e., from O(1 000) CPUs– Random “skimming” access pattern– Need for disk server farms of typically 500 TByte – 1 PByte

• Management of disk resources– Split ‘file system view’ (file metadata) from the object store– dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …

Page 65: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 65

DPM Manageability

Example: Disk Pool Manager• Elements in DPM

– SRM: scheduling of requests via a standard interface– DPM Server: management and translation of filename to a transfer URL

(disk node)– DPNS Name service: ACLs, directory layout, ownership, size– Object servers: effectuate actual (GridFTP) transfers

• No central configuration files– Disk nodes request to add themselves to the DPM– All states are kept in a DB (easy to restore after a crash)

• Easy to remove disks and partitions– Allows simple reconfiguration of the Disk Pools– Files are stored as whole chunks (safer than striping over O(100) servers!)– Administrator can temporarily remove file systems if a disk has crashed– DPM automatically configures a file system as “unavailable” when it is not

contactable

Page 66: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 66

Example: SRM put processing (1)

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

1a. SRM Put

1b. Put intoRequest Database

1c. Return SRM RequestId

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 67: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 67

Example: SRM put processing (2)

2a. Get Request from Database

2d. Add TURL in Request

Database and Mark ‘Ready’

2c. Pick best Data Server to put data onto

Data ServerGridftp Daemon

Client

SRM Daemon

2b. Check permissions and add to NS

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

2e.add to replica table and set status ‘Pending’

DPM Daemon

Page 68: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 68

Example: SRM put processing (3)

3a. SRM getRequestStatus

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

3c. Return TURL

DPM Database

3b. Get TURL from Request

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 69: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 69

Example: SRM put processing (4)

Data ServerGridftp Daemon

Client

SRM Daemon DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

4a. SRM(v1) set ‘Running’

4b. Update status of request

DPM Daemon

Page 70: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 70

Example: SRM put processing (5)

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

5. put file via Gridftp

DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

Page 71: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 71

Example: SRM put processing (6)

6c. Get filesize

Data ServerGridftp Daemon

Client

SRM Daemon DPM Database

DPNS Daemon

Data ServerGridftp Daemon

Data ServerGridftp Daemon

6a. SRM(v1) set Done 6e. Update status of request

6d. Update replica metadata(size/status/pintime)

6b. Notify ‘Done’

DPM Daemon

Page 72: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 72

Other disk storage solutions

• Cluster file systems: PVFS2, Lustre, GFS– Built for parallel IO performance– Clusters are not expected to grow (or shrink)– Can obtain better throughput on a single file within the cluster– Node failure can be quite catastrophic

• Some (e.g. GFS) limited by block device/file system limits– A 32-bit linux 2.4 kernel: 2 TByte

– Ext2/3, 4 kByte blocks: file max 2 TB, fs size 16 TB– Reiser3.6: file max 1 EB, fs size 16 TB ???!– XFS: file max 8 EB, fs size 8 EB

Page 73: Grid Computing  and Grid Site Infrastructure

Monitoring at the global scale

Monitoring

Site Functional Tests

Real-time throbbing job monitor

Page 74: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 74

Success Rate

What’s the chance the whole grid is working correctly?

If a single site has 98.5% reliability (i.e. is down 5 days/year)• With 200 sites, this gives you a 4% chance that the whole

grid is working correctly• And the 98.5% is quite optimistic to begin with …

So• build the grid, both middleware and user jobs, for failure• Monitor sites with both system and functional tests• Exclude sites with a current malfunction dynamically

Page 75: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 75

Monitoring Tools

1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests

4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor

7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor

Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005

Page 76: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 76

Google Grid Map

Page 77: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 77

Freedom of Choice

• Tool for VOs to make a site selection based on a set of standard tests

Page 78: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 78

Success Rate: WISDOM

Average success rate for jobs: 70-80% (single submit)

Success rate (August)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 3 5 7 9 11 13 15 17 19 21 23 25 27

day

nb

of

job

s

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

succ

ess

rate

registered

success (final status)

aborted (final status)

cancelled (final status)

success rate :success/(registered-cancelled)

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005

Page 79: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 79

Failure reasons vary

Biomed data challenge Abort reasons distribution

(10/07/2005 - 27-08-2005)

63%

28%

4%4%1%Missmatching ressources

wrong configuration

Network/Connection Failures

Proxy problems

JDL Problems

- Failing middleware component- Wrong request in the job JDL

Abort reasons distribution for all VO 01/2005 – 06/2005

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005

Page 80: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 80

Is the Grid middleware current?

• Common causes of failure– Specified impossible combination of resources– Wrong middleware version at the site– Not enough space in proper place ($TMPDIR)– Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)

0

20

40

60

80

100

120

140

12

/02

/20

05

19

/02

/20

05

26

/02

/20

05

05

/03

/20

05

12

/03

/20

05

19

/03

/20

05

26

/03

/20

05

02

/04

/20

05

09

/04

/20

05

16

/04

/20

05

23

/04

/20

05

30

/04

/20

05

07

/05

/20

05

14

/05

/20

05

21

/05

/20

05

28

/05

/20

05

04

/06

/20

05

11

/06

/20

05

18

/06

/20

05

25

/06

/20

05

Date

Sit

es

wit

h r

ele

as

e

LCG-2_4_0 LCG-2_3_1 LCG-2_3_0

Page 81: Grid Computing  and Grid Site Infrastructure

Going From here

Does it work

How can we make it better

Page 82: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 82

Going from here

Many nice things to do:• Most of LCG provides a single OS (RHEL3), but users may

need SLES, Debian, Gentoo, … or specific libraries• Virtualisation (Xen, VMware)?

• Scheduling user jobs– both VO and site wants to set part of the priorities …

• Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now?– Out of 221 sites, we know for certain there is a compromise!

Page 83: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 83

More things to do …

• Sparse file access: access data efficiently over the wide area

• Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)

• There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?

Page 84: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 84

Job submission live monitor

Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre

Page 85: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 85

Outlook

Towards a global persistent grid infrastructure• Interoperability and persistency that are project independent

– Europe: EGEE-2, ‘European Grid Organisation’– US: Open Science Grid– Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …

GIN aim: cross-submission and file access by end 2006• Extension to industry

– first: industrial engineering, financial scenario simulations

• New ‘middleware’– we are just starting to learn how it should work

• Extend more in sharing of structured data

Page 86: Grid Computing  and Grid Site Infrastructure

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 86

A Bright Future!

Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately, just as you plug in a lamp to get instant light. …

Far from being science-fiction, this is the idea the [Grid] is about to make into reality.

The EU DataGrid project brochure, 2001