68
[email protected] foil 1 last update: 25/05/22 15: CERN What is High Throughput Distributed Computing? CERN Computing Summer School 2001 Santander Les Robertson CERN - IT Division [email protected]

What Is High Throughput Distributed Computing

Embed Size (px)

Citation preview

Page 1: What Is High Throughput Distributed Computing

[email protected] foil 1 last update: 11/04/23 22:53

CERN

What is High Throughput Distributed Computing?

CERN Computing Summer School 2001

Santander

Les RobertsonCERN - IT Division

[email protected]

Page 2: What Is High Throughput Distributed Computing

[email protected] foil 2 last update 11/04/23 22:53

CERN

Outline

High Performance Computing (HPC) and High Throughput Computing (HTC)

Parallel processing so difficult with HPC applications so easy with HTC

Some models of distributed computing

HEP applications Offline computing for LHC Extending HTC to the Grid

Page 3: What Is High Throughput Distributed Computing

[email protected] foil 3 last update 11/04/23 22:53

CERN

“Speeding Up” the Calculation?

Use the fastest processor available

-- but this gives only a small factor over modest (PC) processors

Use many processors, performing bits of the problem in parallel

-- and since quite fast processors are inexpensive we can think

of using very many processors in parallel

Page 4: What Is High Throughput Distributed Computing

[email protected] foil 4 last update 11/04/23 22:53

CERN

High Performance – or – High Throughput?

The key questions are – granularity & degree of parallelism Have you got one big problem or a bunch of little ones?

To what extent can the “problem” be decomposed into sort-of-independent parts (grains) that can all be processed in parallel?

Granularity fine-grained parallelism – the independent bits are small,

need to exchange information, synchronise often coarse-grained – the problem can be decomposed into

large chunks that can be processed independently

Practical limits on the degree of parallelism – how many grains can be processed in parallel? degree of parallelism v. grain size grain size limited by the efficiency of the system at

synchronising grains

Page 5: What Is High Throughput Distributed Computing

[email protected] foil 5 last update 11/04/23 22:53

CERN

High Performance – v. – High Throughput?

fine-grained problems need a high performance system that enables rapid synchronisation between the bits that

can be processed in parallel and runs the bits that are difficult to parallelise as fast as

possible coarse-grained problems can use a high throughput

system that maximises the number of parts processed per minute

High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected

while High Performance Systems use a smaller number of more expensive processors expensively interconnected

Page 6: What Is High Throughput Distributed Computing

[email protected] foil 6 last update 11/04/23 22:53

CERN

High Performance – v. – High Throughput?

There is nothing fundamental here – it is just a question of financial trade-offs like:

how much more expensive is a “fast” computer than a bunch of slower ones?

how much is it worth to get the answer more quickly? how much investment is necessary to improve the degree

of parallelisation of the algorithm?

But the target is moving - Since the cost chasm first opened between fast and

slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems

Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities –

clusters fabrics (inter)national grids

demanding ever-greater degrees of parallelism

Page 7: What Is High Throughput Distributed Computing

[email protected] foil 7 last update: 11/04/23 22:53

CERN

High Performance Computing

Page 8: What Is High Throughput Distributed Computing

[email protected] foil 8 last update 11/04/23 22:53

CERN

A quick look at HPC problems

Classical high-performance applications numerical simulations of complex systems such as

weather climate combustion mechanical devices and structures crash simulation electronic circuits manufacturing processes chemical reactions

image processing applications like medical scans military sensors earth observation, satellite reconnaisance seismic prospecting

Page 9: What Is High Throughput Distributed Computing

[email protected] foil 9 last update 11/04/23 22:53

CERN

Approaches to parallelism

Domain decomposition

Functional decomposition

graphics from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/

Page 10: What Is High Throughput Distributed Computing

[email protected] foil 10 last update 11/04/23 22:53

CERN

Of course – it’s not that simple

graphic from Designing and Building Parallel Programs (Online), by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/

Page 11: What Is High Throughput Distributed Computing

[email protected] foil 11 last update 11/04/23 22:53

CERN

The design process

Data or functional decomposition building an abstract task model

Building a model for communicationbetween tasks

interaction patterns

Agglomeration – to fit the abstractmodel to the constraints of the target hardware

interconnection topology speed, latency, overhead of

communications

Mapping the tasks to the processors load balancing task scheduling

graphic from Designing and Building Parallel Programs (Online),by Ian Foster - http://www-unix.mcs.anl.gov/dbpp/

Page 12: What Is High Throughput Distributed Computing

[email protected] foil 12 last update 11/04/23 22:53

CERN

Large scale parallelism – the need for standards

“Supercomputer” market is on trouble; diminishing number of suppliers; questionable future

Increasingly risky to design for specific tightly coupled architectures like - SGI (Cray, Origin), NEC, Hitachi

Require a standard for communication between partitions/tasks that works also on loosely coupled systems (“massively parallel processors” – MPP – IBM SP, Compaq)

Paradigm is message passing rather than shared memory – tasks rather than threads

Parallel Virtual Machine - PVM MPI – Message Passing Interface

Page 13: What Is High Throughput Distributed Computing

[email protected] foil 13 last update 11/04/23 22:53

CERN

MPI – Message Passing Interface

industrial standard – http://www.mpi-forum.org source code portability widely available; efficient implementations

SPMD (Single Program Multiple Data) model Point-to-point communication (send/receive/wait;

blocking/non-blocking) Collective operations (broadcast; scatter/gather; reduce) Process groups, topologies

comprehensive and rich functionality

Page 14: What Is High Throughput Distributed Computing

[email protected] foil 14 last update 11/04/23 22:53

CERN

MPI – Collective operations

IBM Redbook - http://www.redbooks.ibm.com/redbooks/SG245380.html

Defining high level datafunctions allows highlyefficient implementations,e.g. minimising data copies

Page 15: What Is High Throughput Distributed Computing

[email protected] foil 15 last update 11/04/23 22:53

CERN

The limits of parallelism - Amdahl’s Law

If we have N processors:

s + p Speedup = ———— s + p/N

taking s as the fraction of the time spent in the sequential part of the program ( s + p = 1)

1 Speedup = ———— 1/s

s + (1-s)/N

s – time spent in a serial processor on serial parts of the code

p – time spent in a serial processor on parts that could be executed in parallel

Amdahl, G.M., Validity of single-processor approach to achieving large scale computing capabilityProc. AFIPS Conf., Reston, VA, 1967, pp. 483-485

Page 16: What Is High Throughput Distributed Computing

[email protected] foil 16 last update 11/04/23 22:53

CERN

Amdahl’s Law - maximum speedup

max speedup

0

20

40

60

80

100

120

1% 2% 3% 4% 5% 6% 7% 8% 9% 10%

sequential part as percentage of total time

sp

ee

du

p

Page 17: What Is High Throughput Distributed Computing

[email protected] foil 17 last update 11/04/23 22:53

CERN

Load Balancing - real life is (much) worse

Often have to use barrier synchronisation between each step, and different cells require different amounts of computation

Real time sequential part s = si

Real time parallelisable part on a sequential processor p = k jpk

j

Real time parallelised T = s + max(pkj)

T = s + max(pkj) >> s + p/N

s1

si

sj

sN

… …pk

1

pkj pk

M

… …pK1 pK

j

pKM

… …p11 p1

j p1M

:

:

:

:

:

:t

Page 18: What Is High Throughput Distributed Computing

[email protected] foil 18 last update 11/04/23 22:53

CERN

Gustafson’s Interpretation

The problem size scales with the number of processors

With a lot more processors (computing capacity) available you can and will do much more work in less time

The complexity of the application rises to fill the capacity available

But the sequential part remains approximately constant

Gustafson, J.L., Re-evaluating Amdahl’s Law, CACM 31(5), 1988, pp. 532-533

Page 19: What Is High Throughput Distributed Computing

[email protected] foil 19 last update 11/04/23 22:53

CERN

Amdahl’s Law - maximum speedup with Gustafson’s appetite

max speedup

0

500

1000

1500

2000

2500

0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0%

sequential part as % of total time

sp

ee

du

p

potential 1,000 X speedup with 0.1% sequential code

Page 20: What Is High Throughput Distributed Computing

[email protected] foil 20 last update 11/04/23 22:53

CERN

The importance of the network

Communication Overhead adds to the inherent sequential part of the program to limit the Amdahl speedup

Latency – the round-trip time (RTT) to communicate between two processors

communications overhead

c = latency + data_transfer_time

s + p

Speedup = ————— s + c + p/N

For fine grained parallel programs the problem is latency, not bandwidtht

… …

… …

… …

sequential

communications overhead

parallelisable

Page 21: What Is High Throughput Distributed Computing

[email protected] foil 21 last update 11/04/23 22:53

CERN

Latency

Comparison – Efficient MPI implementation on Linux cluster (source: Real World Computing Partnership, Tsukuba Research Center)

Network Bandwidth(MByte/sec)

RTT Latency(microsecond)

Myrinet 146 20

Gigabit Ethernet(Sysconect) 73 61

Fast Ethernet(EEPRO100) 11 100

Page 22: What Is High Throughput Distributed Computing

[email protected] foil 22 last update: 11/04/23 22:53

CERN

High Throughput Computing

Page 23: What Is High Throughput Distributed Computing

[email protected] foil 23 last update 11/04/23 22:53

CERN

High Throughput Computing - HTC

Roughly speaking –

HPC – deals with one large problem HTC – is appropriate when the problem can be decomposed into

many (very many) smaller problems that are essentially independent

Build a profile of all MasterCard customers who purchased an airline ticket and rented a car in August

Analyse the purchase patterns of Wallmart customers in the LA area last month

Generate 106 CMS events Web surfing, Web searching Database queries

HPC – problems that are hard to parallelise – single processor performance is important

HTC – problems that are easy to parallelise – can be adapted to very large numbers of processors

Page 24: What Is High Throughput Distributed Computing

[email protected] foil 24 last update 11/04/23 22:53

CERN

HTC - HPC

High Performance

Granularity largely defined by the algorithm, limitations in the hardware Load balancing difficult Hard to schedule different workloads Reliability is all important

if one part fails the calculation stops (maybe even aborts!) check-pointing essential – all the processes must be restarted from the same synchronisation point hard to dynamically re- configure for smaller number of processors

High Throughput

Granularity can be selected to fit the environment

Load balancing easy Mixing workloads is easy

Sustained throughput is the key goal

the order in which the individual tasks execute is (usually) not important if some equipment goes down the work can be re-run later easy to re-schedule dynamically the workload to different configurations

Page 25: What Is High Throughput Distributed Computing

[email protected] foil 25 last update: 11/04/23 22:53

CERN

Distributed Computing

Page 26: What Is High Throughput Distributed Computing

[email protected] foil 26 last update 11/04/23 22:53

CERN

Distributed Computing

Local distributed systems Clusters Parallel computers (IBM SP)

Geographically distributed systems Computational Grids

HPC – as we have seen Needs low latency AND good communication bandwidth

HTC distributed systems The bandwidth is important, the latency is less significant If latency is poor more processes can be run in parallel to

cover the waiting time

Page 27: What Is High Throughput Distributed Computing

[email protected] foil 27 last update 11/04/23 22:53

CERN

Shared Data

If the granularity is course enough –the different parts of the problem can be synchronised simply by sharing data

Example – event reconstruction all of the events to be reconstructed are stored in a large

data store processes (jobs) read successive raw events, generating

processed event records, until there are no raw events left

the result is the concatenation of the processed events (and folding together some histogram data)

synchronisation overhead can be minimised by partitioning the input and output data

Page 28: What Is High Throughput Distributed Computing

[email protected] foil 28 last update 11/04/23 22:53

CERN

Data Sharing - Files

Global file namespace maps universal name to network node, local name

Remote data access Caching strategies

Local or intermediate caching Replication Migration

Access control, authentication issues Locking issues

NFS AFS Web folders

Highly scalable for read-only data

Page 29: What Is High Throughput Distributed Computing

[email protected] foil 29 last update 11/04/23 22:53

CERN

Data Sharing – Databases, Objects

File sharing is probably the simplest paradigm for building distributed systems

Database and object sharing look the same

But – Files are universal, fundamental systems concepts –

standard interfaces, functionality Databases are not yet fundamental, built-in

but there are only a few standards Objects even less so – still at the application level – so

harder to implement efficient and universal caching, remote access, etc.

Page 30: What Is High Throughput Distributed Computing

[email protected] foil 30 last update 11/04/23 22:53

CERN

Client-server

Examples Web browsing Online banking Order entry ……..

The functionality is divided between the two parts – for example exploit locality of data (e.g. perform searches, transformations on

node where data resides) exploit different hardware capabilities (e.g. central supercomputer,

graphics workstation) security concerns – restrict sensitive data to defined geographical

locations (e.g. account queries) reliability concerns (e.g. perform database updates on highly reliable

servers)

Usually the server implements pre-defined, standardised functions

client serverrequest

response

Page 31: What Is High Throughput Distributed Computing

[email protected] foil 31 last update 11/04/23 22:53

CERN

server

client

client

client

server

client

client

client

server

client

client

client

database

server

• data extracts replicated on intermediate servers• changes batched for asynchronous treatment by database server

Enables - • scaling up client query capacity• isolation of main database

3-Tier client-server

Page 32: What Is High Throughput Distributed Computing

[email protected] foil 32 last update 11/04/23 22:53

CERN

Peer-to-Peer - P2P

Peer-to-Peer decentralisation of function and control Taking advantage of the computational resources at the edge

of the network The functions are shared between the distributed parts –

without central control Programs to cooperate without being designed as a single

application So P2P is just a democratic form of parallel programming -

SETI The parallel HPC problems we have looked at, using MPI

All the buzz of P2P is because new interfaces promise to bring this to the commercial world; allow different communities, businesses to collaborate through the internet

XML SOAP .NET JXTA

Page 33: What Is High Throughput Distributed Computing

[email protected] foil 33 last update 11/04/23 22:53

CERN

Simple Object Access Protocol - SOAP

SOAP – simple, lightweight mechanism for exchanging objects between peers in a distributed environment using XML carried over HTTP

SOAP consists of three parts: The SOAP envelope - what is in a message; who should

deal with it, and whether it is optional or mandatory The SOAP encoding rules - serialisation definition for

exchanging instances of application-defined datatypes. The SOAP Remote Procedure Call representation

Page 34: What Is High Throughput Distributed Computing

[email protected] foil 34 last update 11/04/23 22:53

CERN

Microsoft’s .NET

.NET is a framework, or environment for building, deploying and running Web services and other internet applications

Common Language Runtime - C++, C#, Visual Basic and JScript

Framework classes Aiming at a standard but Windows only

Page 35: What Is High Throughput Distributed Computing

[email protected] foil 35 last update 11/04/23 22:53

CERN

JXTA

Interoperability locating JXTA peers communication

Platform, language and network independence Implementable on anything –

phone – VCR - PDA – PC A set of protocols Security model Peer discovery Peer groups XML encoding

http://www.jxta.org/project/www/docs/TechOverview.pdf

Page 36: What Is High Throughput Distributed Computing

[email protected] foil 36 last update: 11/04/23 22:53

CERN

End of Part 1

Tomorrow:

HEP applicationsOffline computing for LHCExtending HTC to the Grid

Page 37: What Is High Throughput Distributed Computing

[email protected] foil 37 last update: 11/04/23 22:53

CERN

HEP Applications

Page 38: What Is High Throughput Distributed Computing

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

les.

rob

ert

son

@ce

rn.c

h

CERN

Page 39: What Is High Throughput Distributed Computing

[email protected] foil 39 last update 11/04/23 22:53

CERN

HEP Computing Characteristics

Large numbers of independent events - trivial parallelism – “job” granularity

Modest floating point requirement - SPECint performance Large data sets - smallish records, mostly read-only Modest I/O rates - few MB/sec per fast processor

Simulation cpu-intensive mostly static input data very low output data rate

Reconstruction very modest I/O easy to partition input data easy to collect output data

Page 40: What Is High Throughput Distributed Computing

[email protected] foil 40 last update 11/04/23 22:53

CERN

Analysis

• ESD analysis

• modest I/O rates

• read only ESD

BUT

Very large input database

Chaotic workload –

• unpredictable, no limit to the requirements

• AOD analysis

• potentially very high I/O rates

• but modest database

Page 41: What Is High Throughput Distributed Computing

[email protected] foil 41 last update 11/04/23 22:53

CERN

HEP Computing Characteristics

Large numbers of independent events - trivial parallelism – “job” granularity

Large data sets - smallish records, mostly read-only

Modest I/O rates - few MB/sec per fast processor

Modest floating point requirement - SPECint performance

Chaotic workload –

• research environment unpredictable, no limit to the requirements

Very large aggregate requirements – computation, data• Scaling up is not just big – it is also complex• …and once you exceed the capabilities of a single

geographical installation ………?

Page 42: What Is High Throughput Distributed Computing

[email protected] foil 42 last update: 11/04/23 22:53

CERN

Task Farming

Page 43: What Is High Throughput Distributed Computing

[email protected] foil 43 last update 11/04/23 22:53

CERN

Task Farming

Decompose the data into large independent chunks Assign one task (or job) to each chunk Put all the tasks in a queue for a scheduler

which manages a large “farm” of processors, each of which has access to all of the data

The scheduler runs one or more jobs on each processor When a job finishes the next job in the queue is started Until all the jobs have been run Collect the output files

Page 44: What Is High Throughput Distributed Computing

[email protected] foil 44 last update 11/04/23 22:53

CERN

Task Farming

Task farming is good for

a very large problem

Which has

selectable granularity

largely independent tasks

loosely shared data

HEP –-- Simulation-- Reconstruction-- and much of the Analysis

Page 45: What Is High Throughput Distributed Computing

[email protected] foil 45 last update 11/04/23 22:53

CERN

The SHIFT Software Model (1990)

les.

rob

ert

son

@ce

rn.c

h

diskservers

applicationservers

stage (migration)servers

tapeservers

queueservers

IP network

From the application’s viewpoint – this is simply file sharing –

all data available to all processes

standard APIs –disk I/O; mass storage; job scheduler; can be implemented over an IP network

mass storage model – tape data cached on disk (stager)

physical implementation - transparent to the application/user

scalable, heterogeneous

flexible evolution – scalable capacity; multiple platforms;seamless integration of new technologies

Page 46: What Is High Throughput Distributed Computing

[email protected] foil 46 last update 11/04/23 22:53

CERN

Current Implementation of SHIFT

massstorage

applicationservers

WAN

data cache

racks of dual-cpuLinux PCsLinux PC controllers

IDE disks

Linux PC controllersRobots – STK Powderhorn

Drives - STK 9840,STK 9940, IBM 3590

Ethernet100BaseT, Gigabit

Page 47: What Is High Throughput Distributed Computing

[email protected] foil 47 last update 11/04/23 22:53

CERN

Fermilab Reconstruction Farms

1991 – farms of RISC workstations introduced for reconstruction

replaced special purpose processors (emulators, ACP) Ethernet network Integrated with tape systems

cps – job scheduler, event manager

Page 48: What Is High Throughput Distributed Computing

[email protected] foil 48 last update 11/04/23 22:53

CERN

Condor – a hunter of unused cycles

The hunter of idle workstations (1986)

ClassAd Matchmaking users advertise their requirements systems advertise their capabilities & constraints

Directed Acyclic Graph Manager – DAGman define dependencies between jobs

Checkpoint – reschedule – restart if the owner of the workstation returns or if there is some failure

Share data through files global shared files Condor file system calls

Flocking interconnecting pools of Condor-content workstations

http://www.cs.wisc.edu/condor/

Page 49: What Is High Throughput Distributed Computing

[email protected] foil 49 last update 11/04/23 22:53

CERN

Layout of the Condor Pool

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Desktop

schedd

startd

master

Desktop

schedd

startd

master

Cluster Node

master

startd

Cluster Node

master

startd

http://www.cs.wisc.edu/condor

Page 50: What Is High Throughput Distributed Computing

[email protected] foil 50 last update 11/04/23 22:53

CERN

How Flocking Works

Add a line to your condor_config :FLOCK_HOSTS = Pool-Foo, Pool-Bar

ScheddSchedd

CollectorCollector

NegotiatorNegotiator

Central Manager

(CONDOR_HOST)

CollectorCollector

NegotiatorNegotiator

Pool-Foo Central Manager

CollectorCollector

NegotiatorNegotiator

Pool-BarCentral Manager

SubmitMachine

http://www.cs.wisc.edu/condor

Page 51: What Is High Throughput Distributed Computing

[email protected] foil 51 last update 11/04/23 22:53

CERN

Friendly Condor Pool

600 Condorjobs

Home Condor Pool

http://www.cs.wisc.edu/condor

Page 52: What Is High Throughput Distributed Computing

[email protected] foil 52 last update: 11/04/23 22:53

CERN

Finer grained HTC

Page 53: What Is High Throughput Distributed Computing

[email protected] foil 53 last update 11/04/23 22:53

CERN

The food chain in reverse – -- The PC has consumed the market for larger computers destroying the species -- There is no choice but to harness the PCs

Page 54: What Is High Throughput Distributed Computing

[email protected] foil 54 last update 11/04/23 22:53

CERN

Berkeley - Networks of Workstations (1994)

Single system view Shared resources Virtual machine Single address space

Global Layer Unix – GLUnix Serverless Network File Service – xFS

Research project

A Case for Networks of Workstations: NOW, IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson

http://now.cs.berkeley.edu

Page 55: What Is High Throughput Distributed Computing

[email protected] foil 55 last update 11/04/23 22:53

CERN

Beowulf

Nasa Goddard (Thomas Sterling, Donald Becker) - 1994 16 Intel PCs – Ethernet - Linux Caltech/JPL, Los Alamos Parallel applications from the Supercomputing

community Oak Ridge – 1996 – The Stone SouperComputer

problem – generate an eco-region map of the US, 1 km grid

64-way PC cluster proposal rejected re-cycle rejected desktop systems

The experience, emphasis on do-it-yourself, packaging of some of the tools, and probably the name – stimulated wide-spread adoption of clusters in the super-computing world

Page 56: What Is High Throughput Distributed Computing

[email protected] foil 56 last update 11/04/23 22:53

CERN

Parallel ROOT Facility - Proof

ROOT object oriented analysis tool

Queries are performed in parallel on an arbitrary number of processors

Load balancing: Slaves receive work

from Master process in “packets”

Packet size is adapted to current load, number of slaves, etc.

proof

Page 57: What Is High Throughput Distributed Computing

[email protected] foil 57 last update: 11/04/23 22:53

CERN

LHC Computing

Page 58: What Is High Throughput Distributed Computing

[email protected] foil 58 last update 11/04/23 22:53

CERN

CERN's Users in the World

Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users

Page 59: What Is High Throughput Distributed Computing

[email protected] foil 59 last update 11/04/23 22:53

CERN

The Large Hadron Collider Project

4 detectors CMSATLAS

LHCb

Storage – Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 5-8 PetaBytes/year

10 PetaBytes of disk

Processing – 200,000 of today’s fastest PCs

Page 60: What Is High Throughput Distributed Computing

[email protected] foil 60 last update 11/04/23 22:53

CERN

source: CERN/LHCC/2001-004 - Report of the LHC Computing Review - 20 February 2001

(ATLAS with 270Hz trigger)Regional Grand

Tier 0 Tier 1 Total Centres Total

Processing (K SI95) 1,727 832 2,559 4,974 7,533Disk (PB) 1.2 1.2 2.4 8.7 11.1Magnetic tape (PB) 16.3 1.2 17.6 20.3 37.9

---------- CERN ----------

Summary of Computing Capacity Required for all LHC Experiments in 2007

Worldwide distributed computing system Small fraction of the analysis at CERN ESD analysis – using 12-20 large regional centres

how to use the resources efficiently establishing and maintaining a uniform physics environment

Data exchange – with tens of smaller regional centres, universities, labs

Page 61: What Is High Throughput Distributed Computing

[email protected] foil 61 last update: 11/04/23 22:53

CERN

Estimated DISK Capacity at CERN

0

1000

2000

3000

4000

5000

6000

7000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

year

Ter

aByt

es

LHC

Other experiments

Estimated Mass Storage at CERN

LHC

Other experiments

0

20

40

60

80

100

120

14019

98

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

Year

Pet

aByt

es

Estimated CPU Capacity at CERN

0

1,000

2,000

3,000

4,000

5,000

6,000

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

year

K S

I95

LHC

Other experiments

Moore’s law

Planned capacity evolution at CERN

Mass Storage Disk

CPU

Page 62: What Is High Throughput Distributed Computing

[email protected] foil 62 last update 11/04/23 22:53

CERN

Are Grids a solution?

The Grid – Ian Foster, Carl Kesselman – The Globus Project

“Dependable, consistent, pervasive access to [high-end] resources”

• Dependable:

• provides performance and functionality guarantees

• Consistent:

• uniform interfaces to a wide variety of resources

• Pervasive:

• ability to “plug in” from anywhere

Page 63: What Is High Throughput Distributed Computing

[email protected] foil 63 last update 11/04/23 22:53

CERN

The Grid

The GRID

ubiquitous access to computation

in the sense that the WEB provides

ubiquitous access to information

Page 64: What Is High Throughput Distributed Computing

[email protected] foil 64 last update 11/04/23 22:53

CERN

Globus Architecturewww.globus.org

Applications

Core ServicesMetacomputing

Directory Service

GRAMGlobus

Security Interface

Heartbeat Monitor

Nexus

Gloperf

Local ServicesLSF

Condor MPI

NQEEasy

TCP

SolarisIrixAIX

UDP

High-level Services and ToolsDUROC globusrunMPI Nimrod/GMPI-IO CC++

GlobusView Testbed Status

GASS

middleware

Uniform application program interface to grid resources

Grid infrastructure primitives

Mapped to local implementations, architectures, policies

Page 65: What Is High Throughput Distributed Computing

[email protected] foil 65 last update 11/04/23 22:53

CERN

The nodes of the Grid are managed by different people so have different access and usage policies and may have different architectures

The geographical distribution means that there cannot be a central status status information and resource availability is “published”

(remember Condor Classified Ads) Grid schedulers can only have an approximate view of

resources

The Grid Middleware tries to present this as a coherent virtual computing centre

Page 66: What Is High Throughput Distributed Computing

[email protected] foil 66 last update 11/04/23 22:53

CERN

Core Services

Security Information Service Resource Management – Grid scheduler, standard

resource allocation Remote Data Access – global namespace, caching,

replication Performance and Status Monitoring Fault detection Error Recovery Management

Page 67: What Is High Throughput Distributed Computing

[email protected] foil 67 last update 11/04/23 22:53

CERN

The Promise of Grid Technology

What does the Grid do for you? you submit your work and the Grid

Finds convenient places for it to be run Optimises use of the widely dispersed resources Organises efficient access to your data

Caching, migration, replication Deals with authentication to the different sites that you

will be using Interfaces to local site resource allocation mechanisms,

policies Runs your jobs Monitors progress Recovers from problems

.. and .. Tells you when your work is complete

Page 68: What Is High Throughput Distributed Computing

[email protected] foil 68 last update 11/04/23 22:53

CERN

CMSATLAS

LHCbCERN

Tier 0 Centre at CERN

physics group

LHC Computing Model2001 - evolving

regional group

les.

rob

ert

son

@ce

rn.c

h

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Tier3physics

department

Desktop

Germany

Tier 1

USA

UK

France

Italy

……….

CERN Tier 1

……….

The LHC Computing

Centre

The opportunity ofGrid technology

CERN Tier 0