24
A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen – Niels Bohr Institute With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter

A Distributed Tier-1

Embed Size (px)

DESCRIPTION

A Distributed Tier-1. An example based on the Nordic Scientific Computing Infrastructure. GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen – Niels Bohr Institute. With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter. - PowerPoint PPT Presentation

Citation preview

Page 1: A  Distributed  Tier-1

A Distributed Tier-1

An example based on the

Nordic Scientific Computing Infrastructure

GDB meeting – NIKHEF/SARA 13th October 2004John Renner Hansen – Niels Bohr Institute

With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter

Page 2: A  Distributed  Tier-1

Basis for a distributed Tier-1 structure

• External connectivity

• Internal connectivity

• Computer and Storage capacity

• Maintenance and operation

• Long term stability

Page 3: A  Distributed  Tier-1

NORDUnet network in 2003

155MRUNNet

622M

NASK12M

GÉANT10G

(Oct ’03)

NETNOD3.5G

GeneralInternet

5G

GeneralInternet

2.5G

Page 4: A  Distributed  Tier-1

NorthernLight

HelsinkiOslo Stockholm

Copenhagen

NetherLightAmsterdam

2.5G links connected to “ONS boxes” giving 2 GE channels between endpoints

Dec 2003

Aug 2003

Page 5: A  Distributed  Tier-1

NORDUNet was represented at the

NREN-TIER1 MeetingParis, Roissy Hilton, 12:00-17:00, 22 July 2004

by Peter Villemoes

Page 6: A  Distributed  Tier-1

Denmark / Forskningsnet • Upgrading from 622Mbit/s to a 2.5 Gbit/s ring

structure finished:– Copenhagen-Odense-Århus-Aalborg-

and back via Göteborg to Copenhagen

• Network research– setting up a Danish national IPv6 activity– dark fibre through the country

• experimental equipment for 10GE channels

Page 7: A  Distributed  Tier-1

Finland / FUNET

• Upgraded to 2.5G already in 2002

• Upgrading backbone routers to 10G capability

Page 8: A  Distributed  Tier-1

Norway / UNINETT

• Network upgraded to 2.5G between major universities,

• UNINETT is expanding to more services and organisations:

Page 9: A  Distributed  Tier-1

Sweden / SUNET

• 10G resilient nationwide network since Nov 2002

–all 32 universities have 2.5G access

• Active participation in SweGrid, Swedish Grid Initiative

Page 10: A  Distributed  Tier-1

Nordic Tier1 2004 2005 2006 2007 2008 2009 2010Split 2008 ALICE

ATLAS CMS LHCb SUM 2008

CPU (kSI2K)   300 300 700 1400    Offered   1400     1400

% of Total   8%     8%

Disk (Tbytes)   60 70 200 700    Offered   700     700

% of Total   8%     8%

Tape (Pbytes)   0.06 0.07 0.2 0.7    Offered   0.7     0.7

% of Total   12%     12%

Tape (Mbytes/sec)

 

           

Offered          

Required         580

Balance          

WAN (Mbits/sec)   1000 2000 5000 10000                

Computer and Storage capacity at a Nordic Tier-1

Page 11: A  Distributed  Tier-1

Who

• Denmark

Danish Center for Grid Computing - DCGC

Danish Center for Scientific Computing -DCSC

Page 12: A  Distributed  Tier-1

Who

• Denmark

• Finland - CSC

Page 13: A  Distributed  Tier-1

Who

• Denmark

• Finland

• Norway - NorGrid

Page 14: A  Distributed  Tier-1

Who

• Denmark

• Finland

• Norway

• Sweden - SweGrid

Page 15: A  Distributed  Tier-1

Denmark

• Two collaborating Grid projects• Danish Centre for Scientific Computing

Grid– DCSC-Grid spans the four DCSC sites and

thus unify the resources, PC-clusters, IBM-Regatta, SGI-Enterprise, … within DCSC

• Danish Centre for Grid Computing– Is the the national Grid project– DCSC Grid is a partner in DCGC

Page 16: A  Distributed  Tier-1

Finland

• Remains centred about CSC– CSC participates in NDGF and NGC– A Finnish Grid will probably be created– This Grid will focus more on accessing CSC

resources with local machines

Page 17: A  Distributed  Tier-1

Norway

• NOTUR Emerging Technologies on Grid Computing is the main mover– Oslo-Bergen “mini-Grid” in place– Trondheim and Tromsø should be joining

Curently under reorganization

Page 18: A  Distributed  Tier-1

Sweden

• SweGrid is a very ambitious project– 6 clusters have been created for the purpose

of SweGrid each equipped with a 100 PCs and a large disk system

– Large support and education organization is integrated in the plans

Page 19: A  Distributed  Tier-1

Research CouncilsDK SF S NNOS-N

Nordic Data Grid Facility

Nordic Data Grid Facility Nordic Project

1. Create the basis for a common Nordic Data Grid Facility

2. Coordinate Nordic Grid

Activities

Core Group

Project Director

4 Post Doc.s

Steering Group 3 members per country

1 R.C. Civil Servant 2 Scientists

Page 20: A  Distributed  Tier-1

Services provided by the Tier-1 Regional Centres

• acceptance of raw and processed data from the Tier-0 centre, keeping up with data acquisition;

• recording and maintenance of raw and processed data on permanent mass storage;

• provision of managed disk storage providing permanent and temporary data storage for files and databases;

• operation of a data-intensive analysis facility;

• provision of other services according to agreed experiment requirements

• provision of high capacity network services for data exchange with the Tier-0 centre, as part of an overall plan agreed between the experiments, Tier-1 and Tier-0 centres;

• provision of network services for data exchange with Tier-1 and selected Tier-2 centres, as part of an overall plan agreed between the experiments, Tier-1 and Tier-2 centres;

• administration of databases required by experiments at Tier-1 centres;

Page 21: A  Distributed  Tier-1
Page 22: A  Distributed  Tier-1

SiteCountr

y

~ # CPUs

~ % Dedicated

1 atlas.hpc.unimelb.edu.au 28 30%

2genghis.hpc.unimelb.edu.

au90 20%

3charm.hpc.unimelb.edu.a

u20 100%

4 lheppc10.unibe.ch 12 100%

5 lxsrv9.lrz-muenchen.de 234 5%

6 atlas.fzk.de 884 5%

7 morpheus.dcgc.dk 18 100%

8 lscf.nbi.dk 32 50%

9 benedict.aau.dk 46 90%

10 fe10.dcsc.sdu.dk 644 1%

11 grid.uio.no 40 100%

12 fire.ii.uib.no 58 50%

13 grid.fi.uib.no 4 100%

14 hypatia.uio.no 100 60%

15 sigrid.lunarc.lu.se 100 30%

16 sg-access.pdc.kth.se 100 30%

17 hagrid.it.uu.se 100 30%

18 bluesmoke.nsc.liu.se 100 30%

19 ingrid.hpc2n.umu.se 100 30%

20 farm.hep.lu.se 60 60%

21 hive.unicc.chalmers.se 100 30%

22 brenta.ijs.si 50 100%

Totals at peak:• 7 countries• 22 sites• ~3000 CPUs

– dedicated ~700

• 7 Storage Services (in RLS)– few more storage

facilities– ~12TB

• ~1FTE (1-3 persons) in charge of production– At most 2 executor

instances simultaneously

ARC-connected resources for DC2

Page 23: A  Distributed  Tier-1

0

1000

2000

3000

4000

5000

6000

blue

smok

e.ns

c.liu

.se

grid

.uio

.no

hypa

tia.u

io.n

o

atla

s.hp

c.un

imel

b.ed

u.au

sg-a

cces

s.pd

c.kt

h.se

bene

dict

.aau

.dk

lxsr

v9.lr

z-m

uenc

hen.

de

bren

ta.ij

s.si

farm

.hep

.lu.s

e

lhep

pc10

.uni

be.c

h

sigr

id.lu

narc

.lu.s

e

hagr

id.it

.uu.

se

fire.

ii.ui

b.no

fe10

.dcs

c.sd

u.dk

ingr

id.h

pc2n

.um

u.se

atla

s.fz

k.de

mor

pheu

s.dc

gc.d

k

geng

his.

hpc.

unim

elb.

edu.

au

char

m.h

pc.u

nim

elb.

edu.

au

hive

.uni

cc.c

halm

ers.

se

lscf

.nbi

.dk

grid

.fi.u

ib.n

o

Good jobs

Failed jobs

Total # of successful jobs: 42202 (as of September 25, 2004) Failure rate before ATLAS ProdSys manipulations: 20%

• ~1/3 of failed jobs did not waste resources

Failure rate after: 35% Possible reasons:

• Dulcinea failing to add DQ attributes in RLS• DQ renaming• Windmill re-submitting good jobs

ARC performance in ATLAS DC2

Page 24: A  Distributed  Tier-1

Failure analysis

• Dominant problem: hardware accidents