11
Performance of The Performance of The NorduGrid ARC And The NorduGrid ARC And The Dulcinea Executor in ATLAS Dulcinea Executor in ATLAS Data Challenge 2 Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration CHEP 2004, Interlaken, 2004-09-29

Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

Embed Size (px)

Citation preview

Page 1: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

Performance of The Performance of The NorduGrid ARC And The NorduGrid ARC And The

Dulcinea Executor in ATLAS Dulcinea Executor in ATLAS Data Challenge 2Data Challenge 2

Oxana Smirnova (Lund University/CERN)for the NorduGrid collaborationCHEP 2004, Interlaken, 2004-09-29

Page 2: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 2

NorduGridNorduGrid NorduGrid is a research

collaboration established by universities in Denmark, Estonia, Finland, Norway and Sweden– Focuses on providing

production-quality Grid middleware for academic researchers

– Triggered by the needs of LHC experiments

Close cooperation with other Grid projects:– EU DataGrid (2001-2003)– SWEGRID– NDGF– LCG– EGEE

Assistance in Grid deployment outside the Nordic area

Page 3: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 3

Advanced Resource Connector Advanced Resource Connector (ARC)(ARC)

ARC is the Grid middleware developed by the NorduGrid– Based on Globus libraries and API– Original architectural solutions,

services and implementations– Supports one of the largest

functional Grid-like systems• 10 countries, 40+ sites, ~4000 CPUs,

~30 TB storage

Page 4: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 4

ATLAS Data Challenge 2ATLAS Data Challenge 2

Mass simulation of future data taking– Event generation, detector

simulation– Test of Tier0 operation: “raw data”

processing, distribution of output to regional centers

– Distributed analysis Duration: Summer-Fall 2004 New ATLAS software An automated production system Resources come via available Grid systems

– Grid3 (USA), LCG, NorduGrid + other ARC-enabled sites

Test of the ATLAS Computing Model

Page 5: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 5

ATLAS Production SystemATLAS Production System

Thin application-specific layer on top of the Grid and legacy systems– “Don Quijote” is a data management system, interfacing to

Grid data indexing services (RLS)– Production Database holds job definitions and status records

– “Windmill” – the supervisor, interacts between the ProdDB and the executors

– Executors use Grid-specific API to schedule and manipulate the jobs

• “Capone”: Grid3• “Dulcinea”: ARC• “Lexor”: LCG2

Page 6: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 6

Dulcinea implementationDulcinea implementation

Implemented in C++ — compiled as a shared library– Shared library imported into Python

Wraps ATLAS jobs into a tailored script that:– Creates POOL file catalog for the input files– Untars the ATLAS transformations tarball– Calls the transformation requested by the Windmill – Creates an XML file with metadata (Don Quijote attributes) for

the output results Calls ARC User Interface API and Globus RLS API

– File transfer is handled entirely by the ARC gatekeeper– No internal tracking of jobs, relies on the ARC Information

System– Can avoid problematic sites using a “blacklist”

Fetches the XML file for each job and adds the attributes to the RLS catalogue

Page 7: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 7

Dulcinea performanceDulcinea performance

Ran at most 2 Dulcinea executor instances at all times

Up to 5000 jobs handled by each such executor without major problems– can run unattended for

several days Few serious problems:

– very long startup-times of supervisor (recovering accumulated jobs)

– transfer of large XML messages between the supervisor and the executor can render the system unresponsive for long periods of time

Dulcinea executor+supervisor instances in ATLAS DC2 production

Page 8: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 8

SiteCountr

y

~ # CPUs

~ % Dedicated

1 atlas.hpc.unimelb.edu.au 28 30%

2genghis.hpc.unimelb.edu.a

u90 20%

3 charm.hpc.unimelb.edu.au 20 100%

4 lheppc10.unibe.ch 12 100%

5 lxsrv9.lrz-muenchen.de 234 5%

6 atlas.fzk.de 884 5%

7 morpheus.dcgc.dk 18 100%

8 lscf.nbi.dk 32 50%

9 benedict.aau.dk 46 90%

10 fe10.dcsc.sdu.dk 644 1%

11 grid.uio.no 40 100%

12 fire.ii.uib.no 58 50%

13 grid.fi.uib.no 4 100%

14 hypatia.uio.no 100 60%

15 sigrid.lunarc.lu.se 100 30%

16 sg-access.pdc.kth.se 100 30%

17 hagrid.it.uu.se 100 30%

18 bluesmoke.nsc.liu.se 100 30%

19 ingrid.hpc2n.umu.se 100 30%

20 farm.hep.lu.se 60 60%

21 hive.unicc.chalmers.se 100 30%

22 brenta.ijs.si 50 100%

Totals at peak: 7 countries 22 sites ~3000 CPUs

– dedicated ~700 7 Storage Services (in

RLS)– few more storage

facilities– ~12TB

~1FTE (1-3 persons) in charge of production– At most 2 executor

instances simultaneously

ARC-connected resources for DC2ARC-connected resources for DC2

Page 9: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 9

0

1000

2000

3000

4000

5000

6000

blue

smok

e.ns

c.liu

.se

grid

.uio

.no

hypa

tia.u

io.n

o

atla

s.hp

c.un

imel

b.ed

u.au

sg-a

cces

s.pd

c.kt

h.se

bene

dict

.aau

.dk

lxsr

v9.lr

z-m

uenc

hen.

de

bren

ta.ij

s.si

farm

.hep

.lu.s

e

lhep

pc10

.uni

be.c

h

sigr

id.lu

narc

.lu.s

e

hagr

id.it

.uu.

se

fire.

ii.ui

b.no

fe10

.dcs

c.sd

u.dk

ingr

id.h

pc2n

.um

u.se

atla

s.fz

k.de

mor

pheu

s.dc

gc.d

k

geng

his.

hpc.

unim

elb.

edu.

au

char

m.h

pc.u

nim

elb.

edu.

au

hive

.uni

cc.c

halm

ers.

se

lscf

.nbi

.dk

grid

.fi.u

ib.n

o

Good jobs

Failed jobs

Total # of successful jobs: 42202 (as of September 25, 2004) Failure rate before ATLAS ProdSys manipulations: 20%

• ~1/3 of failed jobs did not waste resources

Failure rate after: 35% Possible reasons:

• Dulcinea failing to add DQ attributes in RLS• DQ renaming• Windmill re-submitting good jobs

ARC performance in ATLAS DC2ARC performance in ATLAS DC2

Page 10: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 10

Failure analysisFailure analysis

Dominant problem: hardware accidents

Page 11: Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration

2004-09-29 www.nordugrid.org 11

SummarySummary

ARC middleware and the Dulcinea executor provided stable services for ATLAS DC2– 20+ sites from Norway to Australia operate as a single resource– These sites contributed ~30% to the total ATLAS DC2

production• Despite offering the least ATLAS-dedicated resources• Originally committed to provide only 20%

Performed extremely well comparing to other Grid systems– Negligible amount of middleware-related problems

• Save the initial instability of the Globus RLS – a common problem– Needed order of magnitude less human efforts comparing to

both Grid3 and LCG– Produced same amount of data having much less resources due

to higher resource usage efficiency Dulcinea & ARC helped to prove the validity of the ATLAS

Production System concept Problems still to solve:

– Safeguard against site-specific hardware failures– Improvement of the ATLAS Production System