Performance of The Performance of The NorduGrid ARC And The NorduGrid ARC And The
Dulcinea Executor in ATLAS Dulcinea Executor in ATLAS Data Challenge 2Data Challenge 2
Oxana Smirnova (Lund University/CERN)for the NorduGrid collaborationCHEP 2004, Interlaken, 2004-09-29
2004-09-29 www.nordugrid.org 2
NorduGridNorduGrid NorduGrid is a research
collaboration established by universities in Denmark, Estonia, Finland, Norway and Sweden– Focuses on providing
production-quality Grid middleware for academic researchers
– Triggered by the needs of LHC experiments
Close cooperation with other Grid projects:– EU DataGrid (2001-2003)– SWEGRID– NDGF– LCG– EGEE
Assistance in Grid deployment outside the Nordic area
2004-09-29 www.nordugrid.org 3
Advanced Resource Connector Advanced Resource Connector (ARC)(ARC)
ARC is the Grid middleware developed by the NorduGrid– Based on Globus libraries and API– Original architectural solutions,
services and implementations– Supports one of the largest
functional Grid-like systems• 10 countries, 40+ sites, ~4000 CPUs,
~30 TB storage
2004-09-29 www.nordugrid.org 4
ATLAS Data Challenge 2ATLAS Data Challenge 2
Mass simulation of future data taking– Event generation, detector
simulation– Test of Tier0 operation: “raw data”
processing, distribution of output to regional centers
– Distributed analysis Duration: Summer-Fall 2004 New ATLAS software An automated production system Resources come via available Grid systems
– Grid3 (USA), LCG, NorduGrid + other ARC-enabled sites
Test of the ATLAS Computing Model
2004-09-29 www.nordugrid.org 5
ATLAS Production SystemATLAS Production System
Thin application-specific layer on top of the Grid and legacy systems– “Don Quijote” is a data management system, interfacing to
Grid data indexing services (RLS)– Production Database holds job definitions and status records
– “Windmill” – the supervisor, interacts between the ProdDB and the executors
– Executors use Grid-specific API to schedule and manipulate the jobs
• “Capone”: Grid3• “Dulcinea”: ARC• “Lexor”: LCG2
2004-09-29 www.nordugrid.org 6
Dulcinea implementationDulcinea implementation
Implemented in C++ — compiled as a shared library– Shared library imported into Python
Wraps ATLAS jobs into a tailored script that:– Creates POOL file catalog for the input files– Untars the ATLAS transformations tarball– Calls the transformation requested by the Windmill – Creates an XML file with metadata (Don Quijote attributes) for
the output results Calls ARC User Interface API and Globus RLS API
– File transfer is handled entirely by the ARC gatekeeper– No internal tracking of jobs, relies on the ARC Information
System– Can avoid problematic sites using a “blacklist”
Fetches the XML file for each job and adds the attributes to the RLS catalogue
2004-09-29 www.nordugrid.org 7
Dulcinea performanceDulcinea performance
Ran at most 2 Dulcinea executor instances at all times
Up to 5000 jobs handled by each such executor without major problems– can run unattended for
several days Few serious problems:
– very long startup-times of supervisor (recovering accumulated jobs)
– transfer of large XML messages between the supervisor and the executor can render the system unresponsive for long periods of time
Dulcinea executor+supervisor instances in ATLAS DC2 production
2004-09-29 www.nordugrid.org 8
SiteCountr
y
~ # CPUs
~ % Dedicated
1 atlas.hpc.unimelb.edu.au 28 30%
2genghis.hpc.unimelb.edu.a
u90 20%
3 charm.hpc.unimelb.edu.au 20 100%
4 lheppc10.unibe.ch 12 100%
5 lxsrv9.lrz-muenchen.de 234 5%
6 atlas.fzk.de 884 5%
7 morpheus.dcgc.dk 18 100%
8 lscf.nbi.dk 32 50%
9 benedict.aau.dk 46 90%
10 fe10.dcsc.sdu.dk 644 1%
11 grid.uio.no 40 100%
12 fire.ii.uib.no 58 50%
13 grid.fi.uib.no 4 100%
14 hypatia.uio.no 100 60%
15 sigrid.lunarc.lu.se 100 30%
16 sg-access.pdc.kth.se 100 30%
17 hagrid.it.uu.se 100 30%
18 bluesmoke.nsc.liu.se 100 30%
19 ingrid.hpc2n.umu.se 100 30%
20 farm.hep.lu.se 60 60%
21 hive.unicc.chalmers.se 100 30%
22 brenta.ijs.si 50 100%
Totals at peak: 7 countries 22 sites ~3000 CPUs
– dedicated ~700 7 Storage Services (in
RLS)– few more storage
facilities– ~12TB
~1FTE (1-3 persons) in charge of production– At most 2 executor
instances simultaneously
ARC-connected resources for DC2ARC-connected resources for DC2
2004-09-29 www.nordugrid.org 9
0
1000
2000
3000
4000
5000
6000
blue
smok
e.ns
c.liu
.se
grid
.uio
.no
hypa
tia.u
io.n
o
atla
s.hp
c.un
imel
b.ed
u.au
sg-a
cces
s.pd
c.kt
h.se
bene
dict
.aau
.dk
lxsr
v9.lr
z-m
uenc
hen.
de
bren
ta.ij
s.si
farm
.hep
.lu.s
e
lhep
pc10
.uni
be.c
h
sigr
id.lu
narc
.lu.s
e
hagr
id.it
.uu.
se
fire.
ii.ui
b.no
fe10
.dcs
c.sd
u.dk
ingr
id.h
pc2n
.um
u.se
atla
s.fz
k.de
mor
pheu
s.dc
gc.d
k
geng
his.
hpc.
unim
elb.
edu.
au
char
m.h
pc.u
nim
elb.
edu.
au
hive
.uni
cc.c
halm
ers.
se
lscf
.nbi
.dk
grid
.fi.u
ib.n
o
Good jobs
Failed jobs
Total # of successful jobs: 42202 (as of September 25, 2004) Failure rate before ATLAS ProdSys manipulations: 20%
• ~1/3 of failed jobs did not waste resources
Failure rate after: 35% Possible reasons:
• Dulcinea failing to add DQ attributes in RLS• DQ renaming• Windmill re-submitting good jobs
ARC performance in ATLAS DC2ARC performance in ATLAS DC2
2004-09-29 www.nordugrid.org 10
Failure analysisFailure analysis
Dominant problem: hardware accidents
2004-09-29 www.nordugrid.org 11
SummarySummary
ARC middleware and the Dulcinea executor provided stable services for ATLAS DC2– 20+ sites from Norway to Australia operate as a single resource– These sites contributed ~30% to the total ATLAS DC2
production• Despite offering the least ATLAS-dedicated resources• Originally committed to provide only 20%
Performed extremely well comparing to other Grid systems– Negligible amount of middleware-related problems
• Save the initial instability of the Globus RLS – a common problem– Needed order of magnitude less human efforts comparing to
both Grid3 and LCG– Produced same amount of data having much less resources due
to higher resource usage efficiency Dulcinea & ARC helped to prove the validity of the ATLAS
Production System concept Problems still to solve:
– Safeguard against site-specific hardware failures– Improvement of the ATLAS Production System