Upload
randall-hopkins
View
24
Download
1
Embed Size (px)
DESCRIPTION
Status of the use of LCG. Stefano Bagnasco, INFN Torino. ALICE Offline Week CERN June 1, 2004. Being bold: from AliEn to a Meta-Grid. LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge (more than 1000 CPUs) GRID.IT sites - PowerPoint PPT Presentation
Citation preview
Status of the use of Status of the use of LCGLCG
Stefano Bagnasco, INFN Torino
ALICE Offline Week
CERN June 1, 2004
2Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Being bold: from AliEn to a Meta-Grid
● LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge
(more than 1000 CPUs)● GRID.IT sites
LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF)
● Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex:
“A Grid is a system that […] coordinates resources that are not subject to centralized control […] using standard, open, general-purpose protocols and interfaces […] to deliver nontrivial qualities of service.”
I. Foster“What is the Grid? A three Point Checklist”Grid Today (2001)
3Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
From AliEn to a Meta-Grid – cont’d
Design strategy:
● Use AliEn as a general front-end Owned and shared resource are exploited transparently
● Minimize points of contact between the systems No need to reimplement services etc. No special services required to run on remote CE/WNs
● Make full use of provided services: Data Catalogues, scheduling, monitoring… Let the Grids do their jobs (they should know how)
● Use high-level tools and APIs to access Grid resources Developers put a lot of abstraction effort into hiding the
complexity and shielding the user from implementation changes
4Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Implementation
Implementation:
● Manage LCG (and Grid.it) resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface. The whole of LCG computing is seen as a single, large AliEn CE
associated with a single, large SE.
● Job management interface JDL translation (incl. InputData statements) JobID bookkeeping Command proxying (Submit, Cancel, Status query…)
● Data management interface Put() and Get() implementation AliEn PFN is LCG GUID (plus SE host to allow for optimisation) AliEn knows nothing about LCG replicas (but RLS/ROS does!)
5Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Interfacing AliEn and LCG
LCG SiteEDG CE
WNAliEn
LCG SE
LCG
RB
Server
Interface Site
AliEn CE
LCG UI
AliEn SE
Job submission
Status report
Replica Catalogue
Data RegistrationLFN
Data Catalogue
Data RegistrationPFN = LFN
LFN
PFN
6Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Implementation details● Job management interface
JDL translation (incl. InputData statements)• Remote catalogue query for each job – may slow down submission
significantly JobID bookkeeping
• AliEn job states do not map onto LCG ones• Job states from the two system are often out-of-sync (LCG
bookkkeeping slower) Command proxying (Submit, Cancel, Status query…)
• Have to take care of LCG LB delays
● Data management interface Put() and Get() implementation AliEn PFN is LCG GUID (plus SE host to allow for
optimisation)• lcg://tbed0115.cern.ch/8f6064e7-b580-11d8-ba2a-f8adc5715cd0
AliEn knows nothing about LCG replicas (but RLS/ROS does!)
• Again, let the RB do its job!
7Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Cheating: two grids, same resources!
Alien CE
LCG UI
AliEnCE/SE
Server
SubmissionA User submits jobs
LCG RB
LCGCE/SE
WN
WN
WN
WN
WN
● “Double access” for selected sites (CNAF and CT.INFN)
8Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Software installation
● Both AliEn and AliRoot installed via “plain” LCG jobs Do some checks, download tarballs, uncompress, build
environment script and publish relevant tags Single command available to get the list of available sites, send
the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ (if queues are not full…)
Manual intervention still needed in few sites (e.g. CERN/LSF) Ready for integration into AliEn automatic installation system
● Experiment software shared area misconfiguration caused most of the trouble in the beginning LCG-UI
NIKHEF
Taiwan
RAL
CNAF
TO.INFN
installAlice.shinstallAlice.jdl
installAliEn.shinstallAliEn.jdl
…
9Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Monitoring● AliEn Job monitoring
Through interface “keyhole” State misalignment
● GridIce (LCG native monitoring) Only CE/SE load, integrated info No job monitoring (but coming soon)
● MonALISA sensors on server Checking storage capacity etc. Looking directly at LCG Info Providers via LDAP
● MonALISA sensors on interface Job distribution Not much automatism (but then, they do the job) AliEn knows nothing about LCG replicas (but RLS and ROS do!)
10Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
The debugging loop
Queued
Started
Running
Saving
Done
Ready
Waiting
Scheduled
Running
Done Aborted
Error_*
Assigned
AliEn
Sta
tus
LC
G S
tatu
s
Logs from LCGOutputSandbox
Logs from AliEnCatalogue
Couldn’t save & Ran out of WallClockTime?No logs at all!
(Some) logs from LCG L&BK
11Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
PDC2004 - Status● Statistics after round 1 (ended april, 4): job
distribution Alice::CERN::LCG is the interface to LCG-2 Alice::Torino::LCG is the interface to GRID.IT
12Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
AliEn Vs. AliEn+LCG● LCG-2 jobs seen through AliEn MonaLisa monitoring
Ramp-up slope shows no major performance degradation
AliEn native site
LCG-2
GRID.IT
13Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Grid.it starting up
Larger sites filled first
14Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Phase II: Reconstrution● Will need local storage for all sites
So will need use of native LCG storage SRM in front of CASTOR, but ugly “Classic SE” in front of
disk pools SRM-to-dCache not working yet
● Interface system available, installed everywhere and tested on the EIS testbed. SRM everywhere would simplify things a lot Use of GUIDs for files also simplified things Eagerly waiting to see if large-scale production works…
● Many sites have ludicrously small storage LCG management is informed…
15Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Phase II
[aliendb3.cern.ch:3307] /proc/296915/ > ls -l-rwxr-xr-x sbagnasc z2 40190 Jun 16 19:06 ConfigPPR.C -rwxr-xr-x sbagnasc z2 9416 Jun 16 19:10 EMCAL.Hits.root -rwxr-xr-x sbagnasc z2 8538 Jun 16 19:11 FMD.Hits.root -rwxr-xr-x sbagnasc z2 568379 Jun 16 19:12 galice.root -rwxr-xr-x sbagnasc z2 405 Jun 16 19:06 grunPPR.C -rwxr-xr-x sbagnasc z2 13703 Jun 16 19:12 ITS.Hits.root -rwxr-xr-x sbagnasc z2 19652 Jun 16 19:13 Kinematics.root -rwxr-xr-x sbagnasc z2 9552 Jun 16 19:14 MUON.Hits.root -rwxr-xr-x sbagnasc z2 7195 Jun 16 19:14 PHOS.Hits.root -rwxr-xr-x sbagnasc z2 6822 Jun 16 19:15 PMD.Hits.root -rwxr-xr-x sbagnasc z2 1431 Jun 16 19:20 resources -rwxr-xr-x sbagnasc z2 7222 Jun 16 19:15 RICH.Hits.root -rwxr-xr-x sbagnasc z2 10608 Jun 16 19:16 START.Hits.root -rwxr-xr-x sbagnasc z2 2105 Jun 16 19:20 stderr -rwxr-xr-x sbagnasc z2 277336 Jun 16 19:20 stdout -rwxr-xr-x sbagnasc z2 9799 Jun 16 19:17 TOF.Hits.root -rwxr-xr-x sbagnasc z2 35321 Jun 16 19:17 TPC.Hits.root -rwxr-xr-x sbagnasc z2 100688 Jun 16 19:18 TrackRefs.root -rwxr-xr-x sbagnasc z2 32475 Jun 16 19:19 TRD.Hits.root -rwxr-xr-x sbagnasc z2 10992 Jun 16 19:19 VZERO.Hits.root -rwxr-xr-x sbagnasc z2 8069 Jun 16 19:20 ZDC.Hits.root [aliendb3.cern.ch:3307] /proc/296915/ > whereis -o EMCAL.Hits.root And the file is in Alice::CERN::LCGtest lcg://tbed0115.cern.ch/139fff74-bfb8-11d8-91c1-be96c4d6b0ee
● The full system was tested in the EIS testbed With a somewhat simpler JDL than for prduction Awkward to do tests on the real infrastructure production
16Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Conclusions & lessons learned - 1● In the first half of Phase I, LCG (+ grid.it) was
able to provide ~50% of computing cycles Fraction was reduced afterwards, but we should be able to
have it back
● LCG Workload Management proved itself reasonably robust We never lost a job due because of RB failures
● The remote site configuration is still the major source of problems, LCG-side. Things are unlikely to get better with the use of LCG
storage… Software management tools are still rudimentary Large sites have often tighter security restrictions & other
idiosincracies Investigating and fixing problems is hard and time-
consuming
17Status of the use of LCGStefano Bagnasco, INFN Torino
ALICE Offline WeekCERN June 1, 2004
Conclusions & lessons learned - 2● The most difficult part of the management is
monitoring LCG through a “keyhole”. Only integrated information available natively MonALISA for AliEn, GridICE for LCG Some safety mechanisms are too coarse for this approach
(queue blocking)
● Anti-blackhole mechanisms being studied in grid.it Based on run time statistics, CPUTime/WallClockTime ratio…
● For short jobs, submission time (and thus the interface system performance) can limit the number of jobs But the system is trivially scalable InputData translation may slow down things significantly