Upload
amber-perry
View
217
Download
2
Embed Size (px)
Citation preview
BaBarGrid
GridPP10 Meeting
CERN June 3 rd 2004
Roger BarlowManchester University
1:Simulation
2: Data Distribution: The SRB3: Distributed Analysis
BaBarGrid: GridPP10, CERN
June3 2004Slide 2 / 16
1: Grid based simulation(Fergus Wilson + Co.)
• Using existing UK farms (80 CPUs)
• Dedicated process at RAL merging output and sending to SLAC
• Use VDT Globus rather than LCG– Why? Installation difficulty/Reliability/stability
problems.
– VDT Globus is subset of LCG: running on LCG system perfectly possible (in principle)
– US groups talk of using GRID3. VDT Globus is also a subset of GRID3 – but GRID3 and LCG different. Mistake to rely on LCG features?
BaBarGrid: GridPP10, CERN
June3 2004Slide 3 / 16
Current situation
5 Million events in official production since 7th March. Best week (so far!) 1.6 million events.
Now producing at RHUL & Bristol. Manchester & Liverpool in ~2 weeks. Then QMUL & Brunel. 4 farms will produce 3-4 million a week.
Sites cooperative (need to install BaBar Conditions Database which uses Objectivity)
Major problem has been firewalls. Complicated interaction with all the communication and ports. Identifying the source has been hard.
BaBarGrid: GridPP10, CERN
June3 2004Slide 4 / 16
What the others are doing
• Italians and Germans going full-blown LCG route
• Objectivity database through networked ams servers (need 1 server per ~30 processes)
• Otherwise assume BaBar environment available at remote hosts
Our approaches will converge one day
• Meanwhile, they will try sending jobs to RAL, we will try sending jobs to Ferrara.
BaBarGrid: GridPP10, CERN
June3 2004Slide 5 / 16
Future
Keep production running.
Test an LCG interface (RAL? Ferrara? Manchester Tier 2?) when we have the manpower. Will give more functionality and stability in the long-term.
Smooth and streamline process
SLAC/BaBar
Richard P. Mount
SLAC
May 20, 2004
2: Data Distribution and The SRB
BaBarGrid: GridPP10, CERN
June3 2004Slide 7 / 16
Client Client Client Client Client Client
Disk Server
Disk Server
Disk Server
Disk Server
Disk Server
Disk Server
Tape Server
Tape Server
Tape Server
Tape Server
Tape Server
SLAC-BaBar Computing Fabric
IP Network (Cisco)
IP Network (Cisco)
120 dual/quad CPU Sun/Solaris400 TB Sun FibreChannel RAID arrays
1500 dual CPU Linux 900 single CPU Sun/Solaris
25 dual CPU Sun/Solaris40 STK 9940B6 STK 9840A6 STK Powderhornover 1 PB of data
Objectivity/DB object database + HEP-specific ROOT software (Xrootd)
HPSS + SLAC enhancements to Objectivity and ROOT server code
BaBarGrid: GridPP10, CERN
June3 2004Slide 8 / 16
BaBar Tier-A CentersA component of the Fall 2000 BaBar Computing Model
• Offer resources at the disposal of BaBar;
• Each provides tens of percent of total BaBar computing/analysis need;
– 50% of BaBar computing investment was in Europe in 2002, 2003
• CCIN2P3, Lyon, France in operation for 3+ years;
• RAL, UK in operation for 2+ years
• INFN-Padova, Italy in operation for 2 years
• GridKA, Karlsruhe, Germany in operation for 1 year.
BaBarGrid: GridPP10, CERN
June3 2004Slide 9 / 16
SLAC-PPDG Grid Team
Richard Mount 10% PI
Bob Cowles 10% Strategy and Security
Adil Hasan 50% BaBar Data Mgmt
Andy Hanushevsky 20% Xrootd, Security …
Matteo Melani 80% New hire
Wilko Kroeger 100% SRB data distribution
Booker Bense 80% Grid software installation
Post Doc 50% BaBar - OSG
BaBarGrid: GridPP10, CERN
June3 2004Slide 10 / 16
Network/Grid Traffic
BaBarGrid: GridPP10, CERN
June3 2004Slide 11 / 16
SLAC-BaBar-OSG
• BaBar-US has been:– Very successful in deploying Grid data distribution
(SRB US-Europe)
– Far behind BaBar-Europe in deploying Grid job execution (in production for simulation)
• SLAC-BaBar-OSG plan– Focus on achieving massive simulation production
in US within 12 months
– make 1000 SLAC processors part of OSG
– Run BaBar simulation on SLAC and non-SLAC OSG resources
BaBarGrid: GridPP10, CERN
June3 2004Slide 12 / 16
3: Distributed Analysis
At GridPP9:
Good news: Basic grid job submission system deployed and working (Alibaba / Gsub) with GANGA portal
Bad news: Low take up because of
• Users uninterested
• Poor reliability
BaBarGrid: GridPP10, CERN
June3 2004Slide 13 / 16
Since then…
Janusz• Improve portal
• Develop web-based version
Alessandra
• Move to Tier 2 system manager post
James • Starts June 14th
• Attended GridPP10 meeting
Mike• Give talk at IoP parallel
session
• Write Abstract (accepted) for All Hands meeting
• Write Thesis
Roger • Submit Proforma 3
• Complete quarterly progress report
• Revise Proforma 3
• Advertise and recruit replacement post
• Negotiate on revised Proforma 3
• Write Abstract (pending) for CHEP
• Submit JeSRP-1
• Write contribution for J Phys G Grid article
BaBarGrid: GridPP10, CERN
June3 2004Slide 14 / 16
Future two-point plan(1)
• James to review/revise/relaunch job submission system
• Work with UK Grid/SP team (short term) and Italian/German LCG system (long term)
• Improve reliability through core team of users on development system
BaBarGrid: GridPP10, CERN
June3 2004Slide 15 / 16
Future two-point plan (2)
RAL CPUs very heavily loaded by BaBar. Slow turnround stressed users
Make significant CPU resources available to BaBar users only through the Grid
• Some of the new Tier 1/A resources
• All the Tier 2 (Manchester) resources
And see that Grid certificate take-up grow!
Drive Grid usage through incentive
BaBarGrid: GridPP10, CERN
June3 2004Slide 16 / 16
Final Word
Our problems today will be your problems tomorrow
challengeschallenges