Upload
doreen-baker
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
DDØ Computing Model Ø Computing Model & &
Monte Carlo & Data ReprocessingMonte Carlo & Data Reprocessing
Gavin DaviesGavin Davies
Imperial CollegeImperial College LondonLondon
DOSAR Workshop, Sao Paulo, September 2005DOSAR Workshop, Sao Paulo, September 2005
DOSAR Workshop, Sept 2005 2
OutlineOutline Operational status
Globally – continue to do well Shared by recent Run II Computing Review
DØ Computing model Ongoing, ‘long’ established plan
Production Computing Monte Carlo Reprocessing of Run II data
109 events reprocessed on the grid – largest HEP grid effort
Looking forward
Conclusions
DOSAR Workshop, Sept 2005 3
Snapshot of Current Status Reconstruction keeping up with data taking
Data handling is performing well
Production computing is off-site and grid based. It continues to grow & work well
Over 75 million Monte Carlo events produced in last year
Run IIa data set being reprocessed on the grid – 109 events
Analysis cpu power has been expanded
Globally doing well Shared by recent Run II Computing Review
DOSAR Workshop, Sept 2005 4
Computing Model
Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks
Scalable Not alone – Joint effort with others at FNAL and elsewhere, LHC…
1997 – Original Plan All Monte Carlo to be produced off-site SAM to be used for all data handling, provides a ‘data-grid’
Now: Monte Carlo and data reprocessing with SAM-Grid
Next: Other production tasks e.g. fixing and then user analysis
Use concept of Regional Centres DOSAR one of pioneers Builds local expertise
DOSAR Workshop, Sept 2005 5
Reconstruction Release
Periodically update version of reconstruction code As develop new / more refined algorithms As get better understanding of detector
Frequency of releases decreases with time One major release in last year – p17 Basis for current Monte Carlo (MC) & data reprocessing
Benefits of p17 Reco speed-up Full calorimeter calibration Fuller description of detector material Use of zero-bias overlay for MC (More details:
http://cdinternal.fnal.gov/RUNIIRev/runIIMP05.asp)
DOSAR Workshop, Sept 2005 6
SAM continues to perform well, providing a data-grid
50 SAM sites worldwide Over 2.5 PB (50B events) consumed in the last year Up to 300 TB moved per month
Larger SAM cache solved tape access issues
Continued success of SAM shifters Often remote collaborators Form 1st line of defense
SAMTV monitors SAM & SAM stations
Data Handling - SAM
http://d0db-prd.fnal.gov/sm_local/SamAtAGlance/
DOSAR Workshop, Sept 2005 7
SAMGrid
More than 10 DØ execution siteshttp://samgrid.fnal.gov:8080/
http://samgrid.fnal.gov:8080/list_of_schedulers.php
http://samgrid.fnal.gov:8080/list_of_resources.php
SAM – data handlingJIM – job submission & monitoringSAM + JIM SAM-Grid
DOSAR Workshop, Sept 2005 8
Remote Production Activities – Monte Carlo - I
Over 75M events produced in last year, at more than 10 sites More than double last year’s production
Vast majority on shared sites DOSAR major part of this
SAM-Grid introduced in spring 04, becoming the default Based on request system and jobmanager-mc_runjob MC software package retrieved via SAMo way, inc central farm Average production efficiency ~90% Average inefficiency due to grid infrastructure ~1-5% http://www-d0.fnal.gov/computing/grid/deployment-issues.html
Continued move to common tools DOSAR sites continue move to SAMGrid from McFarm
From
From
0404
DOSAR Workshop, Sept 2005 9
Beyond just ‘shared’ resources More than 17M events produced ‘directly’ on LCG via submission from Nikhef
Good example of remote site driving the ‘development’
Similar momentum building on/for OSGTwo good site examples within p17 reprocessing
Remote Production Activities – Monte Carlo - II
DOSAR Workshop, Sept 2005 10
After significant improvements to reconstruction, reprocess old data
P14 Winter 2003/04 500M events, 100M remotely, from DST Based around mc_runjob Distributed computing rather than Grid
P17 End march ~Oct x 10 larger ie 1000M events, 250TB Basically all remote From raw ie use of db proxy servers SAM-Grid as default (using mc_runjob) 3200 1GHz PIIIs for 6 months Massive activity - largest grid activity in HEP
Remote Production Activities – Reprocessing - I
http://www-d0.fnal.gov/computing/reprocessing/p17/
DOSAR Workshop, Sept 2005 11
Reprocessing - II
““Production”Production”
““Merging”Merging”
Grid jobs spawns many batch Grid jobs spawns many batch jobsjobs
DOSAR Workshop, Sept 2005 12
Reprocessing -III
SAMGrid provides Common environment & operation scripts at each site Effective book-keeping
SAM avoids data duplication + defines recovery jobs JIM’s XML-DB used to ease bug tracing
Tough deploying a product, under evolution with limited manpower to new sites (we are a running experiment)
Very significant improvements in JIM (scalability) during this period
Certification of sites - Need to check SAMGrid vs usual production Remote sites vs central site Merged vs unmerged files
FNAL vs SPRACE
DOSAR Workshop, Sept 2005 13
( = production speed inM events / day)
( = number batch jobs completing successfully)
Reprocessing - IVhttp://samgrid.fnal.gov:8080/cgi-bin/plot_efficiency.cgi Monitoring (illustration)
Overall efficiency, speed or by site.
Status – into the “end-game”
Data sets all allocated, moving to ‘cleaning-up’ Must now push on the Monte Carlo
~855 Mevents done~855 Mevents done
DOSAR Workshop, Sept 2005 14
Need access to greater resources as data sets grow
Ongoing programme on LCG and OSG interoperability
Step 1 (co-existence) – use shared resources with SAM-Grid head-node
Widely done for both Reprocessing and MC OSG co-existence shown for data reprocessing
Step 2 – SAMGrid-LCG interface SAM does data handling & JIM job submission Basically forwarding mechanism Prototype established at IN2P3/Wuppertal Extending to production level
OSG activity increasing – build on LCG experience
Team work between core developers / sites
SAM-Grid Interoperability
DOSAR Workshop, Sept 2005 15
Looking Forward
Increased data sets require increased resources for MC, repro etc
Route to these is increased use of grid and common tools
Have an ongoing joint program, but work to do.. Continue development of SAM-Grid
Automated production job submission by shifters Deployment team
Bring in new sites in manpower efficient manner ‘Benefit’ of a new site goes well beyond a ‘cpu’ count – we
appreciate / value this. Full interoperability
Ability to access efficiently all shared resources
Additional resources for above recommended by Taskforce
DOSAR Workshop, Sept 2005 16
ConclusionsConclusions Computing model continues to be successful
Based around grid-like computing, using common tools Key part of this is the production computing – MC and
reprocessing
Significant advances this year: Continued migration to common tools Progress on interoperability, both LCG and OSG
Two reprocessing sites operating under OSG P17 reprocessing – a tremendous success
Strongly praised by Review Committee
DOSAR major part of this More ‘general’ contribution also strongly acknowledged. Thank you
Let’s all keep up the good work
DOSAR Workshop, Sept 2005 17
Back-up
DOSAR Workshop, Sept 2005 18
Terms Tevatron
Approx equiv challenge to LHC in “today’s” money Running experiments
SAM (Sequential Access to Metadata) Well developed metadata and distributed data replication
system Originally developed by DØ & FNAL-CD
JIM (Job Information and Monitoring) handles job submission and monitoring (all but data handling) SAM + JIM →SAM-Grid – computational grid
Tools Runjob - Handles job workflow management dØtools – User interface for job submission dØrte - Specification of runtime needs
DOSAR Workshop, Sept 2005 19
Reminder of Data Flow Data acquisition (raw data in evpack format)
Currently limited to 50 Hz Level-3 accept rate Request increase to 100 Hz, as planned for Run IIb – see later
Reconstruction (tmb/DST in evpack format) Additional information in tmb → tmb++ (DST format stopped) Sufficient for ‘complex’ corrections, inc track fitting
Fixing (tmb in evpack format) Improvements / corrections coming after cut of production release Centrally performed
Skimming (tmb in evpack format) Centralised event streaming based on reconstructed physics objects Selection procedures regularly improved
Analysis (out: root histogram) Common root-based Analysis Format (CAF) introduced in last year tmb format remains
DOSAR Workshop, Sept 2005 20
Remote Production Activities – Monte Carlo
DOSAR Workshop, Sept 2005 21
The Good and Bad of the Grid
Only viable way to go… Increase in resources (cpu and potentially manpower)
Work with, not against, LHC Still limited
BUT Need to conform to standards – dependence on others..
Long term solutions must be favoured over short term idiosyncratic convenience
Or won’t be able to maintain adequate resources.
Must maintain production level service (papers), while increasing functionality
As transparent as possible to non-expert