LHCb DataModelLHCb DataModel Nick Brook Glenn Patrick
University of Bristol Rutherford Lab
• Motivation
• DataModel Options
• Future plans
WHY NOW ?WHY NOW ?
- DataGrid Architecture design process already started (ATF)
- LHCb input to direction & need of WP8 (HEP applications)
Grid architecture meet LHCb requirements
- how do we deal with the data
- are we wanting “events” or “objects” or … (user/LHCb only superficially interested in files)
Timescales:
• feedback from DataGrid work packages to ATF already begun
• end of June WP8 need to define their long term use requirements for EU
PhilosophyPhilosophy
- take a conservative viewpoint
• we want to perform analysis from day 1 !
• a “simplistic” model i.e. less reliant on the Grid
• building on our Grid tools is easier to unravel than built-in sophistication if the Grid fails to meet expectations
Time
Hype
Peak of Inflated Expectations
Trough of Disillusionment
Slope of Enlightenment
Plateau of Productivity
Trigger
Architecture of LHCb Computing Model
- based on a distributed multi-Tier regional centre
Processing of real data at CERN (production centre)
National centre simulation production centre
Physics Generator
Detector Simulation
Generator Data
Monte Carlo
Raw Data
DAQ system
L2/L3 Trigger
Calibration System
Calibration Data
Reconstruction
Event Summary Data(ESD)
Event Tags
Detector
RawMC Data RawMC Tags
Raw Tags
Dataflow ModelDataflow Model
Forget MC needs at our
peril !!!
MC samples will be our greatest strain on
resources ...
… not only CPU
- storage, bandwidth ..
.
ESD Reconstruction Tags
Analysis Object Data (AOD) Physics Tags
First PassAnalysis
Physics Analysis
Private Data
Analysis Workstation
Physics results
Generator Data
MC evts only
Raw Data
DataFlow ModelDataFlow Model
How do we access the data?
How often do we need to access ESD or RAW data ?
What info. is available from AOD ?
Need to address questions to both MC and real data
RAW per event 150kBESD per event 100kBAOD per event 20kBTAG per event 1kB
Nominal Event SizesNominal Event Sizes
AOD factor of 5 smaller than ESD
- if access needed to ESD for analysis it has large consequences on the amount of data that is moved around
- re-visit how we view the data, perhaps better to breakdown an event in to constituent components
AOD InformationAOD Information - 20kB/evt
To minimise bandwidth requirements it is important that whole info. for analysis available on AOD.
How do we access non AOD info. ?Above example:
• 3’s and K
• intermediate decay particles
• B-meson
• event information (e.g. tot event energy, posn of primary vertex…)
• object info. e.g. quality of particle ID, error/quality of track measurement...
• no/limited info. on non B hadron candidates
} “objects”
Real DataReal Data
CERN
Tier1
Tier1
Tier1Tier1
Tier1
Data needs to get from CERN to regional centres
CERN (as production centre) should trigger the distribution of this data to Tier 1’s
AOD: 20TB/year/pass
ESD: 100TB/year/pass
Repeat passes over data and distribution -centralised control (as opposed to user triggered over Grid)
Real DataReal Data
What data goes to where ?
Working assumption AOD(+TAG) will be distributed
Options:
All ESD & AOD data < 1PB of data - distributed by production centre
in addition to physics TAG in database ( access problems ?)
streamingevent header info.
- partial distribution based on this info.
match to national physics requirements ?
Real DataReal Data
- only Tier 0-2 will have Grid enabled data
- analysis performed, to first order, over national computing resources i.e. Tier 1-2 (+ 3-4 ?)
- Grid advertised resources needed to make decision on job-to-data or data-to-job
- analysis deals with “events” as the smallest working unit
CERN
Tier1
Tier1
Tier1Tier1
Tier1
Monte Carlo DataMonte Carlo Data - complications: MC production centres scattered throughout LHCb
How do we distribute - access MC data ?
How do we manage MC needs ?
Monte Carlo ProductionMonte Carlo Production
• distributed throughout collaboration
• priority & production co-ordinated centrally (as opposed to user triggered)
• production throughout Tier structure - down to Tier 4 ?
• (Local) Tier 1 centres will store MC production (including raw MC data )
??• availability of MC samples in distributed Tier 1
Monte Carlo ProductionMonte Carlo Production
• changes in recons. code &/or AOD production ?
re-run over MC samples
• again “re-creation” of ESD/AOD should be centrally
initiated, rather than re-creation triggered by user need
need to define obsolete MC samples
• “re-creation” performed at original national Tier 1
need to advertise new dataset
Monte Carlo AnalysisMonte Carlo Analysis
- because of distributed nature of MC production not clear if we want to distribute AODs as with real data (current baseline soln)
Alternatives:
(a) AOD stay at “local” generation Tier 1 & analysis jobs are moved to data until some access threshold is reached which triggers transfer of AOD to user’s local Tier 1
(b) user trigger transfer
(c) Tier2-Tier 1 caching i.e not moving MC data from Tier2 unless requested
(d) Automatically ship AOD & ESD’s to all Tier 1’s
VersionVersion
- the availability of s/w (version compatability)
- what datasets exists, created with which s/w version
Crucial clean “versioning” of s/w (including LHCb s/w) over distributed environment
Grid will need to advertise not only available h/w resources and LHCb data BUT also the available s/w
resources.
RequirementsRequirements
GRID
• data replication - fast,reliable, common tools
• stats. on data access (incl. Geographical parameters)
• uniform working env.
• info. services - incl. s/w availability
• meta-data services/cataloguing
• active & archive storage- migration/movement of data, restaging
LHCb
• clean versioning system - data & s/w
• priority management - e.g recreating MC AODs
• centralised control of MC
•Defn of AOD & ESD data -i/p from physics group
•policy on obtaining “missing info.” - “tough luck” or automatic generation
What Next ?What Next ?
- examine the options and if poss. decide on a baseline
(aim: LHCb Grid meeting, Bologna, June 13th-15th)
- initial working system: realistic, pragmatic, flexible
- identify crucial elements and develop a backbone analysis Grid (BAG) now to run in parallel to the Monte Carlo Production Grid
• enable LHCb analysis framework & services to be developed
• intial testbed to explore how Tier 2 & Tier 3 centres fit into Grid topology
• initial working model - to attract effort and funding
• physics group effort needed for datamodel & analysis system design
BAG RequirementsBAG Requirements
•standardised replication tools - originally based on existing tools
•basic meta-data catalogues - based around simple LDAP implementation
•standard LHCb analysis environment - installation toolkit, interfaces to mass storage systems, ...
•Interface between LHCb Gaudi analysis framework & Grid
Gaudi Services Application Manager Job Options Service Detector Description EventData Service Histogram Service Message Service Particle Property Service GaudiLab Service
Grid Services Information Services Scheduling Security Monitoring Data Management Service Discovery Database Service?
Meta Data
Data
Standard Interfaces & Protocols
Logical DataStoresEvent
DetectorHistogram
Ntuple
Most Grid services are producers or consumersof meta-data
GAUDI meets the GridGAUDI meets the Grid
Data Model depends on how we perform analysis. Can we really do analysis on AOD only? Major implications.
Analysis of MC data is a vital element of the problem due to distributed nature & scale of production.
Realistic “physics” analysis performed: (a) at CERN (b) external institute
Need to begin work/studies on interfacing Gaudi to Grid
Limitations on networking will have to feed into our approach to analysis
SummarySummary