Download ppt - LHCb DataModel Nick Brook Glenn Patrick University of Bristol Rutherford Lab Motivation DataModel Options Future plans

LHCb DataModelLHCb DataModel Nick Brook Glenn Patrick

University of Bristol Rutherford Lab

• Motivation

• DataModel Options

• Future plans

WHY NOW ?WHY NOW ?

- DataGrid Architecture design process already started (ATF)

- LHCb input to direction & need of WP8 (HEP applications)

Grid architecture meet LHCb requirements

- how do we deal with the data

- are we wanting “events” or “objects” or … (user/LHCb only superficially interested in files)

Timescales:

• feedback from DataGrid work packages to ATF already begun

• end of June WP8 need to define their long term use requirements for EU

PhilosophyPhilosophy

- take a conservative viewpoint

• we want to perform analysis from day 1 !

• a “simplistic” model i.e. less reliant on the Grid

• building on our Grid tools is easier to unravel than built-in sophistication if the Grid fails to meet expectations

Time

Hype

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Trigger

Architecture of LHCb Computing Model

- based on a distributed multi-Tier regional centre

Processing of real data at CERN (production centre)

National centre simulation production centre

Physics Generator

Detector Simulation

Generator Data

Monte Carlo

Raw Data

DAQ system

L2/L3 Trigger

Calibration System

Calibration Data

Reconstruction

Event Summary Data(ESD)

Event Tags

Detector

RawMC Data RawMC Tags

Raw Tags

Dataflow ModelDataflow Model

Forget MC needs at our

peril !!!

MC samples will be our greatest strain on

resources ...

… not only CPU

- storage, bandwidth ..

.

ESD Reconstruction Tags

Analysis Object Data (AOD) Physics Tags

First PassAnalysis

Physics Analysis

Private Data

Analysis Workstation

Physics results

Generator Data

MC evts only

Raw Data

DataFlow ModelDataFlow Model

How do we access the data?

How often do we need to access ESD or RAW data ?

What info. is available from AOD ?

Need to address questions to both MC and real data

RAW per event 150kBESD per event 100kBAOD per event 20kBTAG per event 1kB

Nominal Event SizesNominal Event Sizes

AOD factor of 5 smaller than ESD

- if access needed to ESD for analysis it has large consequences on the amount of data that is moved around

- re-visit how we view the data, perhaps better to breakdown an event in to constituent components

AOD InformationAOD Information - 20kB/evt

To minimise bandwidth requirements it is important that whole info. for analysis available on AOD.

How do we access non AOD info. ?Above example:

• 3’s and K

• intermediate decay particles

• B-meson

• event information (e.g. tot event energy, posn of primary vertex…)

• object info. e.g. quality of particle ID, error/quality of track measurement...

• no/limited info. on non B hadron candidates

} “objects”

Real DataReal Data

CERN

Tier1

Tier1

Tier1Tier1

Tier1

Data needs to get from CERN to regional centres

CERN (as production centre) should trigger the distribution of this data to Tier 1’s

AOD: 20TB/year/pass

ESD: 100TB/year/pass

Repeat passes over data and distribution -centralised control (as opposed to user triggered over Grid)

Real DataReal Data

What data goes to where ?

Working assumption AOD(+TAG) will be distributed

Options:

All ESD & AOD data < 1PB of data - distributed by production centre

in addition to physics TAG in database ( access problems ?)

streamingevent header info.

- partial distribution based on this info.

match to national physics requirements ?

Real DataReal Data

- only Tier 0-2 will have Grid enabled data

- analysis performed, to first order, over national computing resources i.e. Tier 1-2 (+ 3-4 ?)

- Grid advertised resources needed to make decision on job-to-data or data-to-job

- analysis deals with “events” as the smallest working unit

CERN

Tier1

Tier1

Tier1Tier1

Tier1

Monte Carlo DataMonte Carlo Data - complications: MC production centres scattered throughout LHCb

How do we distribute - access MC data ?

How do we manage MC needs ?

Monte Carlo ProductionMonte Carlo Production

• distributed throughout collaboration

• priority & production co-ordinated centrally (as opposed to user triggered)

• production throughout Tier structure - down to Tier 4 ?

• (Local) Tier 1 centres will store MC production (including raw MC data )

??• availability of MC samples in distributed Tier 1

Monte Carlo ProductionMonte Carlo Production

• changes in recons. code &/or AOD production ?

re-run over MC samples

• again “re-creation” of ESD/AOD should be centrally

initiated, rather than re-creation triggered by user need

need to define obsolete MC samples

• “re-creation” performed at original national Tier 1

need to advertise new dataset

Monte Carlo AnalysisMonte Carlo Analysis

- because of distributed nature of MC production not clear if we want to distribute AODs as with real data (current baseline soln)

Alternatives:

(a) AOD stay at “local” generation Tier 1 & analysis jobs are moved to data until some access threshold is reached which triggers transfer of AOD to user’s local Tier 1

(b) user trigger transfer

(c) Tier2-Tier 1 caching i.e not moving MC data from Tier2 unless requested

(d) Automatically ship AOD & ESD’s to all Tier 1’s

VersionVersion

- the availability of s/w (version compatability)

- what datasets exists, created with which s/w version

Crucial clean “versioning” of s/w (including LHCb s/w) over distributed environment

Grid will need to advertise not only available h/w resources and LHCb data BUT also the available s/w

resources.

RequirementsRequirements

GRID

• data replication - fast,reliable, common tools

• stats. on data access (incl. Geographical parameters)

• uniform working env.

• info. services - incl. s/w availability

• meta-data services/cataloguing

• active & archive storage- migration/movement of data, restaging

LHCb

• clean versioning system - data & s/w

• priority management - e.g recreating MC AODs

• centralised control of MC

•Defn of AOD & ESD data -i/p from physics group

•policy on obtaining “missing info.” - “tough luck” or automatic generation

What Next ?What Next ?

- examine the options and if poss. decide on a baseline

(aim: LHCb Grid meeting, Bologna, June 13th-15th)

- initial working system: realistic, pragmatic, flexible

- identify crucial elements and develop a backbone analysis Grid (BAG) now to run in parallel to the Monte Carlo Production Grid

• enable LHCb analysis framework & services to be developed

• intial testbed to explore how Tier 2 & Tier 3 centres fit into Grid topology

• initial working model - to attract effort and funding

• physics group effort needed for datamodel & analysis system design

BAG RequirementsBAG Requirements

•standardised replication tools - originally based on existing tools

•basic meta-data catalogues - based around simple LDAP implementation

•standard LHCb analysis environment - installation toolkit, interfaces to mass storage systems, ...

•Interface between LHCb Gaudi analysis framework & Grid

Gaudi Services Application Manager Job Options Service Detector Description EventData Service Histogram Service Message Service Particle Property Service GaudiLab Service

Grid Services Information Services Scheduling Security Monitoring Data Management Service Discovery Database Service?

Meta Data

Data

Standard Interfaces & Protocols

Logical DataStoresEvent

DetectorHistogram

Ntuple

Most Grid services are producers or consumersof meta-data

GAUDI meets the GridGAUDI meets the Grid

Data Model depends on how we perform analysis. Can we really do analysis on AOD only? Major implications.

Analysis of MC data is a vital element of the problem due to distributed nature & scale of production.

Realistic “physics” analysis performed: (a) at CERN (b) external institute

Need to begin work/studies on interfacing Gaudi to Grid

Limitations on networking will have to feed into our approach to analysis

SummarySummary