Marco Cattaneo "Event data processing in LHCb"

Event Data Processing in LHCb

Marco Cattaneo CERN – LHCb

On behalf of the LHCb Computing Group

LHC interactions

2

LHC: Two proton beams of ~1380 bunches, rotating at 11kHz!!15 MHz crossing rate (30 MHz in 2015)!!Average 1.5 interactions per crossing in LHCb!! !Each crossing contains a potentially interesting “event”!

Typical event in LHCb

3

Raw event size ~60 kB

Data reduction in real time (trigger)

4

HLT

Level-0 ü  custom hardware ü  partial detector information

ü  CPU farm -> software trigger ü  full detector information ü  reconstruction in real time

ü (~20k jobs in parallel) ü  ~200 independent “lines”

Offline Storage

pp collisions

15 MHz

1 MHz

~5 kHz ü  300 MB/s ü 1.3 PB in 2012

Event Reconstruction

❍  On full event sample (~2 billion events expected in 2012): ❏  Pattern recognition to measure particle trajectories

✰  Identify vertices, measure momentum ❏  Particle ID to measure particle types

5

DST event size ~100 kB -> 2 PB in 2012 (includes RAW)

~ 2 sec/event ~ 5k concurrent jobs

(run on grid) Reconstructed data stored in “FULL DST”

Data reduction offline (stripping)

❍  Any given physics analysis selects 0.1-1.0% of events ❏  Inefficient to allow individual physicists to run selection job

over FULL DST ❍  Group all selections in a single selection pass, executed by

central production team ❏  Runs over FULL DST ❏  Executes ~800 independent stripping “lines”

✰  ~ 0.5 sec/event in total ✰  Writes out only events selected by one or more of these lines

❏  Output events are grouped into ~15 streams ❏  Each stream selects 1-10% of events

❍  Overall data volume reduction: x50 – x500 depending on stream ❏  Few TB (<50) per stream, replicated at several places ❏  Accessible to physicists for data analysis

6

Simulation

7

LHCb Monte Carlo simulation software: –  Simulation of physics event –  Detailed detector and material description (GEANT) –  Pattern recognition, trigger simulation and offline event selection –  Implements detector inefficiencies, noise hits, effects of multiple collisions

Resources for simulation

❍  Simulation jobs require 1-2 minutes per event ❏  Several billion events per year required ❏  Runs on ~20k CPUs worldwide

8

Yandex contribution ~25%

Physics Applications Software Organization

Frameworks Toolkits

Reco

nstr

ucti

on

Sim

ulat

ion

Ana

lysi

s

Foundation Libraries H

igh

leve

l tri

gger

s

One framework for basic services + various specialized frameworks:

detector description, visualization, persistency, interactivity, simulation, etc.

A series of widely used basic libraries: Boost, GSL, Root etc.

Applications built on top of frameworks and implementing the required algorithms.

Gaudi Framework (Object Diagram)

Converter

Algorithm

Event Data Service

Persistency Service

Data Files

Algorithm Algorithm

Detec. Data Service

Persistency Service

Data Files

Transient Detector

Store

Message Service

JobOptions Service

Particle Prop. Service

Other Services Histogram

Service Persistency

Service Data Files

Transient Histogram

Store

Application Manager Converter Converter Event

Selector

Transient Event Store

The LHCb Computing Model

11

LHCb Workload management System :DIRAC !

❍  DIRAC forms anoverlay network!

❏  A way for gridinteroperability for a given Community!

❏  Needs specific Agent Director per resource type!

❍  From the user perspective all the resources are seen as a single large “batch system”!

Grid A! Grid B!

User Community!

(WLCG)! (NDG)!

DIRAC WMS!

13

u  Jobs are submitted to the DIRAC Central Task Queue !

u  VRC policies are applied here by prioritizing jobs in the Queue!

u  Pilot Jobs are submitted by specific Directors to various Grids or computer clusters!

u  Allows to aggregate various computing types resources transparently for the users !

u  The Pilot Job gets the most appropriate user job!

u  Jobs are running in a verified environment with a high efficiency!

!

DIRAC

❍  Live DIRAC Display

14

Data replication

❍  Active data placement

❍  Split disk (for analysis) and archive ❏  No disk replica for RAW and

FULL DST (read few times)

15

FULL DST!

Brunel!(recons)!

DaVinci!(stripping

and streaming)!

RAW!

DST1!DST1!

DST1! DST1! DST1!

Merge!

DST1!

1 Tier1 (scratch)

CERN + 1 Tier1

1 Tier1

CERN + 1 Tier1

CERN + 3 Tier1s

Datasets

❍  Granularity at the file level ❏  Data Management operations (replicate, remove replica,

delete file) ❏  Workload Management: input/output files of jobs

❍  LHCbDirac perspective ❏  DMS and WMS use Logical File Names to reference files ❏  LFN namespace refers to the origin of the file

✰  Constructed by the jobs (uses production and job number) ✰  Hierarchical namespace for convenience ✰  Used to define file class (tape-sets) for RAW, FULL.DST, DST ✰  GUID used for internal navigation between files (Gaudi)

❍  User perspective ❏  File is part of a dataset (consistent for physics analysis) ❏  Dataset: specific conditions of data, processing version and

processing level ✰  Files in a dataset should be exclusive and consistent in quality

and content

16

Replica Catalog (1)

❍  Logical namespace ❏  Reflects somewhat the origin of the file (run number for

RAW, production number for output files of jobs) ❏  File type also explicit in the directory tree

❍  Storage Elements ❏  Essential component in the DIRAC DMS ❏  Logical SEs: several DIRAC SEs can physically use the same

hardware SE (same instance, same SRM space) ❏  Described in the DIRAC configuration

✰  Protocol, endpoint, port, SAPath, Web Service URL ✰  Allows autonomous construction of the SURL ✰  SURL = srm:<endPoint>:<port><WSUrl><SAPath><LFN>

❏  SRM spaces at Tier1s ✰  Used to have as many SRM spaces as DIRAC SEs, now only 3 ✰  LHCb-Tape (T1D0) custodial storage ✰  LHCb-Disk (T0D1) fast disk access ✰  LHCb-User (T0D1) fast disk access for user data

17

Replica Catalog (2)

❍  Currently using the LFC ❏  Master write service at CERN ❏  Replication using Oracle streams to Tier1s ❏  Read-only instances at CERN and Tier1s

✰  Mostly for redundancy, no need for scaling ❍  LFC information:

❏  Metadata of the file ❏  Replicas

✰  Use “host name” field for the DIRAC SE name ✰  Store SURL of creation for convenience (not used)

❄  Allows lcg-util commands to work

❏  Quality flag ✰  One character comment used to set temporarily a replica as

unavailable ❍  Testing scalability of the DIRAC file catalog

❏  Built-in storage usage capabilities (per directory)

18

Bookkeeping Catalog (1)

❍  User selection criteria ❏  Origin of the data (real or MC, year of reference)

✰  LHCb/Collision12 ❏  Conditions for data taking of simulation (energy, magnetic

field, detector configuration… ✰  Beam4000GeV-VeloClosed-MagDown

❏  Processing Pass is the level of processing (reconstruction, stripping…) including compatibility version ✰  Reco13/Stripping19

❏  Event Type is mostly useful for simulation, single value for real data ✰  8 digit numeric code (12345678, 90000000)

❏  File Type defines which type of output files the user wants to get for a given processing pass (e.g. which stream) ✰  RAW, SDST, BHADRON.DST (for a streamed file)

❍  Bookkeeping search ❏  Using a path

✰  /<origin>/<conditions>/<processing pass>/<event type>/<file type>!

19

Bookkeeping Catalog (2)

❍  Much more than a dataset catalog! ❍  Full provenance of files and jobs

❏  Files are input of processing steps (“jobs”) that produce files ❏  All files ever created are recorded, each processing step as

well ✰  Full information on the “job” (location, CPU, wall clock time…)

❍  BK relational database ❏  Two main tables: “files” and “jobs” ❏  Jobs belong to a “production” ❏  “Productions” belong to a “processing pass”, with a given

“origin” and “condition” ❏  Highly optimized search for files, as well as summaries

❍  Quality flags ❏  Files are immutable, but can have a mutable quality flag ❏  Files have a flag indicating whether they have a replica or

not

20

Bookkeeping browsing

❍  Allows to save datasets ❏  Filter, selection ❏  Plain list of files ❏  Gaudi configuration file

❍  Can return files with only replica at a given location

21

Event indexing

❍  Book-keeping has no information about individual events

❍  But can be beneficial to select events based on global criteria: ❏  Number of tracks, clusters

etc. ❏  Trigger or Stripping lines

fired ❏  …

❍  Prototype implemented by Andrey Ustyuzhanin

22

104248:539058326

Advanced search

104248:539058326

Event time Oct. 27, 2011, 9:11 p.m.

File names

/lhcb/LHCb/Collision11/DIMUON.DST/00013016/0000/00013016_00000037_1.dimuon.dst /lhcb/LHCb/Collision11/EW.DST/00013017/0000/00013017_00000033_1.ew.dst

Application Brunel v41r1

Tagshead-20110914DDDB

tt-20110126DQFLAGShead-20111111LHCBCOND

HEADONLINE

Stripping lines

1DY2MuMuLine2_Hlt1FullDSTDiMuonDiMuonHighMassLine0StreamDimuon0StreamEW2WMuLine1Z02MuMuLine1Z02MuMuNoPIDsLine2Z02TauTauLine

Global Event Activity counters

37nBackTracks21nDownstreamTracks

352nITClusters11nLongTracks

156nMuonCoordsS086nMuonCoordsS115nMuonCoordsS231nMuonCoordsS318nMuonCoordsS4

9nMuonTracks4615nOTClusters

2nPV944nRich1Hits985nRich2Hits128nSPDhits357nTTClusters

14nTTracks92nTracks

6nUpstreamTracks1389nVeloClusters

56nVeloTracks

Select all Download Selected

Search is supported by

Staging: using files from tape

❍  If jobs use files that are not online (on disk) ❏  Before submitting the job ❏  Stage the file from tape, and pin it on cache

❍  Stager agent ❏  Performs also cache management ❏  Throttle staging requests depending on the cache size and

the amount of pinned data ❏  Requires fine tuning (pinning and cache size)

✰  Caching architecture highly site dependent ✰  No publication of cache sizes (except Castor and StoRM)

❍  Jobs using staged files ❏  Check first the file is still staged

✰  If not reschedule the job ❏  Copies the file locally on the WN whenever possible

✰  Space is released faster ✰  More reliable access for very long jobs (reconstruction) or jobs

using many files (merging)

23

Data Management Outlook

❍  Improvements on staging ❏  Improve the tuning of cache settings

✰  Depends on how caches are used by sites ❏  Pinning/unpinning

✰  Difficult if files are used by more than one job ❍  Popularity

❏  Record dataset usage ✰  Reported by jobs: number of files used in a given dataset ✰  Account number of files used per dataset per day/week

❏  Assess dataset popularity ✰  Relate usage to dataset size

❏  Take decisions on the number of online replicas ✰  Taking into account available space ✰  Taking into account expected need in the coming week

24

Longer term challenges ❍  Computing model evolution

❏  Networking performance allows evolution from static “Tier” model of data access to much more dynamic model

❏  Current processing model of 1 sequential job per file per CPU breaks down in multi-core era due to memory limitations ✰  parallelisation of event processing

❏  Whole node scheduling, virtualisation, clouds ❍  LHCb upgrade in 2018

❏  From computing point of view: ✰  x40 readout rate into HLT farm ✰  x4 event rate to storage (20kHz) ✰  x1.5 event size (100kB/RAW event)

❏  x10 increase in data rates imply ✰  scaling of Dirac WMS and DMS ✰  scaling of data selection catalogs ✰  new ideas for data mining?

❍  Data preservation and open access ❏  Preserve ability to analyse old data many year in the future ❏  Make data available for analysis outside LHCb + general public

25

Technology

Marco Cattaneo "Event data processing in LHCb"