ALICE Plenary | March 24, 2014 | Pierre Vande Vyvre O 2 Project : Status Report Pierre VANDE VYVRE 1

ALICE Plenary | March 24, 2014 | Pierre Vande Vyvre 1

O2 Project : Status Report

Pierre VANDE VYVRE


O2 ProjectRequirements

DetectorInput to Online System

(GByte/s)

Peak Output to Local Data Storage

(GByte/s)

Avg. Output to Computing

Center (GByte/s)

TPC 1000 50.0 8.0

TRD 81.5 10.0 1.6

ITS 40 10.0 1.6

Others 25 12.5 2.0

Total 1146.5 82.5 13.2

- Handle >1 TByte/s detector input- Produce (timely) physics result- Online reconstruction to reduce data volume- Minimize “risk” for physics results- Common hw and sw system developed by the

DAQ, HLT, Offline teams


O2 Project

• PLs: P. Buncic, T. Kollegger, P. Vande Vyvre

Computing Working Group (CWG) Chair

1. Architecture S. Chapeland

2. Tools & Procedures A. Telesca

3. Dataflow T. Breitner

4. Data Model A. Gheata

5. Computing Platforms M. Kretz

6. Calibration C. Zampolli

7. Reconstruction R. Shahoyan

8. Physics Simulation A. Morsch

9. QA, DQM, Visualization B. von Haller

10. Control, Configuration, Monitoring V. Chibante

11. Software Lifecycle A. Grigoras

12. Hardware H. Engel

13. Software framework P. Hristov

Project Organization

O2

TechnicalDesignReport

O2 CWGs

ALICE Plenary | March 24, 2014 | Pierre Vande Vyvre

Hardware Architecture

4

2 x 10 or 40 Gb/s

FLP

10 Gb/s

FLP

FLP

ITS

TRD

Muon

FTP

L0L1

FLPEMC

FLPTPC

FLP

FLPTOF

FLPPHO

Trigger Detectors

~ 2500 DDL3sin total

~ 250 FLPsFirst Level Processors

EPN

EPN

DataStorage

DataStorage

EPN

EPN

StorageNetwork

FarmNetwork

10 Gb/s

~ 1250 EPNsEvent Processing Nodes


CWG1 : ArchitectureStatus & plans

• Working on the TDR

• Architecture is converging– Now includes asynchronous processing and offloading to other

sites– Discussing requirements and interfaces to DCS system and

values

• Some requirements needs refining to scale

the system (in particular the data storage)– Run 3 operating mode (e.g. running time per year)– Physics simulation needs– Will be addressed with Andrea Dainese (Physics requirements)


CWG1 : ArchitectureO2 Architecture and data flow

• Includes now the intermediate data storage.

• Second global processing step possibly not on EPN


CWG1 : ArchitectureAsynchronous and iterative processing

(and/or offloading sites)


CWG2 – Tools, guidelines and procedures

• Activities started in March 2013

• Evaluation procedure completed and approved

• Proposed tools for the organization of the working groups

• Reports and presentations templates created

• Tools and policies identified and

assigned to CWG2 or other CWGs

• Tools evaluations:– Issue tracking systems → JIRA proposed and accepted– Version control systems → Git proposed and accepted– Website creation tools → Drupal proposed and accepted

• C++ Coding conventions– Naming and Formatting → circulated and accepted– Coding guidelines → circulated and under discussion

Status report: achievements


• Ongoing activities– Tools evaluations:

• Code and API documentationAn update of the coding conventions for the comments will then be needed

• Future plans– Policies:

• Licensing (Copyright and distribution of ALICE O2 software)

– Tool to help following the coding conventions in collaboration with CWG11

CWG2 – Tools, guidelines and proceduresStatus report: ongoing activities and future plans


CWG3 : DataflowData-flow Simulation Setup

• Current focus on FLP- EPN data-flow

• Implemented w/ OMNET++, using full TCP/IP simulation

• Heavy computing needs (weeks for some of the simulation)– Downscaling applied for some simulations:

• Reduce network bandwidth and buffer sizes and check• Simulate a slice of the system

• Simulation scenarios:– Different topologies (central switch; spine-leaf)– Different network bandwidths (10 Mb/s – 40 Gb/s)– Different levels of detail (many-to-one, many-to-many, few nodes up to full

scale)– Different data distribution schemes (single vs. multiple time frames, level of

parallelism )


CWG3 : DataflowSimulation Output

• Many parameters/metrics under

investigation– TCP/IP parameters (link sharing,

congestion, router buffers, latency and throughput)

– FLP and EPN buffer requirements

• Preliminary results look promising– Network traffic under control with

available technology (e.g. 40Gb/s)– FLP/EPN buffer requirements

reasonable (i.e. affordable)

• Issues:– Many iterations (parameter variations)

required– Long simulation time (hours to day(s)

depending on setup/level of detail)

S. ChapelandC. Delort

40 Gbps250x1250

40 Mbps250x288


CWG3 : DataflowSystem Dimensioning

• FLP- EPN network

(Two Layers Switch Design,

Non-blocking configuration)

• Ad-hoc programs

• Dimension and optimize

I. Legrand

Nu

mb

er

of

switc

he

sNumber of ports

24 Ports Switch

32 Ports Switch

36 Ports Switch

48 Ports Switch

Ma

xim

um

Nu

mb

er

of

No

de

s C

on

ne

cte

d

No

of

Po

rts

for

the

Co

re S

witc

he

s

Maximum number of connected nodes for a two layers system

No of Ports for the Edge Switches


CWG3 : DataflowNext Steps

• Realistic data size and time distributions

• Dynamic changes in network (e.g. H/W failure)

• Different technologies (e.g. Infiniband)

• Different data distribution algorithms (e.g. Pull, traffic shaping)

• Investigate buffer usage in more detail

• Optimize network throughput, minimize overall cost

• Lab verification:– Small scale setup to verify simulation results– HLT development cluster available for prototyping/tests– ~70 nodes w/ ~1000 cores, 1Gb/s Ethernet, Infiniband QDR


CWG4 – Data modelTime frame based data model

• The group proposes a time frame - based data model to:– Formalize the access to data types produced by both detector FEE and data

processing stages by prepending a generic Multiple Data Header– Provide strict memory management while minimizing the need for copying data for

processing purposes (data service instead of “copy around”)– Use efficient data layouts allowing for fast navigation among data types and sources

and usage of data from vectorized algorithms

• Ongoing investigation and prototyping of efficient AOD formats– Flat vs. hierarchical object structures and the impact on processing speed and data

compression– Investigation on I/O and compression and the output of synchronous reconstruction to

be discussed with CWG7 (reconstruction)

• Future work: integration simulation and benchmark– Realistic raw time frame simulation (CWG8) + time frame aggregation (CWG4) + FLP to

EPN flow (CWG3) + concurrency model and platforms (CWG5) down to EPN reconstruction


• All data blocks produced by both FEE cards or arbitrary processing tasks on FLP (e.g. cluster finding) to be described as generic MDB blocks. A MDH is foreseen to point to several correlated “events” coming asynchronously on different links on the same FLP.

• Processing of MDB blocks is transparent to the node type (FLP, EPN)• EPN’s will process MDB blocks but not required to produce MDB at their turn

but rather the persistent event format.

CWG4 – Data modelThe new generic data block


The time frame data

• Time frames start and end with O2 “heartbeat” MDH (events) and embed all data blocks collected by a given FLP. The corresponding frames will have to be aggregated on a EPN node in a folder-like structure easy to browse by reconstruction algorithms. The fast (synchronous) persistent reconstruction format will have to achieve the required overall compression.

• Note that the HBE summary may be attached to the “end HBE” to allow for asynchronous dispatching of blocks before the frame is fully aggregated by the FLP

CWG4 – Data modelThe time frame data


CWG6 : Calibration

• Minimum requirement for online calibration: safe data reduction– Factor ~20 for TPC, relying on (first level) calibration and (standalone or global)

reconstruction

• Calibration processes should deal with new data flow– FLPs: see all events, for part of a detector– EPNs: see only time-frames, for all detectors

• Identification (detector-by-detector) of the procedures to be run on FLPs

and/or EPNs ongoing– Calibration input– CPU requirements– Memory requirements– Detector interdependencies– Statistics (including handling of Time Frame data format)


CWG6 : Calibration

• Calibrations asynchronously produced (i.e. with a process external wrt data

taking) could be needed especially in case high statistics required– To be used at analysis level

• Evaluation of different scenarios for online (output needed as data

come)/quasi online (data processed with some delay wrt data taking, but

output available by the beginning of the next fill)/ offline (only data

reduction performed online)– Includes evaluation of possibility to reserve dedicated machines to calibrations for

which a fast feedback is needed

• Equivalent of OCDB to be defined (together with CWG3)– Time dependent calibrations (following time frames)– Synchronization


CWG7 : Reconstruction

• Different scenarios of reconstruction flow (depend on speed of different processes):– minimal - to insure data storage– maximal – physics analysis grade reconstruction

At the moment only rough estimates of some components timing is possible

• ITS– finalizing code for detailed implementation of upgrade geometry, global tracking adapted– work on implementation of ITS standalone tracker– preliminary schema of clusterization and cluster data compression

to be finalized once pixel chip architecture will be defined

• TPC– schematic understanding of reconstruction, calibration process

• TRD– inter-dependencies with TPC are understood, need to verify reliability of online tracklets

The status of calibration reconstruction is summarized in the CWG6/7 joint “Conceptual Design Note”, to be converted to the chapter in the TDR


Calibration/reconstruction flow

20

Raw data

ClusterizationCalibration

TPC track finding

ITS track finding/fittin

gVertexing

Standalone

MFT track finding/fittin

g

MOUN track finding/fittin

g

…

TRD seededtrack finding

and matching with TPC

Compressed data storage

Final TPC calibration

(constrainedby ITS, TRD)

TPC-ITSmatching

Matching toTOF, HMPID,calorimeters

Final ITS-TPC

matching,outward refitting

MUON/MFT matching

Global trackinward fitting

V0, Cascadefinding

Event building:(vertex, track, trigg

association)

AOD storage

All FLPs

One EPN

MC Reference TPC map

Adjusted accounting for current luminosity

Average TPC map

FIT multiplicity

Rescaled TPC map

Adjusted with multiplicity

PID calibrations

DCS data

Step 1 Step 2 Step 3 Step 4

Exact partitioning of some components between real-time, quasi-online and offline processing depends on (unknown) component CPU performance


Most critical problem: TPC SCD calibrationCloser Look to ITS-TPC-TRD

• Current understanding of TPC-TRD dependency: TRD T0 calibration is enough for TRD track finding with optimal position resolution

calibration performed on FLPs using position of pulse start TPC standalone tracking with rescaled “average” SCD map correction is sufficient

for seeded track finding in the TRD Constrains from TRD and ITS tracks matched to TPC are enough for SCD

fluctuations calibration (at ~200 Hz rate) TRD Vdrift and ExB calibration (used for PID only) is done using finally

refitted TPC (+ITS) tracks

Raw data

TRD T0 calibration

TPC vdrift + track finding

ITS track finding/fitting

Vertexing

Standalone

TRD track finding with online tracklets and seeding from TPC

Final TPC calibration ofSCD fluctuations

(constrains by ITS, TRD)TPC-ITS matching

All FLPsTRD vdrift and ExB

calibration


CWG8: Physics Simulation

Geant4 v10 Physics Validation Central production Validation by physics observables

Tests with multi-threading Short term (within a few months)

Performance tests with Geant4 VMC 3.00 + ALICE geometry Long term (next 1-2 years)

MT tests with AliRoot Requires migration of AliRoot VMC Application

Fast Simulation Framework Full and parameterized First prototype autumn 2014


CWG8: Geant4 VMC 3.00

First Geant4 VMC version providing support for Geant4 multi-threading

mode Beta version (3.00.b01) released on 14 March 2014 By I. Hrivnacova, IPNO (CNRS/IN2P3, Univ-Paris-Sud), with participation

of A. Gheata, CERN (migration of G4ROOT)

Single source code for both sequential and multi-threading modes VMC applications which were not migrated for MT can be built and run with the

same Geant4 VMC as migrated applications

MT mode is activated automatically when Geant4 VMC is built against

Geant4 MT libraries

All (5) VMC examples were migrated to MT and can be run in this mode

both with Geant4 native and Root navigation


CWG8: Geant4 VMC 3.00 (2)

New set of classes for ROOT IO management which take care of locking critical

operations (registering ROOT object to trees etc.) is introduced in new mtroot

package http://root.cern.ch/drupal/content/mtroot

The instructions for migration VMC applications to MT are available from VMC Web

site: http://root.cern.ch/drupal/content/multi-threaded-processing

Besides MT, there are added VMC application main programs together with CMake

configuration files which allow to run VMC without dynamic loading of libraries This allowed to evaluate a performance penalty due to use of shared libraries a dynamical

loading on the VMC tests The penalty of dynamic loading of shared libraries vs. static was ~12 % in sequential and

~22 % in multi-threading mode

http://root.cern.ch/drupal/content/mtroot

http://root.cern.ch/drupal/content/multi-threaded-processing


CWG9 – QA, DQM & VisualizationStatus & Plans

• Run 2 – Event Display review and refactoring

• Status : Ongoing • Responsibility under transfer to Warsaw Group • Meetings and demo of new architecture + collaboration

with HLT on new communication protocol• PHD student from Warsaw to join in April

– Proposal for the online reconstruction and calibration

• Status : Started, on hold• Preliminary architecture


CWG9 – QA, DQM & VisualizationStatus and plans

• Run 3– System requirements and system functionalities document

• Status : Done

– Detectors needs survey • Status : Ongoing, almost finished

– Definition of the future architecture and design • Status : Ongoing

– Prototypes and feasibility tests • Status : To be done in the near future

– Technical Design Report redaction • Status : Ongoing


CWG10 – Control, Configuration and Monitoring

• Activities started in April 2013

• Software Requirements Specifications– https://twiki.cern.ch/twiki/pub/ALICE/Cwg10/CWG10SoftwareRequirementsSpecifications.pdf

• Requirements (Number of processes)– https://twiki.cern.ch/twiki/pub/ALICE/Cwg10/NumberOfProcessesEstimate.pdf

• Ongoing activities– Writing TDR content: chapter 4

• First draft almost finished

• Future plans– Continue writing TDR (Chapters 5 and 6)– Prototypes for key performance requirements

• Number of control commands, monitoring data volume, configuration distribution

https://twiki.cern.ch/twiki/pub/ALICE/Cwg10/CWG10SoftwareRequirementsSpecifications.pdf

https://twiki.cern.ch/twiki/pub/ALICE/Cwg10/NumberOfProcessesEstimate.pdf


CWG10 – Control, Configuration and MonitoringRoles Hierarchy


CWG5: Computing Platforms

• Speedup of CPU Multithreading:– Task takes n1 seconds on 1 core, n2 seconds on x cores Speedup is n1/n2 for x cores, Factors are n1/n2 and x/1

– With Hyperthreading: n2‘ seconds on x‘ threads on x cores. (x‘ >= 2x) Will not scale linearly, needed to compare to full CPU performance.

• Factors are n1 / n2‘ and x / 1 (Be carefull: Not x‘ / 1, we still use only x cores.)

• Speedup of GPU v.s. CPU:– Should take into account full CPU power (i.e. all cores, hyperthreading).– Task on the GPU might also need CPU resources.

• Assume this occupies y CPU cores.– Task takes n3 seconds on GPU.– Speedup is n2‘/n3, Factors are n2‘/n3 and y/x. (Again x not x‘.)

• How many CPU cores does the GPU save:– Compare to y CPU cores, since the GPU needs that much resources.– Speedup is n1 / n3, GPU Saves n1 / n3 – y CPU cores.

Factors are n1 / n3, y / 1, and n1 / n3 - y.

• Benchmarks: Track Finder, Track Fit, DGEMM (Matrix Multiplication – Synthetic)

The Conversion factors


CWG5: Computing PlatformsTrack finder

Westmere 6-Core 3.6 GHz

1 Thread 4735 ms Factors:

6 Threads 853 ms 5.55 / 6

12 Threads (x = 4, x‘ = 12) 506 ms 9,36 / 6

Nehalem 4-Core 3,6 GHz (Smaller Event than others)


4 Threads 1039 ms 3,77 / 4

12 Threads (x = 4, x‘ = 12) 816 ms 4,80 / 4

Dual Sandy-Bridge 2 * 8-Core 2 GHz


16 Threads 403 ms 11,1 / 16

36 Threads (x = 16, x‘ = 36) 320 ms 14,1 / 16

Dual AMD Magny-Cours 2 * 12-Core 2,1 GHz

36 Threads (x = 24, x‘ = 36) 495 ms

3 CPU Cores + GPU – All Compared to Sandy Bridge System

Factor vs x‘ (Full CPU) Factor vs 1 (1 CPU Core)

GTX580 174 ms 1,8 / 0,19 26 / 3 / 23

GTX780 151 ms 2,11 / 0,19 30 / 3 / 27

Titan 143 ms 2,38 / 0,19 32 / 3 / 29

S9000 160 ms 2 / 0,19 28 / 3 / 25

S10000 (Dual GPU with 6 CPU cores 85 ms 3,79 / 0,38 54 / 6 / 48


CWG12 – Computing HardwareFLP I/O Bandwidth

• I/O bus performance

• PCIe Gen2 x 8– Using C-RORC as data generator– ASUS ESC4000: > 3 GB/s per slot– With TPC tracking code running on GPUs

Total I/O of 17 GB/s

• PCIe Gen3 x 8: – Xilinx Virtex-7 XC7VX330T as Data generator– Supermicro X9SRE-F: ~ 5-6 GB/s per slot

• Current generation of I/O bus could

be used for the upgrade

H. Engel


CWG13 Software Framework Development

• CWG13 just starting

• Design and development of a new modern framework targeting Run3

• Should work in Offline and Online environment– Has to comply with O2 requirements and architecture

• Based on new technologies– Root 6.x, C++11

• Optimized for I/O– New data model

• Capable of utilizing hardware accelerators– FPGA, GPU, MIC…

• Support for concurrency and distributed environment

• Will be based on ALFA - common software foundation developed jointly between

ALICE & GSI/FAIR


CWG13 Software Framework Development

• CWG13 just starting

• Design and development of a new modern framework targeting Run3

• Should work in Offline and Online environment– Has to comply with O2 requirements and architecture

• Based on new technologies– Root 6.x, C++11

• Optimized for I/O– New data model

• Capable of utilizing hardware accelerators– FPGA, GPU, MIC…

• Support for concurrency in an heterogeneous

and distributed environment

• Will be based on ALFA - common software foundation

jointly developed between ALICE & GSI/FAIR

ALFACommon Software Foundations

O2

SoftwareFramework

FairRoot

PandaRoot

CbmRoot


CWG13 Software Framework DevelopmentALICE + FAIR = ALFA

• Expected benefits– Development cost optimization– Better coverage and testing of the code– Documentation, training and examples.– ALICE : work already performed by the FairRoot team

concerning features (e.g. the continuous read-out), which are part of the ongoing FairRoot development.

– FAIR experiments : ALFA could be tested with real data and existing detectors before the start of the FAIR facility.

• The proposed architecture will rely on a data-flow

based model.


O2 TDR Editorial Committee

• Members- Latchezar Betev - Predrag Buncic- Sylvain Chapeland- Frank Cliff - Peter Hristov- Thorsten Kollegger- Ken Read- Jochen Thaeder- Barth von Haller- Pierre Vande Vyvre

• Physics requirement chapter: Andrea Dainese

• General structure, ToC and tools defined.

• Next meetings– 1st April– 5-7 May TDR working days

• All the WGs are working on their respective sections of the TDR


O2 ProjectInstitutes

• Institutes– FIAS, Frankfurt, Germany– IIT, Mumbay, India– Jammu University, Jammu, India– IPNO, Orsay, France– IRI, Frankfurt, Germany– Rudjer Bošković Institute, Zagreb, Croatia– SUP, Sao Paulo, Brasil– University Of Technology, Warsaw, Poland– Wiegner Institute, Budapest, Hungary– CERN, Geneva, Switzerland

• Looking for more groups and people– Need people with computing skills and from detector groups

• Active interest from– Creighton University, Omaha, US– KISTI, Daejeon, Korea– KTO Karatay University, Turkey– Lawrence Berkeley National Lab., US– LIPI, Bandung, Indonesia– Oak Ridge National Laboratory, US– University of Houston, US– University of Texas, US– Wayne State University, US– King Mongkut's University of Technology Thonburi, Bangkok, Thailand

Documents

ALICE Plenary | March 24, 2014 | Pierre Vande Vyvre O 2 Project : Status Report Pierre VANDE VYVRE 1