22
DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Embed Size (px)

Citation preview

Page 1: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

DAQ

Andrea Petrucci

6 May 2008 – CMS-UCSD meeting

OUTLINE

• Introduction

• SCX Setup

• Run Control

• Current Status of the Tests

• Summary

Page 2: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Introduction

• Started commissioning the Readout Builder at its full size

• Many people working together to get this done

• For the first time we have almost two full DAQ Slices to test

• For now tests are limited to two slices of ~640 PCs (rows A, B, E

and F)

• Still in process of making experience with the installation and

maintenance of a cluster of O(1000) PCs

• Also for the XDAQ software and Run Control it is the first time we

work with ~1000 PCs communicating to each other

6-May-2007 Andrea Petrucci - UC San Diego 2

Page 3: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 3

SCX layout

RU 320 PCs :• Row A with 2 rails (ru-c2a[1-4]-[1-20])• Row B with 2 rails (ru-c2b[1-4]-[1-20])• Row E with 2 rails (ru-c2e[1-4]-[1-20])• Row F with 4 rails (ru-c2f[1-4]-[1-20])

BU-FU 320 PCs :• Row A with 2 rails (ru-c2a[5-8]-[1-20])• Row B with 2 rails (ru-c2b[5-8]-[1-20])• Row E with 2 rails (ru-c2e[5-8]-[1-20])• Row F with 2 rails (ru-c2f[5-8]-[1-20])

Row A and B are connected to 1 Force10 switch and row E and F to other.

F

E

B

A

2 Force 10

Page 4: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

18-Sep-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 4

• Used DummyRUs and ~200 FRLs (Tracker) for testing

• Different type of trapezoidal configurations:– 1 slice with 4 rails (68 DummyRUs x 224 BUs) – 1 slice with 4 rails (200 FRLs x 68 RUs x 24 BUs x 672 FUs) close to the final

slice – 4 slices with 2 rails (per slice: 32 RUs x 47 BUs x 147 FUs )– 8 slices with 2 rails (per slice: 32 RUs x 47 BUs x 147 FUs )

• A lot of different activities are going on in parallel:– System and software installation/update– System monitoring optimization– …

• Testing the first slice– The XDAQ installation is XDAQ build 6– Monitoring system (slp, sentinel, …) is enabled

• During last months the system was down many times and it takes some time to set up.

SCX Setup

Page 5: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 5

RU Builder Slices

Page 6: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

DAQ Software Installation

• All the DAQ software installation is managed by a central Quattor server.Quattor is a system administration toolkit providing a powerful, portable and

modular tool suite for the automated installation, configuration and management of clusters and farms running Linux.

• Quattor allows to re-install a pc in few minutes.

• There are different Quattor templates for each type of PC:– RU and BUFU PCs– Run Control PCs– FRL and FMM PCs– Etc…

• All the DAQ software developers had put a lot effort to Quattorize their software (RPM).

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 6

Page 7: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 7

A DAQ Configuration contains• One XML configuration file per

XDAQ executive– Including Myrinet FED-Builder

configuration– including O(100000) I2O

connections– Up to several 100 MB of XML

• Control structure– Hierarchy of function managers– Executives and Applications to

be controlled

Central DAQ System• Currently O(1000) hosts– ~10% controlling custom hardware• O(10000) XDAQ applications

2 107 electronics channels40 MHz

100 Hz

DAQ Configurator

Page 8: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego

8HWCfg Database

EQSetFBSet

DPSet

Hardware Configuration API Software Template API

Software

Template DB

RS3

RS API

CMS DAQ Configurator

SWTemplate GUI

Configurator GUI

Configurator API

Fill DB4

Manage/create

Software Templates2Create

FEDBuilderSets

& DAQPartitionSets1 Select DAQPartition (Hardware

Structure) & Software Template3

5Load

configuration and configure the

system

JAVA Fillers

DAQ Configurator Data Flow

Page 9: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 9

RCMS is integrated in the general CMS DAQ system, providing control and monitor of the two other components:

• the DAQ components that have the task to manage the main data flow. They include the Front End Drivers (FED), the Readout Units (RU), the Builder Unit (BU), the Filter Unit (FU), the trigger and data flow control system.

• the “Detector Control System” DCS, managing the slow controls of the whole experiment

The XML data format and the W3C standard SOAP protocol have been adopted as the main means for communication.

XDAQ is a C++ framework for a distributed Data Acquisition System, implements:

– configuration (parameterization)– communication over multiple network

technologies concurrently – high-level provision of system services

(memory management, tasks, ...)

Run Control and Monitor System

Page 10: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 10

– SECURITY SERVICE• login and user account

management;

– RESOURCE SERVICE (RS)• information about DAQ

resources and partitions;

– INFORMATION AND MONITOR SERVICE (IMS)

• Collects messages and monitor data; distributes them to the subscribers;

– JOB CONTROL• Starts, monitors and stops the

software elements of RCMS, including the DAQ components;

RCMS Services

Page 11: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 11

• Collects log information from log4j compliant applications (i.e. on-line process).

PublishSubscriber System

Storage System

Log Collector

Relational DBOracle,MySQL

Access via JDBC

Access via TCP

RCMS applications and XDAQ

applications

• Send log information directly to a Display System (Chainsaw) .

• Stores log information in a database and visualizes them (LogDBViewer) .

Logging System

Page 12: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 12

Web Browser (GUI)

Level 0 FM

Level 1 FM

Level 2 FM

User interaction with Web Browser connected to Level 0 FM.

Level 0 FM is entry point to Run Control System.

Level 2 FMs are sub-system specific custom

implementations.

Level 1 FM interface to the Level 0 FM and have to implement a standard set of inputs and states.

TOP

LTC

CSC DAQ

RPC DT

TRK

ECAL

HCAL

FB RB FF

Resources

FEC FED

Resources are on-line system components

Function Managers Control Structure

Page 13: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Run Control GUIs

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 13

1) RCMS GUI

2) Function Manager Level Zero GUI

3) FED and TTS GUI

Page 14: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Tests & Measurements DAQ System

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 14

GOALS• Understand problems to run big DAQ system:

• Reliability, scalability and monitoring system.• Measurements:

• Comprehend if the performances of the system are acceptable.

TESTED CONFIGURATIONS• Different configurations have been tested:

A. 68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus .B. 68 dummy RUs x 224 Bus x 672 FUs 4 rail from the RUs and 2 rail to

the Bus.C. 8 Slices with GTPe and ~200 FRLs, per slice: 32 RUs x 47 BUs x 147 FUs

(CMSSW locally).D. 4 Slices with GTPe and ~100 FRLs, per slice: 32 RUs x 47 BUs x 147 FUs

(CMSSW NFS).• The test B should perform almost the same as the final slice configuration (72

RU x 288 Bus x 864 FUs)

Create, Initialize, Connect, Configure, Get Ready, Start, Stop, Destroy

For these tests I create a Java stand-alone application. It controls the Level Zero FM over the following commands:

Page 15: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Test A: Only EVB

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 15

Create Initialize ConnectConfigur

eGet

Ready Start Stop DestroyMAX 2,546 43,771 15,465 3,368 1,321 27,506 4,024 9,317MIN 0,614 19,949 6,852 1,777 0,797 15,982 1,463 5,779AVERAGE 1,419 26,361 8,153 1,978 0,991 17,168 2,108 6,842AVEDEV 0,419 2,781 0,960 0,156 0,098 0,837 0,375 0,526

N. FAILED 0 0 0 0 0 0 0 0

• Setup parameters:– Dummy events are created in the BUs in generation mode.– 1 Slice with 1x1 FED Builders and events are dropped at BUs.– 68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus.– Used row E and F (~ 320 PCs).– Controlled 293 XDAQ executives and 585 XDAQ Applications (ATCPs, EVM,

RUs and Bus).– XDAQ Monitor Application enabled.– 50 iterations of measurement loop (Create, Initialize , Connect, Configure,

Get Ready, Start, Stop and Destroy).

• Results:– RU Throughput at 16, 32 kByte fragment size: ~480 MB/s.

Page 16: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Test B: EVB & Filter Farms

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 16

• Setup parameters:– Dummy events are created in the BUs in generation mode.– 1 Slice with 1x1 FED Builders and events are dropped at FUs.– 68 dummy RUs x 224 BUs 4 rail from the RUs and 2 rail to the Bus.– 3 FUs per BU and 1 Storage Manager.– Used row E and F (~ 320 PCs).– Controlled 965 XDAQ executives and 1539 XDAQ Applications (ATCPs,

EVM, RUs, BUs, FUResourceBrokers and FUEventProcessors ).– All libraries was loaded from local disk.– XDAQ Monitor Application enabled.– 100 iterations of measurements loop (Create, Initialize , Connect, Configure,

Get Ready, Start and Destroy).

• Results:– Could not reach running state because Filter farm applications crashed.

Create Initialize ConnectConfigur

eGet

Ready Start DestroyMAX 48,658 132,875 17,107 17,089 2,557 Error 31,220MIN 2,404 55,062 7,330 13,157 0,881 Error 21,949AVERAGE 4,487 62,195 11,399 14,020 1,072 Error 24,675AVEDEV 2,011 5,947 2,436 0,758 0,072 Error 1,452

N. FAILED 0 1 0 0 0 99 0

Page 17: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Test C: all system with 8 Slices

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 17

Create Initialize ConnectConfigur

eGet

Ready Start Stop DestroyMAX 1,440 91,498 11,795 62,797 1,502 37,201 90,906 40,907MIN 0,403 64,609 7,669 38,936 1,155 28,617 41.798 21,154AVERAGE 0,475 71,340 8,589 42,028 1,235 31,547 46.517 25,165AVEDEV 0,070 3,360 1,041 1,591 0,061 1.397 4.329 11,703

N. FAILED 0 0 0 3 0 10 30 0

• Setup parameters:– Events are generated in ~200 FRLs and used GTPe.– 8 Slice with 8x8 FED Builders and events are sent to the Storage Manager.– 2 rail from the RUs and the BUs.– Per Slice: 32 RUs x 47 BUs x 147 FUs.– Used rows A,B, E and F (~ 640 PCs) for Event Builder and Filter Farm.– Controlled 1976 XDAQ executives and 3202 XDAQ Applications (ATCPs, FRLs,

EVM, RUs, Bus, FUResourceBrokers, FUEventProcessors and Storage Managers).– XDAQ Monitor Application enabled and all libraries was loaded from local disk.– 83 iterations of measurement loop (Create, Initialize , Connect, Configure, Get

Ready, Start, Stop and Destroy).

• Results:– 240 MB/s throughput all the way to the Storage Manager disk (event size 480k)

Page 18: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 18

Test D: all System 4 Slices

Create Initialize ConnectConfigur

eGet

Ready Start Stop DestroyMAX 5,668 174,592 11,203 33,958 3,176 28,234 41,345 61,813MIN 0,923 86,504 7,935 30,027 1,198 25,113 37,398 38,825AVERAGE 1,463 105,880 9,596 31,030 1,415 25,928 38,423 43,572AVEDEV 0,384 18,896 0,566 0,765 0,151 0,620 0,641 3,425

N. FAILED 0 0 0 0 0 3 28 0

• Setup parameters:– Events are generated in ~100 FRLs and used GTPe.– 4 Slice with 4x4 FED Builders and events are sent to the Storage Manager.– 2 rail from the RUs and the BUs.– Per Slice: 32 RUs x 47 BUs x 147 FUs.– Used rows E and F (~ 320 PCs) for Event Builder and Filter Farm.– Controlled 988 XDAQ executives and 1601 XDAQ Applications (ATCPs, FRLs, EVM,

RUs, Bus, FUResourceBrokers, FUEventProcessors and Storage Managers).– XDAQ Monitor Application enabled and Filter Farm libraries was loaded from NFS.– 100 iterations of measurement loop (Create, Initialize , Connect, Configure, Get

Ready, Start, Stop and Destroy).

• Results:– The system is getting slower if we load libraries from NFS and less reliable.

Page 19: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Tests summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 19

Total CreateInitializ

eConnec

tConfigur

eGet

Ready StartA Only EVB (~320 PCs) 55,079 1,419 26,361 8,153 1,978 0,991 17,168B EVB+FF (~320 PCs) - 4,487 62,195 11,399 14,020 1,072 -

C 8 slices (~640 PCs)154,73

9 0,475 71,340 8,589 42,028 1,235 31,547

D4 slices NFS (~320 PCs)

175,312 1,463 105,880 9,596 31,030 1,415 25,928

• Performance:– Configuration B (close to final slice):

– Reasonable time to initialize, connect and configure.– Configuration C:

– The system scales well.– Configuration D:

– The system loses performance if it loads library from NFS disk ( ~ 2 times slower).

Page 20: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

Problems during the tests

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 20

• Problems observed during the tests:– ~15% times the system failed to initialize. The XDAQ executive could not

start because the HTTP address was already in use. Also the ATCP application had the same problem.

FIXED: It was enough to set the XDAQ HTTP port outside the UNIX Ephemeral port range to solve the problem.

– The system could not reach running state because of a fault (segmentation fault) between the communication with BU and FUResourceBroker.

FIXED: A bug was found and it is fixed with CMSSW version 2.0.4. – The system gets stuck in configuring state ~5% times. It is reproducible only

with big system (8 slices and all rows A,B,E and F).

Working in progress: the problem seems to be in the RunControl Framework.– The system fails to start (~5% times) and stop (~40% times).

Working in progress: DAQ function managers need to be improved.– The XDAQ monitor system has a latency between 2 or 3 minutes.

Working in progress: XDAQ developers are working to improve it.

Page 21: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

ATCP application

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 21

• Reasonable time to connect all the sockets (max. 15 sec. for 1 slice)

• Solved the problem of the “address already in use” when starting the listening socket.

• Created a new HyperDAQ interface:

• Added “Standard configuration” parameters.

• Added “debug” page.

• Integrated to XDAQ monitor system.

Page 22: DAQ Andrea Petrucci 6 May 2008 – CMS-UCSD meeting OUTLINE Introduction SCX Setup Run Control Current Status of the Tests Summary

06-May-2007 – CMS-UCSD Meeting Andrea Petrucci - UC San Diego 22

Summary

• RU Builder Commissioning– First time used a RU Builder configuration almost the same as the final slice

• It seems to work fine at 20 kHz per slice and a maximum throughput on the RUs of ~480 MB/s

– FUs and monitor system applications are included– Reasonable time to initialize and start the system– Some things are not yet understood (ex. fails to start and stop)

• Main worries are system instabilities– Cooling and its monitoring– Power cuts– Quattor installation– System configuration– Difficulties issuing the commands on many PCs at the same time