Download pdf - Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Real-time monitoring of operational data

in the Belle II experiment

Takuto KUNIGO on behalf of the Belle II collaboration 22nd IEEE Real Time Conference

Poster session-B

1

SuperKEKB2

4 GeV e+ 7 GeV e- collider

World record achieved in June 2020 L = 2.4 x 1034 cm-2s-1

Goal: 50 ab-1 (= Belle x 50)

Belle II detector3

Inner

OuterMuon spectrometer

Calorimeter

Tracking

Vertex detector

PID

e- (7 GeV)

e+ (4 GeV)

Search for new physics beyond SM via high precision measurement with high statistics samples of B/D/tau decays

Belle II DAQ system4

Many components Need to monitor each component carefully

Monitoring system5

Log messagesInput data

Database

Users

Applications

Integrated database ‘Elasticsearch’

PC usagesSuperKEKB status

CR shifters Sub-systemexperts

Trigger and DAQ condition …

Web-interface ‘Kibana’ visualisations (e.g. efficiency)

Alerting framework

Alerts for ‘ABCD’

check ‘XYWZ’

Offline analyses

Collaboratorsfor example

publication-quality plotsvia ROOT

System overview

Control roomshifters

1. Visualisations on Kibana6Main links

Header for DAQ efficiency

1. Run time breakdown - Total 2. Run-time fraction - Total 3. Deadtime fraction - Total

1. Run time breakdown

2. Run-time fraction

3. Deadtime fraction

Main page | HLT | COPPER | Network | Data transfer | Event size | Deadtime | Run registryDAQ efficiency | Log summary | Daily summary | Error summary

DefinitionConfluence page: DAQ efficiency

1. This is a draft, not finalised yet2. If you have any comments, suggestions, please

post your comments on the confluence page orsend me an e-mail Takuto KUNIGO

Frac

tiona

l rat

io [%

]

0%

20%

40%

60%

80%

100%

_all

All docs

! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…

Frac

tiona

l rat

io [%

]

0%

20%

40%

60%

80%

100%

_all

All docs

! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running

Dead

time

[%]

0

1

2

3

4

5

6

_all

All docs

! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…

Dura

tion

[sec

]

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…

Frac

tiona

l rat

io [%

]

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running

Dead

time

[%]

0

5

10

15

20

2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…

Exit full screen "

Efficiencies integrated over the time-range

Efficiencies binned in date

for physics data-taking

Belle-IIrunning

① Machine-time breakdown ② Belle-II data-taking breakdown ③ Deadtime breakdown

① Machine-time breakdown

② Belle-II data-taking breakdown

③ Deadtime breakdown

per 12 hours

per 12 hours

per 12 hours

Example: DAQ efficiency

Goals - to categorise the in-efficiency

sources - to choose time-range freely

and… Many monitoring plots(Event size, Network-traffic, errors etc.)

Efficiency definitions 1.Luminosity-based 2.Kibana

2: Alerting system7

Elastalert • is a third-party tool allows us to implement alerting function• Alert destinations: RocketChat, e-mail, SNS message, etc…

Example: Automatic advice for the control room shifters via the RocketChat (chat tool)

Elasticsearch (database) Elastalert

Check If any rules are satisfied

3: Offline analyses8

0

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 16 18 20Recorded luminosity [10^33 cm^-2 s^-1]

100

150

200

250

300

350

400

450

500

550

600

HLT

flow

[MB

/s]

• Basic plot to evaluatethe current HLT performance

• Various metrics are stored in Elasticsearch➡We try to evaluate the beam-

background-induced contribution

Elasticsearch (database) ROOT files

0

50

100

150

200

250

300

0 2 4 6 8 10 12 14 16 18 20]-1 s-2 cm33Recorded luminosity [10

100

150

200

250

300

350

400

450

500

550

600

HLT

flow

[MB/

s]

Extrapolate to (for example)the designed luminosity

Plan: Root cause analysis9

Err

or/

Fata

l m

ess

age

timestamp

Many log messages ➡ Visualisations ➡ Quick diagnosis

Visualisation of error “propagation”

CR shifter noticed

Rootcause

Powerful tool to find the root cause of problems

Example (This is not real Belle II log-messages)

Plan: connected with recovery actions10

Full restart (“SALS”~3mins) STOP, ABORT, LOAD, START

ERROR Recovery time Resume

Recovery actions by experts

ContactCR => expert

ERROR Resume

STO

P

STAR

T

Recovery time

Recovery by the CR shifter

• The alert system automatically detect many problems• The next step is to connect the error-diagnosis with

the appropriate recovery actions ➡ If it is implemented in a GUI, CR shifter can take

the recovery action and then we can reduce our recovery time

Time

Schematic view11

Elastalert

Elasticsearch

Check

Currently activated alerts

CR shifters

EPICS PVs

PXD • None

• SVDNone

・・・

• COPPERDOWNRecovery

e.g. restart.sh

① Click

② Resume data-taking

• ALARM:PXD:ABCD = 0 • ALARM:SVD:ABCD = 0 • … • ALARM:DAQ:COPPERDOWN:1234 = 1

Alert messages on RocketChat

PCs RunControl processes Trigger and DAQ condition

If ping2COPPER1234 > 100 ms

Shifter PC

Summary12

Current status • We have started using the Elastic Stack

for real-time monitoring of the Belle II DAQ system• Working smoothly, making good contributions to reduce downtime• Three major applications:

1.Online visualisations on Kibana (e.g. data-taking efficiency)2.Alerting system3.Offline analyses

Future plan • Root cause analysis

- It is worth to try machine learning on Elastic stack• To connect the detection results with quick recovery actions

- Implementation is on-going