12
Real-time monitoring of operational data in the Belle II experiment Takuto KUNIGO on behalf of the Belle II collaboration 22nd IEEE Real Time Conference Poster session-B 1

Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Real-time monitoring of operational data

in the Belle II experiment

Takuto KUNIGO on behalf of the Belle II collaboration 22nd IEEE Real Time Conference

Poster session-B

1

Page 2: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

SuperKEKB2

4 GeV e+ 7 GeV e- collider

World record achieved in June 2020 L = 2.4 x 1034 cm-2s-1

Goal: 50 ab-1 (= Belle x 50)

Page 3: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Belle II detector3

Inner

OuterMuon spectrometer

Calorimeter

Tracking

Vertex detector

PID

e- (7 GeV)

e+ (4 GeV)

Search for new physics beyond SM via high precision measurement with high statistics samples of B/D/tau decays

Page 4: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Belle II DAQ system4

Many components Need to monitor each component carefully

Page 5: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Monitoring system5

Log messagesInput data

Database

Users

Applications

Integrated database ‘Elasticsearch’

PC usagesSuperKEKB status

CR shifters Sub-systemexperts

Trigger and DAQ condition …

Web-interface ‘Kibana’ visualisations (e.g. efficiency)

Alerting framework

Alerts for ‘ABCD’

check ‘XYWZ’

Offline analyses

Collaboratorsfor example

publication-quality plotsvia ROOT

System overview

Control roomshifters

Page 6: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

1. Visualisations on Kibana6Main links

Header for DAQ efficiency

1. Run time breakdown - Total 2. Run-time fraction - Total 3. Deadtime fraction - Total

1. Run time breakdown

2. Run-time fraction

3. Deadtime fraction

Main page | HLT | COPPER | Network | Data transfer | Event size | Deadtime | Run registryDAQ efficiency | Log summary | Daily summary | Error summary

DefinitionConfluence page: DAQ efficiency

1. This is a draft, not finalised yet2. If you have any comments, suggestions, please

post your comments on the confluence page orsend me an e-mail Takuto KUNIGO

Frac

tiona

l rat

io [%

]

0%

20%

40%

60%

80%

100%

_all

All docs

! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…

Frac

tiona

l rat

io [%

]

0%

20%

40%

60%

80%

100%

_all

All docs

! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running

Dead

time

[%]

0

1

2

3

4

5

6

_all

All docs

! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…

Dura

tion

[sec

]

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…

Frac

tiona

l rat

io [%

]

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running

Dead

time

[%]

0

5

10

15

20

2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours

! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…

Exit full screen "

Efficiencies integrated over the time-range

Efficiencies binned in date

for physics data-taking

Belle-IIrunning

① Machine-time breakdown ② Belle-II data-taking breakdown ③ Deadtime breakdown

① Machine-time breakdown

② Belle-II data-taking breakdown

③ Deadtime breakdown

per 12 hours

per 12 hours

per 12 hours

Example: DAQ efficiency

Goals - to categorise the in-efficiency

sources - to choose time-range freely

and… Many monitoring plots(Event size, Network-traffic, errors etc.)

Efficiency definitions 1.Luminosity-based 2.Kibana

Page 7: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

2: Alerting system7

Elastalert • is a third-party tool allows us to implement alerting function• Alert destinations: RocketChat, e-mail, SNS message, etc…

Example: Automatic advice for the control room shifters via the RocketChat (chat tool)

Elasticsearch (database) Elastalert

Check If any rules are satisfied

Page 8: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

3: Offline analyses8

0

20

40

60

80

100

120

140

160

0 2 4 6 8 10 12 14 16 18 20Recorded luminosity [10^33 cm^-2 s^-1]

100

150

200

250

300

350

400

450

500

550

600

HLT

flow

[MB

/s]

• Basic plot to evaluatethe current HLT performance

• Various metrics are stored in Elasticsearch➡We try to evaluate the beam-

background-induced contribution

Elasticsearch (database) ROOT files

0

50

100

150

200

250

300

0 2 4 6 8 10 12 14 16 18 20]-1 s-2 cm33Recorded luminosity [10

100

150

200

250

300

350

400

450

500

550

600

HLT

flow

[MB/

s]

Extrapolate to (for example)the designed luminosity

Page 9: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Plan: Root cause analysis9

Err

or/

Fata

l m

ess

age

timestamp

Many log messages ➡ Visualisations ➡ Quick diagnosis

Visualisation of error “propagation”

CR shifter noticed

Rootcause

Powerful tool to find the root cause of problems

Example (This is not real Belle II log-messages)

Page 10: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Plan: connected with recovery actions10

Full restart (“SALS”~3mins) STOP, ABORT, LOAD, START

ERROR Recovery time Resume

Recovery actions by experts

ContactCR => expert

ERROR Resume

STO

P

STAR

T

Recovery time

Recovery by the CR shifter

• The alert system automatically detect many problems• The next step is to connect the error-diagnosis with

the appropriate recovery actions ➡ If it is implemented in a GUI, CR shifter can take

the recovery action and then we can reduce our recovery time

Time

Page 11: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Schematic view11

Elastalert

Elasticsearch

Check

Currently activated alerts

CR shifters

EPICS PVs

PXD • None

• SVDNone

・・・

• COPPERDOWNRecovery

e.g. restart.sh

① Click

② Resume data-taking

• ALARM:PXD:ABCD = 0 • ALARM:SVD:ABCD = 0 • … • ALARM:DAQ:COPPERDOWN:1234 = 1

Alert messages on RocketChat

PCs RunControl processes Trigger and DAQ condition

If ping2COPPER1234 > 100 ms

Shifter PC

Page 12: Real-time monitoring of operational data in the Belle II experiment … · 2020. 10. 22. · for real-time monitoring of the Belle II DAQ system • Working smoothly, making good

Summary12

Current status • We have started using the Elastic Stack

for real-time monitoring of the Belle II DAQ system• Working smoothly, making good contributions to reduce downtime• Three major applications:

1.Online visualisations on Kibana (e.g. data-taking efficiency)2.Alerting system3.Offline analyses

Future plan • Root cause analysis

- It is worth to try machine learning on Elastic stack• To connect the detection results with quick recovery actions

- Implementation is on-going