Real-time monitoring of operational data
in the Belle II experiment
Takuto KUNIGO on behalf of the Belle II collaboration 22nd IEEE Real Time Conference
Poster session-B
1
SuperKEKB2
4 GeV e+ 7 GeV e- collider
World record achieved in June 2020 L = 2.4 x 1034 cm-2s-1
Goal: 50 ab-1 (= Belle x 50)
Belle II detector3
Inner
OuterMuon spectrometer
Calorimeter
Tracking
Vertex detector
PID
e- (7 GeV)
e+ (4 GeV)
Search for new physics beyond SM via high precision measurement with high statistics samples of B/D/tau decays
Belle II DAQ system4
Many components Need to monitor each component carefully
Monitoring system5
Log messagesInput data
Database
Users
Applications
Integrated database ‘Elasticsearch’
PC usagesSuperKEKB status
CR shifters Sub-systemexperts
Trigger and DAQ condition …
Web-interface ‘Kibana’ visualisations (e.g. efficiency)
Alerting framework
Alerts for ‘ABCD’
check ‘XYWZ’
Offline analyses
Collaboratorsfor example
publication-quality plotsvia ROOT
System overview
Control roomshifters
1. Visualisations on Kibana6Main links
Header for DAQ efficiency
1. Run time breakdown - Total 2. Run-time fraction - Total 3. Deadtime fraction - Total
1. Run time breakdown
2. Run-time fraction
3. Deadtime fraction
Main page | HLT | COPPER | Network | Data transfer | Event size | Deadtime | Run registryDAQ efficiency | Log summary | Daily summary | Error summary
DefinitionConfluence page: DAQ efficiency
1. This is a draft, not finalised yet2. If you have any comments, suggestions, please
post your comments on the confluence page orsend me an e-mail Takuto KUNIGO
Frac
tiona
l rat
io [%
]
0%
20%
40%
60%
80%
100%
_all
All docs
! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…
Frac
tiona
l rat
io [%
]
0%
20%
40%
60%
80%
100%
_all
All docs
! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running
Dead
time
[%]
0
1
2
3
4
5
6
_all
All docs
! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…
Dura
tion
[sec
]
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! [1] Belle II ph…! [2] Accelera…! [3] Accelera…! [4] Beam inj…! [5] Accelera…
Frac
tiona
l rat
io [%
]
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2020-03-22 00:00 2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! (a) HV ramp…! (b) SALS! (c) HV trip o…! (d) Belle II tr…! Running
Dead
time
[%]
0
5
10
15
20
2020-03-29 00:00 2020-04-05 00:00 2020-04-12 00:00 2020-04-19 00:00 2020-04-26 00:00 2020-05-03 00:00 2020-05-10 00:00 2020-05-17 00:00 2020-05-24 00:00 2020-05-31 00:00date per 12 hours
! PXD! SVD! CDC! TOP! ARICH! ECL! KLM! TRG! ttlost! COPPER! FIFO full! PAUSE! Pipeline! APV VETO! Injection VE…
Exit full screen "
Efficiencies integrated over the time-range
Efficiencies binned in date
for physics data-taking
Belle-IIrunning
① Machine-time breakdown ② Belle-II data-taking breakdown ③ Deadtime breakdown
① Machine-time breakdown
② Belle-II data-taking breakdown
③ Deadtime breakdown
per 12 hours
per 12 hours
per 12 hours
Example: DAQ efficiency
Goals - to categorise the in-efficiency
sources - to choose time-range freely
and… Many monitoring plots(Event size, Network-traffic, errors etc.)
Efficiency definitions 1.Luminosity-based 2.Kibana
2: Alerting system7
Elastalert • is a third-party tool allows us to implement alerting function• Alert destinations: RocketChat, e-mail, SNS message, etc…
Example: Automatic advice for the control room shifters via the RocketChat (chat tool)
Elasticsearch (database) Elastalert
Check If any rules are satisfied
3: Offline analyses8
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12 14 16 18 20Recorded luminosity [10^33 cm^-2 s^-1]
100
150
200
250
300
350
400
450
500
550
600
HLT
flow
[MB
/s]
• Basic plot to evaluatethe current HLT performance
• Various metrics are stored in Elasticsearch➡We try to evaluate the beam-
background-induced contribution
Elasticsearch (database) ROOT files
0
50
100
150
200
250
300
0 2 4 6 8 10 12 14 16 18 20]-1 s-2 cm33Recorded luminosity [10
100
150
200
250
300
350
400
450
500
550
600
HLT
flow
[MB/
s]
Extrapolate to (for example)the designed luminosity
Plan: Root cause analysis9
Err
or/
Fata
l m
ess
age
timestamp
Many log messages ➡ Visualisations ➡ Quick diagnosis
Visualisation of error “propagation”
CR shifter noticed
Rootcause
Powerful tool to find the root cause of problems
Example (This is not real Belle II log-messages)
Plan: connected with recovery actions10
Full restart (“SALS”~3mins) STOP, ABORT, LOAD, START
ERROR Recovery time Resume
Recovery actions by experts
ContactCR => expert
ERROR Resume
STO
P
STAR
T
Recovery time
Recovery by the CR shifter
• The alert system automatically detect many problems• The next step is to connect the error-diagnosis with
the appropriate recovery actions ➡ If it is implemented in a GUI, CR shifter can take
the recovery action and then we can reduce our recovery time
Time
Schematic view11
Elastalert
Elasticsearch
Check
Currently activated alerts
CR shifters
EPICS PVs
PXD • None
• SVDNone
・・・
• COPPERDOWNRecovery
e.g. restart.sh
① Click
② Resume data-taking
• ALARM:PXD:ABCD = 0 • ALARM:SVD:ABCD = 0 • … • ALARM:DAQ:COPPERDOWN:1234 = 1
Alert messages on RocketChat
PCs RunControl processes Trigger and DAQ condition
If ping2COPPER1234 > 100 ms
Shifter PC
Summary12
Current status • We have started using the Elastic Stack
for real-time monitoring of the Belle II DAQ system• Working smoothly, making good contributions to reduce downtime• Three major applications:
1.Online visualisations on Kibana (e.g. data-taking efficiency)2.Alerting system3.Offline analyses
Future plan • Root cause analysis
- It is worth to try machine learning on Elastic stack• To connect the detection results with quick recovery actions
- Implementation is on-going