Ranking the Importance of Alerts for Problem Determination in Large
Computer System
Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena
NEC Laboratories America, Princeton
Outline• Introduction– Motivation & Goal
• System Invariants– Invariants extraction– Value propagation
• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts
• Experiment result• Conclusion
ICAC 2009 : 6/16/2009 2
Motivation
ICAC 2009 : 6/16/2009 3
• Large & complex systems are deployed by integrating many heterogeneous components:– servers, routers, storage & software from multiple
vendors.– Hidden dependencies
• Log/Performance data from components– Operators set many rules to check it and trigger alerts.• E.g. CPU% @ Web > 70%
– Rule setting: independent & isolated• Operator’s own system knowledge.
Goal
ICAC 2009 : 6/16/2009 4
• Which alerts should we analyze first?
- Get more consensus from others- Blend system management knowledge from multiple operators
• We introduce “Peer-review” mechanism– To rank the importance of alerts.
• Operators can prioritize problem determinations process.
CPU% @Web > 70% Alert 1
DiskUsg@Web > 150 Alert 2
CPU% @DB > 60% Alert 3
Network@AP > 35k Alert 4
Alert 3 Alert 1 Alert 2Alert 4
Full automation
Alerts Ranking Process
ICAC 2009 : 6/16/20095
t
t
tOff line
CPU% @Web > 70% Alert 1
DiskUsg@Web > 150 Alert 2
CPU% @DB > 60% Alert 3
Network@AP > 35k Alert 4
1. Extract Invariants from monitoring data
Invariants model
Operators(w/ domain knowledge)
Large system
Alert 1Alert 2
Alert 3Alert 4
2. Define alert rules 3. Sort alert rules
[ICAC 2006][TDSC 2006][TKDE 2007][DSN 2006]
4. Rank alertsOnlineAt time of alerts received
Alert 1Alert 1Alert 1Alert 4Real alerts
Domain information
System Invariants
ICAC 2009 : 6/16/2009 6
m1
m2
m4
m3
mi
mi+1
mi+2
mn t
t
t
tt
t
t
. .
.
.
..
. ..
any constant relationship
???
mn
Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests.
Target System
Userrequests
t
t
• User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly.
• We search the relationships among these internal measurements collected at various points.
• If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.
Invariant Examples
ICAC 2009 : 6/16/2009 7
• Check implicit relationships, but not real values of flow intensities, which are always changing.
However many relationships are constant !! – Example: x, y are changing but the equation y=f (x) is constant.
LoadBalancer
LoadBalancer
I1
O1
O2
O3
I1 = O1+O2+O3
DatabaseServer
DatabaseServer
Packetvolume V1
SQL querynumber N1
V1 = f(N1)Invariant
Automated Invariants Search
ICAC 2009 : 6/16/2009 8
model library f
TargetSystem
observationdata
pick any twomeasurementsi, j to learn f ij
f ij: Invariantcandidates
with new data [t1-t2], do f ij hold ?
drop thevariants f ij Pi: Confidence Score
NO
Sequential validation
[t0-t1]Monitoring
observationdata
[t1-t2]
with new data [tk-tk+1], do f ij hold ?
observationdata
[tk-tk+1]
P0 P1
Yes
drop thevariants f ij
NO PK
YesTemplate
One example in model library
ICAC 2009 : 6/16/2009 9
• We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements.
• Define
• Given a sequence of real observations, using LMS, we learn the model by minimizing the error.
• A fitness function can be used to evaluate how well the learned model fits the real data.
cmktxbktxbntyatyaty mn )(...)()(...)1()( 01
Tmn cbbbaaa ],,...,,,,...,,[ 1021
Tmktxktxntytyt )](),...,(),(),...,1([)( Ttty )()(
N
t
N
t
TN tyttt
1
1
1).()(])()([ˆ
100]|)(|
|)|(ˆ)(|1[)(
1
2
1
2
N
t
N
t
yty
tytyF
Value Propagation with Invariants
ICAC 2009 : 6/16/2009 10
x
y=f(x)y
zz=g(y)
uv
u=h(x)v=s(u)Extract
invariants
cxbxbyayay mn 101
yty )( Converged
n
j j
m
i i
a
cxby
1
1
0
1
xtx )(Set
z=g(f(x))v=s(h(x))
With ARX Model
Multi hops
Rules and Fault Model
ICAC 2009 : 6/16/2009 11
1then),(if alert generate_xx TRule
Predicate Action
Probability of fault occurrence
x
1
0xT
Fault model for each rule
False positive
False negative
Ideal modelRealistic model
Probability of Reporting a True Positive Alert
• Importance of an alert:
ICAC 2009 : 6/16/2009 12
true|xProb
Probability of Reporting a True Positive (PRTP)generated by value x
A very small false positive rate leads to large number of false positive repots.
Ex. One measurement is checked every minute and its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!!
Ex. Real operation support system: 80% of reports are FPs
Local Context Mapping to Global Context
ICAC 2009 : 6/16/2009 13
CPU% @Web > 70% Alert 1
DiskUsg@Web > 150 Alert 2
CPU% @DB > 60% Alert 3
Network@AP > 35k Alert 4
Web AP
DB
Different semanticsDifferent semanticsGlobal context
CPU%Web = fa(Network@AP)
CPU%Web = fb(CPU%@DB)CPU%Web = fc(DiskUsg%@Web)
Fault model of CPU%WebPRTP
x
1
0xT xCPU@DBxDiskUsg@WEB
xNetwork@AP
= fa(Network@AP)
= fc(DiskUsg@WEB) = fb(CPU%@AP)
Prob(true|XCPU@DB)> Prob(true|XT)> Prob(true|XDiskUsg@Web)> Prob(true|XNetwork@AP)
Alert 3
Alert 1
Alert 2
Alert 4
Local Context Mapping to Global Context
ICAC 2009 : 6/16/2009 14
CPU% @Web > 70% Alert 1
DiskUsg@Web > 150 Alert 2
CPU% @DB > 60% Alert 3
Network@AP > 35k Alert 4
Web AP
DB
Fault model of Network%APPRTP
x
1
0xCPU@WEB
xCPU@DB
xDiskUsg@WEB
xT
Prob(true|XCPU@DB)> Prob(true|XCPU@WEB)> Prob(true|XDiskUsg@Web)
> Prob(true|XT)
Alert 3
Alert 1
Alert 2
Alert 4
Alert ranking: No Change
Alerts Ranking Process
ICAC 2009 : 6/16/2009 154. Rank alertsOnline
At time of alerts receivedAlert 1Alert 1Alert 1Alert 4Real alerts
Ranking Alerts (Case I)
ICAC 2009 : 6/16/2009 16
Sorted alert rules
Alert 6Alert 2
Alert 3Alert 7
Alert 5Alert 9Alert 1
Alert 8Alert 4
Case I: Receive ONLY ALERTS, no monitoring data from components
Alert 2
Alert 3Alert 7
Alert 5
Alert 1
Alerts ranking
12345
5 alertsgenerated
Operator’s knowledge & configuration
System InvariantsNetwork
Ranking Alerts (Case II)
ICAC 2009 : 6/16/2009 17
Case II: Receive both alerts and monitoring data from components
Fault model of CPU%Web
PRTP
x
1
0xT xCPU@DBxDiskUsg@WEB
xNetwork@AP
= fa(Network@AP)
= fc(DiskUsg@WEB) = fb(CPU%@AP)
Observed ValueX(CPU%Web)
Number of Threshold Violations (NTV)
NTV=3
Fault model of Network%AP
PRTP
x
1
0xCPU@WEB
xCPU@DB
xDiskUsg@WEB
xT
Observed ValueX(Network%AP)
NTV=2
Alert by CPU%Web is more important than one from Network%AP.
Index• Introduction– Motivation & Goal
• System Invariants– Invariants extraction– Value propagation
• Collaborative peer review mechanism– Rules & Fault model– Ranking alerts
• Experiment result• Conclusion
ICAC 2009 : 6/16/2009 18
Experimental system
ICAC 2009 : 6/16/2009 19
Flow Intensities:
: the number of EJB created at time t.
: the JVM processing time at time t.
: the number of SQL queries at time t.
Flow Intensities:
: the number of EJB created at time t.
: the JVM processing time at time t.
: the number of SQL queries at time t.
A
D
C
B
BA D
C
( )ejbI t
( )jvmI t
( )sqlI t
Invariant Examples:Invariant Examples:
( ) 0.07 ( 1) 0.57 ( )ejb ejb jvmI t I t I t
( ) 0.34 ( 1) 1.41 ( )
0.2 ( 1)
sql sql ejb
ejb
I t I t I t
I t
Extracted Invariants Network
ICAC 2009 : 6/16/2009 20
m1
m3
m5
m2
m4m6
Thresholds of Measurements
ICAC 2009 : 6/16/2009 21
70
30000
80
70
30000
20000
m1 m2 m3 m4 m5 m6
iTii malert generate_mx _then),(if
m1T
m2T
m3T
m4T
m5T
m6T
63.6
70.2
70.5
77.0
59.8
Thresholds of Measurements
ICAC 2009 : 6/16/2009 22
70
m1
iTii malert generate_mx _then),(if
m1T
m2T
m3T
m4T
m5T
m6T
63.6
70.2
70.5
77.0
59.8
30000
m2
32726
33006
33212
36316
28207
80
m3
71.4
78.0
86.4
81.0
66.9
30000
m4
29540
29646
32613
25469
27018
70
m5
57.4
62.8
63.7
54.1
63.0
20000
m6
23208
23291
25688
21200
23509
Ranking Alerts with NTVs (1)
ICAC 2009 : 6/16/2009 23
70m1
m1T
m2T
m3T
m4T
m5T
m6T
63.6
70.2
70.5
77.0
59.8
30000
m2
32726
33006
33212
36316
28207
80
m3
71.4
78.0
86.4
81.0
66.9
30000
m4
29540
29646
32613
25469
27018
70
m5
57.4
62.8
63.7
54.1
63.0
20000
m6
23208
23291
25688
21200
23509
Observed value 73.6 34319 81.6 71.430621 22620
NTVs 5 5 5 65 2
Ranking Alerts with NTVs (1)
ICAC 2009 : 6/16/2009 24
Ranking Alerts with NTVs (2)
ICAC 2009 : 6/16/2009 25
70m1
m1T
m2T
m3T
m4T
m5T
m6T
63.6
70.2
70.5
77.0
59.8
30000
m2
32726
33006
33212
36316
28207
80
m3
71.4
78.0
86.4
81.0
66.9
30000
m4
29540
29646
32613
25469
27018
70
m5
57.4
62.8
63.7
54.1
63.0
20000
m6
23208
23291
25688
21200
23509
Observed value 73.5 31478 54.6 46.122712 18564
NTVs 5 2 - -- -
Ranking Alerts with NTVs (2)
ICAC 2009 : 6/16/2009 26Inject a problem (SCP copy) to Web serverInject a problem (SCP copy) to Web server
Conclusion• We introduce a peer review mechanism to
rank alerts from heterogeneous components– By mapping local thresholds of various rules into
their equivalent values in a global context
– Based on system invariants network model
• To support operators’ consultation for prioritization of problem determination.
ICAC 2009 : 6/16/2009 27
Thank You!
• Questions?
ICAC 2009 : 6/16/2009 28