Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015

Mining System Logs to Learn Error Predictors

A Case Study of a Telemetry System

Barbara Russo L.E.S.E.R.

Faculty of Computer Science, Free University of Bozen-Bolzano, Italy [email protected]

Universität Stuttgart - June 9th, 2015

A collaboration between

Free University of Bozen-Bolzano, Italy

and

University of Alberta, Canada

Barbara Russo, Giancarlo Succi, Witold Pedrycz (2015) Mining system logs to learn error predictors: a case study of a telemetry system, Empirical Software Engineering: Volume 20, Issue 4 (2015), pp. 879-927

Universität Stuttgart - June 9th, 2015 2

System events

•  Events describe the behaviour within and across subsystems or components –  the system changes over time

•  Logs track events


The value of logs

•  Log events carry information on –  the software application that generated the event and its

state,

–  the task and the user whose interaction with the system triggered the event, and

–  the time-stamp at which the event is generated.


Logs can be cryptic


Errors

•  Some behaviours are desirable and some are not

•  Undesirable behaviours are referred to as system errors –  crashes that immediately stop the system and are easily

identifiable

–  deviations from the expected output that let the system run and reveal only at completion of system tasks


Meaning of errors

•  Events in error state (errors) act as alerts –  ? Manifestations of system failures

–  ? Originated from a series of preceding events

–  ? Immediate action must be taken

–  ? Indication of an underlying problem


Goal

•  Analysing the behaviour of a (composite) system by mining logs of events and predicting future system misbehaviour

•  Composite: many applications or subsystems


Method

•  Solve a classification problem with SVM

•  Build a sequence abstraction by mining logs

•  Integrate several statistical techniques to control for data brittleness and accuracy of model selection and validation

•  Discuss the classification problem at different degree of defectiveness


Sequences

•  A single event may not suffice to predict system failures

•  An event sequence is a set of events ordered by their timestamp occurring within a given time window

•  A sequence abstraction is a representation of identified sequences in formal format that machines can read


Research question

•  Is the amount and type of information carried by a sequence enough to predict errors?


Isolating sequences


Different length, different types

Abstracting sequences


µ1 … µn

s7

s30

s2

s14

s10

Same length, same types

Example – sequence type

•  sv1=[0,1,0,1]

•  sv2=[2,1,1,0]


Sequence type

•  µi – number of the events of type i in a sequence

•  sv=[µ1, …,µn] – vector of event multiplicities

•  ρ(sv) = sum of # errors in sequences mapping into sv


Features to feed SVM

•  v= [sv, µ(sv), ν(sv)] – feature –  µ(sv) = # sequences mapping into sv

–  ν(sv) average # of users in sequences mapping into sv

•  v is an faulty feature if at least one event in one sequence is in error state


Sequence vector semantic

•  Patterns of system behaviour –  If µ>1 and ρ>0 such sequences denote a reliability problem

that recurs

•  Distributed teams –  If ν>1 the comparative analysis of features with ρ>0 or ρ=0

tells whether errors are originated by multi users working for the same tasks


Example - features

•  v1= [0,1,0,1;1,1], sv1=[0,1,0,1] –  µ(sv1) =1, ν(sv1)=1, ρ(sv1)=0

•  v2 = [2,1,1,0;1,2], sv2=[2,1,1,0] –  µ(sv2) =1, ν(sv2)=2, ρ(sv2)=2


The classification problem

19

Data Sets Classifier

Different ex-ante distributions: (faulty, non-faulty)

G2 =Non-Faulty

G1= Faulty

Ex-post classification differs on different classifier’s thresholds

Features

Classification •  False Positive = features v that are predicted faulty

but do not contain error(s), ρ(sv)=0

•  True positive = features v that are predicted faulty and contain error(s), ρ(sv)>0

•  False negative = features v that are predicted non-faulty but that contain error(s), ρ(sv)>0

•  True negative = features v that are predicted non-faulty and do not contain error(s), ρ(sv)=0


Measures of accuracy


Build classifiers on historical data

22

Classifier

Training Set

Test Set

1.  To tune classifier’s parameters

2.  To compute classifier’s fitting performance

Compare prediction performance

23

Classifier1

Validation Set

Classifier2

Classifiern

…

Validating sequence abstraction

•  Did we put too much information in our features? –  Information Gain selects features that most contribute to the

information of a given classification category: Classification category: sequences with a given number of error events


Control the effect of the dataset nature •  Does set balancing increase the quality of prediction?

–  If classification categories are not equally represented in datasets, classifiers might have low precision even though true positive rate is high and false positive rate is low.

–  Such imbalanced data sets are very frequent in software engineering data


Parametric classification

•  The problem varies depending on how many errors we allow in the system

•  c – cut-off value, i.e., number of errors in a sequence vector

•  Categories: –  G1(c)={v = [sv, µ(sv),ν(sv)] | ρ(sv)≥c}

–  G2(c)={v = [sv, µ(sv),ν(sv)] | ρ(sv)<c}


The case study


Business Questions

•  In our case study: –  Can we use Support Vector Machines to build suitable

predictors?

–  Is there any Support Vector Machine that performs best for all system applications?

–  Is there any machine that does it for different levels of reliability requested to the system?


Descriptive analysis across apps


54 datasets out of them 25 with some faulty features

Across system applications


Applications ordered by size of features set

Perc

enta

ge o

f fau

lty fe

atur

es

Effects of Information Gain


Splitting data

•  Three approaches to control for artificial assumptions –  Varying the size of splitting “t-splitting”

–  Reducing features with IG and varying size “t-splitting reduced”

–  Balancing sets “k-splitting” , i.e., manipulating sets so that the number of instances in the two categories are balanced


Types of SVM

•  Different kernels –  Multilayer perceptron

–  Linear

–  Radial Basis Function


Fitting performance ac. applications


Number of applications for which a classifier outperforms (with MR) the others in quality of fit

Prediction


No filter

Filtered with IG

•  Models with high fitting performance (bal>0.73)

•  Prediction performance averaged across t-

splitting and models

Findings

•  Better with IG filtering, MP is best across applications, but it is not the unique (Clustering applications?)

•  Artificial balance does not help to identify a single classifier, but it helps to increase convergence in those classifiers that are not reduced with IG


Findings (superior than literature)

•  Best performance at individual application (MP, c=3): –  1% false positive rate, 94% true positive rate, and 95%

precision

•  Best performance across applications averaged over models for c=2, –  9% false positive rate, 78% true positive rate, and 95%

precision,


What predictions can tell managers •  Application the manages software tools of cars

–  Pervasive in the telemetry system

•  106 distinct features of 10 different event types, 18% multiple sequences, and 89% with more than one user

•  c=1

•  IG reduction from 12 to 7 still including µ and ν


Confusion matrix: prediction - MP


Prediction - assumptions

•  Behaviour is the same in next three months

•  1000 features

•  Category balance is the one for the test set (fitting) (39%) –  390 faulty features and 610 non- faulty features


In numbers

•  We have 390 faulty features and 610 non-faulty features and 450 predicted faulty features

•  Predicted faulty features that have no error: –  67 = 11%*610

•  Fail to predict faulty features = 70 =18%*390


Pred pos Pred neg Total

Pos 82% 18% 100%

Neg 11% 89% 100%

Total 45% 54% 100%

Cost of prediction

•  Inspection cost. Wasting time ≥ 67 * average cost to fix one error –  There might be more than one error in one sequence on

average

•  Cost for undiscovered errors. Defect slippage ≥ 70 –  Measure of system unreliability

–  Cost to repair errors at late stages (inaccuracy, higher cost due to pressure, not being able to fix)


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

True Positive Rate

Best prediction

models

Equal chance

Higher cost to fix

undiscovered errors

Higher inspection costs

Prediction

MP

RBF

L


FPr=11%, TPr=82%

Recommendations

•  Select models that first accurately fit historical data before using them for predictions –  Best models for quality of fit are not always the best

predictors for all splitting sizes of a feature set

•  Reduce information redundancy


Recommendations

•  Report fitting accuracy

•  Use parametric classification –  The parameter being the number of errors a sequence must

contain in order to be classified as defective/faulty.

•  Study prediction at different cut-off values or with different splitting size or balance to solve the prediction problem independently from the level of reliability requested for the system and the nature of the data.


Thank you


With artificial balance

•  It does not help to identify a single classifier

•  It helps to increase convergence in those classifiers that are not reduced with IG

47

With IG filter

48

Best classifiers across different t-splitting; classifiers with b<0.73 are not reported

Presentations & Public Speaking

Mining System Logs to Learn Error Predictors, Universität Stuttgart, Stuttgart June 2015