61
Evaluation of Learning Evaluation of Learning Models Models Evgueni Smirnov

Evaluation of Learning Models

Embed Size (px)

DESCRIPTION

Evaluation of Learning Models. Evgueni Smirnov. Overview. Motivation Metrics for Classifier’s Evaluation Methods for Classifier’s Evaluation Comparing the Performance of two Classifiers Costs in Classification Cost-Sensitive Classification and Learning Lift Charts ROC Curves. - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluation of Learning Models

Evaluation of Learning Evaluation of Learning ModelsModels

Evgueni Smirnov

Page 2: Evaluation of Learning Models

Overview

• Motivation• Metrics for Classifier’s Evaluation• Methods for Classifier’s Evaluation• Comparing the Performance of two Classifiers• Costs in Classification

– Cost-Sensitive Classification and Learning

– Lift Charts

– ROC Curves

Page 3: Evaluation of Learning Models

Motivation

• It is important to evaluate classifier’s generalization performance in order to:– Determine whether to employ the classifier;

(For example: when learning the effectiveness of medical treatments from a limited-size data, it is important to estimate the accuracy of the classifiers.)

– Optimize the classifier.(For example: when post-pruning decision trees we must evaluate the accuracy of the decision trees on each pruning step.)

Page 4: Evaluation of Learning Models

data

Targetdata Processed

data

Transformeddata Patterns

Knowledge

Selection

Preprocessing& cleaning

Transformation& featureselection

Data Mining

InterpretationEvaluation

Model’s Evaluation in the KDD Process

Page 5: Evaluation of Learning Models

How to evaluate the Classifier’s Generalization Performance?

Predicted class

Actual

class

Pos Neg

Pos TP FN

Neg FP TN

• Assume that we test a classifier on some test set and we derive at the end the following confusion matrix:

P

N

Page 6: Evaluation of Learning Models

Metrics for Classifier’s Evaluation

Predicted class

Actual

class

Pos Neg

Pos TP FN

Neg FP TN

• Accuracy = (TP+TN)/(P+N)• Error = (FP+FN)/(P+N)• Precision = TP/(TP+FP)• Recall/TP rate = TP/P• FP Rate = FP/N

P

N

Page 7: Evaluation of Learning Models

How to Estimate the Metrics?

• We can use:– Training data;– Independent test data;– Hold-out method;– k-fold cross-validation method;– Leave-one-out method;– Bootstrap method;– And many more…

Page 8: Evaluation of Learning Models

Estimation with Training Data

• The accuracy/error estimates on the training data are not good indicators of performance on future data.

– Q: Why? – A: Because new data will probably not be exactly the same as the

training data!

• The accuracy/error estimates on the training data measure the degree of classifier’s overfitting.

Training set

Classifier

Training set

Page 9: Evaluation of Learning Models

Estimation with Independent Test Data • Estimation with independent test data is used when we have

plenty of data and there is a natural way to forming training and test data.

• For example: Quinlan in 1987 reported experiments in a medical domain for which the classifiers were trained on data from 1985 and tested on data from 1986.

Training set

Classifier

Test set

Page 10: Evaluation of Learning Models

Hold-out Method• The hold-out method splits the data into training data and test data

(usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

• The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class.

Training set

Classifier

Test set

Data

Page 11: Evaluation of Learning Models

Classification: Train, Validation, Test Split

Data

Predictions

Y N

Results Known

Training set

Validation set

++--+

Classifier BuilderEvaluate

+-+-

ClassifierFinal Test Set

+-+-

Final Evaluation

ModelBuilder

The test data can’t be used for parameter tuning!

Page 12: Evaluation of Learning Models

Making the Most of the Data

• Once evaluation is complete, all the data can be used to build the final classifier.

• Generally, the larger the training data the better the classifier (but returns diminish).

• The larger the test data the more accurate the error estimate.

Page 13: Evaluation of Learning Models

Stratification

• The holdout method reserves a certain amount for testing and uses the remainder for training.– Usually: one third for testing, the rest for training.

• For “unbalanced” datasets, samples might not be representative.– Few or none instances of some classes.

• Stratified sample: advanced version of balancing the data.– Make sure that each class is represented with

approximately equal proportions in both subsets.

Page 14: Evaluation of Learning Models

Repeated Holdout Method

• Holdout estimate can be made more reliable by repeating the process with different subsamples.– In each iteration, a certain proportion is randomly

selected for training (possibly with stratification).– The error rates on the different iterations are

averaged to yield an overall error rate.

• This is called the repeated holdout method.

Page 15: Evaluation of Learning Models

Repeated Holdout Method, 2

• Still not optimum: the different test sets overlap, but we would like all our instances from the data to be tested at least ones.

• Can we prevent overlapping?

witten & eibe

Page 16: Evaluation of Learning Models

k-Fold Cross-Validation

• k-fold cross-validation avoids overlapping test sets:– First step: data is split into k subsets of equal size;– Second step: each subset in turn is used for testing and the

remainder for training.• The subsets are stratified

before the cross-validation.• The estimates are averaged to

yield an overall estimate.

Classifier

Data

train train test

train test train

test train train

Page 17: Evaluation of Learning Models

More on Cross-Validation

• Standard method for evaluation: stratified 10-fold cross-validation.

• Why 10? Extensive experiments have shown that this is the best choice to get an accurate estimate.

• Stratification reduces the estimate’s variance.

• Even better: repeated stratified cross-validation:

– E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance).

Page 18: Evaluation of Learning Models

Leave-One-Out Cross-Validation

• Leave-One-Out is a particular form of cross-validation:– Set number of folds to number of training

instances;– I.e., for n training instances, build classifier n

times.

• Makes best use of the data.• Involves no random sub-sampling.• Very computationally expensive.

Page 19: Evaluation of Learning Models

Leave-One-Out Cross-Validation and Stratification

• A disadvantage of Leave-One-Out-CV is that stratification is not possible:– It guarantees a non-stratified sample because

there is only one instance in the test set!

• Extreme example - random dataset split equally into two classes:– Best inducer predicts majority class;– 50% accuracy on fresh data;– Leave-One-Out-CV estimate is 100% error!

Page 20: Evaluation of Learning Models

Bootstrap Method

• Cross validation uses sampling without replacement:– The same instance, once selected, can not be selected again

for a particular training/test set

• The bootstrap uses sampling with replacement to form the training set:– Sample a dataset of n instances n times with replacement to

form a new dataset of n instances;

– Use this data as the training set;

– Use the instances from the original dataset that don’t occur in the new training set for testing.

Page 21: Evaluation of Learning Models

Bootstrap Method

• The bootstrap method is also called the 0.632 bootstrap:– A particular instance has a probability of 1–1/n of not

being picked;– Thus its probability of ending up in the test data is:

– This means the training data will contain approximately 63.2% of the instances and the test data will contain approximately 36.8% of the instances.

368.01

1 1

e

n

n

Page 22: Evaluation of Learning Models

Estimating Error with the Bootstrap Method

• The error estimate on the test data will be very pessimistic because the classifier is trained on just ~63% of the instances.

• Therefore, combine it with the training error:

• The training error gets less weight than the error on the test data.

• Repeat process several times with different replacement samples; average the results.

instances traininginstancestest 368.0632.0 eeerr

Page 23: Evaluation of Learning Models

Confidence Intervals for Accuracy

• Assume that the estimated accuracy accS(h) of classifier h is 75%.

• How close is the estimated accuracy accS(h) to the true accuracy accD(h) ?

Page 24: Evaluation of Learning Models

Confidence Intervals for Accuracy

• Classification of an instance is a Bernoulli trial. – A Bernoulli trial has 2 outcomes: correct and wrong;

– The random variable X, the number of correct outcomes of N Bernoulli trials, has a Binomial distribution b(x; N, accD);

– Example: If we have a classifier with true accuracy accD equal to 50%, then to classify 30 randomly chosen instances we receive the following Binomial Distribution:

X

Page 25: Evaluation of Learning Models

Confidence Intervals for Accuracy

The main question:

Given number N of test instances and number x of correct classifications, or equivalently, empirical accuracy accS= x/N, can we predict the true accuracy accD of the classifier?

Page 26: Evaluation of Learning Models

Confidence Intervals for Accuracy

• The binomial distribution of X has mean equal to N accD and variance N accD(1- accD).

• It can be shown that the empirical accuracy accS= X / N follows also a binomial distribution with mean equal to accD and variance accD(1- accD)/ N.

X

Page 27: Evaluation of Learning Models

Confidence Intervals for Accuracy

• For large test sets (N > 30),

a binomial distribution is

approximated by a normal

one with mean accD and

variance accD(1- accD)/N.

• Thus,

• Confidence Interval for accD:

1)/)1(

( 2/12/ ZNaccacc

accaccZP

DD

DS

Area = 1 -

Z/2 Z1- /2

)(2

4422

2/

222/

22/

ZN

accNaccNZZaccN SSS

Page 28: Evaluation of Learning Models

Confidence Intervals for Accuracy

• Confidence Interval for accD:

• The confidence intervals shrink when we decrease confidence:

• The confidence intervals become tighter when the number N of test instances increases. See below the evolution of the intervals for confidence level 95% for a classifier with accuracy 80% on 100 test instances.

)(2

4422

2/

222/

22/

ZN

accNaccNZZaccN SSS

1- α 0.99 0.98 0.95 0.9 0.8 0.7 0.5

Zα/2 2.58 2.33 1.96 1.65 1.28 1.04 0.67

N 20 50 100 500 1000 5000

Confidence Interval

[0.58, 0.92] [0.67, 0.89] [0.71, 0.87] [0.76, 0.83] [0.77, 0.82] [0.78, 0.81]

Page 29: Evaluation of Learning Models

Estimating Confidence Intervals of the Difference of Generalization Performances

of two Classifier Models• Assume that we have two classifiers, M1 and M2, and we would

like to know which one is better for a classification problem.

• We test the classifiers on n test data sets D1, D2, …, Dn, and we

receive error rate estimates e11, e12, …, e1n for classifier M1 and

error rate estimates e21, e22, …, e2n for classifier M2.

• Using rate estimates we can compute the mean error rate e1 for

classifier M1 and the mean error rate e2 for classifier M2.

• These mean error rates are just estimates of error on the true population of future data cases. What if the difference between the two error rates is just attributed to chance?

Page 30: Evaluation of Learning Models

Estimating Confidence Intervals of the Difference of Generalization Performances

of two Classifier Models

• We note that error rate estimates e11, e12, …, e1n for classifier M1

and error rate estimates e21, e22, …, e2n for classifier M2 are paired.

Thus, we consider the differences d1, d2, …, dn where dj= | e1j- e2j|.

• The differences d1, d2, …, dn are instantiations of n random

variables D1, D2, …, Dn with mean µD and standard deviation σD.

• We need to establish confidence intervals for µD in order to decide

whether the difference in the generalization performance of the

classifiers M1 and M2 is statistically significant or not.

Page 31: Evaluation of Learning Models

Estimating Confidence Intervals of the Difference of Generalization Performances

of two Classifier Models

• Since the standard deviation σD is unknown, we approximate it

using the sample standard deviation sd:

• Since we approximate the true standard deviation σD, we

introduce T statistics:

2212

11 )]()[(

1eeee

ns i

n

iid

ns

DT

d

D

/

Page 32: Evaluation of Learning Models

Estimating Confidence Intervals of the Difference of Generalization Performances

of two Classifier Models

• The T statistics is governed by t-distribution with n - 1 degrees of

freedom. Area = 1 -

t/2 t1- /2

Page 33: Evaluation of Learning Models

Estimating Confidence Intervals of the Difference of Generalization Performances

of two Classifier Models• If d and sd are the mean and standard deviation of the normally

distributed differences of n random pairs of errors, a (1 – α)100%

confidence interval for µD = µ1 - µ2 is :

where tα/2 is the t-value with v = n -1 degrees of freedom, leaving

an area of α/2 to the right.

• Thus, if the interval contains 0.0 we can conclude on significance

level α that the difference is 0.0.

,2/2/n

std

n

std d

Dd

Page 34: Evaluation of Learning Models

Metric Evaluation Summary:

• Use test sets and the hold-out method for “large” data;

• Use the cross-validation method for “middle-sized” data;

• Use the leave-one-out and bootstrap methods for small data;

• Don’t use test data for parameter tuning - use separate validation data.

Page 35: Evaluation of Learning Models

Counting the Costs

• In practice, different types of classification errors often incur different costs

• Examples:– ¨ Terrorist profiling

• “Not a terrorist” correct 99.99% of the time

– Loan decisions– Fault diagnosis– Promotional mailing

Page 36: Evaluation of Learning Models

Cost Matrices

Pos Neg

Pos TP Cost FN Cost

Neg FP Cost TN Cost

Usually, TP Cost and TN Cost are set equal to 0!

Hypothesized

classTrue class

Page 37: Evaluation of Learning Models

Cost-Sensitive Classification• If the classifier outputs probability for each class, it can be adjusted to minimize the

expected costs of the predictions. • Expected cost is computed as dot product of vector of class probabilities and

appropriate column in cost matrix.

Pos Neg

Pos TP Cost FN Cost

Neg FP Cost TN Cost

Hypothesized

classTrue class

Page 38: Evaluation of Learning Models

Cost Sensitive Classification• Assume that the classifier returns for an instance probs ppos = 0.6 and pneg = 0.4.

Then, the expected cost if the instance is classified as positive is 0.6 * 0 + 0.4 * 10 = 4. The expected cost if the instance is classified as negative is 0.6 * 5 + 0.4 * 0 = 3. To minimize the costs the instance is classified as negative.

Pos Neg

Pos 0 5

Neg 10 0

Hypothesized

classTrue class

Page 39: Evaluation of Learning Models

Cost Sensitive Learning

• Simple methods for cost sensitive learning:– Resampling of instances according to costs;

– Weighting of instances according to costs.

In Weka Cost Sensitive Classification and Learning can be applied for any classifier using the meta scheme: CostSensitiveClassifier.

Pos Neg

Pos 0 5

Neg 10 0

Hypothesized

classTrue class

Page 40: Evaluation of Learning Models

Lift Charts• In practice, decisions are usually made by comparing possible scenarios taking into account different costs.• Example:

• Promotional mailout to 1,000,000 households. If we mail to all households, we get 0.1% respond (1000).• Data mining tool identifies (a) subset of 100,000 households with 0.4% respond (400); or (b) subset of 400,000 households with 0.2% respond (800);• Depending on the costs we can make final decision using lift charts!• A lift chart allows a visual comparison.

Page 41: Evaluation of Learning Models

Generating a Lift Chart• Instances are sorted according to their predicted probability of being

a true positive:

Rank Predicted probability Actual class1 0.95 Yes2 0.93 Yes3 0.93 No4 0.88 Yes… … …

• In lift chart, x axis is sample size and y axis is number of true positives.

Page 42: Evaluation of Learning Models

Hypothetical Lift Chart

Page 43: Evaluation of Learning Models

ROC Curves and Analysis

True

Predicted

pos neg

pos 60 40

neg 20 80

True

Predicted

pos neg

pos 70 30

neg 50 50

True

Predicted

pos neg

pos 40 60

neg 30 70

Classifier 1TPr = 0.4FPr = 0.3

Classifier 2TPr = 0.7FPr = 0.5

Classifier 3TPr = 0.6FPr = 0.2

Page 44: Evaluation of Learning Models

ROC SpaceIdeal classifier

chance

always negative

always positive

True Negative Rate

Fal

se N

egat

ive

Rat

e

Page 45: Evaluation of Learning Models

Dominance in the ROC Space

Classifier A dominates classifier B if and only if TPrA > TPrB and FPrA < FPrB.

Page 46: Evaluation of Learning Models

ROC Convex Hull (ROCCH)

• ROCCH is determined by the dominant classifiers; • Classifiers on ROCCH achieve the best accuracy; • Classifiers below ROCCH are always sub-optimal.

Page 47: Evaluation of Learning Models

Convex Hull

• Any performance on a line segment connecting two ROC points can be achieved by randomly choosing between them;• The classifiers on ROCCH can be combined to form a hybrid.

Page 48: Evaluation of Learning Models

Iso-Accuracy Lines

• Iso-accuracy line connects ROC points with the same accuracy A:

• (P*TPr + N*(1–FPr))/(P+N) = A;• TPr = (A*(P+N)-N)/P + N/P*FPr.

• Iso-accuracy lines have slope N/P.• Higher iso-accuracy lines are better.

Page 49: Evaluation of Learning Models

Iso-Accuracy Lines• For uniform class distribution, C4.5 is optimal and achieves about 82% accuracy.

Page 50: Evaluation of Learning Models

Iso-Accuracy Lines• With for times as many positives as negatives SVM is optimal and achieves about 84% accuracy.

Page 51: Evaluation of Learning Models

Iso-Accuracy Lines• With for times as many negatives as positives CN2 is optimal and achieves about 86% accuracy.

Page 52: Evaluation of Learning Models

Iso-Accuracy Lines• With less than 9% positives, AlwaysNeg is optimal.• With less than 11% negatives, AlwaysPos is optimal.

Page 53: Evaluation of Learning Models

How to Construct ROC Curve for one Classifier

• Sort the instances according to their Ppos.

• Move a threshold on the sorted instances.

• For each threshold define a classifier with confusion matrix.

• Plot the TPr and FPr rates of the classfiers.

Ppos True Class

0.99 pos

0.98 pos

0.7 neg

0.6 pos

0.43 neg

True

Predicted

pos neg

pos 2 1

neg 1 1

Page 54: Evaluation of Learning Models

ROC for one Classifier

Good separation between the classes, convex curve.

Page 55: Evaluation of Learning Models

ROC for one Classifier

Reasonable separation between the classes, mostly convex.

Page 56: Evaluation of Learning Models

ROC for one Classifier

Fairly poor separation between the classes, mostly convex.

Page 57: Evaluation of Learning Models

ROC for one Classifier

Poor separation between the classes, large and small concavities.

Page 58: Evaluation of Learning Models

ROC for one Classifier

Random performance.

Page 59: Evaluation of Learning Models

The AUC Metric

• The area under ROC curve (AUC) assesses the ranking in terms of separation of the classes.

• AUC estimates that randomly chosen positive instance will be ranked before randomly chosen negative instances.

Page 60: Evaluation of Learning Models

Note

• To generate ROC curves or Lift charts we need to use some evaluation methods considered in this lecture.

• ROC curves and Lift charts can be used for internal optimization of classifiers.

Page 61: Evaluation of Learning Models

Summary

• In this lecture we have considered:– Metrics for Classifier’s Evaluation– Methods for Classifier’s Evaluation– Comparing Data Mining Schemes– Costs in Data Mining

• Cost-Sensitive Classification and Learning

• Lift Charts

• ROC Curves