15
A. Cornuéjols AgroParisTech (based in part on Sebastian Thrun CMU class and on the tutorial of Padraic Cunningham at ECML-09) !"#$%#&'( $*#+','( #$(-+,./01 Evaluating ML algorithms 2 A. Cornuéjols Questions Since induction is fallible, it is necessary to be able to assess its reliability ! Typical questions: What is the true performance of my (learned) classification rule Is my learning algorithm better than this other one? Evaluating ML algorithms 3 A. Cornuéjols Outline 1. Measuring the error rate 2. Confusion matrices and various performance criteria 3. The ROC curve Evaluating ML algorithms 4 !1&0#&'( ./* .+%* *++-+ +#.* A. Cornuéjols

01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Embed Size (px)

Citation preview

Page 1: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

A. Cornuéjols!

AgroParisTech!

(based in part on Sebastian Thrun CMU class "and on the tutorial of Padraic Cunningham at ECML-09)!

!"#$%#&'()$*#+','()#$(-+,./01)

Evaluating ML algorithms 2 A. Cornuéjols

Questions

Since induction is fallible, it is necessary to be able to assess its reliability!

!  Typical questions: –  What is the true performance of my (learned) classification rule

–  Is my learning algorithm better than this other one?

Evaluating ML algorithms 3 A. Cornuéjols

Outline

1.  Measuring the error rate

2.  Confusion matrices and various performance criteria

3.  The ROC curve

Evaluating ML algorithms 4

!1&0#&'()./*).+%*)*++-+)+#.*)

A. Cornuéjols

Page 2: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 5 A. Cornuéjols

Evaluating classification rules

Large data sample

Very small data sample

Illimited sample

Evaluating ML algorithms 6 A. Cornuéjols

Various sets of data

The whole available data set

Learning set Validation set Test set

Evaluating ML algorithms 7 A. Cornuéjols

Asymptotic behaviour (ideal case)

!  Useful for very large data sets!

Evaluating ML algorithms 8 A. Cornuéjols

Over-fitting (over-learning)

Erreur

t

erreur sur basede test

erreur sur based'apprentissage

Arrêt de l'apprentissage

Sur-apprentissage

Page 3: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 9 A. Cornuéjols

Over-fitting (NNs)

• ))2-%+3*1)4-%+)5)666)*7*04$*1)

• !!"#$%&'(!)#$%!*!+++!','-).'(!/!Evaluating ML algorithms 10 A. Cornuéjols

Why using a test set?

!  The control parameters of the learning algorithm •  E.g.: number of hidden layers, number of neurons, ...

–  Are tuned in order to reduce the error on the validation set

!  In order to have a non optimistically biased estimate of the error, one must measure it on an independent data set: the test set

Evaluating ML algorithms 11 A. Cornuéjols

Evaluating classification rules

A lot Few

Evaluating ML algorithms 12 A. Cornuéjols

Evaluating the error rate

!  True error:!

!  Test error:!

! "=D

D ydxyxpxfye ,),(),( #

!"

#=Syx

S xfym

e,

),(1ˆ $

D = the true distribution

m = # of test examples

T = test data

(Real risk)

(Empirical risk) T

Page 4: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 13 A. Cornuéjols

Example:

!  The learned hypothesis incorrectly classifies 12 out of 40 examples in the test set T.

!  Q : What will be the true error rate?

!  R : ???

Evaluating ML algorithms 14 A. Cornuéjols

Confidence intervals

!  They are estimated using the normal law with:

–  Mean:

–  Standard deviation:

!  We want to estimate errorD(h).!

!  We estimate it by using errorT(h) which follows a binomial law!

–  With mean !

–  And standard error !!

Evaluating ML algorithms 15 A. Cornuéjols

Confidence intervals

!  The normal law! !  The normal law!

Evaluating ML algorithms 16 A. Cornuéjols

Confidence intervals

With probability N%, the true error errorD lies in the interval:!

N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58

Page 5: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 17 A. Cornuéjols

Confidence intervals (cf. Mitchell 97)

If –  T contains m examples independently sampled –  m ! 30

Then –  With probability 95%, the true error eD is within:

meee SS

S)ˆ1(ˆ96.1ˆ !

±

Evaluating ML algorithms 18 A. Cornuéjols

Example:

!  The learned hypothesis incorrectly classifies 12 out of 40 test examples in T.

!  Q: What will be the true error on unseen examples?

!  A: With 95% confidence, the true error will lie within

meee SS

S)ˆ1(ˆ96.1ˆ]44.0;16.0[ !

±"

3.04012ˆ ==Se40=m 14.0)ˆ1(ˆ96.1 !

"

mee SS

Evaluating ML algorithms 19 A. Cornuéjols

95% confidence intervals

Evaluating ML algorithms 20 A. Cornuéjols

Performance curves

Erreur de test

95% confidence intervals

Erreur d’apprentissage

Page 6: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 21 A. Cornuéjols

Evaluating learned hypotheses

Lot of data

Few

Evaluating ML algorithms 22 A. Cornuéjols

Various sets

Data

test " error Learning

Evaluating ML algorithms 23 A. Cornuéjols

Small data sets: a dilemma

Evaluating ML algorithms 24 A. Cornuéjols

Small data sets: a dilemma

Page 7: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 25 A. Cornuéjols

Cross validation (k-fold) Data

Learn on yellow, test on rose " error5

Learn on yellow, test on rose " error6

Learn on yellow, test on rose " error7

Learn on yellow, test on rose " error1

Learn on yellow, test on rose " error3

Learn on yellow, test on rose " error4

Learn on yellow, test on rose " error8

Learn on yellow, test on rose " error2

error = # errori / k

k-way split

Evaluating ML algorithms 26 A. Cornuéjols

The “leave-one-out” procedure

Data

!  Low bias

!  Highvariance

!  Tends to under-estimate the error if the data are not fully i.i.d.

[Guyon & Elisseeff, jMLR, 03]!

Evaluating ML algorithms 27 A. Cornuéjols

The Bootstrap estimate

Data

"  Repeat and compute the mean

"  Learn on yellow, test on rose " error

Evaluating ML algorithms 28 A. Cornuéjols

Problem

!  The calculation of the confidence interval supposes the independence of the estimations.

!  But our estimations are not independent. #

Estimation of the true risk for the final h Mean of the risks

On the k test samples Mean of the risk on

whole data set

Page 8: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 29 A. Cornuéjols

Types of performance criteria

Evaluating ML algorithms 30

2-'8%1,-')0#.+,9*1))

#':)"#+,-%1)4*+8-+0#'9*)9+,.*+,#)

A. Cornuéjols

Evaluating ML algorithms 31 A. Cornuéjols

Confusion matrix

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!

Evaluating ML algorithms 32 A. Cornuéjols

Confusion matrix

14% of the butterflies are recognized as fishes

Page 9: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 33 A. Cornuéjols

Types of performance criteria

Evaluating ML algorithms 34 A. Cornuéjols

Types of performance criteria

Evaluating ML algorithms 35 A. Cornuéjols

Types of performance criteria

Evaluating ML algorithms 36 A. Cornuéjols

Types of performance criteria

Page 10: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 37 A. Cornuéjols

Types of performance measures

Evaluating ML algorithms 38 A. Cornuéjols

Performance measures

!  Sensitivity !

!  Specificity!

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!

VN

VN + FP

VP

FN + VP !  Recall!

!  Precision!

VP

VP + FN

VP

VP + FP

Evaluating ML algorithms 39 A. Cornuéjols

Performance measures

!  FN-rate !

!  F-measure!

FN

VP + FN

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!

2 x recall x precision

Recall + precision

!  FP-rate!FP

FP + VN

2 VP

2 VP + FP + FN =

Evaluating ML algorithms 40 A. Cornuéjols

Performance measures

Page 11: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 41 A. Cornuéjols

Performance measures

Evaluating ML algorithms 42 A. Cornuéjols

Performance measures

Evaluating ML algorithms 43 A. Cornuéjols

Performance measures

!!!!!!!!!!!!"#$%!&'()#!

*++,! -.,!

(--:) 6;<=>) 6;?5@)

3#:) 6;56>) 6;A<A)

B+*9,1,-'C(--:D)E)=F)G)5A5)E)6;@5<)Evaluating ML algorithms 44

H/*)IJ2)9%+"*)

A. Cornuéjols

Page 12: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 45 A. Cornuéjols

The ROC curve

Evaluating ML algorithms 46 A. Cornuéjols

Types of errors

Evaluating ML algorithms 47 A. Cornuéjols

The ROC curve

Critère de décision

Prob

abili

téde

la cl

asse

Classe '+'Classe '-'

ROC = Receiver Operating Characteristic

Evaluating ML algorithms 48 A. Cornuéjols

The ROC curve

Critère de décision

Prob

abili

téde

la cl

asse

Classe '+'

Critère de décisionPr

obab

ilité

de la

clas

se

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs

(50%)(50%)

(90%)(10%)

Page 13: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 49 A. Cornuéjols

The ROC curve

Evaluating ML algorithms 50 A. Cornuéjols

The ROC curve PROPORTION DE VRAIS NEGATIFS

PROPORTION DE FAUX POSITIFS

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PRO

POR

TIO

N D

E V

RA

IS P

OSI

TIFS

PRO

P OR

TIO

N D

E FA

UX

NEG

ATI

FS

Ligne de hasard(pertinence = 0,5)

Courbe ROC(pertinence = 0,90)

Evaluating ML algorithms 51 A. Cornuéjols

The ROC curve

PROPORTION DE VRAIS NEGATIFS

PROPORTION DE FAUX POSITIFS

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PROP

ORTIO

N DE V

RAIS

POSIT

IFS

PROP

ORTIO

N DE F

AUX N

EGAT

IFS

Ligne de hasard(pertinence = 0,5)

Courbe ROC(pertinence = 0,90)

PROPORTION DE VRAIS NEGATIFS

PROPORTION DE FAUX POSITIFS

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PROP

ORTIO

N DE V

RAIS

POSIT

IFS

PROP

ORTIO

N DE F

AUX N

EGAT

IFS

Critère de déci-sion

Probabilité

de la classe

Classe '+'

Critère de déci-sion

Probabilité

de la classe

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs

Critère de déci-sion

Probabilité

de la classe

Classe '+'

Critère de déci-sion

Probabilité

de la classe

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs

(50%)(50%)

(90%)(10%)

Seuil "laxiste"

Seuil "sévère"

Evaluating ML algorithms 52 A. Cornuéjols

The ROC curve

Page 14: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 53 A. Cornuéjols

The ROC curve

Evaluating ML algorithms 54 A. Cornuéjols

Comparaison of learning algorithms !  Résumé!

!  Comparison on a single data sets –  [Dietterich, 1998] recommends using:

•  5 x 2 cross-validation •  Paired t-test

–  The McNemar test on a validation set

!  Comparison on multiples (different) data sets –  [Demsar, 2006] recommends using:

•  Wilcoxon Signed Ranks Test •  The Friedman test

Evaluating ML algorithms 55 A. Cornuéjols

Résumé

!  Attention à votre fonction de coût : –  qu’est-ce qui importe pour la mesure de performance ?

!  Données en nombre fini: –  calculez les intervalles de confiance

!  Données rares : –  Attention à la répartition entre données d’apprentissage et données

test. Validation croisée.

!  N’oubliez pas l’ensemble de validation

!  L’évaluation est très importante –  Ayez l’esprit critique –  Convainquez-vous vous même !

Evaluating ML algorithms 56 A. Cornuéjols

Specific problems

!  The distribution of the classes is very unbalanced (e.g. 1% ou 1%O for one of the two classes)

!  “Gray zone” (uncertain labels)

!  Multi-valued functions

Page 15: 01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

Evaluating ML algorithms 57 A. Cornuéjols

Other evaluation criteria

!  Intelligibility of the learned decision function –  E.g. SVMs or boosting are not good

!  Performances in generalization –  Often not correlated to the previous performance criterion

!  Various costs –  Data preparation

–  Computational cost –  Cost of the ML expertise –  Cost of the domain expertise

Evaluating ML algorithms 58 A. Cornuéjols

References

!  Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924.!

!  JapKowicz N. & Shah M. (2011). Evaluating Learning Algorithms. A classification perspective. Cambridge University Press, 2011. (An interesting book)!

Evaluating ML algorithms 59 A. Cornuéjols

The Weka ML toolkit !  http://www.cs.waikato.ac.nz/m!weka/ "

Evaluating ML algorithms 60 A. Cornuéjols

The Weka ML toolkit