01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample

A. Cornuéjols!

AgroParisTech!

(based in part on Sebastian Thrun CMU class "and on the tutorial of Padraic Cunningham at ECML-09)!

!"#$%#&'()$*#+','()#$(-+,./01)

Evaluating ML algorithms 2 A. Cornuéjols

Questions

Since induction is fallible, it is necessary to be able to assess its reliability!

!  Typical questions: –  What is the true performance of my (learned) classification rule

–  Is my learning algorithm better than this other one?


Outline

1.  Measuring the error rate

2.  Confusion matrices and various performance criteria

3.  The ROC curve

Evaluating ML algorithms 4

!1&0#&'()./*).+%*)*++-+)+#.*)

A. Cornuéjols


Evaluating classification rules

Large data sample

Very small data sample

Illimited sample


Various sets of data

The whole available data set

Learning set Validation set Test set


Asymptotic behaviour (ideal case)

!  Useful for very large data sets!


Over-fitting (over-learning)

Erreur

t

erreur sur basede test

erreur sur based'apprentissage

Arrêt de l'apprentissage

Sur-apprentissage


Over-fitting (NNs)

• ))2-%+3*1)4-%+)5)666)*7*04$*1)

• !!"#$%&'(!)#$%!*!+++!','-).'(!/!Evaluating ML algorithms 10 A. Cornuéjols

Why using a test set?

!  The control parameters of the learning algorithm •  E.g.: number of hidden layers, number of neurons, ...

–  Are tuned in order to reduce the error on the validation set

!  In order to have a non optimistically biased estimate of the error, one must measure it on an independent data set: the test set


Evaluating classification rules

A lot Few


Evaluating the error rate

!  True error:!

!  Test error:!

! "=D

D ydxyxpxfye ,),(),( #

!"

#=Syx

S xfym

e,

),(1ˆ $

D = the true distribution

m = # of test examples

T = test data

(Real risk)

(Empirical risk) T


Example:

!  The learned hypothesis incorrectly classifies 12 out of 40 examples in the test set T.

!  Q : What will be the true error rate?

!  R : ???


Confidence intervals

!  They are estimated using the normal law with:

–  Mean:

–  Standard deviation:

!  We want to estimate errorD(h).!

!  We estimate it by using errorT(h) which follows a binomial law!

–  With mean !

–  And standard error !!



!  The normal law! !  The normal law!



With probability N%, the true error errorD lies in the interval:!

N% 50% 68% 80% 90% 95% 98% 99%

zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58


Confidence intervals (cf. Mitchell 97)

If –  T contains m examples independently sampled –  m ! 30

Then –  With probability 95%, the true error eD is within:

meee SS

S)ˆ1(ˆ96.1ˆ !

±


Example:

!  The learned hypothesis incorrectly classifies 12 out of 40 test examples in T.

!  Q: What will be the true error on unseen examples?

!  A: With 95% confidence, the true error will lie within

meee SS

S)ˆ1(ˆ96.1ˆ]44.0;16.0[ !

±"

3.04012ˆ ==Se40=m 14.0)ˆ1(ˆ96.1 !

"

mee SS


95% confidence intervals


Performance curves

Erreur de test

95% confidence intervals

Erreur d’apprentissage


Evaluating learned hypotheses

Lot of data

Few


Various sets

Data

test " error Learning


Small data sets: a dilemma


Small data sets: a dilemma


Cross validation (k-fold) Data

Learn on yellow, test on rose " error5








error = # errori / k

k-way split


The “leave-one-out” procedure

Data

!  Low bias

!  Highvariance

!  Tends to under-estimate the error if the data are not fully i.i.d.

[Guyon & Elisseeff, jMLR, 03]!


The Bootstrap estimate

Data

"  Repeat and compute the mean

"  Learn on yellow, test on rose " error


Problem

!  The calculation of the confidence interval supposes the independence of the estimations.

!  But our estimations are not independent. #

Estimation of the true risk for the final h Mean of the risks

On the k test samples Mean of the risk on

whole data set


Types of performance criteria

Evaluating ML algorithms 30

2-'8%1,-')0#.+,9*1))

#':)"#+,-%1)4*+8-+0#'9*)9+,.*+,#)

A. Cornuéjols


Confusion matrix

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!


Confusion matrix

14% of the butterflies are recognized as fishes










Types of performance measures


Performance measures

!  Sensitivity !

!  Specificity!

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!

VN

VN + FP

VP

FN + VP !  Recall!

!  Precision!

VP

VP + FN

VP

VP + FP



!  FN-rate !

!  F-measure!

FN

VP + FN

Réel!

Estimé ! +! -!

+! VP! FP!

-! FN! VN!

2 x recall x precision

Recall + precision

!  FP-rate!FP

FP + VN

2 VP

2 VP + FP + FN =









!!!!!!!!!!!!"#$%!&'()#!

*++,! -.,!

(--:) 6;<=>) 6;?5@)

3#:) 6;56>) 6;A<A)

B+*9,1,-'C(--:D)E)=F)G)5A5)E)6;@5<)Evaluating ML algorithms 44

H/*)IJ2)9%+"*)

A. Cornuéjols


The ROC curve


Types of errors


The ROC curve

Critère de décision

Prob

abili

téde

la cl

asse

Classe '+'Classe '-'

ROC = Receiver Operating Characteristic


The ROC curve

Critère de décision

Prob

abili

téde

la cl

asse

Classe '+'

Critère de décisionPr

obab

ilité

de la

clas

se

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs

(50%)(50%)

(90%)(10%)


The ROC curve


The ROC curve PROPORTION DE VRAIS NEGATIFS

PROPORTION DE FAUX POSITIFS

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PRO

POR

TIO

N D

E V

RA

IS P

OSI

TIFS

PRO

P OR

TIO

N D

E FA

UX

NEG

ATI

FS

Ligne de hasard(pertinence = 0,5)

Courbe ROC(pertinence = 0,90)


The ROC curve

PROPORTION DE VRAIS NEGATIFS


0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PROP

ORTIO

N DE V

RAIS

POSIT

IFS

PROP

ORTIO

N DE F

AUX N

EGAT

IFS

Ligne de hasard(pertinence = 0,5)

Courbe ROC(pertinence = 0,90)

PROPORTION DE VRAIS NEGATIFS


0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

0,1

0,2

0

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

PROP

ORTIO

N DE V

RAIS

POSIT

IFS

PROP

ORTIO

N DE F

AUX N

EGAT

IFS

Critère de déci-sion

Probabilité

de la classe

Classe '+'


Probabilité

de la classe

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs


Probabilité

de la classe

Classe '+'


Probabilité

de la classe

Classe '-'

Vraispositifs

Fauxnégatifs

Fauxpositifs

Vraisnégatifs

(50%)(50%)

(90%)(10%)

Seuil "laxiste"

Seuil "sévère"


The ROC curve


The ROC curve


Comparaison of learning algorithms !  Résumé!

!  Comparison on a single data sets –  [Dietterich, 1998] recommends using:

•  5 x 2 cross-validation •  Paired t-test

–  The McNemar test on a validation set

!  Comparison on multiples (different) data sets –  [Demsar, 2006] recommends using:

•  Wilcoxon Signed Ranks Test •  The Friedman test


Résumé

!  Attention à votre fonction de coût : –  qu’est-ce qui importe pour la mesure de performance ?

!  Données en nombre fini: –  calculez les intervalles de confiance

!  Données rares : –  Attention à la répartition entre données d’apprentissage et données

test. Validation croisée.

!  N’oubliez pas l’ensemble de validation

!  L’évaluation est très importante –  Ayez l’esprit critique –  Convainquez-vous vous même !


Specific problems

!  The distribution of the classes is very unbalanced (e.g. 1% ou 1%O for one of the two classes)

!  “Gray zone” (uncertain labels)

!  Multi-valued functions


Other evaluation criteria

!  Intelligibility of the learned decision function –  E.g. SVMs or boosting are not good

!  Performances in generalization –  Often not correlated to the previous performance criterion

!  Various costs –  Data preparation

–  Computational cost –  Cost of the ML expertise –  Cost of the domain expertise


References

!  Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924.!

!  JapKowicz N. & Shah M. (2011). Evaluating Learning Algorithms. A classification perspective. Cambridge University Press, 2011. (An interesting book)!


The Weka ML toolkit !  http://www.cs.waikato.ac.nz/m!weka/ "


The Weka ML toolkit

Documents

01 Since induction is fallible, it is necessary · A. Cornuéjols Evaluating ML algorithms 5 Evaluating classification rules Large data sample Very small data sample Illimited sample