31
Evaluating Machine Learning Models I Cèsar Ferri Ramírez Universitat Politècnica de València

L2. Evaluating Machine Learning Algorithms I

Embed Size (px)

Citation preview

Page 1: L2. Evaluating Machine Learning Algorithms I

Evaluating Machine Learning Models I

Cèsar Ferri Ramírez Universitat Politècnica de València

Page 2: L2. Evaluating Machine Learning Algorithms I

2!

q  Machine Learning Tasks

q  Classification v  Imbalanced problems, probabilistic classifiers, rankers

q  Regression

q  Unsupervised Learning

q  Lessons learned

Outline

Page 3: L2. Evaluating Machine Learning Algorithms I

3!

q  Supervised learning: The problem is presented with example inputs and their desired outputs and the goal is to learn a general rule that maps inputs to outputs. o  Classification: Output is categorical o  Regression: Output is numerical

q  Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. o  Clustering, association rules…

q  Reinforcement learning: A computer program interacts with a dynamic environment in which it must perform a certain goal: Driving a vehicle or videogames.

Machine Learning Tasks

Page 4: L2. Evaluating Machine Learning Algorithms I

4!

q  Classification: problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. o  Spam filters

o  Face identification

o  Diagnosis of patients

Evaluation of Classifiers

Page 5: L2. Evaluating Machine Learning Algorithms I

5!

q  Given a set S of n instances, we define classification error:

Evaluation of Classifiers

∑∈

∂=Sx

S xhxfn

herror ))(),((1)(

Where δ(a,b)=0 if a=b and 1otherwise.

Predicted class(h(x)) Actual class (f(x)) Error

Buy Buy No

No Buy Buy Yes

Buy No Buy Yes

Buy Buy No

No Buy No Buy No

No Buy Buy Yes

No Buy No Buy No

Buy Buy No

Buy Buy No

No Buy No Buy No

Error = 3/10 = 0.3

Mistakes/ Total

Page 6: L2. Evaluating Machine Learning Algorithms I

6!

q  Common solution:

o  Split between training and test data

Split the data

training

test

Models

Evaluation

Best model

∑∈

−=Sx

S xhxfn

herror 2))()((1)(

data Algorithms

What if there is not much data available?

GOLDEN RULE: Never use the same example for training the model and evaluating it!!

Page 7: L2. Evaluating Machine Learning Algorithms I

7!

q  Too much training data: poor evaluation

q  Too much test data: poor training

q  Can we have more training data and more test data without breaking the golden rule? o  Repeat the experiment!

ü  Bootstrap: we perform n samples (with repetition) and test with the rest.

ü  Cross validation: Data is split in n folds of equal size.

Taking the most of the data

Page 8: L2. Evaluating Machine Learning Algorithms I

8!

q  What dataset do we use to estimate all previous metrics? o  If we use all data to train the models and evaluate them, we

get overoptimistic models: ü  Over-fitting:

o  If we try to compensate by generalising the model (e.g., pruning a tree), we may get: ü  Under-fitting:

o  How can we find a trade-off?

Overfitting?

Page 9: L2. Evaluating Machine Learning Algorithms I

9!

q  Confusion (contingency) matrix: o  We can observe how errors are distributed.

Confusion Matrix

c Buy No Buy

Buy 4 1

No Buy 2 3

Actual

Pred.

Page 10: L2. Evaluating Machine Learning Algorithms I

10!

q  For two classes:

Confusion Matrix

+ -

+

-

TP

FN

FP

TN

actual

pred

icte

d

TP+FN FP+TN

true positive false positive

false negative true negative

Accuracy= 𝑇𝑃+𝑇𝑁/𝑁  ! Error=1-Accuracy= 𝐹𝑃+𝐹𝑁/𝑁 !

Page 11: L2. Evaluating Machine Learning Algorithms I

11!

q  For two classes:

Confusion Matrix

+ -

+

-

TP

FN

FP

TN

actual

pred

icte

d

TP+FN FP+TN

true positive false positive

false negative true negative

TPRate, Sensitivity= 𝑇𝑃/𝑇𝑃+𝐹𝑁 !

TNRate, Specificity = 𝑇𝑁/𝐹𝑃+𝑇𝑁  !

Page 12: L2. Evaluating Machine Learning Algorithms I

12!

q  For two classes:

Confusion Matrix

+ -

+

-

TP

FN

FP

TN

actual

pred

icte

d

TP+FN FP+TN

true positive false positive

false negative true negative

TPRate, Sensitivity, Recall= 𝑇𝑃/𝑇𝑃+𝐹𝑁 !

Positive predictive value (PPV), Precision = 𝑇𝑃/𝑇𝑃+𝐹𝑃  !

Page 13: L2. Evaluating Machine Learning Algorithms I

13!

q  Common measures in IR: Precision and Recall

q  Are defined in terms of a set of retrieved documents and a set of relevant documents. o  Precision: Fraction of retrieved documents that are

relevant to the query

o  Recall: Percent of all relevant documents that is returned by the search

q  Both measures are usually combined in one (harmonic mean):

Information Retrieval

F-measure=2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙/𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙 = 2𝑇𝑃/2𝑇𝑃+𝐹𝑃+𝐹𝑁 !

Page 14: L2. Evaluating Machine Learning Algorithms I

14!

q  Confusion (contingency) tables can be multiclass

q  Measures based on 2-class matrices are computed o  1 vs all (average N partial measures)

o  1 vs 1 (average N*(N-1)/2 partial measures)

ü  Weighted average?

Multiclass problems

ERROR actual low medium high

low 20 0 13 medium 5 15 4

predicted

high 4 7 60

Page 15: L2. Evaluating Machine Learning Algorithms I

15!

q  In some cases we can find important differences among proportion of classes o  A naïve classifier that always predictive majority class

(ignoring minority classes) obtains good performance ü  In a binary problem (+/-) with 1% of negative instances,

model “always positive” gets an accuracy of 99%.

q  Macro-accuracy: Average of accuracy per class

o  The naïve classifier gets a macro-accuracy=0.5

Imbalanced Datasets

mtotalhits

totalhits

totalHits

hmacroacc m class

m class

2 class

2 class

1 class

1 class ...)(

+++=

Page 16: L2. Evaluating Machine Learning Algorithms I

q  Crisp and Soft Classifiers: o  A “hard” or “crisp” classifier predicts a class between a

set of possible classes.

o  A “soft” or “scoring” classifier (probabilistic) predicts a class, but accompanies each prediction with an estimation of the reliability (confidence) of each prediction.

ü  Most learning methods can be adapted to generate this confidence value.

Soft classifiers

16!

Page 17: L2. Evaluating Machine Learning Algorithms I

q  A special kind of soft classifier is a class probability estimator. o  Instead of predicting “a”, “b” or “c”, it gives a

probability estimation for “a”, “b” or “c”, i.e., “pa”, “pb” and “pc”.

ü  Example:

v  Classifier 1: pa = 0.2, pb = 0.5 and pc = 0.3.

v  Classifier 2: pa = 0.3, pb = 0.4 and pc = 0.3.

o  Both predict “b”, but classifier 1 is more confident.

Soft classifiers

17!

Page 18: L2. Evaluating Machine Learning Algorithms I

q  Probabilistic classifiers: Classifiers that are able to predict a probability distribution over a set of classes o  Provide classification with a degree of certainty:

ü  Combining classifiers

ü  Cost sensitive contexts

q  Mean Squared Error (Brier Score)

q  Log Loss

Evaluating probabilistic classifiers

18!

∑∑∈ ∈

−=Si Cj

jipjifn

MSE ),(),(1f(i,j)=1 if instance i is of class j, 0 otherwise.

p(i,j) returns de prob. instance i in class j

∑∑∈ ∈

∗−=Si Cj

jipjifn

Logloss )),(log),((12

Page 19: L2. Evaluating Machine Learning Algorithms I

q  MSE or Brier Score can be decomposed into two factors: o  BS=CAL+REF

ü  Calibration: Measures the quality of classifier scores wrt class membership probabilities

ü  Refinement: it is an aggregation of resolution and uncertainty, and is related to the area under the ROC Curve.

q  Calibration Methods: o  Try to transform classifier scores into class membership

probabilities ü  Platt scaling, Isotonic Regression, PAVcal..

Evaluating probabilistic classifiers

19!

Page 20: L2. Evaluating Machine Learning Algorithms I

q  Brier Curves for analysing classifier performance

Evaluating probabilistic classifiers

20!

Non calibrated Brier curve ! PAV-calibrated Brier curve !

Page 21: L2. Evaluating Machine Learning Algorithms I

q  “Rankers”: o  Whenever we have a probability estimator for a two-

class problem:

ü  pa = x, then pb = 1 ‒ x.

o  Let’s call one class 0 (neg) and the other class 1 (pos).

o  A ranker is a soft classifier that gives a value (score) monotonically related to the probability of class 1.

ü  Examples:

v  Probability of a customer buying a product.

v  Probability of a message being spam....

Soft classifiers

21!

Page 22: L2. Evaluating Machine Learning Algorithms I

22!

q  We can rank instances according to estimated probability o  CRM: You are interested in the top % of potential

costumers

q  Measures for ranking o  AUC: Area Ander the ROC Curve

o  Distances between perfect ranking and estimated ranking

Evaluation of Rankers

Page 23: L2. Evaluating Machine Learning Algorithms I

23!

q  Regression: In this case the variable to be predicted is a continuous value. o  Predicting daily value of stocks in NASDAQ

o  Forecasting number of docks available in a Valenbisi station in the next hour

o  Predicting the amount of beers sold the next month by a retail company

Regression

Page 24: L2. Evaluating Machine Learning Algorithms I

24!

q  Given a set S of n instances, o  Mean Absolute Error:

o  Mean Squared Error:

o  Root Mean Suared Error:

ü  MSE is more sensitive to extreme values

ü  RMSE and MAE are in the same magnitude of the actual values

Evaluation of Regressors

∑∈

−=Sx

S xhxfn

hMAE )()(1)(

∑∈

−=Sx

S xhxfn

hMSE 2))()((1)(

∑∈

−=Sx

S xhxfn

hRMSE 2))()((1)(

Page 25: L2. Evaluating Machine Learning Algorithms I

25!

q  Example:

Evaluation of Regressors

Predicted Value(h(x)) Acual Value (f(x)) Error Error2 100 mill. € 102 mill. € 2 4 102 mill. € 110 mill. € 8 64 105 mill. € 95 mill. € 10 100 95 mill. € 75 mill. € 20 400 101 mill. € 103 mill. € 2 4 105 mill. € 110 mill. € 5 25 105 mill. € 98 mill. € 7 49 40 mill. € 32 mill. € 8 64 220 mill. € 215 mill. € 5 25 100 mill. € 103 mill. € 3 9

MSE= 744/10 = 74,4!

MAE= 60/10 =6!

RMSE= sqrt(744/10) = 8.63!

Page 26: L2. Evaluating Machine Learning Algorithms I

26!

q  Sometimes relative error values are more appropiate: o  10% for an error of 50 when predicting 500

q  How much does the scheme improve on simply predicting the average: o  Relative Mean Squared Error

o  Relative Mean Absolute Error

Evaluation of Regressors

( )

( )∑

−=

Sx

SxS

xff

xhxfhRSE 2

2

)(

)()()(

−=

Sx

SxS xhf

xhxfhRAE

)(

)()()(

Page 27: L2. Evaluating Machine Learning Algorithms I

27!

q  A related measure is R2 (coefficient of determination) o  Number that indicates how well data fit a statistical

model

ü  sum of squares of residuals

ü  total sum of squares

ü  coefficient of determination

Evaluation of Regressors

( )∑∈

−=Sx

s xffhSStot 2)()(

)()(1)(2

hSStothSSreshR

s

sS −=

( )∑∈

−=Sx

s xhfhSSres 2)()(

Page 28: L2. Evaluating Machine Learning Algorithms I

28!

q  Association Rules: task of discovering interesting relations between variables in databases.

q  Common metrics: o  Support: Estimates the popularity of a rule

o  Confidence: Estimates the reliability of a rule

q  Rules are ordered according to a measures that combine both values

q  No partition train/test.

Unsupervised Learning

Page 29: L2. Evaluating Machine Learning Algorithms I

29!

q  Clustering: task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other clusters. o  Task difficult to evaluate

q  Some evaluation measures based on distance: o  Distance among borders of clusters

o  Distance among centers (centroids) of clusters

o  Radius and density of clusters

Unsupervised Learning

Page 30: L2. Evaluating Machine Learning Algorithms I

q  Model evaluation is a fundamental phase in the knowledge discovery process.

q  In classification, depending on the feature and context we want to analyse, we need to use the proper metric.

q  Several (and sometimes equivalent) measures for regression models.

q  Supervised models are easier to evaluate since we have an estimate of the ground truth.

Lessons learned

30!

Page 31: L2. Evaluating Machine Learning Algorithms I

q  Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.

q  Flach, P.A. (2012)“Machine Learning: The Art and Science of Algorithms that Make Sense of Cambridge University Press.

q  Hand, D.J. (1997) “Construction and Assessment of Classification Rules”, Wiley.

q  César Ferri, José Hernández-Orallo, R. Modroiu (2009): An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1): 27-38

q  Nathalie Japkowicz, ‎Mohak Shah (2011) “Evaluating Learning Algorithms: A Classification Perspective”, Cambridge University Press 2011.

q  Antonio Bella, Cèsar Ferri Ramirez, José Hernández-Orallo, M. José Ramírez-Quintana: On the effect of calibration in classifier combination. Appl. Intell. 38(4): 566-585 (2013)

q  José Hernández-Orallo, Peter A. Flach, Cèsar Ferri Ramirez:Brier Curves: a New Cost-Based Visualisation of Classifier Performance. ICML 2011: 585-592

To know more

31!