Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing

Non-Traditional Non-Traditional MetricsMetrics

Evaluation measures from the Evaluation measures from the medical diagnostic communitymedical diagnostic community Constructing new evaluation Constructing new evaluation measures that combine metric measures that combine metric and statistical information and statistical information

Part IPart IBorrowing new performance evaluation Borrowing new performance evaluation measures from the medical diagnostic measures from the medical diagnostic

communitycommunity

(Marina Sokolova, Nathalie Japkowicz (Marina Sokolova, Nathalie Japkowicz and Stan Szpakowicz)and Stan Szpakowicz)

33

The need to borrow new performance The need to borrow new performance measures: an examplemeasures: an example

It has come to our attention that the It has come to our attention that the performance measures commonly used in performance measures commonly used in Machine Learning are not very good at Machine Learning are not very good at assessing the performance of assessing the performance of problems in problems in which the two classes are equally importantwhich the two classes are equally important.. Accuracy focuses on both classes, but, it does Accuracy focuses on both classes, but, it does

not distinguish between the two classes.not distinguish between the two classes. Other measures, such as Precision/Recall, F-Other measures, such as Precision/Recall, F-

Score and ROC Analysis only focus on one class, Score and ROC Analysis only focus on one class, without concerning themselves with performance without concerning themselves with performance on the other class.on the other class.

44

Learning Problems in which the Learning Problems in which the classes are equally important classes are equally important

Examples of recent Machine Learning domains that Examples of recent Machine Learning domains that require equal focus on both classes and a distinction require equal focus on both classes and a distinction between false positive and false negative rates are:between false positive and false negative rates are:• opinion/sentiment identificationopinion/sentiment identification• classification of negotiationsclassification of negotiations

An examples of a traditional problem that requires An examples of a traditional problem that requires equal focus on both classes and a distinction equal focus on both classes and a distinction between false positive and false negative rates is:between false positive and false negative rates is:• Medical Diagnostic TestsMedical Diagnostic Tests

What measures have researchers in the Medical What measures have researchers in the Medical Diagnostic Test Community used that we can Diagnostic Test Community used that we can borrow?borrow?

55

Performance Measures in use in Performance Measures in use in the Medical Diagnostic Communitythe Medical Diagnostic Community

Common performance measures in use Common performance measures in use in the Medical Diagnostic Community in the Medical Diagnostic Community are:are:• Sensitivity/Specificity (also in use in Sensitivity/Specificity (also in use in

Machine learning)Machine learning)• Likelihood ratiosLikelihood ratios• Youden’s IndexYouden’s Index• Discriminant PowerDiscriminant Power

[Biggerstaff, 2000; Blakeley & Oddone, 1995][Biggerstaff, 2000; Blakeley & Oddone, 1995]

66

Sensitivity/SpecificitySensitivity/Specificity The The sensitivitysensitivity of a diagnostic test is: of a diagnostic test is:

• P[+|D], i.e., the probability of obtaining a positive test P[+|D], i.e., the probability of obtaining a positive test result in the result in the

diseased population.diseased population. The The specificityspecificity of a diagnostic test is: of a diagnostic test is:

• P[-|P[-|ĎĎ], i.e., the probability of obtaining a negative test ], i.e., the probability of obtaining a negative test result in the result in the

disease-free population.disease-free population.

Sensitivity and specificity are not that useful, Sensitivity and specificity are not that useful, however, since one, really is interested in P[D|+] however, since one, really is interested in P[D|+] ((PVP: the Predictive Value of a PositivePVP: the Predictive Value of a Positive) and P[) and P[Ď|-] Ď|-] ((PVN: the Predictive Value of a NegativePVN: the Predictive Value of a Negative) in both the ) in both the medical testing community and in Machine Learning. medical testing community and in Machine Learning. We can apply Bayes’ Theorem to derive the PVP We can apply Bayes’ Theorem to derive the PVP and PVN. and PVN.

77

Deriving the PVPs and PVNsDeriving the PVPs and PVNs

The problem with deriving the The problem with deriving the PVPPVP and and PVN PVN of a test, is that in of a test, is that in order to derive them, we need to know p[D], the pre-test probability order to derive them, we need to know p[D], the pre-test probability of the disease. This cannot be done directly.of the disease. This cannot be done directly.

As usual, however, we can set ourselves in the context of the As usual, however, we can set ourselves in the context of the comparison of two tests (with P[D] being the same in both cases).comparison of two tests (with P[D] being the same in both cases).

Doing so, and using Bayes’ Theorem:Doing so, and using Bayes’ Theorem:P[D|+] = (P[+|D] P[D])/(P[+|D] P[D] + P[+|P[D|+] = (P[+|D] P[D])/(P[+|D] P[D] + P[+|Ď]P[Ď])Ď]P[Ď])

We can get the following relationships (see Biggerstaff, 2000): We can get the following relationships (see Biggerstaff, 2000): • P[D|+P[D|+YY] > P[D|+] > P[D|+XX] ] ↔ ↔ ρρ++YY > > ρρ++XX • P[Ď|-P[Ď|-YY] > P[Ď|-] > P[Ď|-XX] ↔ ] ↔ ρρ--YY < < ρρ--XX

Where X and Y are two diagnostic tests, and +Where X and Y are two diagnostic tests, and +XX, and –, and –X X stand for stand for confirming the presence of the disease and confirming the confirming the presence of the disease and confirming the absence of the disease, respectively. (and similarly for +absence of the disease, respectively. (and similarly for +YY and – and –YY))

• ρρ+ and + and ρρ- are the - are the likelihood ratioslikelihood ratios that are defined on the next that are defined on the next slideslide

88

Likelihood RatiosLikelihood Ratios ρρ+ and + and ρρ- are actually easy to derive.- are actually easy to derive. The The likelihood ratio of a positive testlikelihood ratio of a positive test is: is:

• ρρ+ = P[+|D] / P[+| Ď], i.e. the ratio of the true + = P[+|D] / P[+| Ď], i.e. the ratio of the true positive rate to the false positive rate to the false

positive rate.positive rate. The The likelihood ratio of a negative testlikelihood ratio of a negative test is: is:

• ρρ- = P[-|D] / P[-| Ď], i.e. the ratio of the false - = P[-|D] / P[-| Ď], i.e. the ratio of the false negative rate to the true negative rate to the true

negative rate.negative rate.Note: We want to maximize Note: We want to maximize ρρ+ and minimize + and minimize ρρ-. -.

This means that, even though we cannot calculate the PVP and This means that, even though we cannot calculate the PVP and PVN directly, we can get the information we need to compare PVN directly, we can get the information we need to compare two tests through the likelihood ratios.two tests through the likelihood ratios.

99

Youden’s Index and Discriminant Youden’s Index and Discriminant PowerPower

Youden’s IndexYouden’s Index measures the measures the avoidance of failureavoidance of failure of an algorithm while of an algorithm while Discriminant PowerDiscriminant Power evaluates evaluates how well an algorithm distinguishes how well an algorithm distinguishes between positive and negative examples.between positive and negative examples.

Youden’s IndexYouden’s Index γγ = sensitivity – (1 – specificity) = sensitivity – (1 – specificity) = = P[+|D] – (1 - P[-|P[+|D] – (1 - P[-|ĎĎ])]) Discriminant Power:Discriminant Power: DP = √3/DP = √3/ππ (log X + log Y), (log X + log Y), where X = sensitivity/(1 – sensitivity) and where X = sensitivity/(1 – sensitivity) and Y = specificity/(1-specificity)Y = specificity/(1-specificity)

1010

Comparison of the various measures Comparison of the various measures on the outcome of e-negotiationon the outcome of e-negotiation

MeasureMeasure SVMSVM N. BayesN. Bayes

AccuracyAccuracy 77.477.4 76.876.8

F-ScoreF-Score 81.281.2 78.978.9

SensitivitySensitivity 86.886.8 77.577.5

SpecificitySpecificity 65.465.4 75.975.9

AUCAUC 76.176.1 76.776.7

YoudenYouden .522.522 .534.534

Pos. LikelihoodPos. Likelihood 2.512.51 3.223.22

Neg LikelihoodNeg Likelihood .2.2 .3.3

Discriminant PowerDiscriminant Power 1.391.39 1.311.31

DP is below 3 insignificant

1111

What does this all mean? What does this all mean? Traditional ML MeasuresTraditional ML Measures

ClassifierClassifier Overall Overall effectivenesseffectiveness

(Accuracy)(Accuracy)

Predictive Predictive PowerPower

(Precision)(Precision)

Effectiveness Effectiveness on a class, a-on a class, a-posteriori posteriori (sensitivity/ (sensitivity/ specificity)specificity)

SVMSVM SuperiorSuperior Superior on Superior on pos examplespos examples

Superior on Superior on pos examplespos examples

NBNB inferiorinferior Superior on Superior on neg examplesneg examples

Superior on Superior on neg examplesneg examples

1212

What does this all mean? What does this all mean? New Measures that are more appropriate for New Measures that are more appropriate for problems where both classes are as importantproblems where both classes are as importantClassifierClassifier Avoidance of Avoidance of

failurefailure

(Youden)(Youden)

Effectiveness Effectiveness on a class, on a class, a- prioria- priori

(Likelihood (Likelihood Ratios)Ratios)

Discrimination Discrimination of classesof classes

(Discriminant (Discriminant Power)Power)

SVMSVM InferiorInferior Superior on Superior on neg examplesneg examples

LimitedLimited

NBNB SuperiorSuperior Superior on Superior on pos examplespos examples

LimitedLimited

1313

Part I: DiscussionPart I: Discussion

The variety of results obtained with the The variety of results obtained with the different measures suggest two conclusions:different measures suggest two conclusions:

1.1. It is very important for practitioners of Machine It is very important for practitioners of Machine Learning to understand their domain deeply, to Learning to understand their domain deeply, to understand what it is, exactly, that they want to understand what it is, exactly, that they want to evaluate, and to reach their goal using appropriate evaluate, and to reach their goal using appropriate measures (existing or new ones).measures (existing or new ones).

2.2. Since some of the results are very close to each Since some of the results are very close to each other, it is important to establish reliable other, it is important to establish reliable confidence tests to find out whether or not these confidence tests to find out whether or not these results are significant.results are significant.

1414

Part IIPart II

Constructing new evaluation measuresConstructing new evaluation measures

(William Elamzeh, Nathalie Japkowicz (William Elamzeh, Nathalie Japkowicz and Stan Matwin)and Stan Matwin)

1515

Motivation for our new Motivation for our new evaluation methodevaluation method

ROC Analysis alone and its associated AUC measure do ROC Analysis alone and its associated AUC measure do not assess the performance of classifiers adequately since not assess the performance of classifiers adequately since they omit any information regarding the confidence of they omit any information regarding the confidence of these estimates.these estimates.

Though the identification of the significant portion of Though the identification of the significant portion of ROC Curves is an important step towards generating a ROC Curves is an important step towards generating a more useful assessment, this analysis remains biased in more useful assessment, this analysis remains biased in favour of the large class, in case of severe imbalances.favour of the large class, in case of severe imbalances.

We would like to combine the information provided by We would like to combine the information provided by the ROC Curve together with information regarding how the ROC Curve together with information regarding how balanced the classifier is with regard to the balanced the classifier is with regard to the misclassification of positive and negative examples.misclassification of positive and negative examples.

1616

ROC’s bias in the case of severe ROC’s bias in the case of severe class imbalancesclass imbalances

ROC Curves, for the pos class, ROC Curves, for the pos class, plots the true positive rate plots the true positive rate a/(a+b) against the false a/(a+b) against the false positive rate c/(c+d).positive rate c/(c+d).

When the number of pos. When the number of pos. examples is significantly lower examples is significantly lower than the number of neg. than the number of neg. examples, a+b << c+d, as we examples, a+b << c+d, as we change the class probability change the class probability threshold, a/(a+b) climbs faster threshold, a/(a+b) climbs faster than c/(c+d)than c/(c+d)

ROC gives the majority class ROC gives the majority class (-) an unfair advantage.(-) an unfair advantage.

Ideally, a classifier should Ideally, a classifier should classify both classes classify both classes proportionallyproportionally

PredPred

++

PredPred

--

TotalTotal

ClassClass

++

aa bb a+ba+b

ClassClass

--

cc dd c+dc+d

TotalTotal a+ca+c b+db+d nn

Confusion Matrix

1717

Correcting for ROC’s bias in the Correcting for ROC’s bias in the case of severe class imbalancescase of severe class imbalances

Though we keep ROC as a Though we keep ROC as a performance evaluation measure, performance evaluation measure, since rate information is useful, since rate information is useful, we propose to favour classifiers we propose to favour classifiers that perform with similar number that perform with similar number of errors in both classes, for of errors in both classes, for confidence estimation.confidence estimation.

More specifically,as in the Tango More specifically,as in the Tango test, we favour classifiers that test, we favour classifiers that have lower difference in have lower difference in classification errors in both classification errors in both classes, (b-c)/n.classes, (b-c)/n.

This quantity (b-c)/n is interesting This quantity (b-c)/n is interesting not just for confidence estimation, not just for confidence estimation, but also as an evaluation measure but also as an evaluation measure in its own right in its own right

PredPred

++

PredPred

--

TotalTotal

ClassClass

++

aa bb a+ba+b

ClassClass

--

cc dd c+dc+d

TotalTotal a+ca+c b+db+d nn

Confusion Matrix

1818

Proposed Evaluation Method for Proposed Evaluation Method for severely Imbalanced Data setsseverely Imbalanced Data sets

Our method consists of five steps:Our method consists of five steps:• Generate a ROC Curve R for a classifier K applied Generate a ROC Curve R for a classifier K applied

to data D.to data D.• Apply Tango’s confidence test in order to identify Apply Tango’s confidence test in order to identify

the confident segments of R.the confident segments of R.• Compute the CAUC, the area under the confident Compute the CAUC, the area under the confident

ROC segment.ROC segment.• Compute AveD, the average normalized difference Compute AveD, the average normalized difference

(b-c)/n for all points in the confident ROC segment.(b-c)/n for all points in the confident ROC segment.• Plot CAUC against aveD Plot CAUC against aveD An effective classifier An effective classifier

shows low AveD and high CAUCshows low AveD and high CAUC

1919

Experiments and Expected ResultsExperiments and Expected Results

We considered 6 imbalanced domain from UCI. The We considered 6 imbalanced domain from UCI. The most imbalanced one contained only 1.4% examples in most imbalanced one contained only 1.4% examples in the small class while the least imbalanced one had as the small class while the least imbalanced one had as many as 26%.many as 26%.

We ran 4 classifiers: Decision Stumps, Decision Trees, We ran 4 classifiers: Decision Stumps, Decision Trees, Decision Forests and Naïve BayesDecision Forests and Naïve Bayes

We expected the following results:We expected the following results:• Weak Performance from the Decision StumpsWeak Performance from the Decision Stumps• Stronger Performance from the Decision TreesStronger Performance from the Decision Trees• Even Stronger Performance from the Random ForestsEven Stronger Performance from the Random Forests• We expected Naïve Bayes to perform reasonably well, but We expected Naïve Bayes to perform reasonably well, but

with no idea of how it would compare to the tree family of with no idea of how it would compare to the tree family of learnerslearners

Samefamily of learners

2020

Results using our new method: Our Results using our new method: Our expectations are metexpectations are met

• Decision Stumps perform the worst, followed by decision trees and then random forests (in most cases)

• Surprise 1: Decision trees outperform random forests in the two most balanced data sets.

• Surprise 2: Naïve Bayes consistently outperforms Random forests

Note: Classifiers in the top left corner outperform those in the bottom right corner

2121

AUC ResultsAUC Results

Our, more informed, results contradict the AUC results which claim Our, more informed, results contradict the AUC results which claim that:that:• Decision Stumps are sometimes as good as or superior to decision Decision Stumps are sometimes as good as or superior to decision

trees (!)trees (!)• Random Forests outperforms all other systems in all but one Random Forests outperforms all other systems in all but one

cases.cases.

2222

Part II: DiscussionPart II: Discussion

In order to better understand the performance of In order to better understand the performance of classifiers on various domains, it can be useful to classifiers on various domains, it can be useful to consider several aspects of this evaluation consider several aspects of this evaluation simultaneously.simultaneously.

In order to do so, it might be useful to create specific In order to do so, it might be useful to create specific measures adapted to the purpose of the evaluation.measures adapted to the purpose of the evaluation.

In our case, above, our evaluation measure allowed us In our case, above, our evaluation measure allowed us to study the tradeoff between classification difference to study the tradeoff between classification difference and area under the confident segment of the AUC and area under the confident segment of the AUC curve, thus, producing more reliable results curve, thus, producing more reliable results

Documents

Non-Traditional Metrics Evaluation measures from the Evaluation measures from the medical diagnostic community medical diagnostic community Constructing