Model Accuracy Measures - Regulatory Genomics Groupregulatorygenomics.upf.edu/.../Lecture_Accuracy.pdf · 2018-01-26 · Training and Testing 1) Stratification: The size of each of

Model Accuracy Measures

Eduardo Eyras Computational Genomics

Pompeu Fabra University - ICREA Barcelona, Spain

Master in Bioinformatics UPF 2017-2018

Variables What we can measure (attributes)

Hypotheses What we want to predict (Class values/labels)

Examples Training set (labeled data)

Model Training

Predict on new cases

Model

Variables What we can measure (attributes)

Hypotheses What we want to predict (Class values/labels)

Examples Training set (labeled data)

Model Training

Predict on new cases

Model Prediction: Does it example “belong” to this model?

Classification: what is the most probable label?

Is my method good enough? (for the specific problem) How does my method compare to other methods?

Testing the accuracy of a model

Testing the accuracy of a model

We need a systematic way to evaluate and compare multiple methods Methods are heterogenous in their purposes, e.g.: 1)  Ability to classify instances accurately

2)  Predicting/scoring the class labels

3)  Methods may predict numerical or nominal values (score, class label, yes/no, posterior probability, etc….)

Thus we need a methodology that is applicable to all of them

Accuracy expected performance (accuracy) of the model in future (new) data

Training and Testing

It is wrong to estimate the accuracy on the same dataset used to build (train) the model. This estimation would be overly optimistic: Overfitting è it won’t necessarily adapt well to new different instances

Separate known cases into a training set and a test set

Labeled cases

Cases for training

model

Cases for testing

Training step

Evaluation step


On the cases for testing we predict and compare the predictions with the known labels. How to do the splitting? A common splitting choice is 2/3 for training and 1/3 for testing This approach is suitable when the entire dataset is large


1)  Stratification: The size of each of the “prediction classes” should be similar

in each subset, training and testing (balanced subsets)

2)  Homogeneity: Data sets should have similar properties to have a reliable test. E.g. GC-content, peptide lengths, species represented.

These conditions ensure representativity of the different properties and prediction classes (e.g. would you test a model of human transmembrane domains with yeast proteins?) (e.g. think of GC content).

Provided that sets are balanced and homogeneous, the accuracy on the test set will be a good estimation of future performance.

How to select the data for training and testing:

Training set

Test set

Accuracy1

Data set

Build a predictive model

Training and Testing N-fold cross validation

1/N parts of the data

(N-1)/N parts of the data

…where “accuracy” is used generically: any measure of prediction performance

Training set

Test set

Accuracy1 Accuracy2

Data set

Build a predictive model


…where “accuracy” is used generically: any measure of prediction performance

Training set

Test set

…

Average accuracy

Accuracy1 Accuracy2 Accuracy3

Data set

Accuracyn


The average accuracy reflects the performance of the model on the entire dataset. Important: subsets must be representative of the original data (stratification and homogeneity) The standard is to do 10-fold cross validation

It is like n-fold cross validation, but where n is the size of the set (number of instances), that is: “train in all but 1, test on this one” Advantages: 1) The greatest possible amount of data is used for training (n-1 instances) 2) It is deterministic: no random sampling of subsets is involved. Disadvantages: 1)  Computationally more expensive

2)  It cannot be stratified

E.g. Imagine you have the same number of examples for 2 classes. A random classifier predicting the majority class is expected to have an error rate of 50%, but in the leave-one out method, the majority class is always the opposite class, which will produce 100% error rate.

Leave-one out Training and Testing

Accuracy measures

Accuracy measure

Example: The model of transmembrane helices We have two models: (1) the loop model Mloop given by the observed frequencies of AA in loops p (2) the helix model Mhelix given by the observed frequencies of AA in helices q

Given a peptide s=x1…xN we can predict whether it is part of a helix or a loop using the log-likelihood test (assuming uniform priors and positional independence)

€

S = log L(s |Mhelix )L(s |Mloop )

=

qxii=1

N

∏

pxii=1

N

∏

As a default, we can use as classification the rule: •  if S>0 then s is part of a helix •  if S≤0 then s is a loop

Helix Loop

Accuracy measure

A test set: a set of “labelled” (annotated) proteins that we do not use for training

€


=

qxii=1

N

∏

pxii=1

N

∏Training set

Example: The model of transmembrane helices

Real False

Accuracy measure

Real False

Accuracy measure

Our model divides the test set according to our predictions of Real and False:

The red area contains the predictions (helix) made by our model

Our predictions

TP (True positives): elements predicted as real that are real

Real False

TP

Accuracy measure

TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false

Real False

TP

TN

Accuracy measure

TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false

Real False

TP FP

TN

Accuracy measure

TP (True positives): elements predicted as real that are real TN (True Negatives): elements predicted as false that are false FP (False Positives): elements predicted as real that are false FN (False Negatives): elements predicted as false that are real

Real False

TP FP

FN TN

Accuracy measure

€

Sn = TPR =TP

TP + FN

True Positive Rate (Sensitivity): proportion of true elements that is correctly predicted (a.k.a hit rate, recall)

False Positive Rate (FPR): proportion of negative cases that are mislabelled (a.k.a. fall-out)

Sn and Sp take values between 0 and 1. A perfect classification would have Sn=1 and Sp=1

Accuracy measure

€

FPR =FP

FP + TN

Specificity: proportion of the negatives that are correctly predicted

€

Sp =1− FPR =TN

FP + TN

Real False

TP FP

FN TN

€

PPV =TP

TP + FP

Positive Predictive Value (PPV): sometimes called Precision it gives the fraction of our predictions that are correct

False Discovery Rate (FDR): what fraction of our predictions are wrong

Accuracy measure

€

FDR =FP

FP + TP

PPV à 1 means most of our predictions are correct FDR à 0 means that very few of our predictions are wrong

Real False

TP FP

FN TN

TPR = TPTP +FN

Accuracy measure

We can still use the TPR, PPV and FDR:

PPV =TP

FP +TP

Real

TP FP

FN

Sometimes we cannot find a True Negative set (e.g. Think of genomic features, like genes, regulatory regions, etc… it is very hard to find real negative cases for some biological features)

The issue of True Negatives

€

FDR =FP

FP + TP

€

Overall Success Rate =TP+TN

TP+TN +FN +FP

Overall success rate: is the number of correct classifications divided by the total number of classifications (sometimes called “accuracy”):

Accuracy measure

The error rate: 1 minus the overall success rate:

€

Error Rate =1− TP+TNTP+TN +FN +FP

A value of 1 for the Success rate means that the model identifies all the positive and negative cases correctly

Correlation coefficient (a.k.a. Matthews Correlation Coefficient (MCC))

This measure scores positively correct predictions and negatively incorrect ones, and takes values between -1 and 1. The more correct the method, the closer to one CC --> 1 A very bad method will have a CC closer to -1

€

CC =(TP)(TN )− (FP)(FN )

(TP+ FN )(TN + FP)(TP+ FP)(TN + FN )

Accuracy measure

Accuracy measure

This can also be represented by a confusion matrix for a 2-class prediction:

shed in the same order, except in unusual circumstances such as estrus. In amodern dairy operation it’s important to know when a cow is ready: animalsare fertilized by artificial insemination and missing a cycle will delay calvingunnecessarily, causing complications down the line. In early experiments,machine learning methods stubbornly predicted that each cow was never inestrus. Like humans, cows have a menstrual cycle of approximately 30 days, sothis “null” rule is correct about 97% of the time—an impressive degree of accu-racy in any agricultural domain! What was wanted, of course, were rules thatpredicted the “in estrus” situation more accurately than the “not in estrus” one:the costs of the two kinds of error were different. Evaluation by classificationaccuracy tacitly assumes equal error costs.

Other examples in which errors cost different amounts include loan deci-sions: the cost of lending to a defaulter is far greater than the lost-business costof refusing a loan to a nondefaulter. And oil-slick detection: the cost of failingto detect an environment-threatening real slick is far greater than the cost of afalse alarm. And load forecasting: the cost of gearing up electricity generatorsfor a storm that doesn’t hit is far less than the cost of being caught completelyunprepared. And diagnosis: the cost of misidentifying problems with a machinethat turns out to be free of faults is less than the cost of overlooking problemswith one that is about to fail. And promotional mailing: the cost of sending junkmail to a household that doesn’t respond is far less than the lost-business costof not sending it to a household that would have responded. Why—these areall the examples of Chapter 1! In truth, you’d be hard pressed to find an appli-cation in which the costs of different kinds of error were the same.

In the two-class case with classes yes and no, lend or not lend, mark a suspi-cious patch as an oil slick or not, and so on, a single prediction has the four dif-ferent possible outcomes shown in Table 5.3. The true positives (TP) and truenegatives (TN) are correct classifications. A false positive (FP) occurs when theoutcome is incorrectly predicted as yes (or positive) when it is actually no (neg-ative). A false negative (FN) occurs when the outcome is incorrectly predictedas negative when it is actually positive. The true positive rate is TP divided

1 6 2 CHAPTER 5 | CREDIBILITY: EVALUATING WHAT ’S BEEN LEARNED

Table 5.3 Different outcomes of a two-class prediction.

Predicted class

yes no

Actual yes true false positive negative

classno false true

positive negative

P088407-Ch005.qxd 4/30/05 11:21 AM Page 162

Yes No

TP FP

FN TN

Accuracy measure For multiclass predictions:

by the total number of positives, which is TP + FN; the false positive rate is FP divided by the total number of negatives, FP + TN. The overall success rate is the number of correct classifications divided by the total number ofclassifications:

Finally, the error rate is one minus this.In a multiclass prediction, the result on a test set is often displayed as a two-

dimensional confusion matrix with a row and column for each class. Each matrixelement shows the number of test examples for which the actual class is the rowand the predicted class is the column. Good results correspond to large numbersdown the main diagonal and small, ideally zero, off-diagonal elements. Table5.4(a) shows a numeric example with three classes. In this case the test set has200 instances (the sum of the nine numbers in the matrix), and 88 + 40 + 12 =140 of them are predicted correctly, so the success rate is 70%.

But is this a fair measure of overall success? How many agreements wouldyou expect by chance? This predictor predicts a total of 120 a’s, 60 b’s, and 20c’s; what if you had a random predictor that predicted the same total numbersof the three classes? The answer is shown in Table 5.4(b). Its first row dividesthe 100 a’s in the test set into these overall proportions, and the second andthird rows do the same thing for the other two classes. Of course, the row andcolumn totals for this matrix are the same as before—the number of instanceshasn’t changed, and we have ensured that the random predictor predicts thesame number of a’s, b’s, and c’s as the actual predictor.

This random predictor gets 60 + 18 + 4 = 82 instances correct. A measurecalled the Kappa statistic takes this expected figure into account by deducting it from the predictor’s successes and expressing the result as a proportion of the total for a perfect predictor, to yield 140 - 82 = 58 extra successes out

TP TNTP TN FP FN

++ + +

.

5 .7 COUNTING THE COST 1 6 3

Table 5.4 Different outcomes of a three-class prediction: (a) actual and (b) expected.

Predicted class Predicted class

a b c Total a b c Total

Actual a 88 10 2 100 Actual a 60 30 10 100class b 14 40 6 60 class b 36 18 6 60

c 18 10 12 40 c 24 12 4 40

Total 120 60 20 Total 120 60 20

(a) (b)

P088407-Ch005.qxd 4/30/05 11:21 AM Page 163

Good results correspond to large numbers on the diagonal and small numbers off the diagonal In the example we have 200 instances (100+60+40) and 140 of them are predicted correctly, thus the success rate is 70%. Question: is this a good measure? How many agreements do we expect by chance?







TP TNTP TN FP FN

++ + +

.






c 18 10 12 40 c 24 12 4 40

Total 120 60 20 Total 120 60 20

(a) (b)

P088407-Ch005.qxd 4/30/05 11:21 AM Page 163

We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40

Expected values Observed values







TP TNTP TN FP FN

++ + +

.






c 18 10 12 40 c 24 12 4 40

Total 120 60 20 Total 120 60 20

(a) (b)

P088407-Ch005.qxd 4/30/05 11:21 AM Page 163

We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%








TP TNTP TN FP FN

++ + +

.






c 18 10 12 40 c 24 12 4 40

Total 120 60 20 Total 120 60 20

(a) (b)

P088407-Ch005.qxd 4/30/05 11:21 AM Page 163

We build the matrix of expected values by using the same totals as before and sharing the total of each class: Totals in each actual (Real) class: a = 100, b = 60, c = 40 We split each of them into the three groups using the proportions of the predicted classes: a =120, b=60, c =20 è a= 60%, b=30%, c = 10%








TP TNTP TN FP FN

++ + +

.






c 18 10 12 40 c 24 12 4 40

Total 120 60 20 Total 120 60 20

(a) (b)

P088407-Ch005.qxd 4/30/05 11:21 AM Page 163


€

κ =P(A) −P(E)1−P(E)

=n(A) − n(E)N − n(E)

=140−82200−82

= 0.49

To estimate the relative agreement between observed and expected values we can use the kappa statistic:

Where P(A) is the probability of agreement and P(E) is the probability of agreement by chance. The maximum possible value is κ=1, and for a random predictor κ=0

140 82

Accuracy measure

What is a good accuracy? Every measure shows a different perspective on the performance of the model. In general we will use two or more complementary measures to evaluate a model. E.g. a method that finds almost all elements will have an Sn close to 1, but this can be achieved with a method with very low Sp E.g. a method that has Sp close to 1, may have very low Sn In general, one would like to have a method that balances Sn and Sp (or equivalent measures)

Accuracy measure

What is a good accuracy? Which accuracy measure we want to maximize often depends on the question Do you want to find all the true cases? (You want higher sensitivity) Or want to find only correct cases? (You want higher specificity) Question: “predicting novel genes” might require high Sp or perhaps high Sn?

Choosing a prediction threshold

Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test:

Real

False

Accuracy measure

€


> λ

λ λ λ λ λ

Real

False

Accuracy measure

€


> λ

λ λ λ λ λ

TP FP TN FN

λ

λ

λ

λ

λ

Although we have one single model in fact we have a family of predictions, which are defined by one or more parameters, e.g. the log-likelihood test:

Real

False

λ λ λ λ λ TPR FPR

λ

λ

λ

λ

λ

Receiver Operating Characteristic (ROC) curve

A ROC curve is a graphical plot of TPR (Sn) vs. FPR built for the same prediction model by varying one or more of the model parameters

It is quite common in binary classifiers For instance, it can be plotted for several values of the discrimination threshold, but other parameters of the model can be used.


Threshold criterion

Distribution of the scores

In negative cases

In positive cases

TPR FPR

A Low TPR

Low FPR

B

C High TPR

High FPR

€

FPR =FP

FP + TNTPR = TPTP +FN

True Negative

False Negative

Example for threshold B

This area are our positive predictions


0 1

1

0

TPR

FPRThreshold criterion

Distribution of the scores

In negative cases

In positive cases

Random classification

Model classification

€

FPR =FP


True Negative

False Negative


€

FPR =FP

FP + TN

0 1

1

0

TPR

FPR

TPR = TPTP +FN


Model classification Each dot in the line corresponds to

a choice of parameters (usually 1 single parameter) The information that is not visible in this graph is the threshold used at each point of the graph. The x=y line corresponds to the random classification, i.e choosing positive or negative at every threshold with 50% chance.


Example: Consider the ranking of scores:

S10

7

4

2

1

-0.4

-2

-5

-9




S Known label

10 R

7 R

4 R

2 F

1 R

-0.4 R

-2 F

-5 F

-9 F


The test set is labeled:



S Known label

10 R

7 R

4 R

2 F

1 R

-0.4 R

-2 F

-5 F

-9 F


Let’s choose a cut-off (a λ):

3 = Cut-off for prediction, i.e. above this value we predict “R”



S Known label

10 R

7 R

4 R

2 F

1 R

-0.4 R

-2 F

-5 F

-9 F


Calculate TP, FP,… for this λ

λ TP FP TN FN TPR FPR

3 3 0 4 2 3/5 0

€

FPR =FP




S Known label

10 R

7 R

4 R

2 F

1 R

-0.4 R

-2 F

-5 F

-9 F


Repeat for other λ’s

Note: I’m using arbitrary intermediate values for cut-off


3 3 0 4 2 3/5 0

0 4 1 3 1 4/5 1/4



S Known label

10 R

7 R

4 R

2 F

1 R

-0.4 R

-2 F

-5 F

-9 F


Repeat for other λ’s


3 3 0 4 2 3/5 0

0 4 1 3 1 4/5 1/4

-7 5 3 1 0 1 3/4

Note: I’m using arbitrary intermediate values for cut-off


Example: Consider the ranking of scores: S = log L(s |Mhelix )L(s |Mloop )

Exercise: complete the table


3 3 0 4 2 3/5 0

0 4 1 3 1 4/5 1/4

-7 5 3 1 0 1 3/4

You should see that for smaller cut-offs the TPR (sensitivity) increases but the FPR increases as well (i.e. the specificity drops) Whereas for high cut-offs TPR decreases but the FPR is low (specificity is high)

The variability of the accuracy as a function of the parameters and/or cut-offs is generally described with a ROC curve.


Comparing multiple methods Each line corresponds to a different method Better models are further from the x=y line (random classification)

Method 1 Method 2 Method 3

ROC curves


(see e.g. Corvelo et al. PLOS Comp. Biology 2010)


ROC curves

Random classification Example: If you wish to discover at least 60% of the true elements (TPR=0.6), the graph says that Model 1 has lower FPR than Model 2 and 3. We may want to choose Model 1. We would then decide to make predictions with Model 1 and choose parameters that produce FPR=0.2 at TPR=0.6 But is this the best choice?




ROC curves Note that the more distant the points from the diagonal (the line of TPR=FPR) the better the classification. An optimal choice for a dot in the curve is the one that is at a maximum distance from the TPR=FPR line. There are standard methods to calculate this point. But again: this is optimal for the balance of TPR and FPR, but it might not be the one most appropriate for the model at hand, e.g. predicting novel genes.

Optimal configuration


ROC curves


A summary measure for the best model is the Area Under the Curve (AUC). The best model in general will have the highest AUC The maximum value is AUC=1. The closer AUC is to one, the better the model There are also standard methods to estimate the AUC from the sampled

Models


ROC curves


Question: Why do you think there are error bars in the AUC barplot and in the ROC curves?

Models

Precision recall curves

ROC curves are useful to compare predictive models. However, they still do not provide a complete picture of the accuracy of models. If we predict many TPs at the cost of producing many false predictions (FP is large), the FPR might not look so bad if in our testing set we have many Negatives, such that TN >> FP: So we may have a situation where our TPR is high, the FPR is low, but where for the actual counts FP >> TP That is, TPR is not affected by FP and FPR can be low even if FP is high (as long as TN >> FP).

€

FPR=FP

FP+TN TN large" → " " 0

Precision recall curves For instance, consider a method to classify documents. Let’s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances.

TPR1 =TP

TP +FN=40100

= 0.4 FPR1 =FP

FP +TN=

6010000

= 0.006

Precision recall curves For instance, consider a method to classify documents. Let’s supposed that the first method selects 100 documents, but 40 are correct. Imagine that our test set is composed of 100 True instances and 10000 Negative instances.

TPR1 =TP

TP +FN=40100

= 0.4 FPR1 =FP

FP +TN=

6010000

= 0.006

Now consider a second method selects 680 documents with 80 correct, and imagine that our test set is composed now of 100 True instances and 100000 Negative instances.

TPR2 =TP

TP +FN=80100

= 0.8 FPR2 =FP

FP +TN=

600100000

= 0.006

Which method is better?


Thus, one must also take into account the “relative cost” of the predictions, i.e. the FN and FP values that must be assumed to achieve high TPR

The second one may seem better, because it retrieves more relevant documents, but the proportion of predictions that are correct (precision or PPV) is smaller:

Precision1 =40100

= 0.40

Precision1 =80680

= 0.11

€

PPV =TP

TP + FP(Note: you can also use FDR = 1 –PPV)

One can make TN arbitrarily large to make FPR à 0 So other accuracy measures are needed to have a more correct picture.


€

precision= PPV =TP

TP+FP

€

recall = TPR =TP

TP+FN

Precision = proportion of the predictions that are correct

Recall = proportion of the true instances that are correctly recovered

(see e.g. Plass et al. RNA 2012)


Model 1 Has greater AUC, but low precision (high cost of false positives)

Model B We achieve a lower AUC than model A, but still pretty good. Precision is highly improved


Model 1 Has greater AUC, but low precision (high cost of false positives)

Model 2 We achieve a lower AUC than model A, but still quite good. Precision is highly improved

DataMining:Prac-calMachineLearningToolsandTechniques.IanH.Wi)en,EibeFrank,MarkA.Hall.MorganKaufmannISBN978-0-12-374856-0http://www.cs.waikato.ac.nz/ml/weka/book.html MethodsforComputa-onalGenePredic-on.W.H.Majoros.CambridgeUniversityPress2007

References

Documents

Model Accuracy Measures - Regulatory Genomics Groupregulatorygenomics.upf.edu/.../Lecture_Accuracy.pdf · 2018-01-26 · Training and Testing 1) Stratification: The size of each of