Chapter 10 Performance Metrics. Introduction Sometimes, measuring how well a system is performing is relatively straightforward: we calculate a “percent

Chapter 10

Performance Metrics

Introduction

• Sometimes, measuring how well a system is performing is relatively straightforward: we calculate a “percent correct”

• Other performance metrics are not often discussed in the literature

• Issues such as selection of data are reviewed

• Choices and uses of specific performance metrics are discussed

Issues to be Addressed

1. Selection of “gold standards”

2. Specification of training sets (sizes, # iterations, etc.)

3. Selection of test sets

4. Role of decision threshold levels

Computational Intelligence Performance Metrics Percent correct

Average sum-squared error

Normalized error

Evolutionary algorithm effectiveness measures

Mann-Whitney U Test

Receiver operating characteristic curves

Recall, precision, sensitivity, specificity, etc.

Confusion matrices, cost matrices

Chi-square test

Selecting the “Gold Standard”

Issues:

1. Selecting the classification* do experts agree?* involve end users

2. Selecting representative pattern sets* agreed to by experts* distributed over classes appropriately

(some near decision hypersurfaces)

3. Selecting person or process to designate gold standard* involve end users* the customer is always right

Specifying Training Sets

• Use different sets of patterns for training and testing

• Rotate patterns through training and testing, if possible

• Use “leave-n-out” method

• Select pattern distribution appropriate for paradigm> equal number in each class for back-propagation> according to probability distribution for SOFM

Percent Correct

•Most commonly used metric

•Can be misleading:>Predicting 90% which will exceed Dow avg, and 60%those that won’t, where .5 in each category, resultsin 75% correct>Prediction of 85% of those which exceed and 55% ofthose that won’t, where .7 in first cat., .3 in second,results in 76% correct

Calculating Average Sum-Squared Error

Total error: E b zTOTAL kj kjk j

05 2. ( ),

Average error: E ENo patternsAVG

TOTAL._

Note: Inclusion of .5 factor not universal Number of output PEs not always taken into account

Dividing by no. of output PEs desirable to get results thatcan be compared

This metric is used with other CI paradigms

State your method when publishing results

Selection of Threshold Values

•Value is often set to 0.5, usually without scientific basis

•Another value may give better performance

Example: 10 patterns, threshold = .5, 5 should be on and 5 off (1 output PE)If on always .6, off always .4, avg. SSE = .16, 100% correctIf on always .9, off .1 for 3 & .6 for 2, SSE=.08, 80% correct

.05 .9 for 5 (on) .03 .1 for 3 (off) .72 .6 for 2 (off) .80/10 = .08 SSE

Perhaps calculate SSE only for errors, and use thresholdas desired value

Absolute Error

• More intuitive than sum-squared error

• Mean absolute error:

Emq

b yMA k j k jj

q

k

m

1

11

• Above formulation is for neural net; metric alsouseful for other paradigms using optimization, suchas fuzzy cognitive maps

• Max. Abs. Error kjkjjk yb ,MAX

Removing Target Variance Effects

•Variance: Avg. of squared deviations from the mean• •Standard SSE is corrupted by target variance:

j k j jk

m

mb2

1

21

• Pineda developed normalized error ENORM using EMEAN,

which is constant for a given pattern set

Calculating Normalized Error First, calculate total error (previous slide) and mean error EMEAN

E bMEAN k j jk j

052

.,

Now, normalized error ENORM = ETOTAL/ EMEAN

Watch out for PEs that don’t change value (mean error = 0);perhaps add .000001 to mean error to be safe

This metric reflects output variance due to error rather than error due to neural network architecture.

Evolutionary Algorithm Effectiveness Metrics (DeJong 1975)

Offline performance - measure of convergence

pG

f gsoffline

sg

G

1

1

*

* Online performance

pG

f gsonline savg

g

G

1

1

Where G is the latest generation, f gs*( )

is best fitness for system s in generation g.

Mann-Whitney U Test

• Also known as Mann-Whitney-Wilcoxon test or Wilcoxon rank sum test

• Used for analyzing performance of evolutionary algorithms

• Evaluates similarity of medians of two data samples

• Uses ordinal data (continuous; can tell which of two is greater)

Two Samples: n1 and n2

• Sample sizes need not be the same size

• Can often get significant results (.05 level or better) with fewer than 10 samples in each group

• We focus on calculating U for relatively small sample sizes

Analyze Best Fitness of Runs

• Assume minimization problem, best value = 0.0

• Two configurations, A and B (baseline) with different configurations (maybe different mutation rates)

• We obtain the following best fitness values from 5 runs of A and 4 runs of B:

A: .079, .062, .073, .047, .085 (n1 = 5)

B: .102, .069, .055, .049 (n2 = 4)

• There are two ways to calculate U– Quick and direct– Formula (PC statistical packages)

Quick and Direct MethodArrange measurements in fitness order:

.047 .049 .055 .062 .069 .073 .079 .085 .102

A B B A B A A A B

Count number of As that are better than each B:

U1 = 1 +1 + 2 + 5 = 9

Count number of Bs that are better than each A:

U2 = 0 + 2 + 3 + 3 + 3 = 11

Now, U = min[U1, U2] = 9 Note: U2 = n1n2 – U1

Formula Method

Calculate R, the sum of ranks, for each n.

R1 = 1 + 4 + 6 + 7 + 8 = 26

R2 = 2 + 3 + 5 + 0 = 19

Now, U is the smaller of:

11

2

1

92

1

222

212

111

211

Rnn

nnU

Rnn

nnU

Is Null Hypothesis Rejected?

If U is less than or equal to the value in the table, the null hypothesis is rejected at the .05 level. 9 > 2, so it is NOT rejected. Thus, we cannot say one configuration results in significantly higher fitness than the other.

Now Test Configuration C

We obtain the following, ignoring specific fitness values since rank is what matters:

C C C C B B C B B

No. of Bs better than each C is 0 + 0 + 0 + 0 + 2 = 2 = U

Now, null hypothesis rejected at .05 level

C is statistically better than B

Note: This test can be used for other systems using variety of fitness measures such as percent correct.

Receiver Operating Characteristic Curves

* Originated in 1940s in communications systems and psychology

* Now being used in diagnostic systems and expert systems

* ROC curves not sensitive to probability distribution of training oror test set patterns or decision bias

* Good for one class at a time (one output PE)

* Most common metric is the area under the curve

Contingency Matrix

System Diagnosis

Positive Negative

Positive TP(true pos.)

FN(false neg.)

Negative FP(false pos.)

TN(true neg.)

GoldStandardDiagnosis

Recall is TP/(TP + FN)Precision is TP/(TP+FP)

Contingency Matrix

* Reflects the four possibilities from one PE or output class

* ROC curve makes use of two ratios of these numbersTrue pos. ratio = TP/(TP+FN) = sensitivityFalse pos. ratio = FP/(FP+TN) = 1 - specificity

* Major diagonal of curve represents situation where nodiscrimination exists

Note: Specificity = TN/(FP + TN)

Plotting the ROC Curve

•Plot for various values of thresholds, or,

•Plot for various output values of the PE

•Probably need about 10 values to get resolution

•Calculate area under curve using trapezoidal rule

Note: Calculate each value in contingency matrix for each threshold or output value.

ROC Curve Example

ROC Curve Interpretation

Along the dotted line, no discrimination exists. The system can achieve this performance solely by chance.

A perfect system has a true positive ratio of one, and a false positive ratio of zero, for some threshold.

ROC Cautions

* Two ROC curves with same area can intersect - one is betteron false positives, the other on false negatives

* Use a sufficient number of cases

* Might want to investigate behavior near other output PEvalues

Recall and Precision

Recall: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the gold standard (true positive ratio)

Precision: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the system

Sensitivity and Specificity

Sensitivity = TP/(TP+FN) - Likelihood event is detected given thatit is present

Specificity = TN/TN+FP) - Likelihood absence of event is detectedgiven that it is absent

Pos. predictive value = TP/(TP+FP) - Likelihood that detectionof event is associated with event

False alarm rate = FP/(FP+TN) - Likelihood of false signaldetection given absence of event

Criteria for Correctness

•If outputs are mutually exclusive, “winning” PE is PE with largest activation value

•If outputs are not mutually exclusive, then a fixed threshold criterion (e.g. 0.5) can be used.

Confusion Matrices

•Useful when system has multiple output classes

•For n classes, n by n matrix constructed

•Rows reflect the “gold standard”

•Columns reflect system classifications

•Entry represents a count (a frequency of occurrence)

•Diagonal values represent correct instances

•Off-diagonal values represent row misclassified as column

Using Confusion Matrices

• Initially work row-by-row to calculate “class confusion” valuesby dividing each entry by total count in row (each row sums to 1)

• Now have “class confusion” matrix

• Calculate “average percent correct” by summing diagonal values and dividing by number of classes n. (This isn’t true percent correct unless all classes have same prior probability.)

To Calculate Cost Matrix • Must know prior probabilities of each class

• Multiply each element by prior probability for class (row)

• Now each value in matrix is probability of occurrence; all sum to one. This is the confusion matrix.

• Multiply each element in matrix by its cost (diagonal costs often are zero, but not always) This is the cost matrix.

• Cost ratios can be used; subjective measures cannot be

• Use cost matrix to fine-tune system (threshold values,membership functions, etc.)

Minimizing Cost

An evolutionary algorithm can be used to find a lowest-cost system; the cost matrix output is thus the fitness function.

Sometimes, it is sufficient to just minimize the sum of off-diagonal numbers.

Example: Medical Diagnostic System

Three possible diagnoses: A, B, and C; 50 cases of each diagnosis for training and also for testing.

Prior probabilities are .60, .30, and .10, respectively.

Test results: CI System Diagnoses A B C Gold A 40 8 2 Standard Diagnoses B 6 42 2

C 1 1 48

Class confusion matrix: CI System Diagnoses A B C Gold A 0.80 0.16 0.04 Standard Diagnoses B 0.12 0.84 0.04

C 0.02 0.02 0.96

Final confusion matrix (includes prior probabilities): CI System Diagnoses A B C Gold A .480 .096 .024 Standard Diagnoses B .036 .252 .012

C .002 .002 .096

Sum of main diagonal gives system accuracy of 82.8 percent.

Costs of correct diagnoses: $10, $100, and $5,000.

Misdiagnoses: A as B: $100 + $10 B as A: $10 + $100 B as C: $5,000 + $100 (plus angry patient) C as A or B: $80,000+($10 or $100 for A or B) (plus lawsuits)

Final cost matrix for this system configuration: CI System Diagnoses A B C Gold A 4.80 10.56 120.24 Standard Diagnoses B 3.96 25.20 61.20

C 160.02 160.20 480.00

Average application of system thus costs $1,026.18.

Chi-Square Test What you can use if you don’t know what the results are

supposed to be* Can be useful in modeling, simulation, or pattern

generation, such as music composition* Chi-square test examines how often each category (class)

occurs versus how often it’s expected to occur

n

iiii EEO

1

22 /)(

E is expected frequency; O is observed frequencyn is number of categories

This test assumes normally distributed data.Threshold values play only an indirect role.

Chi-square test case

Four PE’s (four output categories), so there are 3 degrees of freedom.

•In test case of 50 patterns, expected frequency distribution is: 5, 10, 15, 20

•Test 1 results: 4, 10, 16, 20. Chi-square = 0.267

•Test 2 results: 2, 15 9, 26. Chi-square = 8.50

Chi-Square Example Results

For first test set, hypothesis of no difference between expectedand obtained (null hypothesis) is sustained at .95 levelThis means that it is over 95% probable that differences are due solely to chance

For second test set, the null hypothesis is rejected at the .05level. This means that the probability is less than 5% thatthe differences are due solely to chance

Using Excel™ to Calculate Chi-square

To find probability that null hypothesis is sustained or rejected, use=CHIDIST(X2, df), where X2 is the chi-square value and df is the degrees of freedom.

Example: CHIDIST(0.267,3) yields an answer of 0.966, so the null hypothesis is sustained at the .966 level.

Generate chi-square values (as in a table) with =CHIINV(p, df)

Example: CHIINV(.95, 3) yields 0.352, which is the value in a table.

Chi-Square Summary

• Chi-square measures whole system performance at once

• Watch out for:Output combinations not expected (frequencies of 0)Systems with large number of degrees of freedom

(most tables limited to 30-40 or so)

• You are, of course, looking for small chi-square values!

Documents

Chapter 10 Performance Metrics. Introduction Sometimes, measuring how well a system is performing is relatively straightforward: we calculate a “percent