Upload
noah-conley
View
229
Download
0
Embed Size (px)
Citation preview
Chapter 10
Performance Metrics
Introduction
• Sometimes, measuring how well a system is performing is relatively straightforward: we calculate a “percent correct”
• Other performance metrics are not often discussed in the literature
• Issues such as selection of data are reviewed
• Choices and uses of specific performance metrics are discussed
Issues to be Addressed
1. Selection of “gold standards”
2. Specification of training sets (sizes, # iterations, etc.)
3. Selection of test sets
4. Role of decision threshold levels
Computational Intelligence Performance Metrics Percent correct
Average sum-squared error
Normalized error
Evolutionary algorithm effectiveness measures
Mann-Whitney U Test
Receiver operating characteristic curves
Recall, precision, sensitivity, specificity, etc.
Confusion matrices, cost matrices
Chi-square test
Selecting the “Gold Standard”
Issues:
1. Selecting the classification* do experts agree?* involve end users
2. Selecting representative pattern sets* agreed to by experts* distributed over classes appropriately
(some near decision hypersurfaces)
3. Selecting person or process to designate gold standard* involve end users* the customer is always right
Specifying Training Sets
• Use different sets of patterns for training and testing
• Rotate patterns through training and testing, if possible
• Use “leave-n-out” method
• Select pattern distribution appropriate for paradigm> equal number in each class for back-propagation> according to probability distribution for SOFM
Percent Correct
•Most commonly used metric
•Can be misleading:>Predicting 90% which will exceed Dow avg, and 60%those that won’t, where .5 in each category, resultsin 75% correct>Prediction of 85% of those which exceed and 55% ofthose that won’t, where .7 in first cat., .3 in second,results in 76% correct
Calculating Average Sum-Squared Error
Total error: E b zTOTAL kj kjk j
05 2. ( ),
Average error: E ENo patternsAVG
TOTAL._
Note: Inclusion of .5 factor not universal Number of output PEs not always taken into account
Dividing by no. of output PEs desirable to get results thatcan be compared
This metric is used with other CI paradigms
State your method when publishing results
Selection of Threshold Values
•Value is often set to 0.5, usually without scientific basis
•Another value may give better performance
Example: 10 patterns, threshold = .5, 5 should be on and 5 off (1 output PE)If on always .6, off always .4, avg. SSE = .16, 100% correctIf on always .9, off .1 for 3 & .6 for 2, SSE=.08, 80% correct
.05 .9 for 5 (on) .03 .1 for 3 (off) .72 .6 for 2 (off) .80/10 = .08 SSE
Perhaps calculate SSE only for errors, and use thresholdas desired value
Absolute Error
• More intuitive than sum-squared error
• Mean absolute error:
Emq
b yMA k j k jj
q
k
m
1
11
• Above formulation is for neural net; metric alsouseful for other paradigms using optimization, suchas fuzzy cognitive maps
• Max. Abs. Error kjkjjk yb ,MAX
Removing Target Variance Effects
•Variance: Avg. of squared deviations from the mean• •Standard SSE is corrupted by target variance:
j k j jk
m
mb2
1
21
• Pineda developed normalized error ENORM using EMEAN,
which is constant for a given pattern set
Calculating Normalized Error First, calculate total error (previous slide) and mean error EMEAN
E bMEAN k j jk j
052
.,
Now, normalized error ENORM = ETOTAL/ EMEAN
Watch out for PEs that don’t change value (mean error = 0);perhaps add .000001 to mean error to be safe
This metric reflects output variance due to error rather than error due to neural network architecture.
Evolutionary Algorithm Effectiveness Metrics (DeJong 1975)
Offline performance - measure of convergence
pG
f gsoffline
sg
G
1
1
*
* Online performance
pG
f gsonline savg
g
G
1
1
Where G is the latest generation, f gs*( )
is best fitness for system s in generation g.
Mann-Whitney U Test
• Also known as Mann-Whitney-Wilcoxon test or Wilcoxon rank sum test
• Used for analyzing performance of evolutionary algorithms
• Evaluates similarity of medians of two data samples
• Uses ordinal data (continuous; can tell which of two is greater)
Two Samples: n1 and n2
• Sample sizes need not be the same size
• Can often get significant results (.05 level or better) with fewer than 10 samples in each group
• We focus on calculating U for relatively small sample sizes
Analyze Best Fitness of Runs
• Assume minimization problem, best value = 0.0
• Two configurations, A and B (baseline) with different configurations (maybe different mutation rates)
• We obtain the following best fitness values from 5 runs of A and 4 runs of B:
A: .079, .062, .073, .047, .085 (n1 = 5)
B: .102, .069, .055, .049 (n2 = 4)
• There are two ways to calculate U– Quick and direct– Formula (PC statistical packages)
Quick and Direct MethodArrange measurements in fitness order:
.047 .049 .055 .062 .069 .073 .079 .085 .102
A B B A B A A A B
Count number of As that are better than each B:
U1 = 1 +1 + 2 + 5 = 9
Count number of Bs that are better than each A:
U2 = 0 + 2 + 3 + 3 + 3 = 11
Now, U = min[U1, U2] = 9 Note: U2 = n1n2 – U1
Formula Method
Calculate R, the sum of ranks, for each n.
R1 = 1 + 4 + 6 + 7 + 8 = 26
R2 = 2 + 3 + 5 + 0 = 19
Now, U is the smaller of:
11
2
1
92
1
222
212
111
211
Rnn
nnU
Rnn
nnU
Is Null Hypothesis Rejected?
If U is less than or equal to the value in the table, the null hypothesis is rejected at the .05 level. 9 > 2, so it is NOT rejected. Thus, we cannot say one configuration results in significantly higher fitness than the other.
Now Test Configuration C
We obtain the following, ignoring specific fitness values since rank is what matters:
C C C C B B C B B
No. of Bs better than each C is 0 + 0 + 0 + 0 + 2 = 2 = U
Now, null hypothesis rejected at .05 level
C is statistically better than B
Note: This test can be used for other systems using variety of fitness measures such as percent correct.
Receiver Operating Characteristic Curves
* Originated in 1940s in communications systems and psychology
* Now being used in diagnostic systems and expert systems
* ROC curves not sensitive to probability distribution of training oror test set patterns or decision bias
* Good for one class at a time (one output PE)
* Most common metric is the area under the curve
Contingency Matrix
System Diagnosis
Positive Negative
Positive TP(true pos.)
FN(false neg.)
Negative FP(false pos.)
TN(true neg.)
GoldStandardDiagnosis
Recall is TP/(TP + FN)Precision is TP/(TP+FP)
Contingency Matrix
* Reflects the four possibilities from one PE or output class
* ROC curve makes use of two ratios of these numbersTrue pos. ratio = TP/(TP+FN) = sensitivityFalse pos. ratio = FP/(FP+TN) = 1 - specificity
* Major diagonal of curve represents situation where nodiscrimination exists
Note: Specificity = TN/(FP + TN)
Plotting the ROC Curve
•Plot for various values of thresholds, or,
•Plot for various output values of the PE
•Probably need about 10 values to get resolution
•Calculate area under curve using trapezoidal rule
Note: Calculate each value in contingency matrix for each threshold or output value.
ROC Curve Example
ROC Curve Interpretation
Along the dotted line, no discrimination exists. The system can achieve this performance solely by chance.
A perfect system has a true positive ratio of one, and a false positive ratio of zero, for some threshold.
ROC Cautions
* Two ROC curves with same area can intersect - one is betteron false positives, the other on false negatives
* Use a sufficient number of cases
* Might want to investigate behavior near other output PEvalues
Recall and Precision
Recall: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the gold standard (true positive ratio)
Precision: The number of positive diagnoses correctly made by the system divided by the total number of positive diagnoses made by the system
Sensitivity and Specificity
Sensitivity = TP/(TP+FN) - Likelihood event is detected given thatit is present
Specificity = TN/TN+FP) - Likelihood absence of event is detectedgiven that it is absent
Pos. predictive value = TP/(TP+FP) - Likelihood that detectionof event is associated with event
False alarm rate = FP/(FP+TN) - Likelihood of false signaldetection given absence of event
Criteria for Correctness
•If outputs are mutually exclusive, “winning” PE is PE with largest activation value
•If outputs are not mutually exclusive, then a fixed threshold criterion (e.g. 0.5) can be used.
Confusion Matrices
•Useful when system has multiple output classes
•For n classes, n by n matrix constructed
•Rows reflect the “gold standard”
•Columns reflect system classifications
•Entry represents a count (a frequency of occurrence)
•Diagonal values represent correct instances
•Off-diagonal values represent row misclassified as column
Using Confusion Matrices
• Initially work row-by-row to calculate “class confusion” valuesby dividing each entry by total count in row (each row sums to 1)
• Now have “class confusion” matrix
• Calculate “average percent correct” by summing diagonal values and dividing by number of classes n. (This isn’t true percent correct unless all classes have same prior probability.)
To Calculate Cost Matrix • Must know prior probabilities of each class
• Multiply each element by prior probability for class (row)
• Now each value in matrix is probability of occurrence; all sum to one. This is the confusion matrix.
• Multiply each element in matrix by its cost (diagonal costs often are zero, but not always) This is the cost matrix.
• Cost ratios can be used; subjective measures cannot be
• Use cost matrix to fine-tune system (threshold values,membership functions, etc.)
Minimizing Cost
An evolutionary algorithm can be used to find a lowest-cost system; the cost matrix output is thus the fitness function.
Sometimes, it is sufficient to just minimize the sum of off-diagonal numbers.
Example: Medical Diagnostic System
Three possible diagnoses: A, B, and C; 50 cases of each diagnosis for training and also for testing.
Prior probabilities are .60, .30, and .10, respectively.
Test results: CI System Diagnoses A B C Gold A 40 8 2 Standard Diagnoses B 6 42 2
C 1 1 48
Class confusion matrix: CI System Diagnoses A B C Gold A 0.80 0.16 0.04 Standard Diagnoses B 0.12 0.84 0.04
C 0.02 0.02 0.96
Final confusion matrix (includes prior probabilities): CI System Diagnoses A B C Gold A .480 .096 .024 Standard Diagnoses B .036 .252 .012
C .002 .002 .096
Sum of main diagonal gives system accuracy of 82.8 percent.
Costs of correct diagnoses: $10, $100, and $5,000.
Misdiagnoses: A as B: $100 + $10 B as A: $10 + $100 B as C: $5,000 + $100 (plus angry patient) C as A or B: $80,000+($10 or $100 for A or B) (plus lawsuits)
Final cost matrix for this system configuration: CI System Diagnoses A B C Gold A 4.80 10.56 120.24 Standard Diagnoses B 3.96 25.20 61.20
C 160.02 160.20 480.00
Average application of system thus costs $1,026.18.
Chi-Square Test What you can use if you don’t know what the results are
supposed to be* Can be useful in modeling, simulation, or pattern
generation, such as music composition* Chi-square test examines how often each category (class)
occurs versus how often it’s expected to occur
n
iiii EEO
1
22 /)(
E is expected frequency; O is observed frequencyn is number of categories
This test assumes normally distributed data.Threshold values play only an indirect role.
Chi-square test case
Four PE’s (four output categories), so there are 3 degrees of freedom.
•In test case of 50 patterns, expected frequency distribution is: 5, 10, 15, 20
•Test 1 results: 4, 10, 16, 20. Chi-square = 0.267
•Test 2 results: 2, 15 9, 26. Chi-square = 8.50
Chi-Square Example Results
For first test set, hypothesis of no difference between expectedand obtained (null hypothesis) is sustained at .95 levelThis means that it is over 95% probable that differences are due solely to chance
For second test set, the null hypothesis is rejected at the .05level. This means that the probability is less than 5% thatthe differences are due solely to chance
Using Excel™ to Calculate Chi-square
To find probability that null hypothesis is sustained or rejected, use=CHIDIST(X2, df), where X2 is the chi-square value and df is the degrees of freedom.
Example: CHIDIST(0.267,3) yields an answer of 0.966, so the null hypothesis is sustained at the .966 level.
Generate chi-square values (as in a table) with =CHIINV(p, df)
Example: CHIINV(.95, 3) yields 0.352, which is the value in a table.
Chi-Square Summary
• Chi-square measures whole system performance at once
• Watch out for:Output combinations not expected (frequencies of 0)Systems with large number of degrees of freedom
(most tables limited to 30-40 or so)
• You are, of course, looking for small chi-square values!