Statistics in Retail Finance Chapter 3: Assessing performance
1
Statistics in Retail Finance
Chapter 3: Assessing performance
Statistics in Retail Finance Chapter 3: Assessing performance
2
Overview >
In this chapter we consider ways to measure performance of a default
model. This allows us to compare models.
Topics include:
Error types
Receiver Operating Characteristic (ROC) curve Area under the ROC curve (AUC)
Probability calibration Cost-based measure and optimal cut-off score Testing and forecasting
Statistics in Retail Finance Chapter 3: Assessing performance
3
Introduction >
When a model is built we need to determine how well it is performing.
How we do this depends on how we intend to use the model.
There are broadly three types of performance measure that we consider:
1. How good is the model at classifying borrowers? 2. How good are the models at estimating probabilities?
3. How well do the models estimate the profit/loss of an individual borrower?
A model is tested on a validation data set for which the model provides
predictions or estimates and for which we also have the observed outcomes.
A performance measure is a function which compares the estimates against
observations.
Statistics in Retail Finance Chapter 3: Assessing performance
4
Classification errors >
Remember that decision making using default models is based on a cut-off
score . Suppose we have a loan application with score and outcome ,
then here are outcomes for each decision.
Rejected but it has good outcome. False positive error.
Rejected and it has bad outcome.
Accepted and it has good outcome.
Accepted but it has bad outcome. False negative error.
This table shows the two types of errors that can occur.
Note that these two types of errors have different costs:
A false positive represents the loss of potential interest income.
A false negative represents possibly the entire value of the loan or
credit.
The false negative is a much higher cost.
Statistics in Retail Finance Chapter 3: Assessing performance
5
Definitions >
Error rates within this matrix can be expressed using the following
cumulative distribution functions (CDFs).
( ) ( ) as the CDF of scores that are rejected amongst
those that are negative (false positive rate);
( ) ( ) as the CDF of scores that are rejected amongst
those that are positive.
Therefore,
( ) is the complementary CDF of scores that are accepted amongst
those that are negative; ( ) is the complementary CDF of scores that are accepted amongst
those that are positive (false negative rate).
Let ( ). Therefore, ( ).
Statistics in Retail Finance Chapter 3: Assessing performance
6
Given a validation data set, these CDFs can be computed as empirical CDFs.
Example 3.1
Consider 16 applicants with different scores (not log-odds) and outcomes.
Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78
Actual
outcome
1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0
Empirical distributions
and for the 16 example
borrowers.
Notice that and
diverge.
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
Emp
iric
al C
DF
Cut-off score
F0
F1
Statistics in Retail Finance Chapter 3: Assessing performance
7
ROC curve >
A widely used tool to assess class discrimination accuracy is the Receiver
Operating Characteristic (ROC) curve.
Developed originally for signal detection theory, hence the name.
Has the merit of being independent of any specific cut-off score or class distribution.
Plots on x-axis against on the y-axis:
that is, false positive rate against true positive rate.
Statistics in Retail Finance Chapter 3: Assessing performance
8
Typically, a ROC curve looks like this:
The blue line is the ROC curve.
The red line is a reference line (it represents an uninformative model).
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
F_1
(tr
ue
po
siti
ve r
ate
)
F_0 (false positive rate)
Statistics in Retail Finance Chapter 3: Assessing performance
9
Characteristics of the ROC curve >
True positive rate ( ( )) is also known as sensitivity.
True negative rate ( ( )) is also known as specificity.
The ROC curve shows the trade-off between true positive rate and true negative rate. In general, as one is increased, so the other decreases.
Must pass through point (0,0) since this is the extreme case when cut-
off is so low, no scores are less (eg all applications are accepted).
Must pass through point (1,1) since this is the extreme case when cut-off is so high, all scores are less (eg all applications are rejected).
The best model has ROC curve that passes through (0,1) since this is the case when there are no errors of either type
(ie ( )=0 and 1- ( )=0).
Statistics in Retail Finance Chapter 3: Assessing performance
10
A model that has no discriminatory power is such that ( )= ( ) for all
. This is represented by a straight line from (0,0) to (1,1): the red line
in the example above.
The slope on the ROC curve is ( )
( ).
Proof: ( ) ( ), therefore
( ) ⁄
( ) ⁄
( )
( )
.
Statistics in Retail Finance Chapter 3: Assessing performance
11
Example 3.2
Again, consider the 16 applicants from example 3.1, with different scores
and outcomes.
Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78
Outcome 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0
The ROC curve is given as follows.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
F_1
(tr
ue
po
siti
ve r
ate
)
F_0 (false positive rate)
Statistics in Retail Finance Chapter 3: Assessing performance
12
Exercise 3.1
Consider the following 16 applicants with different scores and outcomes.
Score 5 10 12 20 26 28 32 42 45 52 55 60 62 75 82 99
Outcome 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0
1. Draw a graph showing empirical distributions and .
2. Plot the ROC curve based on these results.
Statistics in Retail Finance Chapter 3: Assessing performance
13
Comparing models using the ROC curve >
Consider two models A and B that produce two different ROC curves on the
same validation data set.
ROC curve for model A is blue and model B is green.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Tru
e p
osi
tive
rat
e
False positive rate
Statistics in Retail Finance Chapter 3: Assessing performance
14
Model B outperforms model A over the whole range of the curve, since its
curve is always higher, so B seems to be the better model.
However, not all comparisons between ROC curves are so clear-cut.
Consider:
ROC curve for model A is blue and model B is green.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Tru
e p
osi
tive
rat
e
False positive rate
Statistics in Retail Finance Chapter 3: Assessing performance
15
Model A is good for low false positive rates, whereas model B is good for
high false positive rates. Therefore, it is difficult to determine a “best”
model.
For credit scoring, it is low cut-offs (eg rejecting few applications) that are
usually considered, so it is the lower left of the ROC curve which is usually
most important. However, where do we draw the line for such
comparisons?
The ROC curve is useful to view behaviour of a model across different cut-
off scores.
However, it does not give a single measure of discrimination, which is what
we really want for model comparison.
Statistics in Retail Finance Chapter 3: Assessing performance
16
Area under the ROC curve >
A popular measure of discrimination is the area under the ROC curve (AUC)
given by
∫ ( ) ( )
In particular,
AUC=0.5 corresponds to a model with no classification power.
AUC=1 corresponds to a model with maximal classification power.
Models can be directly compared using their AUC. If model A has a higher AUC than model B then it is considered the better
model in terms of discriminatory power.
Statistics in Retail Finance Chapter 3: Assessing performance
17
Proof that AUC=0.5 for a model with no classification power:
If a model has no classification power, then ( )= ( ) for all .
Therefore, ∫ ( ) ( ) [
( ( ))
]
[ ]
.
Proof that AUC=1 for a model with maximal classification power.
If a model has maximal classification power, then there exists a cut-off
score such that ( ) and ( ) (ie no errors).
Since and are both CDFs,
for all : ( ) , hence ( ) ,
for all : ( ) .
Therefore,
∫ ( ) ( )
∫ ( ) ( )
∫
( ) [ ( )]
.
Statistics in Retail Finance Chapter 3: Assessing performance
18
Estimate of AUC >
Suppose we have a validation data set with observations and instances of
scores indexed in rank order:
with empirical estimates and for and respectively.
Since ∫ ( ) ( ) , we estimate AUC as
∑
( ( ) ( )) [ ( ) ( )]
and using ( ) .
This uses the trapezoid rule to estimate the area of
segments of the ROC curve where multiple
observations exist with the same score but different
outcome.
( )
a
c
b
Statistics in Retail Finance Chapter 3: Assessing performance
19
Exercise 3.2
Consider the following table of empirical CDFs and for two scorecards A
and B.
A B
F0 F1 F0 F1
0.1 0.35 0.2 0.4
0.25 0.6 0.4 0.7
0.5 0.8 0.7 0.95
0.75 0.95 0.9 1
1. Draw ROC curves for both scorecards.
2. Interpret the relative difference in performance for each scorecard
given their ROC curves. 3. Which scorecard is best in terms of AUC?
Statistics in Retail Finance Chapter 3: Assessing performance
20
Other classification performance measures >
The ROC curve and AUC are common measures of classification performance
in credit scoring (and other application areas).
However, some alternatives also exist:-
Gini coefficient = 2 AUC-1
Information Gain;
Kolmogorov-Smirnoff statistic;
Cumulative accuracy profile (CAP) and Accuracy rate = Gini.
We will not cover them in any detail in this course.
Statistics in Retail Finance Chapter 3: Assessing performance
21
Probability calibration >
The classification performance measures only give us a measure of how
well the models discriminate between classes.
However, quite often we are interested in the probability estimate of an
event (eg PD).
To determine the accuracy of the probability estimates we compare
against the observed frequency of the event within groups of
observations.
It is natural to group them by risk grades.
Remember that the risk grade is specified by a function of score,
{ } where G is the number of risk grades (see Chapter 1, slide 24).
Statistics in Retail Finance Chapter 3: Assessing performance
22
Then, given a validation data set [( ) ( ) ( )], the sample
mean estimated probability of outcome for risk grade is
∑ ( ) ( ( ( )) )
where is the number of scores in grade :
|{ { } ( ( )) }|
and ( ) is the indicator function.
The default model gives us both ( ) and ( ).
The observed frequencies are given by
∑ ( ( ( )) )
Statistics in Retail Finance Chapter 3: Assessing performance
23
Probability calibration graph >
We can compare expected probabilities with observed probabilities
graphically on a probability calibration graph:
is plotted on the x-axis against on the y-axis for each
{ }.
The scorecard is well-calibrated if estimated probabilities are similar to
observed frequencies. On the probability calibration graph, this means
points sit close to the diagonal from (0,0) to (1,1).
Statistics in Retail Finance Chapter 3: Assessing performance
24
Example 3.3
Interpret each of the five models M1 to M5 shown in the following
probability calibration graphs.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Ob
serv
ed
pro
bab
ility
Estimated probability
M1
M2
M3
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1O
bse
rve
d p
rob
abili
ty
Estimated probability
M1
M4
M5
Statistics in Retail Finance Chapter 3: Assessing performance
25
Solution
M1. Perfect calibration of estimated probability with observation.
M2. Consistently overestimates probability. M3. Consistently underestimates probability. M4. Generally fine for estimating mid-range probabilities (around 0.5) but
underestimates extreme probabilities (ie estimates are too conservative).
M5. Produces too many extreme probability estimates.
Statistics in Retail Finance Chapter 3: Assessing performance
26
Example 3.4
This graph shows probability calibration graphs comparing two models over
three risk grades. What do they mean?
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Ob
serv
ed
pro
bab
ility
Estimated probability
Model (1)
Model (2)
Best possiblecalibration
Statistics in Retail Finance Chapter 3: Assessing performance
27
Solution
Model (1) underestimates probabilities of default, whilst model (2) gives
much better estimates.
But is the calibration statistically significant?
Statistics in Retail Finance Chapter 3: Assessing performance
28
Hosmer-Lemeshow Test >
It is useful to have a test of probability calibration across all risk grades.
We can use the Hosmer-Lemeshow Test to do this.
The Hosmer-Lemeshow Test is a form of Chi-square test. The null
hypothesis is that the observed probabilities are not different from the
estimated probabilities and the alternative hypothesis is that there is a
different.
Null hypothesis: for all { }.
Alternative hypothesis: for any { }.
Statistics in Retail Finance Chapter 3: Assessing performance
29
Chi-square Test >
Recall that the Chi-square test is based on observed and expected
frequencies, and respectively, falling into groups, { } .
It tests the null hypothesis against alternative hypothesis:
Null hypothesis: for all { }.
Alternative hypothesis: for any { }.
The chi-square statistic is calculated as
∑( )
And, under the null hypothesis, this follows a chi-square distribution with
degrees of freedom, where is the reduction in degrees of freedom
within the groups.
Statistics in Retail Finance Chapter 3: Assessing performance
30
The chi-square test is used to test probability calibration by testing within
each risk grade how many expected goods were really good ( ) and how
many expected bads were really bad ( ). As such we consider two sets
of expectations/observations: the goods and the bads.
The following frequencies are used for each risk grade :
Number of expected goods = ( )
Number of observed goods = ( )
Number of expected bads =
Number of observed bads =
Statistics in Retail Finance Chapter 3: Assessing performance
31
Then the chi-square statistic is:
∑( ( ) ( ))
( )
∑( )
∑ [ ( )
( )( )
]
( )
∑ ( )
( )
Since , the number of degrees of freedom is , but what is ?
The series of good/bad observations are highly dependent, so will be
large. Simulation studies have shown that an optimal value is given by
Degrees of freedom = .
Statistics in Retail Finance Chapter 3: Assessing performance
32
Example 3.5
Use the grading system and probabilities from Example 16.2 to conduct a
Hosmer-Lemeshow Test for each model.
Model (1) Model (2)
Grade
A 6 1/6 0.0085 17.8 0.132 0.06
B 5 1/5 0.0434 3.0 0.307 0.27
C 5 3/5 0.2083 4.7 0.568 0.02
Sum 25.4 0.35
P-value * <0.001 0.55
* Chi-square tests are at 1 degree of freedom.
These results suggest that model (1) does not calibrate observations well (null hypothesis is rejected at the 1% significance level).
However, the null hypothesis is not rejected for model (2). Hence, the observations are typical given the estimated probabilities.
Statistics in Retail Finance Chapter 3: Assessing performance
33
Exercise 3.3
A portfolio of 900 score cards is graded A to D. The following table shows
observed PD ( ), along with estimated PD ( ) for two models.
Grade Model (1)
Model (2)
A 200 0.3 0.3 0.32
B 300 0.1 0.2 0.12
C 300 0.05 0.1 0.07
D 100 0.03 0.05 0.02
1. Draw a probability calibration graph of this data.
2. Which model is apparently better calibrated, and why?
3. Perform a Hosmer-Lemeshow Test to determine whether either model PD
estimates are consistent with the observed PD.
Statistics in Retail Finance Chapter 3: Assessing performance
34
Further information can be found about the Hosmer-Lemeshow test,
especially in relation to logistic regression, in
Dobson AJ and Barnett AG (2008). An Introduction to Generalized Linear
Models (CRC press), pp.135-137
(available in the library).
Statistics in Retail Finance Chapter 3: Assessing performance
35
Cost-based measures >
Ultimately the bank is interested in the profit that can be derived from borrowers and avoiding any losses.
So when assessing a credit risk model, we would like to use a measure of profit/loss and choose the model that maximizes expected profit.
For application scorecards we can provide a table of expected profit/loss
for accept/reject decision and outcome.
Outcome
Positive Negative
Lender decision
Reject 0 0
Accept
Statistics in Retail Finance Chapter 3: Assessing performance
36
The number in each cell refers to the expected profit from each outcome:
Clearly, if the application is rejected then there is no profit (or loss). If the application is accepted then there is a gain if the borrower does
not default and a loss if they do default.
We suppose these amounts can be treated as a constant (use an
expected value) across all cases.
The number of cases when an account is accepted but the outcome is is
( ( )) where is the number of expected cases.
Hence, expected profit is computed as
( ( )) ( ( ))( )
( ) [ ( )( ) ( ( )) ]
where .
Typically, we expect .
Statistics in Retail Finance Chapter 3: Assessing performance
37
The terms ( ) and are fixed, relative to the model.
Therefore, we only need consider the term in square brackets which
represents the relative cost measure (per account):
( ) ( )( ) ( ( )) .
It is a cost because it is a negative term on profit. Notice that this is very similar to the simple error rate except for the
relative weight between the two types of error.
Statistics in Retail Finance Chapter 3: Assessing performance
38
Optimal cut-off score >
The cost measure allows us to optimize the cut-off score by cost.
Remembering that the lender may have a threshold for the proportion
of applications that need to be accepted (volume), the optimal cut-off score
is given by
( )
such that ( ) .
As increases, so the cut-off increases (reflecting the increased cost of
default risk). As decreases, so the cut-off decreases.
Note that there is not necessarily a unique optimal cut-off score (there could be several). In general, the relationship between the cut-off score and cost measure is not monotonic.
Statistics in Retail Finance Chapter 3: Assessing performance
39
Example 3.6
Again, consider the 16 applicants from Example 3.1.
Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78
Outcome 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0
If we set =5, then the relative mean cost for each value of is shown in
this graph.
We see that the expected cost is
minimized at a cut-off score of 30 (the
red spot).
0
0.5
1
1.5
2
0 20 40 60 80 100
Re
lati
ve m
ean
co
st
Cut-off score
Statistics in Retail Finance Chapter 3: Assessing performance
40
Analytic solution >
If and are differentiable, then an analytic solution is possible.
A minimum for ( ) is found when
( )
( ) ( )
( )
( )
( )
( )
( )
( )
( )
( )
by Bayes theorem (assuming ( ) )
( )
Statistics in Retail Finance Chapter 3: Assessing performance
41
Since ( ) for a score link-function and vector of predictor variables ,
( )
for all such that ( ) .
For a general link function , ( ) ( ( ))
so ( ) (
) is a general solution.
In particular, for the log-odds score, ( ( ))
( ) (see Chapter 2),
therefore
[
]
Statistics in Retail Finance Chapter 3: Assessing performance
42
Test data sets and overfitting >
All performance measures make use of a validation data set.
The validation data set could be any data set of observations with the
right predictor variables and outcomes.
However, in order to avoid overfitting it is best to use observations
that are independent of the data used to train the model. Such a data
set of observations is a hold-out test set.
o Overfitting is the phenomenon whereby the estimate of performance
of a model is upwardly biassed when measured on data that
contains training observations.
If we are particularly interested in using our model for forecasting, as
we typically are in retail finance, then the test observations should also
selected from a time period after the observations in the training data
set.
Statistics in Retail Finance Chapter 3: Assessing performance
43
Review of Chapter 3 >
In this chapter we considered ways to measure performance of a default
model. Topics included:
Error types
Receiver Operating Characteristic (ROC) curve Area under the ROC curve (AUC)
Probability calibration Cost-based measure and optimal cut-off score Testing and forecasting