Download pdf - Statistics in Retail Finance Chapter 3: Assessing performancebm508/teaching/retailfinance/Lecture3.pdf · Statistics in Retail Finance Chapter 3: Assessing performance 3 Introduction

Statistics in Retail Finance Chapter 3: Assessing performance

1

Statistics in Retail Finance

Chapter 3: Assessing performance


2

Overview >

In this chapter we consider ways to measure performance of a default

model. This allows us to compare models.

Topics include:

Error types

Receiver Operating Characteristic (ROC) curve Area under the ROC curve (AUC)

Probability calibration Cost-based measure and optimal cut-off score Testing and forecasting


3

Introduction >

When a model is built we need to determine how well it is performing.

How we do this depends on how we intend to use the model.

There are broadly three types of performance measure that we consider:

1. How good is the model at classifying borrowers? 2. How good are the models at estimating probabilities?

3. How well do the models estimate the profit/loss of an individual borrower?

A model is tested on a validation data set for which the model provides

predictions or estimates and for which we also have the observed outcomes.

A performance measure is a function which compares the estimates against

observations.


4

Classification errors >

Remember that decision making using default models is based on a cut-off

score . Suppose we have a loan application with score and outcome ,

then here are outcomes for each decision.

Rejected but it has good outcome. False positive error.

Rejected and it has bad outcome.

Accepted and it has good outcome.

Accepted but it has bad outcome. False negative error.

This table shows the two types of errors that can occur.

Note that these two types of errors have different costs:

A false positive represents the loss of potential interest income.

A false negative represents possibly the entire value of the loan or

credit.

The false negative is a much higher cost.


5

Definitions >

Error rates within this matrix can be expressed using the following

cumulative distribution functions (CDFs).

( ) ( ) as the CDF of scores that are rejected amongst

those that are negative (false positive rate);

( ) ( ) as the CDF of scores that are rejected amongst

those that are positive.

Therefore,

( ) is the complementary CDF of scores that are accepted amongst

those that are negative; ( ) is the complementary CDF of scores that are accepted amongst

those that are positive (false negative rate).

Let ( ). Therefore, ( ).


6

Given a validation data set, these CDFs can be computed as empirical CDFs.

Example 3.1

Consider 16 applicants with different scores (not log-odds) and outcomes.

Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78

Actual

outcome

1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0

Empirical distributions

and for the 16 example

borrowers.

Notice that and

diverge.

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

Emp

iric

al C

DF

Cut-off score

F0

F1


7

ROC curve >

A widely used tool to assess class discrimination accuracy is the Receiver

Operating Characteristic (ROC) curve.

Developed originally for signal detection theory, hence the name.

Has the merit of being independent of any specific cut-off score or class distribution.

Plots on x-axis against on the y-axis:

that is, false positive rate against true positive rate.


8

Typically, a ROC curve looks like this:

The blue line is the ROC curve.

The red line is a reference line (it represents an uninformative model).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F_1

(tr

ue

po

siti

ve r

ate

)

F_0 (false positive rate)


9

Characteristics of the ROC curve >

True positive rate ( ( )) is also known as sensitivity.

True negative rate ( ( )) is also known as specificity.

The ROC curve shows the trade-off between true positive rate and true negative rate. In general, as one is increased, so the other decreases.

Must pass through point (0,0) since this is the extreme case when cut-

off is so low, no scores are less (eg all applications are accepted).

Must pass through point (1,1) since this is the extreme case when cut-off is so high, all scores are less (eg all applications are rejected).

The best model has ROC curve that passes through (0,1) since this is the case when there are no errors of either type

(ie ( )=0 and 1- ( )=0).


10

A model that has no discriminatory power is such that ( )= ( ) for all

. This is represented by a straight line from (0,0) to (1,1): the red line

in the example above.

The slope on the ROC curve is ( )

( ).

Proof: ( ) ( ), therefore

( ) ⁄

( ) ⁄

( )

( )

.


11

Example 3.2

Again, consider the 16 applicants from example 3.1, with different scores

and outcomes.

Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78

Outcome 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0

The ROC curve is given as follows.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

F_1

(tr

ue

po

siti

ve r

ate

)

F_0 (false positive rate)


12

Exercise 3.1

Consider the following 16 applicants with different scores and outcomes.

Score 5 10 12 20 26 28 32 42 45 52 55 60 62 75 82 99

Outcome 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0

1. Draw a graph showing empirical distributions and .

2. Plot the ROC curve based on these results.


13

Comparing models using the ROC curve >

Consider two models A and B that produce two different ROC curves on the

same validation data set.

ROC curve for model A is blue and model B is green.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e p

osi

tive

rat

e

False positive rate


14

Model B outperforms model A over the whole range of the curve, since its

curve is always higher, so B seems to be the better model.

However, not all comparisons between ROC curves are so clear-cut.

Consider:

ROC curve for model A is blue and model B is green.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e p

osi

tive

rat

e

False positive rate


15

Model A is good for low false positive rates, whereas model B is good for

high false positive rates. Therefore, it is difficult to determine a “best”

model.

For credit scoring, it is low cut-offs (eg rejecting few applications) that are

usually considered, so it is the lower left of the ROC curve which is usually

most important. However, where do we draw the line for such

comparisons?

The ROC curve is useful to view behaviour of a model across different cut-

off scores.

However, it does not give a single measure of discrimination, which is what

we really want for model comparison.


16

Area under the ROC curve >

A popular measure of discrimination is the area under the ROC curve (AUC)

given by

∫ ( ) ( )

In particular,

AUC=0.5 corresponds to a model with no classification power.

AUC=1 corresponds to a model with maximal classification power.

Models can be directly compared using their AUC. If model A has a higher AUC than model B then it is considered the better

model in terms of discriminatory power.


17

Proof that AUC=0.5 for a model with no classification power:

If a model has no classification power, then ( )= ( ) for all .

Therefore, ∫ ( ) ( ) [

( ( ))

]

[ ]

.

Proof that AUC=1 for a model with maximal classification power.

If a model has maximal classification power, then there exists a cut-off

score such that ( ) and ( ) (ie no errors).

Since and are both CDFs,

for all : ( ) , hence ( ) ,

for all : ( ) .

Therefore,

∫ ( ) ( )

∫ ( ) ( )

∫

( ) [ ( )]

.


18

Estimate of AUC >

Suppose we have a validation data set with observations and instances of

scores indexed in rank order:

with empirical estimates and for and respectively.

Since ∫ ( ) ( ) , we estimate AUC as

∑

( ( ) ( )) [ ( ) ( )]

and using ( ) .

This uses the trapezoid rule to estimate the area of

segments of the ROC curve where multiple

observations exist with the same score but different

outcome.

( )

a

c

b


19

Exercise 3.2

Consider the following table of empirical CDFs and for two scorecards A

and B.

A B

F0 F1 F0 F1

0.1 0.35 0.2 0.4

0.25 0.6 0.4 0.7

0.5 0.8 0.7 0.95

0.75 0.95 0.9 1

1. Draw ROC curves for both scorecards.

2. Interpret the relative difference in performance for each scorecard

given their ROC curves. 3. Which scorecard is best in terms of AUC?


20

Other classification performance measures >

The ROC curve and AUC are common measures of classification performance

in credit scoring (and other application areas).

However, some alternatives also exist:-

Gini coefficient = 2 AUC-1

Information Gain;

Kolmogorov-Smirnoff statistic;

Cumulative accuracy profile (CAP) and Accuracy rate = Gini.

We will not cover them in any detail in this course.


21

Probability calibration >

The classification performance measures only give us a measure of how

well the models discriminate between classes.

However, quite often we are interested in the probability estimate of an

event (eg PD).

To determine the accuracy of the probability estimates we compare

against the observed frequency of the event within groups of

observations.

It is natural to group them by risk grades.

Remember that the risk grade is specified by a function of score,

{ } where G is the number of risk grades (see Chapter 1, slide 24).


22

Then, given a validation data set [( ) ( ) ( )], the sample

mean estimated probability of outcome for risk grade is

∑ ( ) ( ( ( )) )

where is the number of scores in grade :

|{ { } ( ( )) }|

and ( ) is the indicator function.

The default model gives us both ( ) and ( ).

The observed frequencies are given by

∑ ( ( ( )) )


23

Probability calibration graph >

We can compare expected probabilities with observed probabilities

graphically on a probability calibration graph:

is plotted on the x-axis against on the y-axis for each

{ }.

The scorecard is well-calibrated if estimated probabilities are similar to

observed frequencies. On the probability calibration graph, this means

points sit close to the diagonal from (0,0) to (1,1).


24

Example 3.3

Interpret each of the five models M1 to M5 shown in the following

probability calibration graphs.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Ob

serv

ed

pro

bab

ility

Estimated probability

M1

M2

M3

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1O

bse

rve

d p

rob

abili

ty


M1

M4

M5


25

Solution

M1. Perfect calibration of estimated probability with observation.

M2. Consistently overestimates probability. M3. Consistently underestimates probability. M4. Generally fine for estimating mid-range probabilities (around 0.5) but

underestimates extreme probabilities (ie estimates are too conservative).

M5. Produces too many extreme probability estimates.


26

Example 3.4

This graph shows probability calibration graphs comparing two models over

three risk grades. What do they mean?

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Ob

serv

ed

pro

bab

ility


Model (1)

Model (2)

Best possiblecalibration


27

Solution

Model (1) underestimates probabilities of default, whilst model (2) gives

much better estimates.

But is the calibration statistically significant?


28

Hosmer-Lemeshow Test >

It is useful to have a test of probability calibration across all risk grades.

We can use the Hosmer-Lemeshow Test to do this.

The Hosmer-Lemeshow Test is a form of Chi-square test. The null

hypothesis is that the observed probabilities are not different from the

estimated probabilities and the alternative hypothesis is that there is a

different.

Null hypothesis: for all { }.

Alternative hypothesis: for any { }.


29

Chi-square Test >

Recall that the Chi-square test is based on observed and expected

frequencies, and respectively, falling into groups, { } .

It tests the null hypothesis against alternative hypothesis:

Null hypothesis: for all { }.

Alternative hypothesis: for any { }.

The chi-square statistic is calculated as

∑( )

And, under the null hypothesis, this follows a chi-square distribution with

degrees of freedom, where is the reduction in degrees of freedom

within the groups.


30

The chi-square test is used to test probability calibration by testing within

each risk grade how many expected goods were really good ( ) and how

many expected bads were really bad ( ). As such we consider two sets

of expectations/observations: the goods and the bads.

The following frequencies are used for each risk grade :

Number of expected goods = ( )

Number of observed goods = ( )

Number of expected bads =

Number of observed bads =


31

Then the chi-square statistic is:

∑( ( ) ( ))

( )

∑( )

∑ [ ( )

( )( )

]

( )

∑ ( )

( )

Since , the number of degrees of freedom is , but what is ?

The series of good/bad observations are highly dependent, so will be

large. Simulation studies have shown that an optimal value is given by

Degrees of freedom = .


32

Example 3.5

Use the grading system and probabilities from Example 16.2 to conduct a

Hosmer-Lemeshow Test for each model.

Model (1) Model (2)

Grade

A 6 1/6 0.0085 17.8 0.132 0.06

B 5 1/5 0.0434 3.0 0.307 0.27

C 5 3/5 0.2083 4.7 0.568 0.02

Sum 25.4 0.35

P-value * <0.001 0.55

* Chi-square tests are at 1 degree of freedom.

These results suggest that model (1) does not calibrate observations well (null hypothesis is rejected at the 1% significance level).

However, the null hypothesis is not rejected for model (2). Hence, the observations are typical given the estimated probabilities.


33

Exercise 3.3

A portfolio of 900 score cards is graded A to D. The following table shows

observed PD ( ), along with estimated PD ( ) for two models.

Grade Model (1)

Model (2)

A 200 0.3 0.3 0.32

B 300 0.1 0.2 0.12

C 300 0.05 0.1 0.07

D 100 0.03 0.05 0.02

1. Draw a probability calibration graph of this data.

2. Which model is apparently better calibrated, and why?

3. Perform a Hosmer-Lemeshow Test to determine whether either model PD

estimates are consistent with the observed PD.


34

Further information can be found about the Hosmer-Lemeshow test,

especially in relation to logistic regression, in

Dobson AJ and Barnett AG (2008). An Introduction to Generalized Linear

Models (CRC press), pp.135-137

(available in the library).


35

Cost-based measures >

Ultimately the bank is interested in the profit that can be derived from borrowers and avoiding any losses.

So when assessing a credit risk model, we would like to use a measure of profit/loss and choose the model that maximizes expected profit.

For application scorecards we can provide a table of expected profit/loss

for accept/reject decision and outcome.

Outcome

Positive Negative

Lender decision

Reject 0 0

Accept


36

The number in each cell refers to the expected profit from each outcome:

Clearly, if the application is rejected then there is no profit (or loss). If the application is accepted then there is a gain if the borrower does

not default and a loss if they do default.

We suppose these amounts can be treated as a constant (use an

expected value) across all cases.

The number of cases when an account is accepted but the outcome is is

( ( )) where is the number of expected cases.

Hence, expected profit is computed as

( ( )) ( ( ))( )

( ) [ ( )( ) ( ( )) ]

where .

Typically, we expect .


37

The terms ( ) and are fixed, relative to the model.

Therefore, we only need consider the term in square brackets which

represents the relative cost measure (per account):

( ) ( )( ) ( ( )) .

It is a cost because it is a negative term on profit. Notice that this is very similar to the simple error rate except for the

relative weight between the two types of error.


38

Optimal cut-off score >

The cost measure allows us to optimize the cut-off score by cost.

Remembering that the lender may have a threshold for the proportion

of applications that need to be accepted (volume), the optimal cut-off score

is given by

( )

such that ( ) .

As increases, so the cut-off increases (reflecting the increased cost of

default risk). As decreases, so the cut-off decreases.

Note that there is not necessarily a unique optimal cut-off score (there could be several). In general, the relationship between the cut-off score and cost measure is not monotonic.


39

Example 3.6

Again, consider the 16 applicants from Example 3.1.

Score 8 10 21 22 25 30 35 42 45 46 51 52 59 64 70 78

Outcome 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0

If we set =5, then the relative mean cost for each value of is shown in

this graph.

We see that the expected cost is

minimized at a cut-off score of 30 (the

red spot).

0

0.5

1

1.5

2

0 20 40 60 80 100

Re

lati

ve m

ean

co

st

Cut-off score


40

Analytic solution >

If and are differentiable, then an analytic solution is possible.

A minimum for ( ) is found when

( )

( ) ( )

( )

( )

( )

( )

( )

( )

( )

( )

by Bayes theorem (assuming ( ) )

( )


41

Since ( ) for a score link-function and vector of predictor variables ,

( )

for all such that ( ) .

For a general link function , ( ) ( ( ))

so ( ) (

) is a general solution.

In particular, for the log-odds score, ( ( ))

( ) (see Chapter 2),

therefore

[

]


42

Test data sets and overfitting >

All performance measures make use of a validation data set.

The validation data set could be any data set of observations with the

right predictor variables and outcomes.

However, in order to avoid overfitting it is best to use observations

that are independent of the data used to train the model. Such a data

set of observations is a hold-out test set.

o Overfitting is the phenomenon whereby the estimate of performance

of a model is upwardly biassed when measured on data that

contains training observations.

If we are particularly interested in using our model for forecasting, as

we typically are in retail finance, then the test observations should also

selected from a time period after the observations in the training data

set.


43

Review of Chapter 3 >

In this chapter we considered ways to measure performance of a default

model. Topics included:

Error types

Receiver Operating Characteristic (ROC) curve Area under the ROC curve (AUC)

Probability calibration Cost-based measure and optimal cut-off score Testing and forecasting