Credit Scoring Model Validation - UvA/FNWI (Science ... · PDF fileCredit Scoring Model Validation Master Thesis June 29, 2008 Xuezhen Wu Referent: Prof. Peter Spreij ... of credit

Faculty of ScienceKorteweg-de Vries Institute for Mathematics

Credit Scoring Model Validation

Master Thesis

June 29, 2008

Xuezhen Wu

Referent: Prof. Peter SpreijKorteweg-de Vries Institute for Mathematics

Abstract

Under the framework of Basel II, banks are allowed to use their own internalrating-based (IRB) approaches for key drivers of credit risk as primary inputsto the capital calculation. In addition, regulatory validations of the internalrating/scoring system is required. Assessing the discriminatory power and exam-ining the calibration of a credit scoring systems are two different important tasksof validation. This paper discusses several commonly used statistical approachesfor measuring the discriminatory power and calibration and shows that such ap-proaches should be interpreted with caution. When the objective of the validationof a credit scoring model is to confirm that the developed scoring model is stillvalid for the current applicant population, one should first check whether theportfolio structure changed over time or not. Because in some cases, significantshifts of the portfolio structure might happen.

Acknowledgement

This thesis is the result of my four months internship at Credit Risk Management,Risk Analytics and Instruments, Deutsche Bank, Frankfurt am Main, and is alsothe final part of my master study Stochastics and Financial Mathematics at theUniversiteit van Amsterdam. It is a pleasure to thank the many people who madethis thesis possible.

First of all, I would like to thank Thomas Werner for hiring me as an internat Deutsche Bank, giving the opportunity to do this practical research, super-visoring me and providing resources. I also want to express my gratitude toMichael Luxenburger, Martin Hillebrand for their advice and supervision dur-ing my internship. Thanks to other colleagues at DB who had given me help,useful information and suggestions. This experience provided me with deeper in-sights into credit risk management and broadened my perspective on the bankingindustry.

Meanwhile, I would like to thank my supervisor Dr. Peter Spreij, Korteweg-deVries Institute for Mathematics, Universiteit van Amsterdam for his guidanceand support during my thesis and study at UvA. Thanks also to the coordinatorof the master programme, Stochastics and Financial Mathematics at UvA, Dr.Bert van Es, who always gave me help, advice and encouragement.

I would like to thank many people who have taught during my master study inHolland. I acquired a lot of useful knowledge and skills during this study. It hasbeen a wonderful experience and also a turning point in my life.

Last but not least, I want to thank my parents and friends for their constantencouragement and love. To them I dedicate this thesis.

Contents

1 Introduction 1

2 Credit Risk Management & Basel II 3

2.1 Credit Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Basel II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 The Internal Ratings Based Approach . . . . . . . . . . . . . . . . 6

2.4 Validation under the Basel II . . . . . . . . . . . . . . . . . . . . 7

3 Scoring Model Framework 9

3.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.2 Data Specification . . . . . . . . . . . . . . . . . . . . . . 11

3.2.3 Missing Values and Outliers . . . . . . . . . . . . . . . . . 12

3.2.4 Statistical Model Selection . . . . . . . . . . . . . . . . . . 12

3.2.5 Initial Selection of Variables . . . . . . . . . . . . . . . . . 13

3.2.6 Final Model Production . . . . . . . . . . . . . . . . . . . 14

3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

i

CONTENTS ii

4 Methodology Background 16

4.1 Statistical Model Selection for Scoring Systems . . . . . . . . . . . 16

4.2 Bootstrapping and Confidence Intervals . . . . . . . . . . . . . . 18

4.3 The ROC Approach . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 Receiver Operating Characteristics . . . . . . . . . . . . . 21

4.3.2 Analysis of Areas Under ROC Curves . . . . . . . . . . . 23

4.4 Statistical Tests for Rating System Calibration . . . . . . . . . . . 26

4.4.1 Hosmer-Lemeshow Test . . . . . . . . . . . . . . . . . . . . 26

4.4.2 Spiegelhalter Test . . . . . . . . . . . . . . . . . . . . . . . 28

5 Empirical Analysis and Results 30

5.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 The Validation Data Set . . . . . . . . . . . . . . . . . . . . . . . 31

5.3 Variables Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.4 Testing for Discriminatory Power . . . . . . . . . . . . . . . . . . 33

5.5 Statistical Tests for PD Validation . . . . . . . . . . . . . . . . . 35

6 Discussion & Conclusion 36

A Ratio Analysis 37

B Results of Bootstrapping 43

C Transformation of the mean of input ratios 46

List of Tables

2.1 Basel II IRB Foundation and Advanced Approach . . . . . . . . . 6

4.1 Decision results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Number of observations and defaults per year . . . . . . . . . . . 31

5.2 Actual GDP of Germany 2003-2005 . . . . . . . . . . . . . . . . . 33

5.3 Confidence Intervals for AUROC . . . . . . . . . . . . . . . . . . 35

5.4 Hosmer-Lemeshow-Chisquare Test for PD Validation . . . . . . . 35

5.5 Spiegelhalter Test for PD Validation . . . . . . . . . . . . . . . . 35

B.1 99% percentile confidence interval for the mean . . . . . . . . . . 43

C.1 Table of the log transformation of the mean of input ratios . . . . 46

iii

List of Figures

2.1 Three Pillars of Basel II . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Three approaches to manage credit risk given in Basel II . . . . . 5

3.1 credit scoring model development . . . . . . . . . . . . . . . . . . 10

3.2 Data Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Rating Score distributions for defaulters and non-defaulters . . . . 21

4.2 Receiver operating characteristics curves . . . . . . . . . . . . . . 23

5.1 Score-PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Score-KDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3 ROC Curves for the year 2003 2004 and 2005 . . . . . . . . . . . 34

A.1 Histogram and Boxplot of ratio P02101 . . . . . . . . . . . . . . . 37







iv

LIST OF FIGURES v














Chapter 1

Introduction

Recent instabilities in the financial sector lead to the development of Basel II-The new international capital standard for credit institutions. Under Basel II,banks are permitted to use internal-ratings based (IRB) approaches to determinethe risk weights relevant for calculating the capital charge according to their owncredit scoring model. Consequently, banks are obliged to validate their internalprocesses for differentiating risk as well as for quantifying that risk. Therefore,Validation represents a major challenge for both banks and supervisors.

This paper focuses on the validation of the credit scoring system under Basel II,and is motivated by a desire to explain the methodologies that could be used tovalidate a credit scoring model and discuss the practical implementation prob-lems. We will, first, introduce the Basel II and the IRB approach, and thenexplain the main procedure of developing and validating a scoring model underthe framework of Basel II.

The main principle of a credit scoring system is assigning to each borrower a scorein order to separate the bad borrowers (defaulters) and the good borrowers (non-defaulters). For example, a borrower with high estimated default probability isgiven a high score. So, a scoring system can be seen as a classification tool in thesense of providing indications of the borrower’s possible status in the future. Thisprocedure is commonly called discrimination. Thus, the discriminatory powerof a scoring system denotes the model’s ability to discriminate defaulters fromnon-defaulters. Assessing the discriminatory power is one of the major tasks invalidating a credit scoring model.

In practice, scoring systems also form the basis for pricing credits and calculatingrisk premiums and capital charges. Each score value is associated with an esti-

1

Chapter 1. Introduction 2

mated probability of default. Under the internal rating based (IRB) approach,the capital requirements are determined by bank’s internal estimation of the riskparameters (including the default probability, Loss Given Default and Exposureat Default). This concerns the calibration of the scoring system which is anotherimportant task in scoring model validation.

In chapter 2, the current regulation for credit risk - Basel II will be introduced.It follows by a general overview on the development and validation frameworkof credit rating models, in chapter 3. Nowadays, a lot of emphasis has beengiven to the validation of the internal rating system. Here the validation includesboth the assessment of model discriminatory power and calibration. While thediscriminatory power of a scoring model depends on the difference of the scoredistribution of the defaulters and non-defaulters, the calibration of a scoringsystem depends on the difference of the estimated PD and the observed defaultrates. In Chapter 4, we will discuss the commonly used statistical methods formeasuring the discriminatory power and calibration of scoring systems. Finally,in Chapter 5, we will demonstrate the application of all concepts and give theempirical results.

All the statistical results are obtained with the aid of software packages SAS andMS Excel.

Chapter 2

Credit Risk Management &Basel II

2.1 Credit Risk

In terms of risk the business of financial institutions can be described as manage-ment of credit risk, market risk, operational risk, legal and regulatory risk, andliquidity risk. Credit risk, the risk of loss due to uncertainty about an obligor’sability to meet its obligations in accordance with agreed terms, has always con-stituted by far the biggest risk for banks worldwide. Thus, banks should have akeen awareness of credit risk and the idea that credit risk exposures need to bemore actively and effectively managed.

2.2 Basel II

The current regulations for risks are the result of a recommendation of the BaselCommittee on Banking Supervision - the New Basel Capital Accord (Basel II,2004) which is an amendment of Basel I Capital Accord (published in 1988).The main objectives of Basel I were to promote the soundness and stability ofthe banking system and adopt a standard approach across banks in differentcountries. Although it was initially intended to be only for the internationalactive banks in the G-10 countries, it was finally adopted by over 120 countriesand recognized as a global standard. However, the shortcomings of the Basel Ibecame increasingly obvious over time. Principal among them is that regulatorycapital ratios were increasingly becoming less meaningful as measures of true

3

Chapter 2. Credit Risk Management & Basel II 4

capital adequacy, particularly for large complex financial institutions. Moreover,simplicity of Basel I encouraged rapid developments of various types of productsthat overcomes regulatory capital.

Following the publication of successive rounds of proposals between 1999 and2003, the Basel Committee members agreed in mid-2004 on the Basel II CapitalAccord. The main objectives of this revised capital adequacy framework are:Integrating an effective approach to supervision, incentives for banks to improvetheir risk management and measurement, and risk-based capital requirement clos-ing gap between regulatory and economic capital charges. (Nout Wellink, 2006[9])

The Basel II goals shall be achieved through three mutually supporting pillars(Figure 2.1):

• Pillar 1 defines the rules for calculating the minimum capital requirementsfor credit, operational and market risks. The minimum capital requirementsare composed of three fundamental elements: a definition of regulatory cap-ital, risk weighted assets and the minimum ratio of capital to risk weightedassets.

• Pillar 2 provides guidance on the supervisory review process which enablessupervisors to take early actions in order to prevent capital from fallingbelow the minimum requirements for supporting the risk characteristics ofa particular bank and requires supervisors to take rapid remedial action ifcapital is not maintained or restored.

• Pillar 3 recognises that market discipline has the potential to reinforce min-imum capital standards (Pillar 1) and the supervisory review process (Pillar2), and so promote safety and soundness in banks and financial systems.

Basel II brings significant changes, especially in the calculation of capital require-ments. Banks may choose between a standardized approach that is similar to theone already used under Basel I and a new, more demanding, internal rating basedapproach (IRB) which comes in two versions: the foundation IRB approach andthe advanced IRB approach (Figure 2.2):

• The standardized approach is based on external ratings provided by ratingagencies such as Moodys and Standard & Poor, albeit more differentiatedunder Basel I.

• The foundation IRB approach which allow banks to use their internal ratingsystem to calculate the probability of default (PD), but other credit riskinputs such as loss given default (LGD) are obtained from external sources(Figure 2.3).


Su

per

viso

ry

Rev

iew

Mar

ket

Dis

cip

line

Min

imu

m

Cap

ital

Req

uir

emen

ts

Stability - Safety - Soundness

Operational Risk• Basic Indicator Approach• Standardised Approach • Advanced Measurement Approach

Credit Risk• Standardised• IRB Foundation • IRB Advanced

‘Quantitative’ ‘Qualitative’ ‘Market forces’

Figure 2.1: Three Pillars of Basel II

Credit risk

Risk assessment

Risk calculation

External ratings Internal ratings

Standardized approach

FoundationIRB

AdvancedIRB

Increasing Risk Sensitivity

Increasing Complexity

Increasing Qualitative Standards

Decreasing Capital Requirement

Figure 2.2: Three approaches to manage credit risk given in Basel II


• The IRB advanced approach where the banks’ own estimates of input dataare used exclusively (probability of default, loss given default, the exposureat default (EAD), and effective maturities (M)).

Table 2.1: Basel II IRB Foundation and Advanced Approach

Data input Foundation IRB Advanced IRBPD Bank internal estimates Bank internal estimates

LGD Supervisory values Bank internal estimatesEAD Supervisory values Bank internal estimatesM Supervisory values, or bank

internal estimatesBank internal estimates

2.3 The Internal Ratings Based Approach

The IRB approach is regarded as the prerequisite for advanced credit risk manage-ment, and it is necessary for each financial institution to develop its own internalrating system. Once a bank adopts an IRB approach for part of its holdings, itis expected to extend it across the entire banking group. A phased rollout is,however, accepted in principle and has to be agreed upon by supervisors [10].

As mentioned in the previous section, for a given exposure, by using the IRBapproach, the derivation of risk weighted assets which depends on estimates ofthe PD, LGD, EAD and, in some cases, effective maturity (M):

• Probability of default (PD) per rating grade, which gives the average per-centage of obligors that default in this rating grade in the course of oneyear.

• exposure at default (EAD), which gives an estimate of the maount outstand-ing (drawn amounts plus likely future drawdowns of yet undrawn lines) incase the borrower defatuls.

• loss given default (LGD), which gives the percentage of exposure the bankmight lose in case the borrower defaults. These losses are usually shown aspercentage of EAD, and depend, amongst others, on the type and amountof collateral as well as the type of borrower and the expected proceeds fromthe work-out of the assets..


Formulas for calculating risk weighted assets are derived from Merton’s (1974)single asset model to credit portfolios and Vasicek (2002) work [12]:

Correlation (R) = 0.12× 1− exp(−50× PD)

1− exp(−50)

+0.24×[1− 1− exp(−50× PD)

1− exp(−50)

]

Maturity adjustment (b) = (0.08451− 0.05898× ln(PD))2

Capital requirement (K) = LGD×N

[N−1(PD)√

1− R+

√R

1− RN−1(0.999)

]

×1 + (M − 2.5)× b(PD)

1− 1.5× b(PD)

Risk-weighted assets (RWA) = 12.5× EAD×K

Regulatory capital for credit risk = 8%× RWA

where:

- N(.) is the cumulative distribution function (c.d.f.).

- N−1(.) is the inverse cumulative distribution function.

- M is the effective (remaining) maturity.

2.4 Validation under the Basel II

Since implementation of the Basel II Framework continues to move forwardaround the globe, the debate on the IRB approach has changed its accent. Moreemphasis is now given to the validation of the internal rating system used to gen-erate estimates of inputs. When considering the appropriateness of any ratingsystem as the basis for determining capital, there will always be a need to ensureobjectivity, accuracy, stability, and an appropriate level of conservatism. Theterm validation includes a range of processes and activities that assess whetherratings adequately differentiate risk, and whether estimates of risk components(such as PD, LGD, or EAD) appropriately characterize the relevant aspects ofrisks [13].


The Basel Committee on Banking Supervision stated that several principles un-derlying the concept of validation should be considered [14]:

• Validation is fundamentally about assessing the predictive ability of a bank’srisk estimates and the use of ratings in credit processes.

• The bank has primary responsibility for validation.

• Validation is an iterative process.

• There is no single validation method.

• Validation should encompass both quantitative and qualitative elements.

• Validation processes and outcomes should be subject to independent review.

Based on the availability of high-quality data and quantitative rating models,there are two mutually supporting ways to validate bank internal rating systems[2].

1. Result-based validation (also known as backtesting): ex post analysis ofthe rating system’s quantification of credit risk. The probability of default(PD) per rating class or the expected loss (EL) must be compared with therealized default rates or losses.

2. Process-based validation: analysis of the rating system’s interfaces withother processes in the bank and how the rating system is integrated intothe bank’s overall management structure.

Chapter 3

Scoring Model Framework

3.1 Fundamentals

For the banking industry, the scoring methodology has played an important role indeveloping internal rating systems. There are several reasons that can explain thewidespread use of scoring models: First, since credit scoring model are establishedupon statistical models and not on opinions, it offers an objective way to measureand manage risk. Second, the statistical model used to produce credit scorescan be validated. Data have to be carefully used to check the accuracy andperformance of the predictions. Third, the statistical models used by credit scorescan be improved over time as additional data are collected.

Prior to default, there is no sure way of identifying whether a firm would indeeddefault. The purpose of credit risk rating, therefore, is to categorize customersinto various classes, each of which is homogenous in terms of its probability ofdefault (PD). While the score alone does not represent a default probability,model scores are mapped to an empirical probability of default by using financialhistorical data and statistical techniques. Normally, high scores correlate to lowprobability of default.

Broadly speaking, there are three internal components to the credit scoring model:

• There are inputs, which are obtained from the dataset of applicant compa-nies or borrowers.

• There are parameters, which are used to weight the inputs and to controlthe logic of the model.

• There is a well defined statistical algorithm to combine the inputs and theparameters to create a score.

9

Chapter 3. Scoring Model Framework 10

Note that the preparation of inputs, the estimation of parameters and the se-lection of an appropriate statistical model all are based upon statistical validprocedures.

3.2 Model Development

Collect DataExplore dataData cleaning

Characteristicsanalysis

Model produceValidation

Dealing withmissing values

and outliers

Model style selection

Deciding the needfor adjustments

Figure 3.1: credit scoring model development

3.2.1 Data Source

Definition of Defaults The first step in a credit scoring model developmentproject is defining the default event. Traditionally, credit rating or scoring mod-els were developed upon the bankruptcy criteria. However, banks also meet withlosses before the event of bankruptcy. Therefore, in the Basel II Capital Accord,the Basel Committee on Banking Supervision gave a reference definition of thedefault event and announced that banks should use this regulatory reference defi-nition to estimate their internal rating-based models. According to this proposeddefinition, a default is considered to have occurred with regard to a particularobligor when either or both of the two following events have taken place [11].

• The bank considers that the obligor is unlikely to pay its credit obligationsto the banking group in full.

• The obligor is past due more than 90 days on any material credit obligationto the banking group. Overdrafts will be considered as being past due oncethe customer has breached an advised limit or has been advised of a limitsmaller than the current outstanding.


Input Characteristics The following step is the pre-selection of input char-acteristics to be included in the sample. Comparing with importing a snapshotof the entire database, pre-selecting input characteristics makes the model de-velopment process more efficient and increases the developer’s knowledge of in-ternal data. In general, the selected input characteristics should describe themost important credit risk factors, i.e. leverage, asset utilization, liquidity, scale,profitability and operational performance.

3.2.2 Data Specification

Time Horizon Time horizon refers to the period over which the default prob-ability is estimated. The choice of the time horizon is a key decision to buildup a credit scoring model. Depending on the objective for which the credit riskmodel was developed (estimating the short-term or medium-long-term defaultprobability), the time horizon varies. For most banks, it is common to select oneyear as a modelling horizon, as on the one hand one year is long enough to allowbanks to take actions to mitigate credit risk, and on the other hand new obligorinformation and default data may be revealed within one year.

However, a longer time horizon could also be of interest, especially when deci-sions about the allocation of new loans have to be made, but usually here dataunavailability may occur.

Data Splitting Since model building and validation both require samples andthe statistical assessment of the performance of a predicting scoring model canbe highly sensitive to the data set, the given data set at hand should be largeenough to be randomly split into two data sets: one for development and theother for validation. To avoid embedding of unwanted data dependency, sometype of out-of-sample, out-of-time and out-of-universe1 test should be used in thevalidation process. Normally, 60% to 80% of the total sample is used to estimatethe model; the remaining 20% to 40% sample is set aside to validate the model.

Data Exploring Before initiating the actual modeling, it is very useful tocalculate simple statistics for each characteristics, such as mean/median, standard

1Out-of-sample refers to observations that are not used to build a model. Out-of-time refersto observations that are not contemporary with those observations used to build a model. Out-of-universe refers to observations whose distribution differs from those observations used tobuild a model, i.e. the validation data set contains some obligors that are not included in thedata set for building the model.


deviation, and range of values. The interpretation of data also should be checked,for example, to ensure that ’0’ represents zero and not missing values, and toconfirm that any special values such as 999 are documented. This step verifies thatall aspects of the data are understood and offers great insight into the business.

3.2.3 Missing Values and Outliers

Most financial industry data contain missing values or outliers that must prop-erly be managed. Several methods with respect to dealing with missing valuesare available, such as removing all data with missing values or excluding char-acteristics or records that have significant missing values from the model (e.g.,more than 50% is missing), but this would result in too many data being lost.Another straightforward way is substituting the missing values with correspond-ing mean or median values over all observations for the respective time period.While these three methods assume that no further information can be gatheredfrom analyzing the missing data, this is not necessarily true - missing values areusually not random. Missing values may be part of a trend, may be linked toother characteristics, or may indicate bad performance. Therefore, missing valuesshould be analyzed first, and if they are found to be random and performance-neutral, they may be excluded or imputed using statistical techniques; otherwise,if missing values are found to be correlated to the performance of the portfolio,it is preferable to include missing values in the analysis.

Outliers are values that are located far from others for a certain characteristic.They may negatively affect regression results. While the easiest way is removingall extreme data that fall outside of the normal range-for example, at a distanceof more than two or three times the standard deviation, using this solution itis very easy to erroneously remove defaulting companies, considered as outliers.Another technique, called ’winsorisation’, involves setting outliers to a specifiedpercentile of the data.

3.2.4 Statistical Model Selection

There are several statistical methods for building and estimating scoring mod-els, including linear regression models, logit models, probit models, and neuralnetworks. We introduce them in Section 4.1. The most popular one is the logitmodel which assumes that the probability of default is logistically distributed. Inpractice, using the logit model one can directly generate the scorecard, and thecost and speed in model implementation are lower and faster.


3.2.5 Initial Selection of Variables

Univariate Analysis In this first step of initial variable selection, the predic-tive power of each variable will be assessed individually and the weak or illogicalratios will be screened out. This is also known as univariate screening. For thispurpose, the distribution of values of variables is examined and transformationsare performed.

Transformation Financial ratios are expected to have monotonic relationshipswith the observed default rates. Because a non-monotonic variable has an am-biguous relationship to default, it is difficult to draw conclusions on the cred-itworthiness of a company from non-monotonic variables. The monotonicity ofeach continuous variable can be easily tested by dividing the ratios into groupsthat all contain the same number of observations, and plotting them against therespective observed default rates.

In some cases, it is possible that certain ratios produce a non-monotonous func-tion with respect to default rate. For example, the indicator sales growth hasa complex relationship with the default probability. Generally, it is better for afirm to grow than to shrink, however, companies that grow too quickly often findthemselves unable to meet the management challenges presented by such growth.Moreover, this quick growth is unlikely to be financed out of profits, resulting ina possible build up of debt and the associated risks. Hence, in order to includethis kind of ratios into the scoring model, appropriate transformations have tobe performed. This can be done as follows: the value range of the variable aredivided into small disjunct intervals, and the empirical default rate for each smallinterval is determined by the given sample. Then the original values of the ratiosare transformed to the probability of default according to this smoothed relation-ship. Since the value of ratios can be larger than one, the probability of defaultthat we obtained before need to be transformed to a log odds.

Correlation The strongest variables are grouped. However, the correlationamong input ratios should be tested as well. Because, if some highly correlatedindicators are included in the model, the estimated coefficients will be significantlyand systematically biased. Thus, the correlation between all pre-selected variablesshould be calculated to identify subgroups of highly correlated indicators. Foreach correlation subgroup, one can then eliminate some variables and chooseone or more variables, which can represent all the information contained in othercharacteristics, based on both statistical and business/operational considerations.


3.2.6 Final Model Production

However, there is still a large number of potential characteristics available tobe included in the model which tend to ’overfit’ the model. To find out the bestpossible model, one effective way is using a stepwise method which involves includ-ing or excluding characteristics from the model at each step based on statisticalcriteria until the best combination is reached. Forward selection and backwardselection are the two main version of the stepwise method:

(i) Forward selection: Starting with the constant-only model and adding char-acteristics one at a time in order of their predictive power until some cutofflevel is reached (e.g., until no remaining characteristics have p-value lessthan 0.05 or univariate Chi Square above a determined level [15]).

(ii) Backward selection: The opposite of forward selection. Starting with allcharacteristics and deleting one at time in order that they are worst bytheir significant level, until all the remaining characteristics have a p-valuebelow 0.1, for example.

3.3 Validation

Finally, the derived credit risk scoring model has to be validated. The aim ofvalidation is to confirm that the model developed is applicable and to ensurethat the model has not been overfitted. As mentioned before, 60% to 80% ofthe total sample is used to estimate the model; validation is performed on theremaining 20% to 40% of the total sample. In this validation process, it mightinclude comparing the distributions of scored good customers and bad customersacross the development sample and the validation sample or comparing develop-ment statistics for the two samples. Moreover, the model’s quality can also beevaluated by the Receiver Operating Characteristics Curves and other methodswhich will be described in the next Chapter. Significant different results betweenthe development and validation sample by applying any of the preceding methodswill require further analysis. Some people also consider this validation as a partof model development process.

Frequently, scoring models will be validated on a growth period. This validationexercise is similar to the one mentioned above, but with different purposes. Whilethe objective previously is to confirm the robustness and goodness of fit of thescoring model, the objective of the later validation is to confirm that the model isstill valid over time. In some cases, where the development sample is two or threeyears old, significant changes of portfolio structure might have occurred and need


Data for model building

Data for validation (1)

From year T+1 to T+2From year T to T+1

Data for validation (2)

During model building After model building

Figure 3.2: Data Split

to be identified. Several performance test methods should be applied, based onboth classification and prediction ability.

Chapter 4

Methodology Background

4.1 Statistical Model Selection for Scoring Sys-

tems

Assume that there are n data points to be separated, for instance, n borrowers ofa bank. To assess the default risk of a credit applicant, a bank usually identifiesseveral indicator (characteristics) X1, X2, ..., Xp such as total assets, liabilitiesor earning before interest and taxes (EBIT). So, there are p variables availablefor each data point. Let x = (x1, x2, ..., xp) be the vector of observations of therandom variables X = (X1, X2, ..., Xp). We then select a number C of classes inwhich these data will be classified. In the case of a credit scoring model, therecan be two classes (C = 2): default and non-default. Without loss of generality,we assume the default variable y ∈ {0, 1}, with the convention that a non-defaultapplicant is labeled 0, and a default applicant is labeled 1.

Linear Regression The linear regression model establishes a linear relation-ship between the characteristics of borrowers and the default variables:

y = β1X1 + β2X2 + . . . + βpXp + u (4.1)

where u is the random error and u is independent of X.

In a vector expression:y = β′X + u (4.2)

The standard procedure is to estimate (4.1) with the ordinary least squares (OLS)

16

Chapter 4. Methodology Background 17

estimators of β which are denoted in the vector form b:

S = E(Y |X = x) = b′x (4.3)

This means that the score represents the expected value of the performance vari-able conditional on observing realization of characteristics. Although y is a binaryvariable, the score S is usually continuous, moreover, the prediction can take val-ues larger than 1 or smaller than 0. While the score S can be used to segregatebetween ’good’ borrowers and ’bad’ borrowers, the output of the model cannotbe interpreted as a probability of default.

Logit and Probit Let the conditional probability that the applicant or bor-rower is default be denoted by π(x):

π(x) = P (Y = 1|X = x) = F (β′x). (4.4)

Here F (.) denotes an unknown distribution function of indicators x. If we assumethe distribution function F (.) to be a standard normal distribution function, weare faced with the probit model:

F (β′x) =1√2π

∫ β′x

−∞e−t2

2 dt (4.5)

If the logistic distribution is selected, it leads to a logit model:

F (β′x) =eβ0+β1x1+β2x2+...+βpxp

1 + eβ0+β1x1+β2x2+...+βpxp(4.6)

Note that (4.6) implies a linear relationship between the log odds g(x) 1 and theinput characteristics:

g(x) = ln

(π(x)

1− π(x)

)= β0 + β1x1 + β2x2 + . . . + βpxp (4.7)

The results generated by both models can be interpreted directly as default prob-abilities and in both cases the estimation of the parameters is performed by max-imum likelihood. The differences in the results of these two kinds of models areoften negligible, as both distribution functions have a similar form except thatthe logistic distribution has thicker tails than the normal distribution. However,the logit model is easier to handle. First of all, the calculation of logit modelsis simpler than that of probit models. Second, the coefficients of the logit modelcan be more easily interpreted (as shown in equation (4.7)).

1In some literature g(x) is also called the logit transformation [4].


Neural Networks In recent years, neural networks have been discussed ex-tensively as an alternative to the models discussed above. A neural network is amathematical representation inspired by the way the human brain processes in-formation. The basic principle for this type of model is a series of algorithms thatallow for some learning through experience to recognize the relationship betweenborrower characteristics and the probability of default and to determine whichcharacteristics are most important in predicting default. The advantage to thismethod is that, unlike statistical models, it does not require any distributionalassumptions and it can be constantly adjusting to new information, such as thechanges of routine business and economic cycles.

Although some argue that neural networks are able to outperform probit or logitregressions in achieving higher prediction accuracy ratios, there are some findingsthat show the differences in performance between neural networks and standardstatistical models are either non-existing or marginal. Since the logit model allowsto easily check whether the empirical dependence between potential inputs andprobability of default is economically meaningful, many credit scoring model de-velopers still prefer to use the logit model. Therefore, the remainder of this thesiswill mainly focus on the development and validation process for logit models.

4.2 Bootstrapping and Confidence Intervals

Parametric and Nonparametric There are two situations to distinguish: theparametric and nonparametric. When there is a particular mathematical model,with adjustable constants or parameters that fully determine the probability den-sity function, such a model is called parametric and statistical methods based onthis model are parametric methods. When no such mathematical model is used,the statistical analysis is nonparametric [5]. We do not cover the parametricbootstrap here, and concentrate instead on the nonparametric bootstrap.

What is Bootstrapping? Bootstrapping is a simulation method introducedby Efron in 1979. With the bootstrap method, the original sample is treatedas the population and a Monte Carlo-style procedure is conducted on it. Thisis done by randomly resampling with replacement R times with each resamplehaving the same size as the original sample. Random selection means that if anelement of the original sample is selected, it will not be removed from the originalsample before selecting the next element. In other words, an individual elementfrom the original sample can be included repeatedly within a bootstrap samplewhile other elements may not be included at all.


The more bootstrap replications we use, the more reliable the result will be. Ingeneral, using at least 100 replications is recommended.

Confidence Interval Suppose that random variables X1, X2, . . . , Xn are inde-pendent and identically distributed according to a distribution F . The expecta-tion and variance of the observations Xi, i = 1, . . . , n, are given as below:

µ =

∫xdF (x) and σ2 =

∫(x− µ)2dF (x).

Consider the problem of constructing a two sided confidence interval for theexpectation µ, based on the sample mean

Xn =1

n(X1 + X2 + · · ·+ Xn).

Let σ2 denotes the estimate of the sample variance

σ2 =1

n− 1

n∑i=1

(Xi − X)2.

By the Central Limit Theorem we have

√n

(Xn − µ

σ

)D−→ N,

whereD−→ stands for convergence in distribution and N stands for the standard

normal random variable. Let uα2

denote the value for which P (N ≤ uα2) = 1− α

2

then

limn→∞

P

(−uα

2≤ √

n

(Xn − µ

σ

)≤ uα

2

)= 1− α,

which is equivalent to

limn→∞

P(Xn − n−

12 uα

2σ ≤ µ ≤ Xn + n−

12 uα

2σ)

= 1− α.

It follows that a two sided confidence interval for µ with confidence level 1−α is

I(α) = (Xn − n−12 uα

2σ, Xn + n−

12 uα

2σ) [1].

Bootstrap Confidence Interval We continue with the general idea of thebootstrap method for constructing the confidence interval. The empirical distri-bution function Fn, the estimator of F , is defined by

Fn(x) =1

n{number of Xi : Xi ≤ x, 1 ≤ i ≤ n}.


Let X∗1 , X

∗2 , . . . , X

∗n denote a bootstrap sample drawn from X1, X2, . . . , Xn. The

basic idea of the bootstrap method is to approximate the distribution of√

n(Xn−µ(F )) by the conditional distribution of

√n(X∗

n− µ(Fn)) =√

n(X∗n− Xn), given

µ(Fn) = Xn. Hence we approximate

Gn(x) = P(√

n(Xn − µ(F )) ≤ x)

by

G∗n(x) = P

(√n(X∗

n − µ(Fn)) ≤ x)

.

The random function G∗n is called the bootstrap approximation of Gn. Before

calculating the bootstrap confidence interval, let us first give the definition ofinverse function G−1 by

G−1(y) = inf{x : G(x) ≥ y}, 0 < y < 1.

Since the following argument holds under sufficient conditions on F

1− α = P(G−1

n (α

2) <

√n(Xn − µ) < G−1

n (1− α

2))

' P(G∗−1

n (α

2) <

√n(Xn − µ) < G∗−1

n (1− α

2))

,

a bootstrap confidence interval for µ can be denoted by

(Xn −

√nG∗−1(1− α

2), Xn +

√nG∗−1(1− α

2))

[1].

The Percentile Interval The simplest method for calculating bootstrap confi-dence interval is using the percentile interval. If we want to obtain the confidenceinterval with level 1− α, we can simply select the bootstrap estimates which lieon the α

2th percentile and (1− α

2)th percentile. For example, if we want to obtain

the confidence interval of the sample mean µ, we could run the bootstrapping Rtimes and calculated R bootstrap estimates of the mean µ. We could rank themfrom low to high and take the (R + 1)(1− α

2)th value as the lower limit and the

(R + 1)α2th value as the upper limit ((R + 1)α

2and (R + 1)(1 − α

2) should be

integers). In general, we should expect this to be more accurate, provided R waslarge enough. In practice, if confidence levels 0.95 or 0.99 are required, then it isadvisable to have R = 999 or more.


4.3 The ROC Approach

4.3.1 Receiver Operating Characteristics

One common method to represent the discriminatory power of a scoring modelis the receiver operating characteristic (ROC) curve. The construction of anROC curve is illustrated in Figure 4.1 which shows the possible distribution ofthe rating scores for default and non-default counterparties. For a perfect ratingmodel the distributions of defaulters and non-defaulters should be distinguished,but in the real world, perfect discrimination in general is not possible, then bothdistributions will overlap as shown in Figure 4.1. V is a cut-off value whichprovides a simple decision rule to divide counterparties into potential defaultersand non-defaulters. Then four scenarios can occur which are summarized in Table4.1.

DefaultersV

Fre

quen

cy

Non-defaulters

Rating Score

Figure 4.1: Rating Score distributions for defaulters and non-defaulters

Table 4.1: Decision results

ObservedDefaults Non-defaults

ScoresAbove V

True positive prediction False positive prediction(Sensitivity) (1-Specificity)

Below VFalse negative prediction True negative prediction

(1-Sensitivity) (Specificity)


If a counterparty with a rating score larger than V defaults or a non-defaulter hasa rating score lower than V , then the prediction of the rating system is correct.Otherwise, the rating system makes wrong prediction. The proportion of correctlypredicted defaulters is called sensitivity and the proportion of correctly predictednon-defaulters is called specificity2. For a given cut-off value V , a rating systemshould have a high sensitivity and specificity.

In statistics, the false positive prediction (1 − specificity) is also called type Ierror, which defined as the error of rejecting a null hypothesis that should havebeen accepted. The false negative prediction (1 − sensitivity) is called type IIerror, which means the error of accepting a null hypothesis that should have beenrejected.

Traditionally, a Receiver Operating Characteristic (ROC) curve shows the falsepositive prediction rate 1 − specificity on the X axis and the true positive ratesensitivity on the Y axis, as illustrated in Figure 4.2.

We denote the set of defaulters by D, the set of non-defaulters with ND and theset of total counterparties with T . Let SD denotes the distribution of the scoresof the defaulting debtors; and SND be the score distribution of non-defaulters.For any cut-off value v, we have

FD(v) = P (SD ≥ v)

FND(v) = P (SND ≥ v)

Then FD(v) is the sensitivity of the rating system derived based on the cutpointv and FND(v) is the corresponding 1 − specificity. As v varies over the possiblerating scores, the ROC curve can be plotted.

From Figure 4.2, it can be seen that the ROC curves start in the point (0, 0) andends in the point (1, 1). If the cut-off value is above the maximum rating score,then the sensitivity becomes 0 and specificity becomes 1 and the ROC curvepasses through (0, 0). Oppositely, if the cut-off value is below the minimumrating score, then the sensitivity is 1 and the specificity is 0 and the ROC curvemonotonically increase to the point (1, 1). For a perfect model, defaulters andnon-defaulters are separated perfectly. So if the cut-off value is in the score rangeof defaulters, then the sensitivity is less than 1 but but specificity is 1; if thecut-off value is in the score range of non-defaulters, then the sensitivity is 1 butthe specificity is less than 1; and if the cut-off value is above the maximum ratingscore of defaulters, but below the minimum rating score of non-defaulters, thenboth sensitivity and specificity are 1. Therefore, the corresponding ROC curve is

2In some literatures, “sensitivity” is called “hit rate” and “1-specificity” is called “falsealarm”[8]


a vertical line going from (0,0) to (0,1) and then a vertical line linking (0,1) to(1,1). For a random model, the rating system contains no discriminative power,so the correct prediction rate should equal to the wrong prediction rate and thecorresponding ROC curve is a diagonal. Therefore, to be informative, the entirecurve should lie above the 45◦ line.

Scoring model

Random model

Perfect model

0

1

11-Secificity

Sens

itivi

ty

AUROC

Figure 4.2: Receiver operating characteristics curves

4.3.2 Analysis of Areas Under ROC Curves

The area under the ROC curve (AUROC) provides a measure of the model’s dis-criminatory power. For a random model without discriminatory power, AUROCis 0.5; and for a perfect model, AUROC is 1. In practice, it is between 0.5 and 1for any reasonable rating model. A model with greater discriminatory power hasa larger AUROC. Following the notion in previous section, AUROC is defined as

AUROC =

∫ 1

0

FD(v) dFND(v)

In this section, we will discuss the statistical properties of AUROC. Since AUROCcan be interpreted in terms of a probability, we could compute the confidenceintervals for AUROC based on the work of DeLong (1988) [7].


Probability Interpretation of AUROC Consider the following experiment.There are two counterparties: one of them is randomly picked from the defaultersand the other is randomly picked from the non-defaulters. Someone has to decidewhich of the counterparties is the defaulter. Suppose that defaulters are thecounterparties with the higher rating scores, but if both counterparties have thesame rating score, then the decision-maker would toss a coin. If we follow thenotation as last section, then the probability that the decision-maker makes acorrect choice is equal to P (X > Y ) + 1

2P (X = Y ), where X and Y are the

score for defaulters and non-defaulters, respectively. We will show that AUROCis exactly equal to this probability.

Assume the score of a rating system can range from 0 to r. Divide the entire scorerange into k intervals by cut-off value ri, where i = 1, 2, . . . , k, ri < rj,∀i < j,r1 > 0 and rk = r. If the score of a counterpart falls into the interval (ri−1, ri],then the counterpart is assigned a rating i. Let SD and SND represent the ratingdistribution of defaulters and non-defaulters respectively.

Define

F iD = P (SD ≥ ri), i = 1, . . . , k

F iND = P (SND ≥ ri), i = 1, . . . , k

and denote the empirical distribution function as

F iD = P (SD ≤ ri), i = 1, . . . , k

F iND = P (SND ≤ ri), i = 1, . . . , k

Then the area under an empirical ROC curve can be calculated as

AUROC =k∑

i=1

1

2

(P (SD ≥ ri) + P (SD ≥ ri−1)

)·(P (SND ≥ ri−1)− P (SND ≥ ri)

)

=k∑

i=1

(P (SD ≥ ri) +

1

2· P (SD = ri−1)

)· P (SND = ri−1)

=k∑

i=1

P (SD ≥ ri) · P (SND = ri−1) +1

2· P (SD = ri−1) · P (SND = ri−1)

= P (SD > SND) +1

2P (SD = SND)

Calculation of Confidence Intervals for AUROC Next, we will introducethe DeLong’s method of calculating the confidence interval for AUROC, which ismore efficient than the commonly applied bootstrapping.


Suppose X and Y are the score for defaulters and non-defaulters, respectively.It has been shown that the area under an empirical ROC curve is equal to theWilcoxon-Mann-Whitney two sample U-statistic applied to the sample Xi and Yj

[?]. The Wilcoxon-Mann-Whitney U-statistic estimates the probability U thatthe score of a randomly selected defaulter will be less than or equal to the scoreof a randomly selected non-defaulter. It can be calculated as the average over akernel, ψ, as

U =1

mn

m∑i=1

n∑j=1

ψ(Xi, Yj),

where

ψ(X, Y ) =

1 Y < X12

Y = X0 Y > X

Observe that U is an unbiased estimator of P (Y < X) + 12P (Y = X), i.e.

AUROC = E(U) = P (Y < X) +1

2P (Y = X).

For continuous distributions, one has P (Y = X) = 0.

Define

ξ10 = E[ψ(Xi, Yj) ψ(Xi, Yl)]− U2, j 6= l;

ξ01 = E[ψ(Xi, Yj) ψ(Xl, Yj)]− U2, i 6= l;

ξ11 = E[ψ(Xi, Yj) ψ(Xi, Yj)]− U2.

Then the variance of the Wilcoxon-Mann-Whitney statistic is given as

σ2U

=(n− 1) ξ10 + (m− 1) ξ01

mn+

ξ11

mn.[7]

Let us introduce a quantity PXXY , which is the probability that the scores of tworandomly selected defaulters will both be greater than or less than the score of arandomly selected non-defaulter, minus the complementary probability that thescore of the randomly selected non-defaulter will be between the two scores of therandomly selected defaulters. PY Y X is given by a similar definition. They can beexpressed by the following formulas:

PXXY = P (X1, X2 > Y ) + P (X1, X2 < Y ) (4.8)

−P (X1 < Y < X2)− P (X2 < Y < X1)

PY Y X = P (Y1, Y2 > X) + P (Y1, Y2 < X) (4.9)

−P (Y1 < X < Y2)− P (Y2 < X < Y1).


Then the unbiased estimator of the variance σ2U

is given in terms of PXXY andPY Y X

σ2U =

1

4mn[P (X 6= Y ) + (m− 1)PXXY (4.10)

+(n− 1)PY Y X − 4(m + n− 1)(U − 1

2)2]

where P (X 6= Y ), PXXY and PY Y X are the estimators for the probability P (X 6=Y ), PXXY and PY Y X , respectively.

It is known that (AUROC − U)/σU is asymptotically normally distributed withmean zero and standard deviation one, as m → ∞, and n → ∞. Therefore, fora given confidence level 1− α, the confidence interval can be computed as [8]

[AUROC − σU Φ−1 (1− α/2) , AUROC + σU Φ−1 (1− α/2)

].

4.4 Statistical Tests for Rating System Calibra-

tion

The problem with calibration of rating systems or score variables is comparingthe realized default frequency with the estimates of the conditional default prob-ability given the score and analyzing the difference between the observed defaultfrequency and the estimated probability of default. There are several statisticalmethods for validating the probability of default, such as the Binomial test, theSpiegelhalter test and the Hosmer-Lemeshow Chi-square test. While the bino-mial test can only be applied to one single rating grade over a single time period,Spiegelhalter test and Hosmer-Lemeshow (χ2) test provide more advanced meth-ods that can be used to test the adequacy of the default probability predictionover a single time period for several rating grades. We only focus on the Hosmer-Lemeshow (χ2) test and the Spiegelhalter test in this paper.

4.4.1 Hosmer-Lemeshow Test

The Hosmer-Lemeshow test (2000) statistic originated in the field of categoricalregression and is often referred to as a goodness-of-fit test.

Considering a credit scoring system with N borrowers that are classified into Ldifferent rating class according to their credit scores. Let Ni denotes the number


of customers that are classified to rating class i ∈ {1, . . . , L}. Then, we have

N =L∑

i=1

Ni.

Furthermore, let di denote the number of defaulted customers within rating classi ∈ {1, . . . , L}. Then, for the customers within rating class i, the actual defaultrate is pi = di/Ni. Finally, assume that customers within rating class i are assigneda probability of default PDi.

Assume:

(A.1) The predicted default probabilities PDi and the realized default rates pi

are identically distributed.

(A.2) All the default events within each different rating class as well as betweenall rating classes are independent.

Under the hypothesisH0 : pi = PDi, ∀iH1 : pi 6= PDi, ∀i

Define the statistic

CL =L∑

i=1

(Ni · PDi − di)2

Ni · PDi · (1− PDi)(4.11)

Under the assumptions (A.1) and (A.2), when Ni → ∞ simultaneously for alli ∈ {1, . . . , L}, by the central limit theorem, the distribution of CL will convergein distribution towards a χ2-distribution with L− 2 degrees of freedom [4].

The p-value of a χ2-test is a measure to assess the adequacy of the estimateddefault probabilities: the closer the p-value is to zero, the worse the estimationis. However, if the estimated default probabilities are very small, the rate ofconvergence to the χ2-distribution may be very low as well. Moreover, p-valuesprovide a possible way to directly comparing forecasts with different numbers ofrating categories.

We should notice that since the Hosmer-Lemeshow test is based on the assump-tion of independence and a normal approximation, the test is likely to underesti-mated the true type I error. [16]


4.4.2 Spiegelhalter Test

Normally the predicted default probability of each borrower is individually cal-culated. Since the Hosmer-Lemeshow Chi-square test requires averaging the pre-dicted PDs of customers that have been classified within the same rating class,some bias might arise in the calculation. One could avoids this problem by usingthe Spiegelhalter test [13].

Consider N customers are in the credit scoring system. To each customer isassigned a score si and estimated default probability θi where i ∈ 1, . . . , N .

The Spiegelhalter test is based on the Mean Square Error(MSE) which is alsoknown as Bier Score in the contest of validation

MSE =1

N

N∑i=1

(yi − θi)2 (4.12)

where yi, i ∈ {1, . . . , N} is the indicator variable for the default events,

yi =

{1 if customer i is a defaulter

0 if customer i is a non-defaulter(4.13)

From equation (4.5), it can be seen that defaulters are assigned high predicteddefault probability and non-defaulters are assigned low predicted PD, the MSEgets small. Therefore, in general, a smaller value of MSE indicates a better ratingsystem and the higher the MSE, the worse is the performance of the rating system.

The Spiegelhalter test is an approach to assess the difference between the observedMSE and its expected value. Assume all the default events within each differentrating class as well as between all rating classes are independent. Let θi, i ∈{1, . . . , N} denote the observed PD. Consider the Hypotheses:

H0 : θi = θi, ∀iH1 : θi 6= θ, for some i

It can be shown that under H0 we have

E[MSE] =1

N

N∑i=1

θi · (1− θi) (4.14)

V ar[MSE] =1

N2

N∑i=1

(1− 2θi)2 · θ · (1− θi) (4.15)


Under the null hypothesis and the assumption of independence given the scores,it can be shown that the statistic

Z =MSE− E[MSE]√

V ar[MSE]=

1N

∑Ni=1(yi − θi)

2 − 1N

∑Ni=1 θi · (1− θi)√

1N2

∑2i=1(1− 2θi)2 · θ · (1− θi)

(4.16)

asymptotically follows a standard normal distribution by the central limit theo-rem. Performing the two-sided Gauss test, the critical values can easily be derivedas the α/2 and 1− α/2-quantile of the standard normal distribution [13].

Chapter 5

Empirical Analysis and Results

5.1 Model Description

The credit scoring model that we will validate is used to estimate the defaultprobability over 12 months’ time horizon for Deutsche Bank’s German customers.

There are 20 financial variables in the score function. To restrain the distortionscaused by outliers, extreme values without default indication power were trimmedat levels corresponding to the right and left end points of the values range. Missingvalues were replaced by the probability of default or its score equivalent. More-over, all variables were normalized using a probability of default transformation.To do so, the first step is, for each variable Pxx, x = 01, 02, ..., 20, partitioning thewhole value intervals into n subintervals: [(Pxx, i−1), (Pxx, i)], i = 1, . . . , n. TheTaylor approximation was applied to each subinterval and the constant, linear,quadratic and cubic terms (const(Pxx, i), lin(Pxx, i), qua(Pxx, i), cub(Pxx, i))were calculated. So the PD transformation of each variable is calculated as fol-lows:

PD(Pxx, i) = const(Pxx, i) (5.1)

+lin(Pxx, i) ∗ (Pxx− (Pxx, i− 1))

+qua(Pxx, i) ∗ (Pxx− (Pxx, i− 1))2

+cub(Pxx, i) ∗ (Pxx− (Pxx, i− 1))3

30

Chapter 5. Empirical Analysis and Results 31

logodd(Pxx) = ln

(PD(Pxx)

1− PD(Pxx)

)(5.2)

The score function is a linear combination of the PD transformed ratios logodd(Pxx),the turnover size class probability proxies and the Boolean variables of the indus-try segmentation

SCORE = constant +∑

[bool(industry)· const(Pxx, industry) (5.3)

+ prob(turnover size)· const(Pxxxx, turnover size)]· logodd(Pxx).

Based on the resulting score value, customers’ probability of default is calculatedusing the following formula:

PD =exp(a + SCORE ∗ b)

1 + exp(a + SCORE ∗ b). (5.4)

where a, b are constants.

5.2 The Validation Data Set

While the scoring model was developed on the historical data from 1996 to 2001,the original validation sample consisted of 30021 firm-year observations of balancesheets from 2003 to 2005 and default information spanning from 2003 to 2006.After excluding the part of customers that defaulted before the date of scorescreation date, there are 29250 observations left. Table B shows the number ofobserved customers per year and divides the sample into two categories: defaultersand non-defaulters.

Table 5.1: Number of observations and defaults per year

Year All Default Non-default2003 10805 169 106362004 9978 159 98192005 8467 91 8376Total 29250 328 28831

In the validation sample, missing values and outliers are treated in the same wayas in model developing.


Figure 5.1: Score-PD

Following the formulas in previous section, it is clear that the customer with ahigh score has high predicted default probability, as shown in Figure 5.1. Figure5.2 shows the distribution of scores of the validation sample.

Figure 5.2: Score-KDE

5.3 Variables Analysis

Before validating the scoring model, we need to identify the changes of portfo-lio structure over time, so a univariate analysis for each input variable in thescoring model is required. First, for each variable, the histograms and boxplotsfor the sample of three years have been plotted. (See Appendix A) In AppendixA, although for each input variable we see that the shape of the histogram ofdifferent years are almost the same, the boxplots shows that the mean, median,and quantiles of each input variable in different years are different.


To closely look at the changes of the input variables over time, we use bootstrapto calculate the 99% percentile confidence interval for the mean. The results areshown in Appendix B. It can be seen that except ratio P02101 and P02111, themean of other ratios significantly changed over time. For example, if we focus onthe ratio P02102, it could be found that the mean of the year 2003 is not locatedin the confidence intervals of the year 2004 and the year 2005. Then we couldsay that the value of 18 out of 20 input variables changed significantly over time.

However, if we transfer all the input variables’ mean into their correspondinglogodds according to equation (5.2), it can be found that the logodds of variablesthat changed significantly were decreasing over time (See Appendix C). Sincethe logodds of the input variables have positive relationship with the probabilityof default, it means that 18 out of 20 input variables shifted significantly in theway of decreasing the PD between the time period of 2003-2005.

In the meantime, we should notice that the German economy had become betterand kept rising since 2003. The following table (Table 5.2) shows the actualGerman GDP and the growth rate in 2003, 2004 and 2005. It means that thechange tendency of the input variables is in accordance with the development ofGerman economy. Therefore, we could say that the structure of the portfolio didnot change significantly over time.

This result is the precondition of the further analysis. Because if the portfoliostructure changed over time, it means the scoring model is compared on differentportfolios. Then the validation results would have a great deviation and are notreliable.

Table 5.2: Actual GDP of Germany 2003-2005

2003 2004 2005GDP at market prices (euro billion) 2,162 2,207 2,241GDP (US$ billion) 2,444 2,744 2,790Real GDP growth (%) -0.2 1.3 0.9

5.4 Testing for Discriminatory Power

Testing the discriminatory power is one of the main tasks of the validation of acredit scoring model. The discriminant analysis aims to assess the model’s abilityof separating good customers from bad customers. As we mentioned in Section4.3, the Receiver Operating Characteristic is the most widely used statistical tool


for the assessment of the discriminatory power. The following figure (Figure 5.3)gives the ROC curves for the year 2003, 2004 and 2005. We could see that inthese three years, the value of the area under the ROC curves (AUROC) are allpretty high. It means that the model performed well from 2003 to 2005.

Figure 5.3: ROC Curves for the year 2003 2004 and 2005

However, did the discriminatory power of the scoring model change significantlyover time? In the recent years, the DeLong test (1988) is commonly used incomparing ROC curves, but one should notice that the DeLong test is designedfor comparing different rating or scoring models on the same data set. Althoughit is not applicable to a comparison of the AUROC on the same portfolio indifferent time periods, it is still very helpful to calculate the confidence intervalsof the AUROC and we could obtain useful results about the change tendency ofthe discriminatory power by comparing the confidence intervals.

Table 5.3 shows the value of the AUROC and the confidence intervals in 2003,2004 and 2005 which are obtained by the method introduced in section 4.3.2.We could find that the confidence intervals of AUROC in these three years arequite close. It means that the discriminatory power of the scoring model did notchange significantly and the scoring model keeps possessing high discriminatorypower.


Table 5.3: Confidence Intervals for AUROC

95% Confidence IntervalYear AUROC Estimated Variance Lower Limit Upper Limit2003 0.8197 0.0002236 0.7904 0.84902004 0.8225 0.0002237 0.7932 0.85182005 0.8063 0.0004276 0.7658 0.8469

5.5 Statistical Tests for PD Validation

In practice, the observed default rates will be different from the predicted defaultprobability. The key question is whether the difference between the predicted PDand the observed default rate is acceptable. As discussed in Section 4.4, severalstatistical tests are available for calibration of a rating system. Here, I only usedthe Hosmer-Lemeshow Chi-square test and Spiegelhalter test, because both ofthem are tests that could make a comparison of forecasts with different numbersof rating categories simultaneously. The results are given in the Table 5.4 andTable 5.5.

Table 5.4: Hosmer-Lemeshow-Chisquare Test for PD Validation

Year Statistic Value p-value2003 16.44002 0.28722004 34.40864 0.00182005 9.12808 0.8227

Table 5.5: Spiegelhalter Test for PD Validation

Year Statistic Value p-value2003 1.30508 0.19192004 2.77888 0.00552005 -0.16138 0.8718

The result of Hosmer-Lemeshow Chi-square test and Spiegelhalter test are verysimilar. According to the p-values, the estimated PD are quite close to the ob-served default rates for the year 2003 and 2005, but for the year 2004 the estima-tion is not well enough (p-value < 0.05). Because we only have few defaulters inour portfolio, we could say that the result is still acceptable and the model stillworks well.

Chapter 6

Discussion & Conclusion

When the objective of the validation of a credit scoring model is to confirm thatthe developed scoring model is still valid for the current applicant population,one should first check whether the portfolio structure changed over time or not.In some cases where the development sample is two or three years old, significantshifts of the portfolio structure might have occurred. It means that the scoringmodel is compared on different portfolios, then the validation result would becomeless meaningful.

With regard to assessing the discriminatory power of a credit scoring model,the Receiver Operating Characteristic Curve and the area under the curve arethe most commonly used tools. Although the DeLong test is not applicable tocomparing the AUROC on the same portfolio in different time periods, we stillcould use it to calculate the confidence interval for the value of the AUROC.

With regard to the calibration of the rating or scoring system, Hosmer-Lemeshowand Spiegelhalter test are available and easy to use because it can test the ad-equacy of the PD estimates with different number of rating classes at the sametime. However, their appropriateness strongly depends on the independence as-sumption for the default events. Furthermore, one should notice that due to theinsufficiency of the number of defaulters, the statistical validation is not alwaysreliable.

36

Appendix A

Ratio Analysis

The following histograms and boxplots display the distribution of each variablein three years. In the histogram, while the red lines are the fitted normal curves,the blue curves are the estimated kernel density. In Boxplots, the vertical lines(or whiskers) are drawn from the box to the most extreme point within 1.5 in-terquartile ranges. (An interquartile range is the distance between the 25th andthe 75th sample percentiles.) Any value more extreme than this is marked witha red square symbol.

Figure A.1: Histogram and Boxplot of ratio P02101

37

Appendix A. Ratio Analysis 38
























Appendix B

Results of Bootstrapping

Table B.1: 99% percentile confidence interval for themean

P02101 Mean 99% confidence interval2003 3.939 [3.621, 4.257]2004 4.234 [3.919, 4.543]2005 4.170 [3.828, 4.489]





P02106 Mean 99% confidence interval2003 149.532 [146.766, 152.217]

43

Appendix B. Results of Bootstrapping 44

2004 154.397 [151.561, 157.114]2005 158.139 [155.199, 161.265]










Appendix B. Results of Bootstrapping 45






Appendix C

Transformation of the mean ofinput ratios

Table C.1: Table of the log transformation of the meanof input ratios

P02101 Mean logodd(mean)2003 3.939 -4.6952004 4.234 -4.7052005 4.170 -4.703





46

Appendix C. Transformation of the mean of input ratios 47










P02115 Mean logodd(mean)2003 0.523 -4.875

Appendix C. Transformation of the mean of input ratios 48

2004 0.503 -4.9082005 0.493 -4.925






Bibliography

[1] Bert van Es and Hein Putter. Lecture Notes: The Bootstrap. 2005.

[2] Stefan Blochwitz and Stefan Hohl. XI. Validation of Banks Internal RatingSystems - A Supervisory Perspective. In The Basel II Risk Parameters, pages243–262. Springer, 2006.

[3] Swiss Federal Banking Commission. Basel II Implementation in Switzerland:Summary of the explanatory report of the Swiss Federal Banking Commis-sion, September 2005.

[4] Stanley Lemeshow David W. Hosmer. Applied Logistic Regression. JohnWiley & Sons, Inc., second edition edition, 2000.

[5] A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application.Cambridge University Press, 1997.

[6] Arnaud de Servigny and Olivier Renault. Measuring and Managing CreditRisk. The McGraw Hill companies, Inc., 2004.

[7] Elizabeth R. DeLong, David M. Delong, and Daniel L. Clarke-Pearson. Op-erating Characteristics Curves: A Nonparametric Approach. Bopmetrics,1988.

[8] Bernd Engelmann. XII. Measures of a Rating’s Discriminative Power - Ap-plications and Liminations. In The Basel II Risk Parameters, pages 263–287.Springer, 2006.

[9] Nout Wellink, President of the Netherlands Bank and Chairman of the BaselCommittee on Banking Supervision. Goals and Strategies of the Basel Com-mittee. at the High Level Meeting on the Implementation of Basel II in Asiaand Other Regional Supervisory Priorities, Hong Kong, 11 December 2006.

[10] Basel Committee on Banking Supervision(BCBS). The New Basel CapitalAccord. Consultative document, Bank for International Settlement, Basel,Switzerland, 2003.

49

BIBLIOGRAPHY 50

[11] Basel Committee on Banking Supervision(BCBS). International Conver-gence of Capital Measurement and Capital Standards. A revised framework,Bank for International Settlement, Basel, Switzerland, 2004.

[12] Basel Committee on Banking Supervision(BCBS). An explanatory note onthe basel ii irb risk weight functions. Technical report, Bank for InternationalSettlement, Basel, Switzerland, 2005.

[13] Basel Committee on Banking Supervision(BCBS). Studies on the validationof internal rating systems. Working paper no. 14, Bank for InternationalSettlement, Basel, Switzerland, 2005.

[14] Basel Committee on Banking Supervision(BCBS). Update on work of the Ac-cord Implementation Group related to validation under the Basel II Frame-work. Consultative document, Bank for International Settlement, Basel,Switzerland, 2005. http://www.bis.org/publ/bcbs n14.htm.

[15] Naeem Siddiqi. Credit Risk Scorcards: Developing and Implementing Intel-ligent Credit Scoring. John Wiley & Sons, Inc., 2006.

[16] Stefan Blochwitz, Marcus R. W. Martin, and Carsten S. Wehn. XIII. Statis-tical Approaches to PD Validation. In The Basel II Risk Parameters, pages289–305. Springer, 2006.

Documents

Credit Scoring Model Validation - UvA/FNWI (Science ... · PDF fileCredit Scoring Model Validation Master Thesis June 29, 2008 Xuezhen Wu Referent: Prof. Peter Spreij ... of credit