Predictive Clustering for Credit Scoring

8/12/2019 Predictive Clustering for Credit Scoring

1/5

PREDICTIVE CLUSTERING FOR CREDIT RISK ANALYSIS

Jay B.Simha

Abiba Systems, Bangalore, India

[email protected]

Credit risk modeling is a well researched area from both

statistical and AI communities. Several models cited in researchuse model built using whole data set. In this study, a hybridredictive model framework based on fu!!y clustering and

statistical"machine learning classifiers is roosed for credit riskanalysis. #his hybrid aroach enables building rules"functionsfor different grous of borrowers searately. In the first stage,

customers are segmented into clusters, that are characteri!ed bysimilar features and then, in the second ste, for each grou,

classifiers are built to obtain scoring rules"function that mayrovide risk level for each customer. $ultile classifiers are

evaluated on each segment and the best classifier for eachsegment will selected for final scoring. #he main advantage of

alying the integration of two techni%ues consists of buildingmodels that, may better redict risk connected with granting

credits for each client, than while using each method searately.#he results are comared with the results of classifier on the

whole data set, according to classification erformance and thebusiness objective. #he results indicate that the hyothesis that a

hybrid model based framework indeed rovides better resultsthan a global model.

Key wor!"&ybrid models, 'u!!y C(means, Classifiers, Credit)isk

#. INTRODUCTION

*ne of the key decisions financial institutions have to make is todecide whether or not to grant a loan to a customer. #his

decision basically boils down to a binary classification roblemwhich aims at distinguishing good ayers from bad ayers.

+umerous methods have been roosed in the literature todevelo credit scoring models. #hese models include traditional

statistical methods e.g. logistic regression -/0, nonarametricstatistical models e.g. k(nearest neighbor -1/, and classification

trees -2/0, clustering -3/, fu!!y logic -4/ and neural networkmodels -5,6/. $ost of these studies rimarily focus at

develoing classification models with high redictive accuracy.&owever all these aroaches build a global model. It can be

argued that otential savings from redicting risks from certainsegments can overweigh overall classification accuracy on all

the segments.

7akr!ewska -58/ develoed a model based on clustering anddecision trees. Since the concet used one classifier decision

tree0 for scoring, it may not be alicable across the differentdata sets. In addition a soft clustering method like 'u!!y

clustering is suerior to hard k(means clustering as it rovidesbetter cluster %uality. &ence in this research these two concets,

i.e use of soft clustering to identify the segments and use of bestclassifier for each of the segment has been investigated, with a

hyothesis that, the resulting classifier system will rovide abetter control over scoring.

In this aer we resent a framework using fu!!y clustering anddifferent classifiers for building credit scoring models using

local atterns.

$. SYSTE% ARC&ITECTURE

#he roosed system, which is e9ected to suort evaluation ofcredit risks, by building classifiers, is comosed of three main

modules.

'igure 5. System Architecture

#he first module is a segmentation module where the data set is

slit into clusters with homogeneous behavior. :e are usingfu!!y C(means algorithm for clustering as discussed in the

revious section. #he second module is the classifier learningmodule, which will build a model for each of the classifier on

the each of the cluster obtained by the revious module. In thethird module, the best classifier for each of the segment will be

selected based on the configured criteria. In this research wehave selected two criteria for evaluation, namely ; classification

accuracy and true ositive rate.

'. FU((Y C)%EANS CLUSTERING

'u!!y C(means Clustering'C$0, is a clustering techni%uewhich is different from hard k(means that emloys hard

artitioning. #he 'C$ emloys fu!!y artitioning suchthat a data oint can belong to all grous with differentmembershi grades between 8 and 5.

'C$ is an iterative algorithm. #he aim of 'C$ is to find clustercenters centroids0 that minimi!e a dissimilarity function. A

brief summary of the considerations and major stes is givenbelow.

#he algorithm first osits a given number


2/5

>uclidean distance based ?center of each cluster will be

calculated from all the customers= attribute vectors weighted bytheir membershi degrees in the cluster. #he weighting will also

be recomuted based on the membershi values. #he algorithmstos when the seudo artition membershis collectively sto

changing by a determined amount on successive iterations. #hemathematical treatment of the algorithm can be found in -/. #he

algorithm used in the research is given in fig .

.

'ig . 'u!!y clustering algorithm

*. CLASSIFIERS

A classifier is a statistical"machine learning function which mas

the indeendent attributes to deendent attribute with someconfidence. #here are different tyes of classifiers -/. In this

work, five classifiers namely ; nave Bayes, logistic regression,

decision trees, logistic regression, artificial neural networks andsuort vector machines are used. A brief overview of thesetechni%ues is given belowD

*.# Na+,e Baye! -a!!i/ier

#he robability model for a classifier is a conditional model

over a deendent class variable Cwith a small number ofoutcomes or classes, conditional on several feature variablesF5

throughFn. #his conditional model can be e9tended usingBayes= theorem as

&owever the above e%uation assumes interdeendence. :henthis model is rela9ed with the assumtion of indeendence, theconditional distribution over the class variable C can bee9ressed like asD

where Z is a scaling factor deendent only on '5,',..,'ni.e., a

constant if the values of the feature variables are known.

'ig 3. +ave Bayesian classifier

$odels of this form are much more manageable, since theyfactor into a so(called class rior C0 and indeendent

robability distributions p(Fi|C). #his is the nave Bayes=classifier, which has shown surrising erformance over real life

data sets.

*.$ Lo0i!1i- Re0e!!io2

Eogistic regression is the widely used classifier in the credit riskmodeling. Eogistic regression can redict the robability F0

than an e9amle G belongs to one of two redefined classes.Suose e9amle G H 95,9,93,.9n,0, as in linear regression,

logistic regression gives each 9 ia coefficient wjwhich measuresthe contribution of each 9 i to variations in F. 'irst, a logistic

transformation of F is defined as

where F can only range from 8 to 5, while logitF0 ranges from(J to J. EogitF0 is then matched by a linear function of the

feature variables

*.' De-i!io2 1ree!

Kecision tree learning is a common method used in data mining.

#he goal is to create a model that redicts the value of a targetvariable based on several inut variables. >ach interior node

corresonds to one of the inut variables. #here are edges tochildren for each of the ossible values of that inut variable.

>ach leaf reresents a value of the target variable given thevalues of the inut variables reresented by the ath from the

root to the leaf.

A tree can be LlearnedL by slitting the source set into subsetsbased on an attribute value test. Slitting can be based ondifferent criteria. #wo of the most widely used measures are

information gain and Mini inde9.

Information gainD

Mini inde9D


3/5

'ig . Kecision tree classifier

#his rocess is reeated on each derived subset in a recursive

manner called recursive artitioning. #he recursion is comletedwhen the subset at a node all has the same value of the targetvariable, or when slitting no longer adds value to the

redictions.

*.* Ar1i/i-ia 2e3ra Ne1wor4!

An Artificial +eural +etwork A++0 is an informationrocessing aradigm that is insired by the way biological

nervous systems, such as the brain, rocess information. #he keyelement of this aradigm is the novel structure of theinformation rocessing system. It is comosed of a large numberof highly interconnected rocessing elements neurons0 working

in unison to solve secific roblems. #he learning in neural

networks is accomlished by adjusting the connection weightsiteratively, till convergence.

'ig 1. Artificial +eural +etworks

>ach of the feed forward connections are comuted using the

activation functionD

#yically feedback of the delta comutations

are used to minimi!e the errors during learning. +eural networks

are used in credit risk ne9t only to logistic regression.

*.5 S366or1 Ve-1or %a-hi2e! 7SV%8

A Suort Nector $achine is a suervised learner forclassification. An SN$ will view inut data as two sets of

vectors in an n(dimensional sace and construct a searating

hyerlane in that sace, one which ma9imi!es the marginbetween the two data sets.

'ig . Suort vector machines

In order to calculate the margin, two arallel hyerlanes are

constructed, one on each side of the searating hyerlane,which are Lushed u againstL the two data sets. Intuitively, a

good searation is achieved by the hyerlane that has the

largest distance to the neighboring data oints of both classes,since in general the larger the margin the lower thegenerali!ation error of the classifier. In formal terms an SN$

can be written as in its dual form0D

$a9imi!e in Oi0

subject to for any 0

and

It has been found that suort vector machines work well withcredit risk modeling.

9. E:PERI%ENTAL RESULTS AND DISCUSSIONS

>9eriments were done on a real life credit risk data set

collected for an Indian bank. #he e9eriments consist of

valuating and comaring the %uality of results obtained bybest classifier for each segment against similar classifier


4/5

develoed using whole data. In the whole data set modeof learning the classifier, a ten(fold cross validation is

adoted to test the model. Since the segment si!es aresmall, leave(one(out aroach for validation of the

classification models is adoted.

#able 5. shows the classification accuracy of different

classification algorithms. It can be seen that all thealgorithms erform well the validation set. *ne of themdecision tree0 have in built feature selection, another

logistic regression0 is used with forward selection. *thertwo classifiers were built using full data set and all the

attributes. Since a similar aroach is used in learning theclassifier on segmented data, further runing was not

carried out on the algorithm.

#able . shows the true ositive rates with different

classification algorithms. It can be observed that all the

classifiers erform similarly when all the data is used formodeling. #his indicates that the classification boundaries

learned by each of the classifier are otimal for the givendata. Any further data transformation and classifier

learning arameters may imrove the classificationaccuracy. &owever our intention was to comare the

erformance of classifiers on segments with samearameter settings. It can be seen that none of the

classifier is suerior in all the segments on all of theerformance measures. #his has motivated us to develo

our aroach to select the best classifier for each segment.It is clear from the tables that the best classifier for each

segment rovides a suerior erformance.

#able 5. Classification accuracy

#able . #rue ositive rates

;. CONCLUSION

In the aer a framework for connecting unsuervised

fu!!y clustering0 and suervised classificationalgorithms0 techni%ues for credit risk evaluation is

investigated. #he resented techni%ue allows for buildingdifferent classifiers for different grous of customers,which rovide the best results for that segment. In the

roosed aroach, each credit alicant is assigned tothe most similar grou of clients from the training data set

and credit risk is evaluated by alying the classifierroer for this grou.

)esults obtained on the real credit risk data sets showed

higher recisions and simlicity of models obtained foreach cluster than for model develoed with the whole data

set.

'uture research will focus on further investigations onusing Self *rgani!ing $as and >9ectation

$a9imi!ation clustering for segmentation with multileclassification techni%ues for suervised learning and

additional erformance measures like area under )*Ccurve.

REFERENCES

-5/ B. Baesens, ). Setieno, Ch. $ues, P. Nanthienen. Qsing+eural +etwork )ule >9traction and Kecision #ables for Credit(

)isk >valuation. $anagement Science, 630, 883, 35(36.-/ $. Bensic, +. Sarlija, $. 7ekic(Susac. $odelling Small(

Business Credit Scoring by Qsing Eogistic )egression. +eural+etworks and Kecision #rees. Intelligent Systems in

Accounting, 'inance and $anage(ment, 53, 881, 533(518.-3/ M. Chi, P. &ao, Ch. Giu, 7. 7hu. Cluster Analysis for :eightof Credit )isk >valuation Inde9. Systems >ngineering(#heory

$ethodology, Alications, 5850, 885, (4.-/ Kunn P.C., 5643, LA 'u!!y )elative of the IS*KA#AFrocess and Its Qse in Ketecting Comact :ell(Searated

ClustersL, Pournal of Cybernetics 3D 3(14-1/ :.>. &enley, K.>. &and. Construction of a k(nearest

neighbor credit(scoring system. I$A Pournal of $ana(gement$athematics, 2, 5664, 381(35.

-/ Ian H. Witten and Eibe Frank (2005)

"Data Mining: Practical machine

learning tl! and techni#e!"$ 2ndEditin$ Mrgan %a#&mann$ 'an

Franci!c$ 2005.

-4/ R.(7. Euo, S.(E. Fang, S.(S. iu. 'u!!y Cluster in CreditScoring. Froceedings of the Second Interna(tional Conference on$achine Eearning and Cyber(netics, Gi=an, (1 +ovember 883,

435(43.-2/ Satchidananda S.S., Pay B.Simha, Comaring decision treeswith logistic regression for credit risk analysis, SAS AFAQMC

88, $umbai-6/ K. :est. +eural network credit scoring models. Comuters

T *erations )esearch, 4, 888, 5535(551-58/ 7akr!ewska K, *n integrating unsuervised and suervised

classification for credit risk evaluation, Information technology

and Control, 884, Nol.3, +o.5A


5/5

Documents

Predictive Clustering for Credit Scoring