Predictive Clustering for Credit Scoring

Embed Size (px)

Citation preview

  • 8/12/2019 Predictive Clustering for Credit Scoring

    1/5

    PREDICTIVE CLUSTERING FOR CREDIT RISK ANALYSIS

    Jay B.Simha

    Abiba Systems, Bangalore, India

    [email protected]

    Credit risk modeling is a well researched area from both

    statistical and AI communities. Several models cited in researchuse model built using whole data set. In this study, a hybridredictive model framework based on fu!!y clustering and

    statistical"machine learning classifiers is roosed for credit riskanalysis. #his hybrid aroach enables building rules"functionsfor different grous of borrowers searately. In the first stage,

    customers are segmented into clusters, that are characteri!ed bysimilar features and then, in the second ste, for each grou,

    classifiers are built to obtain scoring rules"function that mayrovide risk level for each customer. $ultile classifiers are

    evaluated on each segment and the best classifier for eachsegment will selected for final scoring. #he main advantage of

    alying the integration of two techni%ues consists of buildingmodels that, may better redict risk connected with granting

    credits for each client, than while using each method searately.#he results are comared with the results of classifier on the

    whole data set, according to classification erformance and thebusiness objective. #he results indicate that the hyothesis that a

    hybrid model based framework indeed rovides better resultsthan a global model.

    Key wor!"&ybrid models, 'u!!y C(means, Classifiers, Credit)isk

    #. INTRODUCTION

    *ne of the key decisions financial institutions have to make is todecide whether or not to grant a loan to a customer. #his

    decision basically boils down to a binary classification roblemwhich aims at distinguishing good ayers from bad ayers.

    +umerous methods have been roosed in the literature todevelo credit scoring models. #hese models include traditional

    statistical methods e.g. logistic regression -/0, nonarametricstatistical models e.g. k(nearest neighbor -1/, and classification

    trees -2/0, clustering -3/, fu!!y logic -4/ and neural networkmodels -5,6/. $ost of these studies rimarily focus at

    develoing classification models with high redictive accuracy.&owever all these aroaches build a global model. It can be

    argued that otential savings from redicting risks from certainsegments can overweigh overall classification accuracy on all

    the segments.

    7akr!ewska -58/ develoed a model based on clustering anddecision trees. Since the concet used one classifier decision

    tree0 for scoring, it may not be alicable across the differentdata sets. In addition a soft clustering method like 'u!!y

    clustering is suerior to hard k(means clustering as it rovidesbetter cluster %uality. &ence in this research these two concets,

    i.e use of soft clustering to identify the segments and use of bestclassifier for each of the segment has been investigated, with a

    hyothesis that, the resulting classifier system will rovide abetter control over scoring.

    In this aer we resent a framework using fu!!y clustering anddifferent classifiers for building credit scoring models using

    local atterns.

    $. SYSTE% ARC&ITECTURE

    #he roosed system, which is e9ected to suort evaluation ofcredit risks, by building classifiers, is comosed of three main

    modules.

    'igure 5. System Architecture

    #he first module is a segmentation module where the data set is

    slit into clusters with homogeneous behavior. :e are usingfu!!y C(means algorithm for clustering as discussed in the

    revious section. #he second module is the classifier learningmodule, which will build a model for each of the classifier on

    the each of the cluster obtained by the revious module. In thethird module, the best classifier for each of the segment will be

    selected based on the configured criteria. In this research wehave selected two criteria for evaluation, namely ; classification

    accuracy and true ositive rate.

    '. FU((Y C)%EANS CLUSTERING

    'u!!y C(means Clustering'C$0, is a clustering techni%uewhich is different from hard k(means that emloys hard

    artitioning. #he 'C$ emloys fu!!y artitioning suchthat a data oint can belong to all grous with differentmembershi grades between 8 and 5.

    'C$ is an iterative algorithm. #he aim of 'C$ is to find clustercenters centroids0 that minimi!e a dissimilarity function. A

    brief summary of the considerations and major stes is givenbelow.

    #he algorithm first osits a given number

  • 8/12/2019 Predictive Clustering for Credit Scoring

    2/5

    >uclidean distance based ?center of each cluster will be

    calculated from all the customers= attribute vectors weighted bytheir membershi degrees in the cluster. #he weighting will also

    be recomuted based on the membershi values. #he algorithmstos when the seudo artition membershis collectively sto

    changing by a determined amount on successive iterations. #hemathematical treatment of the algorithm can be found in -/. #he

    algorithm used in the research is given in fig .

    .

    'ig . 'u!!y clustering algorithm

    *. CLASSIFIERS

    A classifier is a statistical"machine learning function which mas

    the indeendent attributes to deendent attribute with someconfidence. #here are different tyes of classifiers -/. In this

    work, five classifiers namely ; nave Bayes, logistic regression,

    decision trees, logistic regression, artificial neural networks andsuort vector machines are used. A brief overview of thesetechni%ues is given belowD

    *.# Na+,e Baye! -a!!i/ier

    #he robability model for a classifier is a conditional model

    over a deendent class variable Cwith a small number ofoutcomes or classes, conditional on several feature variablesF5

    throughFn. #his conditional model can be e9tended usingBayes= theorem as

    &owever the above e%uation assumes interdeendence. :henthis model is rela9ed with the assumtion of indeendence, theconditional distribution over the class variable C can bee9ressed like asD

    where Z is a scaling factor deendent only on '5,',..,'ni.e., a

    constant if the values of the feature variables are known.

    'ig 3. +ave Bayesian classifier

    $odels of this form are much more manageable, since theyfactor into a so(called class rior C0 and indeendent

    robability distributions p(Fi|C). #his is the nave Bayes=classifier, which has shown surrising erformance over real life

    data sets.

    *.$ Lo0i!1i- Re0e!!io2

    Eogistic regression is the widely used classifier in the credit riskmodeling. Eogistic regression can redict the robability F0

    than an e9amle G belongs to one of two redefined classes.Suose e9amle G H 95,9,93,.9n,0, as in linear regression,

    logistic regression gives each 9 ia coefficient wjwhich measuresthe contribution of each 9 i to variations in F. 'irst, a logistic

    transformation of F is defined as

    where F can only range from 8 to 5, while logitF0 ranges from(J to J. EogitF0 is then matched by a linear function of the

    feature variables

    *.' De-i!io2 1ree!

    Kecision tree learning is a common method used in data mining.

    #he goal is to create a model that redicts the value of a targetvariable based on several inut variables. >ach interior node

    corresonds to one of the inut variables. #here are edges tochildren for each of the ossible values of that inut variable.

    >ach leaf reresents a value of the target variable given thevalues of the inut variables reresented by the ath from the

    root to the leaf.

    A tree can be LlearnedL by slitting the source set into subsetsbased on an attribute value test. Slitting can be based ondifferent criteria. #wo of the most widely used measures are

    information gain and Mini inde9.

    Information gainD

    Mini inde9D

  • 8/12/2019 Predictive Clustering for Credit Scoring

    3/5

    'ig . Kecision tree classifier

    #his rocess is reeated on each derived subset in a recursive

    manner called recursive artitioning. #he recursion is comletedwhen the subset at a node all has the same value of the targetvariable, or when slitting no longer adds value to the

    redictions.

    *.* Ar1i/i-ia 2e3ra Ne1wor4!

    An Artificial +eural +etwork A++0 is an informationrocessing aradigm that is insired by the way biological

    nervous systems, such as the brain, rocess information. #he keyelement of this aradigm is the novel structure of theinformation rocessing system. It is comosed of a large numberof highly interconnected rocessing elements neurons0 working

    in unison to solve secific roblems. #he learning in neural

    networks is accomlished by adjusting the connection weightsiteratively, till convergence.

    'ig 1. Artificial +eural +etworks

    >ach of the feed forward connections are comuted using the

    activation functionD

    #yically feedback of the delta comutations

    are used to minimi!e the errors during learning. +eural networks

    are used in credit risk ne9t only to logistic regression.

    *.5 S366or1 Ve-1or %a-hi2e! 7SV%8

    A Suort Nector $achine is a suervised learner forclassification. An SN$ will view inut data as two sets of

    vectors in an n(dimensional sace and construct a searating

    hyerlane in that sace, one which ma9imi!es the marginbetween the two data sets.

    'ig . Suort vector machines

    In order to calculate the margin, two arallel hyerlanes are

    constructed, one on each side of the searating hyerlane,which are Lushed u againstL the two data sets. Intuitively, a

    good searation is achieved by the hyerlane that has the

    largest distance to the neighboring data oints of both classes,since in general the larger the margin the lower thegenerali!ation error of the classifier. In formal terms an SN$

    can be written as in its dual form0D

    $a9imi!e in Oi0

    subject to for any 0

    and

    It has been found that suort vector machines work well withcredit risk modeling.

    9. E:PERI%ENTAL RESULTS AND DISCUSSIONS

    >9eriments were done on a real life credit risk data set

    collected for an Indian bank. #he e9eriments consist of

    valuating and comaring the %uality of results obtained bybest classifier for each segment against similar classifier

  • 8/12/2019 Predictive Clustering for Credit Scoring

    4/5

    develoed using whole data. In the whole data set modeof learning the classifier, a ten(fold cross validation is

    adoted to test the model. Since the segment si!es aresmall, leave(one(out aroach for validation of the

    classification models is adoted.

    #able 5. shows the classification accuracy of different

    classification algorithms. It can be seen that all thealgorithms erform well the validation set. *ne of themdecision tree0 have in built feature selection, another

    logistic regression0 is used with forward selection. *thertwo classifiers were built using full data set and all the

    attributes. Since a similar aroach is used in learning theclassifier on segmented data, further runing was not

    carried out on the algorithm.

    #able . shows the true ositive rates with different

    classification algorithms. It can be observed that all the

    classifiers erform similarly when all the data is used formodeling. #his indicates that the classification boundaries

    learned by each of the classifier are otimal for the givendata. Any further data transformation and classifier

    learning arameters may imrove the classificationaccuracy. &owever our intention was to comare the

    erformance of classifiers on segments with samearameter settings. It can be seen that none of the

    classifier is suerior in all the segments on all of theerformance measures. #his has motivated us to develo

    our aroach to select the best classifier for each segment.It is clear from the tables that the best classifier for each

    segment rovides a suerior erformance.

    #able 5. Classification accuracy

    #able . #rue ositive rates

    ;. CONCLUSION

    In the aer a framework for connecting unsuervised

    fu!!y clustering0 and suervised classificationalgorithms0 techni%ues for credit risk evaluation is

    investigated. #he resented techni%ue allows for buildingdifferent classifiers for different grous of customers,which rovide the best results for that segment. In the

    roosed aroach, each credit alicant is assigned tothe most similar grou of clients from the training data set

    and credit risk is evaluated by alying the classifierroer for this grou.

    )esults obtained on the real credit risk data sets showed

    higher recisions and simlicity of models obtained foreach cluster than for model develoed with the whole data

    set.

    'uture research will focus on further investigations onusing Self *rgani!ing $as and >9ectation

    $a9imi!ation clustering for segmentation with multileclassification techni%ues for suervised learning and

    additional erformance measures like area under )*Ccurve.

    REFERENCES

    -5/ B. Baesens, ). Setieno, Ch. $ues, P. Nanthienen. Qsing+eural +etwork )ule >9traction and Kecision #ables for Credit(

    )isk >valuation. $anagement Science, 630, 883, 35(36.-/ $. Bensic, +. Sarlija, $. 7ekic(Susac. $odelling Small(

    Business Credit Scoring by Qsing Eogistic )egression. +eural+etworks and Kecision #rees. Intelligent Systems in

    Accounting, 'inance and $anage(ment, 53, 881, 533(518.-3/ M. Chi, P. &ao, Ch. Giu, 7. 7hu. Cluster Analysis for :eightof Credit )isk >valuation Inde9. Systems >ngineering(#heory

    $ethodology, Alications, 5850, 885, (4.-/ Kunn P.C., 5643, LA 'u!!y )elative of the IS*KA#AFrocess and Its Qse in Ketecting Comact :ell(Searated

    ClustersL, Pournal of Cybernetics 3D 3(14-1/ :.>. &enley, K.>. &and. Construction of a k(nearest

    neighbor credit(scoring system. I$A Pournal of $ana(gement$athematics, 2, 5664, 381(35.

    -/ Ian H. Witten and Eibe Frank (2005)

    "Data Mining: Practical machine

    learning tl! and techni#e!"$ 2ndEditin$ Mrgan %a#&mann$ 'an

    Franci!c$ 2005.

    -4/ R.(7. Euo, S.(E. Fang, S.(S. iu. 'u!!y Cluster in CreditScoring. Froceedings of the Second Interna(tional Conference on$achine Eearning and Cyber(netics, Gi=an, (1 +ovember 883,

    435(43.-2/ Satchidananda S.S., Pay B.Simha, Comaring decision treeswith logistic regression for credit risk analysis, SAS AFAQMC

    88, $umbai-6/ K. :est. +eural network credit scoring models. Comuters

    T *erations )esearch, 4, 888, 5535(551-58/ 7akr!ewska K, *n integrating unsuervised and suervised

    classification for credit risk evaluation, Information technology

    and Control, 884, Nol.3, +o.5A

  • 8/12/2019 Predictive Clustering for Credit Scoring

    5/5