38
Binary Classification Modeling Using Logistic Regression to Build Credit Scores Britney Cook and Reuben Hilliard Supervised by Jennifer Lewis Priestley, Ph.D. Kennesaw State University Submitted April 24, 2015 to fulfill the requirements for STAT4330

Binary Classification Final

Embed Size (px)

Citation preview

Page 1: Binary Classification Final

Binary Classification Modeling

Using Logistic Regression to Build Credit Scores

Britney Cook and Reuben Hilliard

Supervised by Jennifer Lewis Priestley, Ph.D.

Kennesaw State University

Submitted April 24, 2015

to fulfill the requirements for STAT4330

Page 2: Binary Classification Final

2    

Executive Summary

The initial objective of this project was to build a binary classification model to predict whether a potential customer would default on a credit line or not. As the analysis progressed different objectives arose where the choice could either be made to optimize the model in the most mathematically appropriate manner justify it with a simpler alternative. It most cases it was decided to take the most optimal route in order to increase profit. Ultimately the analysis ended with a model comprised of variables in their raw, ordinal, odds and log of the odds form. This model proved to be very complex and was difficult to explain. However, the final model which contained only 10 variables was found to profit $113,956.31 per 1,000 customers. Furthermore, when a cluster analysis was performed using 5 significant transformed variables, the profitability per 1,000 customers almost doubled at $222,078.40. Interestingly enough the transformation that resulted in the variables that where used to create these highly profitable clusters were all from the log of odds unsupervised transformation or the most complex and difficult variable to explain. Though this made interpreting the variables used very difficult, the objective to maximize profitability had been met.

After building the final model the next objective was to further optimize the profitability by setting a cut-off point for the probability of defaulting. A classification table and profitability function were analyzed to try and distinguish the value of the cut-off point that would result in the most optimal profitability. Looking at the profitability table output it was clear that there was room for improvement in the areas of both the Type 1 Error, which resulted in a loss of $42,400468.50, and Type 2 Error, which resulted in an opportunity cost of $26,909,750. Overall the model profited $85,810,581.50.

Page 3: Binary Classification Final

3    

Introduction

This research paper describes the process and results of developing a binary classification model, using Logistic Regression, to generate Credit Risk Scores. These scores are then used to maximize a profitability function.

The data for this project came from a Sub-Prime lender. Three datasets were provided:

• CPR. 1,462,955 observations and 338 variables. Each observation represents a unique customer. This file contains all of the potential predictors of credit performance. The variables have differing levels of completeness.    

• PERF. 17,244,104 observations and 18 variables. This file contains the post-hoc performance data for each customer, including the response variable for modeling – DELQID.

• TRAN. 8,536,608 observations and 5 variables. This file contains information on the transaction patterns of each customer.

Each file contains a consistent “MATCHKEY” variable which was used to merge the datasets.

The process for the project included:

Each of these processes will be discussed in turn.

Assignment  of    Dependent  Variable  

Odds,  Correlation  and  Plots  

Multicollinearity  assessment  using  Regression  and  VIF  

Discretization    and  transformations  

Sampling   Model  Development  

Model    Evaluation  

Data  Cleansing  and  Imputation  

Data  Discovery   Variable  Preparation  

Modeling  

Page 4: Binary Classification Final

4    

Data Discovery

Before any analysis could take place, the two datasets, CPR and PERF, needed to be merged. For the merge, an identifier of the individual customer, labeled MATCHKEY, was used. Some of the options for merging the data included left join, right join, outer join, and inner join.

Supposing that CPR was on the left and PERF was on the right as shown the diagram below: an inner join would be most appropriate for the following reasons:

• A left join would result in some MATCHKEYs having no post-hoc performance data, including our dependent variable, DELQID, or delinquent ID number, along with CRELIM, or credit limit after getting approved for a credit line.

• A right join would result in some MATCHKEYs having no potential predictor data, which is what will help determine credit performance.

• An outer join would result in no MATCHKEYs having both post-hoc performance and potential predictor data.

Using an inner join would result in MATCHKEYs that contained both post-hoc performance and potential predictor data, two essential pieces of information necessary in developing a binary classification model that optimizes profitability.

After merging the data, it was soon discovered that the same MATCHKEY sometimes had multiple DELQIDs as shown in Table 1 below.

Table 1: MATCHKEY/DELQID Problem Discovered (first 10 observations)

Obs MATCHKEY CRELIM DELQID1 1333324 800 02 1333324 800 03 1333324 800 04 1333324 800 15 1333324 800 16 1333324 800 17 1333324 800 28 1333324 800 39 1333324 800 410 1333324 800 5

CPR PERF

Page 5: Binary Classification Final

5    

Because DELQID is the dependent variable, to continue the analysis, a single DELQID needed to be assigned per MATCHKEY. Descriptions for the DELQID values follow.  

• A DELQID of 0 indicated that either the individual had a new credit file or it was too soon to tell if they would be a good customer or not.

• A DELQID of 1 indicated that the individual was in good standing. • A DELQID of 2 indicated that the individual was one cycle late. • A DELQID of 3 indicated that the individual was two cycles late. The variable continued

to follow this trend.

Options for deciding which DELQID to use included taking the best DELQID, worst DELQID, median DELQID, mean DELQID or most recent DELQID. The most conservative approach was to go with the worst DELQID. While this did increase the risk of making a Type II Error, or not lending to a customer that would have paid back the money, it did decrease the risk of making a Type I Error, or lending to a customer that would not have paid back the money.

For the procedure, the data was sorted by MATCHKEY and then by ascending DELQID. The last DELQID, or worst DELQID, for each MATCHKEY was then kept and all others were discarded. Each MATCHKEY now had a single DELQID value assigned as shown in Table 2 below.

Up until this point, only the two datasets, CPR and PERF, were merged, and each MATCHKEY was assigned a single DELQID. Nothing had been done with regards to missing values, hence, why observation 6 in Table 2 above contained missing values. The resulting dataset, after merging CPR and PERF along with assigning a single DELQID to each MATCHKEY, had 1,255,429 observations and 357 variables.

Table 2: MATCHKEY/DELQID Problem Resolved (first 10 observations)

Obs MATCHKEY CRELIM DELQID1 1333324 800 62 1333329 1500 13 1333334 2000 04 1333410 3000 65 1333414 4400 06 1333433 - -7 1333437 1390 18 1333443 2250 19 1333463 10000 010 1333538 3000 2

Page 6: Binary Classification Final

6    

Given that this will be a binary classification model using logistic regression, DELQID needed to be reconfigured into a binary variable labeled “GOODBAD”. This new dependent binary variable had a value of either “0” or “1”, where “0” was defined as a DELQID value of 0-2, which was considered good, and “1” was defined as a DELQID value of 3 or greater, which was considered bad. In other words, if a customer had a new credit file, was in good standing or was only one cycle late, their identifier, MATCHKEY, received a GOODBAD value of “0”. If the customer was more than two cycles late, their MATCHKEY received a GOODBAD value of “1”. The result of this reconfiguration is shown in Table 3 below.

Prior to the creation of this table, observations where DELQID were missing were deleted. This explains why there were no missing observations in Table 3 above. In Table 4 below, the descriptive statistics for the new response variable, GOODBAD, are listed. From the table it can be seen that the majority of the response, or 82.43%, had a GOODBAD value of 0, while the other 17.57% had a GOODBAD value of 1.

Next on the list was the matter of coded values. For example, RMS variables 2 digits long contained values that ranged from 0-99. However, only values 0-92 were valid numerical values, where 92 represented all numerical values 92 or greater. The values 93-99 were coded, meaning that they stood for a particular status, or were defined as something non-numerical. This was a

Table 3: DELQID After Being Reconfigured into a Binary Variable (first 10 observations)

ObsMATCHKEYCRELIM DELQID GOODBAD1 1333324 800 6 12 1333329 1500 1 03 1333334 2000 0 04 1333410 3000 6 15 1333414 4400 0 06 1333437 1390 1 07 1333443 2250 1 08 1333463 10000 0 09 1333538 3000 2 010 133572 5500 0 0

Table 4: Descriptive Statistics for GOODBAD

GOODBAD Frequency PercentCumulative Frequency

Cumulative Percent

0 1034829 82.43 1034829 82.431 220600 17.57 1255429 100

Page 7: Binary Classification Final

7    

problem because the software, SAS, would read all values, including the coded, as quantitative, which in many cases led to very misleading statistics (i.e. mean, median). In Figure 1 below, when the variable AGE was left untouched, SAS gave what looked to be an approximately normal distribution, aside from the stack of outliers to the right of the graph.

It seemed odd that there were so few customers in their 80s and 90s but so many that were said to be 100 years old. This observation was common among variables and in many cases was more extreme (i.e. AFR39, or the number of auto finance trades 60+ days past due, where well over half of the variable was coded). Deleting any observation where there was a coded value was not an option because then there would be no observations left for analysis. For this reason, the coded values needed to be imputed, or replaced with an actual numerical value that made sense.

Possibilities for imputation included the following.

• Stratified Imputation • Regressed Imputation • Mean-Based Imputation • Median-Based Imputation

The best options would have been either stratified or regressed imputation, but with 300+ variables to impute it would be near impossible. Two other possibilities for imputing the coded values included a mean or median based imputation. Because most of the variables, excluding coded values, had a skewed distribution, a median-based imputation was the most appropriate approach. In Figure 2 above, AGE is shown after imputing the coded values with the median.

For this particular case, the imputation normalized the data to a certain extent by bringing the mean AGE closer to the median AGE while getting rid of coded values without having to delete

Figure 2: Histogram of AGE after Imputation

Figure 1: Histogram of AGE prior to Imputation

Page 8: Binary Classification Final

8    

any data. This can be seen in Table 5 below. One thing to note, aside from the mean getting closer to the median and the standard deviation getting smaller post imputation, is the maximum, which went from 99 to 91. This was another result of imputing the coded values with the median.

To avoid having to do this manually for each of the 300+ variables, a macro would be used to run through and impute values as needed for each variable, outputting a final dataset containing all actual numerical values. The variables DELQID, MATCHKEY, GOODBAD, and CRELIM were excluded from the macro because, unlike potential predictors, they served as identifiers, and, had they been imputed they would have lost all meaning. The two options that were adjusted with each run of the macro were PCTREM, or percent removed, and MSTD, or max standard deviation. PCTREM specified the threshold for which a variable, after being imputed with the median, was either kept or discarded. For example, a PCTREM value of 50% would indicate that if more than 50% of the variable was coded, it needed to be removed from the dataset. MSTD specified the number of standard deviations a value could be from the mean, before being imputed. Table 6 below shows the results from the 13 macro executions, along with the specifications for the two options PCTREM and MSTD.

Table 5: Descriptive Statistics for AGE Before and After imputation!

N N Miss Mean Minimum Median MaximumBefore 1255429 0 48.0520237 15.6329305 17 47 99

After 1255429 0 47.6310528 14.9511702 17 47 91

Standard Deviation

Table 6: Macro Execution Results!

Percent Removed

Max Standard Deviation

Variables Remaining

50% 4 25445% 4 22540% 4 18040% 3 18040% 2 18040% 1 18039% 4 17538% 4 15737% 4 14436% 4 14435% 4 14430% 4 14425% 4 144

Page 9: Binary Classification Final

9    

There did not appear to be any difference in the variables remaining as to whether the macro was executed at a MSTD value of 1 versus a MSTD value of 4, probably because there was not a significant difference in extreme values being imputed between a MSTD value of 1 and a MSTD value of 4. However, it can be seen in Table 7 below, that there was a difference between the MSTD value’s descriptive statistics.

Although a MSTD value of 3 would have resulted in less spread or variation within variables, a MSTD value of 4 was used so that the potential impact of outliers was not masked.

For the PCTREM, there was a clear break between 38% and 37% in which decreasing the percentage no longer made a difference in the variables remaining. This break can be viewed in Table 6 on the previous page. Between these two percentages, there was a difference of 14 variables, listed in Table 8 below.

Table 8: Variables Lost Using 37% vs. 38% PCTREM

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324 BRRATE3 0.8515 0.4324 0.2616 brrate3

Variable DescriptionDCCR39 NUMBER OF DEPT STORE TRADES/CURENTLY 60+ DAYSDCCR49 NUMBER OF DEPT STORE TRADES/CURENTLY 90+ DAYS

DCCRATE7 NUMBER OF DEPT STORE ACCTS/CURRENTLY BAD DEBTDCLAAGE AGE OF DEPT STORE/LAST ACTIVITYDCN90P24 NUMBER OF 90+,BAD DEBT/DEPT STORE IN 24 MONTHS

DCR29 NUMBER OF DEPT STOR TRADES/EVER 30 DAYS OR WORSEDCR39 NUMBER OF DEPT STORE TRADES/EVER 60 DAYS OR WORSEDCR49 NUMBER OF DEPT STORE TRADES/EVER 90 DAYS OR WORSE

DCR7924 NUMBER OF DEPT STORE ACCTS/BAD DEBT PAST 24 MODCR29P24 NUMBER OF DEPT STORE TRADES/RATD 2-9 RPTD IN 24 MODCR39P24 NUMBER OF DEPT STORE TRADES/RATD 3-9 RPTD IN 24 MO

DCRATE79 NUMBER OF DEPT STORE ACCTS/EVER BAD DEBTDCTRADES NUMBER OF DEPT STORE ACCTS

Table 7: Descriptive Statistics at 40% Imputed with 3 & 4 Standard Deviations

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Variable Mean Std Dev Min Max Mean Std Dev Min MaxBRBAL 5257.05 5774.08 0 3064 5545.46 6411.36 0 38782CRDPTH 143.37 88.95 0 442 147.26 95.7 0 540LOCINQS 1.62 1.85 0 9 1.72 2.08 0 12TRADES 18.46 9.77 1 49 18.67 10.11 1 59TSBAL 16258.08 15427.16 0 74632 16978.59 16807.17 0 93603

3 Standard Deviations 4 Standard Deviations

Page 10: Binary Classification Final

10    

Following the review of this table, noting that the 14 variables were all related to department store cards, it was decided that these variables might be useful in assessing credit risk. Therefore the PCTREM value was set at 38% to avoid the loss of these potential predictors.

For the final imputation, if a variable had values that surpassed 4 standard deviations from the mean, those values would be imputed with the median. If more than 38% of values for that variable were coded, the variable would be dropped. The main reason for this was because imputing much more than 38% of total values with the median will reduce the variance significantly, not giving us much to work with and making it hard to draw meaningful conclusions from the data.

Figures 3 and 4 below show the distribution of the variable DCTRADES, or the number of department store accounts, before and after the macro. The problem with DCTRADES is similar to that of the variable AGE, but on a much grander scale where about 38% of the data was coded.

After imputing all of the coded values for each of the potential predictors, the 15 PERF variables (including MATCHKEY and CRELIM), AGE and BEACON were removed from the dataset so that a Variable Cluster Analysis could be performed. The reasons for the removal of these variables were as follows:

• PERF variables were post-hoc performance data, meaning that the variables had no value unless somebody had already established a line of credit.

• AGE is considered discriminative and could not be used in the model. • BEACON, the description of which was unknown because it was not listed in the RMS

Variables Spreadsheet, contained all missing values in the dataset.

Figure 3: Distribution of DCTRADES Before Macro

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 4: Distribution of DCTRADES After Macro

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 11: Binary Classification Final

11    

The analysis was initially performed using the SAS PROC VARCLUS statement at the maximum 140 clusters on the remaining 140 variables. As part of the diagnostic process, both the dendrogram and output below in Figure 5, indicated that 80 clusters would be a reasonable number to perform the analysis with. This value indicated a change in rate between 60 and 100, as the slope flattened. It also explained ~90% of the proportion of variation. By default, if the option “Maxclusters=” is discarded from the VARCLUS procedure, SAS will optimize the number of clusters created. For this particular case, 31 was the optimal number of clusters, determined by the second eigenvalue being less than one for each group of variables (Liau, Tan and Khoo, 2011)1. For the purposes of this course however, a suitable minimum number of variables were needed to continue into the next phase of the modeling process.

Following the output of the 80 clusters, the variable in each cluster with the lowest 1-R2 Ratio was selected. The lower the 1-R2 Ratio, the better the representation of that variable in explaining the information in its cluster. Table 9 on the follow page is an output of the first 8 clusters containing the variable with the lowest 1-R2 Ratio highlighted.

                                                                                                                         1  Liau, A., Tan, T., & Khoo, A. (2011). Scale Measurement: Comparing Factor Analysis and Variable Clustering.

Figure 5: Distribution Curve of Variable Cluster Analysis

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 12: Binary Classification Final

12    

Additionally, in this table are the R-squared with own cluster and R-squared with Next Closest. R-squared with Own Cluster was the amount of variance within that cluster, explained by the variable. R-squared with Next Closest was the amount of variance in the next cluster, explained by the variable. The formula listed above indicates that, if a 1-R2 Ratio was low, a variable needed to have a high R-squared with Own Cluster value and a low R-squared with Next Closest value. In other words, the best representative variable for a cluster needs to be able to explain the majority of variance for that cluster and little to no variance for the preceding cluster.

Next was the matter of multicollinearity. Multicollinearity was a problem because if more than one variable explaining the same information was represented in the model, the signs for beta

Table 9: Variable Cluster Analysis at 80 Cluster Max (first 8 clusters)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Cluster Variable Own Cluster Next Closest 1-R2 Ratio Variable Label1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p242 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenexBRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79TRCR39 0.9552 0.7232 0.1619 trcr39TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr494 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen245 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524BRRATE45 0.8852 0.4901 0.2251 brrate45

R-Squared with

Variable with the Lowest 1-R2 Ratio

Page 13: Binary Classification Final

13    

coefficients may have been reversed. This could have been very dangerous, causing the output of the response to be wrong and leading towards a decision that might be very costly in the long run. To try and reduce as much error as possible, Variance Inflation Factors (VIF) were analyzed to help identify redundant variables. A VIF of 10 has been a common threshold in practice and is suggested by some, Chatterjee & Price (1991)2 for example, to be large enough to indicate a potential problem. For this reason it was decided to keep variables with a VIF 10 or less. To find these variables, a PROC REG statement with the /VIF option was performed. The output was then exported into Excel and the variables were sorted by VIF in ascending order and any found with a VIF greater than 10 were discarded, as seen in Table 8. These variables were then matched to the 80 clustered variables with the lowest 1-R2 Ratio. During this step 15 variables did not match because of either or both of these reasons.

• The variable had a VIF larger than 10. • The variable did not have the lowest 1-R2 Ratio value in its respective cluster.

Table 10 below displays a sample output of the variables, including their VIF and cluster in which they belonged.

Something interesting to note of the 65 variables that remained in the dataset was that it included DCTRADES, as shown in Table 8 above. Earlier in the data cleansing and imputation stage, there was the option of allowing the macro to keep variables that were either 38% or 37% imputed. At the time, the decision was made to continue with 38%, which included the 14 variables listed in Table 8 that were thought to possibly be useful in the model. DCTRADES was one of those variables and a perfect example of the cascading effect that certain decisions have throughout the process.

                                                                                                                         2  Chatterjee, S., & Price, B. (1991). Regression analysis by example (2nd ed.). New York: John Wiley.  

Variable VIF Cluster FoundBRMINB 1.54646 47

DCWCRATE 1.5761 31DCTRADES 1.59662 39

BRMINH 1.6268 23COLLS 1.66598 25

Table 10: Sample Variance Inflation Factor Output

Page 14: Binary Classification Final

14    

Variable Preparation

After all the variables had been cleaned and the matter of multicollinearity was addressed, the distribution for each variable needed to be reviewed. Most of the variables did not have normal distributions and needed to be transformed in order to find the optimal mathematical form in which the variable could predict the response, GOODBAD. There were two different ways this was done.

• Discretization 1, which was user-defined and exercised equal widths logic. • Discretization 2, which was SAS-defined and exercised equal frequencies logic.

For each of these approaches, three different monotonic transformations took place.

• Ordinal • Odds =

• Logodds = log

The following is the process that was used to transform each of the 65 potential predictors that currently remained in the dataset, with a few exceptions that will be mentioned towards the end. For this demonstration, the variable, AVGMOS, or the number of months the account has been open, will be used. Note the process did start with Discretization 2 as it was much more involved.

Before any transformation took place, the descriptive statistics for the variable were looked over to make sure that the variable was in fact clean. Below in Table 11, are the descriptive statistics for the variable, AVGMOS. Some things to look for were missing values, coded values and the difference between the mean and median. For the variable, AVGMOS, there were no missing values, no coded values (values 193-199), and the mean and median were relatively close to one another. After checking all of these attributes, the variable, AVGMOS, could undergo transformation.

Discretization 2: SAS-Defined

For the ordinal transformation, the data first needed to be sorted in an ascending fashion. It was then divided into a specified number of groups, for this analysis the default of 10 was used, each with an equal frequency count of observations. In other words, each of the 10 groups, or ranks,

Table 11: Descriptive Statistics for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12 TOPEN24 0.8658 0.3365 0.2023 topen24

N N Miss Mean Minimum Median Maximum1255429 0 58.263 29.883 0 56 182

Standard Deviation

Page 15: Binary Classification Final

15    

should have 10% of the total observations within it. SAS however, was not always able to capture exactly 10% of the data, as show in Table 12 below.

While each rank for the variable, AVGMOS, was close to 10%, they did vary slightly. For some variables, the percentage of data in each rank varied a lot. The main reason for this was because of ties or when more than 10% of a variable was imputed, giving more than 10% of the data the same value. SAS, not being able to break those values up, therefore had to compensate and distribute the remaining data the best it could over the remaining ranks.

Below in Table 13, is the summary for the ranks defined above. This table was important with regards to the interpretation of the variable, in this case AVGMOS, and its relationship to the dependent variable, GOODBAD. The following were particular statistics of note.

RANK Frequency Percent0 118172 9.41 118172 9.411 129966 10.35 248138 19.772 122021 9.72 370159 29.483 131796 10.5 501955 39.984 111748 8.9 613703 48.885 135386 10.78 749089 59.676 123489 9.84 872578 69.57 129508 10.32 1002086 79.828 126079 10.04 1128165 89.869 127264 10.14 1255429 100

Cumulative Frequency

Cumulative Percent

Table 12: SAS-Defined Ranks for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

RANK avg_indp avg_dep std_indp std_dep min_indp min_dep max_indp max_dep0 14 0.21029 4 0.40751 0 0 20 11 26 0.1892 3 0.39167 21 0 30 12 35 0.19507 3 0.39626 31 0 39 13 44 0.19612 3 0.39706 40 0 48 14 52 0.18726 2 0.39012 49 0 55 15 59 0.17673 2 0.38144 56 0 63 16 67 0.17325 2 0.37847 64 0 71 17 76 0.16318 3 0.36953 72 0 81 18 88 0.14845 4 0.35555 82 0 96 19 117 0.12109 18 0.32624 97 0 182 1

Table 13: Summary of SAS-Defined Ranks for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7 DCRATE79 0.9843 0.6436 0.0442 dcrate79

Page 16: Binary Classification Final

16    

• avg_indep, in this case, was the average number of months an account had been open for that rank.

• avg_dep was the average probability of default for that rank.

From rank 0 it can be seen that customers who have had an account open between 0 and 20 months, had, on average, a 21.03% chance of defaulting. Following ave_indep and avg_dep down the ranks, there appeared to be a trend.

• As the average number of months an account had been open for increased, the probability of default decreased.

This was a good example of a relationship a potential predictor needed to have, with regards to the response variable, in order to successfully predict the probability of default.

After the initial ranks had been defined, again using equal frequency logic, a SAS macro was implemented to remove any non-meaningful differences in the variable. Because the only differences of interest were sequential differences, consecutive t-tests were performed between each rank and the rank following it. If there was a significant difference between the two ranks, they would remain two separate groups. If however, there was not a significant difference, the two ranks would be combined into one, becoming the rank to be tested against the following rank. Table 14 below, is the summary for the newly defined ranks after all the t-tests had been performed and each distinguishing sequential differences had been noted. It can be seen that because there was not a statistical difference between ranks 2 and 3, rank 2 was collapsed onto rank 3.

In Figure 6 below, a small dip occurs from rank 0 to rank 3. From rank 3 to rank 9 however, there was a consistent slope with a consistent direction, the first of two objectives to be achieved through the ordinal transformation. The other was spread, or the difference in probability of default from the lowest rank to the highest. Looking at the same figure, a spread of 9% can be found between rank 0 and rank 9.

Table 14: Summary of SAS-Defined Ranks for AVGMOS After Macro

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324 BRRATE3 0.8515 0.4324 0.2616 brrate3

RANK avg_indp avg_dep std_indp std_dep min_indp min_dep max_indp max_dep pvalue0 14 0.21029 3 0.40751 0 0 20 1 01 26 0.1892 3 0.39167 21 0 30 1 0.0001863 40 0.19562 2 0.39668 31 0 48 1 04 52 0.18726 2 0.39012 49 0 55 1 05 59 0.17673 2 0.38144 56 0 63 1 0.0200066 67 0.17325 3 0.37847 64 0 71 1 07 76 0.16318 4 0.36953 72 0 81 1 08 88 0.14845 18 0.35555 82 0 96 1 09 117 0.12109 18 0.32624 97 0 182 1 0

Page 17: Binary Classification Final

17    

Overall, the variable, AVGMOS, looked as if it might be a reasonable predictor for the response, GOODBAD. One additional process was performed before finalizing the ranks and assigning the ordinal codes. Again, looking at the Figure 6 above, it can be seen that because of rank 1, there was not a consistent trend across the graph, which would have be the most ideal. Because the spread between ranks 1and 3 was not more than 1%, they were collapsed into one as shown in Figure 7 below.

Figure 6: Plot of SAS-Defined Ranks for AVGMOS After Macro

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 7: Plot of SAS-Defined Ranks for AVGMOS After Collapse

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324 BRRATE3 0.8515 0.4324 0.2616 brrate3

Page 18: Binary Classification Final

18    

After the ranks had been finalized, they were then assigned an ordinal code. This needed to be done for two reasons.

1. If an ordinal variable with a meaningful beta coefficient had a first rank = 0, the variable would drop out. Hence, the first rank 0.

2. Because this was an ordinal variable, the ranks had to be consecutive or the transformation would not be valid.

The assignment of ordinal codes for the variable, ORDEQAVGMOS, is shown below, satisfying these two requirements.

• If rank = 0 then ORDEQAVGMOS = 1 • If rank = 3 then ORDEQAVGMOS = 2 • If rank = 4 then ORDEQAVGMOS = 3 • If rank = 5 then ORDEQAVGMOS = 4 • If rank = 6 then ORDEQAVGMOS = 5 • If rank = 7 then ORDEQAVGMOS = 6 • If rank = 8 then ORDEQAVGMOS = 7 • If rank = 9 then ORDEQAVGMOS = 8

It was by these ordinal codes that the ordinal version of the variable, in this case ORDEQAVGMOS, was defined. The results of the ordinal transformation can be found in Table 15 below.

The next transformation of interest was an odds transformation. This odds transformation was performed on the newly created ordinal variable, in this case ORDEQAVGMOS, and was defined as follows.

ORDEQAVGMOS Frequency Percent1 118172 9.41 118172 9.412 383783 30.57 501955 39.983 111748 8.9 613703 48.884 135386 10.78 749089 59.675 123489 9.84 872578 69.56 129508 10.32 1002086 79.827 126079 10.04 1128165 89.868 127264 10.14 1255429 100

Cumulative Frequency

Cumulative Percent

Table 15: Ordinal Transformation for AVGMOS (SAS-Defined)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50 TOPENB75 0.9516 0.6268 0.1298 topenb75

Page 19: Binary Classification Final

19    

 

The odds for the variable, ORDEQAVGMOS, can be found in Table 16 below.

The last transformation of interest was a log odds transformation. This was an attempt to linearize the odds relationship with the response, and can be defined as follows.

 

The log of the odds for the variable, ORDEQAVGMOS, can be found in Table 17 below.

These same three transformations: ordinal, odds and log odds, were then performed using Discretization 1, described in the following pages.

Table 16: Odds Transformation for ORDEQAVGMOS (SAS-Defined)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

ODSEQAVGMOS Frequency Percent0.137779049 127264 10.14 127264 10.140.174335426 126079 10.04 253343 20.180.194998847 129508 10.32 382851 30.50.209561776 123489 9.84 506340 40.330.214670866 135386 10.78 641726 51.12

0.23040673 111748 8.9 753474 60.020.243189366 383783 30.57 1137257 90.590.266282334 118172 9.41 1255429 100

Cumulative Frequency

Cumulative Percent

Table 17: Log Odds Transformation for ORDEQAVGMOS (SAS-Defined)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50 TOPENB75 0.9516 0.6268 0.1298 topenb75

LODSEQAVGMOS Frequency Percent-1.982103969 127264 10.14 127264 10.14

-1.7467741 126079 10.04 253343 20.18-1.634761635 129508 10.32 382851 30.5-1.562736708 123489 9.84 506340 40.33-1.538649282 135386 10.78 641726 51.12-1.467909142 111748 8.9 753474 60.02-1.413914857 383783 30.57 1137257 90.59-1.323198126 118172 9.41 1255429 100

Cumulative Frequency

Cumulative Percent

Page 20: Binary Classification Final

20    

Discretization 1: User-Defined

For the ordinal transformation, a histogram of the variable was observed to determine two aspects of interest.

1. The distribution and how the values of that variable were spread over the x-axis 2. The range of the x-axis

The above information was then used to determine the number of ranks that would be defined, as well as the width of each rank. Below in Figure 8, is a histogram of the variable, AVGMOS. It can be seen that the variable had a fairly strong right-skewed distribution and a range from 0-182 months. With regards to the number of ranks that would be defined, a number, small enough so that the information was not too cumbersome, but large enough that no important information would be missed, needed to be considered. For the variable, AVGMOS, 10 ranks seemed to have met the criteria.

Next was the matter of deciding upon the widths of each rank. While equal widths would have been ideal, for variables where the majority of the data favors one side, it is not exactly appropriate. The reason being that if all the widths were set equal to each other, for variables such as AVGMOS, it could result in some ranks containing somewhere around 25% of the data and others containing <5%. It is for this reason, in the case of AVGMOS, that the lower ranks, or ranks on the left side where the majority of the data was, were given a smaller width, and the

Figure 8: Histogram of AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 21: Binary Classification Final

21    

higher ranks, or the ranks on the right side, were given larger widths. The defined rank widths for AVGMOS resulted in the frequencies found below in Table 18 where the descriptive statistics for each rank are also displayed.

In Table 19 below, the descriptive statistics for the response variable, GOODBAD, by rank, can be found.

Using the information from Table 18 and Table 19, some inferences concerning the relationship between the variable, AVGMOS, and the response, GOODBAD, can be made.

Looking at Tables 18 and 19, it can be seen that customers who have had an account open between 90 and 104 months, had, on average, a 13.87% chance of defaulting. A similar trend as

Table 18: Descriptive Statistics of the User-Defined Ranks for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

ORDAVGMOS N Obs N Mean Std Dev Minimum Maximum1 54331 54331 10.2065303 2.8664601 0 142 180627 180627 22.4557071 4.2157593 15 293 206636 206636 37.127456 4.3294262 30 444 241613 241613 52.1963802 4.2963834 45 595 232273 232273 66.7771717 4.3066704 60 746 163718 163718 81.368982 4.2693602 75 897 89023 89023 96.1588241 4.2635659 90 1048 44035 44035 111.107664 4.2797616 105 1199 26794 26794 128.073412 5.6539986 120 139

10 16379 16379 154.744063 11.4205595 140 182

Table 19: Descriptive Statistics of GOODBAD by User-Defined Ranks for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50 TOPENB75 0.9516 0.6268 0.1298 topenb75

ORDAVGMOS N Obs N Mean Std Dev Minimum Maximum1 54331 54331 0.2234268 0.4165458 0 12 180627 180627 0.1922968 0.3941061 0 13 206636 206636 0.1966501 0.3974665 0 14 241613 241613 0.1854991 0.3887027 0 15 232273 232273 0.1729474 0.3782026 0 16 163718 163718 0.1571055 0.3639013 0 17 89023 89023 0.1387057 0.3456411 0 18 44035 44035 0.1211763 0.3263358 0 19 26794 26794 0.1118907 0.3152378 0 1

10 16379 16379 0.1037304 0.3049198 0 1

Page 22: Binary Classification Final

22    

observed in the SAS-defined ranks with relation to the response variable, GOODBAD, is stated as the following.

• As the average number of months an account had been open for increased, the probability of default decreased.

The above trend can be better visualized by the following figure, Figure 9.

Similar to what was seen in the plot of SAS-defined ranks after the macro, there is a dip in the lower ranks prior to the consistent trend seen after. Again, being that the spread between the ranks, in this case ranks 2 and 3, is no more than 1%, rank 2 was collapsed onto rank 3, and resulted in the output of Figure 10 below, where there was now a consistent trend across all of the ranks.

Figure 9: Plot of User-Defined Ranks for AVGMOS

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 23: Binary Classification Final

23    

The ranks defined above then became the ordinal codes for the variable AVGMOS; it was by these ordinal codes that the ordinal version of the variable, ORDAVGMOS, was defined. The results of the ordinal transformation can be found in Table 20 below.

Next was the odds transformation. Again, this was performed on the newly created ordinal variable, ORDAVGMOS, and was defined as follows.

 

Lastly was the log odds transformation. Again, this was an attempt to linearize the odds relationship with the response, and can be defined as follows.

Table 20: Ordinal Transformation for AVGMOS (User-Defined)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

ORDEQAVGMOS Frequency Percent1 54331 4.33 54331 4.332 387263 30.85 441594 35.173 241613 19.25 683207 54.424 232273 18.5 915480 72.925 163718 13.04 1079198 85.966 89023 7.09 1168221 93.057 44035 3.51 1212256 96.568 26794 2.13 1239050 98.79 16379 1.3 1255429 100

Cumulative Frequency

Cumulative Percent

Figure 10: Plot of User-Defined Ranks for AVGMOS After Collapse

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 24: Binary Classification Final

24    

 

The odds and log odds for the variable, ORDAVGMOS, can be found in Table 21 below.

While this was the process applied to the majority of the 65 potential predictors, there were a few exceptions.

Variables such as ORRATE3 for example, shown in Figure 11 below, were inherently binary in nature. Because of this, it did not make much sense to perform an ordinal transformation. Even in the attempt to order a binary variable in an equal frequency fashion, the PROC RANK procedure would fail, thus eliminating any execution of the Discretization 2 transformations. Therefore, the variable could only undergo an odds and log odds transformation on the raw variable in Discretization 1.

Variables such as OT3PTOT, shown in Figure 12 below, were converted into a binary variable because again, the ordinal transformation did not make much sense. In the case of OT3PTOT, 70% of the data was comprised of the same value, 0. Because of this, SAS could only break the data into 3 ranks: the first, which captured 70% of the data and the following two, which captured the remaining 30%. This could not follow the equal frequency logic intended for the Discretization 2 transformations. The Discretization 1 ordinal transformation had the same issue in that no matter what width the ranks were given, the first rank would always capture at least 70% of the data. The decision was then made to convert the variable into binary form, where an observation could either take on a value of “0” or “>0”. The new binary variable would then undergo an odds and log odds transformation on the raw variable in Discretization 1.

Table 21: Odds and Log odds Transformations for ORDAVGMOS (User-Defined)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

ORDAVGMOS _FREQ_ avg_ind avg_dep ODSAVGMOS LODSAVGMOS1 54331 10 0.22343 0.28771 -1.245812 387263 30 0.19462 0.24165 -1.420273 241613 52 0.1855 0.22775 -1.479534 232273 67 0.17295 0.20911 -1.564885 163718 81 0.15711 0.18639 -1.679926 89023 96 0.13871 0.16104 -1.826087 44035 111 0.12118 0.13788 -1.981348 26794 128 0.11189 0.12599 -2.071579 16379 155 0.10373 0.11574 -2.15645

Page 25: Binary Classification Final

25    

There were a few cases in which a variable displayed two distinctly different relationships, as in BRMINB for example, its distribution seen in Figure 13 below. After defining the widths of the ranks and plotting the newly defined variable, as shown in Figure 14, the two relationships were clear. From rank 1 to rank 3 there is a positive relationship with an approximate 17% spread and from rank 3 to rank 5 there is a negative relationship with an approximate 9% spread. So not only were the two relationships distinctively different, they were strong too. Variables that displayed this kind of pattern were noted because, if after further analysis the variables showed

Figure 11: Histogram of ORRATE3

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 12: Histogram of OT3PTOT

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 26: Binary Classification Final

26    

significance in predicting the response, GOODBAD, they were likely to be split into two different variables.

Finally, certain variable were just inconsistent in nature, as in BRNEW for example, shown below in Figure 15. Following in Figure 16, a plot of the SAS-defined ranks is displayed where it can be seen that there was nothing consistent concerning the trend or spread. This was an issue and a fairly strong indicator that the variable would perform poorly in predicting the response, GOODBAD. It would be likely that this variable would not be used in our final model.

After all variables had been transformed, the dataset had a new total of 451 variables.

Figure 13: Histogram of BRMINB

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 14: Plot of User-Defined Ranks for BRMINB

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 15: Histogram of BRNEW

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324 BRRATE3 0.8515 0.4324 0.2616 brrate3

Figure 16: Plot of SAS-Defined Ranks for BRNEW

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

Page 27: Binary Classification Final

27    

Modeling

Before the modeling process began, the data was split in two resulting in a training dataset, which was used to build the model, and a validation dataset, which was used to score the data. One reason for this was to help generalize the model by making sure that it was not over-fitted, or contorted to accommodate the influential observations specific to the data from which the model was built. This was done by scoring the trained model with the validation dataset and seeing if the results were approximately the same, implying that the model was stable. The training dataset was created by pulling a simple random sample from the master dataset that was now comprised of 451 variables and 1,255,429 observations. Because the proportion of GOODBAD needed to be the same, the data was first sorted on the variable GOODBAD and a seed was assigned so that the sample could be recreated if need be. As for deciding upon how much data would be used to comprise the sample, given that neither dataset should be significantly larger than the other, a 40/60 split would be utilized where 40% of the data became the training dataset and the remaining 60% became the validation dataset. This resulted in a training dataset of 502,416 observations and a validation dataset of 753,013.

It was from this training dataset that the logistic model was built using PROC LOGISTIC. A backward selection was run on all of the 451 variables, each iteration deleting any insignificant variables in predicting the probability of a 1, or the probability that the potential customer would default, until all of the variables remaining were significant. This dropped the total number of variables down to 177 and resulted in the ROC curve shown in Figure 17 below, where a c-statistic, note C = (% concordance + ½(% ties)), of 0.880 can be found.

Figure 17: ROC Curve for the Full Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324 BRRATE3 0.8515 0.4324 0.2616 brrate3

Page 28: Binary Classification Final

28    

The percent concordant shown in Table 22 below is the percent of pairs, or combinations of 2 observations: one with a GOODBAD=1 and one with a GOODBAD=0, where the predicted probability of a 1 is lower for the observation that is a true 0 than it is for the observation that is a true 1. Percent discordant is the percent of pairs where the predicted probability of a 1 is higher for the observation that is a true 0 than it is for the observation that is a true 1. The higher the percent concordant the better, and given a c-statistic of 0.880, it can be concluded that from this model 88% of pairs were correctly predicted.

To address any issues of multicollinearity, where there were multiple transformations of a variable that proved to be significant, the most significant or the one with the highest Chi-Square value was retained. At this point, variables with duplicate transformations were dropped retaining only the highest Chi-Square value of the group. For example, LODSEQBRHIC was retained, but BRHIX and ODSBRHIC were dropped. PROC LOGISTIC was run two more times: once after removing all the variable replications outputting 60 significant variables, which were then sorted based on Chi-Square values, and again on 20 variables with the largest Chi-Square values from the previous run. From these 20 variables, the 10 variables with the largest Chi-Square values, shown below in Table 23, became the predictors of the final model.

Using the above table the model could be built as shown below.

Table 23: Maximum Likelihood Estimates for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

Table 22: Concordance for the Full Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 29: Binary Classification Final

29    

Interpretation of BRPCTSAT using Table 23:

For a 1 unit increase in BRPCTSAT, holding all other variables in the model constant, the odds of defaulting will decrease by (exp(-2.2634)-1)*100% = 89.6%.

Interpretation of BRPCTSAT using Table 24 below or the odds ratio estimates:

For a 1 unit increase in BRPCTSAT potential customers are 0.104 times more likely of defaulting.  

Table 24: Odds Ratio Estimates and Wald Confidence Intervals for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 30: Binary Classification Final

30    

Model Evaluation

Below in Figure 18, the ROC curve for the final model is displayed and a c-statistic of 0.8401 is found. This indicates that we still have an optimally strong model, considering that the initial logistic procedure ran with 451 variables resulted in a c-statistic of only 0.04 higher.

In Table 25 below, it can be seen that the percent concordant was 84; meaning that 84% of pairs were correctly predicted and 16% were not. Again considering that only 10 variables were used versus the initial 451, simplifying the model by 441 variables for a 4% increase in error seemed like a fair trade.

The next thing that was investigated was the profitability of the model and how it could be optimized. One way to do this was by finding a cut-off point for the probability of defaults that would maximize profit using the classification table in Table 26 below.

Figure 18: ROC Curve for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Table 25: Concordance for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49 DCCRATE7 0.9841 0.6427 0.0445 dccrate7

Page 31: Binary Classification Final

31    

Items to be noted in the table above when trying to find an optimal cut-off point were as follows:

• High specificity – These were the people that were predicted to be good customers and were actually good. Each incident resulted in a profit increase of $250. This was the best-case scenario.

• High sensitivity – These were the people that were predicted to be bad customers and were actually bad. Each incident did not have a direct impact on the profit, but was considered as a potential loss that was prevented.

• Low false negatives (Type I Error) – These were the people that were predicted to be good customers and were actually bad. Each incident resulted in a loss of half the credit line. This was the worst-case scenario.

• Low false positives (Type II Error) – These were the people that were predicted to be bad customers and were actually good. Again, each incident did not have a direct impact on the profit but was considered as a lost opportunity.

It appeared that the optimal cut-off would fall somewhere between 0.2 and 0.3 as these probability levels both have high specificity, relatively high sensitivity and low false negatives. The percent of false positives was neither low nor high. This however was not of much concern as it did not directly affect the profitability for the model.

This value was further investigated in Figure 19 below where there is an apparent peak in profitability per 1,000. The dollar values for the profitability curve were calculated in SAS, using the profitability criterion mention before, for the 0.1-0.9 cut-offs. As expected from the analysis of the classification table above, this peak occurred between 0.2 and 0.3. The profitability between these two values was dissected further to 0.21-0.29 cut-off points as shown in Table 27,

Table 26: Classification Table for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 32: Binary Classification Final

32    

where a cut point of 0.22 proved to be the most optimal in terms of maximizing profitability at $113,956.31 per 1,000 customers.

Following this investigation, a profitability table utilizing the most optimal percent cut-off for the probability of default was generated, shown in Table 28 below.

The results of the above table are as follows:

• ERROR1 – 5.64% of people were predicted to be good customers and were actually bad, resulting in a loss of $998,708.48 per 1,000 customers or a total loss of $42,400,168.50.

• ERROR2 – 14.29% of people were predicted to be bad customers and were actually good customers.

Figure 19: Profitability Curve for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Table 27: Profit per 1,000 for Percent Cut-Off

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Table 28: Profitability Table for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49 DCCRATE7 0.9841 0.6427 0.0445 dccrate7

Table 27: Profit per 1,000 for Cut-Off Points

Page 33: Binary Classification Final

33    

• VALID1 – 11.96% of people were predicted to be bad customers and were actually bad customers.

• VALID2 – 68.11% of people were predicted to be good customers and were actually good, resulting in a profit of $250,000 per 1,000 customers of a total profit of $85,810,581.50.

Another approach to optimizing profitability is to find a cut-off point using the Kolmogorov-Smirnov (KS) Test. After sorting the data by probability of default in an ascending fashion, 10 deciles were created. Below in Table 29 the KS values, equal to the difference in cumulative percentage of goods and the cumulative percentage of bads, are found. Where the difference or the spread between the two cumulative percentages was the largest, it would result in the largest KS value, which would be the optimal cut-off point.

Looking at the above output it can be seen that the largest KS value happened at 40%. While ideally the two cut-off points from the probability function and the KS test should be the same, the cut-off point for the KS test was almost double the one found using the classification table and profitability function. The KS values found above were plotted and output in Figure 20 below where the KS value 50.88 is shown to have resulted in the largest spread.

Table 29: Kolmogorov-Smirnov Test for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 20: Kolmogorov-Smirnov Curve for the Final Model

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7 DCRATE79 0.9843 0.6436 0.0442 dcrate79

Page 34: Binary Classification Final

34    

The lift ratio for each of the deciles was then computed by dividing the cumulative percent of goods by the cumulative decile percentage (e.g. divide the cumulative percent of goods for the first decile by 0.1, the second by 0.2, etc.). Shown below in Figure 21 are the lift values for each decile. The lift chart shows how much more likely defaults will be predicted using the final model versus no model. In this case, the model was 1.2 times more likely to predict the probability of a default than if a random model or no model was used

In Figure 22 it indicates that the model predicted an 8.13% improvement over a random model that had a probability of default of 50%.

Figure 21: Lift Chart for the Kolmogorov-Smirnov Test

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 22: Gains Chart for the Kolmogorov-Smirnov Test

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7 DCRATE79 0.9843 0.6436 0.0442 dcrate79

Page 35: Binary Classification Final

35    

Customer Segmentation Analysis

For the customer segmentation analysis, it was decided to cluster 6 different times, one each of the three different transformations for both the supervised and unsupervised transformations to see if one resulted in a significantly higher profit per 1,000 customers compared to that of the model. An investigation of the clusters for the most profitable transformation would also take place to see if there were any notable differences in spending habits between high profiting customers and low profiting clusters.

In creating the groups of variables for cluster analysis, the top 5 most significant for each transformation were selected. PROC CLUSTER was run on each set of variables generating three criteria: Cubic Clustering Criterion (CCC), Pseudo-F and Pseudo T-Squared, which helped determine the optimal number of clusters. This number was found where there was a peak in the CCC, a peak in the Pseudo-F and a dip in the Pseudo T-Squared. It can be seen in Figure 23 below, for the log of odds unsupervised transformation that these events occurred at 5 clusters.    

PROC FASTCLUS was then used to cluster the data and confirmed that the number of clusters specified was appropriate by looking at the cluster summary table that was output and analyzing the distance between cluster centroids across all of the clusters. In Table 30 below, it can be seen that the distance between centroids are all about the same. Had one been significantly smaller than the rest, collapse it with the nearest cluster would have been considered.

Figure 23: Cluster Criteria Analysis for the Log of Odds (Unsupervised)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Page 36: Binary Classification Final

36    

These clusters can be better visualized using the cluster plot in Figure 24 below.

After defining the clusters, a profitability table was generated so that the profits could be compared to the final model for predicting GOODBAD. The results for the log of odds unsupervised transformation are shown in Table 31 below.    

 

 

 

 

 

Recall that the final model without clustering profited $113,956.31 per 1,000 customers. Looking at the table above it can be seen than clustering on the 5 most significant variables for the log of odds unsupervised transformation, profit per 1,000 customers can almost double.

Table 30: Cluster Summary for Log of Odds (Unsupervised)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Figure 24: Cluster Plot for Log of Odds (Unsupervised)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

DCCRATE7 0.9841 0.6427 0.0445 dccrate7

DCRATE79 0.9843 0.6436 0.0442 dcrate79

6 TOPENB50 0.9516 0.6277 0.1301 topenb50

TOPENB75 0.9516 0.6268 0.1298 topenb75

7 BRR324 0.8515 0.5044 0.2996 brr324

BRRATE3 0.8515 0.4324 0.2616 brrate3

8 BRR4524 0.8852 0.5106 0.2345 brr4524

BRRATE45 0.8852 0.4901 0.2251 brrate45

Variable with the Lowest 1-R^2 Ratio g 37% Versus 38%

Table 31: Cluster Profitability for Log of Odds (Unsupervised)

R-Squared with

Cluster Variable

Own Cluster

Next Closest 1-R^2 Ratio

Variable Label

1 BRN90P24 0.9285 0.6214 0.1889 brn90p24

BRR39P24 0.9285 0.6236 0.19 brr39p24

2 BRCRATE1 0.9533 0.7265 0.1707 brcrate1

BROPENEX 0.9523 0.6143 0.1237 bropenex

BRRATE1 0.8853 0.7536 0.4654 brrate1

BRTRADES 0.9521 0.6139 0.1241 brtrades

3 CRATE79 0.9522 0.6085 0.1222 crate79

TRATE79 0.9523 0.6097 0.1221 trate79

TRCR39 0.9552 0.7232 0.1619 trcr39

TRCR49 0.9747 0.704 0.0855 trcr49

TRR49 0.9167 0.8119 0.4427 trr49

4 TOPEN12 0.8658 0.4535 0.2456 topen12

TOPEN24 0.8658 0.3365 0.2023 topen24

5 DCCR49 0.9383 0.6677 0.1857 dccr49

Page 37: Binary Classification Final

37    

Limitations and Weaknesses

One of the biggest limitations with this particular model was ease of interpretation with regards to the variables that comprised it. Because the most mathematically optimal approach was always exercised without taking into account any tradeoffs that might result in a slightly less significant but simpler model, it comes to no surprise that the final model was comprised of raw, ordinal, odds and log of odds transformations. Furthermore, all of the transformed variables that were retained were unsupervised, making them even harder to explain. Again, the goal for this model was solely to maximize profit and would suit a client with that sole goal in mind. However, it is likely not suited for anyone who would like or needs to be able to interpret the model in relatively simple context.

An example of a weakness in this model was the percentage of Type I Errors, or instances of predicting that a person would be a good customer when they actually were bad. Though it was only 5.64% of the data, each one predicted incorrectly was still a loss of half of the credit line. In this case the total loss was $42,400168.50 or 113,956.51 per 1,00 customers.

Another big weakness was the percentage of Type II Errors, or missed opportunity. A total of 14.29%, or 107,639 people in this dataset were predicted to be bad customers, when actually they were a good customers. The missed opportunity cost to the company can be calculated by 107,639*250=$26,909,750.

Page 38: Binary Classification Final

38    

Conclusions

While building the most mathematically optimal model may seem like the best approach, as it is always maximizing profit, other approaches and trade-offs should always be considered. This is very dependent on what it is that your job/client is looking for and seeing that there was no job/client involved, the decisions made throughout this whole modeling process were built solely on what was thought to be best. Looking at the final outcome of the final model, having been built mainly with a mathematically optimizing mindset, it can be concluded that this exact approach probably would not work for someone who desired a model that only included variables that were easy to interpret.

In addition the Type I and Type II Errors need to be recognized. The combined loss and potential cost together added up to about $70,000,000.00. This was a very large and unexpected total and was not noted until just now in the final analysis. For future analyses it seems appropriate to not just look at outputs, such as the classification table and profitability function for the most profitable cut-off point and conclude there, but also to review the consequence of utilizing that particular cut-off point. After calculating the combined loss and potential cost and finding it to be excessively large, the best thing might be to go with a more conservative cut-off point. It is always important to discuss these considerations with your client or employer.