18

Click here to load reader

Loan predicting web service

Embed Size (px)

Citation preview

Page 1: Loan predicting web service

1

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Israel Chavez

Ngadhnjim Halilaj

Anusha Kodali

Marcos Quezada

Jyoti Shrestha

Sarat Tadi

April 28, 2016

EMC Education ServicesData Science & Big Data Analytics

Page 2: Loan predicting web service

2

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Project Goals

• Create a model that will allow FPC to provide a loan predicting service to

its customers.

• Identify the necessary attributes that will enable the model to give a better

prediction.

• Test the Marketing Department threshold suggestions.

• Advice FPC about the suggestions that they could offer to their customers

to increase their chances of getting a loan.

Page 3: Loan predicting web service

3

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Situation

•FPC wants to expand its set of services offered to its customers by creating

an online site for loan advice.

•Provide a fast and reliable planning platform for customers to manage their

personal finances.

•Attract potential customers that want to know their eligibility for loans, thus

increasing FPC business.

Page 4: Loan predicting web service

4

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Executive Summary

Regression and Decision tree are somewhat efficient in predicting

outcome

• Logistic Regression

– Precision: 0.786

– Recall: 0.984

•Decision Tree

– Precision: 0.784

– Recall: 0.984

Page 5: Loan predicting web service

5

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Approach - Discovery

• Used 2010 housing loan database by Home Mortgage Disclosure Act (HMDA).

• Filtered data based on:

4 Owner-occupied

4 1-4 Family

4 Action Type (Loan originated, application approved but not accepted,

application denied, application withdrawn)

Page 6: Loan predicting web service

6

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

• Data Conditioning:

4 Data was factored, incomplete data was removed Data set created.

4 Releveled variables to produce reference for possible logistic regression.

4 Tested numeric variable correlation through a correlation matrix.

4 Dataset reduced to “Originated” and “Denied” loans.

• Data Visualization:

4 Overviewed data to check distribution and noise.

4 Two originators of noise:

8 Home Improvement Loans

8 Loan amounts > $400K

Data Preparation

Page 7: Loan predicting web service

7

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Approach - Model Planning

• Model Selection:

4 Two methods:

8 Logistic Regression

8 Classification Tree

• Regression:

4 0.5 and 0.75 thresholds suggested by the Marketing Department were

used.

Page 8: Loan predicting web service

8

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Approach - Model Planning

• Variable Selection:

4 Created a Small Set for testing purposes:

8 Three possibilities:

▪ Absence of personal data

▪ Absence of County data

▪ Absence of personal and county data.

• Developed two Full models:

4 Model 1: Included everything that the example script suggested;

4 Model 2: Included only the variables that we chose to build the model

with.

• Pseudo-R² was used to check the variance of the models

• ROC & AUC were used to check the performance of our model.

Page 9: Loan predicting web service

9

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Approach - Model Building

• Created a Holdout set with 25% of the data to test models

• Logistic Regression:

4 Categorized the holdout data in three bins:

8 Low threshold (<50%),

8 Medium threshold (from 50-74%),

8 High threshold (>=75%).

• To further test Regression model, we experimented with a binary

classification: Loan Rejected/ Loan Approved

4 First prediction: threshold 0.5.

4 Second prediction: threshold 0.7.

• Decision Tree:

4 Used binary classification: Loan Rejected/ Loan Approved

• A confusion matrix was developed to compare both methods.

Page 10: Loan predicting web service

10

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Approach - Model Results and Accuracy

•The model developed with Logistic Regression with threshold 0.5 has

predictive power at least as good as the Decision Tree model

Logistic

Regression

Threshold = 0.5

Predictions

FALSE TRUE

Actual FALSE 2,452 23,657

Actual TRUE 1,385 87,383

Decision Tree

Model

Predictions

FALSE TRUE

Actual FALSE 2,082 24,027

Actual TRUE 1,349 87,419

Logistic Regression model Decision Tree model

Accuracy 0.780 0.779

Precision 0.786 0.784

Recall 0.984 0.984

Page 11: Loan predicting web service

11

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Logistic Regression Prediction

Page 12: Loan predicting web service

12

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Decision Tree Visualization

Decision Tree model is a good way to compare the prediction power of a

Logistic Regression model

Page 13: Loan predicting web service

13

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

• Overview of Basic Methodology: Predict the likelihood of a person getting a loan

from FPC.

• Model: Logistic regression and Decision Tree.

• Dependent variable: “Approved”, if the loan application was approved or not.

• Scope:

– 662,997 total observations for year 2010 extracted from the housing loan

database that was assembled by federal agencies pursuant to the Home

Mortgage Disclosure Act (HMDA).

•After thoroughly cleaning the data, the model had 550,336

observations.

•Sampling

– Small set: 10% of the data.

– Holdout set: 25% of the data.

Model Description

Page 14: Loan predicting web service

14

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Data distribution visualization

Visualizing the variables for a normal distribution helps to understand

how good of a predictor they are

Page 15: Loan predicting web service

15

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

Data distribution visualization

Removing the unwanted “noises” from the model increases the predicting

powers of the model

Page 16: Loan predicting web service

16

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

ROC/AUC

The ROC curves lie just inside the full model curve

Essentially they are the same model

Full Model

AUC: 0.70

Personal data

removed 0.69

Personal data

and county

removed

AUC: 0.68

Page 17: Loan predicting web service

17

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

• Data available for analysis is somewhat efficient.

• Logistic Regression or Classification Tree yield a similar result.

• Logistic Regression should be used considering the web app response time

requirement.

• The model provides an estimate not an assurance that a specific customer

will or will not get a loan.

• Sensitive personal information does not affect the model.

• County information does not affect the model.

• High income increases the chances of getting a loan.

• % of minority population in the customer tract reduces the chances of getting

a loan (We don’t recommend to show this finding in the web!)

Recommendations

Page 18: Loan predicting web service

18

© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential