Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take...

Preview:

Citation preview

Copyright © Business Data Miners. 1

Building In-Database Predictive Scoring Model:

Check Fraud Detection Case Study

Jay Zhou, Ph.D.Business Data Miners, LLC

978-726-3182

jzhou@businessdataminers.com

Web Site: www.businessdataminers.com

LinkedIn: www.linkedin.com/in/jiangzhou

Copyright © Business Data Miners. 1

Copyright © Business Data Miners. 2Copyright © Business Data Miners.

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study

• Four Components

–Business: Check Fraud

–Scoring Model: Logistic Regression

–Modeling Tool: Oracle Data Mining

–Data Storage/Management: Oracle Database

2

Fraud

ChurnLTV

TreeLogistic

reg

ANN

SAS

SPSS Statsoft

Oracle

DB2

SQL Server

Business

DM/PA Tools

Data Storage/Management

Credit

Clustering

ODM

Gradient Boosting

SVM

R

MY SQL

SAS File

R File

Marketing

Regression

Weka

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Splus

3

Fraud

ChurnLTV

Tree Logistic reg

ANN

SAS

SPSS Statsoft

Oracle

DB2

SQL Server

Business Problems

DM/PA Tools

Data Storage/Management

Credit

Clustering

ODM

Gradient Boosting

SVM

R

MY SQL

SAS File

R File

Marketing

Regression

Weka

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Splus

4

Copyright © Business Data Miners.

Fraud

ChurnLTV

Tree Logistic reg

Oracle 11g

Business Problems

DM/PA Tools/Database

Credit

Clustering

ODM

SVM

Marketing

Regression

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Anomaly

5

Copyright © Business Data Miners. 6

About Check Deposit

• When a check is deposited, two banks are involved:

– Depositor’s bank and payer’s bank

– Usually the depositor’s bank makes the funds available for the depositor to withdraw before the payer’s bank actually transfers the funds to the depositor’s bank.

Copyright © Business Data Miners. 7

About Check Deposit Fraud

• Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances and then withdraw funds quickly. The depositor’s bank suffers losses when fraudulent checks returned unpaid by the payer’s bank.

Copyright © Business Data Miners. 8

About Rule Based Deposit Fraud Detection Systems

• Advantages

– Understandable by analysts.

– Make use of analysts’ experience and intuition

• Examples• Rule 1: # of Check Deposits In Last 7 Days More than 35

• Rule 2: Total Amount of Check Deposits In Last 5 Days Higher Than $15,000

Copyright © Business Data Miners. 9

Statistical Predictive Model

• Data driven.

• Built based on a large number of historical cases.

• Advantages

–Handle many variables simultaneously and normally detect more frauds with lower false positives. This is illustrated in the next slide.

Copyright © Business Data Miners. 10

Number of Transactions in Last 7 Days

Do

llar A

mo

un

t of Tra

nsa

ction

s in La

st 7 D

ays

Predictive Model that uses two variables simultaneously

Rule 1. High Dollar Amount

Rule 2. High Frequency

Lowering Threshold Increases False Positive

Copyright © Business Data Miners. 11

The Goal of the Project

• Build a statistical model to predict returned items (RDI).

• The work is experimental and the final model has not been implemented operationally.

• The work was done when Dr. Jay Zhou worked for Citizens Bank as a VP. For business questions, please contact Josh Ablett at josh.ablett@citizensbank.com.

Copyright © Business Data Miners. 12

Major Steps

1. Data Gathering

2. Data Validation

3. Data Preparation

4. Feature Variable Calculation

5. Predictive Model Building/Testing

6. Predictive Model Deployment

Copyright © Business Data Miners. 13

Data Used– Account Data

• Account open date• Account open branch• Account type

– Check Deposit (54 millions)• Deposit Date• Deposit Amount• Paying Bank Number• Receiving Bank Number• Check Serial Number

– Third Party “Known Fraud Account” Hits• A service provider has a repository of known fraudulent checks. A hit is

generated when a check’s paying bank number is found in the known fraudulent check list.

Copyright © Business Data Miners.

Data Challenges

• Security: account numbers, SSN, etc.

• Large Volume: 54 million check items

• Refresh Data Daily

• Various Data Sources

– Text Files

– SQL Server

– Oracle DB

14

Copyright © Business Data Miners. 15

RDBMS Solution

• Oracle DB

• Access is controlled: security

• External table: query or load text file

• heterogeneous services: load SQL Server table

• Materialized view : refresh data based on predefined schedules

• Table Partition, Index: handle large data

Copyright © Business Data Miners. 16

2. Data Validation

• SQL functions

–Check the number of null values

–Minimum, Maximum, Median values

–Number of distinctive values

–Number of records by date

- Histogram: using width_bucket()‏

- And More

Copyright © Business Data Miners. 17

2. Data Validation (continued)‏

• Metadata table

• Store variable statistics.

• Easier to handle hundreds of variables.

• We can query it to get the variables with desired characteristics.

• Unique keys:

• Variables with names containing the word “credit” and percentage of missing values less than 15%

Copyright © Business Data Miners. 18

3. Data Preparation

• Merge transaction data with performance data

– Most of the time, there are no common keys to uniquely link transactions and performance data.

• Unmatched cases

• Duplicates

– Iterative SQL joining

• Try many iterations of matching methods, from “stronger” ones to “weaker” ones. Successfully matched cases are removed from the “pools”.

Copyright © Business Data Miners. 19

3. Data Preparation (continued)

• Sampling– Randomly Sample the data based on depositing account

numbers.

• Creating Training and Testing Sets

– Randomly split data into training and testing sets. There is no account number overlap between two sets.

Copyright © Business Data Miners. 20

4. Feature Variable Calculation

• Based on the original variables existing in the data, we are able to create more salient variables that capture interesting transaction patterns.

Copyright © Business Data Miners. 21

4. Feature Variable Calculation (Continued)‏

• Risk Group Variables

– We calculate historical fraud rate for each account type. Then we group account types into a small number of risk groups based on their historical fraud rates. The group numbers are used as one of input variables to the model.

– Use SQL “group by” to calculate fraud ratio.

Copyright © Business Data Miners. 22

4. Feature Variable Calculation (Continued)‏

• Transaction Time Series Variable• Life Time Maximum Check Amount From Paying Account

– max(AMOUNT) over (partition by ABA_ROUTING_NUMBER ,PAYING_ACCOUNT order by TRANSACTION_DATE rows between unbounded preceding and current row) life_time_max_amount

• Number of Checks Deposited in Last Two Days

– count(1) over (partition by ACCOUNT_NUMBER order by TRANSACTION_DATE range between 1 preceding and current row) num_2d

Copyright © Business Data Miners. 23

Calculate Feature Variables(Continued)

• It is helpful to create as many feature variables as possible.

• Then we can use statistical methods to select a small number of variables that are most relevant to the our target variable, i.e., returned item.

Copyright © Business Data Miners. 24

Data Preparation/Feature Variable Calculation Considerations ‏

• Use views or materialized views when possible.

– Challenge caused by separation of data and processing logic

– With a view or materialized view, the logic that is used to produce the view can be retrieved using the following query:

• select text from user_views where view_name='DATA_VIEW';

Copyright © Business Data Miners. 25

5.Predictive Model Building/Testing

• We used logistic regression models in Oracle 11g ODM.

• Variables used in the model including:– Account Type Risk Group

– Days Since Account Opened– Check Serial Number Out of Sequence– Third Party “Negative File” Hit– Number of Checks Deposited In Past Two Days

– Account Open Branch Risk

– Item Amount vs. Maximum Check Amount from the Paying Bank Account

Copyright © Business Data Miners. 26

5.Predictive Model Building/Testing (Continued)‏

• The final model is stored as an Oracle mining object.

• Prediction_Probability function: remove the traditional bottleneck of model deployment.– select a.id, prediction_probability(model_object,'1'

using *) score from table_data a;

Copyright © Business Data Miners. 27

6. Model Deployment

• All of the steps are implemented as materialized views.

• Score calculation for new data is simply refreshing a series of materialized views.

Copyright © Business Data Miners. 28

7. Reporting

• The checks are rank ordered based on the risk scores. Analysts can view the report using a web browser.

Copyright © Business Data Miners. 29

Model Score Performance

–If we maintain the same fraud prevention staff level, it is forecasted that the total fraud loss prevented will be up 20%.

Copyright © Business Data Miners. 30

Drawbacks of In-Database Modeling

• Data miners are not self-sufficient• Need support from database

administrator

Copyright © Business Data Miners. 31

Summary

• Predictive scoring model – Significantly improve the check deposit fraud

detection performance.• From data miner perspective

– Powerful integration of SQL and advanced modeling functions

• Single environment– Almost all the processes executed within a

database environment– Increased security, productivity, manageability and

scalability.

Copyright © Business Data Miners.

Questions?

Recommended