32
Copyright © Business Data Miners. 1 Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study Jay Zhou, Ph.D. Business Data Miners, LLC 978-726-3182 [email protected] Web Site: www.businessdataminers.com LinkedIn: www.linkedin.com/in/jiangzhou Copyright © Business Data Miners. 1

Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 1

Building In-Database Predictive Scoring Model:

Check Fraud Detection Case Study

Jay Zhou, Ph.D.Business Data Miners, LLC

978-726-3182

[email protected]

Web Site: www.businessdataminers.com

LinkedIn: www.linkedin.com/in/jiangzhou

Copyright © Business Data Miners. 1

Page 2: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 2Copyright © Business Data Miners.

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study

• Four Components

–Business: Check Fraud

–Scoring Model: Logistic Regression

–Modeling Tool: Oracle Data Mining

–Data Storage/Management: Oracle Database

2

Page 3: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Fraud

ChurnLTV

TreeLogistic

reg

ANN

SAS

SPSS Statsoft

Oracle

DB2

SQL Server

Business

DM/PA Tools

Data Storage/Management

Credit

Clustering

ODM

Gradient Boosting

SVM

R

MY SQL

SAS File

R File

Marketing

Regression

Weka

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Splus

3

Page 4: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Fraud

ChurnLTV

Tree Logistic reg

ANN

SAS

SPSS Statsoft

Oracle

DB2

SQL Server

Business Problems

DM/PA Tools

Data Storage/Management

Credit

Clustering

ODM

Gradient Boosting

SVM

R

MY SQL

SAS File

R File

Marketing

Regression

Weka

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Splus

4

Page 5: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners.

Fraud

ChurnLTV

Tree Logistic reg

Oracle 11g

Business Problems

DM/PA Tools/Database

Credit

Clustering

ODM

SVM

Marketing

Regression

Data Mining/Predictive Analytics Methods

Copyright © 2009 Business Data Miners.

Anomaly

5

Page 6: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 6

About Check Deposit

• When a check is deposited, two banks are involved:

– Depositor’s bank and payer’s bank

– Usually the depositor’s bank makes the funds available for the depositor to withdraw before the payer’s bank actually transfers the funds to the depositor’s bank.

Page 7: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 7

About Check Deposit Fraud

• Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances and then withdraw funds quickly. The depositor’s bank suffers losses when fraudulent checks returned unpaid by the payer’s bank.

Page 8: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 8

About Rule Based Deposit Fraud Detection Systems

• Advantages

– Understandable by analysts.

– Make use of analysts’ experience and intuition

• Examples• Rule 1: # of Check Deposits In Last 7 Days More than 35

• Rule 2: Total Amount of Check Deposits In Last 5 Days Higher Than $15,000

Page 9: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 9

Statistical Predictive Model

• Data driven.

• Built based on a large number of historical cases.

• Advantages

–Handle many variables simultaneously and normally detect more frauds with lower false positives. This is illustrated in the next slide.

Page 10: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 10

Number of Transactions in Last 7 Days

Do

llar A

mo

un

t of Tra

nsa

ction

s in La

st 7 D

ays

Predictive Model that uses two variables simultaneously

Rule 1. High Dollar Amount

Rule 2. High Frequency

Lowering Threshold Increases False Positive

Page 11: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 11

The Goal of the Project

• Build a statistical model to predict returned items (RDI).

• The work is experimental and the final model has not been implemented operationally.

• The work was done when Dr. Jay Zhou worked for Citizens Bank as a VP. For business questions, please contact Josh Ablett at [email protected].

Page 12: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 12

Major Steps

1. Data Gathering

2. Data Validation

3. Data Preparation

4. Feature Variable Calculation

5. Predictive Model Building/Testing

6. Predictive Model Deployment

Page 13: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 13

Data Used– Account Data

• Account open date• Account open branch• Account type

– Check Deposit (54 millions)• Deposit Date• Deposit Amount• Paying Bank Number• Receiving Bank Number• Check Serial Number

– Third Party “Known Fraud Account” Hits• A service provider has a repository of known fraudulent checks. A hit is

generated when a check’s paying bank number is found in the known fraudulent check list.

Page 14: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners.

Data Challenges

• Security: account numbers, SSN, etc.

• Large Volume: 54 million check items

• Refresh Data Daily

• Various Data Sources

– Text Files

– SQL Server

– Oracle DB

14

Page 15: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 15

RDBMS Solution

• Oracle DB

• Access is controlled: security

• External table: query or load text file

• heterogeneous services: load SQL Server table

• Materialized view : refresh data based on predefined schedules

• Table Partition, Index: handle large data

Page 16: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 16

2. Data Validation

• SQL functions

–Check the number of null values

–Minimum, Maximum, Median values

–Number of distinctive values

–Number of records by date

- Histogram: using width_bucket()‏

- And More

Page 17: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 17

2. Data Validation (continued)‏

• Metadata table

• Store variable statistics.

• Easier to handle hundreds of variables.

• We can query it to get the variables with desired characteristics.

• Unique keys:

• Variables with names containing the word “credit” and percentage of missing values less than 15%

Page 18: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 18

3. Data Preparation

• Merge transaction data with performance data

– Most of the time, there are no common keys to uniquely link transactions and performance data.

• Unmatched cases

• Duplicates

– Iterative SQL joining

• Try many iterations of matching methods, from “stronger” ones to “weaker” ones. Successfully matched cases are removed from the “pools”.

Page 19: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 19

3. Data Preparation (continued)

• Sampling– Randomly Sample the data based on depositing account

numbers.

• Creating Training and Testing Sets

– Randomly split data into training and testing sets. There is no account number overlap between two sets.

Page 20: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 20

4. Feature Variable Calculation

• Based on the original variables existing in the data, we are able to create more salient variables that capture interesting transaction patterns.

Page 21: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 21

4. Feature Variable Calculation (Continued)‏

• Risk Group Variables

– We calculate historical fraud rate for each account type. Then we group account types into a small number of risk groups based on their historical fraud rates. The group numbers are used as one of input variables to the model.

– Use SQL “group by” to calculate fraud ratio.

Page 22: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 22

4. Feature Variable Calculation (Continued)‏

• Transaction Time Series Variable• Life Time Maximum Check Amount From Paying Account

– max(AMOUNT) over (partition by ABA_ROUTING_NUMBER ,PAYING_ACCOUNT order by TRANSACTION_DATE rows between unbounded preceding and current row) life_time_max_amount

• Number of Checks Deposited in Last Two Days

– count(1) over (partition by ACCOUNT_NUMBER order by TRANSACTION_DATE range between 1 preceding and current row) num_2d

Page 23: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 23

Calculate Feature Variables(Continued)

• It is helpful to create as many feature variables as possible.

• Then we can use statistical methods to select a small number of variables that are most relevant to the our target variable, i.e., returned item.

Page 24: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 24

Data Preparation/Feature Variable Calculation Considerations ‏

• Use views or materialized views when possible.

– Challenge caused by separation of data and processing logic

– With a view or materialized view, the logic that is used to produce the view can be retrieved using the following query:

• select text from user_views where view_name='DATA_VIEW';

Page 25: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 25

5.Predictive Model Building/Testing

• We used logistic regression models in Oracle 11g ODM.

• Variables used in the model including:– Account Type Risk Group

– Days Since Account Opened– Check Serial Number Out of Sequence– Third Party “Negative File” Hit– Number of Checks Deposited In Past Two Days

– Account Open Branch Risk

– Item Amount vs. Maximum Check Amount from the Paying Bank Account

Page 26: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 26

5.Predictive Model Building/Testing (Continued)‏

• The final model is stored as an Oracle mining object.

• Prediction_Probability function: remove the traditional bottleneck of model deployment.– select a.id, prediction_probability(model_object,'1'

using *) score from table_data a;

Page 27: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 27

6. Model Deployment

• All of the steps are implemented as materialized views.

• Score calculation for new data is simply refreshing a series of materialized views.

Page 28: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 28

7. Reporting

• The checks are rank ordered based on the risk scores. Analysts can view the report using a web browser.

Page 29: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 29

Model Score Performance

–If we maintain the same fraud prevention staff level, it is forecasted that the total fraud loss prevented will be up 20%.

Page 30: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 30

Drawbacks of In-Database Modeling

• Data miners are not self-sufficient• Need support from database

administrator

Page 31: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners. 31

Summary

• Predictive scoring model – Significantly improve the check deposit fraud

detection performance.• From data miner perspective

– Powerful integration of SQL and advanced modeling functions

• Single environment– Almost all the processes executed within a

database environment– Increased security, productivity, manageability and

scalability.

Page 32: Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances

Copyright © Business Data Miners.

Questions?