Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take...

Building In-Database Predictive Scoring Model:

Check Fraud Detection Case Study

Jay Zhou, Ph.D.Business Data Miners, LLC

978-726-3182

jzhou@businessdataminers.com

Web Site: www.businessdataminers.com

LinkedIn: www.linkedin.com/in/jiangzhou

Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study

• Four Components

–Business: Check Fraud

–Scoring Model: Logistic Regression

–Modeling Tool: Oracle Data Mining

–Data Storage/Management: Oracle Database

ChurnLTV

TreeLogistic

SPSS Statsoft

Oracle

SQL Server

Business

DM/PA Tools

Data Storage/Management

Credit

Clustering

Gradient Boosting

MY SQL

SAS File

R File

Marketing

Regression

Data Mining/Predictive Analytics Methods

ChurnLTV

Tree Logistic reg

SPSS Statsoft

Oracle

SQL Server

Business Problems

DM/PA Tools

Data Storage/Management

Credit

Clustering

Gradient Boosting

MY SQL

SAS File

R File

Marketing

Regression

ChurnLTV

Tree Logistic reg

Oracle 11g

Business Problems

DM/PA Tools/Database

Credit

Clustering

Marketing

Regression

Anomaly

About Check Deposit

• When a check is deposited, two banks are involved:

– Depositor’s bank and payer’s bank

– Usually the depositor’s bank makes the funds available for the depositor to withdraw before the payer’s bank actually transfers the funds to the depositor’s bank.

About Check Deposit Fraud

• Fraudsters take advantage of this “gap”. They deposit fraudulent checks to inflate their account balances and then withdraw funds quickly. The depositor’s bank suffers losses when fraudulent checks returned unpaid by the payer’s bank.

About Rule Based Deposit Fraud Detection Systems

• Advantages

– Understandable by analysts.

– Make use of analysts’ experience and intuition

• Examples• Rule 1: # of Check Deposits In Last 7 Days More than 35

• Rule 2: Total Amount of Check Deposits In Last 5 Days Higher Than $15,000

Statistical Predictive Model

• Data driven.

• Built based on a large number of historical cases.

• Advantages

–Handle many variables simultaneously and normally detect more frauds with lower false positives. This is illustrated in the next slide.

Number of Transactions in Last 7 Days

llar A

t of Tra

s in La

st 7 D

Predictive Model that uses two variables simultaneously

Rule 1. High Dollar Amount

Rule 2. High Frequency

Lowering Threshold Increases False Positive

The Goal of the Project

• Build a statistical model to predict returned items (RDI).

• The work is experimental and the final model has not been implemented operationally.

• The work was done when Dr. Jay Zhou worked for Citizens Bank as a VP. For business questions, please contact Josh Ablett at josh.ablett@citizensbank.com.

Major Steps

1. Data Gathering

2. Data Validation

3. Data Preparation

4. Feature Variable Calculation

5. Predictive Model Building/Testing

6. Predictive Model Deployment

Data Used– Account Data

• Account open date• Account open branch• Account type

– Check Deposit (54 millions)• Deposit Date• Deposit Amount• Paying Bank Number• Receiving Bank Number• Check Serial Number

– Third Party “Known Fraud Account” Hits• A service provider has a repository of known fraudulent checks. A hit is

generated when a check’s paying bank number is found in the known fraudulent check list.

Data Challenges

• Security: account numbers, SSN, etc.

• Large Volume: 54 million check items

• Refresh Data Daily

• Various Data Sources

– Text Files

– SQL Server

– Oracle DB

RDBMS Solution

• Oracle DB

• Access is controlled: security

• External table: query or load text file

• heterogeneous services: load SQL Server table

• Materialized view : refresh data based on predefined schedules

• Table Partition, Index: handle large data

2. Data Validation

• SQL functions

–Check the number of null values

–Minimum, Maximum, Median values

–Number of distinctive values

–Number of records by date

- Histogram: using width_bucket()‏

- And More

2. Data Validation (continued)‏

• Metadata table

• Store variable statistics.

• Easier to handle hundreds of variables.

• We can query it to get the variables with desired characteristics.

• Unique keys:

• Variables with names containing the word “credit” and percentage of missing values less than 15%

3. Data Preparation

• Merge transaction data with performance data

– Most of the time, there are no common keys to uniquely link transactions and performance data.

• Unmatched cases

• Duplicates

– Iterative SQL joining

• Try many iterations of matching methods, from “stronger” ones to “weaker” ones. Successfully matched cases are removed from the “pools”.

3. Data Preparation (continued)

• Sampling– Randomly Sample the data based on depositing account

numbers.

• Creating Training and Testing Sets

– Randomly split data into training and testing sets. There is no account number overlap between two sets.

4. Feature Variable Calculation

• Based on the original variables existing in the data, we are able to create more salient variables that capture interesting transaction patterns.

4. Feature Variable Calculation (Continued)‏

• Risk Group Variables

– We calculate historical fraud rate for each account type. Then we group account types into a small number of risk groups based on their historical fraud rates. The group numbers are used as one of input variables to the model.

– Use SQL “group by” to calculate fraud ratio.

4. Feature Variable Calculation (Continued)‏

• Transaction Time Series Variable• Life Time Maximum Check Amount From Paying Account

– max(AMOUNT) over (partition by ABA_ROUTING_NUMBER ,PAYING_ACCOUNT order by TRANSACTION_DATE rows between unbounded preceding and current row) life_time_max_amount

• Number of Checks Deposited in Last Two Days

– count(1) over (partition by ACCOUNT_NUMBER order by TRANSACTION_DATE range between 1 preceding and current row) num_2d

Calculate Feature Variables(Continued)

• It is helpful to create as many feature variables as possible.

• Then we can use statistical methods to select a small number of variables that are most relevant to the our target variable, i.e., returned item.

Data Preparation/Feature Variable Calculation Considerations ‏

• Use views or materialized views when possible.

– Challenge caused by separation of data and processing logic

– With a view or materialized view, the logic that is used to produce the view can be retrieved using the following query:

• select text from user_views where view_name='DATA_VIEW';

5.Predictive Model Building/Testing

• We used logistic regression models in Oracle 11g ODM.

• Variables used in the model including:– Account Type Risk Group

– Days Since Account Opened– Check Serial Number Out of Sequence– Third Party “Negative File” Hit– Number of Checks Deposited In Past Two Days

– Account Open Branch Risk

– Item Amount vs. Maximum Check Amount from the Paying Bank Account

5.Predictive Model Building/Testing (Continued)‏

• The final model is stored as an Oracle mining object.

• Prediction_Probability function: remove the traditional bottleneck of model deployment.– select a.id, prediction_probability(model_object,'1'

using *) score from table_data a;

6. Model Deployment

• All of the steps are implemented as materialized views.

• Score calculation for new data is simply refreshing a series of materialized views.

7. Reporting

• The checks are rank ordered based on the risk scores. Analysts can view the report using a web browser.

Model Score Performance

–If we maintain the same fraud prevention staff level, it is forecasted that the total fraud loss prevented will be up 20%.

Drawbacks of In-Database Modeling

• Data miners are not self-sufficient• Need support from database

administrator

Summary

• Predictive scoring model – Significantly improve the check deposit fraud

detection performance.• From data miner perspective

– Powerful integration of SQL and advanced modeling functions

• Single environment– Almost all the processes executed within a

database environment– Increased security, productivity, manageability and

scalability.

Questions?

Building In-Database Predictive Scoring Model: Check Fraud ... › ... · •Fraudsters take...

Documents

Self inflate tyres

1 Mining Fraudsters and Fraudulent Strategies in Large ...yangy.org/works/telecom_fraud/TKDE_Fraud_Yang.pdf · detection [5] and insurance fraud detection [6] technologies, to the

Wear it and ready set inflate

Beating the Fraudsters - First Data · Beating the Fraudsters ... ability to detect counterfeit and duplicate cards. ... progressive prediction, detection and prevention capabilities

QE2 Starts To Re-inflate Bubbles, Risks Stagflation!

DEFLATE INFLATE - g-ecx.images-amazon.comg-ecx.images-amazon.com/images/G/01/00/00/41/08/65/... · DEFLATE RIGHT DEFLATE LEFT INFLATE BOTH Single Dual INFLATE Standard Duty Compressor

Countrywide protected fraudsters by silencing

ARE ROBOCALLS AND FRAUDSTERS RUINING YOUR BRAND?

Fraudulent Documentation: Fraudsters’ Secret Weapon ... How to Disarm Them

inflate-a-mole - exo.net

fraudsters document · document used by fraudsters. Sample document used by fraudsters. Created Date: 1/12/2021 5:49:06 AM

Fraudulent Insurance Claims

Promotion Stock Secrets - Promotion Stock Secrets - GARBARINI … · 2018. 10. 4. · (OTC) market,2 and artificially inflate the price of their worthless stock through fraudulent

Fraudulent Transfer Claw-Back Litigation Post Tribune ...media.straffordpub.com/products/fraudulent-transfer-claw-back... · Fraudulent Transfer Claw-Back Litigation Post Tribune,

WorldCom: A Bunch of Dirty Rotten Fraudsters

Fishermen, Randies and Fraudsters by Malcolm Archibald Extract

1 Mining Fraudsters and Fraudulent Strategies in Large ......Zhejiang University, China. E-mail: yangya@zju.edu.cn, corresponding author. Yuhong Xu is with the NetEase Fuxi Lab, Hangzhou,

Inflate catalog 2012

Learn something about the 'Celebrity' Fraudsters

Fraudulent Transfers