New Detecting Credit Card Fraud with Machine Learning CS 229cs229.stanford.edu/proj2019spr/poster/32.pdf · 2019. 6. 18. · Fraud 492 0.17 Not fraud284,13599.83 Credit card dataset

Objectives

Discussion and Future Work

Motivation• Payments fraud is a significant and growing issue• More than $8 billion in in 2015; up 37% from 2012• Key challenge with fraud data is class imbalance

Goal: • Implement and assess ML algorithms to detect credit card fraud• Investigate strategies to address class imbalance

Models

Aaron Rosenbaum | [email protected] 229 | Spring 2019

Results: Stage 1

Results: Stage 2

Detecting Credit Card Fraud with Machine Learning

Sampling Methods

Data

• Oversampling and synthetic data generation, when properly tuned, can lead to superior predictive performance in the face of class imbalance

• Random forests are highly effective, easy to implement, and appear to be robust to class imbalance, at least for this particular dataset

• Given that payments fraud is constantly evolving, future areas of work might include the application of reinforcement learning to a real-time data stream

CS 229

Implementation

Variable DescriptionTime Time since first transaction

V1-V28 Non-descriptive variables (PCA to protect privacy)

Amount Transaction AmountClass 1 = Fraud; 0 otherwise)

n = 284,807 d = 31 variables

Volume % All Tx 284,807 100Fraud 492 0.17Not fraud 284,135 99.83

Class distributionCredit card dataset from Kaggle

PCA VISUALS

Undersampling: randomly delete observations from majority classOversampling: randomly oversample from minority class Both: both under- and oversampleROSE: create artificial samples of minority class in neighborhood of existing ones

Simple logistic regression with linear boundary

ℎ𝜃 𝑥 =1

1+ 𝑒−𝜃𝑇𝑥ℓ 𝜃 =+

,-.

/

𝑦 , log ℎ 𝑥 , + 1−𝑦 , log 1−ℎ 𝑥 ,

Logistic regression with quadratic boundary, LASSO𝐿. penalty 𝜆 𝜃 . reduces variance by deleting some quadratic terms

Random forestAverages many trees from bootstrap samples: 6𝑓89 𝑥 = .

:∑<-.: 𝑓∗< 𝑥

Neural network1 hidden layer, fully connected, sigmoid activation

Dataset partition: 2/3 train, 1/6 validation, 1/6 test

Tournament-style procedure:•Stage 1: Train variants of each model using different sampling strategies•Stage 2: Best performing model in each category on validation set refit using combined train / validation set and assessed against test set•Primary performance metric: AUPRC due to class imbalance

Finalist performance on test set

AUPRCSimple logistic: linearNo sampling 0.6973303Undersampling 0.6813981Oversampling 0.7267684Both 0.7061640ROSE 0.7301219

Logistic: quadratic, LASSOUndersampling 0.6915803Oversampling 0.7935913Both 0.7036836ROSE 0.5475979*No sampling did not converge

Random forestNo sampling 0.8448476Undersampling 0.8177367Oversampling 0.8504548Both 0.8381937ROSE 0.7703227

Neural networkUndersampling 0.6919809Oversampling 0.2757658Both 0.2979033ROSE 0.7118781Weighted option 0.2266319*No sampling did not converge

Bold models = category winner

CV for sampling proportionSimple logistic regression

LASSO resultsLogistic regression, quadratic (undersampling)

Growing a random forestError rate vs. # of trees (oversampling)

AUROC AUPRC Accuracy Sensitivity Specificity F1Linear logistic (ROSE) 0.98368 0.83476 0.9995 0.797619 0.999831 0.842673

Quad logistic (Over) 0.98398 0.88409 0.9993 0.734043 0.999873 0.816568

Random forest (Over) 0.98396 0.90957 0.9997 0.929577 0.999810 0.904110

Neural net (ROSE) 0.98280 0.73169 0.9984 0.831169 0.999768 0.842105

Assumes decision boundary at prob > ½

• Thank you to the CS229 teaching staff!

Linear logistic(ROSE)

TruthPred 0 1

0 | 47426 171 | 8 67

Quadratic logistic (Over)

TruthPred 0 1

0 | 47418 251 | 6 69

Random forest (Over)

TruthPred 0 1

0 | 47438 51 | 9 66

Neural net (ROSE)

TruthPred 0 1

0 | 47430 131 | 11 64

Documents

New Detecting Credit Card Fraud with Machine Learning CS 229cs229.stanford.edu/proj2019spr/poster/32.pdf · 2019. 6. 18. · Fraud 492 0.17 Not fraud284,13599.83 Credit card dataset