Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008

Data Mining – Best PracticesPart #2

Richard Derrig, PhD,

Opal Consulting LLC

CAS Spring Meeting

June 16-18, 2008

Data Mining

Data Mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. In order to achieve this, data mining uses computational techniques from statistics, machine learning and pattern recognition.

www.wikipedia.org

AGENDA

Predictive v Explanatory ModelsDiscussion of Methods

Example: Explanatory Models for Decision to Investigate Claims

The “Importance” of Explanatory and Predictive Variables

An Eight Step Program for Buildinga Successful Model

Predictive v Explanatory Models

Both are of the form: Target or Dependent Variable is a Function of Feature or Independent Variables that are related to the Target Variable

Explanatory Models assume all Variables are Contemporaneous and Known

Predictive Models assume all Variables are Contemporaneous and Estimable

Desirable Properties of a Data Mining Method:

Any nonlinear relationship between target and features can be approximated

A method that works when the form of the nonlinearity is unknown

The effect of interactions can be easily determined and incorporated into the model

The method generalizes well on out-of sample data

Major Kinds of Data Mining Methods Supervised learning

Most common situation Target variable

Frequency Loss ratio Fraud/no fraud

Some methods Regression Decision Trees Some neural networks

Unsupervised learning No Target variable Group like records

together-Clustering A group of claims with

similar characteristics might be more likely to be of similar risk of loss

Ex: Territory assignment,

Some methods PRIDIT K-means clustering Kohonen neural networks

The Supervised Methods and Software Evaluated

1) TREENET 7) Iminer Ensemble

2) Iminer Tree 8) MARS

3) SPLUS Tree 9) Random Forest

4) CART 10) Exhaustive Chaid

5) S-PLUS Neural 11) Naïve Bayes (Baseline)

6) Iminer Neural 12) Logistic reg ( (Baseline)

Decision Trees

In decision theory (for example risk management), a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions. A decision tree is a special form of tree structure.

www.wikipedia.org

CART – Example of 1st split on Provider 2 Bill, With Paid as Dependent

For the entire database, total squared deviation of paid losses around the predicted value (i.e., the mean) is 4.95x1013. The SSE declines to 4.66x1013 after the data are partitioned using $5,021 as the cutpoint.

Any other partition of the provider bill produces a larger SSE than 4.66x1013. For instance, if a cutpoint of $10,000 is selected, the SSE is 4.76*1013.

1st Split

All Data

Mean = 11,224

Bill < 5,021

Mean = 10,770

Bill>= 5,021

Mean = 59,250

Different Kinds of Decision Trees

Single Trees (CART, CHAID)Ensemble Trees, a more recent development

(TREENET, RANDOM FOREST) A composite or weighted average of many trees

(perhaps 100 or more) There are many methods to fit the trees and prevent

overfitting Boosting: Iminer Ensemble and Treenet Bagging: Random Forest

Neural Networks

=

Three Layer Neural Network

Input Layer Hidden Layer Output Layer(Input Data) (Process Data) (Predicted Value)

Self-Organizing Feature Maps T. Kohonen 1982-1990 (Cybernetics) Reference vectors of features map to OUTPUT

format in topologically faithful way. Example: Map onto 40x40 2-dimensional square.

Iterative Process Adjusts All Reference Vectors in a “Neighborhood” of the Nearest One. Neighborhood Size Shrinks over Iterations

NEURAL NETWORKSNEURAL NETWORKS

FEATURE MAPSUSPICION LEVELS

1 4 7 10 13 16S1

S4

S7

S10

S13

S16

4-5

3-4

2-3

1-2

0-1

FEATURE MAPSIMILIARITY OF A CLAIM

1 5 9

13

17

S1

S4

S7

S10

S13

S16

4-5

3-4

2-3

1-2

0-1

DATA MODELING EXAMPLE: CLUSTERING

Data on 16,000 Medicaid providers analyzed by unsupervised neural net

Neural network clustered Medicaid providers based on 100+ features

Investigators validated a small set of known fraudulent providers

Visualization tool displays clustering, showing known fraud and abuse

Subset of 100 providers with similar patterns investigated: Hit rate > 70% Cube size proportional to annual Medicaid revenues

© 1999 Intelligent Technologies Corporation

Multiple Adaptive Regression Splines (MARS)

MARS fits a piecewise linear regression BF1 = max(0, X – 1,401.00) BF2 = max(0, 1,401.00 - X ) BF3 = max(0, X - 70.00) Y = 0.336 + .145626E-03 * BF1 - .199072E-03 * BF2

- .145947E-03 * BF3; BF1 is basis function BF1, BF2, BF3 are basis functions

MARS uses statistical optimization to find best basis function(s)

Basis function similar to dummy variable in regression. Like a combination of a dummy indicator and a linear independent variable

Baseline Methods:Naive Bayes Classifier

Logistic Regression

Naive Bayes assumes feature (predictor) variables) independence conditional on each category

Logistic Regression assumes target is linear in the logs of the feature (predictor) variables

REAL CLAIM FRAUDDETECTION PROBLEM

Classify all claimsIdentify valid classes

Pay the claim No hassle Visa Example

Identify (possible) fraud Investigation needed

Identify “gray” classes Minimize with “learning” algorithms

The Fraud Surrogates used as Target Decision Variables

Independent Medical Exam (IME) requested

Special Investigation Unit (SIU) referralIME successfulSIU successfulDATA: Detailed Auto Injury Closed Claim

Database for Massachusetts Accident Years (1995-1997)

DMDatabases

Scoring Functions

Graded Output

Non-Suspicious ClaimsRoutine Claims

Suspicious ClaimsComplicated Claims

ROC Curve Area Under the ROC Curve

Want good performance both on sensitivity and specificity

Sensitivity and specificity depend on cut points chosen for binary target (yes/no) Choose a series of different cut points, and

compute sensitivity and specificity for each of them

Graph resultsPlot sensitivity vs 1-specifityCompute an overall measure of “lift”, or area

under the curve

True/False Positives and True/False Negatives: The “Confusion” Matrix

Choose a “cut point” in the model score.Claims > cut point, classify “yes”.

Sample Confusion Matrix: Sensitivity and Specificity

Prediction No Yes Row TotalNo 800 200 1,000 Yes 200 400 600 Column Total 1,000 600

True Class

Correct Total Percent CorrectSensitivity 800 1,000 80%Specificity 400 600 67%

TREENET ROC Curve – IMEAUROC = 0.701

Logistic ROC Curve – IMEAUROC = 0.643

Ranking of Methods/Software – IME Requested

Method/Software AUROC Lower Bound Upper BoundRandom Forest 0.7030 0.6954 0.7107Treenet 0.7010 0.6935 0.7085MARS 0.6974 0.6897 0.7051SPLUS Neural 0.6961 0.6885 0.7038S-PLUS Tree 0.6881 0.6802 0.6961Logistic 0.6771 0.6695 0.6848Naïve Bayes 0.6763 0.6685 0.6841SPSS Exhaustive CHAID 0.6730 0.6660 0.6820CART Tree 0.6694 0.6613 0.6775Iminer Neural 0.6681 0.6604 0.6759Iminer Ensemble 0.6491 0.6408 0.6573Iminer Tree 0.6286 0.6199 0.6372

Variable Importance (IME) Based on Average of Methods

Important Variable Summarizations for IME Tree Models, Other Models and Total

Total Score

Tree Score

Other Score

Variable Variable type

Total Score Rank Rank Rank

Health Insurance F 16529 1 2 1 Provider 2 Bill F 12514 2 1 3 Injury Type F 10311 3 3 2 Territory F 5180 4 4 7 Provider 2 Type F 4911 5 6 4 Provider 1 Bill F 4711 6 5 5 Attorneys Per Zip DV 2731 7 7 14 Report Lag DV 2650 8 10 8 Treatment Lag DV 2638 9 13 6 Claimant per City DV 2383 10 12 9 Provider 1 Type F 1794 11 9 13 Providers per City DV 1708 12 11 11 Attorney F 1642 13 8 16 Distance MP1 Zip to Clt Zip DV 1134 14 18 10 AGE F 1048 15 17 12 Avg. Household Price/Zip DM 907 16 16 15 Emergency Treatment F 660 17 14 18 Income Household/Zip DM 329 18 15 20 Providers/Zip DV 288 19 20 17 Household/Zip DM 242 20 19 19 Policy Type F 4 21 21 21

Claim Fraud Detection Plan

STEP 1:SAMPLE: Systematic benchmark of a random sample of claims.

STEP 2:FEATURES: Isolate red flags and other sorting characteristics

STEP 3:FEATURE SELECTION: Separate features into objective and subjective, early, middle and late arriving, acquisition cost levels, and other practical considerations.

STEP 4:CLUSTER: Apply unsupervised algorithms (Kohonen, PRIDIT, Fuzzy) to cluster claims, examine for needed homogeneity.

Claim Fraud Detection Plan STEP 5:ASSESSMENT: Externally classify claims according

to objectives for sorting. STEP 6:MODEL: Supervised models relating selected

features to objectives (logistic regression, Naïve Bayes, Neural Networks, CART, MARS)

STEP7:STATIC TESTING: Model output versus expert assessment, model output versus cluster homogeneity (PRIDIT scores) on one or more samples.

STEP 8:DYNAMIC TESTING: Real time operation of acceptable model, record outcomes, repeat steps 1-7 as needed to fine tune model and parameters. Use PRIDIT to show gain or loss of feature power and changing data patterns, tune investigative proportions to optimize detection and deterrence of fraud and abuse.

Documents

Data Mining – Best Practices Part #2 Richard Derrig, PhD, Opal Consulting LLC CAS Spring Meeting June 16-18, 2008