Upload
watson
View
63
Download
0
Embed Size (px)
DESCRIPTION
Other Modeling Techniques. James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago October, 2004. Agenda. CART overview Case study Spam Detection. CART. Classification And Regression Trees. CART. Developed by Breiman, Friedman, Olshen, Stone in early 80’s. - PowerPoint PPT Presentation
Citation preview
© Deloitte Consulting, 2004
Other Modeling Techniques
James Guszcza, FCAS, MAAA
CAS Predictive Modeling Seminar
Chicago
October, 2004
© Deloitte Consulting, 2004
Agenda
CART overviewCase study
Spam Detection
© Deloitte Consulting, 2004
CART
Classification
And
Regression
Trees
© Deloitte Consulting, 2004
CART
Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
Jerome Friedman wrote the original CART software (Fortran) to accompany the original CART monograph (1984).
One of many tree-based modeling techniques. CART CHAID C5.0 Software package variants
© Deloitte Consulting, 2004
Preface
“Tree Methodology… is a child of the computer age. Unlike many other statistical procedures which were moved from pencil and paper to calculators and then to computers, this use of trees was unthinkable before computers” --Breiman, Friedman, Olshen, Stone
© Deloitte Consulting, 2004
The Basic Idea
Recursive Partitioning Take all of your data. Consider all possible values of all variables. Select the variable/value (X=t1) that produces the
greatest “separation” in the target. (X=t1) is called a “split”.
If X< t1 then send the data to the “left”; otherwise, send data point to the “right”.
Now repeat same process on these two “nodes”CART only uses binary splits.
© Deloitte Consulting, 2004
Let’s Split
Suppose you have 3 variables:# vehicles: {1,2,3…10+}Age category: {1,2,3…6}Liability-only: {0,1}
At each iteration, CART tests all 15 splits.(#veh<2), (#veh<3),…, (#veh<10)(age<2),…, (age<6)(lia<1)
Select split resulting in greatest marginal purity.
© Deloitte Consulting, 2004
Classification Tree Example: predict likelihood of a claim
NUM_VEH <= 4.500
TerminalNode 1
Class Cases %0 29083 80.01 7276 20.0
N = 36359
NUM_VEH > 4.500
TerminalNode 2
Class Cases %0 8808 42.31 12036 57.7
N = 20844
Node 1NUM_VEHClass Cases %
0 37891 66.21 19312 33.8
N = 57203
© Deloitte Consulting, 2004
Classification Tree Example: predict likelihood of a claim
FREQ1_F_RPT <= 0.500
TerminalNode 1
Class = 0Class Cases %
0 18984 78.71 5138 21.3
N = 24122
FREQ1_F_RPT > 0.500
TerminalNode 2
Class = 1Class Cases %
0 2508 57.41 1859 42.6
N = 4367
LIAB_ONLY <= 0.500
Node 3FREQ1_F_RPT
N = 28489
LIAB_ONLY > 0.500
TerminalNode 3
Class = 0Class Cases %
0 7591 96.51 279 3.5
N = 7870
NUM_VEH <= 4.500
Node 2LIAB_ONLY
N = 36359
AVGAGE_CAT <= 8.500
TerminalNode 4
Class = 1Class Cases %
0 4327 48.11 4671 51.9
N = 8998
AVGAGE_CAT > 8.500
TerminalNode 5
Class = 0Class Cases %
0 2072 76.51 637 23.5
N = 2709
NUM_VEH <= 10.500
Node 5AVGAGE_CAT
N = 11707
NUM_VEH > 10.500
TerminalNode 6
Class = 1Class Cases %
0 2409 26.41 6728 73.6
N = 9137
NUM_VEH > 4.500
Node 4NUM_VEH
N = 20844
Node 1NUM_VEHN = 57203
© Deloitte Consulting, 2004
Categorical Splits
Categorical predictors: CART considers every possible subset of categories
Left (1st split): dump, farm, no truck
Right (1st split): contractor, hauling, food delivery, special delivery, waste, other
= ("dump",...)
TerminalNode 1
N = 11641
= ("hauling")
TerminalNode 2N = 652
= ("specDel")
TerminalNode 3N = 249
= ("hauling",...)
Node 3LINE_IND$
N = 901
= ("contr",...)
TerminalNode 4
N = 25758
= ("contr",...)
Node 2LINE_IND$
N = 26659
Node 1LINE_IND$N = 38300
© Deloitte Consulting, 2004
Gains Chart
Node 6: 16% of policies, 35% of claims.
Node 4: 16% of policies, 24% of claims.
Node 2: 8% of policies, 10% of claims.
..etc. The higher the gains
chart, the stronger the model.
© Deloitte Consulting, 2004
Splitting Rules
Select the variable value (X=t1) that produces the greatest “separation” in the target variable.
“Separation” defined in many ways.Regression Trees (continuous target): use
sum of squared errors.Classification Trees (categorical target):
choice of entropy, Gini measure, “twoing” splitting rule.
© Deloitte Consulting, 2004
Regression Trees
Tree-based modeling for continuous target variablemost intuitively appropriate method for loss ratio
analysis Find split that produces greatest separation in
∑y – E(y)2 i.e.: find nodes with minimal within variance
and therefore greatest between variance like credibility theory
Every record in a node is assigned the same yhat model is a step function
© Deloitte Consulting, 2004
Classification Trees
Tree-based modeling for discrete target variable In contrast with regression trees, various measures
of purity are used Common measures of purity:
Gini, entropy, “twoing”
Intuition: an ideal retention model would produce nodes that contain either defectors only or non-defectors only
completely pure nodes
© Deloitte Consulting, 2004
More on Splitting Criteria
Gini purity of a node p(1-p) where p = relative frequency of defectors
Entropy of a node -Σplogp -[p*log(p) + (1-p)*log(1-p)] Max entropy/Gini when p=.5 Min entropy/Gini when p=0 or 1
Gini might produce small but pure nodes The “twoing” rule strikes a balance between purity
and creating roughly equal-sized nodes
© Deloitte Consulting, 2004
Classification Trees vs. Regression Trees
Splitting Criteria: Gini, Entropy, Twoing
Goodness of fit measure: misclassification rates
Prior probabilities and misclassification costs available as model
“tuning parameters”
Splitting Criterion: sum of squared errors
Goodness of fit: same measure! sum of squared errors
No priors or misclassification costs… … just let it run
© Deloitte Consulting, 2004
CART advantages
Nonparametric (no probabilistic assumptions) Automatically performs variable selection Uses any combination of continuous/discrete
variables Discovers “interactions” among variables
© Deloitte Consulting, 2004
CART advantages
CART handles missing values automaticallyUsing “surrogate splits”
Invariant to monotonic transformations of predictive variable
Not sensitive to outliers in predictive variables Great way to explore, visualize data
© Deloitte Consulting, 2004
CART Disadvantages
The model is a step function, not a continuous scoreSo if a tree has 10 nodes, yhat can only take on 10
possible values.MARS improves this.
Might take a large tree to get good lift but then hard to interpret
Instability of model structure correlated variables random data fluctuations could
result in entirely different trees. CART does a poor job of modeling linear structure
© Deloitte Consulting, 2004
Case Study
Spam DetectionCARTMARS
Neural NetsGLM
© Deloitte Consulting, 2004
The Data
Goal: build a model to predict whether an incoming email is spam.
Analogous to insurance fraud detection.
About 6000 data points, each representing an email message sent to an HP scientist.
Binary target variable1 = the message was spam0 = the message was not spam
Predictive variables created based on frequencies of various words & characters.
© Deloitte Consulting, 2004
The Predictive Variables
57 variables createdFrequency of “George” (the scientist’s first name)Frequency of “!”, “$”, etc.Frequency of long strings of capital lettersFrequency of “receive”, “free”, “credit”….Etc
Variables creation required insight that (as yet) can’t be automated.
Analogous to the insurance variables an insightful actuary or underwriter can create.
© Deloitte Consulting, 2004
Methodology
Divide data 60%-40% into train-test. Use multiple techniques to fit models on train
data. Apply the models to the test data. Compare their power using gains charts.
© Deloitte Consulting, 2004
Un-pruned Tree
Just let CART keep splitting until the marginal improvement in purity diminishes
Too big! Use Cross-Validation
(on the train data) to prune back. Select the optimal sub-
tree.
|
© Deloitte Consulting, 2004
Pruned Tree
|`freq_!̀ < 0.0785
freq_remove< 0.045
freq_money< 0.01
`freq_$`< 0.0565
freq_george>=0.08
capLen_avg< 2.755
freq_remove< 0.025
freq_free< 0.24
freq_your< 0.615
`freq_$`< 0.015
freq_our< 0.58
freq_hp>=0.41
01458/126
019/8
16/46
012/0
111/101
0204/23
056/12
16/14
14/27
127/89
14/107
013/5
132/655
© Deloitte Consulting, 2004
CART Gains Chart
Use test data. 40% were spam.
The outer black line is the best one could do
The 45o line is the monkey throwing darts
The pruned tree is simple but does a good job.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Perc.Total.Pop
Pe
rc.S
pa
m
perfect modelCART
Spam Email Detection - Gains Charts
© Deloitte Consulting, 2004
Other Models
Fit a purely additive MARS model to the data. No interactions among basis functions
Fit a neural network with 3 hidden nodes. Fit a logistic regression (GLM). Fit an ordinary multiple regression.
This is a sin: the target is binary, not normal!
© Deloitte Consulting, 2004
Neural Net Weights
© Deloitte Consulting, 2004
Neural Net Intuition
You can think of a NNET as a set of logistic regressions embedded in another logistic regression.
221101
1zbzbbe
Y
1
X1
X3
X2
Z1
Z2
Y
1a11
a12
a21
a31
a321
b1
b2
a22
a01
a02
b0
331221111011
11 xbxbxbaeZ
332222112021
12 xbxbxbaeZ
© Deloitte Consulting, 2004
Neural Net Intuition
You can think of a NNET as a set of logistic regressions embedded in another logistic regression.
221101
1zbzbbe
Y
1
X1
X3
X2
Z1
Z2
Y
1a11
a12
a21
a31
a321
b1
b2
a22
a01
a02
b0
331221111011
11 xbxbxbaeZ
332222112021
12 xbxbxbaeZ
© Deloitte Consulting, 2004
MARS Basis Functions
0 2 4 6 8 10
0.30
0.45
Predictor 5
freq_our
0 2 4 6
0.3
0.6
0.9
Predictor 7
Res
pons
e 1
freq_remove
0 5 10 15 20
0.30
0.45
Predictor 16
Res
pons
e 1
freq_free
0 2 4 6 8 10 12 14
-0.1
0.1
0.3
Predictor 19
Res
pons
e 1
freq_you
0 5 10 15
0.2
0.6
1.0
Predictor 20
freq_credit
0 2 4 6 8 10
-0.1
0.2
Predictor 21
Res
pons
e 1
freq_your
0 5 10 15
0.30
0.50
Predictor 22
Res
pons
e 1
freq_font
0 2 4 6 8
0.30
0.45
Predictor 24
Res
pons
e 1
freq_money
0 5 10 15 20
0.00
0.15
Predictor 25
freq_hp
0 5 10 15 20 25 30
0.05
0.20
Predictor 27
Res
pons
e 1
freq_george
0 2 4 6 8
0.2
0.6
Predictor 28
Res
pons
e 1
freq_650
0 2 4 6 8 10
0.10
0.25
Predictor 42
Res
pons
e 1
freq_meeting
0 5 10 15
0.05
0.20
Predictor 44
freq_project
0 5 10 15 20
-0.0
50.
15
Predictor 46
Res
pons
e 1
freq_edu
0 1 2 3 4
-0.2
0.1
Predictor 49
Res
pons
e 1
freq_;
0 2 4 6 8
0.3
0.5
Predictor 52
Res
pons
e 1
freq_!
0 1 2 3 4 5
0.30
0.45
freq_$
0 100 200 300 400 500 600
0.2
0.5
Res
pons
e 1
capLen_avg
© Deloitte Consulting, 2004
Mars Intuition
This MARS model is just a regression model of the basis functions that MARS automatically found! Less black-boxy than
NNET. No interactions in this
particular model Finding the basis
functions is like CART taken a step further.
0 2 4 6 8 10
0.30
0.45
Predictor 5
freq_our
0 2 4 6
0.30.6
0.9
Predictor 7
Resp
onse
1
freq_remove
0 5 10 15 20
0.30
0.45
Predictor 16
Resp
onse
1
freq_free
0 2 4 6 8 10 12 14
-0.1
0.10.3
Predictor 19
Resp
onse
1
freq_you
0 5 10 15
0.20.6
1.0
Predictor 20
freq_credit
0 2 4 6 8 10
-0.1
0.2
Predictor 21
Resp
onse
1
freq_your
0 5 10 15
0.30
0.50
Predictor 22
Resp
onse
1
freq_font
0 2 4 6 8
0.30
0.45
Predictor 24
Resp
onse
1
freq_money
0 5 10 15 200.0
00.1
5Predictor 25
freq_hp
0 5 10 15 20 25 30
0.05
0.20
Predictor 27
Resp
onse
1
freq_george
0 2 4 6 8
0.20.6
Predictor 28
Resp
onse
1
freq_650
0 2 4 6 8 10
0.10
0.25
Predictor 42
Resp
onse
1
freq_meeting
0 5 10 15
0.05
0.20
Predictor 44
freq_project
0 5 10 15 20
-0.05
0.15
Predictor 46Re
spon
se 1
freq_edu
0 1 2 3 4
-0.2
0.1
Predictor 49
Resp
onse
1
freq_;
0 2 4 6 8
0.30.5
Predictor 52
Resp
onse
1
freq_!
0 1 2 3 4 5
0.30
0.45
freq_$
0 100 200 300 400 500 600
0.20.5
Resp
onse
1capLen_avg
© Deloitte Consulting, 2004
Comparison of Techniques
All techniques work pretty well!
Good variable creation at least as important as modeling technique.
MARS/NNET a bit stronger.
GLM a strong contender.
CART weaker. Even regression isn’t
too bad!0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Perc.Total.Pop
Pe
rc.S
pa
m
perfect modelmarsneural netdecision treeglmregression
Spam Email Detection - Gains Charts
© Deloitte Consulting, 2004
Concluding Thoughts
Often the true power of a predictive model comes from insightful variable creation.
Subject-matter expertise is critical.We don’t have true AI yet!
CART is highly intuitive and a great way to select variables and get a feel for your data.
GLM remains a great bet. Do CBA to decide whether MARS or NNET
are worth the complexity and trouble.
© Deloitte Consulting, 2004
Concluding Thoughts
Generating a bunch of answers is easy. Asking the right questions is the hard part!
Strategic goal?How to manage the project?Model design?Variable creation?How to do IT implementation?How to manage organizational buy-in?How do we measure the success of the project?
(Not just the model)