Upload
agdavis
View
947
Download
0
Embed Size (px)
Citation preview
DataLabUSA
Overview
• Introduction
• The Case For Feature Selection
• Methodologies
• Case Study – DMA Analytics Challenge 2007
• Comparison of Approaches
• Advanced Algorithms
• Conclusion – Questions & Answers
DataLabUSA
The DataLab Environment
• DataLab USA
• Industries Served
• The Data Environment
• Analytical Framework
DataLabUSA
When more is not necessarily better
• TreeNet Models are naturally more robust than more traditional algorithms.
• Without any limitations a TreeNet Model in a typical DM environment can incorporate hundreds of independent variables.
• How many of these variables actually provide trueinformational gain?
DataLabUSA
Not all variables are created equal
• Certain types of variables can degrade TN model performance.
• High order categorical (e.g. State, cluster)
• Composite variables(e.g. risk score, cluster, family composition)
DataLabUSA
Why Not Specialize?
• Lower number of variables can allow for tighter parameters
• Increased number of terminal nodes
• Decreased number of observations in minchild
• Allowance for more variable interactions (ICL)
DataLabUSA
You want me to build how many models?
• Brute Force = 2N-1
• 60 Variables = 1,152,921,504,606,846,975 Models
• Processing Time = 730,693,161,740 years
• Age of the Universe ≈ 13,730,000,000 years
• 1/2 will include top variable
• 1/4 will include top two variables
• 1/1024 will include top ten variables
DataLabUSA
Feature Selection
• Feature Selection Goal – Efficiently identify the subset of independent variables that maximize model discrimination.
• Basic Feature Selection = N x (N+1)/2
• 60 Variables = 60 + 59 + 58 + … + 1 = 1,830 Models
DataLabUSA
Feature Selection - Framework
• The programmatic development and evaluation of TN batches is a necessity
• Performance of initial models dictate the composition of later models.
• Too many decision points to require human interaction.
• SAS/C#
DataLabUSA
Variable Shaving
• Stepwise removal of variables from model based on variable importance.
• Typically starts with an unrestricted model and removes variables until stop condition is met or there are no more variables to remove.
• At each step variable with lowest importance is removed.
• Very low cost – only requires N total models since only one model per step.
DataLabUSA
Forward Selection
• Stepwise addition of variables to model based on performance criteria.
• Typically starts with 0 variables and grows until available variables are exhausted or a stop condition is met.
• Each step has the following substeps that are repeated up to N iterations:
1. Model Testing
2. Evaluation
3. Variable Selection
DataLabUSA
Backward Selection
• Stepwise removal of variables from model based on decision criteria.
• Typically starts with an unrestricted model and restricts variables until stop condition is met or there are no more variables to remove.
• Substeps are similar to forward selection:
1. Model Testing – candidate variables are removed from models.
2. Evaluation - identify model with highest performance
3. Variable Removal – remove variable from model
DataLabUSA
Case Study – Overview• 2007 DMA Analytics Challenge
• Dependent Variable: Response
• Independent Variables: 228 variables in total
• Household demographics
• Area/household level lifestyles and interests
• Geo demographics
• Socio-economic census
• Domain: 40k random mailpieces generating 20k responders
DataLabUSA
Case Study – Overview• Model Parameters – TN 2.0
• Type: Logistic Binary – ROC stopping condition
• Nodes: 6
• Trees: 200
• Minchild: 200
• LR: 0.1
• SubSample: .5
• Validation Type: 50% internal test
• Performance (No Variable Restrictions):
• ROC (Learn/Test): .764/.736
• KS (Learn/Test): .392/.351
DataLabUSA
Case Study – Variable Shaving• Decision Metric: Importance (TN 2.0)
• Resample: Changing of seed values
• Peak performance attained after 72 variables.
• 157 models required to identify best 72 out of 228 variables.
• ROC (Learn/Test): .763/.741
DataLabUSA
Case Study – Variable Shaving
0.730
0.740
0.750
0.760
0.770
0.780
0.790
0 10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0
21
0
22
0
# Variables
Learn ROC Test ROC5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)
DataLabUSA
Case Study – Forward Selection• Decision Metric: ROC (Test)
• Resample: Rows of input file are physically shuffled after each batch.
• Peak performance attained after 25 variables.
• 6,400 models required to identify 25 out of 228 variables.
• ROC (Learn/Test): .768/.758
DataLabUSA
Case Study – Forward Selection
0.730
0.740
0.750
0.760
0.770
0.780
0.790
0
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0
21
0
22
0
# Variables
Learn ROC Test ROC
5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)
DataLabUSA
Case Study – Backward Selection• Decision Metric: ROC (Test)
• Resample: Rows of input file are physically shuffled after each batch.
• Peak performance attained after 71 variables.
• 23,600 models required to identify best 71 out of 228 variables.
• ROC (Learn/Test): .761/.760
DataLabUSA
Case Study – Backward Selection
0.730
0.740
0.750
0.760
0.770
0.780
0.790
0
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0
21
0
22
0
# Variables
Learn ROC Test ROC
5 per. Mov. Avg. (Learn ROC) 5 per. Mov. Avg. (Test ROC)
DataLabUSA
Case Study – Method Comparison
0.730
0.735
0.740
0.745
0.750
0.755
0.760
0
10
20
30
40
50
60
70
80
90
10
0
11
0
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
0
20
0
21
0
22
0
# Variables
Shaving Forward
Backward
DataLabUSA
ComparisonForward Backward Shaving
Sensitivity to unstable/heavily used variables
+ + -
Sensitivity to heavily interactive variables
- + +
Guard against compositevariables
- + -
Processing efficiency + - +
Suitability for large number of variables
+ - +
DataLabUSA
Devising more advanced algorithms• Combination of the two procedures
• Controlling for differences in parameters over variable space.
• Decision metric augmentation
• Variable sampling
• Internal re-sampling of learn vs. test
• ICL