Upload
dean-abbott
View
710
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Why Ensembles Win Data Mining Competitions
A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: [email protected]
1
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 2
Outline
Motivation for Ensembles How Ensembles are Built Do Ensembles Violate Occams Razor? Why Do Ensembles Win?
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
PAKDD Cup 2007 Results: Score Metric Changes Winner
Modeling Technique Modeling Implementa4on -‐>
Par4cipant Affilia4on Loca4on -‐>
Par4cipant Affilia4on Type -‐
>
AUCROC (Trapezoidal Rule)-‐>
AUCROC (Trapezoidal Rule)
Rank -‐>
Top Decile Response Rate
-‐>
Top Decile Response
Rate Rank -‐>
TreeNet + Logis-c Regression Salford Systems Mainland China Prac--oner 70.01% 1 13.00% 7 Probit Regression SAS USA Prac--oner 69.99% 2 13.13% 6 MLP + n-‐Tuple Classifier Brazil Prac--oner 69.62% 3 13.88% 1 TreeNet Salford Systems USA Prac--oner 69.61% 4 13.25% 4 TreeNet Salford Systems Mainland China Prac--oner 69.42% 5 13.50% 2 Ridge Regression Rank Belgium Prac--oner 69.28% 6 12.88% 9 2-‐Layer Linear Regression USA Prac--oner 69.14% 7 12.88% 9 Logis-c Regression + Decision Stump + AdaBoost + VFI Mainland China Academia 69.10% 8 13.25% 4 Logis-c Average of Single Decision Func-ons Australia Prac--oner 68.85% 9 12.13% 17 Logis-c Regression Weka Singapore Academia 68.69% 10 12.38% 16 Logis-c Regression Mainland China Prac--oner 68.58% 11 12.88% 9 Decision Tree + Neural Network + Logis-c Regression Singapore 68.54% 12 13.00% 7 Scorecard Linear Addi-ve Model Xeno USA Prac--oner 68.28% 13 11.75% 20 Random Forest Weka USA 68.04% 14 12.50% 14 Expanding Regression Tree + RankBoost + Bagging Weka Mainland China Academia 68.02% 15 12.50% 14
Logis-c Regression SAS + Salford Systems India Prac--oner 67.58% 16 12.00% 19
J48 + BayesNet Weka Mainland China Academia 67.56% 17 11.63% 21 Neural Network + General Addi-ve Model Tiberius USA Prac--oner 67.54% 18 11.63% 21 Decision Tree + Neural Network Mainland China Academia 67.50% 19 12.88% 9 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 66.71% 20 13.50% 2 Neural Network SAS USA Academia 66.36% 21 12.13% 17 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 65.95% 22 11.63% 21 Neural Network SAS USA Academia 65.69% 23 9.25% 32 Mul--‐dimension Balanced Random Forest Mainland China Academia 65.42% 24 12.63% 13 Neural Network SAS USA Academia 65.28% 25 11.00% 26 CHAID Decision Tree SPSS Argen-na Academia 64.53% 26 11.25% 24 Under-‐Sampling Based on Clustering + CART Decision Tree Taiwan Academia 64.45% 27 11.13% 25 Decision Tree + Neural Network + Polynomial Regression SAS USA Academia 64.26% 28 9.38% 30
3
Ensembles
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Netflix Prize
2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525
Prize: reduce this RMSE by 10% => 0.8572
2007: Korbell team Progress Prize winner – 107 algorithm ensemble
– Top algorithm: SVD with RMSE = 0.8914
– 2nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990
– Mini-ensemble (SVD+RBM) has RMSE = 0.88
http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
4
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Common Kinds of Ensembles vs. Single Models
5
From Zhuowen Tu, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”
{Ensembles
Single Classifiers
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 6
What are Model Ensembles?
Combining outputs from multiple models into single decision
Models can be created using the same algorithm, or several different algorithms
Decision Logic
Ensemble Prediction
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 7
Creating Model Ensembles Step 1: Generate Component Models
Case (Record) Weights — bootstrapping, sampling
Data Values — add noise, recode data
Learning Parameters — vary learning rates, pruning severity, random seeds
Variable Subsets — vary candidate inputs, features
Can Vary Data or Model Parameters:
Multiple models and predictions
Single data set
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 8
Creating Model Ensembles Step 2: Combining Models
Combining Methods – Estimation: Average Outputs – Classification: Average
probabilities or vote (best M of N)
Variance Reduction – Build complex, overfit models – All models built in same manner
Bias Reduction – Build simple models – Subsequent models weight
records with errors more (or model actual errors)
Multiple models and predictions
Decision or Prediction Value
Combine
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
How Model Complexity Effects Errors
Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)
9
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Commonly Used Information-Theoretic Complexity Penalties
10
BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length
For a nice summary: http://en.wikipedia.org/wiki/Regularization_(mathematics)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 11
Four Keys to Effective Ensembling
Diversity of opinion Independence Decentralization Aggregation
From The Wisdom of Crowds, James Surowiecki
11
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 12
Bagging
Bagging Method – Create many data sets by
bootstrapping (can also do this with cross validation)
– Create one decision tree for each data set
– Combine decision trees by averaging (or voting) final decisions
– Primarily reduces model variance rather than bias
Results – On average, better than any
individual tree
Final Answer
(average)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 13
Boosting (Adaboost)
Boosting Method – Creating tree using training data set – Score each data point, indicating when each
incorrect decision is made (errors) – Retrain, giving rows with incorrect decisions
more weight. Repeat – Final prediction is a weighted average of all
models-> model regularization. – Best to create “weak” models—simple models
(just a few splits for a decision tree) and let the boosting iterations find the complexity.
– Often used with trees or Naïve Bayes
Results – Usually better than individual tree or Bagging
Reweight examples where classification incorrect Combine models via weighted sum
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 14
Random Forest Ensembles
Random Forest (RF) Method – Exact same methodology as
Bagging, but with a twist
– At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs
– Generates diversity of samples and inputs (splits)
Results – On average, better than any
individual tree, Bagging, or even Boosting
Final Answer
(average)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 15
Stochastic Gradient Boosting
Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems)
Algorithm – Begin with a simple model—a constant value
for a model – Build a simple tree (perhaps 6 terminal nodes)
—now there are 6 possible levels, whereas before there was one level
– Score the model and compute errors. The score is the sum of all previous trees, weighted by a “learning rate”
– Build a new tree with the errors as the target variable.
Results – TreeNet has won 2 KDD-Cup competitions and
numerous others – It is less prone to outliers and overfit than
Adaboost Final Answer (additive model)
Predict errors in ensemble tree so far Combine models via weighted sum Build
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 16
Ensembles of Trees: Smoothers
Ensembles smooth jagged decision boundaries
Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 17
Heterogeneous Model Ensembles on Glass Data
Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-NN
Combining 3-5 models on average better than best single model
Combining all 6 models not best (best is 3&4 model combination), but is close
The is an example of reducing model variance through ensembles, but not model bias
1 2 3 4 5 60%
5%
10%
15%
20%
25%
30%
35%
40%
Perc
ent C
lass
ificat
ion
Erro
r
Number Models Combined
Max Error Min Error Average Error
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 18
Direct Marketing Example: Considerations or I-Miner
Steps:
1. Join by record—all models applied to same data in same row order
2. Change probability names
3. Average probabilities
1. “Decision” is avg_prob > threshold
4. Decile Probability Ranks
From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Direct Marketing Example: Variable Inclusion in Model Ensembles
Twenty-Five different variables represented in the ten models
Only five were represented in seven or more models
Twelve were represented in one or two models
# Models # Variables
# Models with Common Variables
19
From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Fraud Detection Example: Deployment Stream
Model scoring picks up scores
from each model, combines in an ensemble,
and pushes scores back to
database
20
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Fraud Detection Example: Overall Model Score on Validation Data
Total Score (from validation population)
6.17.2
5.3
7.25.7
7.5 7.0 6.8 6.9
8.89.5
6.3 6.85.3
7.2
1.0
6.3
1.02.03.04.05.06.07.08.09.0
10.0
1 2 3 4 5 6 7 8 9 10 11
Ensem
ble
Averag
e
Averag
e 5 Bes
t
Averag
e 5 W
orst
Best T
estin
g
Worst T
estin
g
Model
Nor
mal
ized
Sco
re
“Score” weights false alarms and
sensitivity
Overall, ensemble
is clearly
best, and much
better than best
on testing data 21
From Abbott, D, and Tom Konchan, “Advanced Fraud Detection Techniques for Vendor Payments”, Predictive Analytics Summit, San Diego, CA, February 24, 2011.
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Are Ensembles Better?
Accuracy? Yes Interpretability? No Do Ensembles contradict Occam’s Razor?
– Principle: simpler models generalize better; avoid overfit!
– They are more complex than single models (RF may have hundreds of trees in the ensemble)
– Yet these more complex models perform better on held-out data
– But…are they really more complex? 22
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
Generalized Degrees of Freedom
Linear Regression: a degree of freedom in the model is simple a parameter – Does not extrapolate to non-linear methods – Number of “parameters” in non-linear methods can
produce more complexity or less
Enter…Generalized Degrees of Freedom (GDF) – GDF (Ye 1998) “randomly perturbs (adds noise to)
the output variable, re-runs the modeling procedure, and measures the changes to the estimates” (for same number of parameters)
23
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
The Math of GDF
24
From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.
The Effect of GDF
25
From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 26
Why Ensembles Win
Performance, performance, performance Single model sometimes provide insufficient accuracy
– Neural networks become stuck in local minima – Decision trees
Run out of data Are greedy—can get fooled early
– Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of “thinking outside of their box”
Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases—they identify different ways to get the same level of accuracy
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 27
Conclusion
Ensembles can achieve significant model performance improvements
The key to good ensembles is diversity in sampling and variable selection
Can be applied to single algorithm, or across multiple algorithms
Just do it!
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 28
References
Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)
Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864 DOI: 10.1198/1061860032733
Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper-classifiers.php
Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-of-algorithms-at-pakdd2007.html
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 29
References
Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/2010_CS_spring/cs269_2010_ensemble.pdf
Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131.