PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Why Ensembles Win Data Mining Competitions

A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: [email protected]

1

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 2

Outline

  Motivation for Ensembles   How Ensembles are Built   Do Ensembles Violate Occams Razor?   Why Do Ensembles Win?


PAKDD Cup 2007 Results: Score Metric Changes Winner

Modeling Technique Modeling Implementa4on -‐>

Par4cipant Affilia4on Loca4on -‐>

Par4cipant Affilia4on Type -‐

>

AUCROC (Trapezoidal Rule)-‐>

AUCROC (Trapezoidal Rule)

Rank -‐>

Top Decile Response Rate

-‐>

Top Decile Response

Rate Rank -‐>

TreeNet + Logis-c Regression Salford Systems Mainland China Prac--oner 70.01% 1 13.00% 7 Probit Regression SAS USA Prac--oner 69.99% 2 13.13% 6 MLP + n-‐Tuple Classifier Brazil Prac--oner 69.62% 3 13.88% 1 TreeNet Salford Systems USA Prac--oner 69.61% 4 13.25% 4 TreeNet Salford Systems Mainland China Prac--oner 69.42% 5 13.50% 2 Ridge Regression Rank Belgium Prac--oner 69.28% 6 12.88% 9 2-‐Layer Linear Regression USA Prac--oner 69.14% 7 12.88% 9 Logis-c Regression + Decision Stump + AdaBoost + VFI Mainland China Academia 69.10% 8 13.25% 4 Logis-c Average of Single Decision Func-ons Australia Prac--oner 68.85% 9 12.13% 17 Logis-c Regression Weka Singapore Academia 68.69% 10 12.38% 16 Logis-c Regression Mainland China Prac--oner 68.58% 11 12.88% 9 Decision Tree + Neural Network + Logis-c Regression Singapore 68.54% 12 13.00% 7 Scorecard Linear Addi-ve Model Xeno USA Prac--oner 68.28% 13 11.75% 20 Random Forest Weka USA 68.04% 14 12.50% 14 Expanding Regression Tree + RankBoost + Bagging Weka Mainland China Academia 68.02% 15 12.50% 14

Logis-c Regression SAS + Salford Systems India Prac--oner 67.58% 16 12.00% 19

J48 + BayesNet Weka Mainland China Academia 67.56% 17 11.63% 21 Neural Network + General Addi-ve Model Tiberius USA Prac--oner 67.54% 18 11.63% 21 Decision Tree + Neural Network Mainland China Academia 67.50% 19 12.88% 9 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 66.71% 20 13.50% 2 Neural Network SAS USA Academia 66.36% 21 12.13% 17 Decision Tree + Neural Network + Logis-c Regression SAS USA Academia 65.95% 22 11.63% 21 Neural Network SAS USA Academia 65.69% 23 9.25% 32 Mul--‐dimension Balanced Random Forest Mainland China Academia 65.42% 24 12.63% 13 Neural Network SAS USA Academia 65.28% 25 11.00% 26 CHAID Decision Tree SPSS Argen-na Academia 64.53% 26 11.25% 24 Under-‐Sampling Based on Clustering + CART Decision Tree Taiwan Academia 64.45% 27 11.13% 25 Decision Tree + Neural Network + Polynomial Regression SAS USA Academia 64.26% 28 9.38% 30

3

Ensembles


Netflix Prize

  2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525

  Prize: reduce this RMSE by 10% => 0.8572

  2007: Korbell team Progress Prize winner –  107 algorithm ensemble

–  Top algorithm: SVD with RMSE = 0.8914

–  2nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990

–  Mini-ensemble (SVD+RBM) has RMSE = 0.88

http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

4


Common Kinds of Ensembles vs. Single Models

5

From Zhuowen Tu, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”

{Ensembles

Single Classifiers


What are Model Ensembles?

 Combining outputs from multiple models into single decision

 Models can be created using the same algorithm, or several different algorithms

Decision Logic

Ensemble Prediction


Creating Model Ensembles Step 1: Generate Component Models

 Case (Record) Weights — bootstrapping, sampling

 Data Values — add noise, recode data

 Learning Parameters — vary learning rates, pruning severity, random seeds

 Variable Subsets — vary candidate inputs, features

Can Vary Data or Model Parameters:

Multiple models and predictions

Single data set


Creating Model Ensembles Step 2: Combining Models

 Combining Methods –  Estimation: Average Outputs –  Classification: Average

probabilities or vote (best M of N)

 Variance Reduction –  Build complex, overfit models –  All models built in same manner

 Bias Reduction –  Build simple models –  Subsequent models weight

records with errors more (or model actual errors)

Multiple models and predictions

Decision or Prediction Value

Combine


How Model Complexity Effects Errors

Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)

9


Commonly Used Information-Theoretic Complexity Penalties

10

BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length

For a nice summary: http://en.wikipedia.org/wiki/Regularization_(mathematics)


Four Keys to Effective Ensembling

  Diversity of opinion   Independence   Decentralization   Aggregation

  From The Wisdom of Crowds, James Surowiecki

11


Bagging

  Bagging Method –  Create many data sets by

bootstrapping (can also do this with cross validation)

–  Create one decision tree for each data set

–  Combine decision trees by averaging (or voting) final decisions

–  Primarily reduces model variance rather than bias

  Results –  On average, better than any

individual tree

Final Answer

(average)


Boosting (Adaboost)

 Boosting Method –  Creating tree using training data set –  Score each data point, indicating when each

incorrect decision is made (errors) –  Retrain, giving rows with incorrect decisions

more weight. Repeat –  Final prediction is a weighted average of all

models-> model regularization. –  Best to create “weak” models—simple models

(just a few splits for a decision tree) and let the boosting iterations find the complexity.

–  Often used with trees or Naïve Bayes

 Results –  Usually better than individual tree or Bagging

Reweight examples where classification incorrect Combine models via weighted sum


Random Forest Ensembles

 Random Forest (RF) Method –  Exact same methodology as

Bagging, but with a twist

–  At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs

–  Generates diversity of samples and inputs (splits)

 Results –  On average, better than any

individual tree, Bagging, or even Boosting

Final Answer

(average)


Stochastic Gradient Boosting

  Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems)

 Algorithm –  Begin with a simple model—a constant value

for a model –  Build a simple tree (perhaps 6 terminal nodes)

—now there are 6 possible levels, whereas before there was one level

–  Score the model and compute errors. The score is the sum of all previous trees, weighted by a “learning rate”

–  Build a new tree with the errors as the target variable.

 Results –  TreeNet has won 2 KDD-Cup competitions and

numerous others –  It is less prone to outliers and overfit than

Adaboost Final Answer (additive model)

Predict errors in ensemble tree so far Combine models via weighted sum Build


Ensembles of Trees: Smoothers

 Ensembles smooth jagged decision boundaries

Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000.


Heterogeneous Model Ensembles on Glass Data

  Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-NN

  Combining 3-5 models on average better than best single model

  Combining all 6 models not best (best is 3&4 model combination), but is close

  The is an example of reducing model variance through ensembles, but not model bias

1 2 3 4 5 60%

5%

10%

15%

20%

25%

30%

35%

40%

Perc

ent C

lass

ificat

ion

Erro

r

Number Models Combined

Max Error Min Error Average Error


Direct Marketing Example: Considerations or I-Miner

Steps:

1.  Join by record—all models applied to same data in same row order

2.  Change probability names

3.  Average probabilities

1.  “Decision” is avg_prob > threshold

4.  Decile Probability Ranks

From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.


Direct Marketing Example: Variable Inclusion in Model Ensembles

  Twenty-Five different variables represented in the ten models

  Only five were represented in seven or more models

  Twelve were represented in one or two models

# Models # Variables

# Models with Common Variables

19

From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.


Fraud Detection Example: Deployment Stream

Model scoring picks up scores

from each model, combines in an ensemble,

and pushes scores back to

database

20


Fraud Detection Example: Overall Model Score on Validation Data

Total Score (from validation population)

6.17.2

5.3

7.25.7

7.5 7.0 6.8 6.9

8.89.5

6.3 6.85.3

7.2

1.0

6.3

1.02.03.04.05.06.07.08.09.0

10.0

1 2 3 4 5 6 7 8 9 10 11

Ensem

ble

Averag

e

Averag

e 5 Bes

t

Averag

e 5 W

orst

Best T

estin

g

Worst T

estin

g

Model

Nor

mal

ized

Sco

re

“Score” weights false alarms and

sensitivity

Overall, ensemble

is clearly

best, and much

better than best

on testing data 21

From Abbott, D, and Tom Konchan, “Advanced Fraud Detection Techniques for Vendor Payments”, Predictive Analytics Summit, San Diego, CA, February 24, 2011.


Are Ensembles Better?

  Accuracy? Yes   Interpretability? No   Do Ensembles contradict Occam’s Razor?

–  Principle: simpler models generalize better; avoid overfit!

–  They are more complex than single models (RF may have hundreds of trees in the ensemble)

–  Yet these more complex models perform better on held-out data

–  But…are they really more complex? 22


Generalized Degrees of Freedom

  Linear Regression: a degree of freedom in the model is simple a parameter –  Does not extrapolate to non-linear methods –  Number of “parameters” in non-linear methods can

produce more complexity or less

  Enter…Generalized Degrees of Freedom (GDF) –  GDF (Ye 1998) “randomly perturbs (adds noise to)

the output variable, re-runs the modeling procedure, and measures the changes to the estimates” (for same number of parameters)

23


The Math of GDF

24

From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)


The Effect of GDF

25

From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864


Why Ensembles Win

  Performance, performance, performance   Single model sometimes provide insufficient accuracy

–  Neural networks become stuck in local minima –  Decision trees

  Run out of data   Are greedy—can get fooled early

–  Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of “thinking outside of their box”

  Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases—they identify different ways to get the same level of accuracy


Conclusion

  Ensembles can achieve significant model performance improvements

  The key to good ensembles is diversity in sampling and variable selection

  Can be applied to single algorithm, or across multiple algorithms

  Just do it!


References

  Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)

  Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864 DOI: 10.1198/1061860032733

  Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper-classifiers.php

  Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-of-algorithms-at-pakdd2007.html


References

  Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/2010_CS_spring/cs269_2010_ensemble.pdf

  Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131.

Technology

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions