29
Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: [email protected] 1

PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Why Ensembles Win Data Mining Competitions

A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL: http://www.abbottanalytics.com Twitter: @deanabb Email: [email protected]

1

Page 2: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 2

Outline

  Motivation for Ensembles   How Ensembles are Built   Do Ensembles Violate Occams Razor?   Why Do Ensembles Win?

Page 3: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

PAKDD Cup 2007 Results: Score Metric Changes Winner

Modeling  Technique   Modeling  Implementa4on  -­‐>  

Par4cipant  Affilia4on  Loca4on  -­‐>  

Par4cipant  Affilia4on  Type  -­‐

>  

AUCROC  (Trapezoidal  Rule)-­‐>  

AUCROC  (Trapezoidal  Rule)  

Rank  -­‐>  

Top  Decile  Response  Rate  

-­‐>  

Top  Decile  Response  

Rate  Rank  -­‐>  

TreeNet  +  Logis-c  Regression   Salford  Systems   Mainland  China   Prac--oner   70.01%   1   13.00%   7  Probit  Regression   SAS   USA   Prac--oner   69.99%   2   13.13%   6  MLP  +  n-­‐Tuple  Classifier   Brazil   Prac--oner   69.62%   3   13.88%   1  TreeNet   Salford  Systems   USA   Prac--oner   69.61%   4   13.25%   4  TreeNet   Salford  Systems   Mainland  China   Prac--oner   69.42%   5   13.50%   2  Ridge  Regression   Rank   Belgium   Prac--oner   69.28%   6   12.88%   9  2-­‐Layer  Linear  Regression   USA   Prac--oner   69.14%   7   12.88%   9  Logis-c  Regression  +  Decision  Stump  +  AdaBoost  +  VFI   Mainland  China   Academia   69.10%   8   13.25%   4  Logis-c  Average  of  Single  Decision  Func-ons   Australia   Prac--oner   68.85%   9   12.13%   17  Logis-c  Regression   Weka   Singapore   Academia   68.69%   10   12.38%   16  Logis-c  Regression   Mainland  China   Prac--oner   68.58%   11   12.88%   9  Decision  Tree  +  Neural  Network  +  Logis-c  Regression   Singapore   68.54%   12   13.00%   7  Scorecard  Linear  Addi-ve  Model   Xeno   USA   Prac--oner   68.28%   13   11.75%   20  Random  Forest   Weka   USA   68.04%   14   12.50%   14  Expanding  Regression  Tree  +  RankBoost  +  Bagging   Weka   Mainland  China   Academia   68.02%   15   12.50%   14  

Logis-c  Regression  SAS  +  Salford  Systems   India   Prac--oner   67.58%   16   12.00%   19  

J48  +  BayesNet   Weka   Mainland  China   Academia   67.56%   17   11.63%   21  Neural  Network  +  General  Addi-ve  Model   Tiberius   USA   Prac--oner   67.54%   18   11.63%   21  Decision  Tree  +  Neural  Network   Mainland  China   Academia   67.50%   19   12.88%   9  Decision  Tree  +  Neural  Network  +  Logis-c  Regression   SAS   USA   Academia   66.71%   20   13.50%   2  Neural  Network   SAS   USA   Academia   66.36%   21   12.13%   17  Decision  Tree  +  Neural  Network  +  Logis-c  Regression   SAS   USA   Academia   65.95%   22   11.63%   21  Neural  Network   SAS   USA   Academia   65.69%   23   9.25%   32  Mul--­‐dimension  Balanced  Random  Forest   Mainland  China   Academia   65.42%   24   12.63%   13  Neural  Network   SAS   USA   Academia   65.28%   25   11.00%   26  CHAID  Decision  Tree   SPSS   Argen-na   Academia   64.53%   26   11.25%   24  Under-­‐Sampling  Based  on  Clustering  +  CART  Decision  Tree   Taiwan   Academia   64.45%   27   11.13%   25  Decision  Tree  +  Neural  Network  +  Polynomial  Regression  SAS   USA   Academia   64.26%   28   9.38%   30  

3

Ensembles

Page 4: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Netflix Prize

  2006 Netflix State-of-the-art (Cinematch) RMSE = 0.9525

  Prize: reduce this RMSE by 10% => 0.8572

  2007: Korbell team Progress Prize winner –  107 algorithm ensemble

–  Top algorithm: SVD with RMSE = 0.8914

–  2nd algorithm: Restricted Boltzmann Machine with RMSE = 0.8990

–  Mini-ensemble (SVD+RBM) has RMSE = 0.88

http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html

4

Page 5: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Common Kinds of Ensembles vs. Single Models

5

From Zhuowen Tu, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”

{Ensembles

Single Classifiers

Page 6: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 6

What are Model Ensembles?

 Combining outputs from multiple models into single decision

 Models can be created using the same algorithm, or several different algorithms

Decision Logic

Ensemble Prediction

Page 7: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 7

Creating Model Ensembles Step 1: Generate Component Models

 Case (Record) Weights — bootstrapping, sampling

 Data Values — add noise, recode data

 Learning Parameters — vary learning rates, pruning severity, random seeds

 Variable Subsets — vary candidate inputs, features

Can Vary Data or Model Parameters:

Multiple models and predictions

Single data set

Page 8: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 8

Creating Model Ensembles Step 2: Combining Models

 Combining Methods –  Estimation: Average Outputs –  Classification: Average

probabilities or vote (best M of N)

 Variance Reduction –  Build complex, overfit models –  All models built in same manner

 Bias Reduction –  Build simple models –  Subsequent models weight

records with errors more (or model actual errors)

Multiple models and predictions

Decision or Prediction Value

Combine

Page 9: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

How Model Complexity Effects Errors

Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)

9

Page 10: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Commonly Used Information-Theoretic Complexity Penalties

10

BIC: Baysian Information Criterion AIC: Akaike Information Criterion MDL: Minimum Description Length

For a nice summary: http://en.wikipedia.org/wiki/Regularization_(mathematics)

Page 11: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 11

Four Keys to Effective Ensembling

  Diversity of opinion   Independence   Decentralization   Aggregation

  From The Wisdom of Crowds, James Surowiecki

11

Page 12: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 12

Bagging

  Bagging Method –  Create many data sets by

bootstrapping (can also do this with cross validation)

–  Create one decision tree for each data set

–  Combine decision trees by averaging (or voting) final decisions

–  Primarily reduces model variance rather than bias

  Results –  On average, better than any

individual tree

Final Answer

(average)

Page 13: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 13

Boosting (Adaboost)

 Boosting Method –  Creating tree using training data set –  Score each data point, indicating when each

incorrect decision is made (errors) –  Retrain, giving rows with incorrect decisions

more weight. Repeat –  Final prediction is a weighted average of all

models-> model regularization. –  Best to create “weak” models—simple models

(just a few splits for a decision tree) and let the boosting iterations find the complexity.

–  Often used with trees or Naïve Bayes

 Results –  Usually better than individual tree or Bagging

Reweight examples where classification incorrect Combine models via weighted sum

Page 14: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 14

Random Forest Ensembles

 Random Forest (RF) Method –  Exact same methodology as

Bagging, but with a twist

–  At each split, rather than using the entire set of candidate inputs, use a random subset of candidate inputs

–  Generates diversity of samples and inputs (splits)

 Results –  On average, better than any

individual tree, Bagging, or even Boosting

Final Answer

(average)

Page 15: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 15

Stochastic Gradient Boosting

  Implemented in MART (Jerry Friedman), and TreeNet (Salford Systems)

 Algorithm –  Begin with a simple model—a constant value

for a model –  Build a simple tree (perhaps 6 terminal nodes)

—now there are 6 possible levels, whereas before there was one level

–  Score the model and compute errors. The score is the sum of all previous trees, weighted by a “learning rate”

–  Build a new tree with the errors as the target variable.

 Results –  TreeNet has won 2 KDD-Cup competitions and

numerous others –  It is less prone to outliers and overfit than

Adaboost Final Answer (additive model)

Predict errors in ensemble tree so far Combine models via weighted sum Build

Page 16: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 16

Ensembles of Trees: Smoothers

 Ensembles smooth jagged decision boundaries

Pictures from T.G. Dietterich. Ensemble methods in machine learning. In Multiple Classier Systems, Cagliari, Italy, 2000.

Page 17: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 17

Heterogeneous Model Ensembles on Glass Data

  Model prediction diversity obtained by using different algorithms: tree, NN, RBF, Gaussian, Regression, k-NN

  Combining 3-5 models on average better than best single model

  Combining all 6 models not best (best is 3&4 model combination), but is close

  The is an example of reducing model variance through ensembles, but not model bias

1 2 3 4 5 60%

5%

10%

15%

20%

25%

30%

35%

40%

Perc

ent C

lass

ificat

ion

Erro

r

Number Models Combined

Max Error Min Error Average Error

Page 18: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 18

Direct Marketing Example: Considerations or I-Miner

Steps:

1.  Join by record—all models applied to same data in same row order

2.  Change probability names

3.  Average probabilities

1.  “Decision” is avg_prob > threshold

4.  Decile Probability Ranks

From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.

Page 19: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Direct Marketing Example: Variable Inclusion in Model Ensembles

  Twenty-Five different variables represented in the ten models

  Only five were represented in seven or more models

  Twelve were represented in one or two models

# Models # Variables

# Models with Common Variables

19

From Abbott, D.W., "How to Improve Customer Acquisition Models with Ensembles", presented at Predictive Analytics World Conference, Washington, D.C., October 20, 2009.

Page 20: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Fraud Detection Example: Deployment Stream

Model scoring picks up scores

from each model, combines in an ensemble,

and pushes scores back to

database

20

Page 21: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Fraud Detection Example: Overall Model Score on Validation Data

Total Score (from validation population)

6.17.2

5.3

7.25.7

7.5 7.0 6.8 6.9

8.89.5

6.3 6.85.3

7.2

1.0

6.3

1.02.03.04.05.06.07.08.09.0

10.0

1 2 3 4 5 6 7 8 9 10 11

Ensem

ble

Averag

e

Averag

e 5 Bes

t

Averag

e 5 W

orst

Best T

estin

g

Worst T

estin

g

Model

Nor

mal

ized

Sco

re

“Score” weights false alarms and

sensitivity

Overall, ensemble

is clearly

best, and much

better than best

on testing data 21

From Abbott, D, and Tom Konchan, “Advanced Fraud Detection Techniques for Vendor Payments”, Predictive Analytics Summit, San Diego, CA, February 24, 2011.

Page 22: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Are Ensembles Better?

  Accuracy? Yes   Interpretability? No   Do Ensembles contradict Occam’s Razor?

–  Principle: simpler models generalize better; avoid overfit!

–  They are more complex than single models (RF may have hundreds of trees in the ensemble)

–  Yet these more complex models perform better on held-out data

–  But…are they really more complex? 22

Page 23: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

Generalized Degrees of Freedom

  Linear Regression: a degree of freedom in the model is simple a parameter –  Does not extrapolate to non-linear methods –  Number of “parameters” in non-linear methods can

produce more complexity or less

  Enter…Generalized Degrees of Freedom (GDF) –  GDF (Ye 1998) “randomly perturbs (adds noise to)

the output variable, re-runs the modeling procedure, and measures the changes to the estimates” (for same number of parameters)

23

Page 24: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

The Math of GDF

24

From Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)

Page 25: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved.

The Effect of GDF

25

From Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864

Page 26: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 26

Why Ensembles Win

  Performance, performance, performance   Single model sometimes provide insufficient accuracy

–  Neural networks become stuck in local minima –  Decision trees

  Run out of data   Are greedy—can get fooled early

–  Single algorithms keep pushing performance using the same ideas (basis function / algorithm), and are incapable of “thinking outside of their box”

  Different algorithms or algorithms built using resample data achieve the same level of accuracy but on different cases—they identify different ways to get the same level of accuracy

Page 27: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 27

Conclusion

  Ensembles can achieve significant model performance improvements

  The key to good ensembles is diversity in sampling and variable selection

  Can be applied to single algorithm, or across multiple algorithms

  Just do it!

Page 28: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 28

References

  Giovanni Seni , John Elder, Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions, Morgan and Claypool Publishers, 2010 (ISBN: 978-1608452842)

  Elder, J.F.E IV, “The Generalization Paradox of Ensembles”, Journal of Computational and Graphical Statistics, Volume 12, Number 4, Pages 853–864 DOI: 10.1198/1061860032733

  Abbott, D.W., “The Benefits of Creating Ensembles of Classifiers”, Abbott Analytics, Inc., http://www.abbottanalytics.com/white-paper-classifiers.php

  Abbott, D.W., “A Comparison of Algorithms at PAKDD2007”, Blog post at http://abbottanalytics.blogspot.com/2007/05/comparison-of-algorithms-at-pakdd2007.html

Page 29: PACE Tech Talk 14-Nov-12 - Why Model Ensembles Win Data Mining Competitions

Copyright © 2000-2012, Abbott Analytics, Inc. All rights reserved. 29

References

  Tu, Zhuowen, “Ensemble Classification Methods: Bagging, Boosting, and Random Forests”, http://www.loni.ucla.edu/~ztu/courses/2010_CS_spring/cs269_2010_ensemble.pdf

  Ye, J. (1998), “On Measuring and Correcting the Effects of Data Mining and Model Selection,” Journal of the American Statistical Association, 93, 120–131.