23
Model Stacking for “Online” Data Jeremy Coyle, Sam Lendle, Sara Moore May 7, 2013

Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

Model Stacking for “Online” Data

Jeremy Coyle, Sam Lendle, Sara Moore

May 7, 2013

Page 2: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

BackgroundModel Stacking

I Definition: An ensemble method where many algorithms,possibly from different types of models, are combined to forma final estimator.

I Motivation: A wide variety of algorithms can be used topredict an outcome, but for any new prediction problem, it’snot clear which one will best capture the true relationshipbetween features and outcome.

I Method: algorithms are trained together, then combined viaa weighted average using using hold-out data (orcross-validation).

I Implementation: Super Learner combines other algorithmsby minimizing a suitable loss function of a linear combinationof their predictions under cross-validation.

I van der Laan et al. (2003, 2006) proved that the combinedapproach will perform asymptotically as well as or better thanthe “best” algorithm in the library of algorithms.

Page 3: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

BackgroundOnline (Machine) Learning

I In many contexts, the amount of data generated is so largethat it is not feasible to store it. Online machine learning, aparadigm in which models are fit to data coming in a“stream,” eliminates the need to store all past observations.

I For each block of incoming data, the fit of the model isupdated and then the data are discarded. Before the fit isupdated, each new block can also be used to assess theperformance of the model (a sort of cross-validation).

I Any algorithms that can be fit with stochastic gradientdescent can be easily used in an online context.

Page 4: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

Research Question

I Can we develop an efficient online/streaming approach totraining stacking algorithms?

I Secondary goal: Speed up the method by training eachalgorithm on each block concurrently

Page 5: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

DatasetsWikipedia

I Most recent Wikipedia dump (updated since Assignment 3’sversion)

I Predict inclusion in parent category of “Mathematics”(dichotomous)

I training set size ≈ 4, 500, 000, validation set size = 100, 000I used Dictionary of 200, 000 most common words, minus

stopwordsI Single words from article text

Page 6: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

DatasetsStack Overflow

I Stack Overflow questions up until July 31, 2012 (fromkaggle.com)

I Predict if question ends up as “closed” (output: OpenStatus,dichotomized)

I training set size ≈ 3, 000, 000, validation set size ≈ 300, 000I used Feature Hashing

I question tagsI words in question titleI words in question body (not longer than 15 characters)I user reputation at question posting time

(ReputationAtPostCreation)I number of undeleted questions by user

(OwnerUndeletedAnswerCountAtPostTime)

Page 7: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

MethodsOverview

I Train many estimators with stochastic or mini-batch gradientdescent

I Update weighted combination of estimators on new blockbefore training each estimator on that block

I Use a “moving window” of predictions and true outcomes toupdate weights, where the window sizes could span more thanone block

I Test on validation set

Page 8: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

MethodsAlgorithm

Algorithm 1: Online Model Stacker

1.1 Using the first data block:1.2 Take a gradient step for each algorithm in the library of algorithms1.3 for each subsequent data block do1.4 Predict the outcome of each observation using each algorithm1.5 Calculate the risk for this data block (Mean Squared Error)1.6 Update the best weighted average of algorithms for predicting

the true outcome (via NNLS)1.7 Take a gradient step for each individual algorithm fits

1.8 end

Page 9: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

MethodsAlgorithm Library

I Logistic regressionI Ridge regularizationI LASSO regularization

I SVM

I Mean

Page 10: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

MethodsPerformance assessment

I Overall model fit assessed using MSE using the predictionsfrom each block (before updating the models).

I Performance of the individual algorithms and overall modelassessed via MSE/AUC on validation set after fitting on onepass through the training data.

Page 11: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsChoosing learning rate (Wikipedia data)

Risk by learning rate and SuperLearner

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()

LogisticRegression(alpha=1.000000)

LogisticRegression(alpha=0.100000)

LogisticRegression(alpha=0.010000)

LogisticRegression(alpha=0.001000)

LogisticRegression(alpha=0.000100)

MeanLearner()

Page 12: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsChoosing learning rate (Wikipedia data)

Weights by learning rate

0.0

0.1

0.2

0.3

0.4

0 100 200 300 400Step

Wei

ght

Model

LogisticRegression(alpha=1.000000)

LogisticRegression(alpha=0.100000)

LogisticRegression(alpha=0.010000)

LogisticRegression(alpha=0.001000)

LogisticRegression(alpha=0.000100)

MeanLearner()

Page 13: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsChoosing learning rate (Wikipedia data)

Validation set performance of eachlearning rate and SuperLearner combination

rMSE Accuracy AUC F1

MeanLearner()

LogisticRegression(alpha=0.000100)

LogisticRegression(alpha=0.001000)

LogisticRegression(alpha=0.010000)

LogisticRegression(alpha=0.100000)

LogisticRegression(alpha=1.000000)

SuperLearner()

0.06

0.07

0.08

0.99

250.

9930

0.99

350.

9940

0.99

450.

9950 0.

50.

60.

70.

80.

91.

00.

0

0.2

0.4

0.6

Performance Measure

Mod

el

Page 14: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsChoosing learning rate (Wikipedia data)

Validation set ROC plot for eachlearning rate and Super Learner combination

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

factor(learner)

SuperLearner()

LogisticRegression(alpha=1.000000)

LogisticRegression(alpha=0.100000)

LogisticRegression(alpha=0.010000)

LogisticRegression(alpha=0.001000)

LogisticRegression(alpha=0.000100)

MeanLearner()

Page 15: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsChoosing penalty parameter (Wikipedia data)

Risk for different penalties

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()

LassoLogisticRegression(lambda=1.00000,alpha=0.100000)

LassoLogisticRegression(lambda=0.100000,alpha=0.100000)

LassoLogisticRegression(lambda=0.0100000,alpha=0.100000)

LassoLogisticRegression(lambda=0.00100000,alpha=0.100000)

LassoLogisticRegression(lambda=0.000100000,alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−06,alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−07,alpha=0.100000)

MeanLearner()

Page 16: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsOverall results (Wikipedia data)

Risk by algorithm

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()

LogisticRegression(alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()

Page 17: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsOverall results (Wikipedia data)

Weights by algorithm

0.0

0.2

0.4

0.6

0 100 200 300 400Step

Wei

ght

Model

LogisticRegression(alpha=0.100000)

LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()

Page 18: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsOverall results (Wikipedia data)

Validation set performance

rMSE Accuracy AUC F1

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SuperLearner

SVM

0.07

0.08

0.99

250.

9930

0.99

350.

9940

0.99

450.

9950 0.

50.

60.

70.

80.

91.

00.

0

0.2

0.4

0.6

Performance Measure

Mod

el

Page 19: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsOverall results (Wikipedia data)

Validation set ROC plots

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

type

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SuperLearner

SVM

Page 20: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsResults from Stack Overflow

Validation set performance

rMSE Accuracy AUC F1

●●●● ● ●●●●●●●●●●●●●● ●●●●●●

●●●● ●

●●

●●

●●●● ● ●●●●● ●●●●● ●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●● ●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

● ● ●●●● ● ●●●●●●●●●●●●●●●●●●

● ● ●●●

●●

●●●

● ● ●●●● ● ●●●● ● ●●●● ●●●● ●●●●●

● ● ●●●

●●● ●●●●● ●●●●●●●●●●●●●●●●●

●●● ●●

●●● ● ●●●●●●●●●●

●●●●●●●●●●●

●●● ● ●

●●●

●●

●●● ● ●●●● ● ●●●●●●●●●●●●●●●●

●●● ● ●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SVM

SuperLearner

0.13

80.

139

0.14

00.

141

0.14

20.

9782

50.

9785

00.

9787

50.

9790

00.

9792

5

0.5

0.6

0.7

0.8

0.00

0.02

0.04

0.06

Performance Measure

Mod

el

Page 21: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

ResultsResults from Stack Overflow

Validation set ROC plot

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

type

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SVM

SuperLearner

Page 22: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

Conclusions

I Does as well as any individual algorithm

I Doesn’t overfit even when many algorithms are included

I Also a good way to auto-tune many learning ratesconcurrently

Page 23: Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

References

Leon Bottou. Large-scale machine learning with stochasticgradient descent. In Proceedings of COMPSTAT’2010, pages177–186. Springer, 2010.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journalof Machine Learning Research, 12:2121–2159, 2010.

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings ofthe 24th international conference on Machine learning, pages807–814. ACM, 2007.

Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Superlearner. Statistical Applications in Genetics and MolecularBiology, 6(1):1–21, 2007.