Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation:

Model Stacking for “Online” Data

Jeremy Coyle, Sam Lendle, Sara Moore

May 7, 2013

BackgroundModel Stacking

I Definition: An ensemble method where many algorithms,possibly from different types of models, are combined to forma final estimator.

I Motivation: A wide variety of algorithms can be used topredict an outcome, but for any new prediction problem, it’snot clear which one will best capture the true relationshipbetween features and outcome.

I Method: algorithms are trained together, then combined viaa weighted average using using hold-out data (orcross-validation).

I Implementation: Super Learner combines other algorithmsby minimizing a suitable loss function of a linear combinationof their predictions under cross-validation.

I van der Laan et al. (2003, 2006) proved that the combinedapproach will perform asymptotically as well as or better thanthe “best” algorithm in the library of algorithms.

BackgroundOnline (Machine) Learning

I In many contexts, the amount of data generated is so largethat it is not feasible to store it. Online machine learning, aparadigm in which models are fit to data coming in a“stream,” eliminates the need to store all past observations.

I For each block of incoming data, the fit of the model isupdated and then the data are discarded. Before the fit isupdated, each new block can also be used to assess theperformance of the model (a sort of cross-validation).

I Any algorithms that can be fit with stochastic gradientdescent can be easily used in an online context.

Research Question

I Can we develop an efficient online/streaming approach totraining stacking algorithms?

I Secondary goal: Speed up the method by training eachalgorithm on each block concurrently

DatasetsWikipedia

I Most recent Wikipedia dump (updated since Assignment 3’sversion)

I Predict inclusion in parent category of “Mathematics”(dichotomous)

I training set size ≈ 4, 500, 000, validation set size = 100, 000I used Dictionary of 200, 000 most common words, minus

stopwordsI Single words from article text

DatasetsStack Overflow

I Stack Overflow questions up until July 31, 2012 (fromkaggle.com)

I Predict if question ends up as “closed” (output: OpenStatus,dichotomized)

I training set size ≈ 3, 000, 000, validation set size ≈ 300, 000I used Feature Hashing

I question tagsI words in question titleI words in question body (not longer than 15 characters)I user reputation at question posting time

(ReputationAtPostCreation)I number of undeleted questions by user

(OwnerUndeletedAnswerCountAtPostTime)

MethodsOverview

I Train many estimators with stochastic or mini-batch gradientdescent

I Update weighted combination of estimators on new blockbefore training each estimator on that block

I Use a “moving window” of predictions and true outcomes toupdate weights, where the window sizes could span more thanone block

I Test on validation set

MethodsAlgorithm

Algorithm 1: Online Model Stacker

1.1 Using the first data block:1.2 Take a gradient step for each algorithm in the library of algorithms1.3 for each subsequent data block do1.4 Predict the outcome of each observation using each algorithm1.5 Calculate the risk for this data block (Mean Squared Error)1.6 Update the best weighted average of algorithms for predicting

the true outcome (via NNLS)1.7 Take a gradient step for each individual algorithm fits

1.8 end

MethodsAlgorithm Library

I Logistic regressionI Ridge regularizationI LASSO regularization

I SVM

I Mean

MethodsPerformance assessment

I Overall model fit assessed using MSE using the predictionsfrom each block (before updating the models).

I Performance of the individual algorithms and overall modelassessed via MSE/AUC on validation set after fitting on onepass through the training data.

ResultsChoosing learning rate (Wikipedia data)

Risk by learning rate and SuperLearner

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()

LogisticRegression(alpha=1.000000)





MeanLearner()


Weights by learning rate

0.0

0.1

0.2

0.3

0.4

0 100 200 300 400Step

Wei

ght

Model






MeanLearner()


Validation set performance of eachlearning rate and SuperLearner combination

rMSE Accuracy AUC F1

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

MeanLearner()






SuperLearner()

0.06

0.07

0.08

0.99

250.

9930

0.99

350.

9940

0.99

450.

9950 0.

50.

60.

70.

80.

91.

00.

0

0.2

0.4

0.6

Performance Measure

Mod

el


Validation set ROC plot for eachlearning rate and Super Learner combination

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False Positive Rate

True

Pos

itive

Rat

e

factor(learner)

SuperLearner()






MeanLearner()

ResultsChoosing penalty parameter (Wikipedia data)

Risk for different penalties

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()

LassoLogisticRegression(lambda=1.00000,alpha=0.100000)





LassoLogisticRegression(lambda=1.00000e−05,alpha=0.100000)



MeanLearner()

ResultsOverall results (Wikipedia data)

Risk by algorithm

0.004

0.005

0.006

0.007

0 100 200 300 400Step

MS

E

Model

SuperLearner()



RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()


Weights by algorithm

0.0

0.2

0.4

0.6

0 100 200 300 400Step

Wei

ght

Model



RidgeLogisticRegression(lambda=1.00000e−05,alpha=0.100000)

SVM(lambda=1.00000e−06,alpha=0.100000)

MeanLearner()


Validation set performance


●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

LassoLogisticRegression

LogisticRegression

MeanLearner

RidgeLogisticRegression

SuperLearner

SVM

0.07

0.08

0.99

250.

9930

0.99

350.

9940

0.99

450.

9950 0.

50.

60.

70.

80.

91.

00.

0

0.2

0.4

0.6

Performance Measure

Mod

el


Validation set ROC plots

0.00

0.25

0.50

0.75

1.00


True

Pos

itive

Rat

e

type


LogisticRegression

MeanLearner


SuperLearner

SVM

ResultsResults from Stack Overflow

Validation set performance


●●●● ● ●●●●●●●●●●●●●● ●●●●●●

●●●● ●

●●

●●

●

●

●●●● ● ●●●●● ●●●●● ●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●

●●●●● ●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●

●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●

● ● ●●●● ● ●●●●●●●●●●●●●●●●●●

● ● ●●●

●●

●●●

●

● ● ●●●● ● ●●●● ● ●●●● ●●●● ●●●●●

● ● ●●●

●●● ●●●●● ●●●●●●●●●●●●●●●●●

●●● ●●

●

●●● ● ●●●●●●●●●●

●●●●●●●●●●●

●●● ● ●

●●●

●●

●

●●● ● ●●●● ● ●●●●●●●●●●●●●●●●

●●● ● ●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●


LogisticRegression

MeanLearner


SVM

SuperLearner

0.13

80.

139

0.14

00.

141

0.14

20.

9782

50.

9785

00.

9787

50.

9790

00.

9792

5

0.5

0.6

0.7

0.8

0.00

0.02

0.04

0.06

Performance Measure

Mod

el

ResultsResults from Stack Overflow

Validation set ROC plot

0.00

0.25

0.50

0.75

1.00


True

Pos

itive

Rat

e

type


LogisticRegression

MeanLearner


SVM

SuperLearner

Conclusions

I Does as well as any individual algorithm

I Doesn’t overfit even when many algorithms are included

I Also a good way to auto-tune many learning ratesconcurrently

References

Leon Bottou. Large-scale machine learning with stochasticgradient descent. In Proceedings of COMPSTAT’2010, pages177–186. Springer, 2010.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradientmethods for online learning and stochastic optimization. Journalof Machine Learning Research, 12:2121–2159, 2010.

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings ofthe 24th international conference on Machine learning, pages807–814. ACM, 2007.

Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Superlearner. Statistical Applications in Genetics and MolecularBiology, 6(1):1–21, 2007.

Documents

Model Stacking for ``Online'' Databid.berkeley.edu/cs294-1-spring13/images/9/90/Final_slides_jc_sl_sm… · a weighted average using using hold-out data (or cross-validation). I Implementation: