23
STK4030: Bagging and random forests Kristoffer H. Hellton 16th of November 2015 All figures and tables from Hastie, Tibshirani and Friedman Elements of statitical learning Bagging STK4030 16th of Nov 1 / 23

STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

STK4030: Bagging and random forests

Kristoffer H. Hellton

16th of November 2015All figures and tables from Hastie, Tibshirani and Friedman

Elements of statitical learning

Bagging STK4030 16th of Nov 1 / 23

Page 2: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Today

Bagging, 8.7

Trees, 9.2

Random forests, Chap 15

Bagging STK4030 16th of Nov 2 / 23

Page 3: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

First; the vox populi or wisdom of the crowd

In 1906, Francis Galton visited a country fair in Plymouth where 800people participated in a contest to estimate the weight of aslaughtered ox. No one managed to guess the exact answer, butGalton observed that the median guess, 1207 pounds, was accuratewithin 1% of the true weight of 1198 pounds, and he stated:

the middlemost estimate expresses the vox populi, everyother estimate being condemned as too low or too high bya majority of the voters1,

emphasizing the wisdom of the crowd as a way to improvepredictions.

1Galton (1906), NatureBagging STK4030 16th of Nov 3 / 23

Page 4: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bootstrap

Given a training set and any estimated quantity

Z = (z1, . . . , zn), S(Z).

Bootstrapping is a resampling tool used to assess statistical accuracy:

1 draw B datasets (of same size) Z∗b randomly with replacementfrom the training data

2 calculate the quantity S(Z∗b) on each bootstrap sample

3 usually one uses the variance over the bootstrap values as anestimate of the variance in the true training dataset

Note: with replacement = some observations will be repeated in eachbootstrap sample. For large N roughly 1− 1/e ' 63% will be unique.

Bagging STK4030 16th of Nov 4 / 23

Page 5: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging STK4030 16th of Nov 5 / 23

Page 6: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging

Bootstrap aggregation or Bagging uses the bootstrap frameworkto build a prediction f̂ ∗b over all bootstrap samples Z∗b,

and then average over B

f̂bag (xi) =1

B

B∑b=1

f̂ ∗b(x),

thereby (hopefully) reducing the variance of the prediction (thinkto the central limit theorem).

This can lead to improvements for unstable procedures.

Bagging STK4030 16th of Nov 6 / 23

Page 7: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Nonlinear estimates

The bagged estimate will only differ from the original estimate,if f̂ (x) is a nonlinear function of the data.

HTF 8.7: The B-spline smoother is linear in y (for fixed inputs):f̂spline(x) = Sy such that f̂bag (x)→ f̂ (x) as B →∞, hencebagging just reproduces the original estimate and noimprovement.

Need TREES!

Highly nonlinear and unstable with large variance, perfect forbagging.

Bagging STK4030 16th of Nov 7 / 23

Page 8: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Trees, Chap 9.2

A tree (in our setting) is a recursivebinary partition, two-part split, of thefeature space, which corresponds torectangles Rm.

Tree regression: Fit a constant γm toeach Rm

f (x) =M∑

m=1

γmI (x ∈ Rm),

with γ̂m = Ave(y |xi ∈ Rm), the meanvalue of the yi ’s within each rectangle.

Bagging STK4030 16th of Nov 8 / 23

Page 9: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Tree example

The corresponding rectangles and regression tree:

Bagging STK4030 16th of Nov 9 / 23

Page 10: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

How to estimate a tree

Finding the best binary partition in terms of minimum sum ofsquares is generally computational infeasible. Instead go greedy.

Consider splitting one variable: R1(j , s) = {X |Xj ≤ s} andR2(j , s) = {X |Xj > s},then we seek to minimize

minj ,s

minc1

∑xi∈R1(j ,s)

(yi − γ1)2 + minc2

∑xi∈R2(j ,s)

(yi − γ2)2

,solve by γ̂1 = Ave(y |xi ∈ R1) and γ̂2 = Ave(y |xi ∈ R2).

For each splitting variable it is quick to find s, such that it isfeasible to find the best (j , s) by searching through all inputs.

Bagging STK4030 16th of Nov 10 / 23

Page 11: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Tree complexity

The tree size works as a tuning parameter controlling the modelcomplexity:

fixed number of splits, for instance a stump (one split)

minimum node size: stop splitting a node if the number ofobservations is below a lower threshold

Preferred strategy: grow a large (or complete) tree and dopruning (removing sections)

cost-complexity pruningreduced error pruningweakest link pruning

Bagging STK4030 16th of Nov 11 / 23

Page 12: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Some advantages of trees

Look to Table 10.1, Chap 10.7:

fast to construct, and interpretable models

can incorporate mixtures of numeric and categorical input

invariant to (strictly monotone) transformations: scaling notnecessary and immune to outliers

automatic internal variable selection: resistant to irrelevantinputs

In result, trees have emerged as one of the most popular data miningmethod.

Bagging STK4030 16th of Nov 12 / 23

Page 13: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Back to bagging

Trees are unstable and nonlinear with high variance and low bias:perfect for bagging.

Each bootstrap tree will involve different features and differentnumber of terminal nodes.

Average the prediction at x for all B trees to get the bagged estimate:

f̂bag (xi) =1

B

B∑b=1

f̂ ∗b(x),

Bagging STK4030 16th of Nov 13 / 23

Page 14: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bootstrap trees

Bagging STK4030 16th of Nov 14 / 23

Page 15: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Combination of weak predictors

Why can bagging lower variance?

var[f̂bag (x)

]=

1

B2

B∑b=1

var[f̂ ∗b(x)

]+

1

B2

∑b 6=b′

cov[f̂ ∗b(x), f̂ ∗b

′(x)

]The variance is a trade-off between the variance of the individual

predictors and their correlation. If variables are identicallydistributed with pairwise correlation ρ > 0, the variance is given

var[f̂bag (x)

]=σ2

B+

B(B − 1) ρσ2

B

= ρσ2 +1− ρB

σ2

The degree of pairwise correlation of the bagged trees limits thebenefit of bagging!

Bagging STK4030 16th of Nov 15 / 23

Page 16: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Random forests

The idea of random forests is to reduce the correlation between thetrees, without increasing the variance.

This is achieved through random selection of the input variables.Before each split, select m ≤ p of the input variables at random ascandidates for splitting.

The parameter m is typically√p or even 1.

Bagging STK4030 16th of Nov 16 / 23

Page 17: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging STK4030 16th of Nov 17 / 23

Page 18: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Some characteristics of random forests

Out-of-bag (OOB) samples: for each zi construct its randomforest by averaging only those bootstrap samples where theobservation did not appear.

Almost identical to leave-one-out crossvalidation. Can be donealong the fitting, when the OOB error stabilizes the training canstop.

For large p with a small fraction of relevant variables, randomforests with small m performs poorly.

Bagging STK4030 16th of Nov 18 / 23

Page 19: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Boosting or bagging??

Bagging: the bias of bagged trees is the same as for theindividual trees (identically distributed), but there is hope forimprovement through variance reduction

Boosting: trees are grown in an adaptive way to remove bias(not identically distributed)

HFT state: “In our experience random forests do remarkably well withvery little tuning required. But often boosting seems to do better....

Bagging STK4030 16th of Nov 19 / 23

Page 20: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging STK4030 16th of Nov 20 / 23

Page 21: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging STK4030 16th of Nov 21 / 23

Page 22: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Bagging STK4030 16th of Nov 22 / 23

Page 23: STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

Summary

Boosting and bagging are both ensemble techniques, where weaklearners, such as trees, are combined through averaging or majorityvote to create a strong learner.

Bagging/RF: improvement only in the variance or stability bycombining bootstrap samples (with replacement).Need unstable/high variance weak learner,Preforms better if learners are decorrelated without increasingvariance

Boosting: improve the learners in an adaptive way to (slowly)remove bias.AdaBoost: uses all data to train each learner, but misclassifiedinstances are given more weight in subsequent steps.Boosting with shrinkage: implicit L1 lasso-style penalty

Bagging STK4030 16th of Nov 23 / 23