STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani

STK4030: Bagging and random forests

Kristoffer H. Hellton

16th of November 2015All figures and tables from Hastie, Tibshirani and Friedman

Elements of statitical learning

Bagging STK4030 16th of Nov 1 / 23

Today

Bagging, 8.7

Trees, 9.2

Random forests, Chap 15


First; the vox populi or wisdom of the crowd

In 1906, Francis Galton visited a country fair in Plymouth where 800people participated in a contest to estimate the weight of aslaughtered ox. No one managed to guess the exact answer, butGalton observed that the median guess, 1207 pounds, was accuratewithin 1% of the true weight of 1198 pounds, and he stated:

the middlemost estimate expresses the vox populi, everyother estimate being condemned as too low or too high bya majority of the voters1,

emphasizing the wisdom of the crowd as a way to improvepredictions.

1Galton (1906), NatureBagging STK4030 16th of Nov 3 / 23

Bootstrap

Given a training set and any estimated quantity

Z = (z1, . . . , zn), S(Z).

Bootstrapping is a resampling tool used to assess statistical accuracy:

1 draw B datasets (of same size) Z∗b randomly with replacementfrom the training data

2 calculate the quantity S(Z∗b) on each bootstrap sample

3 usually one uses the variance over the bootstrap values as anestimate of the variance in the true training dataset

Note: with replacement = some observations will be repeated in eachbootstrap sample. For large N roughly 1− 1/e ' 63% will be unique.



Bagging

Bootstrap aggregation or Bagging uses the bootstrap frameworkto build a prediction f̂ ∗b over all bootstrap samples Z∗b,

and then average over B

f̂bag (xi) =1

B

B∑b=1

f̂ ∗b(x),

thereby (hopefully) reducing the variance of the prediction (thinkto the central limit theorem).

This can lead to improvements for unstable procedures.


Nonlinear estimates

The bagged estimate will only differ from the original estimate,if f̂ (x) is a nonlinear function of the data.

HTF 8.7: The B-spline smoother is linear in y (for fixed inputs):f̂spline(x) = Sy such that f̂bag (x)→ f̂ (x) as B →∞, hencebagging just reproduces the original estimate and noimprovement.

Need TREES!

Highly nonlinear and unstable with large variance, perfect forbagging.


Trees, Chap 9.2

A tree (in our setting) is a recursivebinary partition, two-part split, of thefeature space, which corresponds torectangles Rm.

Tree regression: Fit a constant γm toeach Rm

f (x) =M∑

m=1

γmI (x ∈ Rm),

with γ̂m = Ave(y |xi ∈ Rm), the meanvalue of the yi ’s within each rectangle.


Tree example

The corresponding rectangles and regression tree:


How to estimate a tree

Finding the best binary partition in terms of minimum sum ofsquares is generally computational infeasible. Instead go greedy.

Consider splitting one variable: R1(j , s) = {X |Xj ≤ s} andR2(j , s) = {X |Xj > s},then we seek to minimize

minj ,s

minc1

∑xi∈R1(j ,s)

(yi − γ1)2 + minc2

∑xi∈R2(j ,s)

(yi − γ2)2

,solve by γ̂1 = Ave(y |xi ∈ R1) and γ̂2 = Ave(y |xi ∈ R2).

For each splitting variable it is quick to find s, such that it isfeasible to find the best (j , s) by searching through all inputs.


Tree complexity

The tree size works as a tuning parameter controlling the modelcomplexity:

fixed number of splits, for instance a stump (one split)

minimum node size: stop splitting a node if the number ofobservations is below a lower threshold

Preferred strategy: grow a large (or complete) tree and dopruning (removing sections)

cost-complexity pruningreduced error pruningweakest link pruning


Some advantages of trees

Look to Table 10.1, Chap 10.7:

fast to construct, and interpretable models

can incorporate mixtures of numeric and categorical input

invariant to (strictly monotone) transformations: scaling notnecessary and immune to outliers

automatic internal variable selection: resistant to irrelevantinputs

In result, trees have emerged as one of the most popular data miningmethod.


Back to bagging

Trees are unstable and nonlinear with high variance and low bias:perfect for bagging.

Each bootstrap tree will involve different features and differentnumber of terminal nodes.

Average the prediction at x for all B trees to get the bagged estimate:

f̂bag (xi) =1

B

B∑b=1

f̂ ∗b(x),


Bootstrap trees


Combination of weak predictors

Why can bagging lower variance?

var[f̂bag (x)

]=

1

B2

B∑b=1

var[f̂ ∗b(x)

]+

1

B2

∑b 6=b′

cov[f̂ ∗b(x), f̂ ∗b

′(x)

]The variance is a trade-off between the variance of the individual

predictors and their correlation. If variables are identicallydistributed with pairwise correlation ρ > 0, the variance is given

var[f̂bag (x)

]=σ2

B+

B(B − 1) ρσ2

B

= ρσ2 +1− ρB

σ2

The degree of pairwise correlation of the bagged trees limits thebenefit of bagging!


Random forests

The idea of random forests is to reduce the correlation between thetrees, without increasing the variance.

This is achieved through random selection of the input variables.Before each split, select m ≤ p of the input variables at random ascandidates for splitting.

The parameter m is typically√p or even 1.



Some characteristics of random forests

Out-of-bag (OOB) samples: for each zi construct its randomforest by averaging only those bootstrap samples where theobservation did not appear.

Almost identical to leave-one-out crossvalidation. Can be donealong the fitting, when the OOB error stabilizes the training canstop.

For large p with a small fraction of relevant variables, randomforests with small m performs poorly.


Boosting or bagging??

Bagging: the bias of bagged trees is the same as for theindividual trees (identically distributed), but there is hope forimprovement through variance reduction

Boosting: trees are grown in an adaptive way to remove bias(not identically distributed)

HFT state: “In our experience random forests do remarkably well withvery little tuning required. But often boosting seems to do better....





Summary

Boosting and bagging are both ensemble techniques, where weaklearners, such as trees, are combined through averaging or majorityvote to create a strong learner.

Bagging/RF: improvement only in the variance or stability bycombining bootstrap samples (with replacement).Need unstable/high variance weak learner,Preforms better if learners are decorrelated without increasingvariance

Boosting: improve the learners in an adaptive way to (slowly)remove bias.AdaBoost: uses all data to train each learner, but misclassifiedinstances are given more weight in subsequent steps.Boosting with shrinkage: implicit L1 lasso-style penalty


Documents

STK4030: Bagging and random forests - Universitetet i oslo · STK4030: Bagging and random forests Kristo er H. Hellton 16th of November 2015 All gures and tables from Hastie, Tibshirani