Random Forest And Analytics

8/11/2019 Random Forest And Analytics

1/14

Random Forest

Applied Multivariate StatisticsSpring 2012


2/14

Overview

Intuition of Random Forest

The Random Forest Algorithm

De-correlation gives better accuracy

Out-of-bag error (OOB-error) Variable importance

1

Diseased

Diseased

Healthy

Healthy

Diseased


3/14

Intuition of Random Forest

2

youngold

shorttall

healthy diseased

youngold

diseased

femalemale

healthy healthy

workingretired

healthy

shorttall

healthy diseased

New sample:

old, retired, male, shortTree predictions:

diseased, healthy, diseased

Majority rule:

diseased

healthy

healthy

diseased healthy

Tree 1

Tree 3

Tree 2


4/14

The Random Forest Algorithm

3


5/14

Differences to standard tree

Train each tree on bootstrap resample of data(Bootstrap resample of data set with N samples:Make new data set by drawing with replacementN samples; i.e., some samples will

probably occur multiple times in new data set)

For each split, consider only m randomly selected variables

Dont prune

Fit B trees in such a way and use average or majority

voting to aggregate results

4


6/14

Why Random Forest works 1/2

Mean Squared Error = Variance + Bias2

If trees are sufficiently deep, they have very small bias

How could we improve the variance over that of a single

tree?

5


7/14

Why Random Forest works 2/2

6

i=j

Decreases, if number of trees B

increases (irrespective of )

Decreaes, if

decreases, i.e., if

m decreases

De-correlation gives

better accuracy


8/14

Estimating generalization error:

Out-of bag (OOB) error

Similar to leave-one-out cross-validation, but almost

without any additional computational burden

OOB error is a random number, since based on random

resamples of the data

youngold

shorttall

healthy diseased

diseased healthy

Data:

old, tallhealthy

old, shortdiseased

young, tallhealthy

young, shortdiseased

young, shorthealthy

young, tallhealthy

old, shortdiseased

Resampled Data:

old, tallhealthy

old, shortdiseased

young, tallhealthy

young, tallhealthy

Out of bag samples:

young, shortdiseased

young, shorthealthy

young, tallhealthy

old, shortdiseased

Out of bag (OOB) error rate:

= 0.25


9/14

Variable Importance for variable i

using Permutations

8

Data

Resampled

Dataset 1 OOB

Data 1

Resampled

Dataset mOOB

Data m

Tree 1 Tree m

OOB error e1 OOB error em

Permute values of

variable i in OOB

data set

OOB error p1 OOB error pm

d= 1m

Pm

i=1di

d1= e1p1 dm=em-pm

s

2

d= 1

m1Pm

i=1(di

d)2

vi= d

sd


10/14

Trees vs. Random Forest

+Trees yield insight into

decision rules+Rather fast

+Easy to tune

parameters

-Prediction of trees tend

to have a high variance

9

+RF as smaller prediction

variance and thereforeusually a better generalperformance

+Easy to tune parameters

-Rather slow

-Black Box: Rather hardto get insights into decisionrules


11/14

Comparing runtime

(just for illustration)

10

RF

Tree

Up to thousands of variables

Problematic if there are categorical predictors with many levels (max: 32 levels)

RF: First predictor cut into 15 levels


12/14

+ Very fast

+ Discriminants for visualizinggroup separation

+ Can read off decision rule

-Can model only linear classboundaries

- Mediocre performance

- No variable selection

- Only on categorical response- Needs CV for estimatingprediction error

RF vs. LDA

+Can model nonlinear

class boundaries

+OOB error for free (no

CV needed)

+Works on continuous and

categorical responses(regression / classification)

+Gives variable

importance

+ Very good performance

-Black box

-Slow

11

xx

x x

xx

xx

xx

xx

x

xx

xx

x

x

xx

xxx


13/14

Concepts to know

Idea of Random Forest and how it reduces the prediction

variance of trees

OOB error

Variable Importance based on Permutation

12


14/14

R functions to know

Function randomForest and varImpPlot from package

randomForest

13

Documents

Random Forest And Analytics