Random Forest And Analytics

Embed Size (px)

Citation preview

  • 8/11/2019 Random Forest And Analytics

    1/14

    Random Forest

    Applied Multivariate StatisticsSpring 2012

  • 8/11/2019 Random Forest And Analytics

    2/14

    Overview

    Intuition of Random Forest

    The Random Forest Algorithm

    De-correlation gives better accuracy

    Out-of-bag error (OOB-error) Variable importance

    1

    Diseased

    Diseased

    Healthy

    Healthy

    Diseased

  • 8/11/2019 Random Forest And Analytics

    3/14

    Intuition of Random Forest

    2

    youngold

    shorttall

    healthy diseased

    youngold

    diseased

    femalemale

    healthy healthy

    workingretired

    healthy

    shorttall

    healthy diseased

    New sample:

    old, retired, male, shortTree predictions:

    diseased, healthy, diseased

    Majority rule:

    diseased

    healthy

    healthy

    diseased healthy

    Tree 1

    Tree 3

    Tree 2

  • 8/11/2019 Random Forest And Analytics

    4/14

    The Random Forest Algorithm

    3

  • 8/11/2019 Random Forest And Analytics

    5/14

    Differences to standard tree

    Train each tree on bootstrap resample of data(Bootstrap resample of data set with N samples:Make new data set by drawing with replacementN samples; i.e., some samples will

    probably occur multiple times in new data set)

    For each split, consider only m randomly selected variables

    Dont prune

    Fit B trees in such a way and use average or majority

    voting to aggregate results

    4

  • 8/11/2019 Random Forest And Analytics

    6/14

    Why Random Forest works 1/2

    Mean Squared Error = Variance + Bias2

    If trees are sufficiently deep, they have very small bias

    How could we improve the variance over that of a single

    tree?

    5

  • 8/11/2019 Random Forest And Analytics

    7/14

    Why Random Forest works 2/2

    6

    i=j

    Decreases, if number of trees B

    increases (irrespective of )

    Decreaes, if

    decreases, i.e., if

    m decreases

    De-correlation gives

    better accuracy

  • 8/11/2019 Random Forest And Analytics

    8/14

    Estimating generalization error:

    Out-of bag (OOB) error

    Similar to leave-one-out cross-validation, but almost

    without any additional computational burden

    OOB error is a random number, since based on random

    resamples of the data

    youngold

    shorttall

    healthy diseased

    diseased healthy

    Data:

    old, tallhealthy

    old, shortdiseased

    young, tallhealthy

    young, shortdiseased

    young, shorthealthy

    young, tallhealthy

    old, shortdiseased

    Resampled Data:

    old, tallhealthy

    old, shortdiseased

    young, tallhealthy

    young, tallhealthy

    Out of bag samples:

    young, shortdiseased

    young, shorthealthy

    young, tallhealthy

    old, shortdiseased

    Out of bag (OOB) error rate:

    = 0.25

  • 8/11/2019 Random Forest And Analytics

    9/14

    Variable Importance for variable i

    using Permutations

    8

    Data

    Resampled

    Dataset 1 OOB

    Data 1

    Resampled

    Dataset mOOB

    Data m

    Tree 1 Tree m

    OOB error e1 OOB error em

    Permute values of

    variable i in OOB

    data set

    OOB error p1 OOB error pm

    d= 1m

    Pm

    i=1di

    d1= e1p1 dm=em-pm

    s

    2

    d= 1

    m1Pm

    i=1(di

    d)2

    vi= d

    sd

  • 8/11/2019 Random Forest And Analytics

    10/14

    Trees vs. Random Forest

    +Trees yield insight into

    decision rules+Rather fast

    +Easy to tune

    parameters

    -Prediction of trees tend

    to have a high variance

    9

    +RF as smaller prediction

    variance and thereforeusually a better generalperformance

    +Easy to tune parameters

    -Rather slow

    -Black Box: Rather hardto get insights into decisionrules

  • 8/11/2019 Random Forest And Analytics

    11/14

    Comparing runtime

    (just for illustration)

    10

    RF

    Tree

    Up to thousands of variables

    Problematic if there are categorical predictors with many levels (max: 32 levels)

    RF: First predictor cut into 15 levels

  • 8/11/2019 Random Forest And Analytics

    12/14

    + Very fast

    + Discriminants for visualizinggroup separation

    + Can read off decision rule

    -Can model only linear classboundaries

    - Mediocre performance

    - No variable selection

    - Only on categorical response- Needs CV for estimatingprediction error

    RF vs. LDA

    +Can model nonlinear

    class boundaries

    +OOB error for free (no

    CV needed)

    +Works on continuous and

    categorical responses(regression / classification)

    +Gives variable

    importance

    + Very good performance

    -Black box

    -Slow

    11

    xx

    x x

    xx

    xx

    xx

    xx

    x

    xx

    xx

    x

    x

    xx

    xxx

  • 8/11/2019 Random Forest And Analytics

    13/14

    Concepts to know

    Idea of Random Forest and how it reduces the prediction

    variance of trees

    OOB error

    Variable Importance based on Permutation

    12

  • 8/11/2019 Random Forest And Analytics

    14/14

    R functions to know

    Function randomForest and varImpPlot from package

    randomForest

    13