Tree Depth in a Forest - NUSways to grow the forest. All of the well-known meth-ods grow the forest by perturbing the training set, growing a tree on the perturbed training set, per-turbing

Tree Depth in a Forest

NUS / IMS Workshop on Classification and Regression Trees

Mark SegalCenter for Bioinformatics & Molecular Biostatistics

Division of BioinformaticsDepartment of Epidemiology and Biostatistics

UCSF

• Breiman, Friedman, Olshen, Stone (1984)

• Popularized tree-structured techniques

• Primary distinction with earlier approaches?

• Means for determining tree size

• Grow large / maximal initial tree

• capture all potential action

• Cost-complexity pruning

• Cross-validation based selection

• Size determination critical consideration

• Why??

CART

38 2. Overview of Supervised Learning

High Bias

Low Variance

Low Bias

High Variance

Pre

dic

tion

Err

or

Model Complexity

Training Sample

Test Sample

Low High

FIGURE 2.11. Test and training error as a function of model complexity.

be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.

The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeo!.

More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.

Typically we would like to choose our model complexity to trade biaso! with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1

N

!i(yi ! yi)2. Unfortunately

training error is not a good estimate of test error, as it does not properlyaccount for model complexity.

Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.

Predictive Performance

Predictive Performance

CART Monograph

• CART lived happily ever after

• widespread uptake in diverse fields

• many methodological refinements

• this workshop (thanks Wei-Yin!)

• But, what about predictive performance??

• Better the model fits, the more sound the inference

• Conventional models and CART tend to fit very poorly

• Fit measured by prediction error (PE)

• Substantial gains in PE can be achieved by using ensembles of (weak) predictors

• in particular, individual trees

Breiman Mantra

• Breiman (2001a,b)

• Have become a forefront prediction technique

• Notable gains in prediction performance over individual trees

• PE variance reduced by averaging over the randomness-injected ensemble

• Individual trees grown to large / maximal depth

• Major departure from CART paradigm

• Seemingly, averaging over the ensemble more than compensates for increased individual tree variability

Random Forests

A RF is a collection of tree predictors

h(x;✓t), t = 1, . . . , T ; ✓t iid random vectors

For regression, the forest prediction is the

unweighted average over the collection:

¯h(x)

As t!1 the Law of Large Numbers ensures

EX,Y (Y � ¯h(X))

2 ! EX,Y (Y � E✓h(X;✓))

2

⌘ PE⇤f the forest prediction error

Convergence implies forests don’t overfit

• Growing trees to maximal depth minimizes bias

• But potentially incurs prediction variance cost

• Averaging over ensemble putatively handles this

• But how was it established that such averaging (more than) compensates for increased individual tree variability??

• Hard to address theoretically (will try later)

• Breiman (2001a,b) addressed empirically using

• UCI Irvine machine learning benchmark datasets

• Includes classification and regression problems

• Simulated and (predominantly) real data

• Exported to R mlbench library

STATISTICAL MODELING: THE TWO CULTURES 207

Table 1Data set descriptions

Training TestData set Sample size Sample size Variables Classes

Cancer 699 — 9 2Ionosphere 351 — 34 2Diabetes 768 — 8 2Glass 214 — 9 6Soybean 683 — 35 19

Letters 15,000 5000 16 26Satellite 4,435 2000 36 6Shuttle 43,500 14,500 9 7DNA 2,000 1,186 60 3Digit 7,291 2,007 256 10

that in many states, the trials were anything butspeedy. It funded a study of the causes of the delay.I visited many states and decided to do the anal-ysis in Colorado, which had an excellent computer-ized court data system. A wealth of information wasextracted and processed.

The dependent variable for each criminal casewas the time from arraignment to the time of sen-tencing. All of the other information in the trial his-tory were the predictor variables. A large decisiontree was grown, and I showed it on an overhead andexplained it to the assembled Colorado judges. Oneof the splits was on District N which had a largerdelay time than the other districts. I refrained fromcommenting on this. But as I walked out I heard onejudge say to another, “I knew those guys in DistrictN were dragging their feet.”

While trees rate an A+ on interpretability, theyare good, but not great, predictors. Give them, say,a B on prediction.

9.1 Growing Forests for Prediction

Instead of a single tree predictor, grow a forest oftrees on the same data—say 50 or 100. If we areclassifying, put the new x down each tree in the for-est and get a vote for the predicted class. Let the for-est prediction be the class that gets the most votes.There has been a lot of work in the last five years onways to grow the forest. All of the well-known meth-ods grow the forest by perturbing the training set,growing a tree on the perturbed training set, per-turbing the training set again, growing another tree,etc. Some familiar methods are bagging (Breiman,1996b), boosting (Freund and Schapire, 1996), arc-ing (Breiman, 1998), and additive logistic regression(Friedman, Hastie and Tibshirani, 1998).

My preferred method to date is random forests. Inthis approach successive decision trees are grown byintroducing a random element into their construc-tion. For example, suppose there are 20 predictor

variables. At each node choose several of the 20 atrandom to use to split the node. Or use a randomcombination of a random selection of a few vari-ables. This idea appears in Ho (1998), in Amit andGeman (1997) and is developed in Breiman (1999).

9.2 Forests Compared to Trees

We compare the performance of single trees(CART) to random forests on a number of smalland large data sets, mostly from the UCI repository(ftp.ics.uci.edu/pub/MachineLearningDatabases). Asummary of the data sets is given in Table 1.

Table 2 compares the test set error of a single treeto that of the forest. For the five smaller data setsabove the line, the test set error was estimated byleaving out a random 10% of the data, then run-ning CART and the forest on the other 90%. Theleft-out 10% was run down the tree and the forestand the error on this 10% computed for both. Thiswas repeated 100 times and the errors averaged.The larger data sets below the line came with aseparate test set. People who have been in the clas-sification field for a while find these increases inaccuracy startling. Some errors are halved. Othersare reduced by one-third. In regression, where the

Table 2Test set misclassification error (%)

Data set Forest Single tree

Breast cancer 2.9 5.9Ionosphere 5.5 11.2Diabetes 24.2 25.3Glass 22.0 30.4Soybean 5.7 8.6

Letters 3.4 12.4Satellite 8.6 14.8Shuttle !103 7.0 62.0DNA 3.9 6.2Digit 6.2 17.1

Breiman (2001a,b)

Some classification results from UCI Irvine machine learning benchmark datasets:

• Many further comparisons using the UCI Irvine / mlbench repository datasets:

• several modeling / prediction frameworks:

• CART, ANNs, LDA, QDA, kNNs...

• regression and classification problems

• Conclusion: “Random Forests are A+ predictors’’

• Discussion (Efron): Lots of knobs (tuning parameters)

• Rejoinder (Breiman): Essentially only one (mtry)

• Random Forests have lived happily ever after

• But, lets take a closer look at the UCI Irvine / mlbench repository datasets

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Augmented Friedman #1

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Boston Housing

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Servo

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Friedman #1

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Friedman #2

0 20 40 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Friedman #3

Figure 2: UCI Repository: Regression tree prediction error profiles. Note that the upper left plot corre-sponds to modification of a synthetic repository dataset in order to achieve a non-monotone error profile;see Section 4.

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

rBreast Cancer

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of SplitsC

ross−v

alid

ated

Erro

r

Bupa Liver

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Diabetes

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Glass

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Image

0 5 10 15 200.

00.

20.

40.

60.

81.

01.

2Number of Splits

Cro

ss−v

alid

ated

Erro

r

Ionosphere

Figure 3: UCI Repository: Classification tree prediction error profiles, I.

5

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Letter Recognition

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Promoters

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Ringnorm

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Satellite

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Sonar

0 10 20 30 40 500.

00.

20.

40.

60.

81.

01.

2Number of Splits

Cro

ss−v

alid

ated

Erro

r

Soybean

Figure 4: UCI Repository: Classification tree prediction error profiles, II.

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Threenorm

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Twonorm

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Vehicle

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

House Votes 84

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Vowel

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Waveform

Figure 5: UCI Repository: Classification tree prediction error profiles, III.

Of course the error profiles depend on the class of model being fitted. While it is appropriate to utilize tree-structured models in dissecting the random forest mechanism, it is also purposeful to assess whether thedatasets can’t be overfit under other model classes. To that end we investigate error profiles correspondingto least angle regression (LARS). LARS represents a recently devised (Efron et al., 2004) technique that

6

0 200 400 600 800 1000 1200

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Letter Recognition

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Promoters

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Ringnorm

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Satellite

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Sonar

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Soybean

Figure 4: UCI Repository: Classification tree prediction error profiles, II.

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Threenorm

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Twonorm

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Vehicle

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

House Votes 84

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Vowel

0.0 0.5 1.0 1.5 2.00.

00.

20.

40.

60.

81.

01.

2Number of Splits

Cro

ss−v

alid

ated

Erro

r

Waveform

Figure 5: UCI Repository: Classification tree prediction error profiles, III.

Of course the error profiles depend on the class of model being fitted. While it is appropriate to utilize tree-structured models in dissecting the random forest mechanism, it is also purposeful to assess whether thedatasets can’t be overfit under other model classes. To that end we investigate error profiles correspondingto least angle regression (LARS). LARS represents a recently devised (Efron et al., 2004) technique that

6

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Augmented Friedman #1

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Boston Housing

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Servo

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Friedman #1

0 10 20 30 40 50 60 70

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

rFriedman #2

0 20 40 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Friedman #3

Figure 2: UCI Repository: Regression tree prediction error profiles. Note that the upper left plot corre-sponds to modification of a synthetic repository dataset in order to achieve a non-monotone error profile;see Section 4.

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Breast Cancer

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Bupa Liver

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Diabetes

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Glass

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Image

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ss−v

alid

ated

Erro

r

Ionosphere

Figure 3: UCI Repository: Classification tree prediction error profiles, I.

5

• Almost all UCI Irvine machine learning benchmark datasets exhibit this behaviour:

• they are hard to overfit {not just with trees}

• This will make the Random Forest strategy of growing trees to maximal depth look good

• “Benchmarks” are not representative of what is at least thought to be prototypic

• Will next showcase such an example

• Then offer some theory and characterizations

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ssïv

alid

ated

Erro

r

Basal Splicing Signals

• Pre-messenger RNA splicing - responsible for precise removal of introns - is an essential step in expression of most genes

• Exons defined by short, degenerate splice site sequences at intron/exon boundaries: 5’ splice site (5’ss, donor); 3’ss, acceptor

• Each ss has a consensus sequence motif: essential nucleotides plus base usage preferences in flanking positions

• Despite requirement for accurate splicing, human ss only moderately conserved

• Implies an abundance of decoy ss

• Further, strong and complex dependencies between ss nucleotides exist

• Improved understanding of basal ss is important for exon recognition and, ultimately, disease impact of splicing defects

• Approach as a classification problem -- real vs decoy ss -- using large database

• Objective: predict 3’ splice site sequences

• Large n, small p datasets:

• training 8465 real; 180957 decoy

• test 4233 real; 90494 decoy ATTCTTACAAGTCCAATAAGGTT real GAATCGCTTGAACCTGGGAGGTG real CTGAAATGTCTCATCTGCAGTAC decoy ATTTTATTTTTAAATTGCAGGTA decoy

• each (non-degenerate, aligned) position constitutes an unordered covariate (p = 21)

• data generation: Yeo and Burge (2003).

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of Splits

Cro

ssïv

alid

ated

Erro

r3’ss: CV error for a single tree

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate (1 − Specificity)

True

Pos

itive

Rat

e (S

ensi

tivity

)

Random Forest ROC Curves: Test 3'ss Data

Split ControlNode Size Control

0.00 0.05 0.10 0.15 0.20

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate (1 − Specificity)

True

Pos

itive

Rat

e (S

ensi

tivity

)

Random ForestsSupport Vector MachinesMaximum Entropy Models

ROC Curves: Test 3’ss Data

{Aside: comparisons}

• Individual tree size determined by inter-related tuning parameters that govern (terminal) node size, number of splits, depth, split improvement

• A priori regulation via node size specifications problematic in large n situations

• Guidelines, rules-of-thumb as function of n are lacking (cf defaults for m)

• Leekasso

Tree Depth in a Forest

• Lin and Jeong (2006, JASA)

• Develop construct of k-PNNs

• Establish connections between Random Forests and k-PNNs where k is terminal node size

• k = 1 for trees grown to maximal depth

• Enables analysis of role of tree depth

Potential Nearest Neighbours

Under simplifying assumptions Lin and Jeon

show that a lower bound on the rate of

convergence of RF MSE is k�1(log n)

�(p�1).

Much inferior to standard rate n�2d/(2d+p)

(where d is degree of target smoothness)

attained by many nonparametric methods.

To achieve competitiveness terminal node

size k should increase with sample size n.

Intuitively: largest trees use 1-PNNs at x0

#1-PNNs ⇠ Op[(log n)

p�1] which is too small.

Lin and Jeon: “growing large trees (k small)

does not always give the best performance”

But, asymptotics require n� p and even

when seemingly applicable may not pertain.

Consider p = 10, d = 2, n = 100000. Then

(log n)

p�1/(p� 1)! = 9793� 27 = n2d/(2d+p)

Even more so the case for larger p, smaller n.

So, for high dimensional problems growing

largest individual trees is often desirable.

• UCI / mlbench data repositories are inadequate as representative testbeds

• k-PNNs provide a theoretic framework for (crudely) evaluating tree depth considerations

• In large sample settings (Big Data) growing the individual tree components of a Random Forest ensemble to maximal depth can be undesirable

• Approaches to developing guidelines, defaults, parameterizations, tuning strategies to address tree depth are yet to be developed

Conclusions / Future Work

• Eugene Yeo

• Leo Breiman

Acknowledgements

Documents

Tree Depth in a Forest - NUSways to grow the forest. All of the well-known meth-ods grow the forest by perturbing the training set, growing a tree on the perturbed training set, per-turbing