54
Introduction of Some Ensemble Learning Methods Random Forest, Boosted Trees and so on Honglin Yu March 24, 2014

Introduction to Some Tree based Learning Method

Embed Size (px)

DESCRIPTION

Introduction of Some Ensemble Learning Methods Random Forest, Boosted Trees and so on

Citation preview

Page 1: Introduction to Some Tree based Learning Method

Introduction of Some Ensemble Learning MethodsRandom Forest, Boosted Trees and so on

Honglin Yu

March 24, 2014

Page 2: Introduction to Some Tree based Learning Method

Outline

1 Big Picture: Ensemble Learning

2 Random Forest

3 Boosting

Page 3: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Big Picture

In statistics and machine learning, ensemble methods usemultiple models to obtain better predictive performance thancould be obtained from any of the constituent models.

— Wikipedia

Build house (strong classifier) from bricks (weak classifiers)

Two questions:

1 What are the bricks? How to make them?

2 How to build house from bricks?

Page 4: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Big Picture

In statistics and machine learning, ensemble methods usemultiple models to obtain better predictive performance thancould be obtained from any of the constituent models.

— Wikipedia

Build house (strong classifier) from bricks (weak classifiers)

Two questions:

1 What are the bricks? How to make them?

2 How to build house from bricks?

Page 5: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Classification and Regression Tree

Page 6: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Classification and Regression Tree (CART): Example andInference

Smell

Color Ingredient

0.9 0.6 0.60.8

good bad

bright dark tastyspicy

Inference:1 Regression: the predicted response for an observation is given

by the mean response of the training observations that belongto the same terminal node.

2 Classification: each observation belongs to the mostcommonly occurring class of training observations in theregion to which it belongs.

Page 7: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Classification and Regression Tree (CART): Learning

How to learn or build or grow a tree?

Finding the optimal partitioning of the data is NP-complete

Practical solution: greedy method

Famous Example: C4.5 algorithm

Page 8: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Growing the Tree (1)

Main procedure: (similar to building a kd-tree) recursivelysplit a node of data {xi} into two nodes (NXj<s ,NXj≥s) by onedimension of feature Xj and a threshold s (greedy)

Page 9: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Growing the Tree (2)Split criterion (selecting Xj and s):

Regression: “minimizing RSS”

minj,s

∑i :(xi )j<s

(yi − y1)2 +∑

i :(xi )j≥s

(yi − y2)2

Classification: minimizing weighted (by number of data innodes) sum of

Classification error rate of 1 node: 1 − maxk

(pmk)

Gini index of 1 node:K∑

k=1

pmk(1 − pmk)

Cross-entropy of 1 node: −K∑

k=1

pmk log pmk

Page 10: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Growing the Tree (3)

“When building a classification tree, either the Gini index orthe cross-entropy are typically used to evaluate the quality of aparticular split, since these two approaches are more sensitiveto node purity than is the classification error rate. Any ofthese three approaches might be used when pruning the tree,but the classification error rate is preferable if predictionaccuracy of the final pruned tree is the goal.”

— ISLR book

Stop criterion: for example, until “each terminal node hasfewer than some minimum number of observations”

Page 11: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Growing the Tree (3)

“When building a classification tree, either the Gini index orthe cross-entropy are typically used to evaluate the quality of aparticular split, since these two approaches are more sensitiveto node purity than is the classification error rate. Any ofthese three approaches might be used when pruning the tree,but the classification error rate is preferable if predictionaccuracy of the final pruned tree is the goal.”

— ISLR book

Stop criterion: for example, until “each terminal node hasfewer than some minimum number of observations”

Page 12: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Pruning the Tree (1)

Why: The growed tree is often too large

OverfittingSensitive to data fluctuations

How: Cost complexity pruning or weakest link pruning

minT

|T |∑m=1

∑i :xi∈Rm

(yi − yRm)2 + α|T |

where T is a subtree of the full tree T0. |T | is the number ofnodes of T .

Page 13: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Pruning the Tree (2)

— ISLR book

Page 14: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 15: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 16: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 17: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 18: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 19: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 20: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 21: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 22: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 23: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 24: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 25: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 26: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Page 27: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Pros and Cons of CART

Pros:1 easier to interpret (easy to convert to logic rules)2 automatic variable selection3 scale well4 insensitive to monotone transformations of the inputs (based

on ranking the data points), no need to scale the data

Cons:1 performance is not good (greedy)2 unstable

You can not expect too much on a brick ...Qustion: what makes a good brick?

Page 28: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question: Why Binary Split

No concrete answer

1 Similar to logic

2 Stable (otherwise, more possibilities, errors on the top willaffect the splitting below)

Page 29: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question: Why Binary Split

No concrete answer

1 Similar to logic

2 Stable (otherwise, more possibilities, errors on the top willaffect the splitting below)

Page 30: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Let’s begin to build the house!

Page 31: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Bootstrap Aggregation (Bagging): Just average it!

Bagging:

1 Build M decision trees on bootstrap data (randomlyresampled with replacement)

2 Average it:

f (x) =1

M

M∑i=1

fi (x)

Note that, CARTs here do not need pruning (why?)

Page 32: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Two things

1 How many trees to use (M) ? More the better, increase untilconverge (CART is very fast, can be trained in parallel)

2 What is the size of bootstrap learning set? Breiman chosesame size as original data, then every bootstrap set leaves out

(1− 1

n)n

percentage of data (about 37% when n is large)

Page 33: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question 1: Why Bagging Works?

1 Bias-Variance Trade off:

E [(F (x)− fi (x))2] = (F (x)− E [fi (x)])2 + E [fi (x)− Efi (x)]2

= (Bias[fi (x)])2 + Var [fi (x)]

2 |Bias(fi (x))| is small but Var(fi (x)) is big

3 Then for f (x), if {fi (x)} are mutually independent

Bias(f (x)) = Bias(fi (x))

Var(f (x)) =1

nVar(fi (x))

Variance is reduced by averaging

Page 34: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question 1: Why Bagging Works?

1 Bias-Variance Trade off:

E [(F (x)− fi (x))2] = (F (x)− E [fi (x)])2 + E [fi (x)− Efi (x)]2

= (Bias[fi (x)])2 + Var [fi (x)]

2 |Bias(fi (x))| is small but Var(fi (x)) is big

3 Then for f (x), if {fi (x)} are mutually independent

Bias(f (x)) = Bias(fi (x))

Var(f (x)) =1

nVar(fi (x))

Variance is reduced by averaging

Page 35: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question 2: Why Bagging works NOT quite well?

Trees are strongly Correlated (repetition, bootstrap)To improve : random forest (add more randomness!)

Page 36: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Question 2: Why Bagging works NOT quite well?

Trees are strongly Correlated (repetition, bootstrap)To improve : random forest (add more randomness!)

Page 37: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Random Forest

Random forests are a combination of tree predictors such thateach tree depends on the values of a random vector sampledindependently and with the same distribution for all trees inthe forest

— LEO BREIMAN

Where randomness helps!

Page 38: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Random Forest

Random forests are a combination of tree predictors such thateach tree depends on the values of a random vector sampledindependently and with the same distribution for all trees inthe forest

— LEO BREIMAN

Where randomness helps!

Page 39: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Random Forest: Learning

Where is the “extra” randomness:

As in bagging, we build a number of decision trees onbootstrapped training samples. But when building thesedecision trees, each time a split in a tree is considered, arandom sample of m predictors is chosen as split candidatesfrom the full set of p predictors. The split is allowed to useonly one of those m predictors.

— ISLR book

Give more chance on the “single weak” predictor (features)

Decorrelate the trees → reduce the variance of the finalhyperthesis

Page 40: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Random Forest

How many sub-features to choose (not very sensitive, try it(1, half, int(log2 p + 1)))

Internal test of Random Forest (OOB error)

Interpretation

Page 41: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Out of Bag Error (OOB error)

“By product of bootstraping”: Out-of-bag data

For every training data (xi, yi ), there are about 37% of treesthat does not train on it.

In testing, use the 37% trees to predict the (xi, yi ). Then wecan get a RSS or classification error rate from it.

“proven to be unbiased in many tests.”

Now need to do Cross-validation.

Page 42: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Random Forest: Interpretation (Variable Importance)

By averaging, “Random Forest” lost the ability of “easyinterpretation” of CART. But it is still “interpretable”

Page 43: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Permutation Test (1)

Test correlation by permutation

Figure : Does not correlate

— Ken Rice, Thomas Lumley, 2008

Page 44: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Permutation Test (2)

Test correlation by permutation

Figure : Correlate

— Ken Rice, Thomas Lumley, 2008

Page 45: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Interpret Random Forest

https://stat.ethz.ch/education/semesters/ss2012/ams/slides/v10.2.pdf

Page 46: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Example

Feature importance: x : y = 0.34038216 : 0.65961784

OOB error: 0.947563375391

Page 47: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Pros and Cons

Pros:1 Performance is good, a lot of applications2 not hard to interpret3 Easy to train in parallel

Cons:1 Easy to overfit2 (?) worse than SVM in high dimension learning

Page 48: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Boosting

Main difference between Boosting and Bagging?

1 “Bricks” (CART here) are trained sequentially.

2 Every “brick” is trained on the whole data (with modificationbased on training former “bricks”)

Note that, actually, “Boosting” originates before the invention of“Random Forest”. Though “Boosting” performs better than“Random forest” and is also more complex than it.

Page 49: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Boosting a Regression Tree

Older trees can be seen as (except shrinkage) being built on the“residuals” of former trees.

— ISLR book

Page 50: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Boosting a Classification Tree

Train trees sequentially: For new trees, “hard” data from oldtrees are given more weights.

Reduce bias (early stage) and the variance (late stage)

Page 51: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Compared with Random Forest

1 Performance is often better

2 Slow to train and test

3 Easier to overfit

Page 52: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Questions

Is Neural Network also Ensemble Learning?

Page 53: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Questions

What is the relationship between Linear Regression and RandomForest etc.?

Page 54: Introduction to Some Tree based Learning Method

Big Picture: Ensemble Learning Random Forest Boosting

Big Picture again

Adaptive basis function models