33
Decision Trees Gonzalo Martínez Muñoz Universidad Autónoma de Madrid

L3. Decision Trees

Embed Size (px)

Citation preview

Page 1: L3. Decision Trees

Decision Trees

Gonzalo Martínez Muñoz Universidad Autónoma de Madrid

Page 2: L3. Decision Trees

2!

•  What is a decision tree?

•  History

•  Decision tree learning algorithm

•  Growing the tree

•  Pruning the tree

•  Capabilities

Outline

Page 3: L3. Decision Trees

3!

•  Hierarchical learning model that recursively partitions the space using decision rules

•  Valid for classification and regression

•  Ok, but what is a decision tree?

What is a decision tree?

Page 4: L3. Decision Trees

4!

•  Labor negotiations: predict if a collective agreement is good or bad

Example for classification

good

dental plan?

>full!≤full

good bad

wage increase 1st year

>36%!≤36%

N/A none half full dental plan

10

20

3

0

40

50

wag

e in

crea

se 1

st y

ear

Leaf nodes!

Internal!nodes!

Root node!

Page 5: L3. Decision Trees

5!

•  Boston housing: predict house values in Boston by neigbourhood

Example for regression

500000

average rooms

>5.5!≤ 5.5

350000 200000

Average house year

>75!≤75

3 4 5 6 ave. rooms

50

60

7

0

80

90

ave.

hou

se y

ear

Page 6: L3. Decision Trees

6!

•  Precursors: Expert Based Systems (EBS)

EBS = Knowledge database + Inference Engine

•  MYCIN: Medical diagnosis system based, 600 rules

•  XCON: System for configuring VAX computers, 2500 rules (1982)

•  The rules were created by experts by hand!!

•  Knowledge acquisition has to be automatized

•  Substitute the Expert by its archive with solved cases

History

Page 7: L3. Decision Trees

7!

•  CHAID (CHi-squared Automatic Interaction Detector) Gordon V. Kass ,1980

•  CART (Classification and Regression Trees), Breiman, Friedman, Olsen and Stone, 1984

•  ID3 (Iterative Dichotomiser 3), Quinlan, 1986

•  C4.5, Quinlan 1993: Based on ID3

History

Page 8: L3. Decision Trees

•  Consider two binary variables. How many ways can we split the space using a decision tree?

•  Two possible splits and two possible assignments to the leave nodes à At least 8 possible trees

8!

Computational

0 1

0

1

Page 9: L3. Decision Trees

9!

Computational

•  Under what conditions someone waits in a restaurant?

There are 2 x 2 x 2 x 2 x 3 x 3 x 2 x 2 x 4 x 4 = 9216 cases

and two classes à 29216 possible hypothesis and many more possible trees!!!!

Page 10: L3. Decision Trees

10!

•  It is just not feasible to find the optimal solution

•  A bias should be selected to build the models.

•  This is a general a problem in Machine Learning.

Computational

Page 11: L3. Decision Trees

11!

For decision trees a greedy approach is generally selected:

•  Built step by step, instead of building the tree as a whole

•  At each step the best split with respect to the train data is selected (following a split criterion).

•  The tree is grown until a stopping criterion is met

•  The tree is generally pruned (following a pruning criterion) to avoid over-fitting.

Computational

Page 12: L3. Decision Trees

12!

Basic Decision Tree Algorithm trainTree(Dataset L) 1. T = growTree(L) 2. pruneTree(T,L) 3. return T

growTree(Dataset L) 1.  T.s = findBestSplit(L) 2. if T.s == null return null 3. (L1, L2) = splitData(L,T.s) 4. T.left = growTree(L1) 5. T.right = growTree(L2) 6. return T

Removes subtrees uncertain

about their validity.

Page 13: L3. Decision Trees

13!

Finding the best split findBestSplit(Dataset L) 1.  Try all possible splits 2. return best

0!

1!

2!

3!

4!

5!

0! 1! 2! 3! 4! 5!

There are not too many. Feasible computationally

O(# of attrib x # of points)

But which one is best?

Page 14: L3. Decision Trees

14!

Split Criterion •  It should measure the impurity of node:

𝑖(𝑡)=∑𝑖=1↑𝑚▒𝑓↓𝑖 (1− 𝑓↓𝑖 )  Gini impurity (CART)

where fi is the fraction of instances of class i in node t

•  The improvement of a split is the variation of impurity before and after the split

∆𝑖(𝑡,𝑠)=𝑖(𝑡)− 𝑝↓𝐿 𝑖(𝑡↓𝐿 )− 𝑝↓𝑅 𝑖(𝑡↓𝑅 )

where pL (pR) is the proportion of instances going to the left (right) node

Page 15: L3. Decision Trees

15!

Split Criterion

0!

1!

2!

3!

4!

5!

0! 1! 2! 3! 4! 5!

∆𝑖(𝑡,𝑠)=𝑖(𝑡)− 𝑝↓𝐿 𝑖(𝑡↓𝐿 )− 𝑝↓𝑅 𝑖(𝑡↓𝑅 )!

𝑖(𝑡)=2×8/12 × 4/12 =0,44!

𝑝↓𝐿 = 5/12     ;   𝑝↓𝑅 = 7/12  !

𝑖(𝑡↓𝐿 )=2×2/5 × 3/5  !

𝑖(𝑡↓𝑅 )=2×6/7 × 1/7  !

∆𝑖(𝑡,𝑠)= 0.102!

0.011 0.102 0.026 0.020

0.044

0.111

0.056

0.011

Which one is best?

Page 16: L3. Decision Trees

16!

•  Based on entropy

𝐻(𝑡)=−∑𝑖=1↑𝑚▒𝑓↓𝑖 log↓2  𝑓↓𝑖   

•  Information gain (used in ID4 and C4.5)

𝐼𝐺(𝑡,𝑠)=𝐻(𝑡)− 𝑝↓𝐿 𝐻(𝑡↓𝐿 )− 𝑝↓𝑅 𝐻(𝑡↓𝑅 )

•  Information gain ratio (C4.5)

𝐼𝐺𝑅(𝑡,𝑠)=𝐼𝐺(𝑡,  𝑠)/𝐻(𝑠)

Other splitting criteria

Page 17: L3. Decision Trees

17!

•  Any splitting criteria with this shape is good

Splitting criteria

(this is for binary

problems)!

Page 18: L3. Decision Trees

18!

Full grown tree

0!

1!

2!

3!

4!

5!

0! 1! 2! 3! 4! 5!

good

X2

>3.5!≤3.5

bad

X1

>2.5!≤2.5

X2

>2.5!≤2.5

good X2

>1.5!≤1.5

good X1

>4.5!≤4.5

good X1

>3.5!≤3.5

good bad

Are we happy with this tree? good

How about this one?

Page 19: L3. Decision Trees

19!

•  All instances assigned to the node considered for splitting have the same class label

•  No split is found to further partition the data

•  The number of instances in each terminal node is smaller than a predefined threshold

•  The impurity gain of the best split is not below a given threshold

•  The tree has reached a maximum depth

Stopping criteria

The last three elements are also call pre-pruning

Page 20: L3. Decision Trees

20!

Pruning •  Another option is post pruning (or pruning). Consist in:

•  Grow the tree as much as possible

•  Prune it afterwards by substituting one subtree by a single leaf node if the error does not worsen significantly.

•  This process is continued until no more pruning is possible.

•  Actually we go back to smaller trees but through a different path

•  The idea of pruning is to avoid overfitting

Page 21: L3. Decision Trees

21!

Cost-complexity pruning (CART)

•  Cost-complexity based pruning:

𝑅↓𝛼 (𝑡)=𝑅(𝑡)+𝛼·𝐶(𝑡)

•  R(t) is the error of the decision tree rooted at node t

•  C(t) is the number of leaf nodes from node t

•  Parameter α specifies the relative weight between the accuracy and complexity of the tree

Page 22: L3. Decision Trees

22!

Pruning CART

0!

1!

2!

3!

4!

5!

0! 1! 2! 3! 4! 5!

good

X2

>3.5!≤3.5

bad

X1

>2.5!≤2.5

X2

>2.5!≤2.5

good X2

>1.5!≤1.5

good X1

>4.5!≤4.5

good X1

>3.5!≤3.5

good bad

good 𝑅↓𝛼=0.1 (𝑡)=1/5+0.1·1=0.3

𝑅↓𝛼=0.1 (𝑡)=0+0.1·5=0.5

Pruned:!

Unpruned!

Let’s say 𝛼=0.1!

Page 23: L3. Decision Trees

23!

•  CART uses 10-fold cross-validation within the training data to estimate alpha. Iteratively nine folds are used for training a tree and one for test.

•  A tree is trained on nine folds and it is pruned using all possible alphas (that are finite).

•  Then each of those trees is tested on the remaining fold.

•  The process is repeated 10 times and the alpha value that gives the best generalization accuracy is kept

Cost-complexity pruning (CART)

Page 24: L3. Decision Trees

24!

•  C4.5 estimates the accuracy % on the leaf nodes using the upper confidence bound (parameter) of a normal distribution instead of the data.

•  Error estimate for subtree is the weighted sum of the error estimates for all its leaves

•  This error is higher when few data instances fall on a leaf.

•  Hence, leaf nodes with few instances tend to be pruned.

Statistical pruning (C4.5)

Page 25: L3. Decision Trees

25!

•  CART pruning is slower since it has to build 10 extra trees to estimate alpha.

•  C4.5 pruning is faster, however the algorithm does not propose a way to compute the confidence threshold

•  The statistical grounds for C4.5 pruning are questionable.

•  Using cross validation is safer

Pruning (CART vs C4.5)

Page 26: L3. Decision Trees

26!

Missing values •  What can be done if a value is missing?

•  Suppose the value for “Pat” for one instance is unknown.

•  The instance with the missing value fall through the three branches but weighted

•  And the validity of the split is computed as before

Page 27: L3. Decision Trees

27!

Oblique splits •  CART algorithms allows for oblique splits, i.e. splits

that are not orthogonal to the attributes axis

•  The algorithm searches for planes with good impurity reduction

•  The growing tree process becomes slower

•  But trees become more expressive and compact

N1>N2

true!false

- +

Page 28: L3. Decision Trees

28!

•  Minimum number of instances necessary to split a node.

•  Pruning/No pruning

•  Pruning confidence. How much to prune?

•  For computational issues the number of nodes or depth of the tree can be limited

Parameters

Page 29: L3. Decision Trees

29!

Algorithms details

Splitting criterion! Pruning criterion! Other features!

CART! •  Gini!•  Twoing!

Cross-validation post-pruning!

•  Regression/Classif.!•  Nominal/numeric

attributes!•  Missing values!•  Oblique splits!•  Nominal splits

grouping!!

ID3! Information Gain (IG)! Pre-pruning.! •  Classification!•  Nominal attributes!

C4.5!

•  Information Gain (IG)!

•  Information Gain Ratio (IGR)!

!

Statistical based post-pruning!

•  Classification!•  Nominal/numeric

attributes!•  Missing values!•  Rule generator!•  Multiple nodes split!

!

Page 30: L3. Decision Trees

30!

Bad things about DT

•  None! Well maybe something…

•  Cannot handle so well complex interactions between attributes. Lack of expressive power

Page 31: L3. Decision Trees

31!

Bad things about DT

•  Replication problem. Can end up with similar subtrees in mutually exclusive regions.

Page 32: L3. Decision Trees

32!

Good things about DT

•  Self-explanatory. Easy for non experts to understand. Can be converted to rules.

•  Handle both nominal and numeric attributes.

•  Can handle uninformative and redundant attributes.

•  Can handle missing values.

•  Nonparametric method. In principle, no predefined idea of the concept to learn

•  Easy to tune. Do not have hundreds of parameters

Page 33: L3. Decision Trees

Thanks for listening

Go for a coffee

Questions?

NO!YES

Go for a coffee !Go for a coffee !

Correctly answered?

YES!NO