33
Decision Tree Rong Jin

Decision Tree Rong Jin. Determine Milage Per Gallon

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decision Tree Rong Jin. Determine Milage Per Gallon

Decision Tree

Rong Jin

Page 2: Decision Tree Rong Jin. Determine Milage Per Gallon

Determine Milage Per Gallonmpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Page 3: Decision Tree Rong Jin. Determine Milage Per Gallon

A Decision Tree for Determining MPG

From slides of Andrew Moore

mpg cylinders displacementhorsepower weight acceleration modelyear maker

4 low low low high 75to78 asiagood

Page 4: Decision Tree Rong Jin. Determine Milage Per Gallon

Decision Tree Learning Extremely popular method

Credit risk assessment Medical diagnosis Market analysis

Good at dealing with symbolic feature Easy to comprehend

Compared to logistic regression model and support vector machine

Page 5: Decision Tree Rong Jin. Determine Milage Per Gallon

Representational Power Q: Can trees represent arbitrary Boolean

expressions?

Q: How many Boolean functions are there over N binary attributes?

Page 6: Decision Tree Rong Jin. Determine Milage Per Gallon

How to Generate Trees from Training Data

Page 7: Decision Tree Rong Jin. Determine Milage Per Gallon

A Simple Idea Enumerate all possible trees

Check how well each tree matches with the training data

Pick the one work best

Too many trees

Problems ?

How to determine the quality of

decision trees?

Page 8: Decision Tree Rong Jin. Determine Milage Per Gallon

Solution: A Greedy Approach Choose the most informative feature Split data set Recursive until each data item is classified

correctly

Page 9: Decision Tree Rong Jin. Determine Milage Per Gallon

How to Determine the Best Feature? Which feature is more

informative to MPG?

What metric should be used?

From Andrew Moore’s slides

Mutual Information !

Page 10: Decision Tree Rong Jin. Determine Milage Per Gallon

Mutual Information for Selecting Best Features

,

( , )( ; ) ( , ) log

( ) ( )

: MPG (good or bad), : cylinder (3, 4, 6, 8)

x y

P x yI X Y P x y

P x P y

Y X

From Andrew Moore’s slides

Page 11: Decision Tree Rong Jin. Determine Milage Per Gallon

Another Example: Playing Tennis

Page 12: Decision Tree Rong Jin. Determine Milage Per Gallon

Example: Playing Tennis

Humidity

High Norm

(9+, 5-)

(3+, 4-) (6+, 1-)

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

0.151

hP h p P n p

I P h p P n pP h P p P n P p

P h p P n pP h p P n p

P h P p P n P p

Wind

Weak Strong

(9+, 5-)

(6+, 2-) (3+, 3-)

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

0.048

wP w p P s p

I P w p P s pP w P p P s P p

P w p P s pP w p P s p

P w P p P s P p

Page 13: Decision Tree Rong Jin. Determine Milage Per Gallon

Predication for Nodes

From Andrew Moore’s slides

What is the predication for each node?

Page 14: Decision Tree Rong Jin. Determine Milage Per Gallon

Predication for Nodes

Page 15: Decision Tree Rong Jin. Determine Milage Per Gallon

Recursively Growing Trees

OriginalDataset

Partition it accordingto the value of the attribute we split on

cylinders = 4

cylinders = 5

cylinders = 6

cylinders = 8

From Andrew Moore slides

Page 16: Decision Tree Rong Jin. Determine Milage Per Gallon

Recursively Growing Trees

cylinders = 4

cylinders = 5

cylinders = 6

cylinders = 8

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

Build tree fromThese records..

From Andrew Moore slides

Page 17: Decision Tree Rong Jin. Determine Milage Per Gallon

A Two Level Tree

Recursively growing trees

Page 18: Decision Tree Rong Jin. Determine Milage Per Gallon

When should We Stop Growing Trees?

Should we split this node ?

Page 19: Decision Tree Rong Jin. Determine Milage Per Gallon

Base Cases Base Case One: If all records in current data subset have the

same output then don’t recurse Base Case Two: If all records have exactly the same set of

input attributes then don’t recurse

Page 20: Decision Tree Rong Jin. Determine Milage Per Gallon

Base Cases: An idea Base Case One: If all records in current data subset have the

same output then don’t recurse Base Case Two: If all records have exactly the same set of

input attributes then don’t recurse

Proposed Base Case 3:

If all attributes have zero information gain then don’t recurse

Is this a good idea?

Page 21: Decision Tree Rong Jin. Determine Milage Per Gallon

Old Topic: Overfitting

Page 22: Decision Tree Rong Jin. Determine Milage Per Gallon

What should We do ?

Pruning

Page 23: Decision Tree Rong Jin. Determine Milage Per Gallon

Pruning Decision Tree Stop growing trees in time Build the full decision tree as before. But when you can grow it no more, start to

prune: Reduced error pruning Rule post-pruning

Page 24: Decision Tree Rong Jin. Determine Milage Per Gallon

Reduced Error Pruning Split data into training and validation set Build a full decision tree over the training set Keep removing node that maximally increases

validation set accuracy

Page 25: Decision Tree Rong Jin. Determine Milage Per Gallon

Original Decision Tree

Page 26: Decision Tree Rong Jin. Determine Milage Per Gallon

Pruned Decision Tree

Page 27: Decision Tree Rong Jin. Determine Milage Per Gallon

Reduced Error Pruning

Page 28: Decision Tree Rong Jin. Determine Milage Per Gallon

Rule Post-Pruning Convert tree into rules Prune rules by removing the preconditions Sort final rules by their estimated accuracy

Most widely used method (e.g., C4.5)

Other methods: statistical significance test (chi-square)

Page 29: Decision Tree Rong Jin. Determine Milage Per Gallon

Real Value Inputs What should we do to deal with real value inputs?

mpg cylinders displacementhorsepower weight acceleration modelyear maker

good 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americabad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europe

Page 30: Decision Tree Rong Jin. Determine Milage Per Gallon

Information Gain x: a real value input t: split value Find the split value t such that the mutual

information I(x, y: t) between x and the class label y is maximized.

Page 31: Decision Tree Rong Jin. Determine Milage Per Gallon

Conclusions Decision trees are the single most popular data

mining tool Easy to understand Easy to implement Easy to use Computationally cheap

It’s possible to get in trouble with overfitting They do classification: predict a categorical output

from categorical and/or real inputs

Page 32: Decision Tree Rong Jin. Determine Milage Per Gallon

Software Most widely used decision tree C4.5 (or C5.0)

http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html

Source code, tutorial

Page 33: Decision Tree Rong Jin. Determine Milage Per Gallon

The End