Decision Tree Rong Jin. Determine Milage Per Gallon

Decision Tree

Rong Jin

Determine Milage Per Gallonmpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

A Decision Tree for Determining MPG

From slides of Andrew Moore

mpg cylinders displacementhorsepower weight acceleration modelyear maker

4 low low low high 75to78 asiagood

Decision Tree Learning Extremely popular method

Credit risk assessment Medical diagnosis Market analysis

Good at dealing with symbolic feature Easy to comprehend

Compared to logistic regression model and support vector machine

Representational Power Q: Can trees represent arbitrary Boolean

expressions?

Q: How many Boolean functions are there over N binary attributes?

How to Generate Trees from Training Data

A Simple Idea Enumerate all possible trees

Check how well each tree matches with the training data

Pick the one work best

Too many trees

Problems ?

How to determine the quality of

decision trees?

Solution: A Greedy Approach Choose the most informative feature Split data set Recursive until each data item is classified

correctly

How to Determine the Best Feature? Which feature is more

informative to MPG?

What metric should be used?

From Andrew Moore’s slides

Mutual Information !

Mutual Information for Selecting Best Features

,

( , )( ; ) ( , ) log

( ) ( )

: MPG (good or bad), : cylinder (3, 4, 6, 8)

x y

P x yI X Y P x y

P x P y

Y X


Another Example: Playing Tennis

Example: Playing Tennis

Humidity

High Norm

(9+, 5-)

(3+, 4-) (6+, 1-)

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

0.151

hP h p P n p

I P h p P n pP h P p P n P p

P h p P n pP h p P n p

P h P p P n P p

Wind

Weak Strong

(9+, 5-)

(6+, 2-) (3+, 3-)

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

( , ) ( , )( , ) log ( , ) log

( ) ( ) ( ) ( )

0.048

wP w p P s p

I P w p P s pP w P p P s P p

P w p P s pP w p P s p

P w P p P s P p

Predication for Nodes


What is the predication for each node?

Predication for Nodes

Recursively Growing Trees

OriginalDataset

Partition it accordingto the value of the attribute we split on

cylinders = 4

cylinders = 5

cylinders = 6

cylinders = 8

From Andrew Moore slides

Recursively Growing Trees

cylinders = 4

cylinders = 5

cylinders = 6

cylinders = 8

Build tree fromThese records..




From Andrew Moore slides

A Two Level Tree

Recursively growing trees

When should We Stop Growing Trees?

Should we split this node ?

Base Cases Base Case One: If all records in current data subset have the

same output then don’t recurse Base Case Two: If all records have exactly the same set of

input attributes then don’t recurse

Base Cases: An idea Base Case One: If all records in current data subset have the

same output then don’t recurse Base Case Two: If all records have exactly the same set of

input attributes then don’t recurse

Proposed Base Case 3:

If all attributes have zero information gain then don’t recurse

Is this a good idea?

Old Topic: Overfitting

What should We do ?

Pruning

Pruning Decision Tree Stop growing trees in time Build the full decision tree as before. But when you can grow it no more, start to

prune: Reduced error pruning Rule post-pruning

Reduced Error Pruning Split data into training and validation set Build a full decision tree over the training set Keep removing node that maximally increases

validation set accuracy

Original Decision Tree

Pruned Decision Tree

Reduced Error Pruning

Rule Post-Pruning Convert tree into rules Prune rules by removing the preconditions Sort final rules by their estimated accuracy

Most widely used method (e.g., C4.5)

Other methods: statistical significance test (chi-square)

Real Value Inputs What should we do to deal with real value inputs?

mpg cylinders displacementhorsepower weight acceleration modelyear maker

good 4 97 75 2265 18.2 77 asiabad 6 199 90 2648 15 70 americabad 4 121 110 2600 12.8 77 europebad 8 350 175 4100 13 73 americabad 6 198 95 3102 16.5 74 americabad 4 108 94 2379 16.5 73 asiabad 4 113 95 2228 14 71 asiabad 8 302 139 3570 12.8 78 america: : : : : : : :: : : : : : : :: : : : : : : :good 4 120 79 2625 18.6 82 americabad 8 455 225 4425 10 70 americagood 4 107 86 2464 15.5 76 europebad 5 131 103 2830 15.9 78 europe

Information Gain x: a real value input t: split value Find the split value t such that the mutual

information I(x, y: t) between x and the class label y is maximized.

Conclusions Decision trees are the single most popular data

mining tool Easy to understand Easy to implement Easy to use Computationally cheap

It’s possible to get in trouble with overfitting They do classification: predict a categorical output

from categorical and/or real inputs

Software Most widely used decision tree C4.5 (or C5.0)

http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html

Source code, tutorial

The End

Documents

Decision Tree Rong Jin. Determine Milage Per Gallon