Classification with Decision Trees and Rules Evgueni Smirnov

Classification with Decision Trees and Rules

Evgueni Smirnov

OverviewOverview

• Classification Problem

• Decision Trees for Classification

• Decision Rules for Classification

Classification TaskGiven: • X is an instance space defined as {Xi}i 1..∈ N

where Xi is a discrete/continuous variable.• Y is a finite class set.• Training data D ⊆ X x Y.Find:• Class y ∈ Y of an instance x ∈X.

Instances, Classes, Instance SpacesInstances, Classes, Instance Spaces

friendly robots

A class is a set of objects in a world that are unified by a reason. A reason may be a similar appearance, structure or function.

Example. The set: {children, photos, cat, diplomas} can be viewed as a class “Most important things to take out of your apartment when it catches fire”.

head = squarebody = roundsmiling = yesholding = flagcolor = yellow

X


friendly robots

head = squarebody = roundsmiling = yesholding = flagcolor = yellow

X

friendly robots

H

smiling = yes friendly robots

M


X

H

M

Classification problemClassification problem

Decision Trees for Classification

• Classification Problem• Definition of Decision Trees• Variable Selection: Impurity Reduction,

Entropy, and Information Gain• Learning Decision Trees• Overfitting and Pruning• Handling Variables with Many Values• Handling Missing Values• Handling Large Data: Windowing

Decision Trees for ClassificationDecision Trees for Classification• A decision tree is a tree where:

– Each interior node tests a variable– Each branch corresponds to a variable value– Each leaf node is labelled with a class (class node)

A1

A2 A3

c1 c2

c1

c2 c1

a11a12

a13

a21 a22 a31 a32

A simple database: playtennisDay Outlook Temperature Humidity Wind Play

Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild Normal Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool High Strong Yes

D8 Sunny Mild Normal Weak No

D9 Sunny Hot Normal Weak Yes

D10 Rain Mild Normal Strong Yes

D11 Sunny Cool Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Decision Tree For Playing TennisDecision Tree For Playing Tennis

Outlook

sunny overcast rainy

Humidity Windy

high normal

no

false true

yes

yes yes no

Classification with Decision TreesClassification with Decision TreesClassify(x: instance, node: variable containing a node of DT)• if node is a classification node then

– return the class of node;

• else– determine the child of node that match x.

– return Classify(x, child). A1

A2 A3

c1 c2

c1

c2 c1

a11a12

a13

a21 a22 a31 a32

Decision Tree LearningDecision Tree LearningBasic Algorithm:

1. Xi the “best" decision variable for a node N.

2. Assign Xi as decision variable for the node N.

3. For each value of Xi, create new descendant of N.4. Sort training examples to leaf nodes.5. IF training examples perfectly classified, THEN Stop. ELSE

Iterate over new leaf nodes.

Variable Quality Measures

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Rain Mild High Weak YesRain Cool Normal Weak YesRain Cool Normal Strong NoRain Mild Normal Weak YesRain Mild High Strong No

Outlook

____________________________________Outlook Temp Hum Wind Play -------------------------------------------------------Sunny Hot High Weak NoSunny Hot High Strong NoSunny Mild High Weak NoSunny Cool Normal Weak YesSunny Mild Normal Strong Yes

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Overcast Hot High Weak YesOvercast Cool Normal Strong Yes

SunnyOvercast

Rain

Variable Quality Measures

• Let S be a sample of training instances and pj be the proportions of instances of class j (j=1,…,J) in S.

• Define an impurity measure I(S) that satisfies:– I(S) is minimum only when pi=1 and pj=0 for ji

(all objects are of the same class);– I(S) is maximum only when pj =1/J

(there is exactly the same number of objects of all classes);

– I(S) is symmetric with respect to p1,…,pJ;

Reduction of Impurity: Discrete Variables

• The “best” variable is the variable Xi that determines a split maximizing the expected reduction of impurity:

where Sxij is the subset of instances from S s.t. Xi=xij.

j

xij SxijIS

SSIXiSI )(

||

||)(),(

Xi

Sxi1Sxi2

Sxij…….

Information Gain: EntropyInformation Gain: Entropy

Let S be a sample of training examples, and

p+ is the proportion of positive examples in S and

p- is the proportion of negative examples in S.

Then: entropy measures the impurity of S:

E( S) = - p+ log2 p+ – p- log2 p

-

Entropy ExampleEntropy Example

In the Play Tennis dataset we had two target classes: yes and no

Out of 14 instances, 9 classified yes, rest no

2

2

9 9log 0.4114 14

5 5log 0.5314 14

( ) 0.94

yes

no

yes no

p

p

E S p p

Outlook Temp.Humidit

yWindy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast

Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast

Cool Normal True Yes

Outlook Temp.Humidit

yWindy play

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast

Mild High True Yes

Overcast

Hot Normal False Yes

Rainy Mild High True No

Information GainInformation Gain

Information Gain is the expected reduction in entropy caused by partitioning the instances from S according to a given discrete variable.

Gain(S, Xi) = E(S) -

where Sxij is the subset of instances from S s.t. Xi=xij.

)(||

||ij

ij

xj

x SES

S

Xi

Sxi1Sxi2

Sxij…….

ExampleExample

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Rain Mild High Weak YesRain Cool Normal Weak YesRain Cool Normal Strong NoRain Mild Normal Weak YesRain Mild High Strong No

Outlook

____________________________________Outlook Temp Hum Wind Play -------------------------------------------------------Sunny Hot High Weak NoSunny Hot High Strong NoSunny Mild High Weak NoSunny Cool Normal Weak YesSunny Mild Normal Strong Yes

_____________________________________Outlook Temp Hum Wind Play ---------------------------------------------------------Overcast Hot High Weak YesOvercast Cool Normal Strong Yes

SunnyOvercast

Rain

Which attribute should be tested here?

Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019

Continuous Variables

Temp. Play

80 No

85 No

83 Yes

75 Yes

68 Yes

65 No

64 Yes

72 No

75 Yes

70 Yes

69 Yes

72 Yes

81 Yes

71 No No85

Yes81

Yes83

Yes75

Yes75

No80

Yes70

No71

No72

Yes72

Yes69

Yes68

No65

Yes64

PlayTemp.

SortSort

Temp.< 64.5 Temp.< 64.5 I=0.048I=0.048

Temp.< 84 Temp.< 84 I=0.113I=0.113

Temp.< 80.5 Temp.< 80.5 I=0.000I=0.000

Temp.< 77.5 Temp.< 77.5 I=0.025I=0.025

Temp.< 73.5 Temp.< 73.5 I=0.001I=0.001

Temp.< 70.5 Temp.< 70.5 I=0.045I=0.045

Temp.< 66.5 Temp.< 66.5 I=0.010I=0.010

ID3 AlgorithmID3 Algorithm

Informally:– Determine the variable with the highest

information gain on the training set.– Use this variable as the root, create a branch for

each of the values the attribute can have.– For each branch, repeat the process with subset

of the training set that is classified by that branch.

Hypothesis Space Search in ID3Hypothesis Space Search in ID3

• The hypothesis space is the set of all decision trees defined over the given set of variables.

• ID3’s hypothesis space is a compete space; i.e., the target tree is there!

• ID3 performs a simple-to-complex, hill climbing search through this space.

Hypothesis Space Search in ID3Hypothesis Space Search in ID3

• The evaluation function is the information gain.

• ID3 maintains only a single current decision tree.

• ID3 performs no backtracking in its search.

• ID3 uses all training instances at each step of the search.

Decision Trees are Non-linear Classifiers

0

1

0 1A1

A2

A2<0.33 ?

good A1<0.91 ?

A1<0.23 ? A2<0.91 ?

A2<0.75 ?A2<0.49 ?

A2<0.65 ?

good

bad good

bad

badbad

good

yes no

Posterior Class ProbabilitiesPosterior Class ProbabilitiesOutlook

Sunny Overcast Rainy

no: 2 pos and 3 negPpos = 0.4, Pneg = 0.6

Windy

False True




OverfittingOverfitting Definition: Given a hypothesis space H, a hypothesis h H is

said to overfit the training data if there exists some hypothesis h’ H, such that h has smaller error that h’ over the training instances, but h’ has a smaller error that h over the entire distribution of instances.

Reasons for OverfittingReasons for Overfitting

• Noisy training instances. Consider an noisy training example: Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No

This instance affects the training instances: Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

Outlook


Humidity Windy

high normal

no

false true

yes

yes yes no

Reasons for OverfittingReasons for OverfittingOutlook


Humidity Windy

high normal

no

false true

yes

yes noWindy

true

yes

false

Temp

high

yes no

mild cool

?

Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = NoOutlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

area with probablywrong predictions

+

++

++ +

+

-

-

- -

-

---

---

-

-

- +

---

-

-

Reasons for OverfittingReasons for Overfitting• Small number of instances are associated with leaf nodes. In this case it is possible that for coincidental regularities to occur that are unrelated to the actual borders.

Approaches to Avoiding OverfittingApproaches to Avoiding Overfitting

• Pre-pruning: stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data

• Post-pruning: Allow the tree to overfit the data, and then post-prune the tree.

Pre-pruning

Outlook

Sunny Overcast Rainy

Humidity Windy

High Normal

no

False True

yes

yes yes no

SunnyRainy

Outlook

Overcast

noyes?

• It is difficult to decide when to stop growing the tree.

• A possible scenario is to stop when the leaf nodes get less than m training instances. Here is an example for m = 5.

3 2

2

23

Validation SetValidation Set

• Validation set is a set of instances used to evaluate the utility of nodes in decision trees. The validation set has to be chosen so that it is unlikely to suffer from same errors or fluctuations as the set used for decision-tree training.

• Usually before pruning the training data is split randomly into a growing set and a validation set.

Reduced-ErrorReduced-Error Pruning Pruning (Sub-tree replacement)(Sub-tree replacement)

Split data into growing and validation sets.

Pruning a decision node d consists of:1. removing the subtree rooted at d.2. making d a leaf node. 3. assigning d the most common

classification of the training instances associated with d.

Outlook


Humidity Windy

high normal

no

false true

yes

yes yes no

3 instances 2 instances

Accuracy of the tree on the validation set is 90%.

Reduced-Error PruningReduced-Error Pruning (Sub-tree replacement) (Sub-tree replacement)




Outlook


Windyno

false true

yes

yes no

Accuracy of the tree on the validation set is 92.4%.





Do until further pruning is harmful:1. Evaluate impact on validation set of

pruning each possible node (plus those below it).

2. Greedily remove the one that most improves validation set accuracy.

Outlook


Windyno

false true

yes

yes no

Accuracy of the tree on the validation set is 92.4%.

Outlook

Humidity Wind

no

yes

no yes

Sunny Overcast

Rain

High Normal Strong

Weak

Temp.

no yesMild Cool,Ho

t

Outlook

Humidity Wind

no

yes

no yes

Sunny Overcast

Rain

High Normal Strong

Weak

yes

Outlook

Wind

noyes

no yes

Sunny Overcast

Rain

Strong

Weak

Outlook

noyes

Sunny Overcast

Rain

yes

yes

TT22

TT11 TT33

TT44

TT55

ErrorErrorGSGS=0%, =0%, ErrorErrorVSVS=10%=10%

ErrorErrorGSGS=6%, Error=6%, ErrorVSVS=8%=8%





Reduced Error Pruning ExampleReduced Error Pruning Example

Reduced-ErrorReduced-Error Pruning Pruning (Sub-tree raising)(Sub-tree raising)


Raising a sub-tree with root d consists of:

1. removing the sub-tree rooted at the parent of d.

2. place d at the place of its parent. 3. Sort the training instances

associated with the parent of d using the sub-tree with root d .

Outlook


Humidity Windy

high normal

no

false true

yes

yes yes no









Outlook


Humidity Windy

high normal

no

false true

yes

yes yes no









Humidity

high normal

no yes

Accuracy of the tree on the validation set is 73%. So, No!

Rule Post-PruningRule Post-Pruning

IF (Outlook = Sunny) & (Humidity = High)THEN PlayTennis = NoIF (Outlook = Sunny) & (Humidity = Normal)THEN PlayTennis = Yes……….

1. Convert tree to equivalent set of rules.2. Prune each rule independently of others.3. Sort final rules by their estimated accuracy, and consider them

in this sequence when classifying subsequent instances.

Outlook


Humidity Windy

normal

no

false true

yes

yes yes no

false

Decision Tree are non-linear. Can we make them linear?

0

1

0 1A1

A2

A2<0.33 ?

good A1<0.91 ?

A1<0.23 ? A2<0.91 ?

A2<0.75 ?A2<0.49 ?

A2<0.65 ?

good

bad good

bad

badbad

good

yes no

Oblique Decision Trees

x + y < 1

Class = + Class =

• Test condition may involve multiple attributes

• More expressive representation

• Finding optimal test condition is computationally expensive!

Variables with Many Values

• Problem: – Not good splits: they fragment the data too quickly, leaving

insufficient data at the next level– The reduction of impurity of such test is often high (example:

split on the object id).

• Two solutions:– Change the splitting criterion to penalize variables with many

values– Consider only binary splits

Letter

a b c y z…


• Example: outlook in the playtennis– InfoGain(outlook) = 0.246

– Splitinformation(outlook) = 1.577

– Gainratio(outlook) = 0.246/1.577=0.156 < 0.246

• Problem: the gain ratio favours unbalanced tests

||

||log

||

||),( 2

1 S

S

S

SASSplitInfo i

c

i

i

),(

),(

ASSplitInfo

ASGainGainRatio



Missing ValuesMissing Values

1. If node n tests variable Xi, assign most common value of Xi among other instances sorted to node n.

2. If node n tests variable Xi, assign a probability to each of possible values of Xi. These probabilities are estimated based on the observed frequencies of the values of Xi. These probabilities are used in the information gain measure (via info gain).

j

xij SxijIS

SSIXiSI )(

||

||)(),(

Windowing

If the data don’t fit main memory use windowing:1. Select randomly n instances from the training data

D and put them in window set W.2. Train a decision tree DT on W. 3. Determine a set M of instances from D

misclassified by DT.4. W = W U M.5. IF Not(StopCondition) THEN GoTo 2;

Summary PointsSummary Points

1. Decision tree learning provides a practical method for concept learning.

2. ID3-like algorithms search complete hypothesis space.3. The inductive bias of decision trees is preference (search)

bias.4. Overfitting the training data is an important issue in

decision tree learning.5. A large number of extensions of the ID3 algorithm have

been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc.

Learning Decision RulesLearning Decision Rules

• Decision Rules• Basic Sequential Covering Algorithm• Learn-One-Rule Procedure• Pruning

Definition of Decision RulesDefinition of Decision Rules

Example: If you run the Prism algorithm from Weka on the weather data you will get the following set of decision rules:

if outlook = overcast then PlayTennis = yes

if humidity = normal and windy = FALSE then PlayTennis = yes

if temperature = mild and humidity = normal then PlayTennis = yes

if outlook = rainy and windy = FALSE then PlayTennis = yes

if outlook = sunny and humidity = high then PlayTennis = no

if outlook = rainy and windy = TRUE then PlayTennis = no

Definition: Decision rules are rules with the following form:

if <conditions> then concept C.

Why Decision Rules?Why Decision Rules?• Decision rules are more compact.• Decision rules are more understandable.

Example: Let X {0,1}, Y {0,1}, Z {0,1}, W {0,1}. The rules are:

if X=1 and Y=1 then 1

if Z=1 and W=1 then 1

Otherwise 0;

X

0

Y

1 0

1 Z

1 0

0W

1 0

1 0

Z

1 0

0W

1 0

1 0

1

Why Decision Rules?Why Decision Rules?

+ +

++ ++

+

+++ +

+ -

-

-

- -- -

-

-

--

--

Decision boundaries of decision trees

+ +

++ ++

+

+++ +

+ -

-

-

- -- -

-

-

--

--

Decision boundaries of decision rules

How to Learn Decision Rules?How to Learn Decision Rules?

1. We can convert trees to rules2. We can use specific rule-learning methods

Sequential Covering AlgorithmsSequential Covering Algorithmsfunction LearnRuleSet(Target, Attrs, Examples, Threshold):

LearnedRules :=

Rule := LearnOneRule(Target, Attrs, Examples)

while performance(Rule,Examples) > Threshold, do

LearnedRules := LearnedRules {Rule}

Examples := Examples \ {examples covered by Rule}

Rule := LearnOneRule(Target, Attrs, Examples)

sort LearnedRules according to performance

return LearnedRules

IF true THEN pos

IllustrationIllustration

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF true THEN posIF A THEN pos


++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF true THEN posIF A THEN pos IF A & B THEN pos


++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF true THEN pos


++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

IF true THEN posIF C THEN pos


++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

IF true THEN posIF C THEN posIF C & D THEN pos


++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

IF A & B THEN pos

Learning One RuleLearning One Rule

• To learn one rule we use one of the strategies below:

• Top-down:– Start with maximally general rule

– Add literals one by one

• Bottom-up:– Start with maximally specific rule

– Remove literals one by one

Bottom-up vs. Top-downBottom-up vs. Top-down

++

++

++

+

+

++ +

+ -

-

-

--

- -

-

-

-

-

--

Top-down: typically more general rules

Bottom-up: typically more specific rules

Learning One RuleLearning One Rule

Bottom-up:• Example-driven (AQ family).

Top-down:• Generate-then-Test (CN-2).

Example of Learning One RuleExample of Learning One Rule

Heuristics for Learning One RuleHeuristics for Learning One Rule

– When is a rule “good”?• High accuracy;• Less important: high coverage.

– Possible evaluation functions:• Relative frequency: nc/n, where nc is the number of correctly

classified instances, and n is the number of instances covered by the rule;

• m-estimate of accuracy: (nc+ mp)/(n+m), where nc is the number of correctly classified instances, n is the number of instances covered by the rule, p is the prior probablity of the class predicted by the rule, and m is the weight of p.

• Entropy.

How to Arrange the RulesHow to Arrange the Rules 1. The rules are ordered according to the order they have been

learned. This order is used for instance classification.

2. The rules are ordered according to their accuracy. This order is used for instance classification.

3. The rules are not ordered but there exists a strategy how to apply the rules (e.g., an instance covered by conflicting rules gets the classification of the rule that classifies correctly more training instances; if an instance is not covered by any rule, then it gets the classification of the majority class represented in the training data).

Approaches to Avoiding OverfittingApproaches to Avoiding Overfitting

• Pre-pruning: stop learning the decision rules before they reach the point where they perfectly classify the training data

• Post-pruning: allow the decision rules to overfit the training data, and then post-prune the rules.

Post-PruningPost-Pruning

1. Split instances into Growing Set and Pruning Set;

2. Learn set SR of rules using Growing Set;

3. Find the best simplification BSR of SR.

4. while (Accuracy(BSR, Pruning Set) >

Accuracy(SR, Pruning Set) ) do

4.1 SR = BSR;

4.2 Find the best simplification BSR of SR.

5. return BSR;

Incremental Reduced Error PruningIncremental Reduced Error Pruning

D1

D2

D3

D3

D22

D1 D21

Post-pruning

Incremental Reduced Error PruningIncremental Reduced Error Pruning

1. Split Training Set into Growing Set and Validation Set;

2. Learn rule R using Growing Set;

3. Prune the rule R using Validation Set;

4. if performance(R, Training Set) > Threshold

4.1 Add R to Set of Learned Rules

4.2 Remove in Training Set the instances covered by R;

4.2 go to 1;

5. else return Set of Learned Rules

Summary PointsSummary Points

1. Decision rules are easier for human comprehension than decision trees.

2. Decision rules have simpler decision boundaries than decision trees.

3. Decision rules are learned by sequential covering of the training instances.

Lab 1: Some Details

Model Evaluation Techniques

• Evaluation on the training set: too optimistic

Training set

Classifier

Training set


• Hold-out Method: depends on the make-up of the test set.

Training set

Classifier

Test set

Data

• To improve the precision of the hold-out method: it is repeated many times.


• k-fold Cross Validation

Classifier

Data

train train test

train test train

test train train

Intro to WekaIntro to Weka@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {TRUE, FALSE}

@datasunny,hot,high,FALSE,FALSEsunny,hot,high,TRUE,FALSEovercast,hot,high,FALSE,TRUErainy,mild,high,FALSE,TRUErainy,cool,normal,FALSE,TRUErainy,cool,normal,TRUE,FALSEovercast,cool,normal,TRUE,TRUE………….

ReferencesReferences

• Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill

• Quinlan, J. R. 1986. Induction of decision trees. Machine Learning

• Stuart Russell, Peter Norvig, 2010. Artificial Intelligence: A Modern Approach. New Jersey: Prantice Hall

Documents

Classification with Decision Trees and Rules Evgueni Smirnov