Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Decision Tree Learning Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning

1Lehrstuhl für Informatik 2

Gabriella Kókai: Maschine Learning

Decision Tree Learning

Lehrstuhl für Informatik 2Gabriella Kókai: Maschine Learning



Contents

➔ Introduction Decision Tree representation Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary



Introduction

Widely used practical methods for inductive inference Approximating discrete valued functions Search in a completely expressive hypothesis space Inductive bias: prefering small trees to large ones Robust to noisy data and capable of learning disjunctive

expressions Learned trees can also be re-represented as a set of if-then-rules Algorithms: ID3, ASSISTENT, C4.5



Contents

Introduction➔ Decision Tree representation Appropriate problems for Decision Tree learning The basic Decision Tree learning algorithm (ID3) Hypothesis space search in Decision Tree learning Inductive bias in Decision Tree learning Issues in Decision Tree learning Summary



Decision Tree representation A tree classifies instances:

Node: an attribute which describes an instance Branch: possible values of the attribute Leaf: class to which the instances belong

Procedure (of classifying): An instance is classified by starting at the root node of the tree Repeat: - test the attribute specified by the node

- move down the tree branch corresponding to the value of the attribute-value in the given example

Example: classified as negative example

In general: a decision tree is a disjunction of constraints on the attribute values of the instances

Outlook = Sunny,Temperature = Hot,Humidity = High, Wind = Strong

Outlook = Sunny Humidity = Normal Outlook = Overcast Outlook = Rain Wind = Weak



Decision Tree Representation 2



Contents

Introduction Decision Tree Representation➔ Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary



Appropriate Problems for Decision Tree Learning

Decision tree learning is generally best suited to the problems: Instances are represented by attribute-value tuples:

easiest: each attribute takes on a small number of disjoint possible values extension: handling real valued attributes

The target function has discrete output values: extension 1: learning function with more than two possible output values

Disjunctive description may be required The training data may contain error:

error in the classification of the training examples error in the attribute values that describe these example

The training data may contain missing attribute values Classification Problems: Problems in which the task is to classify

examples into one of the of possible categories



Contents

Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning➔ The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary



The Basic Decision Tree Learning Algorithm

Top-Down, greedy search through the space of possible decision trees ID3 (Quinlan 1986), C45 (Quinlan 1993) and other variations Question: Which attributes should be tested at a node of the tree? Answer:

Statistical test to select the best attribute (how well it alone classifies the training examples)

Descendants of the root node created (each possible value of this attribute and training examples are sorted to the appropriate descendant node)

Process is then repeated Algorithm never backtracks to reconsider earlier choices



The Basic Decision Tree Learning Algorithm 2ID3 (examples, attributes)Begin Create root node if (examples = +) return root(+) if (examples = -) return root(-) if (attributes = empty) return root(most common value of the

target_attr in examples) begin A = Gain(examples, attributes) attr(root) = A forall vi of A do Add_subtree(root, vi)

examples_vi = (examples|value = vi)

if (examples_vi = empty)

Add_Leaf(most common value of target_attr in examples) else below this new branch add subtree ID3(examples_vi, attributes - {A}) end Return rootend



Which Attribute Is the Best Classifier INFORMATION GAIN: How well the given separates attribute the training

examples: ENTROPY: Characterizes the (im)purity of an arbitrary collection of examples

Given: a collection S of positive and negative examples

: proportion of positive examples in S : proportion of negative examples in S

Example: [9+, 5-] Notice:

Entropy is 0 if all members belong to the same class Entropy is 1 when the collection contains an equal number of positive

and negative examples

+ 2 + 2Entropy S p log p p log p

+pp

0 log 0 = 0 2 2Entropy 9+,5 = 9 /14 log 9 /14 5 /14 log 5 /14 = 0.940



Which Attribute Is the Best Classifier

Entropy: specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S

Generally:

is the proportion of S belonging to the class i

c

i 2 ii=1

Entropy S p log p

The entropy function relative to a boolean classification, as the proportion of positive examples, varies between 0 and 1

+p

Lehrstuhl für Informatik 2Gabriella Kókai: Maschine Learning



Information Gain Measures the Expected Reduction in Entropy

INFORMATION GAIN Gain(S,A): Expected reduction in entropy caused by partitioning the examples according to this attribute A:

Values(A) is the set of all possible values for A

Example:Values(Wind) ={Weak,Strong} S= [9+,5-] ,

vv

v Values A

SGain S,A Entropy S Entropy S

S

vS = s S | A s = v

WeakS 6+,2 StrongS 3+,3

vV

v Weak,Strong

SGain S, Wind = Entropy S Entropy S =

S

Weak Strong= Entropy S 8 /14 Entropy S 6 /14 Entropy S

= 0.940 8 /14 0.811 6 /14 1.00 = 0.048



Information Gain Measures the Expected Reduction in Entropy 2



An Illustrative Example

ID3 determines the information gain for each candidate outputsGain(S,Outlook) = 0.246 Gain(S, Humidity) = 0.151Gain(S,Wind) = 0.048 Gain(S,Temperature) = 0.029

Outlook provides the best predection; Outlook= Overcast all examples are positive

Day Outlook Temperature Humidity Wind PlayTennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No



An Illustrative Example 2



An Illustrative Example 3

The process continues for each new leaf node until: Every attribute has already been included along the path through the tree The training examples associated with this leaf node have all the same

target attribute values



Contents

Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3)➔ Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary



Hypothesis Space Search in Decision Tree Learning

Hypothesis space for ID3: The set of possible decision trees

ID3 performs simple to complex hill-climbing search through the hypothesis spaceBeginning: empty tree Considering: more elaborate hypothesisEvaluation function: Information gain



Hypothesis Space Search in Decision Tree Learning 2

Capabilities and limitations: ID3's hypothesis space of all decision trees is the complete space of finite

discrete-valued functions, relative to the available attributes => every finite discrete-valued function can be represented by decision trees => avoids: hypothesis space might not contain the target function

Maintains only single current hypothesis <=> Candidate-Elimination Algorithm

No Backtracking in the search => converging to locally optimal solution Using all training examples at each step => resulting search is much less

sensitive to errors in individual training examples



Contents

Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning➔ Inductive Bias in Decision Tree Learning Issues in Decision Tree Learning Summary



Inductive Bias in Decision Tree Learning INDUCTIVE BIAS: Set of assumptions that, together with the training data,

deductively justify the classifications assigned by the learner to future instances.

In ID3: Basis is how to choose one consistent hypothesis over the others. ID3 search strategy:

Selects in favour shorter trees over larger ones Selects tree where the attribute with highest information gain is closest to the

root Difficult to characterise bias precisely but approximately:

Shorter trees are prefered over larger Could imagine algorithms like ID3 but make breadth-first search

(BSF-ID3)ID3 can be viewed as an efficient approximation of BSF-ID3 butit exhibits more complex bias. It does not always find the shortest tree.



Inductive Bias in Decision Tree Learning

A closer approximation to the inductive bias of ID3: Shorter trees are prefered over longer trees. Tree that place high information gain attributes close to the root are prefered over those that do not.

Occam's razor: Prefer the simplest hypothesis that fits the data



Restriction Biases and Preference Biases

Difference of inductive bias exhibited by ID3 and Candidate-Elimination: ID3 searches a complete hypothesis space incompletely Candidate-Elimination searches an incomplete hypothesis space

completely Inductive bias of ID3 follows from its search strategy.

Inductive bias of Candidate-Elimination Algorithm follows from the definition of its search space

Inductive bias of ID3 is thus a preference to certain hypotheses over othersBias of Candidate-Elimination algorithm is considered in form of the categorical restriction on the set of hypotheses

Typically preference bias is more desirable than a restriction bias (learner can work within the complete hypothesis space)

Restriction bias (strictly limit the set of potential hypothesis) generally less desirable (possibility of excluding the unknown target function)



Contents

Introduction Decision Tree Representation Appropriate Problems for Decision Tree Learning The Basic Decision Tree Learning Algorithm (ID3) Hypothesis Space Search in Decision Tree Learning Inductive Bias in Decision Tree Learning➔ Issues in Decision Tree Learning Summary



Issues in Decision Tree Learning

Include: Determining how deep grows the decision tree Handling continuous attributes Choosing an appropriate attribute-selection measure Handling training data with missing attribute-values Handling attributes with different costs Improving computational efficiency



Avoid Overfitting the Date

Definition: Given a hypothesis space H, a hypothesis is said to overfit the training data if there exists some alternative hypothesis , such that h has smaller error than h' over the training examples, but h' has a smaller error over the entire distribution of instances than h

h H

h' H



Avoiding Overfitting the Data 2 How can it be possible that a tree h better fits the training examples than h'

but it performs more poorly over a subsequent examples? Training examples contain random errors or noise

example: adding following positive training example labeled incorrectly as negative

result: sorted where D9 and D11 but ID3 search for further refinement Small numbers of examples are associated with leaf nodes (coincidental regularities)

Experimental study of ID3 involving five different learning tasks (noisy, nondeterministic) overfitting decrease the accuracy 10-20%

APPROACHES: Stop the growing of the tree earlier, before it reaches the point where it perfectly

classifies the training data Allow the tree to overfit the data and then post-prune the tree

Outlook = Sunny,Temperature = Hot,Humidity = Normal, Wind = StrongPlayTennis = No



Avoiding Overfitting the Data 3

Criterion to determine the correct, final tree size: Training and validation set: Use a set of examples separated from

the training examples to evaluate the utility of post-pruning nodes from tree

Use all the available data for training, but apply a statistical test to estimate whether expanding (pruning) a particular node is likely to produce an improvement beyond the training set

Use an explicit measure of the complexity for encoding the training examples and decision tree



Reduced Error Pruning

How exactly might a validation set be used to prevent overfitting? Reduced-error pruning: Consider each of the decision nodes to be a

candidate for pruning Pruning means to substitute a subtree rooted at the node, by a leaf which the most common class of the training examples assigned

Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set

Nodes are pruned iteratively by choosing the node whose removal most increases the accurancy of decision tree over validation set

Continue until further pruning is necessary



Reduced Error Pruning 2

Here the validation set used for pruning is distinct from both the training and test sets

Disadvantage: Data is limited (withholding part of it for the validation set reduces even further the number of examples available for training)

Many additional techniques have been proposed



Rule Post-Pruning

FOLLOWING STEP: Infer decision tree growing until the training data fit as well as possible and

allow overfitting to occur Convert the learned tree into an equivalent set of rules by creating one rule for

each path from the root to a leaf node Prune each rule by removing any preconditions that result in improving its

estimated accuracy Sort the pruned rules by their estimated accuracy and consider them in this

sequence when classifying subsequent instances

Example:



Rule Post-Prunning 2 One rule is generated for each leaf node in the tree Antecedent: Each attribute test along the path from the root to the leaf

Consequent: The classification at the leafExample:

Removing any ancendent, whose removal does not worsen its estimated accuracy:Example: removing

C4.5 evaluates performance by using a pessimistic estimate: Calculating the rule accuracy over the training example Calculating the standard deviation in the estimated accuracy assuming a

binomial distribution For the given confidencial level the lower-bound estimate is then taken as

measure of rule performance Advantage: For large data sets the pessimistic estimate is very close to the

observed accuracy

IF Outlook = Sunny Humidity = High THENPlayTennis = No

Outlook = Sunny Humidity = High



Rule Post-Prunning 3

Why is it good to convert decision tree to rules before prunning? Allows distinguishing among the different contexts in which

decision tree is used 1 path = 1 rule => pruning can be made differently for each path

Removes the distinction between attribute test that occur near the root of the tree and those that occur near to leaves

Avoid the reorganisation of the tree if the root node is pruned Converting to rules improves readability

Rules are often easier to understand for people



Summary

Decision trees are a practical method for concept learning and for learning other discrete-valued function

Infers decision trees ID3 searches the complete hypothesis space => avoids that the target

function might be not presented in the hypothesis space Inductive bias implicit in ID3 includes preference for smaller trees Overfitting Extension

Documents

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Decision Tree Learning Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning