1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi

1

Universidad de Buenos AiresMaestría en Data Mining y Knowledge

Discovery

Aprendizaje Automático5-Inducción de árboles de decisión (2/2)

Eduardo Poggi ([email protected])Ernesto Mislej ([email protected])

otoño de 2005

22 Decision Trees

Definition Mechanism

Splitting Functions Hypothesis Space and Bias

Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes

33 Example of a Decision Tree

Example: Learning to classify stars. Example: Learning to classify stars.

LuminosityLuminosity

MassMass

Type AType A Type BType B

Type CType C

> T1> T1<= T1<= T1

> T2> T2<= T2<= T2

44 Short vs Long Hypotheses

We mentioned a top-down, greedy approach to constructing We mentioned a top-down, greedy approach to constructing decision trees denotes a preference of short hypotheses over decision trees denotes a preference of short hypotheses over long hypotheses. long hypotheses.

Why is this the right thing to do?Why is this the right thing to do?

Occam’s Razor: Prefer the simplest hypothesis that fits the data.Occam’s Razor: Prefer the simplest hypothesis that fits the data.

Back since William of Occam (1320). Back since William of Occam (1320). Great debate in the philosophy of science. Great debate in the philosophy of science.

55 Issues in Decision Tree Learning

Practical issues while building a decision tree can Practical issues while building a decision tree can be enumerated as follows:be enumerated as follows:

1)1) How deep should the tree be?How deep should the tree be?2)2) How do we handle continuous attributes?How do we handle continuous attributes?3)3) What is a good splitting function?What is a good splitting function?4)4) What happens when attribute values are missing?What happens when attribute values are missing?5)5) How do we improve the computational efficiency?How do we improve the computational efficiency?

66How deep should the tree be? Overfitting the Data

A tree A tree overfitsoverfits the data if we let it grow deep enough so that it the data if we let it grow deep enough so that itbegins to capture “aberrations” in the data that harm the predictivebegins to capture “aberrations” in the data that harm the predictivepower on unseen examples: power on unseen examples:

sizesize

t2t2

t3t3

hum

idit

yhu

mid

ity

Possibly just noise, butPossibly just noise, butthe tree is grown deeperthe tree is grown deeperto capture these examplesto capture these examples

77 Overtting the Data: Definition

Assume a hypothesis space H. We say a hypothesis Assume a hypothesis space H. We say a hypothesis hh in H overfits in H overfitsa dataset D if there is another hypothesis a dataset D if there is another hypothesis hh’ in H where ’ in H where hh has better has betterclassification accuracy than classification accuracy than hh’ on D but worse classification accuracy’ on D but worse classification accuracythan than hh’ on D’. ’ on D’.

0.5

0.6

0.7

0.8

0.9

1.0

0.5

0.6

0.7

0.8

0.9

1.0

Size of the treeSize of the tree

training datatraining data

testing datatesting data

overfittingoverfitting

88 Causes for Overtting the Data

What causes a hypothesis to overfit the data?What causes a hypothesis to overfit the data?

1)1) Random errors or noiseRandom errors or noise Examples have incorrect class label or Examples have incorrect class label or incorrect attribute values.incorrect attribute values.

2)2) Coincidental patternsCoincidental patterns By chance examples seem to deviate from a pattern due to By chance examples seem to deviate from a pattern due to the small size of the sample. the small size of the sample.

Overfitting is a serious problem that can cause Overfitting is a serious problem that can cause strong performance degradation. strong performance degradation.

99 Solutions for Overtting the Data

There are two main classes of solutions: There are two main classes of solutions:

1)1) Stop the tree early before it begins to overfit the data.Stop the tree early before it begins to overfit the data. + In practice this solution is hard to implement because it+ In practice this solution is hard to implement because it is not clear what is a good stopping point. is not clear what is a good stopping point.

2) Grow the tree until the algorithm stops even if the overfitting 2) Grow the tree until the algorithm stops even if the overfitting problem shows up. Then prune the tree as a post-processing problem shows up. Then prune the tree as a post-processing step. step. + This method has found great popularity in the machine + This method has found great popularity in the machine learning community. learning community.

1010 Decision Tree Pruning

1.) Grow the tree to learn the 1.) Grow the tree to learn the training datatraining data

2.) Prune tree to avoid overfitting2.) Prune tree to avoid overfitting the datathe data

1111Methods to Validate the New Tree

1.1. Training and Validation Set ApproachTraining and Validation Set Approach

Divide dataset D into a training set TR and a testing set TEDivide dataset D into a training set TR and a testing set TE Build a decision tree on TRBuild a decision tree on TR Test pruned trees on TE to decide the best final tree. Test pruned trees on TE to decide the best final tree.

Dataset DDataset D Training TRTraining TR

Testing TETesting TE


2. Use a statistical test2. Use a statistical test

Use all dataset D for trainingUse all dataset D for training Use a statistical test to decide if you should expand Use a statistical test to decide if you should expand the node or not (e.g., chi squared). the node or not (e.g., chi squared).

Should I expand or not?Should I expand or not?


3.3. Use an encoding scheme to capture the size of the tree and theUse an encoding scheme to capture the size of the tree and the errors made by the tree.errors made by the tree.

Use all dataset D for trainingUse all dataset D for training Use the encoding scheme to know when to stop Use the encoding scheme to know when to stop growing the tree. growing the tree. The method is know as minimum description length The method is know as minimum description length principle.principle.

1414 Training and Validation

There are two approaches:There are two approaches:

A.A. Reduced Error PruningReduced Error PruningB.B. Rule Post-PruningRule Post-Pruning

Dataset DDataset D Training TR (normally 2/3 of D)Training TR (normally 2/3 of D)

Testing TE (normally 1/3 of D)Testing TE (normally 1/3 of D)

1515 Reduced Error Pruning

Main Idea:Main Idea:

1) Consider all internal nodes in the tree. 1) Consider all internal nodes in the tree. 2)2) For each node check if removing it (along with the subtree For each node check if removing it (along with the subtree below it) and assigning the most common class to it does below it) and assigning the most common class to it does not harm accuracy on the validation set.not harm accuracy on the validation set.3)3) Pick the node n* that yields the best performance and prune Pick the node n* that yields the best performance and prune its subtree.its subtree.4) Go back to (2) until no more improvements are possible.4) Go back to (2) until no more improvements are possible.

1616 Example

Original TreeOriginal Tree

Possible trees after pruning:Possible trees after pruning:

1717 Example

Pruned TreePruned Tree

Possible trees after 2Possible trees after 2ndnd pruning: pruning:

1818 Example

Process continues until no improvement is observedProcess continues until no improvement is observedon the validation set:on the validation set:

0.5

0.6

0.7

0.8

0.9

1.0

0.5

0.6

0.7

0.8

0.9

1.0

Size of the treeSize of the tree

validation datavalidation data

Stop pruning the treeStop pruning the tree

1919 Reduced Error Pruning

Disadvantages:Disadvantages:

If the original data set is small, separating examples away for If the original data set is small, separating examples away for validation may leave you with few examples for training.validation may leave you with few examples for training.

Dataset DDataset D Training TRTraining TR

Testing TETesting TE

Small datasetSmall dataset

Training set is too Training set is too small and so is the small and so is the validation setvalidation set

2020 Rule Post-Pruning

Main Idea:Main Idea:

1) Convert the tree into a rule-based system. 1) Convert the tree into a rule-based system.

2)2) Prune every single rules first by removing redundant Prune every single rules first by removing redundant conditions.conditions.

3) Sort rules by accuracy. 3) Sort rules by accuracy.

2121 Example

x1x1

x2x2 x3x3

AA BB AA CC

1100

111100 00

Original treeOriginal tree

Rules:Rules:~x1 & ~x2 -> Class A~x1 & ~x2 -> Class A~x1 & x2 -> Class B~x1 & x2 -> Class Bx1 & ~x3 -> Class Ax1 & ~x3 -> Class Ax1 & x3 -> Class Cx1 & x3 -> Class C

Possible rules after pruningPossible rules after pruning(based on validation set):(based on validation set):~x1 -> Class A~x1 -> Class A~x1 & x2 -> Class B~x1 & x2 -> Class B ~x3 -> Class A~x3 -> Class Ax1 & x3 -> Class Cx1 & x3 -> Class C

2222 Advantages of Rule Post-Pruning

The language is more expressiveThe language is more expressive

Improves on interpretabilityImproves on interpretability

Pruning is more flexiblePruning is more flexible

In practice this method yields high accuracy performanceIn practice this method yields high accuracy performance

2323 Decision Trees

Definition Mechanism

Splitting Functions Hypothesis Space and Bias

Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes

2424Discretizing Continuous Attributes

Example: attribute temperature.Example: attribute temperature.

1) Order all values in the training set1) Order all values in the training set2) Consider only those cut points where there is a change of class2) Consider only those cut points where there is a change of class3) Choose the cut point that maximizes information gain3) Choose the cut point that maximizes information gain

temperaturetemperature

97 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6 103.297 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6 103.2

2525 Missing Attribute Values

We are at a node We are at a node nn in the decision tree. in the decision tree.Different approaches:Different approaches:

1)1) Assign the most common value for that attribute in node Assign the most common value for that attribute in node nn..2)2) Assign the most common value in Assign the most common value in nn among examples with the among examples with the same classification as same classification as XX. . 3)3) Assign a probability to each value of the attribute based on the Assign a probability to each value of the attribute based on the frequency of those values in node frequency of those values in node nn. Each fraction is propagated. Each fraction is propagated down the tree. down the tree.

Example: Example: XX = (luminosity > T1, mass = ?) = (luminosity > T1, mass = ?)

2626 Summary

Decision-tree induction is a popular approach to classification Decision-tree induction is a popular approach to classification that enables us to interpret the output hypothesis.that enables us to interpret the output hypothesis. The hypothesis space is very powerful: all possible DNF formulas.The hypothesis space is very powerful: all possible DNF formulas. We prefer shorter trees than larger trees.We prefer shorter trees than larger trees. Overfitting is an important issue in decision-tree induction.Overfitting is an important issue in decision-tree induction. Different methods exist to avoid overfitting like reduced-error Different methods exist to avoid overfitting like reduced-error pruning and rule post-processing.pruning and rule post-processing. Techniques exist to deal with continuous attributes and missing Techniques exist to deal with continuous attributes and missing attribute values.attribute values.

2727 Tareas

Leer Cap 3 de Mitchel desde 3.7 en adelante

Documents

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi