Upload
kimberly-morton
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
1
Universidad de Buenos AiresMaestría en Data Mining y Knowledge
Discovery
Aprendizaje Automático5-Inducción de árboles de decisión (2/2)
Eduardo Poggi ([email protected])Ernesto Mislej ([email protected])
otoño de 2005
22 Decision Trees
Definition Mechanism
Splitting Functions Hypothesis Space and Bias
Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes
33 Example of a Decision Tree
Example: Learning to classify stars. Example: Learning to classify stars.
LuminosityLuminosity
MassMass
Type AType A Type BType B
Type CType C
> T1> T1<= T1<= T1
> T2> T2<= T2<= T2
44 Short vs Long Hypotheses
We mentioned a top-down, greedy approach to constructing We mentioned a top-down, greedy approach to constructing decision trees denotes a preference of short hypotheses over decision trees denotes a preference of short hypotheses over long hypotheses. long hypotheses.
Why is this the right thing to do?Why is this the right thing to do?
Occam’s Razor: Prefer the simplest hypothesis that fits the data.Occam’s Razor: Prefer the simplest hypothesis that fits the data.
Back since William of Occam (1320). Back since William of Occam (1320). Great debate in the philosophy of science. Great debate in the philosophy of science.
55 Issues in Decision Tree Learning
Practical issues while building a decision tree can Practical issues while building a decision tree can be enumerated as follows:be enumerated as follows:
1)1) How deep should the tree be?How deep should the tree be?2)2) How do we handle continuous attributes?How do we handle continuous attributes?3)3) What is a good splitting function?What is a good splitting function?4)4) What happens when attribute values are missing?What happens when attribute values are missing?5)5) How do we improve the computational efficiency?How do we improve the computational efficiency?
66How deep should the tree be? Overfitting the Data
A tree A tree overfitsoverfits the data if we let it grow deep enough so that it the data if we let it grow deep enough so that itbegins to capture “aberrations” in the data that harm the predictivebegins to capture “aberrations” in the data that harm the predictivepower on unseen examples: power on unseen examples:
sizesize
t2t2
t3t3
hum
idit
yhu
mid
ity
Possibly just noise, butPossibly just noise, butthe tree is grown deeperthe tree is grown deeperto capture these examplesto capture these examples
77 Overtting the Data: Definition
Assume a hypothesis space H. We say a hypothesis Assume a hypothesis space H. We say a hypothesis hh in H overfits in H overfitsa dataset D if there is another hypothesis a dataset D if there is another hypothesis hh’ in H where ’ in H where hh has better has betterclassification accuracy than classification accuracy than hh’ on D but worse classification accuracy’ on D but worse classification accuracythan than hh’ on D’. ’ on D’.
0.5
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1.0
Size of the treeSize of the tree
training datatraining data
testing datatesting data
overfittingoverfitting
88 Causes for Overtting the Data
What causes a hypothesis to overfit the data?What causes a hypothesis to overfit the data?
1)1) Random errors or noiseRandom errors or noise Examples have incorrect class label or Examples have incorrect class label or incorrect attribute values.incorrect attribute values.
2)2) Coincidental patternsCoincidental patterns By chance examples seem to deviate from a pattern due to By chance examples seem to deviate from a pattern due to the small size of the sample. the small size of the sample.
Overfitting is a serious problem that can cause Overfitting is a serious problem that can cause strong performance degradation. strong performance degradation.
99 Solutions for Overtting the Data
There are two main classes of solutions: There are two main classes of solutions:
1)1) Stop the tree early before it begins to overfit the data.Stop the tree early before it begins to overfit the data. + In practice this solution is hard to implement because it+ In practice this solution is hard to implement because it is not clear what is a good stopping point. is not clear what is a good stopping point.
2) Grow the tree until the algorithm stops even if the overfitting 2) Grow the tree until the algorithm stops even if the overfitting problem shows up. Then prune the tree as a post-processing problem shows up. Then prune the tree as a post-processing step. step. + This method has found great popularity in the machine + This method has found great popularity in the machine learning community. learning community.
1010 Decision Tree Pruning
1.) Grow the tree to learn the 1.) Grow the tree to learn the training datatraining data
2.) Prune tree to avoid overfitting2.) Prune tree to avoid overfitting the datathe data
1111Methods to Validate the New Tree
1.1. Training and Validation Set ApproachTraining and Validation Set Approach
Divide dataset D into a training set TR and a testing set TEDivide dataset D into a training set TR and a testing set TE Build a decision tree on TRBuild a decision tree on TR Test pruned trees on TE to decide the best final tree. Test pruned trees on TE to decide the best final tree.
Dataset DDataset D Training TRTraining TR
Testing TETesting TE
1212Methods to Validate the New Tree
2. Use a statistical test2. Use a statistical test
Use all dataset D for trainingUse all dataset D for training Use a statistical test to decide if you should expand Use a statistical test to decide if you should expand the node or not (e.g., chi squared). the node or not (e.g., chi squared).
Should I expand or not?Should I expand or not?
1313Methods to Validate the New Tree
3.3. Use an encoding scheme to capture the size of the tree and theUse an encoding scheme to capture the size of the tree and the errors made by the tree.errors made by the tree.
Use all dataset D for trainingUse all dataset D for training Use the encoding scheme to know when to stop Use the encoding scheme to know when to stop growing the tree. growing the tree. The method is know as minimum description length The method is know as minimum description length principle.principle.
1414 Training and Validation
There are two approaches:There are two approaches:
A.A. Reduced Error PruningReduced Error PruningB.B. Rule Post-PruningRule Post-Pruning
Dataset DDataset D Training TR (normally 2/3 of D)Training TR (normally 2/3 of D)
Testing TE (normally 1/3 of D)Testing TE (normally 1/3 of D)
1515 Reduced Error Pruning
Main Idea:Main Idea:
1) Consider all internal nodes in the tree. 1) Consider all internal nodes in the tree. 2)2) For each node check if removing it (along with the subtree For each node check if removing it (along with the subtree below it) and assigning the most common class to it does below it) and assigning the most common class to it does not harm accuracy on the validation set.not harm accuracy on the validation set.3)3) Pick the node n* that yields the best performance and prune Pick the node n* that yields the best performance and prune its subtree.its subtree.4) Go back to (2) until no more improvements are possible.4) Go back to (2) until no more improvements are possible.
1616 Example
Original TreeOriginal Tree
Possible trees after pruning:Possible trees after pruning:
1717 Example
Pruned TreePruned Tree
Possible trees after 2Possible trees after 2ndnd pruning: pruning:
1818 Example
Process continues until no improvement is observedProcess continues until no improvement is observedon the validation set:on the validation set:
0.5
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1.0
Size of the treeSize of the tree
validation datavalidation data
Stop pruning the treeStop pruning the tree
1919 Reduced Error Pruning
Disadvantages:Disadvantages:
If the original data set is small, separating examples away for If the original data set is small, separating examples away for validation may leave you with few examples for training.validation may leave you with few examples for training.
Dataset DDataset D Training TRTraining TR
Testing TETesting TE
Small datasetSmall dataset
Training set is too Training set is too small and so is the small and so is the validation setvalidation set
2020 Rule Post-Pruning
Main Idea:Main Idea:
1) Convert the tree into a rule-based system. 1) Convert the tree into a rule-based system.
2)2) Prune every single rules first by removing redundant Prune every single rules first by removing redundant conditions.conditions.
3) Sort rules by accuracy. 3) Sort rules by accuracy.
2121 Example
x1x1
x2x2 x3x3
AA BB AA CC
1100
111100 00
Original treeOriginal tree
Rules:Rules:~x1 & ~x2 -> Class A~x1 & ~x2 -> Class A~x1 & x2 -> Class B~x1 & x2 -> Class Bx1 & ~x3 -> Class Ax1 & ~x3 -> Class Ax1 & x3 -> Class Cx1 & x3 -> Class C
Possible rules after pruningPossible rules after pruning(based on validation set):(based on validation set):~x1 -> Class A~x1 -> Class A~x1 & x2 -> Class B~x1 & x2 -> Class B ~x3 -> Class A~x3 -> Class Ax1 & x3 -> Class Cx1 & x3 -> Class C
2222 Advantages of Rule Post-Pruning
The language is more expressiveThe language is more expressive
Improves on interpretabilityImproves on interpretability
Pruning is more flexiblePruning is more flexible
In practice this method yields high accuracy performanceIn practice this method yields high accuracy performance
2323 Decision Trees
Definition Mechanism
Splitting Functions Hypothesis Space and Bias
Issues in Decision-Tree Learning Avoiding overfitting through pruning Numeric and Missing attributes
2424Discretizing Continuous Attributes
Example: attribute temperature.Example: attribute temperature.
1) Order all values in the training set1) Order all values in the training set2) Consider only those cut points where there is a change of class2) Consider only those cut points where there is a change of class3) Choose the cut point that maximizes information gain3) Choose the cut point that maximizes information gain
temperaturetemperature
97 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6 103.297 97.5 97.6 97.8 98.5 99.0 99.2 100 102.2 102.6 103.2
2525 Missing Attribute Values
We are at a node We are at a node nn in the decision tree. in the decision tree.Different approaches:Different approaches:
1)1) Assign the most common value for that attribute in node Assign the most common value for that attribute in node nn..2)2) Assign the most common value in Assign the most common value in nn among examples with the among examples with the same classification as same classification as XX. . 3)3) Assign a probability to each value of the attribute based on the Assign a probability to each value of the attribute based on the frequency of those values in node frequency of those values in node nn. Each fraction is propagated. Each fraction is propagated down the tree. down the tree.
Example: Example: XX = (luminosity > T1, mass = ?) = (luminosity > T1, mass = ?)
2626 Summary
Decision-tree induction is a popular approach to classification Decision-tree induction is a popular approach to classification that enables us to interpret the output hypothesis.that enables us to interpret the output hypothesis. The hypothesis space is very powerful: all possible DNF formulas.The hypothesis space is very powerful: all possible DNF formulas. We prefer shorter trees than larger trees.We prefer shorter trees than larger trees. Overfitting is an important issue in decision-tree induction.Overfitting is an important issue in decision-tree induction. Different methods exist to avoid overfitting like reduced-error Different methods exist to avoid overfitting like reduced-error pruning and rule post-processing.pruning and rule post-processing. Techniques exist to deal with continuous attributes and missing Techniques exist to deal with continuous attributes and missing attribute values.attribute values.
2727 Tareas
Leer Cap 3 de Mitchel desde 3.7 en adelante