Upload
olympe
View
70
Download
0
Tags:
Embed Size (px)
DESCRIPTION
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Rajeev RastogiKyuseok Shim. Presented by: Alon Keinan. Presentation layout. Introduction: Classification and Decision Trees Decision Tree Building Algorithms SPRINT & MDL PUBLIC Performance Comparison - PowerPoint PPT Presentation
Citation preview
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning
Rajeev Rastogi Kyuseok Shim
Presented by: Alon Keinan
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Introduction: Classification
• Classification in data mining:– Training sample set– Classifying future records
• Techniques: Bayesian, NN, Genetic, decision trees …
Introduction: Decision Trees
training
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Decision Tree Building Algorithms
• 2 phases: – The building phase– The pruning phase
• The building constructs a “perfect” tree
• The pruning prevents “overfitting”
Building Phase Algorithms
• Differ in the selection of the test criterion for partitioning– CLS– ID3 & C4.5– CART, SLIQ & SPRINT
• Differ in their ability to handle large training sets
• All consider “guillotine-cut” only
Pruning Phase Algorithms
• MDL – Minimum Description Length
• Cost-Complexity Pruning
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
SPRINT
• Initialize root node• Initialize queue Q to contain root node• While Q is not empty do
– dequeue the first node N in Q– if N is not pure
• for each attribute evaluate splits• use least entropy split to split node N into N1 and
N2• append N1 and N2 to Q
Entropy
MDL
• The best tree is the one that can be encoded using the fewest number of bits
• Cost of encoding data records:
• Cost of encoding tree:– The structure of the tree– The splits– The classes in the leaves
Pruning algorithm
• computeCost&Prune(Node N)– If N is a leaf return (C(S)+1)
– minCostLeft:=computeCost&Prune(Nleft)
– minCostRight:=computeCost&Prune(Nright)
– minCost:=min{C(S)+1, Csplit(N)+1+minCostLeft+minCostRight}
– If minCost=C(S)+1• Prune child nodes Nleft and Nright
– return minCost
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
PUBLIC
• PUBLIC = PrUning and BuiLding Integrated in Classification
• Uses SPRINT for building• Prune periodically !!!• Basically uses MDL for pruning• Distinguished three types of leaves:
– “not expandable”– “pruned”– “yet to be expanded”
• Exact same tree
Lower Bound Computation
• PUBLIC(1) – Bound=1
• PUBLIC(S) – Incorporating split costs
• PUBLIC(V) – Incorporating split values
PUBLIC(S)
• Calculates a lower bound for s=0,..,k-1– For s=0: C(S)+1– For s>0:
• Takes the minimum of the bounds
• Computes by iterative addition
• O(klogk)
PUBLIC(V)
• PUBLIC(S) estimates each split as log(a)
• PUBLIC(V) estimates each split as log(a), plus the encoding of the splitting value\s
• Complexity: O(k*(logk+a))
Lower Bound ComputationSummary
PUBLIC(1) Fixed - 1 O(1)
PUBLIC(S) Incorporating split costs
O(klogk)
PUBLIC(V) Incorporating split value costs
O(k*(logk+a))
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Performance Comparisons
• Algorithms:– SPRINT– PUBLIC(1)– PUBLIC(S)– PUBLIC(V)
• Data sets:– Real-life– Synthetic
Real-life Data Sets
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
SPRINT
PUBLIC(1)
PUBLIC(S)
PUBLIC(V)
Synthetic Data Sets
0
500
1000
1500
2000
SPRINT
PUBLIC(1)
PUBLIC(S)
PUBLIC(V)
Noise
Other Parameters
• No. of Attributes
• No. of Classes
• Size of training set
Presentation layout
• Introduction: Classification and Decision Trees
• Decision Tree Building Algorithms
• SPRINT & MDL
• PUBLIC
• Performance Comparison
• Conclusions
Conclusion
• The pruning is integrated into the building phase• Computing lower bounds of the cost of “yet to be
expanded” leaves• Improved performance• Open:
– How often to invoke the pruning procedure?
– Expanding other algorithms …
– Developing a tighter lower bound…