Upload
abdelfattah-al-zaqqa
View
1.246
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples
Citation preview
ID3,C4.5,AQ,ILA Algorithms
By: Abdelfattah Al Zaqqa PSUT-Amman-Jordan
Al Zaqqa-PSUT
Agenda
Introduction AQ ID3 C4.5 ILA
Al Zaqqa-PSUT
Introduction-Machine Learning Machine learning is a branch of artificial
intelligence, concerns the construction and study of systems that can learn from data.
Machine learning and Data mining Machine learning prediction
Known properties learned from the training data. Data mining discovery
Previously unknown properties in the data. With overlapping
Al Zaqqa-PSUT
What is Decision Tree?
A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Root Node: Attribute
Edges: Attribute
Value
Leaf Node: output, class or decision
Al Zaqqa-PSUT
Introduction
ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset using Shannon Entropy.
Typically used in the machine learning and natural language processing domains.
Al Zaqqa-PSUT
ID3 basics
ID3 employs Top_Down Induction of Decision Tree (greedy algorithm)
Attribute selection is the fundamental step to construct a decision tree.
Select which attribute will be selected to become a node of the decision tree and so on.
There are two terms Entropy and Information Gain is used to process attribute selection.
Al Zaqqa-PSUT
Entropy
Entropy H(S) is a measure of the amount of uncertainty in the (data) set S
More uniform More information we can gain
More entropy More information we can gain
Al Zaqqa-PSUT
Entropy
set S
Positive
Negative
Entropy(S)= - P(positive)log2P(positive) -P(negative)log2P(negative)
Al Zaqqa-PSUT
Information Gain
is the measure of the difference in entropy from before to after the set is split on an attribute .
Al Zaqqa-PSUT
Example
Outlook Temperature Humidity Wind Play ball
Sunny Hot High Weak No
Sunny Hot High StrongNo
OvercastHot High Weak Yes
Rain MildHigh Weak Yes
Rain Cool NormalWeak Yes
Rain Cool Normal StrongNo
Overcast Cool Normal Strong Yes
Sunny MildHigh Weak No
Sunny Cool NormalWeak Yes
Rain Mild NormalWeak Yes
Sunny Mild Normal Strong Yes
Overcast MildHigh Strong Yes
OvercastHot NormalWeak Yes
Rain MildHigh StrongNo
Total 14
Example-Dataset Elements
Outlook Temperature Humidity Wind Play ball
Sunny Hot High Weak No
Sunny Hot High StrongNo
OvercastHot High Weak Yes
Rain MildHigh Weak Yes
Rain Cool NormalWeak Yes
Rain Cool Normal StrongNo
Overcast Cool Normal Strong Yes
Sunny MildHigh Weak No
Sunny Cool NormalWeak Yes
Rain Mild NormalWeak Yes
Sunny Mild Normal Strong Yes
Overcast MildHigh Strong Yes
OvercastHot NormalWeak Yes
Rain MildHigh StrongNo
Total 14
Collection (S)All the records in the table refer as Collection (S). Al Zaqqa-PSUT
Example-Dataset Elements
Outlook Temperature Humidity Wind Play ball
Sunny Hot High Weak No
Sunny Hot High Strong No
OvercastHot High Weak Yes
Rain MildHigh Weak Yes
Rain Cool NormalWeak Yes
Rain Cool Normal Strong No
Overcast Cool Normal StrongYes
Sunny MildHigh Weak No
Sunny Cool NormalWeak Yes
Rain Mild NormalWeak Yes
Sunny Mild Normal StrongYes
Overcast MildHigh StrongYes
OvercastHot NormalWeak Yes
Rain MildHigh Strong No
Total 14
Attributes
Class(C) or Classifier:Play ball
Because based on Outlook, Temperature, Humidity and Wind we need to decide whether we can Play ball or not, that’s why Play ball is a classifier to make decision.
Al Zaqqa-PSUT
Al Zaqqa-PSUT
ID3 Algorithm
1. Compute Entropy(S) = -(9/14)log2(9/14)-(5/14)log2(5/14)=0.940
2. Compute information gain for each attribute:
Gain(S,Windy) = Entropy(S)-(8/14)Entropy(Sfalse) -(6/14)Entropy(Strue) =0.048
• Entropy(Sfalse)=-6/8Log2(6/8)-2/8Log2(2/8)=0.811• Entropy(Strue) =-3/6Log2(3/6)-3/6Log2(3/6)=1
Gain(S,Windy) = 0.940-(8/14)(0.811)-(6/14)(1)=0.048
Windy: Weak=8(6+,2-), Strong=6(3+,3-)
Al Zaqqa-PSUT
ID3 Algorithm
3. Select attribute with the maximum information gain for splitting:
Gain(S, Windy)=0.048 Gain(S, Humidity) =0.151 Gain(S, Temperature)=0.029 Gain(S, Outlook) = 0.246
Al Zaqqa-PSUT
ID3 Algorithm
4. Apply ID3 to each child node of this root, until leaf node or node that has entropy=0 are reached.
Al Zaqqa-PSUT
C4.5
C4.5 is an extension of Quinlan's earlier ID3 algorithm. Handling both continuous and discrete
attributes. Handling training data with missing
attribute values Pruning trees after creation.
Al Zaqqa-PSUT
Continuous-valued attributes
Outlook Temperature Humidity Wind Play ball
Sunny Hot 0.9 Weak No
Sunny Hot 0.87 StrongNo
OvercastHot 0.93 Weak Yes
Rain Mild 0.89 Weak Yes
Rain Cool 0.80 Weak Yes
Rain Cool 0.59 StrongNo
Overcast Cool 0.77 Strong Yes
Sunny Mild 0.91 Weak No
Sunny Cool 0.68 Weak Yes
Rain Mild 0.84 Weak Yes
Sunny Mild 0.72 Strong Yes
Overcast Mild 0.49 Strong Yes
OvercastHot 0.74 Weak Yes
Rain Mild 0.86 StrongNo
Total 14
Al Zaqqa-PSUT
Continuous-valued attributes
Humidity Play ball
0.9 No
0.87 No
0.93 Yes
0.89 Yes
0.80 Yes
0.59 No
0.77 Yes
0.91 No
0.68 Yes
0.84 Yes
0.72 Yes
0.49 Yes
0.74 Yes
0.86 No
Humidity 0.68 0.72 0.87 0.9 0.91Humidity yes yes no no no
1. sort the numeric attribute values, 2. Identify adjacent examples that
differ in their target classification to pick the threshold.
Humidity>(0.72+0.87)/2 Humidity>0.795
Al Zaqqa-PSUT
Continuous-valued attributes
Al Zaqqa-PSUT
Overfitting
“Under fitting”
“Just right” “Over fitting”
Overfitting: If we have too many attributes(features) the learned hypothesis may fit the training set very well, but fail to generalize to new examples (Predict price on new examples).
Al Zaqqa-PSUT
Overfitting
Al Zaqqa-PSUT
Overfitting
Al Zaqqa-PSUT
Why overfitting happens?
Presence of error in the training examples. (In general in machine learning).
When small numbers of examples are associated with leaf node.
Al Zaqqa-PSUT
Reduce Overfitting
Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data. (difficult)
Allow the tree to overfit the data, and then post-prune the tree.
Al Zaqqa-PSUT
Rule post-pruning
(Outlook = Sunny " Humidity = Normal) P(Outlook = Sunny " Humidity = High) N(Outlook = Overcast) P(Outlook = Rain " Wind = Strong) N(Outlook = Rain " Wind = Weak) P
Al Zaqqa-PSUT
Rule post-pruning
Outlook Temp Humidity
Wind Tennis
Rain Low High Weak No
Rain Hot High Strong No
•Prune preconditions
(Outlook = Sunny " Humidity = High) N(Outlook = Sunny " Humidity = Normal) P(Outlook = Overcast) P(Outlook = Rain " Wind = Strong) N(Outlook = Rain " Wind = Weak) P
Rule post-pruning
Outlook Temp Humidity
Wind Tennis
Rain Low High Weak No
Rain Hot High Strong No
•Prune preconditions
(Outlook = Sunny " Humidity = High) N(Outlook = Sunny " Humidity = Normal) P(Outlook = Overcast) P(Outlook = Rain) N(Outlook = Rain " Wind = Weak) P
Outlook Temp Humidity
Wind Tennis
Sunny Low Low Weak yes
Rain Hot High Weak No
New instances
Al Zaqqa-PSUT
Al Zaqqa-PSUT
Rule post-pruning
Validation set Save a portion of the data for validation
s <= t, prune subtree {s validation performance with subtree at node, t
validation set performance with leaf instead of subtree) Rule post-pruning (Quinlan 1993)
Can remove smaller elements than whole subtrees Improved readability
Reduced-error pruning (Quinlan 1987) …
Training set Validation set Test set
Al Zaqqa-PSUT
Missing information
Example: Missing information in mammograph data
BI-RAD
Age shape Margin
Density
Class
4 48 4 5 ? 1
5 67 3 5 3 1
5 57 4 4 3 1
5 60 ? 5 1 1
4 53 ? 4 3 1
4 28 1 1 3 0
4 70 ? 2 3 0
2 66 1 1 ? 0
5 63 3 ? 3 0
4 78 1 1 1 0
Al Zaqqa-PSUT
Missing information-according to most common
Fill in the data according to most common (given class)
BI-RAD
Age shape Margin
Density
Class
4 48 4 5 3 1
5 67 3 5 3 1
5 57 4 4 3 1
5 60 4 5 1 1
4 53 4 4 3 1
4 28 1 1 3 0
4 70 1 2 3 0
2 66 1 1 3 0
5 63 3 ? 3 0
4 78 1 1 1 0
Al Zaqqa-PSUT
Missing information-according to proportions
Fraction BI-RAD Age shape Margin Density Class
0.75 4 48 4 5 3 1
0.25 4 48 4 5 1 1
1 5 67 3 5 3 1
1 5 57 4 4 3 1
0.66 5 60 4 5 1 1
0.33 5 60 3 5 1 1
0.66 4 53 4 4 3 1
0.33 4 53 3 4 3 1
1 4 28 1 1 3 0
0.75 4 70 1 2 3 0
0.25 4 70 3 2 3 0
0.25 2 66 1 1 1 0
0.75 2 66 1 1 3 0
0.75 5 63 3 1 3 0
0.25 5 63 3 2 3 0
1 4 78 1 1 1 0
33/411/4
Al Zaqqa-PSUT
Summery
ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains
ID3, C4.5: uses the entropy of an attribute and picks the attribute with the highest reduction in entropy to determine which attribute should the data be split with first and then through a series of recursive functions that calculate the entropy of the node the process is continued until all the left nodes are pure.
THANKS