33
ID3,C4.5,AQ,IL A Algorithms By: Abdelfattah Al Zaqqa PSUT-Amman-Jordan

Id3,c4.5 algorithim

Embed Size (px)

DESCRIPTION

ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains, overview about these algorithms with illustrated examples

Citation preview

Page 1: Id3,c4.5 algorithim

ID3,C4.5,AQ,ILA Algorithms

By: Abdelfattah Al Zaqqa PSUT-Amman-Jordan

Page 2: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Agenda

Introduction AQ ID3 C4.5 ILA

Page 3: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Introduction-Machine Learning Machine learning is a branch of artificial

intelligence, concerns the construction and study of systems that can learn from data.

Machine learning and Data mining Machine learning prediction

Known properties learned from the training data. Data mining discovery

Previously unknown properties in the data. With overlapping

Page 4: Id3,c4.5 algorithim

Al Zaqqa-PSUT

What is Decision Tree?

A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

Root Node: Attribute

Edges: Attribute

Value

Leaf Node: output, class or decision

Page 5: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Introduction

ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset using Shannon Entropy.

Typically used in the machine learning and natural language processing domains.

Page 6: Id3,c4.5 algorithim

Al Zaqqa-PSUT

ID3 basics

ID3 employs Top_Down Induction of Decision Tree (greedy algorithm)

Attribute selection is the fundamental step to construct a decision tree.

Select which attribute will be selected to become a node of the decision tree and so on.

There are two terms Entropy and Information Gain is used to process attribute selection.

Page 7: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Entropy

Entropy H(S) is a measure of the amount of uncertainty in the (data) set S

More uniform More information we can gain

More entropy More information we can gain

Page 8: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Entropy

set S

Positive

Negative

Entropy(S)= - P(positive)log2P(positive) -P(negative)log2P(negative)

Page 9: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Information Gain

is the measure of the difference in entropy from before to after the set is split on an attribute .

Page 10: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Example

Outlook Temperature Humidity Wind Play ball

Sunny Hot High Weak No

Sunny Hot High StrongNo

OvercastHot High Weak Yes

Rain MildHigh Weak Yes

Rain Cool NormalWeak Yes

Rain Cool Normal StrongNo

Overcast Cool Normal Strong Yes

Sunny MildHigh Weak No

Sunny Cool NormalWeak Yes

Rain Mild NormalWeak Yes

Sunny Mild Normal Strong Yes

Overcast MildHigh Strong Yes

OvercastHot NormalWeak Yes

Rain MildHigh StrongNo

Total 14

Page 11: Id3,c4.5 algorithim

Example-Dataset Elements

Outlook Temperature Humidity Wind Play ball

Sunny Hot High Weak No

Sunny Hot High StrongNo

OvercastHot High Weak Yes

Rain MildHigh Weak Yes

Rain Cool NormalWeak Yes

Rain Cool Normal StrongNo

Overcast Cool Normal Strong Yes

Sunny MildHigh Weak No

Sunny Cool NormalWeak Yes

Rain Mild NormalWeak Yes

Sunny Mild Normal Strong Yes

Overcast MildHigh Strong Yes

OvercastHot NormalWeak Yes

Rain MildHigh StrongNo

Total 14

Collection (S)All the records in the table refer as Collection (S). Al Zaqqa-PSUT

Page 12: Id3,c4.5 algorithim

Example-Dataset Elements

Outlook Temperature Humidity Wind Play ball

Sunny Hot High Weak No

Sunny Hot High Strong No

OvercastHot High Weak Yes

Rain MildHigh Weak Yes

Rain Cool NormalWeak Yes

Rain Cool Normal Strong No

Overcast Cool Normal StrongYes

Sunny MildHigh Weak No

Sunny Cool NormalWeak Yes

Rain Mild NormalWeak Yes

Sunny Mild Normal StrongYes

Overcast MildHigh StrongYes

OvercastHot NormalWeak Yes

Rain MildHigh Strong No

Total 14

Attributes

Class(C) or Classifier:Play ball

Because based on Outlook, Temperature, Humidity and Wind we need to decide whether we can Play ball or not, that’s why Play ball is a classifier to make decision.

Al Zaqqa-PSUT

Page 13: Id3,c4.5 algorithim

Al Zaqqa-PSUT

ID3 Algorithm

1. Compute Entropy(S) = -(9/14)log2(9/14)-(5/14)log2(5/14)=0.940

2. Compute information gain for each attribute:

Gain(S,Windy) = Entropy(S)-(8/14)Entropy(Sfalse) -(6/14)Entropy(Strue) =0.048

• Entropy(Sfalse)=-6/8Log2(6/8)-2/8Log2(2/8)=0.811• Entropy(Strue) =-3/6Log2(3/6)-3/6Log2(3/6)=1

Gain(S,Windy) = 0.940-(8/14)(0.811)-(6/14)(1)=0.048

Windy: Weak=8(6+,2-), Strong=6(3+,3-)

Page 14: Id3,c4.5 algorithim

Al Zaqqa-PSUT

ID3 Algorithm

3. Select attribute with the maximum information gain for splitting:

Gain(S, Windy)=0.048 Gain(S, Humidity) =0.151 Gain(S, Temperature)=0.029 Gain(S, Outlook) = 0.246

Page 15: Id3,c4.5 algorithim

Al Zaqqa-PSUT

ID3 Algorithm

4. Apply ID3 to each child node of this root, until leaf node or node that has entropy=0 are reached.

Page 16: Id3,c4.5 algorithim

Al Zaqqa-PSUT

C4.5

C4.5 is an extension of Quinlan's earlier ID3 algorithm. Handling both continuous and discrete

attributes. Handling training data with missing

attribute values Pruning trees after creation.

Page 17: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Continuous-valued attributes

Outlook Temperature Humidity Wind Play ball

Sunny Hot 0.9 Weak No

Sunny Hot 0.87 StrongNo

OvercastHot 0.93 Weak Yes

Rain Mild 0.89 Weak Yes

Rain Cool 0.80 Weak Yes

Rain Cool 0.59 StrongNo

Overcast Cool 0.77 Strong Yes

Sunny Mild 0.91 Weak No

Sunny Cool 0.68 Weak Yes

Rain Mild 0.84 Weak Yes

Sunny Mild 0.72 Strong Yes

Overcast Mild 0.49 Strong Yes

OvercastHot 0.74 Weak Yes

Rain Mild 0.86 StrongNo

Total 14

Page 18: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Continuous-valued attributes

Humidity Play ball

0.9 No

0.87 No

0.93 Yes

0.89 Yes

0.80 Yes

0.59 No

0.77 Yes

0.91 No

0.68 Yes

0.84 Yes

0.72 Yes

0.49 Yes

0.74 Yes

0.86 No

Humidity 0.68 0.72 0.87 0.9 0.91Humidity yes yes no no no

1. sort the numeric attribute values, 2. Identify adjacent examples that

differ in their target classification to pick the threshold.

Humidity>(0.72+0.87)/2 Humidity>0.795

Page 19: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Continuous-valued attributes

Page 20: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Overfitting

“Under fitting”

“Just right” “Over fitting”

Overfitting: If we have too many attributes(features) the learned hypothesis may fit the training set very well, but fail to generalize to new examples (Predict price on new examples).

Page 21: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Overfitting

Page 22: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Overfitting

Page 23: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Why overfitting happens?

Presence of error in the training examples. (In general in machine learning).

When small numbers of examples are associated with leaf node.

Page 24: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Reduce Overfitting

Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data. (difficult)

Allow the tree to overfit the data, and then post-prune the tree.

Page 25: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Rule post-pruning

(Outlook = Sunny " Humidity = Normal) P(Outlook = Sunny " Humidity = High) N(Outlook = Overcast) P(Outlook = Rain " Wind = Strong) N(Outlook = Rain " Wind = Weak) P

Page 26: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Rule post-pruning

Outlook Temp Humidity

Wind Tennis

Rain Low High Weak No

Rain Hot High Strong No

•Prune preconditions

(Outlook = Sunny " Humidity = High) N(Outlook = Sunny " Humidity = Normal) P(Outlook = Overcast) P(Outlook = Rain " Wind = Strong) N(Outlook = Rain " Wind = Weak) P

Page 27: Id3,c4.5 algorithim

Rule post-pruning

Outlook Temp Humidity

Wind Tennis

Rain Low High Weak No

Rain Hot High Strong No

•Prune preconditions

(Outlook = Sunny " Humidity = High) N(Outlook = Sunny " Humidity = Normal) P(Outlook = Overcast) P(Outlook = Rain) N(Outlook = Rain " Wind = Weak) P

Outlook Temp Humidity

Wind Tennis

Sunny Low Low Weak yes

Rain Hot High Weak No

New instances

Al Zaqqa-PSUT

Page 28: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Rule post-pruning

Validation set Save a portion of the data for validation

s <= t, prune subtree {s validation performance with subtree at node, t

validation set performance with leaf instead of subtree) Rule post-pruning (Quinlan 1993)

Can remove smaller elements than whole subtrees Improved readability

Reduced-error pruning (Quinlan 1987) …

Training set Validation set Test set

Page 29: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Missing information

Example: Missing information in mammograph data

BI-RAD

Age shape Margin

Density

Class

4 48 4 5 ? 1

5 67 3 5 3 1

5 57 4 4 3 1

5 60 ? 5 1 1

4 53 ? 4 3 1

4 28 1 1 3 0

4 70 ? 2 3 0

2 66 1 1 ? 0

5 63 3 ? 3 0

4 78 1 1 1 0

Page 30: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Missing information-according to most common

Fill in the data according to most common (given class)

BI-RAD

Age shape Margin

Density

Class

4 48 4 5 3 1

5 67 3 5 3 1

5 57 4 4 3 1

5 60 4 5 1 1

4 53 4 4 3 1

4 28 1 1 3 0

4 70 1 2 3 0

2 66 1 1 3 0

5 63 3 ? 3 0

4 78 1 1 1 0

Page 31: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Missing information-according to proportions

Fraction BI-RAD Age shape Margin Density Class

0.75 4 48 4 5 3 1

0.25 4 48 4 5 1 1

1 5 67 3 5 3 1

1 5 57 4 4 3 1

0.66 5 60 4 5 1 1

0.33 5 60 3 5 1 1

0.66 4 53 4 4 3 1

0.33 4 53 3 4 3 1

1 4 28 1 1 3 0

0.75 4 70 1 2 3 0

0.25 4 70 3 2 3 0

0.25 2 66 1 1 1 0

0.75 2 66 1 1 3 0

0.75 5 63 3 1 3 0

0.25 5 63 3 2 3 0

1 4 78 1 1 1 0

33/411/4

Page 32: Id3,c4.5 algorithim

Al Zaqqa-PSUT

Summery

ID3, C4.5 :used to generate a decision tree developed by Ross Quinlan typically used in the machine learning and natural language processing domains

ID3, C4.5: uses the entropy of an attribute and picks the attribute with the highest reduction in entropy to determine which attribute should the data be split with first and then through a series of recursive functions that calculate the entropy of the node the process is continued until all the left nodes are pure.

Page 33: Id3,c4.5 algorithim

THANKS