Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Decision TreesTirgul 5
Using Decision Trees
β’ It could be difficult to decide which pet is right for you.
β’ Weβll find a nice algorithm to help us decide what to choose without having to think about it.
2
Using Decision Trees
β’ Another example:β’ The tree helps a student decide
what to do in the evening.
β’ <Party? (Yes, No), Deadline (Urgent, Near, None), Lazy? (Yes, No)>
Party/Study/TV/Pub
3
Definition
β’ A Decision tree is a flowchart-like structure which provides a useful way to describe a hypothesis β from a domain π³ to a label set π΄= 0, 1, β¦ π .
β’ Given π₯ β π³, a prediction β(π₯) of a decision tree β corresponds to a path from the root of the tree to a leaf.β’ Most of the time we do not distinguish between the representation (the tree)
and the hypothesis.
β’ An internal node corresponds to a βquestionβ.
β’ A branch corresponds to an βanswerβ.
β’ A leaf corresponds to a label.
4
Should I study?
β’ Equivalent to:
β’ ππππ‘π¦ == ππ β§ π·πππππππ == ππππππ‘ β¨ ππππ‘π¦ == ππ β§ π·πππππππ == ππππ β§ (πΏππ§π¦ ==
5
Constructing the Tree
β’ Features: Party, Deadline, Lazy.
β’ Based on these features, how do we construct the tree?
β’ The DT algorithm use the following principle:β’ Build the tree in a greedy manner;
β’ Starting at the root, choose the most informative feature at each step.
6
Constructing the Tree
β’ βInformativeβ features?β’ Choosing which feature to use next in
the decision tree can be thought of as playing the game β20 Questionsβ.
β’ At each stage, you choose a question that gives you the most information given what you know already.
β’ Thus, you would ask βIs it an animal?β before you ask βIs it a cat?β.
7
Constructing the Tree
β’ β20 Questionsβ example.
My character
Akinatorβs questions
8
Constructing the Tree
β’ The idea: quantify how much information is provided.
β’ Mathematically: Information Theory
9
Pivot example:
10
Quick Aside: Information Theory
Entropy
β’ S is a sample of training examples.
β’ p+ is the proportion of positive examples in S
β’ p- is the proportion of negative examples in S
β’ Entropy measures the impurity of S:
β’ Entropy(S) = - p+ log2 p+ - p- log2 p-
β’ The smaller the better
β’ (we define: 0log0 = 0)
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
12
Entropy
β’ Generally, entropy refers to disorder or uncertainty.
β’ Entropy = 0 if outcome is certain.
β’ E.g.: Consider a coin toss:β’ Probability of heads == probability of tails
β’ The entropy of the coin toss is as high as it could be.
β’ This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2.
13
Entropy: Example
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
πΈππ‘ππππ¦ π = βπ"π¦ππ " log2 π"π¦ππ " β π"ππ" log2 π"ππ"
= β9
14log29
14β5
14log25
14
= 0.409 + 0.530 = 0.939
} 14 examples:β’ 9 positiveβ’ 5 negative
The dataset
14
Information Gain
β’ Important Idea: find how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step.
β’ Called: βInformation Gainβ.β’ defined as the entropy of the whole set minus the entropy when a particular
feature is chosen.
15
Information Gain
β’ The information gain is the expected reduction in entropy caused by partitioning the examples with respect to an attribute.
β’ Given S is the set of examples (at the current node), A the attribute, and Sv the subset of S for which attribute A has value v:
β’ IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
β’ That is, current entropy minus new entropy
16
Information Gain: Example
β’ Attribute A: Outlookβ’ ππππ’ππ π΄ = π π’πππ¦, ππ£πππππ π‘, ππππ
Outlook
[3 + , 2 -][2 + , 3-] [4 + , 0 -]
SunnyOvercast
Rain
17
18
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππ’π‘ππππ = π π’πππ¦, ππ£πππππ π‘, ππππ
β’ππ π’πππ¦
ππΈππ‘ππππ¦ ππ π’πππ¦
β’πππ£πππππ π‘
ππΈππ‘ππππ¦ πππ£πππππ π‘
β’πππππ
ππΈππ‘ππππ¦ πππππ
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
Outlook
[3 + , 2 -][2 + , 3-] [4 + , 0 -]
SunnyOvercast
Rain
[9 + , 5 -]
19
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππ’π‘ππππ = π π’πππ¦, ππ£πππππ π‘, ππππ
β’ππ π’πππ¦
ππΈππ‘ππππ¦ ππ π’πππ¦ =
5
14πΈππ‘ππππ¦ ππ π’πππ¦
β’πππ£πππππ π‘
ππΈππ‘ππππ¦ πππ£πππππ π‘ =
4
14πΈππ‘ππππ¦ πππ£πππππ π‘
β’πππππ
ππΈππ‘ππππ¦ πππππ =
5
14πΈππ‘ππππ¦ πππππ
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
Outlook
[3 + , 2 -][2 + , 3-] [4 + , 0 -]
SunnyOvercast
Rain
[9 + , 5 -]
20
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππ’π‘ππππ = π π’πππ¦, ππ£πππππ π‘, ππππ
β’ππ π’πππ¦
ππΈππ‘ππππ¦ ππ π’πππ¦ =
5
14β2
5πππ2
2
5β3
5πππ2
3
5=5
14β 0.970
β’πππ£πππππ π‘
ππΈππ‘ππππ¦ πππ£πππππ π‘ =
4
14β4
4πππ2
4
4β0
4πππ2
0
4=4
14β 0
β’πππππ
ππΈππ‘ππππ¦ πππππ =
5
14β3
5πππ2
3
5β2
5πππ2
2
5=5
14β 0.970
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
Outlook
[3 + , 2 -][2 + , 3-] [4 + , 0 -]
SunnyOvercast
Rain
[9 + , 5 -]
21
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππ’π‘ππππ = π π’πππ¦, ππ£πππππ π‘, ππππ
β’ππ π’πππ¦
ππΈππ‘ππππ¦ ππ π’πππ¦ =
5
14β 0.970 = 0.346
β’πππ£πππππ π‘
ππΈππ‘ππππ¦ πππ£πππππ π‘ =
4
14β 0 = 0
β’πππππ
ππΈππ‘ππππ¦ πππππ =
5
14β 0.970 = 0.346
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
Outlook
[3 + , 2 -][2 + , 3-] [4 + , 0 -]
SunnyOvercast
Rain
[9 + , 5 -]
IG(S, a = outlook) = 0.939 β 0.346 + 0 + 0.346 = π. πππ
22
Information Gain: Example
β’ Attribute A: Windβ’ ππππ’ππ π΄ = π€πππ, π π‘ππππ
Wind
[3 + , 3 -][6 + , 2-]
weak strong
[9 + , 5-]
23
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππππ = π€πππ, π π‘ππππ
β’ππ€πππ
ππΈππ‘ππππ¦ ππ€πππ =
8
14β6
8πππ2
6
8β2
8πππ2
2
8=8
14β 0.811
β’ππ π‘ππππ
ππΈππ‘ππππ¦ ππ π‘ππππ =
6
14β3
6πππ2
3
6β3
6πππ2
3
6=6
14β 1
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
πΈππ‘ππππ¦ π = β
π
ππ log2 ππ
Wind
[3 + , 3 -][6 + , 2-]
weak strong
[9 + , 5-]
24
Information Gain: Example
β’ πΈππ‘ππππ¦ π = 0.939
β’ ππππ’ππ π΄ = ππππ = π€πππ, π π‘ππππ
β’ππ€πππ
ππΈππ‘ππππ¦ ππ€πππ =
8
14β 0.811 = 0.463
β’ππ π‘ππππ
ππΈππ‘ππππ¦ ππ π‘ππππ =
6
14β 1 = 0.428
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
IG(S, a = wind) = 0.939 β 0.463 + 0.428 = π. πππ
25
Information Gain
β’ The smaller the value of : ππ£
ππΈππ‘ππππ¦ ππ£ is, the larger IG becomes.
β’ The ID3 algorithm computes this IG for each attribute and chooses the one that produces the highest value.β’ Greedy
IG(S, a) = πΈππ‘ππππ¦ π β π£βππππ’ππ π΄ππ£
ππΈππ‘ππππ¦ ππ£
βbeforeβ βafterβ
Probability of getting to the node
26
ID3 AlgorithmArtificial Intelligence: A Modern Approach
27
ID3 Algorithm
β’ Majority Value function:β’ Returns the label of the majority of the training examples in the current
subtree.
β’ Choose attribute function:β’ Choose the attribute that maximizes the Information Gain.
β’ (Could use other measures other than IG.)
28
Back to the example
β’ IG(S, a = outlook) = π. πππ
β’ IG(S, a = wind) = 0.048
β’ IG(S, a = temperature) = 0.028
β’ IG(S, a = humidity) = 0.151
β’ The tree (for now):
Maximal value
29
Decision Tree after first step:
hothot
mild
cool
mild
hot
cool
mild
hot
mild
coolcoolmildmild
30
Decision Tree after first step:
hothot
mild
cool
mild
hot
cool
mild
hot
mild
coolcoolmildmild
yes
31
The second step:
wind
[1 + , 1 -][1 + , 2-]
weak strong
[2 + , 3-]
temperature
[1 + , 0-][0 + , 2-] [1 + , 1 -]
hotmild
cool
[2 + , 3 -]
humidity
[2 + , 0 -][0 + , 3-]
high normal
[2 + , 3-]
Entropy = 0 Entropy = 0
IG is maximal
32
Decision Tree after second step:
33
Nextβ¦
34
Final DT:
35
Minimal Description Length
β’ The attribute we choose is the one with the highest information gainβ’ We minimize the amount of information that is left
β’ Thus, the algorithm is biased towards smaller trees
β’ Consistent with the well-known principle that short solutions are usually better than longer ones.
36
Minimal Description Length
β’ MDL: β’ The shortest description of something, i.e., the most compressed one, is the
best description.
β’ a.k.a: Occamβs Razor:β’ Among competing hypotheses, the one with the fewest assumptions should
be selected.
37
Outlook
WindHumidity True
Sunny
Overcast
Rain
True False
Weak Strong
False True
High Normal
New Data: <Rain, Mild, High, Weak> - False
Prediction: True
Wait What!?
38
Overfitting
β’ Learning was performed for too long or the learner may adjust to very specific random features of the training data, that have no causal relation to the target function.
39
Overfitting
40
Overfitting
β’ The performance on the training examples still increases while the performance on unseen data becomes worse.
41
42
Pruning
43
44
ID3 for real-valued features
β’ Until now, we assumed that the splitting rules are of the form:β’ π[π₯π=1]
β’ For real-valued features, use threshold-based splitting rules:β’ π[π₯π<π]
*π(πππππππ ππ₯ππππ π πππ) is the indicator function
(equals 1 if expression is true and 0 otherwise.
45
Random forest
46
Random forest
β’ Collection of decision trees.
β’ Prediction: a majority vote over the predictions of the individual trees.
β’ Constructing the trees:β’ Take a random subsample πβ (of size πβ) from π, with replacements.
β’ Construct a sequence πΌ1, πΌ2, β¦ , where πΌπ‘ is a subset of the attributes (of size π)
β’ The algorithm grows a DT (using ID3) based on the sample πββ’ At each splitting stage, the algorithm chooses a feature that maximized IG from πΌπ‘
47
Summary
β’ Intro: Decision Trees
β’ Constructing the tree:β’ Information Theory: Entropy, IGβ’ ID3 Algorithmβ’ MDL
β’ Overfitting:β’ Pruning
β’ ID3 for real-valued features
β’ Random Forests
β’ Weka
49