49
Decision Trees Tirgul 5

Decision Trees - BIU

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decision Trees - BIU

Decision TreesTirgul 5

Page 2: Decision Trees - BIU

Using Decision Trees

β€’ It could be difficult to decide which pet is right for you.

β€’ We’ll find a nice algorithm to help us decide what to choose without having to think about it.

2

Page 3: Decision Trees - BIU

Using Decision Trees

β€’ Another example:β€’ The tree helps a student decide

what to do in the evening.

β€’ <Party? (Yes, No), Deadline (Urgent, Near, None), Lazy? (Yes, No)>

Party/Study/TV/Pub

3

Page 4: Decision Trees - BIU

Definition

β€’ A Decision tree is a flowchart-like structure which provides a useful way to describe a hypothesis β„Ž from a domain 𝒳 to a label set 𝒴= 0, 1, … π‘˜ .

β€’ Given π‘₯ ∈ 𝒳, a prediction β„Ž(π‘₯) of a decision tree β„Ž corresponds to a path from the root of the tree to a leaf.β€’ Most of the time we do not distinguish between the representation (the tree)

and the hypothesis.

β€’ An internal node corresponds to a β€œquestion”.

β€’ A branch corresponds to an β€œanswer”.

β€’ A leaf corresponds to a label.

4

Page 5: Decision Trees - BIU

Should I study?

β€’ Equivalent to:

β€’ π‘ƒπ‘Žπ‘Ÿπ‘‘π‘¦ == π‘π‘œ ∧ π·π‘’π‘Žπ‘‘π‘™π‘–π‘›π‘’ == π‘ˆπ‘Ÿπ‘”π‘’π‘›π‘‘ ∨ π‘ƒπ‘Žπ‘Ÿπ‘‘π‘¦ == π‘π‘œ ∧ π·π‘’π‘Žπ‘‘π‘™π‘–π‘›π‘’ == π‘π‘’π‘Žπ‘Ÿ ∧ (πΏπ‘Žπ‘§π‘¦ ==

5

Page 6: Decision Trees - BIU

Constructing the Tree

β€’ Features: Party, Deadline, Lazy.

β€’ Based on these features, how do we construct the tree?

β€’ The DT algorithm use the following principle:β€’ Build the tree in a greedy manner;

β€’ Starting at the root, choose the most informative feature at each step.

6

Page 7: Decision Trees - BIU

Constructing the Tree

β€’ β€œInformative” features?β€’ Choosing which feature to use next in

the decision tree can be thought of as playing the game β€˜20 Questionsβ€˜.

β€’ At each stage, you choose a question that gives you the most information given what you know already.

β€’ Thus, you would ask β€˜Is it an animal?’ before you ask β€˜Is it a cat?’.

7

Page 8: Decision Trees - BIU

Constructing the Tree

β€’ β€œ20 Questions” example.

My character

Akinator’s questions

8

Page 9: Decision Trees - BIU

Constructing the Tree

β€’ The idea: quantify how much information is provided.

β€’ Mathematically: Information Theory

9

Page 10: Decision Trees - BIU

Pivot example:

10

Page 11: Decision Trees - BIU

Quick Aside: Information Theory

Page 12: Decision Trees - BIU

Entropy

β€’ S is a sample of training examples.

β€’ p+ is the proportion of positive examples in S

β€’ p- is the proportion of negative examples in S

β€’ Entropy measures the impurity of S:

β€’ Entropy(S) = - p+ log2 p+ - p- log2 p-

β€’ The smaller the better

β€’ (we define: 0log0 = 0)

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

12

Page 13: Decision Trees - BIU

Entropy

β€’ Generally, entropy refers to disorder or uncertainty.

β€’ Entropy = 0 if outcome is certain.

β€’ E.g.: Consider a coin toss:β€’ Probability of heads == probability of tails

β€’ The entropy of the coin toss is as high as it could be.

β€’ This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2.

13

Page 14: Decision Trees - BIU

Entropy: Example

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = βˆ’π‘"𝑦𝑒𝑠" log2 𝑝"𝑦𝑒𝑠" βˆ’ 𝑝"π‘›π‘œ" log2 𝑝"π‘›π‘œ"

= βˆ’9

14log29

14βˆ’5

14log25

14

= 0.409 + 0.530 = 0.939

} 14 examples:β€’ 9 positiveβ€’ 5 negative

The dataset

14

Page 15: Decision Trees - BIU

Information Gain

β€’ Important Idea: find how much the entropy of the whole training set would decrease if we choose each particular feature for the next classification step.

β€’ Called: β€œInformation Gain”.β€’ defined as the entropy of the whole set minus the entropy when a particular

feature is chosen.

15

Page 16: Decision Trees - BIU

Information Gain

β€’ The information gain is the expected reduction in entropy caused by partitioning the examples with respect to an attribute.

β€’ Given S is the set of examples (at the current node), A the attribute, and Sv the subset of S for which attribute A has value v:

β€’ IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

β€’ That is, current entropy minus new entropy

16

Page 17: Decision Trees - BIU

Information Gain: Example

β€’ Attribute A: Outlookβ€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = 𝑠𝑒𝑛𝑛𝑦, π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘, π‘Ÿπ‘Žπ‘–π‘›

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

17

Page 18: Decision Trees - BIU

18

Page 19: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘‚π‘’π‘‘π‘™π‘œπ‘œπ‘˜ = 𝑠𝑒𝑛𝑛𝑦, π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘, π‘Ÿπ‘Žπ‘–π‘›

‒𝑆𝑠𝑒𝑛𝑛𝑦

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑠𝑒𝑛𝑛𝑦

β€’π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

β€’π‘†π‘Ÿπ‘Žπ‘–π‘›

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘Ÿπ‘Žπ‘–π‘›

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

19

Page 20: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘‚π‘’π‘‘π‘™π‘œπ‘œπ‘˜ = 𝑠𝑒𝑛𝑛𝑦, π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘, π‘Ÿπ‘Žπ‘–π‘›

‒𝑆𝑠𝑒𝑛𝑛𝑦

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑠𝑒𝑛𝑛𝑦 =

5

14πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑠𝑒𝑛𝑛𝑦

β€’π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘ =

4

14πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

β€’π‘†π‘Ÿπ‘Žπ‘–π‘›

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘Ÿπ‘Žπ‘–π‘› =

5

14πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘Ÿπ‘Žπ‘–π‘›

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

20

Page 21: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘‚π‘’π‘‘π‘™π‘œπ‘œπ‘˜ = 𝑠𝑒𝑛𝑛𝑦, π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘, π‘Ÿπ‘Žπ‘–π‘›

‒𝑆𝑠𝑒𝑛𝑛𝑦

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑠𝑒𝑛𝑛𝑦 =

5

14βˆ’2

5π‘™π‘œπ‘”2

2

5βˆ’3

5π‘™π‘œπ‘”2

3

5=5

14β‹… 0.970

β€’π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘ =

4

14βˆ’4

4π‘™π‘œπ‘”2

4

4βˆ’0

4π‘™π‘œπ‘”2

0

4=4

14β‹… 0

β€’π‘†π‘Ÿπ‘Žπ‘–π‘›

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘Ÿπ‘Žπ‘–π‘› =

5

14βˆ’3

5π‘™π‘œπ‘”2

3

5βˆ’2

5π‘™π‘œπ‘”2

2

5=5

14β‹… 0.970

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

21

Page 22: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘‚π‘’π‘‘π‘™π‘œπ‘œπ‘˜ = 𝑠𝑒𝑛𝑛𝑦, π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘, π‘Ÿπ‘Žπ‘–π‘›

‒𝑆𝑠𝑒𝑛𝑛𝑦

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑠𝑒𝑛𝑛𝑦 =

5

14β‹… 0.970 = 0.346

β€’π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘œπ‘£π‘’π‘Ÿπ‘π‘Žπ‘ π‘‘ =

4

14β‹… 0 = 0

β€’π‘†π‘Ÿπ‘Žπ‘–π‘›

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘Ÿπ‘Žπ‘–π‘› =

5

14β‹… 0.970 = 0.346

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

Outlook

[3 + , 2 -][2 + , 3-] [4 + , 0 -]

SunnyOvercast

Rain

[9 + , 5 -]

IG(S, a = outlook) = 0.939 βˆ’ 0.346 + 0 + 0.346 = 𝟎. πŸπŸ’πŸ•

22

Page 23: Decision Trees - BIU

Information Gain: Example

β€’ Attribute A: Windβ€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘€π‘’π‘Žπ‘˜, π‘ π‘‘π‘Ÿπ‘œπ‘›π‘”

Wind

[3 + , 3 -][6 + , 2-]

weak strong

[9 + , 5-]

23

Page 24: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘Šπ‘–π‘›π‘‘ = π‘€π‘’π‘Žπ‘˜, π‘ π‘‘π‘Ÿπ‘œπ‘›π‘”

β€’π‘†π‘€π‘’π‘Žπ‘˜

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘€π‘’π‘Žπ‘˜ =

8

14βˆ’6

8π‘™π‘œπ‘”2

6

8βˆ’2

8π‘™π‘œπ‘”2

2

8=8

14β‹… 0.811

β€’π‘†π‘ π‘‘π‘Ÿπ‘œπ‘›π‘”

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘ π‘‘π‘Ÿπ‘œπ‘›π‘” =

6

14βˆ’3

6π‘™π‘œπ‘”2

3

6βˆ’3

6π‘™π‘œπ‘”2

3

6=6

14β‹… 1

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑝 = βˆ’

𝑖

𝑝𝑖 log2 𝑝𝑖

Wind

[3 + , 3 -][6 + , 2-]

weak strong

[9 + , 5-]

24

Page 25: Decision Trees - BIU

Information Gain: Example

β€’ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 = 0.939

β€’ π‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴 = π‘Šπ‘–π‘›π‘‘ = π‘€π‘’π‘Žπ‘˜, π‘ π‘‘π‘Ÿπ‘œπ‘›π‘”

β€’π‘†π‘€π‘’π‘Žπ‘˜

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘€π‘’π‘Žπ‘˜ =

8

14β‹… 0.811 = 0.463

β€’π‘†π‘ π‘‘π‘Ÿπ‘œπ‘›π‘”

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ π‘†π‘ π‘‘π‘Ÿπ‘œπ‘›π‘” =

6

14β‹… 1 = 0.428

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

IG(S, a = wind) = 0.939 βˆ’ 0.463 + 0.428 = 𝟎. πŸŽπŸ’πŸ–

25

Page 26: Decision Trees - BIU

Information Gain

β€’ The smaller the value of : 𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣 is, the larger IG becomes.

β€’ The ID3 algorithm computes this IG for each attribute and chooses the one that produces the highest value.β€’ Greedy

IG(S, a) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆 βˆ’ π‘£βˆˆπ‘‰π‘Žπ‘™π‘’π‘’π‘  𝐴𝑆𝑣

π‘†πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 𝑆𝑣

β€œbefore” β€œafter”

Probability of getting to the node

26

Page 27: Decision Trees - BIU

ID3 AlgorithmArtificial Intelligence: A Modern Approach

27

Page 28: Decision Trees - BIU

ID3 Algorithm

β€’ Majority Value function:β€’ Returns the label of the majority of the training examples in the current

subtree.

β€’ Choose attribute function:β€’ Choose the attribute that maximizes the Information Gain.

β€’ (Could use other measures other than IG.)

28

Page 29: Decision Trees - BIU

Back to the example

β€’ IG(S, a = outlook) = 𝟎. πŸπŸ’πŸ•

β€’ IG(S, a = wind) = 0.048

β€’ IG(S, a = temperature) = 0.028

β€’ IG(S, a = humidity) = 0.151

β€’ The tree (for now):

Maximal value

29

Page 30: Decision Trees - BIU

Decision Tree after first step:

hothot

mild

cool

mild

hot

cool

mild

hot

mild

coolcoolmildmild

30

Page 31: Decision Trees - BIU

Decision Tree after first step:

hothot

mild

cool

mild

hot

cool

mild

hot

mild

coolcoolmildmild

yes

31

Page 32: Decision Trees - BIU

The second step:

wind

[1 + , 1 -][1 + , 2-]

weak strong

[2 + , 3-]

temperature

[1 + , 0-][0 + , 2-] [1 + , 1 -]

hotmild

cool

[2 + , 3 -]

humidity

[2 + , 0 -][0 + , 3-]

high normal

[2 + , 3-]

Entropy = 0 Entropy = 0

IG is maximal

32

Page 33: Decision Trees - BIU

Decision Tree after second step:

33

Page 34: Decision Trees - BIU

Next…

34

Page 35: Decision Trees - BIU

Final DT:

35

Page 36: Decision Trees - BIU

Minimal Description Length

β€’ The attribute we choose is the one with the highest information gainβ€’ We minimize the amount of information that is left

β€’ Thus, the algorithm is biased towards smaller trees

β€’ Consistent with the well-known principle that short solutions are usually better than longer ones.

36

Page 37: Decision Trees - BIU

Minimal Description Length

β€’ MDL: β€’ The shortest description of something, i.e., the most compressed one, is the

best description.

β€’ a.k.a: Occam’s Razor:β€’ Among competing hypotheses, the one with the fewest assumptions should

be selected.

37

Page 38: Decision Trees - BIU

Outlook

WindHumidity True

Sunny

Overcast

Rain

True False

Weak Strong

False True

High Normal

New Data: <Rain, Mild, High, Weak> - False

Prediction: True

Wait What!?

38

Page 39: Decision Trees - BIU

Overfitting

β€’ Learning was performed for too long or the learner may adjust to very specific random features of the training data, that have no causal relation to the target function.

39

Page 40: Decision Trees - BIU

Overfitting

40

Page 41: Decision Trees - BIU

Overfitting

β€’ The performance on the training examples still increases while the performance on unseen data becomes worse.

41

Page 42: Decision Trees - BIU

42

Page 43: Decision Trees - BIU

Pruning

43

Page 44: Decision Trees - BIU

44

Page 45: Decision Trees - BIU

ID3 for real-valued features

β€’ Until now, we assumed that the splitting rules are of the form:β€’ 𝕀[π‘₯𝑖=1]

β€’ For real-valued features, use threshold-based splitting rules:β€’ 𝕀[π‘₯𝑖<πœƒ]

*𝕀(π‘π‘œπ‘œπ‘™π‘’π‘Žπ‘› 𝑒π‘₯π‘π‘Ÿπ‘’π‘ π‘ π‘–π‘œπ‘›) is the indicator function

(equals 1 if expression is true and 0 otherwise.

45

Page 46: Decision Trees - BIU

Random forest

46

Page 47: Decision Trees - BIU

Random forest

β€’ Collection of decision trees.

β€’ Prediction: a majority vote over the predictions of the individual trees.

β€’ Constructing the trees:β€’ Take a random subsample 𝑆’ (of size π‘šβ€™) from 𝑆, with replacements.

β€’ Construct a sequence 𝐼1, 𝐼2, … , where 𝐼𝑑 is a subset of the attributes (of size π‘˜)

β€’ The algorithm grows a DT (using ID3) based on the sample 𝑆’‒ At each splitting stage, the algorithm chooses a feature that maximized IG from 𝐼𝑑

47

Page 48: Decision Trees - BIU

Weka

β€’ http://www.cs.waikato.ac.nz/ml/weka/

48

Page 49: Decision Trees - BIU

Summary

β€’ Intro: Decision Trees

β€’ Constructing the tree:β€’ Information Theory: Entropy, IGβ€’ ID3 Algorithmβ€’ MDL

β€’ Overfitting:β€’ Pruning

β€’ ID3 for real-valued features

β€’ Random Forests

β€’ Weka

49