32
Machine Learning III Decision Tree Induction CSE 473

Machine Learning III Decision Tree Induction CSE 473

Embed Size (px)

Citation preview

Machine Learning IIIDecision Tree Induction

CSE 473

© Daniel S. Weld 2

Machine Learning Outline

• Machine learning: √ Function approximation√ Bias

• Supervised learning√ Classifiers & concept learning√ Version Spaces (restriction bias) Decision-trees induction (pref bias)

• Overfitting• Ensembles of classifiers

© Daniel S. Weld 3

Two Strategies for ML

•Restriction bias: use prior knowledge to specify a restricted hypothesis space.

  Version space algorithm over conjunctions.•Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses.

  Decision trees.

© Daniel S. Weld 4

Decision Trees• Convenient Representation

  Developed with learning in mind  Deterministic

• Expressive  Equivalent to propositional DNF  Handles discrete and continuous parameters

• Simple learning algorithm  Handles noise well  Classify as follows

•Constructive (build DT by adding nodes)•Eager•Batch (but incremental versions exist)

© Daniel S. Weld 5

Concept Learning

• E.g. Learn concept “Edible mushroom”  Target Function has two values: T or F

• Represent concepts as decision trees• Use hill climbing search • Thru space of decision trees

  Start with simple concept  Refine it into a complex concept as needed

© Daniel S. Weld 7

Decision Tree Representation

Outlook

Humidity Wind

YesYes

Yes

No No

Sunny Overcast Rain

High StrongNormal Weak

Good day for tennis?Leaves = classificationArcs = choice of valuefor parent attribute

Decision tree is equivalent to logic in disjunctive normal formG-Day (Sunny Normal) Overcast (Rain Weak)

© Daniel S. Weld 8

© Daniel S. Weld 12

What is theSimplest

Tree?

Day Outlook Temp Humid Wind Play?d1 s h h w nd2 s h h s nd3 o h h w yd4 r m h w yd5 r c n w yd6 r c n s yd7 o c n s yd8 s m h w nd9 s c n w yd10 r m n w yd11 s m n s yd12 o m h s yd13 o h n w yd14 r m h s n

How good?

[10+, 4-]Means: correct on 10 examples incorrect on 4 examples

© Daniel S. Weld 13

Successors Yes

Outlook Temp

Humid Wind

Which attribute should we use to split?

© Daniel S. Weld 14

To be decided:

• How to choose best attribute?  Information gain  Entropy (disorder)

• When to stop growing tree?

© Daniel S. Weld 15

Disorder is badHomogeneity is good

No

Better

Good

Bad

© Daniel S. Weld 16

Entr

opy

.00 .50 1.00

1.0

0.5

P of

as

%

© Daniel S. Weld 17

Entropy (disorder) is badHomogeneity is good

• Let S be a set of examples• Entropy(S) = -P log2(P) - N log2(N)

  where P is proportion of pos example  and N is proportion of neg examples  and 0 log 0 == 0

• Example: S has 10 pos and 4 negEntropy([10+, 4-]) = -(10/14) log2(10/14) -

(4/14)log2(4/14) = 0.863

© Daniel S. Weld 18

Information Gain

• Measure of expected reduction in entropy

• Resulting from splitting along an attribute

Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

Where Entropy(S) = -P log2(P) - N log2(N)

v Values(A)

© Daniel S. Weld 19

Gain of Splitting on WindDay Wind Tennis?d1 weak nd2 s nd3 weak yesd4 weak yesd5 weak yesd6 s yesd7 s yesd8 weak nd9 weak yesd10 weak yesd11 s yesd12 s yesd13 weak yesd14 s n

Values(wind)=weak, strongS = [10+, 4-]

Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv)

= Entropy(S) - 8/14 Entropy(Sweak)

- 6/14 Entropy(Ss)

= 0.863 - (8/14) 0.811 - (6/14) 1.00 = -0.029

v {weak, s}

Sweak = [6+, 2-]

Ss = [3+, 3-]

© Daniel S. Weld 22

Evaluating AttributesYes

Outlook Temp

Humid Wind

Gain(S,Humid)=0.375

Gain(S,Outlook)=0.401

Gain(S,Temp)=0.029

Gain(S,Wind)=-0.029

Value is actually different than this, but ignore this detail

© Daniel S. Weld 23

Resulting Tree …. Outlook

Sunny Overcast Rain

Good day for tennis?

No[2+, 3-]

Yes[4+]

No[2+, 3-]

© Daniel S. Weld 24

Recurse!

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c n weak yesd11 m n s yes

Outlook

Sunny

© Daniel S. Weld 25

One Step Later…Outlook

Humidity

Sunny Overcast Rain

HighNormal

Yes[2+]

Yes[4+]

No[2+, 3-]

No[3-]

© Daniel S. Weld 26

Decision Tree AlgorithmBuildTree(TraingData)

Split(TrainingData)

Split(D)If (all points in D are of the same class)

Then ReturnFor each attribute A

Evaluate splits on attribute AUse best split to partition D into D1, D2Split (D1)Split (D2)

© Daniel S. Weld 27

Movie Recommendation

• Features?

Rambo

Matrix

Rambo 2

© Daniel S. Weld 28

Issues• Content vs. Social

• Non-Boolean Attributes

• Missing Data

• Scaling up

© Daniel S. Weld 29

Missing Data 1

• Assign attribute …

most common value, … given classification

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes

• Don’t use this instance for learning?

most common value at node, or

© Daniel S. Weld 30

Fractional Values

• 75% h and 25% n• Use in gain calculations• Further subdivide if other missing attributes• Same approach to classify test ex with missing attr

  Classification is most probable classification  Summing over leaves where it got divided

Day Temp Humid Wind Tennis?d1 h h weak nd2 h h s nd8 m h weak nd9 c weak yesd11 m n s yes

[0.75+, 3-]

[1.25+, 0-]

© Daniel S. Weld 34

Machine Learning Outline

• Machine learning: • Supervised learning• Overfitting

  What is the problem?  Reduced error pruning

• Ensembles of classifiers

© Daniel S. Weld 35

Overfitting

Number of Nodes in Decision tree

Accuracy

0.9

0.8

0.7

0.6

On training dataOn test data

© Daniel S. Weld 36

Overfitting 2

Figure from w.w.cohen

© Daniel S. Weld 37

Overfitting…

• DT is overfit when exists another DT’ and  DT has smaller error on training examples,

but  DT has bigger error on test examples

• Causes of overfitting  Noisy data, or  Training set is too small

© Daniel S. Weld 38

© Daniel S. Weld 39

© Daniel S. Weld 40

Effect of Reduced-Error Pruning

Cut tree back to

here

© Daniel S. Weld 41

Machine Learning Outline

• Machine learning: • Supervised learning• Overfitting• Ensembles of classifiers

  Bagging  Cross-validated committees  Boosting