57
Decision Trees in the Big Picture • Classification (vs. Rule Pattern Discovery) • Supervised Learning (vs. Unsupervised) • Inductive • Generation (vs. Discrimination)

Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Decision Trees in the Big Picture

• Classification (vs. Rule Pattern Discovery)• Supervised Learning (vs. Unsupervised)• Inductive• Generation (vs. Discrimination)

Page 2: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Example age income veteran

college_educated

support_hillary

youth low no no noyouth low yes no nomiddle_aged low no no yessenior low no no yessenior medium no yes nosenior medium yes no yesmiddle_aged medium no yes noyouth low no yes noyouth low no yes nosenior high no yes yesyouth low no no nomiddle_aged high no yes nomiddle_aged medium yes yes yessenior high no yes no

Page 3: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Example age income veteran

college_educated

support_ hillary

youth low no no noyouth low yes no nomiddle_aged low no no yessenior low no no yessenior medium no yes nosenior medium yes no yesmiddle_aged medium no yes noyouth low no yes noyouth low no yes nosenior high no yes yesyouth low no no nomiddle_aged high no yes nomiddle_aged medium yes yes yessenior high no yes no

Class-labels

Page 4: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Exampleage income veteran

college_educated

support_hillary

middle_aged medium no no ?????

no

ageyouth middle_aged

college_educated

income yes

yes

low medium high

no

senior

no yes

noyes

Inner nodes are ATTRIBUTES

Branches are attribute VALUES

Leaves are class-label VALUES

Page 5: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Exampleage income veteran

college_educated

support_hillary

middle_aged medium no no yes (predicted)

no

ageyouth middle_aged

college_educated

income yes

yes

low medium high

no

senior

no yes

noyes

Inner nodes are ATTRIBUTES

Branches are attribute VALUES

Leaves are class-label VALUES

ANSWER

Page 6: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Example

no

ageyouth middle_aged

college_educated

income yes

yes

low medium high

no

senior

no yes

noyes

Induced Rules:

The youth do not support Hillary.

All who are middle-aged and low-income support Hillary.

Seniors support Hillary.

Etc…A rule is generated for each leaf.

Page 7: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

ExampleInduced Rules:

The youth do not support Hillary.

All who are middle-aged and low-income support Hillary.

Seniors support Hillary.

Nested IF-THEN:

IF age == youthTHEN support_hillary = no

ELSE IF age == middle_aged & income == lowTHEN support_hillary = yes

ELSE IF age = seniorTHEN support_hillary = yes

Page 8: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do you construct one?

1. Select an attribute to place at the root node and make one branch for each possible value.

14 tuples; Entire Training Set

5 tuples 4 tuples 5 tuples

age

youth middle_aged senior

Page 9: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do you construct one?

2. For each branch, recursively process the remaining training examples by choosing an attribute to split them on. The chosen attribute cannot be one used in the ancestor nodes. If at anytime all the training examples have the same class, stop processing that part of the tree.

Page 10: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do you construct one?age=youth Income veteran

college_educated

support_ hillary

youth low no no noyouth low yes no noyouth low no yes noyouth low no yes noyouth low no no no

no

age

youthmiddle_aged senior

Page 11: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do you construct one?age=middle_aged income veteran

college_educated

supports_ hillary

middle_aged low no no yesmiddle_aged medium no yes nomiddle_aged high no yes nomiddle_aged medium yes yes yes

no veteran

age

youthmiddle_aged senior

yes no

Page 12: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youthmiddle_aged senior

yes

yes no

age=middle_aged income veteran

college_educated

supports_hillary

middle_aged low no no yesmiddle_aged medium no yes nomiddle_aged high no yes nomiddle_aged medium yes yes yes

Page 13: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youthmiddle_aged senior

yes

yes no

age=middle_aged income veteran

college_educated

supports_hillary

middle_aged low no no yesmiddle_aged medium no yes nomiddle_aged high no yes nomiddle_aged medium yes yes yes

college_educated

yes no

Page 14: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

age=middle_aged income veteran=no

college_educated

supports_hillary

middle_aged low no no yesmiddle_aged medium no yes nomiddle_aged high no yes no

no veteran

age

youth middle_aged

yes

yes no

college_educated

yes no

senior

no

Page 15: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

age=middle_aged income veteran=no

college_educated

supports_hillary

middle_aged low no no yesmiddle_aged medium no yes nomiddle_aged high no yes no

no veteran

age

youth middle_aged

yes

yes no

college_educated

yes no

senior

no yes

Page 16: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth

middle_aged

yes

yes no

college_educated

yes no

senior

no yes

age=senior income veterancollege_educated

supports_ hillary

senior low no no yessenior medium no yes nosenior medium yes no yessenior high no yes yessenior high no yes no

Page 17: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth

middle_aged

yes

yes no

college_educated

yes no

senior

no yes

age=senior income veterancollege_educated

supports_ hillary

senior low no no yessenior medium no yes nosenior medium yes no yessenior high no yes yessenior high no yes no

college_educated

yes no

Page 18: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes no

college_educated

yes no

senior

no yes

college_educated

yes no

age=senior income veterancollege_educated=yes

supports_hillary

senior medium no yes nosenior high no yes yessenior high no yes no

income

low medium high

Page 19: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes no

college_educated

yes no

senior

no yes

college_educated

yes no

age=senior income veterancollege_educated=yes

supports_hillary

senior medium no yes nosenior high no yes yessenior high no yes no

income

low medium high

No low-income college-educated seniors…

Page 20: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

age=senior income veterancollege_educated=yes

supports_hillary

senior medium no yes nosenior high no yes yessenior high no yes no

income

low medium high

No low-income college-educated seniors…

no

no

“Majority Vote”

Page 21: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

age=seniorincome=medium veteran

college_educated=yes

supports_hillary

senior medium no yes no

income

low medium high

no

no

Page 22: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

age=seniorincome=medium veteran

college_educated=yes

supports_hillary

senior medium no yes no

income

low medium high

no

no

no

Page 23: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

age

youth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

income

low medium high

no

no

no

age=senior income=high veterancollege_educated=yes

supports_hillary

senior high no yes yessenior high no yes no

Page 24: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

income

low medium high

nono

no

age=senior income=high veterancollege_educated=yes

supports_hillary

senior high no yes yessenior high no yes no

veteran

yes no

Page 25: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

income

low medium high

nono

no

age=senior income=high veterancollege_educated=yes

supports_hillary

senior high no yes yessenior high no yes no

veteran

yes no

“Majority Vote” split…No Veterans

??? ???

Page 26: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

income

low medium high

nono

no veteran

yes no

??? ???

age=senior income veterancollege_educated=no

supports_hillary

senior low no no yessenior medium yes no yes

Page 27: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth middle_aged

yes

yes nocollege_educated

yes

senior

no yes

college_educated

yes no

income

low medium high

nono

no veteran

yes no

??? ???

age=senior income veterancollege_educated=no

supports_hillary

senior low no no yessenior medium yes no yes

yes

Page 28: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth

middle_aged

yes

yes no

college_educated

yes

senior

no yes

college_educated

yes no

income

lowmedium

high

no

no

no veteran

yes no

??? ???

yes

Page 29: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Cost to grow?

n = number of AttributesD = Training Set of tuples

O( n * |D| * log|D| )

Page 30: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Cost to grow?

n = number of AttributesD = Training Set of tuples

O( n * |D| * log|D| )

Amount of work at each tree level

Max height of the tree

Page 31: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do we minimize the cost?

• Optimal decision trees are NP-complete (shown by Hyafil and Rivest)

Page 32: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do we minimize the cost?

• Optimal decision trees are NP-complete (shown by Hyafil and Rivest)

• Need Heuristic to pick “best” attribute to split on.

Page 33: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

no veteran

ageyouth

middle_aged

yes

yes no

college_educated

yes

senior

no yes

college_educated

yes no

income

lowmedium

high

no

no

no veteran

yes no

??? ???

yes

Page 34: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

How do we minimize the cost?

• Optimal decision trees are NP-complete (shown by Hyafil and Rivest)

• Most common approach is “greedy”• Need Heuristic to pick “best” attribute to split

on.• “Best” attribute results in “purest” split Pure = all tuples belong to the same class

Page 35: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

….A good split increase purity of all children nodes

Page 36: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Three Heuristics

1. Information gain

2. Gain Ratio

3. Gini Index

Page 37: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Information Gain

• Ross Quinlan’s ID3 (iterative dichotomizer 3rd) uses info gain as its heuristic.

• Heuristic based on Claude Shannon’s information theory.

Page 38: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

HIGHENTROPY

LOWENTROPY

Page 39: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Calculate Entropy for DD = Training Set D=14m = num. of classes m=2i = 1,…,mCi = distinct class C1 = yes, C2 = no

Ci,D = tuples in D of class Ci C1,D = yes, C2,D = no

pi = prob. a random tuple in p1 = 5/14, p2 = 9/14

D belongs to class Ci

=|Ci,D|/|D|

Page 40: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

= -[ 5/14 * log(5/14) + 9/14 * log(9/14)]= -[ .3571 * -1.4854 + .6428 * -.6374] = -[ -.5304 + -.4097] = .9400 bits

Extremes: = -[ 7/14 * log(7/14) + 7/14 * log(7/14)] = 1 bit

= -[ 1/14 * log(1/14) + 13/14 * log(13/14)] = .3712 bits

= -[ 0/14 * log(0/14) + 14/14 * log(14/14)] = 0 bits

Page 41: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Entropy for D split by AA = attribute to split D on E.g. agev = distinct values of A E.g. youth,

middle_aged, seniorj = 1,…,vDj = subset of D where A=j E.g. All

tuples where age=youth

Page 42: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Entropyage (D)= 5/14 * -[0/5*log(0/5) + 5/5*log(5/5)]

+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 5/14 * -[3/5*log(3/5) + 2/5*log(2/5)]

= .6324 bits

Entropyincome (D)= 7/14 * -[2/7*log(2/7) + 5/7*log(5/7)]

+ 4/14 * -[2/4*log(2/4) + 2/4*log(2/4)] + 3/14 * -[1/3*log(1/3) + 2/3*log(2/3)]

= .9140 bits

Entropyveteran (D)= 3/14 * -[2/3*log(2/3) + 1/3*log(1/3)]

+ 11/14 * -[3/11*log(3/11) + 8/11*log(8/11)]

= .8609 bits

Entropycollege_educated (D)= 8/14 * -[6/8*log(6/8) + 2/8*log(2/8)]

+ 6/14 * -[3/6*log(3/6) + 3/6*log(3/6)]

= .8921 bits

Page 43: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Information Gain

Gain(A) = Entropy(D) - EntropyA (D)

Set of tuples D Subset of D split on attribute A

Choose the A with the highest Gain. decreases Entropy

Page 44: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gain(A) = Entropy(D) - EntropyA (D)

Gain(age) = Entropy(D) - Entropyage (D)

= .9400 - .6324 = .3076 bits

Gain(income) = .0259 bits

Gain(veteran) = .0790 bits

Gain(college_educated) = .0479 bits

Page 45: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Entropy with values >2

Entropy = -[7/13*log(7/13) + 2/13*log(2/13) + 2/13*log(2/13) + 2/13*log(2/13)] = 1.7272 bits

Entropy = -[5/13*log(5/13) + 1/13*log(1/13) + 6/13*log(6/13) + 1/13*log(1/13)] = 1.6143 bits

Page 46: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

ss age income veterancollege_educated

support_hillary

215-98-9343 youth low no no no238-34-3493 youth low yes no no234-28-2434 middle_aged low no no yes243-24-2343 senior low no no yes634-35-2345 senior medium no yes no553-32-2323 senior medium yes no yes554-23-4324 middle_aged medium no yes no523-43-2343 youth low no yes no553-23-1223 youth low no yes no344-23-2321 senior high no yes yes212-23-1232 youth low no no no112-12-4521 middle_aged high no yes no423-13-3425 middle_aged medium yes yes yes423-53-4817 senior high no yes no

Added social security number attribute

Page 47: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

ss no

yes

yesnononoyes

no

yes

yesno

no

no

no

215-98-9343……..423-53-4817

Will Information Gain split on ss?

Page 48: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

ss no

yes

yesnononoyes

no

yes

yesno

no

no

no

215-98-9343……..423-53-4817

Will Information Gain split on ss?

Yes, because Entropyss (D) = 0. *Entropyss (D) = 1/14 * -14[1/1*log(1/1) + 0/1*log(0/1)]

Page 49: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gain ratio

• C4.5, a successor of ID3, uses this heuristic.

• Attempts to overcome Information Gain’s bias in favor of attributes with large number of values.

Page 50: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gain ratio

Page 51: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gain ratio

Gain(ss) = .9400

SplitInfoss (D) = 3.9068

GainRatio(ss) = .2406

Gain(age) = .3076

SplitInfoage (D) = 1.5849

GainRatio(age) = .1940

Page 52: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gini Index

• CART uses this heuristic.

• Binary splits.

• Not biased toward multi-value attributes like Info Gain.

age

youthmiddle_aged

senior

age

senioryouth, middle_aged

Page 53: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gini IndexFor the attribute age the possible subsets are:

{youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior},

{middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}.

We exclude the powerset and the empty set.

So we have to examine 2v – 2 subsets.

Page 54: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gini IndexFor the attribute age the possible subsets are:

{youth, middle_aged, senior}, {youth, middle_aged}, {youth, senior},

{middle_aged, senior}, {youth}, {middle_aged}, {senior} and {}.

We exclude the powerset and the empty set.

So we have to examine 2v – 2 subsets.

CALCULATE GINI INDEX ON EACH SUBSET

Page 55: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Gini Index

Page 56: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)

Miscellaneous thoughts

• Widely applicable to data exploration, classification and scoring tasks

• Generate understandable rules• Better for predicting discrete outcomes than

continuous (lumpy)• Error-prone when # of training examples for a

class is small• Most business cases trying to predict few broad

categories

Page 57: Decision Trees in the Big Picture Classification (vs. Rule Pattern Discovery) Supervised Learning (vs. Unsupervised) Inductive Generation (vs. Discrimination)