Decision Trees: Representation - svivek · Summary: Decision trees • Decision trees can represent...

Preview:

Citation preview

MachineLearning

DecisionTrees:Representation

1SomeslidesfromTomMitchell,DanRothandothers

Keyissuesinmachinelearning

• ModelingHowtoformulateyourproblemasamachinelearningproblem?Howtorepresentdata?Whichalgorithmstouse?Whatlearningprotocols?

• RepresentationGoodhypothesisspacesandgoodfeatures

• Algorithms– Whatisagoodlearningalgorithm?– Whatissuccess?– Generalizationvs overfitting– Thecomputationalquestion:Howlongwilllearningtake?

2

Comingup…(therestofthesemester)

Differenthypothesisspacesandlearningalgorithms– DecisiontreesandtheID3algorithm– Linearclassifiers

• Perceptron• SVM• Logisticregression

– Combiningmultipleclassifiers• Boosting,bagging

– Non-linearclassifiers– Nearestneighbors

3

Comingup…(therestofthesemester)

Differenthypothesisspacesandlearningalgorithms– DecisiontreesandtheID3algorithm– Linearclassifiers

• Perceptron• SVM• Logisticregression

– Combiningmultipleclassifiers• Boosting,bagging

– Non-linearclassifiers– Nearestneighbors

4

Importantissuestoconsider

1. Whatdothesehypothesesrepresent?

2. Implicitassumptionsandtradeoffs

3. Generalization?

4. Howdowelearn?

Thislecture:LearningDecisionTrees

1. Representation:Whataredecisiontrees?

2. Algorithm:Learningdecisiontrees– TheID3algorithm:Agreedyheuristic

3. Someextensions

5

Thislecture:LearningDecisionTrees

1. Representation:Whataredecisiontrees?

2. Algorithm:Learningdecisiontrees– TheID3algorithm:Agreedyheuristic

3. Someextensions

6

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Name Label

ClaireCardie -PeterBartlett +EricBaum +Haym Hirsh +Shai Ben-David +MichaelI.Jordan -

7

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -8

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Withthesefourattributes,howmanyuniquerowsarepossible?2· 26· 26· 2=2704

Ifthereare100attributes,allbinary,howmanyuniquerowsarepossible?2100

9

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Withthesefourattributes,howmanyuniquerowsarepossible?2×26×2×2 = 208

Ifthereare100attributes,allbinary,howmanyuniquerowsarepossible?2100

10

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Withthesefourattributes,howmanyuniquerowsarepossible?2×26×2×2 = 208

Ifthereare100attributes,allbinary,howmanyuniquerowsarepossible?2100

11

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Withthesefourattributes,howmanyuniquerowsarepossible?2×26×2×2 = 208

Ifthereare100attributes,allbinary,howmanyuniquerowsarepossible?(100times)2×2×2×⋯×2 = 2)**

12

Name Namehaspunctuation?

Secondcharacteroffirstname

Lengthoffirst

name>5?

Samefirstletterintwonames?

Label

ClaireCardie No l Yes Yes -PeterBartlett No e No No +EricBaum No r No No +Haym Hirsh No a No Yes +Shai Ben-David

Yes h No No +

MichaelI.Jordan

Yes i Yes No -

Representingdata

Datacanberepresentedasabigtable,withcolumnsdenotingdifferentattributes

Withthesefourattributes,howmanyuniquerowsarepossible?2×26×2×2 = 208

Ifthereare100attributes,allbinary,howmanyuniquerowsarepossible?(100times)2×2×2×⋯×2 = 2)**

13

Ifwewantedtostoreallpossiblerows,thisnumberistoolarge.

Weneedtofigureouthowtorepresentdatainabetter,moreefficientway

Whataredecisiontrees?

Ahierarchicaldatastructurethatrepresentsdatausingadivide-and-conquerstrategy

Canbeusedashypothesisclassfornon-parametricclassificationorregression

Generalidea:Givenacollectionofexamples,learnadecisiontreethatrepresentsit

14

Whataredecisiontrees?

• Decisiontreesareafamilyofclassifiersforinstancesthatarerepresentedbycollectionsofattributes(i.e.features)

• Nodes aretestsforfeaturevalues

• Thereisonebranch foreveryvaluethatthefeaturecantake

• Leaves ofthetreespecifytheclasslabels

15

Let’sbuildadecisiontreeforclassifyingshapes

Label=ALabel=C Label=B

16

Let’sbuildadecisiontreeforclassifyingshapes

17

Beforebuildingadecisiontree:

Whatisthelabelforaredtriangle?Andwhy?

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?

18

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape

19

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

20

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Blue Red Green

21

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Blue Red Green

B

22

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Blue Red Green

B

squaretriangle circle

CAB

Shape?

23

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Shape?circlesquare

AB

Blue Red Green

B

squaretriangle circle

CAB

Shape?

24

Label=ALabel=C Label=B

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Shape?circlesquare

AB

Blue Red Green

squaretriangle circle

CAB

Shape?

1. Howdowelearn adecisiontree?Comingupsoon…

2. Howtouseadecisiontreeforprediction?• Whatisthelabelforared triangle?

• Justfollowapathfromtheroottoaleaf

• Whataboutagreentriangle?

25

B

Label=ALabel=C Label=B

Let’sbuildadecisiontreeforclassifyingshapes

Whataresomeattributesoftheexamples?Color,Shape Color?

Shape?circlesquare

AB

Blue Red Green

squaretriangle circle

CAB

Shape?

1. Howdowelearn adecisiontree?Comingupsoon…

2. Howtouseadecisiontreeforprediction?• Whatisthelabelforared triangle?

• Justfollowapathfromtheroottoaleaf

• Whataboutagreen triangle?

26

B

ExpressivityofDecisiontrees

WhatBooleanfunctionscandecisiontreesrepresent?– AnyBooleanfunction

(Color=blue ANDShape=triangle) Label=B)AND(Color=blue ANDShape=square) Label=A) AND(Color=blueANDShape=circle) Label=C)AND….

Everypathfromthetreetoarootisarule

Thefulltreeisequivalenttotheconjunctionofalltherules

AnyBooleanfunctioncanberepresentedasadecisiontree.

27

ExpressivityofDecisiontrees

WhatBooleanfunctionscandecisiontreesrepresent?– AnyBooleanfunction

(Color=blue ANDShape=triangle) Label=B)AND(Color=blue ANDShape=square) Label=A) AND(Color=blueANDShape=circle) Label=C)AND….

AnyBooleanfunctioncanberepresentedasadecisiontree.

28

Everypathfromthetreetoarootisarule

Thefulltreeisequivalenttotheconjunctionofalltherules

ExpressivityofDecisiontrees

WhatBooleanfunctionscandecisiontreesrepresent?– AnyBooleanfunction

(Color=blue ANDShape=triangle) Label=B)AND(Color=blue ANDShape=square) Label=A) AND(Color=blueANDShape=circle) Label=C)AND….

AnyBooleanfunctioncanberepresentedasadecisiontree.

29

Everypathfromthetreetoarootisarule

Thefulltreeisequivalenttotheconjunctionofalltherules

DecisionTrees

• Outputsarediscretecategories

• Butrealvaluedoutputsarealsopossible(regressiontrees)

• Wellstudiedmethodsforhandlingnoisydata(noiseinthelabelorinthefeatures)andforhandlingmissingattributes– Pruningtreeshelpswithnoise– Moreonthislater…

30

Numericattributesanddecisionboundaries

• Wehaveseeninstancesrepresentedasattribute-valuepairs(color=blue,secondletter=e,etc.)– Valueshavebeencategorical

• Howdowedealwithnumericfeaturevalues?(eg length=?)– Discretizethemorusethresholdsonthenumericvalues– Thisexampledividesthefeaturespaceintoaxisparallelrectangles

31

Numericattributesanddecisionboundaries

• Wehaveseeninstancesrepresentedasattribute-valuepairs(color=blue,secondletter=e,etc.)– Valueshavebeencategorical

• Howdowedealwithnumericfeaturevalues?(eg length=?)– Discretizethemorusethresholdsonthenumericvalues– Thisexampledividesthefeaturespaceintoaxisparallelrectangles

32

Numericattributesanddecisionboundaries

• Wehaveseeninstancesrepresentedasattribute-valuepairs(color=blue,secondletter=e,etc.)– Valueshavebeencategorical

• Howdowedealwithnumericfeaturevalues?(eg length=?)– Discretizethemorusethresholdsonthenumericvalues– Thisexampledividesthefeaturespaceintoaxisparallelrectangles

13X

7

5

Y

- +

+ +

+ +

-

-

+

33

Numericattributesanddecisionboundaries

• Wehaveseeninstancesrepresentedasattribute-valuepairs(color=blue,secondletter=e,etc.)– Valueshavebeencategorical

• Howdowedealwithnumericfeaturevalues?(eg length=?)– Discretizethemorusethresholdsonthenumericvalues– Thisexampledividesthefeaturespaceintoaxisparallelrectangles

13X

7

5

Y

- +

+ +

+ +

-

-

+

34

X<3

Y<5

no yes

Y>7yesno

X<1

no yes

- + +

+ -yesno

Numericattributesanddecisionboundaries

• Wehaveseeninstancesrepresentedasattribute-valuepairs(color=blue,secondletter=e,etc.)– Valueshavebeencategorical

• Howdowedealwithnumericfeaturevalues?(eg length=?)– Discretizethemorusethresholdsonthenumericvalues– Thisexampledividesthefeaturespaceintoaxisparallelrectangles

13X

7

5

Y

- +

+ +

+ +

-

-

+Decisionboundariescanbenon-linear

35

X<3

Y<5

no yes

Y>7yesno

X<1

no yes

- + +

+ -yesno

Summary:Decisiontrees

• DecisiontreescanrepresentanyBooleanfunction• Awaytorepresentlotofdata• Anaturalrepresentation(think20questions)• Predicting withadecisiontreeiseasy

• Clearly,givenadataset,therearemanydecisiontreesthatcanrepresentit.[Exercise:Why?]

• Learningagoodrepresentationfromdataisthenextquestion

36

Summary:Decisiontrees

• DecisiontreescanrepresentanyBooleanfunction• Awaytorepresentlotofdata• Anaturalrepresentation(think20questions)• Predicting withadecisiontreeiseasy

• Clearly,givenadataset,therearemanydecisiontreesthatcanrepresentit.[Exercise:Why?]

• Learningagoodrepresentationfromdataisthenextquestion

37

Exercises

1. WritedownthedecisiontreefortheshapesdataiftherootnodewasShape insteadofColor.

2. Willthetwotreesmakethesamepredictionsforunseenshapes/colorcombinations?

3. ShowthatmultiplestructurallydifferentdecisiontreescanrepresentthesameBooleanfunctionoftwoormorevariables.

38

Label=ALabel=C Label=B

(thinkaboutwhatitmeansfortwotreestobestructurallydifferent)

Recommended