38
Classification: Decision Trees Robot Image Credit: Viktoriya Sukhanova © 123RF.com These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution.

Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

  • Upload
    others

  • View
    22

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Classification:DecisionTrees

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.

Page 2: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

LastTime• Commondecompositionofmachinelearning,basedondifferencesin

inputsandoutputs– SupervisedLearning:Learnafunctionmappinginputstooutputsusing

labeledtrainingdata(yougetinstances/exampleswithbothinputsandgroundtruthoutput)

– UnsupervisedLearning:Learnsomethingaboutjustdatawithoutanylabels(harder!),forexampleclusteringinstancesthatare“similar”

– ReinforcementLearning:Learnhowtomakedecisionsgivenasparsereward

• MLisaninteractionbetween:– theinputdata– inputtothefunction(features/attributesofdata)– thethefunction(model)youchoose,and– theoptimizationalgorithmyouusetoexplorespaceoffunctions

• Wearenewatthis,solet’sstartbylearningif/thenrulesforclassification!

Page 3: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses

Input:Trainingexamplesofunknowntargetfunctionf

Output:Hypothesisthatbestapproximatesf

XY

f : X ! YH = {h | h : X ! Y}

h 2 H

BasedonslidebyTomMitchell

{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}

Page 4: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

SampleDataset(wasTennisPlayed?)• ColumnsdenotefeaturesXi

• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed

hxi, yii

hxi, yii

Page 5: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

DecisionTree• Apossibledecisiontreeforthedata:

• Eachinternalnode:testoneattributeXi

• Eachbranchfromanode:selectsonevalueforXi

• Eachleafnode:predictY

BasedonslidebyTomMitchell

Page 6: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

DecisionTree• Apossibledecisiontreeforthedata:

• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?

BasedonslidebyTomMitchell

Page 7: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles

• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels

7

Decisionboundary

Page 8: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Expressiveness

• Givenaparticularspaceoffunctions,youmaynotbeabletorepresenteverything

• Whatfunctionscandecisiontreesrepresent?

• Decisiontreescanrepresentanyfunctionoftheinputattributes!– Booleanoperations(and,or,

xor,etc.)?– Yes!– Allboolean functions?– Yes!

Page 9: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Stagesof(Batch)MachineLearningGiven: labeledtrainingdata

• Assumeseachwith

Trainthemodel:modelß classifier.train(X,Y )

Applythemodeltonewdata:• Given:newunlabeledinstance

yprediction ßmodel.predict(x)

model

learner

X,Y

x yprediction

X,Y = {hxi, yii}ni=1

xi ⇠ D(X ) yi = ftarget(xi)

x ⇠ D(X )

Page 10: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

BasicAlgorithmforTop-DownLearningofDecisionTrees

[ID3,C4.5byQuinlan]

node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,

recurse overnewleafnodes.

Howdowechoosewhichattributeisbest?

Page 11: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples

• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues

– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues

– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren

• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute

Page 12: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

12

InformationGain

Whichtestismoreinformative?

Split over whether Balance exceeds 50K

Over 50KLess or equal 50K EmployedUnemployed

Split over whether applicant is employed

BasedonslidebyPedroDomingos

Page 13: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

13

Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples

InformationGain

BasedonslidebyPedroDomingos

Page 14: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

14

Impurity

Very impure group Less impure Minimum impurity

BasedonslidebyPedroDomingos

Page 15: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

15

Entropy:acommonwaytomeasureimpurity

• Entropy=

pi istheprobabilityofclassiComputeitastheproportionofclassiintheset.

• Entropycomesfrominformationtheory.Thehighertheentropythemoretheinformationcontent.

å-i

ii pp 2log

Whatdoesthatmeanforlearningfromexamples?

BasedonslidebyPedroDomingos

Page 16: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

16

2-ClassCases:

• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0

• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1

Minimum impurity

Maximumimpurity

notagoodtrainingsetforlearning

goodtrainingsetforlearning

BasedonslidebyPedroDomingos

H(x) = �nX

i=1

P (x = i) log2 P (x = i)Entropy

Page 17: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

SampleEntropy

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

Page 18: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

18

InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.

• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.

• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.

BasedonslidebyPedroDomingos

Page 19: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

FromEntropytoInformationGain

11

Sample Entropy

Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(X|Y=v) of X given Y=v :

Conditional entropy H(X|Y) of X given Y :

Mututal information (aka Information Gain) of X and Y :

SlidebyTomMitchell

Page 20: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

InformationGain

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

SlidebyTomMitchell

Page 21: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

21

CalculatingInformationGain

996.03016log

3016

3014log

3014

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

787.0174log

174

1713log

1713

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

Entire population (30 instances)17 instances

13 instances

(Weighted) Average Entropy of Children = 615.0391.03013787.0

3017

=÷øö

çèæ ×+÷

øö

çèæ ×

Information Gain= 0.996 - 0.615 = 0.38

391.01312log

1312

131log

131

22 =÷øö

çèæ ×-÷

øö

çèæ ×-=impurity

InformationGain =entropy(parent)– [averageentropy(children)]

parententropy

childentropy

childentropy

BasedonslidebyPedroDomingos

Page 22: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

12

Information Gain is the mutual information between

input attribute A and target variable Y

Information Gain is the expected reduction in entropy

of target variable Y for data sample S, due to sorting

on variable A

Page 23: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

13

SlidebyTomMitchell

Page 24: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

13

SlidebyTomMitchell

Page 25: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

14

Decision Tree Learning Applet

•  http://www.cs.ualberta.ca/%7Eaixplore/learning/

DecisionTrees/Applet/DecisionTreeApplet.html

Which Tree Should We Output?

•  ID3 performs heuristic

search through space of

decision trees

•  It stops at smallest

acceptable tree. Why?

Occam’s razor: prefer the simplest hypothesis that fits the data

SlidebyTomMitchell

Page 26: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony

• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall

Idea:Thesimplestconsistentexplanationisthebest

Page 27: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

OverfittinginDecisionTrees• Manykindsof“noise” canoccurintheexamples:– Twoexampleshavesameattribute/valuepairs,butdifferentclassifications

– Somevaluesofattributesareincorrectbecauseoferrorsinthedataacquisitionprocessorthepreprocessingphase

– Theinstancewaslabeledincorrectly(+insteadof-)

• Also,someattributesareirrelevanttothedecision-makingprocess– e.g.,colorofadieisirrelevanttoitsoutcome

BasedonSlidefromM.desJardins &T.Finin27

Page 28: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

• Irrelevantattributes canresultinoverfitting thetrainingexampledata– Ifhypothesisspacehasmanydimensions(largenumberofattributes),wemayfindmeaninglessregularity inthedatathatisirrelevanttothetrue,important,distinguishingfeatures

• Ifwehavetoolittletrainingdata,evenareasonablehypothesisspacewill‘overfit’

BasedonSlidefromM.desJardins &T.Finin28

Overfitting inDecisionTrees

Page 29: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Overfitting inDecisionTreesConsideraddinganoisy trainingexampletothefollowingtree:

Whatwouldbetheeffectofadding:<outlook=sunny,temperature=hot,humidity=normal,wind=strong,playTennis=No>?

BasedonSlidebyPedroDomingos29

Page 30: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

SlidebyPedroDomingos30

OverfittinginDecisionTrees

Page 31: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

SlidebyPedroDomingos31

OverfittinginDecisionTrees

Page 32: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

AvoidingOverfittinginDecisionTreesHowcanweavoidoverfitting?• Stopgrowingwhendatasplitisnotstatisticallysignificant• Acquiremoretrainingdata• Removeirrelevantattributes (manualprocess– notalwayspossible)• Growfulltree,thenpost-prune

Howtoselect“best”tree:• Measureperformanceovertrainingdata• Measureperformanceoverseparatevalidationdataset• Addcomplexitypenaltytoperformancemeasure

(heuristic:simplerisbetter)

BasedonSlidebyPedroDomingos32

Page 33: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Reduced-ErrorPruning

Splittrainingdatafurtherintotraining andvalidation sets

Growtreebasedontraining set

Dountilfurtherpruningisharmful:1. Evaluateimpactonvalidationsetofpruningeach

possiblenode(plusthosebelowit)2. Greedilyremovethenodethatmostimproves

validationset accuracy

SlidebyPedroDomingos33

Page 34: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

PruningDecisionTrees• Pruningofthedecisiontreeisdonebyreplacingawhole

subtree byaleafnode.• Thereplacementtakesplaceifadecisionruleestablishesthat

theexpectederrorrateinthesubtree isgreaterthaninthesingleleaf.

• Forexample,

Color

1positive0negative

0positive2negative

red blue

Training Color

1positive3negative

1positive1negative

red blue

Validation

BasedonExamplefromM.desJardins &T.Finin

2correct4incorrect

Ifwehadsimplypredictedthemajorityclass(negative),wemake2errorsinsteadof4.

34

Page 35: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

35

Ontrainingdataitlooksgreat

Butthat’snotthecaseforthetest(validation)data

Page 36: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

BasedonSlidebyPedroDomingos

EffectofReduced-ErrorPruning

Thetreeisprunedbacktotheredlinewhereitgivesmoreaccurateresultsonthetestdata 36

Page 37: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

Summary:DecisionTreeLearning• Widelyusedinpractice

• Strengthsinclude– Fastandsimpletoimplement– Canconverttorules– Handlesnoisydata

• Weaknessesinclude– Univariate splits/partitioningusingonlyoneattributeatatime--- limitstypesofpossibletrees

– Largedecisiontreesmaybehardtounderstand– Requiresfixed-lengthfeaturevectors– Non-incremental(i.e.,batchmethod)

37

Page 38: Classification: Decision Trees · Basic Algorithm for Top-Down Learning of Decision Trees [ID3, C4.5 by Quinlan] node= root of decision tree Main loop: 1. Aßthe “best” decision

38

Summary:DecisionTreeLearning• Representation: decisiontrees• Bias: prefersmalldecisiontrees• Searchalgorithm: greedy• Heuristicfunction: informationgainorinformation

contentorothers• Overfitting /pruning

SlidebyPedroDomingos