Upload
others
View
22
Download
1
Embed Size (px)
Citation preview
Classification:DecisionTrees
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyByronBoots,withgratefulacknowledgementtoEricEatonandthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.
LastTime• Commondecompositionofmachinelearning,basedondifferencesin
inputsandoutputs– SupervisedLearning:Learnafunctionmappinginputstooutputsusing
labeledtrainingdata(yougetinstances/exampleswithbothinputsandgroundtruthoutput)
– UnsupervisedLearning:Learnsomethingaboutjustdatawithoutanylabels(harder!),forexampleclusteringinstancesthatare“similar”
– ReinforcementLearning:Learnhowtomakedecisionsgivenasparsereward
• MLisaninteractionbetween:– theinputdata– inputtothefunction(features/attributesofdata)– thethefunction(model)youchoose,and– theoptimizationalgorithmyouusetoexplorespaceoffunctions
• Wearenewatthis,solet’sstartbylearningif/thenrulesforclassification!
FunctionApproximationProblemSetting• Setofpossibleinstances• Setofpossiblelabels• Unknowntargetfunction• Setoffunctionhypotheses
Input:Trainingexamplesofunknowntargetfunctionf
Output:Hypothesisthatbestapproximatesf
XY
f : X ! YH = {h | h : X ! Y}
h 2 H
BasedonslidebyTomMitchell
{hxi, yii}ni=1 = {hx1, y1i , . . . , hxn, yni}
SampleDataset(wasTennisPlayed?)• ColumnsdenotefeaturesXi
• Rowsdenotelabeledinstances• Classlabeldenoteswhetheratennisgamewasplayed
hxi, yii
hxi, yii
DecisionTree• Apossibledecisiontreeforthedata:
• Eachinternalnode:testoneattributeXi
• Eachbranchfromanode:selectsonevalueforXi
• Eachleafnode:predictY
BasedonslidebyTomMitchell
DecisionTree• Apossibledecisiontreeforthedata:
• Whatpredictionwouldwemakefor<outlook=sunny,temperature=hot,humidity=high,wind=weak>?
BasedonslidebyTomMitchell
DecisionTree– DecisionBoundary• Decisiontreesdividethefeaturespaceintoaxis-parallel(hyper-)rectangles
• Eachrectangularregionislabeledwithonelabel– oraprobabilitydistributionoverlabels
7
Decisionboundary
Expressiveness
• Givenaparticularspaceoffunctions,youmaynotbeabletorepresenteverything
• Whatfunctionscandecisiontreesrepresent?
• Decisiontreescanrepresentanyfunctionoftheinputattributes!– Booleanoperations(and,or,
xor,etc.)?– Yes!– Allboolean functions?– Yes!
Stagesof(Batch)MachineLearningGiven: labeledtrainingdata
• Assumeseachwith
Trainthemodel:modelß classifier.train(X,Y )
Applythemodeltonewdata:• Given:newunlabeledinstance
yprediction ßmodel.predict(x)
model
learner
X,Y
x yprediction
X,Y = {hxi, yii}ni=1
xi ⇠ D(X ) yi = ftarget(xi)
x ⇠ D(X )
BasicAlgorithmforTop-DownLearningofDecisionTrees
[ID3,C4.5byQuinlan]
node =rootofdecisiontreeMainloop:1. Aß the“best”decisionattributeforthenextnode.2. AssignA asdecisionattributefornode.3. ForeachvalueofA,createanewdescendantofnode.4. Sorttrainingexamplestoleafnodes.5. Iftrainingexamplesareperfectlyclassified,stop.Else,
recurse overnewleafnodes.
Howdowechoosewhichattributeisbest?
ChoosingtheBestAttributeKeyproblem:choosingwhichattributetosplitagivensetofexamples
• Somepossibilitiesare:– Random: Selectanyattributeatrandom– Least-Values: Choosetheattributewiththesmallestnumberofpossiblevalues
– Most-Values: Choosetheattributewiththelargestnumberofpossiblevalues
– Max-Gain: Choosetheattributethathasthelargestexpectedinformationgain• i.e.,attributethatresultsinsmallestexpectedsizeofsubtreesrootedatitschildren
• TheID3algorithmusestheMax-Gainmethodofselectingthebestattribute
12
InformationGain
Whichtestismoreinformative?
Split over whether Balance exceeds 50K
Over 50KLess or equal 50K EmployedUnemployed
Split over whether applicant is employed
BasedonslidebyPedroDomingos
13
Impurity/Entropy (informal)–Measuresthelevelofimpurity inagroupofexamples
InformationGain
BasedonslidebyPedroDomingos
14
Impurity
Very impure group Less impure Minimum impurity
BasedonslidebyPedroDomingos
15
Entropy:acommonwaytomeasureimpurity
• Entropy=
pi istheprobabilityofclassiComputeitastheproportionofclassiintheset.
• Entropycomesfrominformationtheory.Thehighertheentropythemoretheinformationcontent.
å-i
ii pp 2log
Whatdoesthatmeanforlearningfromexamples?
BasedonslidebyPedroDomingos
16
2-ClassCases:
• Whatistheentropyofagroupinwhichallexamplesbelongtothesameclass?– entropy=- 1log21=0
• Whatistheentropyofagroupwith50%ineitherclass?– entropy=-0.5log20.5– 0.5log20.5=1
Minimum impurity
Maximumimpurity
notagoodtrainingsetforlearning
goodtrainingsetforlearning
BasedonslidebyPedroDomingos
H(x) = �nX
i=1
P (x = i) log2 P (x = i)Entropy
SampleEntropy
11
Sample Entropy
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
18
InformationGain• Wewanttodeterminewhichattribute inagivensetoftrainingfeaturevectorsismostuseful fordiscriminatingbetweentheclassestobelearned.
• Informationgain tellsushowimportantagivenattributeofthefeaturevectorsis.
• Wewilluseittodecidetheorderingofattributesinthenodesofadecisiontree.
BasedonslidebyPedroDomingos
FromEntropytoInformationGain
11
Sample Entropy
Entropy
Entropy H(X) of a random variable X
Specific conditional entropy H(X|Y=v) of X given Y=v :
Conditional entropy H(X|Y) of X given Y :
Mututal information (aka Information Gain) of X and Y :
SlidebyTomMitchell
InformationGain
12
Information Gain is the mutual information between
input attribute A and target variable Y
Information Gain is the expected reduction in entropy
of target variable Y for data sample S, due to sorting
on variable A
SlidebyTomMitchell
21
CalculatingInformationGain
996.03016log
3016
3014log
3014
22 =÷øö
çèæ ×-÷
øö
çèæ ×-=impurity
787.0174log
174
1713log
1713
22 =÷øö
çèæ ×-÷
øö
çèæ ×-=impurity
Entire population (30 instances)17 instances
13 instances
(Weighted) Average Entropy of Children = 615.0391.03013787.0
3017
=÷øö
çèæ ×+÷
øö
çèæ ×
Information Gain= 0.996 - 0.615 = 0.38
391.01312log
1312
131log
131
22 =÷øö
çèæ ×-÷
øö
çèæ ×-=impurity
InformationGain =entropy(parent)– [averageentropy(children)]
parententropy
childentropy
childentropy
BasedonslidebyPedroDomingos
12
Information Gain is the mutual information between
input attribute A and target variable Y
Information Gain is the expected reduction in entropy
of target variable Y for data sample S, due to sorting
on variable A
13
SlidebyTomMitchell
13
SlidebyTomMitchell
14
Decision Tree Learning Applet
• http://www.cs.ualberta.ca/%7Eaixplore/learning/
DecisionTrees/Applet/DecisionTreeApplet.html
Which Tree Should We Output?
• ID3 performs heuristic
search through space of
decision trees
• It stops at smallest
acceptable tree. Why?
Occam’s razor: prefer the simplest hypothesis that fits the data
SlidebyTomMitchell
Preferencebias:Ockham’sRazor• PrinciplestatedbyWilliamofOckham(1285-1347)– “nonsunt multiplicanda entia praeter necessitatem”– entitiesarenottobemultipliedbeyondnecessity– AKAOccam’sRazor,LawofEconomy,orLawofParsimony
• Therefore,thesmallestdecisiontreethatcorrectlyclassifiesallofthetrainingexamplesisbest• FindingtheprovablysmallestdecisiontreeisNP-hard• ...Soinsteadofconstructingtheabsolutesmallesttreeconsistentwiththetrainingexamples,constructonethatisprettysmall
Idea:Thesimplestconsistentexplanationisthebest
OverfittinginDecisionTrees• Manykindsof“noise” canoccurintheexamples:– Twoexampleshavesameattribute/valuepairs,butdifferentclassifications
– Somevaluesofattributesareincorrectbecauseoferrorsinthedataacquisitionprocessorthepreprocessingphase
– Theinstancewaslabeledincorrectly(+insteadof-)
• Also,someattributesareirrelevanttothedecision-makingprocess– e.g.,colorofadieisirrelevanttoitsoutcome
BasedonSlidefromM.desJardins &T.Finin27
• Irrelevantattributes canresultinoverfitting thetrainingexampledata– Ifhypothesisspacehasmanydimensions(largenumberofattributes),wemayfindmeaninglessregularity inthedatathatisirrelevanttothetrue,important,distinguishingfeatures
• Ifwehavetoolittletrainingdata,evenareasonablehypothesisspacewill‘overfit’
BasedonSlidefromM.desJardins &T.Finin28
Overfitting inDecisionTrees
Overfitting inDecisionTreesConsideraddinganoisy trainingexampletothefollowingtree:
Whatwouldbetheeffectofadding:<outlook=sunny,temperature=hot,humidity=normal,wind=strong,playTennis=No>?
BasedonSlidebyPedroDomingos29
SlidebyPedroDomingos30
OverfittinginDecisionTrees
SlidebyPedroDomingos31
OverfittinginDecisionTrees
AvoidingOverfittinginDecisionTreesHowcanweavoidoverfitting?• Stopgrowingwhendatasplitisnotstatisticallysignificant• Acquiremoretrainingdata• Removeirrelevantattributes (manualprocess– notalwayspossible)• Growfulltree,thenpost-prune
Howtoselect“best”tree:• Measureperformanceovertrainingdata• Measureperformanceoverseparatevalidationdataset• Addcomplexitypenaltytoperformancemeasure
(heuristic:simplerisbetter)
BasedonSlidebyPedroDomingos32
Reduced-ErrorPruning
Splittrainingdatafurtherintotraining andvalidation sets
Growtreebasedontraining set
Dountilfurtherpruningisharmful:1. Evaluateimpactonvalidationsetofpruningeach
possiblenode(plusthosebelowit)2. Greedilyremovethenodethatmostimproves
validationset accuracy
SlidebyPedroDomingos33
PruningDecisionTrees• Pruningofthedecisiontreeisdonebyreplacingawhole
subtree byaleafnode.• Thereplacementtakesplaceifadecisionruleestablishesthat
theexpectederrorrateinthesubtree isgreaterthaninthesingleleaf.
• Forexample,
Color
1positive0negative
0positive2negative
red blue
Training Color
1positive3negative
1positive1negative
red blue
Validation
BasedonExamplefromM.desJardins &T.Finin
2correct4incorrect
Ifwehadsimplypredictedthemajorityclass(negative),wemake2errorsinsteadof4.
34
BasedonSlidebyPedroDomingos
EffectofReduced-ErrorPruning
35
Ontrainingdataitlooksgreat
Butthat’snotthecaseforthetest(validation)data
BasedonSlidebyPedroDomingos
EffectofReduced-ErrorPruning
Thetreeisprunedbacktotheredlinewhereitgivesmoreaccurateresultsonthetestdata 36
Summary:DecisionTreeLearning• Widelyusedinpractice
• Strengthsinclude– Fastandsimpletoimplement– Canconverttorules– Handlesnoisydata
• Weaknessesinclude– Univariate splits/partitioningusingonlyoneattributeatatime--- limitstypesofpossibletrees
– Largedecisiontreesmaybehardtounderstand– Requiresfixed-lengthfeaturevectors– Non-incremental(i.e.,batchmethod)
37
38
Summary:DecisionTreeLearning• Representation: decisiontrees• Bias: prefersmalldecisiontrees• Searchalgorithm: greedy• Heuristicfunction: informationgainorinformation
contentorothers• Overfitting /pruning
SlidebyPedroDomingos