Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CSE417TIntroductiontoMachineLearning
Lecture15Instructor:Chien-Ju(CJ)Ho
Logistics:Homework3
• Homework3ispostedonthecoursewebsite
• DueonMarch25(Wednesday),2020• Homework4willbeannouncedbeforethedueofhomework3
DiscussiononExam1
417TPart2MachineLearningTechniques
Focusoftherestofthesemester
DecisionTree
DecisionTreeHypothesis
• �⃗� = (annualincome,havedebt)• 𝑦 ∈ {approve,deny}
AnnualIncome
≥ 100𝑘 ≥ 20𝑘< 100𝑘
< 20𝑘
havedebt? Deny
yes no
ApproveDeny
Approve
ApproveDeny
Deny
CreditCardApprovalExample
DecisionTreeHypothesis• Pros
• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)
• Canhandlemulti-typedata(Numerical,categorical.…)
• Easytoimplement(Bunchofif-elserules)
• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)
• Easilyoverfit
• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)
AnnualIncome
≥ 100𝑘 ≥ 20𝑘< 100𝑘
< 20𝑘
havedebt? Deny
yes no
ApproveDeny
Approve
ApproveDeny
Deny
CreditCardApprovalExample
DecisionTreeHypothesis• Pros
• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)
• Canhandlemulti-typedata(Numerical,categorical.…)
• Easytoimplement(Bunchofif-elserules)
• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)
• Easilyoverfit
• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)
AnnualIncome
≥ 100𝑘 ≥ 20𝑘< 100𝑘
< 20𝑘
havedebt? Deny
yes no
ApproveDeny
Approve
ApproveDeny
Deny
CreditCardApprovalExample
LearningDecisionTreefromData
• Givendataset𝐷,howtolearnadecisiontreehypothesis?𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1
• Potentialapproach• Find𝑔 = 𝑎𝑟𝑔𝑚𝑖𝑛M∈N𝐸PQ(ℎ)
• Multipledecisiontreeswithzero𝐸PQ𝑥D
+1
+1
-1
𝑥E+1 -1
+1 -1
+1
+1 -1
𝑥F+1 -1
𝑥E+1 -1
+1
𝑥E+1 -1
+1 𝑥D+1 -1
𝑥D+1 -1
+1 +1 -1
+1
+1
+1
+1-1 -1
Whichonedoyouthinkmightgeneralizebetter?
LearningDecisionTreefromData
• Conceptualintuitiontodealwithoverfitting• Regularization:Constrain𝐻
• Informally,
• Thisoptimizationisgenerallycomputationallyintractable.• Mostdecisiontreelearningalgorithmsrelyonheuristics toapproximatethegoal.
minimize𝐸PQ 𝑤
subjectto𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 ≤ 𝐶
Greedy-BasedDecisionTreeAlgorithm
• DecisionTreeLearn(𝐷):Inputadataset𝐷,outputadecisiontreehypothesis• Createarootnode𝑟• Ifterminationconditionsaremet• returnasinglenodetreewithleafpredictionbasedon𝐷
• Else:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteria• Foreachpossiblevalue𝑣P of 𝐴• Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴• CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟
• Mostdecisiontreelearningalgorithmsfollowthistemplate,butwithdifferentchoicesofheuristics
Example
𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1
𝑥D+1 -1
𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1
𝑥D 𝑥E 𝑥F 𝑦-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1
Leafprediction+1
terminate
Findnextfeaturetosplit
Don’tterminate
DecisionTreeLearn
TerminationconditionsnotnetFindafeaturetosplit
DecisionTreeLearn DecisionTreeLearn
DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet
returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴
Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟
ExampleHeuristics
• Terminationconditions• Whenthedatasetisempty• Whenalllabelsarethesame• whenallfeaturesarethesame• Whenthedepthofthetreeistoodeep• …
• Leafpredictions• Majorityvoting• Average(forregression)• …
• Splitcriteria?
DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet
returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴
Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟
SplitCriteria
• Whichfeaturewouldyouchoosetosplit?
• Wantthetreetobe“smaller”• Intuition:choosetheonethatthelabelsaremore“pure”• Example:choosetheonemaximizinginformationgain =>ID3Algorithm
𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1
𝑥D+1 -1
𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1
𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1
𝑥E+1 -1
𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1
𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1
BriefIntrotoInformationEntropy
• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE
Dab
cPdD
• 𝑝P: ratioofpointswithlabel𝑖 inthedata
• Binarycasewith𝐾 = 2
Bydefinition
0 logE10= 0 ; 1 logE
11= 0
• Interpretationsofentropy• Expected#bittoencodeadistribution
• Higherentropy• dataisless“pure”• ”pure”data=>alllabelsare+1or-1=>entropy=0
𝑝D
𝐻(𝐷)
BriefIntrotoInformationEntropy
• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE
Dab
cPdD
• 𝑝P: ratioofpointswithlabel𝑖 inthedata
• Binarycasewith𝐾 = 2
Bydefinition
0 logE10= 0 ; 1 logE
11= 0
• Interpretationsofentropy• Expected#bittoencodeadistribution
• Higherentropy• dataisless“pure”
• ”pure”data=>alllabelsare+1or-1=>entropy=0• Wanttochoosesplitsthatleadtopuredata,i.e.,lowerentropy
𝑝D
𝐻(𝐷)
ID3:UsingInformationGainasSelectionCriteria
• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib
iP 𝐻(𝐷P) [Theamountofdecreaseinentropy]
• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴
DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet
returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴
Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟
Notation:|𝐷| isthenumberofpointsin𝐷
• ID3terminationconditions• Ifalllabelsarethesame• Ifallfeaturesarethesame• Ifdatasetisempty
• ID3leafpredictions• Mostcommonlabels(majorityvoting)
• ID3splitcriteria• Informationgain
ID3:UsingInformationGainasSelectionCriteria
• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib
iP 𝐻(𝐷P)
• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴
𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1
𝑥D+1 -1
𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1
𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1
𝑥E+1 -1
𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1
𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1
𝐻 𝐷 = 0.5 logE 2 + 0.5 logE 2 = 1𝐻 𝐷nodD = 0 𝐻 𝐷nodpD = 0 𝐻 𝐷nqdD = 1 𝐻 𝐷nqdpD = 1
𝐺𝑎𝑖𝑛 𝐷, 𝑥D = 1 𝐺𝑎𝑖𝑛 𝐷, 𝑥E = 0
ID3willchoose𝑥D asthenextsplitattribute
FurtherAddressingOverfitting
• MoreRegularization(Constrain𝐻)• Donotsplitleavespastafixeddepth• Donotsplitleaveswithfewerthan𝑐 labels• Donotsplitleaveswherethemaximalinformationgainislessthan𝜏
• Pruning(removingleaves)• Evaluateeachsplitusingavalidationsetandcomparethevalidationerrorwithandwithoutthatsplit(replacingitwiththemostcommonlabelatthatpoint)
• Usestatisticaltesttoexaminewhetherthesplitis“informative”(leadstodifferentenoughsubtrees)
MoreDiscussions
• Real-valuedfeatures(continuous𝑥)• Needtoselectthresholdforbranching
• Regression(continuous𝑦)• Changeleafprediction:e.g.,averageinsteadofmajorityvote• Changemeasurefor“purity”ofdata:e.g.,squarederrorofdata
EnsembleLearningThefocusofthenexttwolectures
EnsembleLearning
• Assumewearegivenasetoflearnedhypothesis• 𝑔D,𝑔E,…,𝑔t
• Whatcanwedo?• Usevalidationtopickthebestone• Whatifallofthemarenotgoodenough
• Canweaggregate them?
IsAggregationaGoodIdea?
• Ata1906countryfair,~800peopleparticipateinacontesttoguesstheweightofanox.
• Rewardisgiventothepersonwiththeclosestguess.
• Theaverageguessis1,197lbs.Thetrueansweris1,198lbs.
IsAggregationaGoodIdea?
• Maybe• Ifthehypothesisis“diverse”,and“inaverage”theyseemgood
• Question:• Howdowefind asetofhypothesisthatarediverseand“inaverage”good• Howdoweaggregate thesetofhypothesis
• Ensemblelearning• Bagging– RandomForest(March17)• Boosting– AdaBoost(March19)