CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

CSE417TIntroductiontoMachineLearning

Lecture15Instructor:Chien-Ju(CJ)Ho

Logistics:Homework3

• Homework3ispostedonthecoursewebsite

• DueonMarch25(Wednesday),2020• Homework4willbeannouncedbeforethedueofhomework3

DiscussiononExam1

417TPart2MachineLearningTechniques

Focusoftherestofthesemester

DecisionTree

DecisionTreeHypothesis

• �⃗� = (annualincome,havedebt)• 𝑦 ∈ {approve,deny}

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny

CreditCardApprovalExample

DecisionTreeHypothesis• Pros

• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)

• Canhandlemulti-typedata(Numerical,categorical.…)

• Easytoimplement(Bunchofif-elserules)

• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)

• Easilyoverfit

• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny


DecisionTreeHypothesis• Pros

• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)

• Canhandlemulti-typedata(Numerical,categorical.…)

• Easytoimplement(Bunchofif-elserules)

• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)

• Easilyoverfit

• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny


LearningDecisionTreefromData

• Givendataset𝐷,howtolearnadecisiontreehypothesis?𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

• Potentialapproach• Find𝑔 = 𝑎𝑟𝑔𝑚𝑖𝑛M∈N𝐸PQ(ℎ)

• Multipledecisiontreeswithzero𝐸PQ𝑥D

+1

+1

-1

𝑥E+1 -1

+1 -1

+1

+1 -1

𝑥F+1 -1

𝑥E+1 -1

+1

𝑥E+1 -1

+1 𝑥D+1 -1

𝑥D+1 -1

+1 +1 -1

+1

+1

+1

+1-1 -1

Whichonedoyouthinkmightgeneralizebetter?

LearningDecisionTreefromData

• Conceptualintuitiontodealwithoverfitting• Regularization:Constrain𝐻

• Informally,

• Thisoptimizationisgenerallycomputationallyintractable.• Mostdecisiontreelearningalgorithmsrelyonheuristics toapproximatethegoal.

minimize𝐸PQ 𝑤

subjectto𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 ≤ 𝐶

Greedy-BasedDecisionTreeAlgorithm

• DecisionTreeLearn(𝐷):Inputadataset𝐷,outputadecisiontreehypothesis• Createarootnode𝑟• Ifterminationconditionsaremet• returnasinglenodetreewithleafpredictionbasedon𝐷

• Else:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteria• Foreachpossiblevalue𝑣P of 𝐴• Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴• CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

• Mostdecisiontreelearningalgorithmsfollowthistemplate,butwithdifferentchoicesofheuristics

Example

𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1

𝑥D 𝑥E 𝑥F 𝑦-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

Leafprediction+1

terminate

Findnextfeaturetosplit

Don’tterminate

DecisionTreeLearn

TerminationconditionsnotnetFindafeaturetosplit

DecisionTreeLearn DecisionTreeLearn

DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet

returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴

Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

ExampleHeuristics

• Terminationconditions• Whenthedatasetisempty• Whenalllabelsarethesame• whenallfeaturesarethesame• Whenthedepthofthetreeistoodeep• …

• Leafpredictions• Majorityvoting• Average(forregression)• …

• Splitcriteria?




SplitCriteria

• Whichfeaturewouldyouchoosetosplit?

• Wantthetreetobe“smaller”• Intuition:choosetheonethatthelabelsaremore“pure”• Example:choosetheonemaximizinginformationgain =>ID3Algorithm

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1

𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1

𝑥E+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1

𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1

BriefIntrotoInformationEntropy

• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE

Dab

cPdD

• 𝑝P: ratioofpointswithlabel𝑖 inthedata

• Binarycasewith𝐾 = 2

Bydefinition

0 logE10= 0 ; 1 logE

11= 0

• Interpretationsofentropy• Expected#bittoencodeadistribution

• Higherentropy• dataisless“pure”• ”pure”data=>alllabelsare+1or-1=>entropy=0

𝑝D

𝐻(𝐷)

BriefIntrotoInformationEntropy

• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE

Dab

cPdD

• 𝑝P: ratioofpointswithlabel𝑖 inthedata

• Binarycasewith𝐾 = 2

Bydefinition

0 logE10= 0 ; 1 logE

11= 0

• Interpretationsofentropy• Expected#bittoencodeadistribution

• Higherentropy• dataisless“pure”

• ”pure”data=>alllabelsare+1or-1=>entropy=0• Wanttochoosesplitsthatleadtopuredata,i.e.,lowerentropy

𝑝D

𝐻(𝐷)

ID3:UsingInformationGainasSelectionCriteria

• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib

iP 𝐻(𝐷P) [Theamountofdecreaseinentropy]

• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴




Notation:|𝐷| isthenumberofpointsin𝐷

• ID3terminationconditions• Ifalllabelsarethesame• Ifallfeaturesarethesame• Ifdatasetisempty

• ID3leafpredictions• Mostcommonlabels(majorityvoting)

• ID3splitcriteria• Informationgain

ID3:UsingInformationGainasSelectionCriteria

• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib

iP 𝐻(𝐷P)

• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1

𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1

𝑥E+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1

𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1

𝐻 𝐷 = 0.5 logE 2 + 0.5 logE 2 = 1𝐻 𝐷nodD = 0 𝐻 𝐷nodpD = 0 𝐻 𝐷nqdD = 1 𝐻 𝐷nqdpD = 1

𝐺𝑎𝑖𝑛 𝐷, 𝑥D = 1 𝐺𝑎𝑖𝑛 𝐷, 𝑥E = 0

ID3willchoose𝑥D asthenextsplitattribute

FurtherAddressingOverfitting

• MoreRegularization(Constrain𝐻)• Donotsplitleavespastafixeddepth• Donotsplitleaveswithfewerthan𝑐 labels• Donotsplitleaveswherethemaximalinformationgainislessthan𝜏

• Pruning(removingleaves)• Evaluateeachsplitusingavalidationsetandcomparethevalidationerrorwithandwithoutthatsplit(replacingitwiththemostcommonlabelatthatpoint)

• Usestatisticaltesttoexaminewhetherthesplitis“informative”(leadstodifferentenoughsubtrees)

MoreDiscussions

• Real-valuedfeatures(continuous𝑥)• Needtoselectthresholdforbranching

• Regression(continuous𝑦)• Changeleafprediction:e.g.,averageinsteadofmajorityvote• Changemeasurefor“purity”ofdata:e.g.,squarederrorofdata

EnsembleLearningThefocusofthenexttwolectures

EnsembleLearning

• Assumewearegivenasetoflearnedhypothesis• 𝑔D,𝑔E,…,𝑔t

• Whatcanwedo?• Usevalidationtopickthebestone• Whatifallofthemarenotgoodenough

• Canweaggregate them?

IsAggregationaGoodIdea?

• Ata1906countryfair,~800peopleparticipateinacontesttoguesstheweightofanox.

• Rewardisgiventothepersonwiththeclosestguess.

• Theaverageguessis1,197lbs.Thetrueansweris1,198lbs.

IsAggregationaGoodIdea?

• Maybe• Ifthehypothesisis“diverse”,and“inaverage”theyseemgood

• Question:• Howdowefind asetofhypothesisthatarediverseand“inaverage”good• Howdoweaggregate thesetofhypothesis

• Ensemblelearning• Bagging– RandomForest(March17)• Boosting– AdaBoost(March19)

Documents

CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0