26
CSE 417T Introduction to Machine Learning Lecture 15 Instructor: Chien-Ju (CJ) Ho

CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

CSE417TIntroductiontoMachineLearning

Lecture15Instructor:Chien-Ju(CJ)Ho

Page 2: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

Logistics:Homework3

• Homework3ispostedonthecoursewebsite

• DueonMarch25(Wednesday),2020• Homework4willbeannouncedbeforethedueofhomework3

Page 3: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

DiscussiononExam1

Page 4: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

417TPart2MachineLearningTechniques

Page 5: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0
Page 6: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

Focusoftherestofthesemester

Page 7: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

DecisionTree

Page 8: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

DecisionTreeHypothesis

• �⃗� = (annualincome,havedebt)• 𝑦 ∈ {approve,deny}

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny

CreditCardApprovalExample

Page 9: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

DecisionTreeHypothesis• Pros

• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)

• Canhandlemulti-typedata(Numerical,categorical.…)

• Easytoimplement(Bunchofif-elserules)

• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)

• Easilyoverfit

• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny

CreditCardApprovalExample

Page 10: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

DecisionTreeHypothesis• Pros

• Easytointerpret(interpretabilityisgettingattentionandisimportantinsomedomains)

• Canhandlemulti-typedata(Numerical,categorical.…)

• Easytoimplement(Bunchofif-elserules)

• Cons• Generallyspeaking,badgeneralization• VCdimensionisinfinity• Highvariance(smallchangeofdataleadstoverydifferenthypothesis)

• Easilyoverfit

• Whywecare?• Oneoftheclassicalmodel• Buildingblockforothermodels(e.g.,randomforest)

AnnualIncome

≥ 100𝑘 ≥ 20𝑘< 100𝑘

< 20𝑘

havedebt? Deny

yes no

ApproveDeny

Approve

ApproveDeny

Deny

CreditCardApprovalExample

Page 11: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

LearningDecisionTreefromData

• Givendataset𝐷,howtolearnadecisiontreehypothesis?𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

• Potentialapproach• Find𝑔 = 𝑎𝑟𝑔𝑚𝑖𝑛M∈N𝐸PQ(ℎ)

• Multipledecisiontreeswithzero𝐸PQ𝑥D

+1

+1

-1

𝑥E+1 -1

+1 -1

+1

+1 -1

𝑥F+1 -1

𝑥E+1 -1

+1

𝑥E+1 -1

+1 𝑥D+1 -1

𝑥D+1 -1

+1 +1 -1

+1

+1

+1

+1-1 -1

Whichonedoyouthinkmightgeneralizebetter?

Page 12: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

LearningDecisionTreefromData

• Conceptualintuitiontodealwithoverfitting• Regularization:Constrain𝐻

• Informally,

• Thisoptimizationisgenerallycomputationallyintractable.• Mostdecisiontreelearningalgorithmsrelyonheuristics toapproximatethegoal.

minimize𝐸PQ 𝑤

subjectto𝑠𝑖𝑧𝑒 𝑡𝑟𝑒𝑒 ≤ 𝐶

Page 13: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

Greedy-BasedDecisionTreeAlgorithm

• DecisionTreeLearn(𝐷):Inputadataset𝐷,outputadecisiontreehypothesis• Createarootnode𝑟• Ifterminationconditionsaremet• returnasinglenodetreewithleafpredictionbasedon𝐷

• Else:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteria• Foreachpossiblevalue𝑣P of 𝐴• Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴• CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

• Mostdecisiontreelearningalgorithmsfollowthistemplate,butwithdifferentchoicesofheuristics

Page 14: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

Example

𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑥F 𝑦+1 +1 +1 +1+1 +1 -1 +1+1 -1 +1 +1+1 -1 -1 +1

𝑥D 𝑥E 𝑥F 𝑦-1 +1 +1 +1-1 +1 -1 +1-1 -1 +1 -1-1 -1 -1 -1

Leafprediction+1

terminate

Findnextfeaturetosplit

Don’tterminate

DecisionTreeLearn

TerminationconditionsnotnetFindafeaturetosplit

DecisionTreeLearn DecisionTreeLearn

DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet

returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴

Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

Page 15: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

ExampleHeuristics

• Terminationconditions• Whenthedatasetisempty• Whenalllabelsarethesame• whenallfeaturesarethesame• Whenthedepthofthetreeistoodeep• …

• Leafpredictions• Majorityvoting• Average(forregression)• …

• Splitcriteria?

DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet

returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴

Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

Page 16: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

SplitCriteria

• Whichfeaturewouldyouchoosetosplit?

• Wantthetreetobe“smaller”• Intuition:choosetheonethatthelabelsaremore“pure”• Example:choosetheonemaximizinginformationgain =>ID3Algorithm

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1

𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1

𝑥E+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1

𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1

Page 17: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

BriefIntrotoInformationEntropy

• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE

Dab

cPdD

• 𝑝P: ratioofpointswithlabel𝑖 inthedata

• Binarycasewith𝐾 = 2

Bydefinition

0 logE10= 0 ; 1 logE

11= 0

• Interpretationsofentropy• Expected#bittoencodeadistribution

• Higherentropy• dataisless“pure”• ”pure”data=>alllabelsare+1or-1=>entropy=0

𝑝D

𝐻(𝐷)

Page 18: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

BriefIntrotoInformationEntropy

• Assumethereare𝐾 possiblelabels• Entropy:• 𝐻 𝐷 = ∑ 𝑝P logE

Dab

cPdD

• 𝑝P: ratioofpointswithlabel𝑖 inthedata

• Binarycasewith𝐾 = 2

Bydefinition

0 logE10= 0 ; 1 logE

11= 0

• Interpretationsofentropy• Expected#bittoencodeadistribution

• Higherentropy• dataisless“pure”

• ”pure”data=>alllabelsare+1or-1=>entropy=0• Wanttochoosesplitsthatleadtopuredata,i.e.,lowerentropy

𝑝D

𝐻(𝐷)

Page 19: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

ID3:UsingInformationGainasSelectionCriteria

• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib

iP 𝐻(𝐷P) [Theamountofdecreaseinentropy]

• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴

DecisionTreeLearn(𝐷)Createarootnode𝑟Ifterminationconditionsaremet

returnasinglenodetreewithleafpredictionbasedonElse:Greedilyfindafeature𝐴 tosplitaccordingtosplitcriteriaForeachpossiblevalue𝑣P of 𝐴

Let𝐷P bethedatasetcontainingdatawith value𝑣P forfeature𝐴CreateasubtreeDecisionTreeLearn(𝐷P)thatbeingthechildofroot𝑟

Notation:|𝐷| isthenumberofpointsin𝐷

• ID3terminationconditions• Ifalllabelsarethesame• Ifallfeaturesarethesame• Ifdatasetisempty

• ID3leafpredictions• Mostcommonlabels(majorityvoting)

• ID3splitcriteria• Informationgain

Page 20: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

ID3:UsingInformationGainasSelectionCriteria

• Informationgainofchoosingfeature𝐴 tosplit• 𝐺𝑎𝑖𝑛 𝐷, 𝐴 = 𝐻 𝐷 − ∑ ib

iP 𝐻(𝐷P)

• ID3:Choosethesplitthatmaximize𝐺𝑎𝑖𝑛 𝐷, 𝐴

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1-1 +1 -1-1 -1 -1

𝑥D+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1+1 -1 +1

𝑥D 𝑥E 𝑦-1 +1 -1-1 -1 -1

𝑥E+1 -1

𝑥D 𝑥E 𝑦+1 +1 +1-1 +1 -1

𝑥D 𝑥E 𝑦+1 -1 +1-1 -1 -1

𝐻 𝐷 = 0.5 logE 2 + 0.5 logE 2 = 1𝐻 𝐷nodD = 0 𝐻 𝐷nodpD = 0 𝐻 𝐷nqdD = 1 𝐻 𝐷nqdpD = 1

𝐺𝑎𝑖𝑛 𝐷, 𝑥D = 1 𝐺𝑎𝑖𝑛 𝐷, 𝑥E = 0

ID3willchoose𝑥D asthenextsplitattribute

Page 21: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

FurtherAddressingOverfitting

• MoreRegularization(Constrain𝐻)• Donotsplitleavespastafixeddepth• Donotsplitleaveswithfewerthan𝑐 labels• Donotsplitleaveswherethemaximalinformationgainislessthan𝜏

• Pruning(removingleaves)• Evaluateeachsplitusingavalidationsetandcomparethevalidationerrorwithandwithoutthatsplit(replacingitwiththemostcommonlabelatthatpoint)

• Usestatisticaltesttoexaminewhetherthesplitis“informative”(leadstodifferentenoughsubtrees)

Page 22: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

MoreDiscussions

• Real-valuedfeatures(continuous𝑥)• Needtoselectthresholdforbranching

• Regression(continuous𝑦)• Changeleafprediction:e.g.,averageinsteadofmajorityvote• Changemeasurefor“purity”ofdata:e.g.,squarederrorofdata

Page 23: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

EnsembleLearningThefocusofthenexttwolectures

Page 24: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

EnsembleLearning

• Assumewearegivenasetoflearnedhypothesis• 𝑔D,𝑔E,…,𝑔t

• Whatcanwedo?• Usevalidationtopickthebestone• Whatifallofthemarenotgoodenough

• Canweaggregate them?

Page 25: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

IsAggregationaGoodIdea?

• Ata1906countryfair,~800peopleparticipateinacontesttoguesstheweightofanox.

• Rewardisgiventothepersonwiththeclosestguess.

• Theaverageguessis1,197lbs.Thetrueansweris1,198lbs.

Page 26: CSE 417T Introduction to Machine Learningchienjuho.com/courses/cse417t/lecture15.pdf · P:ratio of points with label Kin the data •Binary case with ]=2 By definition 0log E 1 0

IsAggregationaGoodIdea?

• Maybe• Ifthehypothesisis“diverse”,and“inaverage”theyseemgood

• Question:• Howdowefind asetofhypothesisthatarediverseand“inaverage”good• Howdoweaggregate thesetofhypothesis

• Ensemblelearning• Bagging– RandomForest(March17)• Boosting– AdaBoost(March19)