Machine Learning & Data Mining · • Combine multiple learning algorithms or models – Previous Lecture: Bagging & Random Forests – Today: Boosting & Ensemble Selection • “Meta

MachineLearning&DataMiningCS/CNS/EE155

Lecture6:Boosting&EnsembleSelection

1

Today

• HighLevelOverviewofEnsembleMethods

• Boosting– EnsembleMethodforReducingBias

• EnsembleSelection

2

Recall: TestError

• “True”distribution: P(x,y)– Unknowntous

• Train: hS(x)=y– Usingtrainingdata:– SampledfromP(x,y)

• TestError:

• Overfitting: TestError>>TrainingError

S = (xi, yi ){ }i=1N

LP (hS ) = E(x,y)~P(x,y) L(y,hS (x))[ ]

3

Person Age Male? Height>55”

Alice 14 0 1

Bob 10 1 1

Carol 13 0 1

Dave 8 1 0

Erin 11 0 0

Frank 9 1 1

Gena 8 0 0

Person Age Male? Height >55”

James 11 1 1

Jessica 14 0 1

Alice 14 0 1

Amy 12 0 1

Bob 10 1 1

Xavier 9 1 0

Cathy 9 0 1

Carol 13 0 1

Eugene 13 1 0

Rafael 12 1 1

Dave 8 1 0

Peter 9 1 0

Henry 13 1 0

Erin 11 0 0

Rose 7 0 0

Iain 8 1 1

Paulo 12 1 0

Margaret

10 0 1

Frank 9 1 1

Jill 13 0 0

Leon 10 1 0

Sarah 12 0 0

Gena 8 0 0

Patrick 5 1 1…

L(h)=E(x,y)~P(x,y)[L(h(x),y)]TestError:

h(x)y

TrainingSetSTrueDistributionP(x,y)

4

Recall: TestError

• TestError:

• TreathS asrandomvariable:

• ExpectedTestError:

LP (h) = E(x,y)~P(x,y) L(y,h(x))[ ]

hS = argminh

L yi,h(xi )( )(xi ,yi )∈S∑

ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$

5

akatesterrorofmodelclass

Recall:Bias-VarianceDecomposition

• Forsquarederror:


ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#

$%+ H (x)− y( )2"

#&$%'

H (x) = ES hS (x)[ ]“Averagepredictiononx”

VarianceTerm BiasTerm

6


0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5VarianceBias VarianceBias VarianceBias

7


0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5VarianceBias VarianceBias VarianceBias

8

Somemodelsexperiencehightesterrorduetohighbias.(Modelclasstosimpletomakeaccuratepredictions.)

Somemodelsexperiencehightesterrorduetohighvariance.(Modelclassunstableduetoinsufficienttrainingdata.)

GeneralConcept:EnsembleMethods

• Combinemultiplelearningalgorithmsormodels– PreviousLecture:Bagging&RandomForests– Today:Boosting&EnsembleSelection

• “MetaLearning”approach– Doesnotinnovateonbaselearningalgorithm/model

– Ex:Bagging• Newtrainingsetsviabootstrapping• Combinesbyaveragingpredictions

9

DecisionTrees,SVMs,etc.

Intuition:WhyEnsembleMethodsWork

• Bias-VarianceTradeoff!• Baggingreducesvarianceoflow-biasmodels– Low-biasmodelsare“complex”andunstable– Baggingaveragesthemtogethertocreatestability

• Boostingreducesbiasoflow-variancemodels– Low-variancemodelsaresimplewithhighbias– Boostingtrainssequenceofsimplemodels– Sumofsimplemodelsiscomplex/accurate

10

Boosting“TheStrengthofWeakClassifiers”*

11*http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf

Terminology: ShallowDecisionTrees

• DecisionTreeswithonlyafewnodes• Veryhighbias&lowvariance– Differenttrainingsetsleadtoverysimilartrees– Errorishigh(barelybetterthanstaticbaseline)

• Extremecase:“DecisionStumps”– Treeswithexactly1split

12

StabilityofShallowTrees

• Tendstolearnmore-or-lessthesamemodel.• hS(x)haslowvariance– OvertherandomnessoftrainingsetS

13

Terminology:WeakLearning

• Errorrate:

• WeakClassifier: slightlybetterthan0.5– Slightlybetterthanrandomguessing

• WeakLearner: canlearnaweakclassifier

14

εh,P = EP(x,y) 1 h(x )≠y[ ]"#

$%

εh,P

Terminology:WeakLearning

• Errorrate:

• WeakClassifier: slightlybetterthan0.5– Slightlybetterthanrandomguessing

• WeakLearner: canlearnaweakclassifier

15

εh,P = EP(x,y) 1 h(x )≠y[ ]"#

$%

εh,P

ShallowDecisionTreesareWeakClassifiers!

WeakLearnersareLowVariance&HighBias!

Howto“Boost”WeakModels?

• WeakModelsareHighBias&LowVariance• Baggingwouldnotwork– Reducesvariance,notbias

16


$%+ H (x)− y( )2"

#&$%'


VarianceTerm BiasTermExpectedTestErrorOverrandomnessofS(SquaredLoss)

FirstTry(forRegression)

• 1dimensionalregression• LearnDecisionStump

– (singlesplit,predictmeanoftwopartitions)

17

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S




18

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S

y1 h1(x)

0 6

1 6

4 6

9 6

16 6

25 30.5

36 30.5




19

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S

y1 h1(x) y20 6 -6

1 6 -5

4 6 -2

9 6 -3

16 6 10

25 30.5 -5.5

36 30.5 5.5

yt =y– h1:t-1(x)

“residual”




20

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S

y1 h1(x) y2 h2(x)

0 6 -6 -5.5

1 6 -5 -5.5

4 6 -2 2.2

9 6 -3 2.2

16 6 10 2.2

25 30.5 -5.5 2.2

36 30.5 5.5 2.2

h1:t(x)=h1(x)+…+ht(x)

yt =y– h1:t-1(x)

“residual”




21

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S

y1 h1(x) y2 h2(x) h1:2(x)

0 6 -6 -5.5 0.5

1 6 -5 -5.5 0.5

4 6 -2 2.2 8.2

9 6 -3 2.2 8.2

16 6 10 2.2 8.2

25 30.5 -5.5 2.2 32.7

36 30.5 5.5 2.2 32.7

h1:t(x)=h1(x)+…+ht(x)

yt =y– h1:t-1(x)

“residual”




22

x y

0 0

1 1

2 4

3 9

4 16

5 25

6 36

S

y1 h1(x) y2 h2(x) h1:2(x) y3 h3(x) h1:3(x)

0 6 -6 -5.5 0.5 -0.5 -0.55 -0.05

1 6 -5 -5.5 0.5 0.5 -0.55 -0.05

4 6 -2 2.2 8.2 -4.2 -0.55 7.65

9 6 -3 2.2 8.2 0.8 -0.55 7.65

16 6 10 2.2 8.2 7.8 -0.55 7.65

25 30.5 -5.5 2.2 32.7 -7.7 -0.55 32.15

36 30.5 5.5 2.2 32.7 3.3 3.3 36

h1:t(x)=h1(x)+…+ht(x)

yt =y– h1:t-1(x)

“residual”

23

0 2 4 60

10

20

30

40

0 2 4 60

10

20

30

40

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-6-4-2024

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-101234

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-0.02

0

0.02

0.04

0.06

0 2 4 60

10

20

30

40

yt

ht

h1:t

h1:t(x)=h1(x)+…+ht(x)yt =y– h1:t-1(x)

t=1 t=2 t=3 t=4


GradientBoosting(SimpleVersion)

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

h(x)=h1(x)

h1(x) h2(x) hn(x)

…

+h2(x) +…+hn(x)

S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1

N Sn = (xi, yi − h1:n−1(xi )){ }i=1N

S = (xi, yi ){ }i=1N

(Whyisitcalled“gradient”?)(Answernextslides.)

(ForRegressionOnly)

24

AxisAlignedGradientDescent

• LinearModel:h(x)=wTx• SquaredLoss:L(y,y’)=(y-y’)2

• SimilartoGradientDescent– Butonlyallowaxis-alignedupdatedirections– Updatesareoftheform:

25

(ForLinearModel)

w = w−ηgded ed =

0!010!0

!

"

########

$

%

&&&&&&&&

Unitvectoralongd-thDimension

g = ∇wL(yi,wT xi )

i∑

Projectionofgradientalongd-th dimensionUpdatealongaxiswithgreatestprojection

S = (xi, yi ){ }i=1N

TrainingSet

AxisAlignedGradientDescent

26

Updatealongaxiswithlargestprojection

Thisconceptwillbecomeusefulin~5slides

FunctionSpace&EnsembleMethods

• Linearmodel=onecoefficientperfeature– Linearovertheinputfeaturespace

• Ensemblemethods=onecoefficientpermodel– Linearoverafunctionspace– E.g.,h=h1 +h2 +…+hn

27

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#Func+onal#Gradient#Descent#


h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

“FunctionSpace”(Spanofallshallowtrees)

(Potentiallyinfinite)(Mostcoefficientsare0)



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#



h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Coefficient=1formodelsusedCoefficient=0forothermodels

PropertiesofFunctionSpace

• GeneralizationofaVectorSpace

• ClosedunderAddition– Sumoftwofunctionsisafunction

• ClosedunderScalarMultiplication– Multiplyingafunctionwithascalarisafunction

• Gradientdescent: addingascaledfunctiontoanexistingfunction

28

FunctionSpaceofModels

• Every“axis”inthespaceisaweakmodel– Potentiallyinfiniteaxes/dimensions

• Complexmodelsarelinearcombinationsofweakmodels– h=η1h1 +η2h2 +…+ηnhn– Equivalenttoapointinfunctionspace• Definedbycoefficientsη

29

Recall:AxisAlignedGradientDescent

30

Projecttoclosestaxis&update(smallestsquareddist)

Imagineeachaxisisaweakmodel.

Everypointisalinearcombinationofweakmodels

FunctionalGradientDescent

31

(GradientDescentinFunctionSpace)(DerivationforSquaredLoss)

• Init h(x)=0• Loopn=1,2,3,4,…

h = h− argmaxhn

projecthn ∇hL(yi,h(xi ))i∑$

%&

'

()

$

%&&

'

())

= h+ argminhn

(yi − h(xi )− hn (xi )i∑ )2

S = (xi, yi ){ }i=1N

Projectfunctionalgradienttobestfunction

Equivalenttofindingthehnthatminimizesresidualloss

ReductiontoVectorSpace

• Functionspace=axis-alignedunitvectors–Weakmodel=axis-alignedunitvector:

• Linearmodelwhassamefunctionalform:– w=η1e1 +η2e2 +…+ηDeD– PointinspaceofD“axis-alignedfunctions”

• Axis-AlignedGradientDescent=FunctionalGradientDescentonspaceofaxis-alignedunitvectorweakmodels.

32

ed =

!010!

!

"

######

$

%

&&&&&&

GradientBoosting(FullVersion)

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

h1:n(x)=h1(x)

h1(x) h2(x) hn(x)

…

+η2h2(x)+…+ηnhn(x)

S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1

N Sn = (xi, yi − h1:n−1(xi )){ }i=1N

S = (xi, yi ){ }i=1N

(InstanceofFunctionalGradientDescent) (ForRegressionOnly)

33Seereferenceforhowtosetη

Recap: BasicBoosting

• Ensembleofmanyweakclassifiers.– h(x)=η1h1(x)+η2h2(x)+…+ηnhn(x)

• Goal:reducebiasusinglow-variancemodels

• Derivation:viaGradientDescentinFunctionSpace– Spaceofweakclassifiers

• We’veonlyseentheregressionsofar…

34

AdaBoostAdaptiveBoostingforClassification

35http://www.yisongyue.com/courses/cs155/lectures/msri.pdf

BoostingforClassification

• GradientBoostingwasdesignedforregression

• Canwedesignoneforclassification?

• AdaBoost– AdaptiveBoosting

36

AdaBoost =FunctionalGradientDescent

• AdaBoost isalsoinstanceoffunctionalgradientdescent:– h(x)=sign(a1h1(x)+a2h2(x)+…+a3hn(x))

• E.g.,weakmodelshi(x)areclassificationtrees– Alwayspredict-1or+1– (GradientBoostingusedregressiontrees)

37

CombiningMultipleClassifiers

38

DataPoint

h1(x) h2(x) h3(x) h4(x) f(x) h(x)

x1 +1 +1 +1 -1 0.1+1.5+0.4- 1.1 =0.9 +1

x2 +1 +1 +1 +1 0.1 +1.5+0.4+1.1=3.1 +1

x3 -1 +1 -1 -1 -0.1+1.5– 0.3– 1.1= -0.1 -1

x4 -1 -1 +1 -1 -0.1– 1.5+0.3– 1.1=-2.4 -1

f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)

h(x)=sign(f(x))

AggregateScoringFunction:

AggregateClassifier:

AlsoCreatesNewTrainingSets

• GradientsinFunctionSpace–Weakmodelthatoutputsresidualoflossfunction• Squaredlosstarget:y-h(x)

– Algorithmicallyequivalenttotrainingweakmodelonmodifiedtrainingset• GradientBoosting=trainon(xi,yi–h(xi))

• WhataboutAdaBoost?– Classificationproblem.

39

ForRegression

ReweightingTrainingData

• DefineweightingDoverS:– Sumsto1:

• Examples:

• Weightedlossfunction:

40

S = (xi, yi ){ }i=1N

D(i)i∑ =1

Data Point D(i)

(x1,y1) 1/3

(x2,y2) 1/3

(x3,y3) 1/3

Data Point D(i)

(x1,y1) 0

(x2,y2) 1/2

(x3,y3) 1/2

Data Point D(i)

(x1,y1) 1/6

(x2,y2) 1/3

(x3,y3) 1/2

LD (h) = D(i)L(yi,h(xi ))i∑

TrainingDecisionTreeswithWeightedTrainingData

• Slightmodificationofsplittingcriterion.

• Example:BernoulliVariance:

• Estimatefractionofpositivesas:

41

L(S ') = S ' pS ' (1− pS ' ) =# pos*#neg

| S ' |

pS ' =D(i)1 yi=1[ ]

(xi ,yi )∈S '∑

S 'S ' ≡ D(i)

(xi ,yi )∈S '∑

AdaBoost Outline

h(x)=sign(a1h1(x))

(S,D1=Uniform)

h1(x)

(S,D2)

h2(x)

(S,Dn)

hn(x)

…

+a2h2(x))+…+anhn(x))

Dt – weightingondatapointsat – weightoflinearcombination

Stopwhenvalidationperformanceplateaus(willdiscusslater) 42http://www.yisongyue.com/courses/cs155/lectures/msri.pdf

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }h(x)=sign(a1h1(x)+a2h2(x)

Intuition

43

f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)

h(x)=sign(f(x))



DataPoint

Label f(x) h(x)

x1 y1=+1 0.9 +1

x2 y2=+1 3.1 +1

x3 y3=+1 -0.1 -1

x4 y4=-1 -2.4 -1SafelyFarfromDecisionBoundary

SomewhatclosetoDecisionBoundary

ViolatesDecisionBoundary

Intuition

44

f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)

h(x)=sign(f(x))



DataPoint

Label f(x) h(x)

x1 y1=+1 0.9 +1

x2 y2=+1 3.1 +1

x3 y3=+1 -0.1 -1

x4 y4=-1 -2.4 -1SafelyFarfromDecisionBoundary

SomewhatclosetoDecisionBoundary

ViolatesDecisionBoundary

ThoughtExperiment:Whenwetrainnewh5(x)toaddtof(x)…

…whathappenswhenh5 mispredicts oneverything?

Intuition

45

DataPoint

Label f1:4(x) h1:4(x) Worstcaseh5(x)

Worstcasef1:5(x)

Impactofh5(x)

x1 y1=+1 0.9 +1 -1 0.4

x2 y2=+1 3.1 +1 -1 2.6

x3 y3=+1 -0.1 -1 -1 -0.6

x4 y4=-1 -2.4 -1 +1 -1.9

f1:5(x)=f1:4(x)+0.5*h5(x)

h1:5(x)=sign(f1:5(x))



h5(x)thatmispredicts oneverything

Supposea5 =0.5

KindofBad

Irrelevant

VeryBad

Irrelevant

Intuition

46

DataPoint

Label f1:4(x) h1:4(x) Worstcaseh5(x)

Worstcasef1:5(x)

Impactofh5(x)

x1 y1=+1 0.9 +1 -1 0.4

x2 y2=+1 3.1 +1 -1 2.6

x3 y3=+1 -0.1 -1 -1 -0.6

x4 y4=-1 -2.4 -1 +1 -1.9

f1:5(x)=f1:4(x)+0.5*h5(x)

h1:5(x)=sign(f1:5(x))



h5(x)thatmispredicts oneverything

Supposea5 =0.5

KindofBad

Irrelevant

VeryBad

Irrelevant

h5(x)shoulddefinitelyclassify(x3,y3)correctly!h5(x)shouldprobablyclassify(x1,y1)correctly.

Don’tcareabout(x2,y2)&(x4,y4)Impliesaweightingovertrainingexamples

Intuition

47

DataPoint

Label f1:4(x) h1:4(x) DesiredD5

x1 y1=+1 0.9 +1

x2 y2=+1 3.1 +1

x3 y3=+1 -0.1 -1

x4 y4=-1 -2.4 -1

Medium

Low

High

Low

f1:4(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)

h1:4(x)=sign(f1:4(x))



AdaBoost

• Init D1(x)=1/N• Loopt=1…n:– Trainclassifierht(x)using(S,Dt)

– Computeerroron(S,Dt):

– Definestepsizeat:

– UpdateWeighting:

• Return: h(x)=sign(a1h1(x)+…+anhn(x))

48

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

ZtNormalizationFactors.t. Dt+1sumsto1.

E.g.,bestdecisionstump

Example

49

DataPoint

Label D1

x1 y1=+1 0.01

x2 y2=+1 0.01

x3 y3=+1 0.01

x4 y4=-1 0.01…


at =12log 1−εt

εt

"#$

%&'


Zt

NormalizationFactors.t. Dt+1sumsto1.

yiht(xi)=-1or+1

Example

50

DataPoint

Label D1 h1(x)

x1 y1=+1 0.01 +1

x2 y2=+1 0.01 -1

x3 y3=+1 0.01 -1

x4 y4=-1 0.01 -1…

ε1=0.4a1=0.2


at =12log 1−εt

εt

"#$

%&'


Zt


yiht(xi)=-1or+1

…

Example

51

DataPoint

Label D1 h1(x) D2

x1 y1=+1 0.01 +1 0.008

x2 y2=+1 0.01 -1 0.012

x3 y3=+1 0.01 -1 0.012

x4 y4=-1 0.01 -1 0.008

ε1=0.4a1=0.2


at =12log 1−εt

εt

"#$

%&'


Zt


yiht(xi)=-1or+1

… …

Example

52

DataPoint

Label D1 h1(x) D2 h2(x)

x1 y1=+1 0.01 +1 0.008 +1

x2 y2=+1 0.01 -1 0.012 +1

x3 y3=+1 0.01 -1 0.012 -1

x4 y4=-1 0.01 -1 0.008 +1

ε1=0.4a1=0.2


at =12log 1−εt

εt

"#$

%&'


Zt


ε2=0.45a2=0.1

yiht(xi)=-1or+1

… … …

Example

53

DataPoint

Label D1 h1(x) D2 h2(x) D3

x1 y1=+1 0.01 +1 0.008 +1 0.007

x2 y2=+1 0.01 -1 0.012 +1 0.011

x3 y3=+1 0.01 -1 0.012 -1 0.013

x4 y4=-1 0.01 -1 0.008 +1 0.009

ε1=0.4a1=0.2


at =12log 1−εt

εt

"#$

%&'


Zt


ε2=0.45a2=0.1

yiht(xi)=-1or+1

… … …

Example

54

DataPoint

Label D1 h1(x) D2 h2(x) D3 h3(x)

x1 y1=+1 0.01 +1 0.008 +1 0.007 -1

x2 y2=+1 0.01 -1 0.012 +1 0.011 +1

x3 y3=+1 0.01 -1 0.012 -1 0.013 +1

x4 y4=-1 0.01 -1 0.008 +1 0.009 -1

ε1=0.4a1=0.2


at =12log 1−εt

εt

"#$

%&'


Zt


ε2=0.45a2=0.1

yiht(xi)=-1or+1

ε3=0.35a3=0.31

… … … …Whathappensifε=0.5?

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

4

5

6

7

8

ExponentialLoss

55

L(y, f (x)) = exp −yf (x){ }

Targety

f(x)

Exp Loss UpperBounds0/1Loss!

CanprovethatAdaBoost minimizesExp Loss(HomeworkQuestion)

DecomposingExp Loss

56

L(y, f (x)) = exp −yf (x){ }

= exp −y atht (x)t=1

n

∑#

$%

&

'(

)*+

,+

-.+

/+

= exp −yatht (x){ }t=1

n

∏

DistributionUpdateRule!

http://www.yisongyue.com/courses/cs155/lectures/msri.pdf

Intuition

• Exp Lossoperatesinexponentspace

• Additiveupdatetof(x)=multiplicativeupdatetoExp Lossoff(x)

• ReweightingSchemeinAdaBoost canbederivedviaresidualExp Loss

57

L(y, f (x)) = exp −y atht (x)t=1

n

∑#$%

&'(= exp −yatht (x){ }

t=1

n

∏


AdaBoost =MinimizingExp Loss

• Init D1(x)=1/N• Loopt=1…n:– Trainclassifierht(x)using(S,Dt)

– Computeerroron(S,Dt):

– Definestepsizeat:

– UpdateWeighting:

• Return: h(x)=sign(a1h1(x)+…+anhn(x))

58

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }


at =12log 1−εt

εt

"#$

%&'


ZtNormalizationFactors.t. Dt+1sumsto1.


DatapointsreweightedaccordingtoExp Loss!

StorySoFar:AdaBoost

• AdaBoost iterativelyfindsweakclassifiertominimizeresidualExp Loss– Trainsweakclassifieronreweighteddata(S,Dt).

• Homework:Rigorouslyproveit!

1. FormallyproveExp Loss≥0/1Loss

2. RelateExp LosstoZt:

3. Justifychoiceofat:• GiveslargestdecreaseinZt

59

at =12log 1−εt

εt

"#$

%&'


Zt

Theproofisinearlierslides.


Recap:AdaBoost

• GradientDescentinFunctionSpace– Spaceofweakclassifiers

• Finalmodel=linearcombinationofweakclassifiers– h(x)=sign(a1h1(x)+…+anhn(x))– I.e.,apointinFunctionSpace

• Iterativelycreatesnewtrainingsetsviareweighting– Trainsweakclassifieronreweightedtrainingset– DerivedviaminimizingresidualExp Loss

60

EnsembleSelection

61


• Forsquarederror:



$%+ H (x)− y( )2"

#&$%'


VarianceTerm BiasTerm

62

EnsembleMethods

• Combinebasemodelstoimproveperformance

• Bagging:averageshighvariance,lowbiasmodels– Reducesvariance– Indirectlydealswithbiasvialowbiasbasemodels

• Boosting: carefullycombinessimplemodels– Reducesbias– Indirectlydealswithvariancevialowvariancebasemodels

• Canwegetbestofbothworlds?

63

Insight:UseValidationSet

• EvaluateerroronvalidationsetV:

• Proxyfortesterror:

64

LV (hS ) = E(x,y)~V L(y,hS (x))[ ]

EV LV (hS )[ ] = LP (hS )

ExpectedValidationError TestError

EnsembleSelection

“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil,Crew&Ksikes,ICML2004

Person+ Age+ Male?+ Height+>+55”+

Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#


James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+

L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+

h(x)+y+


Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#


James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+


h(x)+y+


Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#


James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+


h(x)+y+

TrainingS’

ValidationV’

H={2000modelstrainedusingS’}

h(x)=h1(x)+h2(x)+…+hn(x)MaintainensemblemodelascombinationofH:

AddmodelfromHthatmaximizesperformanceonV’

+hn+1(x)

Repeat

S

Denoteashn+1

ModelsaretrainedonS’Ensemblebuilt tooptimizeV’

ReducesBothBias&Variance

• ExpectedTestError=Bias+Variance

• Bagging: reducevarianceoflow-biasmodels

• Boosting: reducebiasoflow-variancemodels

• EnsembleSelection: whocares!– Usevalidationerrortoapproximatetesterror– Directlyminimizevalidationerror– Don’tworryaboutthebias/variancedecomposition

66

What’stheCatch?

• Reliesheavilyonvalidationset– Bagging&Boosting:usestrainingsettoselectnextmodel– EnsembleSelection:usesvalidationsettoselectnextmodel

• Requiresvalidationsetbesufficientlylarge

• Inpractice: impliessmallertrainingsets– Training&validation=partitioningoffinitedata

• Oftenworksverywellinpractice

67

“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil,Crew&Ksikes,ICML2004

EnsembleSelectionoftenoutperformsamorehomogenoussetsofmodels.Reducesoverfittingbybuildingmodelusingvalidationset.

EnsembleSelectionwonKDDCup2009http://www.niculescu-mizil.org/papers/KDDCup09.pdf

References&FurtherReading“AnEmpiricalComparisonofVotingClassificationAlgorithms:Bagging,Boosting,andVariants”Bauer&Kohavi,Machine Learning,36,105–139(1999)

“BaggingPredictors” LeoBreiman,TechReport#421,UCBerkeley,1994,http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

“AnEmpiricalComparisonofSupervisedLearningAlgorithms”Caruana &Niculescu-Mizil, ICML2006

“AnEmpiricalEvaluationofSupervisedLearninginHighDimensions”Caruana,Karampatziakis &Yessenalina,ICML2008

“EnsembleMethodsinMachineLearning”ThomasDietterich,MultipleClassifier Systems, 2000

“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil, Crew&Ksikes, ICML2004

“GettingtheMostOutofEnsembleSelection” Caruana,Munson, &Niculescu-Mizil, ICDM2006

“ExplainingAdaBoost”RobSchapire,https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf

“GreedyFunctionApproximation:AGradientBoostingMachine”,JeromeFriedman, 2001,http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

“RandomForests– RandomFeatures” LeoBreiman,TechReport#567,UCBerkeley,1999,

“StructuredRandomForestsforFastEdgeDetection”Dollár&Zitnick,ICCV2013

“ABC-Boost:AdaptiveBaseClassBoostforMulti-classClassification” PingLi,ICML2009

“AdditiveGrovesofRegressionTrees”Sorokina, Caruana &Riedewald,ECML2007,http://additivegroves.net/

“WinningtheKDDCupOrangeChallengewithEnsembleSelection”,Niculescu-Mizil etal.,KDD2009

“LessonsfromtheNetflixPrizeChallenge”Bell&Koren,SIGKDDExporations 9(2),75—79,2007

NextLectures

• DeepLearning

• RecitationNextThursday– Keras Tutorial

70

JoeMarino

Documents

Machine Learning & Data Mining · • Combine multiple learning algorithms or models – Previous Lecture: Bagging & Random Forests – Today: Boosting & Ensemble Selection • “Meta