Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
MachineLearning&DataMiningCS/CNS/EE155
Lecture6:Boosting&EnsembleSelection
1
Today
• HighLevelOverviewofEnsembleMethods
• Boosting– EnsembleMethodforReducingBias
• EnsembleSelection
2
Recall: TestError
• “True”distribution: P(x,y)– Unknowntous
• Train: hS(x)=y– Usingtrainingdata:– SampledfromP(x,y)
• TestError:
• Overfitting: TestError>>TrainingError
S = (xi, yi ){ }i=1N
LP (hS ) = E(x,y)~P(x,y) L(y,hS (x))[ ]
3
Person Age Male? Height>55”
Alice 14 0 1
Bob 10 1 1
Carol 13 0 1
Dave 8 1 0
Erin 11 0 0
Frank 9 1 1
Gena 8 0 0
Person Age Male? Height >55”
James 11 1 1
Jessica 14 0 1
Alice 14 0 1
Amy 12 0 1
Bob 10 1 1
Xavier 9 1 0
Cathy 9 0 1
Carol 13 0 1
Eugene 13 1 0
Rafael 12 1 1
Dave 8 1 0
Peter 9 1 0
Henry 13 1 0
Erin 11 0 0
Rose 7 0 0
Iain 8 1 1
Paulo 12 1 0
Margaret
10 0 1
Frank 9 1 1
Jill 13 0 0
Leon 10 1 0
Sarah 12 0 0
Gena 8 0 0
Patrick 5 1 1…
L(h)=E(x,y)~P(x,y)[L(h(x),y)]TestError:
h(x)y
TrainingSetSTrueDistributionP(x,y)
4
Recall: TestError
• TestError:
• TreathS asrandomvariable:
• ExpectedTestError:
LP (h) = E(x,y)~P(x,y) L(y,h(x))[ ]
hS = argminh
L yi,h(xi )( )(xi ,yi )∈S∑
ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$
5
akatesterrorofmodelclass
Recall:Bias-VarianceDecomposition
• Forsquarederror:
ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$
ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#
$%+ H (x)− y( )2"
#&$%'
H (x) = ES hS (x)[ ]“Averagepredictiononx”
VarianceTerm BiasTerm
6
Recall:Bias-VarianceDecomposition
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5VarianceBias VarianceBias VarianceBias
7
Recall:Bias-VarianceDecomposition
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5
0 20 40 60 80 100−1
−0.5
0
0.5
1
1.5
0 20 40 60 80 1000
0.5
1
1.5VarianceBias VarianceBias VarianceBias
8
Somemodelsexperiencehightesterrorduetohighbias.(Modelclasstosimpletomakeaccuratepredictions.)
Somemodelsexperiencehightesterrorduetohighvariance.(Modelclassunstableduetoinsufficienttrainingdata.)
GeneralConcept:EnsembleMethods
• Combinemultiplelearningalgorithmsormodels– PreviousLecture:Bagging&RandomForests– Today:Boosting&EnsembleSelection
• “MetaLearning”approach– Doesnotinnovateonbaselearningalgorithm/model
– Ex:Bagging• Newtrainingsetsviabootstrapping• Combinesbyaveragingpredictions
9
DecisionTrees,SVMs,etc.
Intuition:WhyEnsembleMethodsWork
• Bias-VarianceTradeoff!• Baggingreducesvarianceoflow-biasmodels– Low-biasmodelsare“complex”andunstable– Baggingaveragesthemtogethertocreatestability
• Boostingreducesbiasoflow-variancemodels– Low-variancemodelsaresimplewithhighbias– Boostingtrainssequenceofsimplemodels– Sumofsimplemodelsiscomplex/accurate
10
Boosting“TheStrengthofWeakClassifiers”*
11*http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf
Terminology: ShallowDecisionTrees
• DecisionTreeswithonlyafewnodes• Veryhighbias&lowvariance– Differenttrainingsetsleadtoverysimilartrees– Errorishigh(barelybetterthanstaticbaseline)
• Extremecase:“DecisionStumps”– Treeswithexactly1split
12
StabilityofShallowTrees
• Tendstolearnmore-or-lessthesamemodel.• hS(x)haslowvariance– OvertherandomnessoftrainingsetS
13
Terminology:WeakLearning
• Errorrate:
• WeakClassifier: slightlybetterthan0.5– Slightlybetterthanrandomguessing
• WeakLearner: canlearnaweakclassifier
14
εh,P = EP(x,y) 1 h(x )≠y[ ]"#
$%
εh,P
Terminology:WeakLearning
• Errorrate:
• WeakClassifier: slightlybetterthan0.5– Slightlybetterthanrandomguessing
• WeakLearner: canlearnaweakclassifier
15
εh,P = EP(x,y) 1 h(x )≠y[ ]"#
$%
εh,P
ShallowDecisionTreesareWeakClassifiers!
WeakLearnersareLowVariance&HighBias!
Howto“Boost”WeakModels?
• WeakModelsareHighBias&LowVariance• Baggingwouldnotwork– Reducesvariance,notbias
16
ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#
$%+ H (x)− y( )2"
#&$%'
H (x) = ES hS (x)[ ]“Averagepredictiononx”
VarianceTerm BiasTermExpectedTestErrorOverrandomnessofS(SquaredLoss)
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
17
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
18
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
y1 h1(x)
0 6
1 6
4 6
9 6
16 6
25 30.5
36 30.5
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
19
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
y1 h1(x) y20 6 -6
1 6 -5
4 6 -2
9 6 -3
16 6 10
25 30.5 -5.5
36 30.5 5.5
yt =y– h1:t-1(x)
“residual”
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
20
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
y1 h1(x) y2 h2(x)
0 6 -6 -5.5
1 6 -5 -5.5
4 6 -2 2.2
9 6 -3 2.2
16 6 10 2.2
25 30.5 -5.5 2.2
36 30.5 5.5 2.2
h1:t(x)=h1(x)+…+ht(x)
yt =y– h1:t-1(x)
“residual”
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
21
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
y1 h1(x) y2 h2(x) h1:2(x)
0 6 -6 -5.5 0.5
1 6 -5 -5.5 0.5
4 6 -2 2.2 8.2
9 6 -3 2.2 8.2
16 6 10 2.2 8.2
25 30.5 -5.5 2.2 32.7
36 30.5 5.5 2.2 32.7
h1:t(x)=h1(x)+…+ht(x)
yt =y– h1:t-1(x)
“residual”
FirstTry(forRegression)
• 1dimensionalregression• LearnDecisionStump
– (singlesplit,predictmeanoftwopartitions)
22
x y
0 0
1 1
2 4
3 9
4 16
5 25
6 36
S
y1 h1(x) y2 h2(x) h1:2(x) y3 h3(x) h1:3(x)
0 6 -6 -5.5 0.5 -0.5 -0.55 -0.05
1 6 -5 -5.5 0.5 0.5 -0.55 -0.05
4 6 -2 2.2 8.2 -4.2 -0.55 7.65
9 6 -3 2.2 8.2 0.8 -0.55 7.65
16 6 10 2.2 8.2 7.8 -0.55 7.65
25 30.5 -5.5 2.2 32.7 -7.7 -0.55 32.15
36 30.5 5.5 2.2 32.7 3.3 3.3 36
h1:t(x)=h1(x)+…+ht(x)
yt =y– h1:t-1(x)
“residual”
23
0 2 4 60
10
20
30
40
0 2 4 60
10
20
30
40
0 2 4 60
10
20
30
40
0 2 4 6-10
-5
0
5
10
0 2 4 6-6-4-2024
0 2 4 60
10
20
30
40
0 2 4 6-10
-5
0
5
10
0 2 4 6-101234
0 2 4 60
10
20
30
40
0 2 4 6-10
-5
0
5
10
0 2 4 6-0.02
0
0.02
0.04
0.06
0 2 4 60
10
20
30
40
yt
ht
h1:t
h1:t(x)=h1(x)+…+ht(x)yt =y– h1:t-1(x)
t=1 t=2 t=3 t=4
FirstTry(forRegression)
GradientBoosting(SimpleVersion)
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
h(x)=h1(x)
h1(x) h2(x) hn(x)
…
+h2(x) +…+hn(x)
S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1
N Sn = (xi, yi − h1:n−1(xi )){ }i=1N
S = (xi, yi ){ }i=1N
(Whyisitcalled“gradient”?)(Answernextslides.)
(ForRegressionOnly)
24
AxisAlignedGradientDescent
• LinearModel:h(x)=wTx• SquaredLoss:L(y,y’)=(y-y’)2
• SimilartoGradientDescent– Butonlyallowaxis-alignedupdatedirections– Updatesareoftheform:
25
(ForLinearModel)
w = w−ηgded ed =
0!010!0
!
"
########
$
%
&&&&&&&&
Unitvectoralongd-thDimension
g = ∇wL(yi,wT xi )
i∑
Projectionofgradientalongd-th dimensionUpdatealongaxiswithgreatestprojection
S = (xi, yi ){ }i=1N
TrainingSet
AxisAlignedGradientDescent
26
Updatealongaxiswithlargestprojection
Thisconceptwillbecomeusefulin~5slides
FunctionSpace&EnsembleMethods
• Linearmodel=onecoefficientperfeature– Linearovertheinputfeaturespace
• Ensemblemethods=onecoefficientpermodel– Linearoverafunctionspace– E.g.,h=h1 +h2 +…+hn
27
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
“FunctionSpace”(Spanofallshallowtrees)
(Potentiallyinfinite)(Mostcoefficientsare0)
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Func+onal#Gradient#Descent#
hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#
h(x)#=#h1(x)#
S’#=#{(x,y)}#
h1(x)#
S’#=#{(x,yVh1(x))}#
h2(x)#
S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#
h2(x)#
…+
+#h2(x)##+#…#+#hn(x)#
Coefficient=1formodelsusedCoefficient=0forothermodels
PropertiesofFunctionSpace
• GeneralizationofaVectorSpace
• ClosedunderAddition– Sumoftwofunctionsisafunction
• ClosedunderScalarMultiplication– Multiplyingafunctionwithascalarisafunction
• Gradientdescent: addingascaledfunctiontoanexistingfunction
28
FunctionSpaceofModels
• Every“axis”inthespaceisaweakmodel– Potentiallyinfiniteaxes/dimensions
• Complexmodelsarelinearcombinationsofweakmodels– h=η1h1 +η2h2 +…+ηnhn– Equivalenttoapointinfunctionspace• Definedbycoefficientsη
29
Recall:AxisAlignedGradientDescent
30
Projecttoclosestaxis&update(smallestsquareddist)
Imagineeachaxisisaweakmodel.
Everypointisalinearcombinationofweakmodels
FunctionalGradientDescent
31
(GradientDescentinFunctionSpace)(DerivationforSquaredLoss)
• Init h(x)=0• Loopn=1,2,3,4,…
h = h− argmaxhn
projecthn ∇hL(yi,h(xi ))i∑$
%&
'
()
$
%&&
'
())
= h+ argminhn
(yi − h(xi )− hn (xi )i∑ )2
S = (xi, yi ){ }i=1N
Projectfunctionalgradienttobestfunction
Equivalenttofindingthehnthatminimizesresidualloss
ReductiontoVectorSpace
• Functionspace=axis-alignedunitvectors–Weakmodel=axis-alignedunitvector:
• Linearmodelwhassamefunctionalform:– w=η1e1 +η2e2 +…+ηDeD– PointinspaceofD“axis-alignedfunctions”
• Axis-AlignedGradientDescent=FunctionalGradientDescentonspaceofaxis-alignedunitvectorweakmodels.
32
ed =
!010!
!
"
######
$
%
&&&&&&
GradientBoosting(FullVersion)
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
h1:n(x)=h1(x)
h1(x) h2(x) hn(x)
…
+η2h2(x)+…+ηnhn(x)
S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1
N Sn = (xi, yi − h1:n−1(xi )){ }i=1N
S = (xi, yi ){ }i=1N
(InstanceofFunctionalGradientDescent) (ForRegressionOnly)
33Seereferenceforhowtosetη
Recap: BasicBoosting
• Ensembleofmanyweakclassifiers.– h(x)=η1h1(x)+η2h2(x)+…+ηnhn(x)
• Goal:reducebiasusinglow-variancemodels
• Derivation:viaGradientDescentinFunctionSpace– Spaceofweakclassifiers
• We’veonlyseentheregressionsofar…
34
AdaBoostAdaptiveBoostingforClassification
35http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
BoostingforClassification
• GradientBoostingwasdesignedforregression
• Canwedesignoneforclassification?
• AdaBoost– AdaptiveBoosting
36
AdaBoost =FunctionalGradientDescent
• AdaBoost isalsoinstanceoffunctionalgradientdescent:– h(x)=sign(a1h1(x)+a2h2(x)+…+a3hn(x))
• E.g.,weakmodelshi(x)areclassificationtrees– Alwayspredict-1or+1– (GradientBoostingusedregressiontrees)
37
CombiningMultipleClassifiers
38
DataPoint
h1(x) h2(x) h3(x) h4(x) f(x) h(x)
x1 +1 +1 +1 -1 0.1+1.5+0.4- 1.1 =0.9 +1
x2 +1 +1 +1 +1 0.1 +1.5+0.4+1.1=3.1 +1
x3 -1 +1 -1 -1 -0.1+1.5– 0.3– 1.1= -0.1 -1
x4 -1 -1 +1 -1 -0.1– 1.5+0.3– 1.1=-2.4 -1
f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)
h(x)=sign(f(x))
AggregateScoringFunction:
AggregateClassifier:
AlsoCreatesNewTrainingSets
• GradientsinFunctionSpace–Weakmodelthatoutputsresidualoflossfunction• Squaredlosstarget:y-h(x)
– Algorithmicallyequivalenttotrainingweakmodelonmodifiedtrainingset• GradientBoosting=trainon(xi,yi–h(xi))
• WhataboutAdaBoost?– Classificationproblem.
39
ForRegression
ReweightingTrainingData
• DefineweightingDoverS:– Sumsto1:
• Examples:
• Weightedlossfunction:
40
S = (xi, yi ){ }i=1N
D(i)i∑ =1
Data Point D(i)
(x1,y1) 1/3
(x2,y2) 1/3
(x3,y3) 1/3
Data Point D(i)
(x1,y1) 0
(x2,y2) 1/2
(x3,y3) 1/2
Data Point D(i)
(x1,y1) 1/6
(x2,y2) 1/3
(x3,y3) 1/2
LD (h) = D(i)L(yi,h(xi ))i∑
TrainingDecisionTreeswithWeightedTrainingData
• Slightmodificationofsplittingcriterion.
• Example:BernoulliVariance:
• Estimatefractionofpositivesas:
41
L(S ') = S ' pS ' (1− pS ' ) =# pos*#neg
| S ' |
pS ' =D(i)1 yi=1[ ]
(xi ,yi )∈S '∑
S 'S ' ≡ D(i)
(xi ,yi )∈S '∑
AdaBoost Outline
h(x)=sign(a1h1(x))
(S,D1=Uniform)
h1(x)
(S,D2)
h2(x)
(S,Dn)
hn(x)
…
+a2h2(x))+…+anhn(x))
Dt – weightingondatapointsat – weightoflinearcombination
Stopwhenvalidationperformanceplateaus(willdiscusslater) 42http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
S = (xi, yi ){ }i=1N
yi ∈ −1,+1{ }h(x)=sign(a1h1(x)+a2h2(x)
Intuition
43
f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)
h(x)=sign(f(x))
AggregateScoringFunction:
AggregateClassifier:
DataPoint
Label f(x) h(x)
x1 y1=+1 0.9 +1
x2 y2=+1 3.1 +1
x3 y3=+1 -0.1 -1
x4 y4=-1 -2.4 -1SafelyFarfromDecisionBoundary
SomewhatclosetoDecisionBoundary
ViolatesDecisionBoundary
Intuition
44
f(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)
h(x)=sign(f(x))
AggregateScoringFunction:
AggregateClassifier:
DataPoint
Label f(x) h(x)
x1 y1=+1 0.9 +1
x2 y2=+1 3.1 +1
x3 y3=+1 -0.1 -1
x4 y4=-1 -2.4 -1SafelyFarfromDecisionBoundary
SomewhatclosetoDecisionBoundary
ViolatesDecisionBoundary
ThoughtExperiment:Whenwetrainnewh5(x)toaddtof(x)…
…whathappenswhenh5 mispredicts oneverything?
Intuition
45
DataPoint
Label f1:4(x) h1:4(x) Worstcaseh5(x)
Worstcasef1:5(x)
Impactofh5(x)
x1 y1=+1 0.9 +1 -1 0.4
x2 y2=+1 3.1 +1 -1 2.6
x3 y3=+1 -0.1 -1 -1 -0.6
x4 y4=-1 -2.4 -1 +1 -1.9
f1:5(x)=f1:4(x)+0.5*h5(x)
h1:5(x)=sign(f1:5(x))
AggregateScoringFunction:
AggregateClassifier:
h5(x)thatmispredicts oneverything
Supposea5 =0.5
KindofBad
Irrelevant
VeryBad
Irrelevant
Intuition
46
DataPoint
Label f1:4(x) h1:4(x) Worstcaseh5(x)
Worstcasef1:5(x)
Impactofh5(x)
x1 y1=+1 0.9 +1 -1 0.4
x2 y2=+1 3.1 +1 -1 2.6
x3 y3=+1 -0.1 -1 -1 -0.6
x4 y4=-1 -2.4 -1 +1 -1.9
f1:5(x)=f1:4(x)+0.5*h5(x)
h1:5(x)=sign(f1:5(x))
AggregateScoringFunction:
AggregateClassifier:
h5(x)thatmispredicts oneverything
Supposea5 =0.5
KindofBad
Irrelevant
VeryBad
Irrelevant
h5(x)shoulddefinitelyclassify(x3,y3)correctly!h5(x)shouldprobablyclassify(x1,y1)correctly.
Don’tcareabout(x2,y2)&(x4,y4)Impliesaweightingovertrainingexamples
Intuition
47
DataPoint
Label f1:4(x) h1:4(x) DesiredD5
x1 y1=+1 0.9 +1
x2 y2=+1 3.1 +1
x3 y3=+1 -0.1 -1
x4 y4=-1 -2.4 -1
Medium
Low
High
Low
f1:4(x)=0.1*h1(x)+1.5*h2(x)+0.4*h3(x)+1.1*h4(x)
h1:4(x)=sign(f1:4(x))
AggregateScoringFunction:
AggregateClassifier:
AdaBoost
• Init D1(x)=1/N• Loopt=1…n:– Trainclassifierht(x)using(S,Dt)
– Computeerroron(S,Dt):
– Definestepsizeat:
– UpdateWeighting:
• Return: h(x)=sign(a1h1(x)+…+anhn(x))
48
S = (xi, yi ){ }i=1N
yi ∈ −1,+1{ }
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
ZtNormalizationFactors.t. Dt+1sumsto1.
E.g.,bestdecisionstump
Example
49
DataPoint
Label D1
x1 y1=+1 0.01
x2 y2=+1 0.01
x3 y3=+1 0.01
x4 y4=-1 0.01…
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
yiht(xi)=-1or+1
Example
50
DataPoint
Label D1 h1(x)
x1 y1=+1 0.01 +1
x2 y2=+1 0.01 -1
x3 y3=+1 0.01 -1
x4 y4=-1 0.01 -1…
ε1=0.4a1=0.2
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
yiht(xi)=-1or+1
…
Example
51
DataPoint
Label D1 h1(x) D2
x1 y1=+1 0.01 +1 0.008
x2 y2=+1 0.01 -1 0.012
x3 y3=+1 0.01 -1 0.012
x4 y4=-1 0.01 -1 0.008
ε1=0.4a1=0.2
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
yiht(xi)=-1or+1
… …
Example
52
DataPoint
Label D1 h1(x) D2 h2(x)
x1 y1=+1 0.01 +1 0.008 +1
x2 y2=+1 0.01 -1 0.012 +1
x3 y3=+1 0.01 -1 0.012 -1
x4 y4=-1 0.01 -1 0.008 +1
ε1=0.4a1=0.2
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
ε2=0.45a2=0.1
yiht(xi)=-1or+1
… … …
Example
53
DataPoint
Label D1 h1(x) D2 h2(x) D3
x1 y1=+1 0.01 +1 0.008 +1 0.007
x2 y2=+1 0.01 -1 0.012 +1 0.011
x3 y3=+1 0.01 -1 0.012 -1 0.013
x4 y4=-1 0.01 -1 0.008 +1 0.009
ε1=0.4a1=0.2
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
ε2=0.45a2=0.1
yiht(xi)=-1or+1
… … …
Example
54
DataPoint
Label D1 h1(x) D2 h2(x) D3 h3(x)
x1 y1=+1 0.01 +1 0.008 +1 0.007 -1
x2 y2=+1 0.01 -1 0.012 +1 0.011 +1
x3 y3=+1 0.01 -1 0.012 -1 0.013 +1
x4 y4=-1 0.01 -1 0.008 +1 0.009 -1
ε1=0.4a1=0.2
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
NormalizationFactors.t. Dt+1sumsto1.
ε2=0.45a2=0.1
yiht(xi)=-1or+1
ε3=0.35a3=0.31
… … … …Whathappensifε=0.5?
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30
1
2
3
4
5
6
7
8
ExponentialLoss
55
L(y, f (x)) = exp −yf (x){ }
Targety
f(x)
Exp Loss UpperBounds0/1Loss!
CanprovethatAdaBoost minimizesExp Loss(HomeworkQuestion)
DecomposingExp Loss
56
L(y, f (x)) = exp −yf (x){ }
= exp −y atht (x)t=1
n
∑#
$%
&
'(
)*+
,+
-.+
/+
= exp −yatht (x){ }t=1
n
∏
DistributionUpdateRule!
http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
Intuition
• Exp Lossoperatesinexponentspace
• Additiveupdatetof(x)=multiplicativeupdatetoExp Lossoff(x)
• ReweightingSchemeinAdaBoost canbederivedviaresidualExp Loss
57
L(y, f (x)) = exp −y atht (x)t=1
n
∑#$%
&'(= exp −yatht (x){ }
t=1
n
∏
http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
AdaBoost =MinimizingExp Loss
• Init D1(x)=1/N• Loopt=1…n:– Trainclassifierht(x)using(S,Dt)
– Computeerroron(S,Dt):
– Definestepsizeat:
– UpdateWeighting:
• Return: h(x)=sign(a1h1(x)+…+anhn(x))
58
S = (xi, yi ){ }i=1N
yi ∈ −1,+1{ }
εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
ZtNormalizationFactors.t. Dt+1sumsto1.
http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
DatapointsreweightedaccordingtoExp Loss!
StorySoFar:AdaBoost
• AdaBoost iterativelyfindsweakclassifiertominimizeresidualExp Loss– Trainsweakclassifieronreweighteddata(S,Dt).
• Homework:Rigorouslyproveit!
1. FormallyproveExp Loss≥0/1Loss
2. RelateExp LosstoZt:
3. Justifychoiceofat:• GiveslargestdecreaseinZt
59
at =12log 1−εt
εt
"#$
%&'
Dt+1(i) =Dt (i)exp −atyiht (xi ){ }
Zt
Theproofisinearlierslides.
http://www.yisongyue.com/courses/cs155/lectures/msri.pdf
Recap:AdaBoost
• GradientDescentinFunctionSpace– Spaceofweakclassifiers
• Finalmodel=linearcombinationofweakclassifiers– h(x)=sign(a1h1(x)+…+anhn(x))– I.e.,apointinFunctionSpace
• Iterativelycreatesnewtrainingsetsviareweighting– Trainsweakclassifieronreweightedtrainingset– DerivedviaminimizingresidualExp Loss
60
EnsembleSelection
61
Recall:Bias-VarianceDecomposition
• Forsquarederror:
ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$
ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#
$%+ H (x)− y( )2"
#&$%'
H (x) = ES hS (x)[ ]“Averagepredictiononx”
VarianceTerm BiasTerm
62
EnsembleMethods
• Combinebasemodelstoimproveperformance
• Bagging:averageshighvariance,lowbiasmodels– Reducesvariance– Indirectlydealswithbiasvialowbiasbasemodels
• Boosting: carefullycombinessimplemodels– Reducesbias– Indirectlydealswithvariancevialowvariancebasemodels
• Canwegetbestofbothworlds?
63
Insight:UseValidationSet
• EvaluateerroronvalidationsetV:
• Proxyfortesterror:
64
LV (hS ) = E(x,y)~V L(y,hS (x))[ ]
EV LV (hS )[ ] = LP (hS )
ExpectedValidationError TestError
EnsembleSelection
“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil,Crew&Ksikes,ICML2004
Person+ Age+ Male?+ Height+>+55”+
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
Person+ Age+ Male?+ Height+>+55”+
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1#…+
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+
h(x)+y+
Person+ Age+ Male?+ Height+>+55”+
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
Person+ Age+ Male?+ Height+>+55”+
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1#…+
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+
h(x)+y+
Person+ Age+ Male?+ Height+>+55”+
Alice# 14# 0# 1#
Bob# 10# 1# 1#
Carol# 13# 0# 1#
Dave# 8# 1# 0#
Erin# 11# 0# 0#
Frank# 9# 1# 1#
Gena# 8# 0# 0#
Person+ Age+ Male?+ Height+>+55”+
James# 11# 1# 1#
Jessica# 14# 0# 1#
Alice# 14# 0# 1#
Amy# 12# 0# 1#
Bob# 10# 1# 1#
Xavier# 9# 1# 0#
Cathy# 9# 0# 1#
Carol# 13# 0# 1#
Eugene# 13# 1# 0#
Rafael# 12# 1# 1#
Dave# 8# 1# 0#
Peter# 9# 1# 0#
Henry# 13# 1# 0#
Erin# 11# 0# 0#
Rose# 7# 0# 0#
Iain# 8# 1# 1#
Paulo# 12# 1# 0#
Margaret# 10# 0# 1#
Frank# 9# 1# 1#
Jill# 13# 0# 0#
Leon# 10# 1# 0#
Sarah# 12# 0# 0#
Gena# 8# 0# 0#
Patrick# 5# 1# 1#…+
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+
h(x)+y+
TrainingS’
ValidationV’
H={2000modelstrainedusingS’}
h(x)=h1(x)+h2(x)+…+hn(x)MaintainensemblemodelascombinationofH:
AddmodelfromHthatmaximizesperformanceonV’
+hn+1(x)
Repeat
S
Denoteashn+1
ModelsaretrainedonS’Ensemblebuilt tooptimizeV’
ReducesBothBias&Variance
• ExpectedTestError=Bias+Variance
• Bagging: reducevarianceoflow-biasmodels
• Boosting: reducebiasoflow-variancemodels
• EnsembleSelection: whocares!– Usevalidationerrortoapproximatetesterror– Directlyminimizevalidationerror– Don’tworryaboutthebias/variancedecomposition
66
What’stheCatch?
• Reliesheavilyonvalidationset– Bagging&Boosting:usestrainingsettoselectnextmodel– EnsembleSelection:usesvalidationsettoselectnextmodel
• Requiresvalidationsetbesufficientlylarge
• Inpractice: impliessmallertrainingsets– Training&validation=partitioningoffinitedata
• Oftenworksverywellinpractice
67
“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil,Crew&Ksikes,ICML2004
EnsembleSelectionoftenoutperformsamorehomogenoussetsofmodels.Reducesoverfittingbybuildingmodelusingvalidationset.
EnsembleSelectionwonKDDCup2009http://www.niculescu-mizil.org/papers/KDDCup09.pdf
References&FurtherReading“AnEmpiricalComparisonofVotingClassificationAlgorithms:Bagging,Boosting,andVariants”Bauer&Kohavi,Machine Learning,36,105–139(1999)
“BaggingPredictors” LeoBreiman,TechReport#421,UCBerkeley,1994,http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
“AnEmpiricalComparisonofSupervisedLearningAlgorithms”Caruana &Niculescu-Mizil, ICML2006
“AnEmpiricalEvaluationofSupervisedLearninginHighDimensions”Caruana,Karampatziakis &Yessenalina,ICML2008
“EnsembleMethodsinMachineLearning”ThomasDietterich,MultipleClassifier Systems, 2000
“EnsembleSelectionfromLibrariesofModels”Caruana,Niculescu-Mizil, Crew&Ksikes, ICML2004
“GettingtheMostOutofEnsembleSelection” Caruana,Munson, &Niculescu-Mizil, ICDM2006
“ExplainingAdaBoost”RobSchapire,https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf
“GreedyFunctionApproximation:AGradientBoostingMachine”,JeromeFriedman, 2001,http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
“RandomForests– RandomFeatures” LeoBreiman,TechReport#567,UCBerkeley,1999,
“StructuredRandomForestsforFastEdgeDetection”Dollár&Zitnick,ICCV2013
“ABC-Boost:AdaptiveBaseClassBoostforMulti-classClassification” PingLi,ICML2009
“AdditiveGrovesofRegressionTrees”Sorokina, Caruana &Riedewald,ECML2007,http://additivegroves.net/
“WinningtheKDDCupOrangeChallengewithEnsembleSelection”,Niculescu-Mizil etal.,KDD2009
“LessonsfromtheNetflixPrizeChallenge”Bell&Koren,SIGKDDExporations 9(2),75—79,2007
NextLectures
• DeepLearning
• RecitationNextThursday– Keras Tutorial
70
JoeMarino