Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

$Page 1: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners$
Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Introduction to Boosting

February 2, 2012


Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details


Boosting - Introduction

Train one base learner at a time.

Focus it on the mistakes of its predecessors.

Weight it based on how ‘useful’ it is in the ensemble (not onits training error).



Under very mild assumptions on the base learners (“Weaklearning” - they must perform better than random guessing):

1 Training error is eventually reduced to zero (We can fit thetraining data).

2 Generalisation error continues to be reduced even aftertraining error is zero (There is no overfitting).



The most popular ensemble algorithm is a boosting algorithmcalled “Adaboost”.

Adaboost is short for “Adaptive Boosting”, because thealgorithm adapts weights on the base learners and trainingexamples.

There are many explanation of precisely what Adaboost doesand why it is so successful - but the basic idea is simple!


Outline



3 Adaboost - Demo



Adaboost - Pseudo-code

Adaboost Pseudo-code

Set uniform example weights.for Each base learner do

Train base learner with weighted sample.Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for


Adaboost - Loss Function

To see precisely how Adaboost is implemented, we need tounderstand its Loss Function

L(H) =N∑i=1

e−mi (1)

mi = yi

K∑k=1

αkhk(xi ) (2)

mi is the voting margin.There are N examples and K base learners.αK is the weight of the kth learner, hk(xi ) is its prediction on xi .



Why Exponential Loss?

Loss is higher when a prediction is wrong.

Loss is steeper when a prediction is wrong.

Precise reasons later. . .




Adaboost - Example Weights

After k learners, the example weights used to train the k + 1th

learner are:

Dk(i) ∝ e−mi (3)

= eyi∑k

j=1 αjhj (xi ) (4)

= eyi∑k−1

j=1 αjhj (xi )eyiαkhk (xi ) (5)

∝ Dk−1(i)eyiαkhk (xi ) (6)

Dk(i) =Dk−1(i)eyiαkhk (xi )

Zk(7)

Zk normalises Dk(i) such that∑N

i=1Dk(i) = 1.

Larger when mi is negative.

Directly proportional to the loss funcion.


Adaboost - Learner Weights

Once the kth learner is trained, we evaluate its predictions ontraining data hk(xi ) and assign weight:

αk =1

2log

1− εkεk

, (8)

εk =N∑i=1

Dk−1(i)δ[hk(xi ) 6= yi ] (9)

εk is the weighted error of the kth learner.

The log in the learner weight is related to the exponential inthe loss function.


Adaboost - Learner Weights




Dk(i): Example i weight after learner kαk : Learner k weightSet uniform example weights.for Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for




Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for





Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on DTest base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for





Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for







αk = 12 log 1−εk

εkSet example weights based on ensemble predictions.

end for







αk ← 12 log 1−εk

εk

Dk(i)← Dk−1(i)e−αk yi hk (xi )

Zk

end for



Decision rule:

H(x) = sign(K∑

k=1

αkhk(x)) (10)


Outline



3 Adaboost - Demo



Adaboost - Demo

[Demo]


Outline



3 Adaboost - Demo



Adaboost - Training Error

So the training error rate of Adaboost reaches 0 for largeenough K

Combination of exponential loss + weak learning assumption.

If base learner weighted error is always ε ≤ 12 − γ then:

etr ≤ (√

1− 4γ2)K (11)





Let’s derive the bound!

1 Show that exponential loss is an upper bound on classificationerror.

2 Show that the loss can be written as a product of thenormalisation constant.

3 Minimise the normalisation constant at each step (derive thelearner weight rule).

4 Write this constant using the weighted error at each step.



Show that exponential loss is an upper bound on classificationerror.

etr =1

N

N∑i=1

δ[H(xi ) 6= yi ] (12)

=1

N

N∑i=1

δ[mi ≤ 0] (13)

≤ 1

N

N∑i=1

e−mi (14)

Because x ≤ 0→ e−x ≥ 1 and x ≤ 1→ e−x ≥ 0



Show that the loss can be written as a product of thenormalisation constant.

Dk(i) =Dk−1(i)e−yiαkhk (xi )

Zk(15)

= D0(i)[k∏

j=1

e−yiαjhj (xi )

Zj] (16)

=1

Ne−yi

∑kj=1 αjhj (xi ) 1∏k

j=1 Zj

(17)

e−mi = NDk(i)k∏

j=1

Zj (18)



Show that the loss can be written as a product of thenormalisation constant.

e−mi = NDk(i)k∏

j=1

Zj (19)

etr ≤1

N

N∑i=1

e−mi (20)

=1

N

N∑i=1

NDk(i)k∏

j=1

Zj (21)

= [N∑i=1

Dk(i)]k∏

j=1

Zj (22)

=k∏

j=1

Zj (23)



Minimise normalisation constant at each step.

Zk =N∑i=1

Dk−1(i)e−yiαkhk (xi ) (24)

=∑

hk (xi )6=yi

Dk−1(i)eαk +∑

hk (xi )=yi

Dk−1(i)e−αk (25)

= εkeαk + (1− εk)e−αk (26)





Minimise normalisation constant at each step.

Zk = εkeαk + (1− εk)e−αk (27)

∂Zk

∂αk= εke

αk − (1− εk)e−αk (28)

0 = εke2αk − (1− εk)e0 (29)

e2αk =1− εkεk

(30)

αk =1

2log

1− εkεk

(31)

Minimising Zk gives us the learner weight rule.



Write the normalisation constant using weighted error.

Zk = εkeαk + (1− εk)e−αk (32)

αk =1

2log

1− εkεk

(33)

Zk = εke12log

1−εkεk + (1− εk)e

− 12log

1−εkεk (34)

= 2√εk(1− εk) (35)

So Zk depends only on εk , and the weak learning assumption saysthat εk < 0.5.



We bound training error using: etr ≤∏k

j=1 Zj

We’ve found that Zj is a function of the weighted error:Zj = 2

√εj(1− εj)

So, we can bound the training error: etr ≤∏k

j=1 2√εj(1− εj)

Let εj ≤ 12 − γ, and we can write etr ≤ (

√1− 4γ2)k

If γ > 0, then this bound will shrink to 0 when k is largeenough.


Adaboost - Training Error Summary

We have seen an upper bound on the training error whichdecreases as K increase.

We can prove the bound by considering the loss function as abound on the error rate.

Examining the role of the normalisation constant in thisbound reveals the learner weight rule.

Examining the role of γ in the bound shows why we canachieve 0 training error even with weak learners.

Documents

Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners