35
Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details Introduction to Boosting February 2, 2012

Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

  • Upload
    buihanh

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Introduction to Boosting

February 2, 2012

Page 2: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Page 3: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

Train one base learner at a time.

Focus it on the mistakes of its predecessors.

Weight it based on how ‘useful’ it is in the ensemble (not onits training error).

Page 4: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

Under very mild assumptions on the base learners (“Weaklearning” - they must perform better than random guessing):

1 Training error is eventually reduced to zero (We can fit thetraining data).

2 Generalisation error continues to be reduced even aftertraining error is zero (There is no overfitting).

Page 5: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Boosting - Introduction

The most popular ensemble algorithm is a boosting algorithmcalled “Adaboost”.

Adaboost is short for “Adaptive Boosting”, because thealgorithm adapts weights on the base learners and trainingexamples.

There are many explanation of precisely what Adaboost doesand why it is so successful - but the basic idea is simple!

Page 6: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Page 7: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Set uniform example weights.for Each base learner do

Train base learner with weighted sample.Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Page 8: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

To see precisely how Adaboost is implemented, we need tounderstand its Loss Function

L(H) =N∑i=1

e−mi (1)

mi = yi

K∑k=1

αkhk(xi ) (2)

mi is the voting margin.There are N examples and K base learners.αK is the weight of the kth learner, hk(xi ) is its prediction on xi .

Page 9: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

Why Exponential Loss?

Loss is higher when a prediction is wrong.

Loss is steeper when a prediction is wrong.

Precise reasons later. . .

Page 10: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Loss Function

Page 11: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Example Weights

After k learners, the example weights used to train the k + 1th

learner are:

Dk(i) ∝ e−mi (3)

= eyi∑k

j=1 αjhj (xi ) (4)

= eyi∑k−1

j=1 αjhj (xi )eyiαkhk (xi ) (5)

∝ Dk−1(i)eyiαkhk (xi ) (6)

Dk(i) =Dk−1(i)eyiαkhk (xi )

Zk(7)

Zk normalises Dk(i) such that∑N

i=1Dk(i) = 1.

Larger when mi is negative.

Directly proportional to the loss funcion.

Page 12: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Learner Weights

Once the kth learner is trained, we evaluate its predictions ontraining data hk(xi ) and assign weight:

αk =1

2log

1− εkεk

, (8)

εk =N∑i=1

Dk−1(i)δ[hk(xi ) 6= yi ] (9)

εk is the weighted error of the kth learner.

The log in the learner weight is related to the exponential inthe loss function.

Page 13: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Learner Weights

Page 14: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weightSet uniform example weights.for Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Page 15: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor Each base learner do

Train base learner with weighted sample.

Test base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Page 16: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on DTest base learner on all data.Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Page 17: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

Set learner weight with weighted error.Set example weights based on ensemble predictions.

end for

Page 18: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

αk = 12 log 1−εk

εkSet example weights based on ensemble predictions.

end for

Page 19: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Adaboost Pseudo-code

Dk(i): Example i weight after learner kαk : Learner k weight∀i : D0(i)← 1

Nfor k=1 to K doD ← data sampled with Dk−1.hk ← base learner trained on Dεk ←

∑Ni=1Dk−1(i)δ[hk(xi ) 6= yi ]

αk ← 12 log 1−εk

εk

Dk(i)← Dk−1(i)e−αk yi hk (xi )

Zk

end for

Page 20: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Pseudo-code

Decision rule:

H(x) = sign(K∑

k=1

αkhk(x)) (10)

Page 21: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Page 22: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Demo

[Demo]

Page 23: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Outline

1 Boosting - Intuitively

2 Adaboost - Overview

3 Adaboost - Demo

4 Adaboost - Details

Page 24: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

So the training error rate of Adaboost reaches 0 for largeenough K

Combination of exponential loss + weak learning assumption.

If base learner weighted error is always ε ≤ 12 − γ then:

etr ≤ (√

1− 4γ2)K (11)

Page 25: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Page 26: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Let’s derive the bound!

1 Show that exponential loss is an upper bound on classificationerror.

2 Show that the loss can be written as a product of thenormalisation constant.

3 Minimise the normalisation constant at each step (derive thelearner weight rule).

4 Write this constant using the weighted error at each step.

Page 27: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that exponential loss is an upper bound on classificationerror.

etr =1

N

N∑i=1

δ[H(xi ) 6= yi ] (12)

=1

N

N∑i=1

δ[mi ≤ 0] (13)

≤ 1

N

N∑i=1

e−mi (14)

Because x ≤ 0→ e−x ≥ 1 and x ≤ 1→ e−x ≥ 0

Page 28: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that the loss can be written as a product of thenormalisation constant.

Dk(i) =Dk−1(i)e−yiαkhk (xi )

Zk(15)

= D0(i)[k∏

j=1

e−yiαjhj (xi )

Zj] (16)

=1

Ne−yi

∑kj=1 αjhj (xi ) 1∏k

j=1 Zj

(17)

e−mi = NDk(i)k∏

j=1

Zj (18)

Page 29: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Show that the loss can be written as a product of thenormalisation constant.

e−mi = NDk(i)k∏

j=1

Zj (19)

etr ≤1

N

N∑i=1

e−mi (20)

=1

N

N∑i=1

NDk(i)k∏

j=1

Zj (21)

= [N∑i=1

Dk(i)]k∏

j=1

Zj (22)

=k∏

j=1

Zj (23)

Page 30: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Minimise normalisation constant at each step.

Zk =N∑i=1

Dk−1(i)e−yiαkhk (xi ) (24)

=∑

hk (xi )6=yi

Dk−1(i)eαk +∑

hk (xi )=yi

Dk−1(i)e−αk (25)

= εkeαk + (1− εk)e−αk (26)

Page 31: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Page 32: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Minimise normalisation constant at each step.

Zk = εkeαk + (1− εk)e−αk (27)

∂Zk

∂αk= εke

αk − (1− εk)e−αk (28)

0 = εke2αk − (1− εk)e0 (29)

e2αk =1− εkεk

(30)

αk =1

2log

1− εkεk

(31)

Minimising Zk gives us the learner weight rule.

Page 33: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

Write the normalisation constant using weighted error.

Zk = εkeαk + (1− εk)e−αk (32)

αk =1

2log

1− εkεk

(33)

Zk = εke12log

1−εkεk + (1− εk)e

− 12log

1−εkεk (34)

= 2√εk(1− εk) (35)

So Zk depends only on εk , and the weak learning assumption saysthat εk < 0.5.

Page 34: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error

We bound training error using: etr ≤∏k

j=1 Zj

We’ve found that Zj is a function of the weighted error:Zj = 2

√εj(1− εj)

So, we can bound the training error: etr ≤∏k

j=1 2√εj(1− εj)

Let εj ≤ 12 − γ, and we can write etr ≤ (

√1− 4γ2)k

If γ > 0, then this bound will shrink to 0 when k is largeenough.

Page 35: Introduction to Boosting - The University of Manchesterstapenr5/boosting.pdf · Adaboost is short for \Adaptive Boosting", because the algorithm adapts weights on the base learners

Boosting - Intuitively Adaboost - Overview Adaboost - Demo Adaboost - Details

Adaboost - Training Error Summary

We have seen an upper bound on the training error whichdecreases as K increase.

We can prove the bound by considering the loss function as abound on the error rate.

Examining the role of the normalisation constant in thisbound reveals the learner weight rule.

Examining the role of γ in the bound shows why we canachieve 0 training error even with weak learners.