Online Learning: Wrap Up Boosting: Interpretations, Extensions

Preview:

Citation preview

Online Learning: Wrap Up

&

Boosting: Interpretations, Extensions, Theory

Arindam Banerjee

. – p.1

Online: “Things we liked”

Adaptive (Natural)

Computationally efficient

Good guarantees [2]

Gradient descent based

No assumptions [2]

. – p.2

Online: “Things we didn’t like”

Assumes some good expert or oracle [6]

Not “robust” (noisy data)

Tracking of outliers

Intuition about convergence rates

Unsupervised versions

Stock market assumptions

Worst-case focus (conservative)

Learning rate (how to choose?)

. – p.3

Online: “General Questions”

Relations with Bayesian inference (e.g., Dirichlet)

Convergence proofs are good, but what about intuitivealgorithms

Is there a Batch to Online continuum

. – p.4

Some Key Ideas

Hypothesis Spaces and Oracles

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

Empirical Estimation Theory and Falsifiability

. – p.5

Some Key Ideas

Hypothesis Spaces and Oracles

Approximation error and Estimation error

Complexity of Hypothesis Spaces (VC dimensions)

Regularization and “Being Conservative”

Sufficient conditions for “Learning”

Empirical Estimation Theory and Falsifiability

Lets get back to real life ... to Boosting ...

. – p.5

Weak Learning

A weak learner predicts “slightly better” than random

The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed (but unknown) distribution on X

An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H

such that

Px∼Q[h(x) 6= c(x)] ≤1

2− γ

The algorithm is called a γ-weak learner

We assume the existence of such a learner

. – p.6

The Boosting Model

Boosting converts a weak learner to a strong learner

Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt

[ht(x) 6= c(x)] ≤ 12 − γt

After T rounds, the weak hypotheses ht, [t]T1 are combined

into a final hypothesis hfinal

We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses

. – p.7

Adaboost

Input: Training set (x1, y1), . . . , (xm, ym)

Algorithm: Initialize D1(i) = 1/m

For t = 1, . . . , T

Train a weak learner using distribution Dt

Get weak hypothesis ht with error εt = Prx∼Dt[ht(x) 6= y]

Choose αt = 12 ln

(

1−εt

εt

)

Update

Dt+1(i) =Dt(i) exp(−αtyiht(xi))

Zt

where Zt is the normalization factor

Output: h(x) = sign

[

T∑

t=1

αtht(x)

]

. – p.8

Training Error, Margin Error

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

Margin, yf(x)

Tra

inin

g lo

ss

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.9

More on the Training Error

The training error of the final classifier is bounded

1

m|{h(xi) 6= yi}| ≤

1

m

m∑

i=1

exp(−yih(xi)) =

T∏

t=1

Zt

Training error can be reduced most rapidly by choosing αt thatminimizes

Zt =n

i=1

Dt(i) exp(−αtyiht(xi))

Adaboost chooses the optimal αt

Margin is (sort of) maximized

Other boosting algorithms minimize other upper bounds

. – p.10

Boosting as Entropy Projection

For a general α we have

Dαt+1(i) =

Dt(i) exp(−αyiht(xi))

Zt(α)

where Zt(α) =∑m

i=1 exp(−αyiht(xi)). Then

minED[yh(x)]=0

KL(D, Dt) = maxα

(− lnZt(α)) .

Further

argminED[yh(x)]=0

KL(D, Dt) = Dαt

t+1 ,

which is the update Adaboost uses.

But what if labels are noisy?

. – p.11

Additive Models

An additive model

FT (x) =

T∑

t=1

wtft(x) .

Fit the t-th component to minimize the “residual” error

Error is measured by C(yf(x)), an upper bound on [y 6= f(x)]

For Adaboost,C(yf(x)) = exp(−yf(x))

For Logitboost,

C(yf(x)) = log(1 + exp(−yf(x)))

. – p.12

Upper Bounds on Training Errors

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

Margin, yf(x)

Tra

inin

g lo

ss

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.13

Adaboost as Gradient Descent

Margin Cost function C(yh(x)) = exp(−yh(x))

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

The cost function to be minimized

m∑

i=1

C(yi[Ht−1(xi) + αtht(xi)])

The optimum solution

αt =1

2ln

(∑

i:ht(xi)=yiDt(i)

i:ht(xi) 6=yiDt(i)

)

=1

2ln

(

1 − εt

εt

)

. – p.14

Gradient Descent in Function Space

Choose any appropriate margin cost function C

Goal is to minimize margin cost on the training set

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

Choose αt to minimize

C(Ht(x) = C(Ht−1(x) + αtht(x)) .

“Anyboost” class of algorithms

. – p.15

Margin Bounds: Why Boosting works

Test set error decreases even after train set error is 0

No simple “bias-variance” explanation

Bounds on the error rate of the boosted classifier depends on

the number of training examples

the margin achieved on the train set

the complexity of the weak learners

It does not depend on the number of classifiers combined

. – p.16

Margin Bound for Boosting

H: Class of binary classifiers of VC-dimension dH

co(H) =

{

h : h(x) =∑

i

αihi(x), αi ≥ 0,∑

i

αi = 1

}

With probability at least (1 − δ) over the draw of train set S ofsize m, ∀h ∈ co(H), θ > 0 we have

Pr[yh(x) ≤ 0] ≤ PrS [yh(x) ≤ θ] + O

(

1

θ

dHm

)

+ O

log 1δ

m

. – p.17

Maximizing the Margin

Do boosting algorithms maximize the margin?

Minimize upper bound on training error

Often get large margin in the process

Do not explicitly maximize the margin

Explicit margin maximization

Classical Boost: Minimize bound on PrS [yh(x) ≤ 0]

Margin Boost: Minimize (bound on) PrS [yh(x) ≤ θ]

. – p.18

Boosting: Wrap Up (for now)

Strong learner from convex combinations of weak learners

Choice of loss function is critical

Aggresive loss functions for clean data

Robust loss functions for noisy data

Gradient descent in function space of additive models

Generalization by implicit margin maximization

Explicit maximum margin boosting algorithms possible

Is bound minimization a good idea?

. – p.19

Recommended