Online Learning: Wrap Up Boosting: Interpretations, Extensions

Online Learning: Wrap Up

Boosting: Interpretations, Extensions, Theory

Arindam Banerjee

. – p.1

Online: “Things we liked”

Adaptive (Natural)

Computationally efficient

Good guarantees [2]

Gradient descent based

No assumptions [2]

. – p.2

Online: “Things we didn’t like”

Assumes some good expert or oracle [6]

Not “robust” (noisy data)

Tracking of outliers

Intuition about convergence rates

Unsupervised versions

Stock market assumptions

Worst-case focus (conservative)

Learning rate (how to choose?)

. – p.3

Online: “General Questions”

Relations with Bayesian inference (e.g., Dirichlet)

Convergence proofs are good, but what about intuitivealgorithms

Is there a Batch to Online continuum

. – p.4

Some Key Ideas

Hypothesis Spaces and Oracles

. – p.5

Some Key Ideas

Approximation error and Estimation error

. – p.5

Some Key Ideas

Complexity of Hypothesis Spaces (VC dimensions)

. – p.5

Some Key Ideas

Regularization and “Being Conservative”

. – p.5

Some Key Ideas

Sufficient conditions for “Learning”

. – p.5

Some Key Ideas

Empirical Estimation Theory and Falsifiability

. – p.5

Some Key Ideas

Empirical Estimation Theory and Falsifiability

Lets get back to real life ... to Boosting ...

. – p.5

Weak Learning

A weak learner predicts “slightly better” than random

The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed (but unknown) distribution on X

An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H

such that

Px∼Q[h(x) 6= c(x)] ≤1

2− γ

The algorithm is called a γ-weak learner

We assume the existence of such a learner

. – p.6

The Boosting Model

Boosting converts a weak learner to a strong learner

Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt

[ht(x) 6= c(x)] ≤ 12 − γt

After T rounds, the weak hypotheses ht, [t]T1 are combined

into a final hypothesis hfinal

We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses

. – p.7

Adaboost

Input: Training set (x1, y1), . . . , (xm, ym)

Algorithm: Initialize D1(i) = 1/m

For t = 1, . . . , T

Train a weak learner using distribution Dt

Get weak hypothesis ht with error εt = Prx∼Dt[ht(x) 6= y]

Choose αt = 12 ln

1−εt

Update

Dt+1(i) =Dt(i) exp(−αtyiht(xi))

where Zt is the normalization factor

Output: h(x) = sign

αtht(x)

. – p.8

Training Error, Margin Error

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

Margin, yf(x)

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.9

More on the Training Error

The training error of the final classifier is bounded

m|{h(xi) 6= yi}| ≤

exp(−yih(xi)) =

Training error can be reduced most rapidly by choosing αt thatminimizes

Dt(i) exp(−αtyiht(xi))

Adaboost chooses the optimal αt

Margin is (sort of) maximized

Other boosting algorithms minimize other upper bounds

. – p.10

Boosting as Entropy Projection

For a general α we have

Dαt+1(i) =

Dt(i) exp(−αyiht(xi))

Zt(α)

where Zt(α) =∑m

i=1 exp(−αyiht(xi)). Then

minED[yh(x)]=0

KL(D, Dt) = maxα

(− lnZt(α)) .

Further

argminED[yh(x)]=0

KL(D, Dt) = Dαt

which is the update Adaboost uses.

But what if labels are noisy?

. – p.11

Additive Models

An additive model

FT (x) =

wtft(x) .

Fit the t-th component to minimize the “residual” error

Error is measured by C(yf(x)), an upper bound on [y 6= f(x)]

For Adaboost,C(yf(x)) = exp(−yf(x))

For Logitboost,

C(yf(x)) = log(1 + exp(−yf(x)))

. – p.12

Upper Bounds on Training Errors

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

Margin, yf(x)

Loss functions for boosting

0−1 lossAdaboost lossLogitboost loss

The 0-1 training set loss with convex upper bounds:exponential loss and logistic loss

. – p.13

Adaboost as Gradient Descent

Margin Cost function C(yh(x)) = exp(−yh(x))

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

The cost function to be minimized

C(yi[Ht−1(xi) + αtht(xi)])

The optimum solution

αt =1

i:ht(xi)=yiDt(i)

i:ht(xi) 6=yiDt(i)

1 − εt

. – p.14

Gradient Descent in Function Space

Choose any appropriate margin cost function C

Goal is to minimize margin cost on the training set

Current classifier Ht−1(x), next weak learner ht(x)

Next classifier Ht(x) = Ht−1(x) + αtht(x)

Choose αt to minimize

C(Ht(x) = C(Ht−1(x) + αtht(x)) .

“Anyboost” class of algorithms

. – p.15

Margin Bounds: Why Boosting works

Test set error decreases even after train set error is 0

No simple “bias-variance” explanation

Bounds on the error rate of the boosted classifier depends on

the number of training examples

the margin achieved on the train set

the complexity of the weak learners

It does not depend on the number of classifiers combined

. – p.16

Margin Bound for Boosting

H: Class of binary classifiers of VC-dimension dH

co(H) =

h : h(x) =∑

αihi(x), αi ≥ 0,∑

αi = 1

With probability at least (1 − δ) over the draw of train set S ofsize m, ∀h ∈ co(H), θ > 0 we have

Pr[yh(x) ≤ 0] ≤ PrS [yh(x) ≤ θ] + O

log 1δ

. – p.17

Maximizing the Margin

Do boosting algorithms maximize the margin?

Minimize upper bound on training error

Often get large margin in the process

Do not explicitly maximize the margin

Explicit margin maximization

Classical Boost: Minimize bound on PrS [yh(x) ≤ 0]

Margin Boost: Minimize (bound on) PrS [yh(x) ≤ θ]

. – p.18

Boosting: Wrap Up (for now)

Strong learner from convex combinations of weak learners

Choice of loss function is critical

Aggresive loss functions for clean data

Robust loss functions for noisy data

Gradient descent in function space of additive models

Generalization by implicit margin maximization

Explicit maximum margin boosting algorithms possible

Is bound minimization a good idea?

. – p.19

Online Learning: Wrap Up Boosting: Interpretations, Extensions

Documents

2010: Interpretations

Boosting Margin

NationalElectrialSafetyCode Interpretations

Boosting governance in Mozambique’s forestspubs.iied.org/pdfs/17601IIED.pdf · Boosting governance in Mozambique’s forests ... Boosting governance in Mozambique’s forests.

Boosting immunityin2010

Interpretations Consulting

1. Interpretations

Functional Interpretations - Lecture 1: The parametrised ...pbo/away-talks/2006_03_27Munich.pdf · Functional Interpretations Functional Interpretations Lecture 1: The parametrised

Generalized Predictive Control Part II. Extensions and Interpretations

Boosting Retention :

Interpretations Solas

Chapter 10 Boosting May 6, 2010. Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods

What are historical interpretations? An introduction to dealing with historical interpretations

Interpretations of quantum mechanics an overview - …srjcstaff.santarosa.edu/~yataiiya/4D/Interpretations of Quantum...Interpretations of quantum mechanics—an overview Spring 2014

Decisiontrees Boosting

Subsea Boosting Systems - · PDF file1 Content Introduction Subsea Boosting and Compression –Why subsea boosting? Subsea Boosting Systems for Subsea Tiebacks –Total system approach

The European Code of Ethics for Franchising & its … Code of Ethics .pdf · The European Code of Ethics for Franchising & its national Extensions & Interpretations EUROPEAN FRANCHISE

Vedic interpretations

For Bucking and Boosting Voltage - USESI · For Bucking and Boosting Voltage GE bucking and boosting transformers provide an economical and convenient means for bucking or boosting

Triangular billiards and generalized continued fractions · Geometric interpretations of continued fractions Natural extensions Conjugacy of the Rosen and Veech algorithms Classical