150
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1, 2010 Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an A

A Decision-Theoretic Generalization of On-Line … Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Sachin Gupta and Nitish Lakhanpal November 1,

Embed Size (px)

Citation preview

A Decision-Theoretic Generalization of On-LineLearning and an Application to Boosting

Sachin Gupta and Nitish Lakhanpal

November 1, 2010

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Introduction

Scenario: Gambler wants to let his fellow gamblers make betson his behalf

Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do

Obviously if he knew which of his friends would do the best hewould just give him all of the money

Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend

Paper discusses a simple algorithm for solving these dynamicallocation problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategies

pti ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy i

n∑i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironment

Loss suffered by A is∑

i (pti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation rule

This loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

On-line Allocation Model

Agent A has N options or strategies to choose from

Number strategies with integers 1, . . . ,N

Each time step t = 1, 2, . . . ,T :

Allocator A decides on distribution pt over the strategiespt

i ≥ 0 is the amount allocated to strategy in∑

i=1

pt = 1

Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is

∑i (p

ti · l ti ) i.e. the average loss of the

strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Bounds on Losses

Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]

No assumption about the form of loss vectors l t or manner inwhich they are generated

Adversary’s choice for l t may depend on allocator’s chosenmixture pt

Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy

A attempts to minimize net loss La −mini Li whereLa =

∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by

algorithm A on the first T trials

Li =∑T

t=1 l ti , the cumulative loss of strategy i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Repeated n Decision Game

The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.

Let there be payoff vectors pit equivalent to our loss vectors l it

Let there be a decision space ∆ consisting of decisions x it

equivalent to our weight vectors w it

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Regret Bound for Exponential Updating

We prove this in the repeated n decision game framework(equivalent to paper’s formulation)

Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1

εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 1(Regularity)

By definition, we have for any x ∈ ∆n,

1

εH(x) + p1 · x + . . .+ pt−1 · x =

∑i

1

εxi log

1

xi+ Pt

i xi

By simple algebra, the above is equal to,

∑i

1

ε

(xi log

1

xi+ εxiP

ti

)=

1

ε

∑i

xi logeεP

ti

xi

For the xi chosen by the algorithm, the above expression is1ε

∑xi log Z t = 1

ε log Z t , because∑

xi = 1. Using Jensen’sinequality, we have:1ε

∑i xi log eεPt

i

xi≤ 1

ε log∑

i xieεPt

i

xi= 1

ε log Z t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

Proof

Note first that P t+1i −M ≤ P t

i ≤ P t+1i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t

∣∣ ≤ 4εM2

ProofNote first that P t+1

i −M ≤ P ti ≤ P t+1

i + M and hence

eεPt+1i e−εM ≤ eεP

ti ≤ eεP

t+1i eεM

The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives

x ti =

eεPti

Z t≥ e−2εM eεP

t+1i

Z t+1∀i ≤ n

Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1

i

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,

x t = (1− λ)x t+1 + λz t

Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows

directly from the argument above. The fact that∑

z ti = 1

follows from the fact that∑

x ti = 1,

∑x t+1i = 1 and that x t

is a convex combination of x t + 1 and x t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Proof of Lemma 2(Stability)

Finally,

x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt

Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Proof

The stability-regularization lemma, combined with the lemma1, implies that,

regret ≤ 1

T

T∑t=1

∣∣x t+1 · pt − x t · pt∣∣+ 1

Tmax

x,x′∈∆n

∣∣∣∣1εH(x)− 1

εH(x ′)

∣∣∣∣

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Regret Bound

Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n

1εH(x)− 1

εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the

“stability” term. Putting these together gives,

regret ≤ 1

T

T∑t=1

4εM2 +log(n)

T ε= 4εM2 +

log(n)

T ε

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problems

Take decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]

At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .

Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]

Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .

At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)

Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t

i on ∆. Suffers loss Λ(E ti , ω

t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Applications

Run algorithm Hedge(β) treating each expert as a strategy

Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =

∑Ni=1 pt

i Eti

For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =

∑Ni=1 pt

i (Λ(E ti , ω

t))

If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t

It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T

t=1 Λ(Dt , ωt) ≤ mini∑T

t=1 Λ(E ti , ω

t) +√

2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Rock, Paper, Scissors

Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors

Here ∆ = Ω = R,P,S

Loss function:

ωR P S

δR 1

2 1 0P 0 1

2 1S 1 0 1

2

δ represents learner’s play, ω is the adversary’s play

Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties

Cumulative loss is simply the expected number of loses in aseries of rounds of game play

Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic Error

Low Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High Variance

Ex: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Bias-Variance Tradeoff

Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Can we reduce variance without increasing bias?

Use committees

Bias

Average Performance at least as good as average performanceof members

Variance

Spurious patterns outvoted

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the space

Forces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Intuition

Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.

A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial p

Intuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Strong Learning

For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.

Weak Learning

Same as above except that ε ≥ 1

2− γ where γ > 0 or

decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examples

Compute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distribution

Feed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithm

Generate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weights

Output a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

The Algorithm

Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εt

Calculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting

Rules for generating a new set of weights

Calculate the error of the weak hypothesis

εt =∑N

i=1 pti |ht(xi )− yi |

Set βt =εt

1− εtCalculate new weights as:

w t+1i = w t

i B1−|ht(xi )−yi |t

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”

The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of Hedge

We also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Relationship Between Boosting and Weighted Majority

Dual relationship between Hedge and AdaBoost

“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction

Loss large for bad predictions in Adaboost in order to focus on“hard” problems

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — Critique

Not enough data

Can’t produce something out of nothing

Base learner too weak

Committee may not be able to come up with good hypothesis

Base learner too strong

Individual learner may overfit on its own

Susceptibility to noisy data

Work too hard to find pattern in outlier

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 if

Pr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 if

Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Boosting — A Bayesian Perspective

In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.

Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]

And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]

Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as

Predict 1 if Pr [y = 1]∏

t:ht(x)=0 εt∏

t:ht(x)=1(1− εt) > Pr [y =

0]∏

t:ht(x)=0(1− εt)∏

t:ht(x)=1 εt and 0 otherwise, where

εt = Pr [ht(x) 6= y ]

Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., k

Correspondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Multiclass Extensions of AdaBoost

We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y

we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner

We calculate the error of the hypothesis as

εt =∑N

i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0

otherwise

Let βt =εt

(1− εt)

The weights are updated as w t+1i = w t

i β1−||ht(xi )6=yi ||t

The final hypothesis is

hf (x) = arg maxy∈Y

T∑t=1

(log

1

βi

)[[ht(x) = y ]]

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Notes on the Multiclass Extensions of AdaBoost

Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish

More sophisticated extensions exist, some of which arediscussed in the paper

Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Summary

Hedge Algorithm, applications of Hedge

Adaboost Algorithm, relationship to Hedge, relationship toBayesian statistics, and extensions

Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting