Download pdf - Contextual Bandit Survey

Page 1: Contextual Bandit Survey

Lab Seminar: Contextual Bandit Survey

Sangwoo Mo


[email protected]

August 4, 2016

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32

Page 2: Contextual Bandit Survey


1 Problem Setting

2 Naıve Approach: Reduce to MAB

3 Stochastic Contextual BanditUCB & Thompson SamplingArbitrary Set of Policies

4 Adversarial Contextual Bandit

5 Supervised Learning to Contextual Bandit

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32

Page 3: Contextual Bandit Survey

Problem Setting

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32

Page 4: Contextual Bandit Survey

Multi-Armed Bandit

At each time t, the agent selects an arm at (at ∈ 1, ...,K)Then, the agent recieves a reward rt(= rat ,t) from the enviroment

If ri ,t is i.i.d. of some distribution, we call it stochastic bandit, and ifri ,t is selected by the enviroment, we call it adversarial bandit

The goal of MAB is to find the policy π ∈ Π s.t.

π(a1, r1,−1, rt−1) = at

which minimizes the regret1

RT := maxi=1,...,K



ri ,t −T∑t=1

rat ,t


1Properly speaking, cumulative pseudo-regret.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32

Page 5: Contextual Bandit Survey

Contextual Bandit

In contextual bandit, the agent recieves an additional information(=context) ct

1 ∈ C at the begining of time t

In stochastic contextual bandit, the reward ri ,t can be represented asa function of the context ci ,t and noise εi ,t

ri ,t = f (ci ,t) + εi ,t

or simply ri ,t = fi (ct) + εi ,t if ct is independent to i

In adversarial contextual bandit, the reward ri ,t is selected by theenviroment, as in the non-contextual MAB

1Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notationsare identical since we can construct a single vector ct by concatenating ci,t s.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32

Page 6: Contextual Bandit Survey

Optimal Regret Bound

Stochastic Bandit: Ω(logT )1

Adversarial Bandit: Ω(√KT )2

Contextual Bandit: Ω(d√T )3

1Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.2Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.

Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.3Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(

√T )

even for the stochastic contextual bandit, since context may come in adversarially.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32

Page 7: Contextual Bandit Survey

Naıve Approach: Reduce to MAB

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32

Page 8: Contextual Bandit Survey

Naıve Approach: Reduce to MAB

Approach 1: assume the context set is finite (|C| = N)

Run MAB algorithm (ex. EXP3) for each context independently

The regret bound is O(√TNK logK )1 (w/ EXP3)

Approach 2: assume the policy space is finite (|H| = M)

Run MAB algorithm (ex. EXP3) on policies, instead of arms

The regret bound is O(√TM logM) (w/ EXP3)

1∑Nc=1 O(nc

√K log K) ≤ O(

√TN√K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32

Page 9: Contextual Bandit Survey

Stochastic Contextual BanditUCB & Thompson Sampling

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32

Page 10: Contextual Bandit Survey

Review: Index Policy and Greedy Algorithm

Since Gittins Index1, index policy became one of the most popularstrategy for MAB problems

Idea: for each time t, define a score si ,t (=index) for each arm i .Select an arm which has the highest score

Question: how to define proper si ,t?

Naıve approach: use empirical mean2! (greedy algorithm)

However, naıve greedy algorithm may occur O(T ) regret

1Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.2Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean

correctly and rapidly (explore-exploit dilema)

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32

Page 11: Contextual Bandit Survey

Review: UCB1

Assume ri ,t ∼ Pi with support [0, 1] and mean µi

Idea: select more seldom-selected arms and less often-selected arms.In other words, give a confidence bonus1!

UCB12: define score as

si ,t = µi ,t +

√2 log t

ni ,t

where µi ,t is empirical mean, and ni ,t is number of arm i selected

UCB1 policy garantees the optimal regret O(logT )

Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)

1We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.2Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.3Garivier & Cappe. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.4Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32

Page 12: Contextual Bandit Survey


Assume ri ,t ∼ P(ri ,t | ci ,t , θ) where E[ri ,t ] = cTi ,tθ∗ (ci ,t , θ ∈ Rd)

Like UCB1, want to define score as

si ,t = cTi ,t θt + UCBi ,t

Question: how to choose proper UCBi ,t?

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32

Page 13: Contextual Bandit Survey


Idea: let θt be an estimator of θ∗ by ridge regression

θt = (CTt Ct + λId)−1CT

t Rt

where Ct = c1, ..., ct−1 and Rt = r1, ..., rt−1

Then, the inequality below holds with probability 1− δT∣∣∣cTi ,t θt − cTi ,tθ

∗∣∣∣ ≤ (ε+ 1)

√cTi ,tA

−1t ci ,t

where At = CTt Ct + Id and ε =

√12 log 2TK


Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32

Page 14: Contextual Bandit Survey


LinUCB1: define score as

si ,t = cTi ,t θt + α√cTi ,tA

−1t ci ,t

Regret bound (with probability 1− δ) is


√T log

1 + T


LinUCB policy garantees the optimal regret O(d√T )

Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)

1Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.2Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.3Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32

Page 15: Contextual Bandit Survey

Review: Thompson Sampling

Another popular strategy for MAB is Thompson Sampling1

It can be applied to both contextual and non-contextual bandit

Assume ri ,t ∼ P(ri ,t | ci ,t , θ∗) with prior θ∗ ∼ P(θ)

Idea: sample estimator θt from the posterior distribution

step 1. draw θt from posterior P(θ | D = ct , at , rt)step 2. select arm ai = arg maxi E[ri ,t | ci ,t , θt ]

The idea is simple, but it works well both in theory2 and in practice3

1Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.Biometrica, 1933.

2Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.3Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32

Page 16: Contextual Bandit Survey


Assume ri ,t ∼ N (cTi ,tθ∗, v2) and θ∗ ∼ N (θt , v

2B−1t ) where

Bt =t−1∑τ=1

ci ,τcTi ,τ + Id , θt = B−1



ci ,τ ri ,τ


ri ,t ∈ [ri ,t − R, ri ,t + R], v = R


εd log



Then, the posterior of θ∗ is N (θt+1, v2B−1


LinTS1: run Thompson Sampling in this assumption

Regret bound (with probability 1− δ) is



√T 1+ε log(Td) log



1Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32

Page 17: Contextual Bandit Survey

UCB & TS: Nonlinear Case

Assume E[ri ,t ] = f (ci ,t) is general nonlinear function

If we assume f is a member of exponential family, we can useGLM-UCB1

If we assume f is sampled from a Guassian Process, we can useGP-UCB2/CGP-UCB3

If we assume f is an element of Reproducing Kernel Hilbert Space,we can use KernelUCB4

Also, we can use Thompson Sampling if we know the form ofprobability distribution

1Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.2Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.3Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.4Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32

Page 18: Contextual Bandit Survey

Stochastic Contextual BanditArbitrary Set of Policies

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32

Page 19: Contextual Bandit Survey


Assume policy space H if finite1

Idea: explore T ′ steps and exploit T − T ′ steps (epsilon-first)

issue 1. how to get an unbiased estimator of the best policy?

issue 2. how to balance explore and exploit if we don’t know T?

trick 1: use D = ct , at , rt observed in explore step

π = maxπ∈H

∑(ct ,at ,rt)∈D

raI(π(ct) = at)


trick 2: run epsilon-first in mini-batches (partition of T )

1Infinite w/ finite VC-dimension can be derived in similar way

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32

Page 20: Contextual Bandit Survey


Epoch-Greedy1: combine trick 1 & trick 2

Regret bound is O(T 2/3) (not optimal!)

1Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32

Page 21: Contextual Bandit Survey


Idea: estimate the distribution Pt over the policy space HRandomizedUCB1:

Regret bound is O(√T ), but time complexity is O(T 6)

1Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32

Page 22: Contextual Bandit Survey


Idea: similar to RandomizedUCB, improve time complexity

ILOVECONBANDITS1 (Importance-weighted LOw-VarianceEpoch-Timed Oracleized CONtextual BANDITS):

Regret bound is O(√T ), and time complexity is O(T 1.5)

1Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32

Page 23: Contextual Bandit Survey

Adversarial Contextual Bandit

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32

Page 24: Contextual Bandit Survey

Review: EXP3

Assume ri ,t ∈ [0, 1] is selected by the enviroment

In adversarial setting, the agent must select arm randomly

Idea: weight more probability to higher-reward ovserved arms

EXP31 (EXPonential-weight algorithm for EXPloration andEXPloitation):

Regret bound is O(√TK logK )

1Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32

Page 25: Contextual Bandit Survey


Idea: run EXP3 on policies, instead of arms

EXP41 (EXPonential-weight algorithm for EXPloration andEXPloitation using EXPert advice):

Regret bound is O(√TK logN), but variance is high

1Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32

Page 26: Contextual Bandit Survey


Idea: run EXP4 with better weight, to make algorithm stable

EXP4.P1 (EXP4 with Probability):

Regret bound is O(√TK logN), with high probability

1Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32

Page 27: Contextual Bandit Survey

Supervised Learning to Contextual Bandit

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32

Page 28: Contextual Bandit Survey

Supervised Learning to Contextual Bandit

Idea: note that contextual bandit can be thought as a supervisedlearing problem with partially-observed restriction

Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased

(true) reward estimator rat ,t =r ′at ,tpat

instead of observed reward r ′at ,t .


E[ri ,t ] = pi ·ri ,tpi

+ (1− pi ) · 0 = ri ,t

Using this trick, any supervised learning algorithm can be convertedto a contextual bandit algorithm

Banditron and NeuralBandit are examples using neural network

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32

Page 29: Contextual Bandit Survey

Banditron and NeuralBandit

Both Banditron1 and NeuralBandit2 uses multi-layer perceptron andepsilon-greedy algorithm w/ unbiased reward estimator

However, Banditron uses 0-1 loss (classification) while NeuralBandituses L2 loss (regression)

Regret bound of original Banditron is O(T 2/3), and a 2nd-ordervariant3 reduced it to O(

√T )

No theoretical garnatee is proved for NeuralBandit yet

1Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.2Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.3Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32

Page 30: Contextual Bandit Survey

Summary & Reference

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32

Page 31: Contextual Bandit Survey


Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32

Page 32: Contextual Bandit Survey


[Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv,2015.

[Burtini’ 2015] A Survey of Online Experiment Design with theStochastic Multi-Armed Bandit. arXiv, 2015.

[Bubeck’ 2012] Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. arXiv, 2012.

Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32
