Upload
sangwoo-mo
View
206
Download
1
Embed Size (px)
Citation preview
Lab Seminar: Contextual Bandit Survey
Sangwoo Mo
KAIST
August 4, 2016
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32
Overview
1 Problem Setting
2 Naıve Approach: Reduce to MAB
3 Stochastic Contextual BanditUCB & Thompson SamplingArbitrary Set of Policies
4 Adversarial Contextual Bandit
5 Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32
Problem Setting
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32
Multi-Armed Bandit
At each time t, the agent selects an arm at (at ∈ 1, ...,K)Then, the agent recieves a reward rt(= rat ,t) from the enviroment
If ri ,t is i.i.d. of some distribution, we call it stochastic bandit, and ifri ,t is selected by the enviroment, we call it adversarial bandit
The goal of MAB is to find the policy π ∈ Π s.t.
π(a1, r1, ...at−1, rt−1) = at
which minimizes the regret1
RT := maxi=1,...,K
E
[T∑t=1
ri ,t −T∑t=1
rat ,t
]
1Properly speaking, cumulative pseudo-regret.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
Contextual Bandit
In contextual bandit, the agent recieves an additional information(=context) ct
1 ∈ C at the begining of time t
In stochastic contextual bandit, the reward ri ,t can be represented asa function of the context ci ,t and noise εi ,t
ri ,t = f (ci ,t) + εi ,t
or simply ri ,t = fi (ct) + εi ,t if ct is independent to i
In adversarial contextual bandit, the reward ri ,t is selected by theenviroment, as in the non-contextual MAB
1Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notationsare identical since we can construct a single vector ct by concatenating ci,t s.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
Optimal Regret Bound
Stochastic Bandit: Ω(logT )1
Adversarial Bandit: Ω(√KT )2
Contextual Bandit: Ω(d√T )3
1Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.2Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.
Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.3Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(
√T )
even for the stochastic contextual bandit, since context may come in adversarially.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
Naıve Approach: Reduce to MAB
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
Naıve Approach: Reduce to MAB
Approach 1: assume the context set is finite (|C| = N)
Run MAB algorithm (ex. EXP3) for each context independently
The regret bound is O(√TNK logK )1 (w/ EXP3)
Approach 2: assume the policy space is finite (|H| = M)
Run MAB algorithm (ex. EXP3) on policies, instead of arms
The regret bound is O(√TM logM) (w/ EXP3)
1∑Nc=1 O(nc
√K log K) ≤ O(
√TN√K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
Stochastic Contextual BanditUCB & Thompson Sampling
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32
Review: Index Policy and Greedy Algorithm
Since Gittins Index1, index policy became one of the most popularstrategy for MAB problems
Idea: for each time t, define a score si ,t (=index) for each arm i .Select an arm which has the highest score
Question: how to define proper si ,t?
Naıve approach: use empirical mean2! (greedy algorithm)
However, naıve greedy algorithm may occur O(T ) regret
1Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.2Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean
correctly and rapidly (explore-exploit dilema)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
Review: UCB1
Assume ri ,t ∼ Pi with support [0, 1] and mean µi
Idea: select more seldom-selected arms and less often-selected arms.In other words, give a confidence bonus1!
UCB12: define score as
si ,t = µi ,t +
√2 log t
ni ,t
where µi ,t is empirical mean, and ni ,t is number of arm i selected
UCB1 policy garantees the optimal regret O(logT )
Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)
1We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.2Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.3Garivier & Cappe. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.4Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
LinUCB
Assume ri ,t ∼ P(ri ,t | ci ,t , θ) where E[ri ,t ] = cTi ,tθ∗ (ci ,t , θ ∈ Rd)
Like UCB1, want to define score as
si ,t = cTi ,t θt + UCBi ,t
Question: how to choose proper UCBi ,t?
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
LinUCB
Idea: let θt be an estimator of θ∗ by ridge regression
θt = (CTt Ct + λId)−1CT
t Rt
where Ct = c1, ..., ct−1 and Rt = r1, ..., rt−1
Then, the inequality below holds with probability 1− δT∣∣∣cTi ,t θt − cTi ,tθ
∗∣∣∣ ≤ (ε+ 1)
√cTi ,tA
−1t ci ,t
where At = CTt Ct + Id and ε =
√12 log 2TK
δ
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
LinUCB
LinUCB1: define score as
si ,t = cTi ,t θt + α√cTi ,tA
−1t ci ,t
Regret bound (with probability 1− δ) is
O(d
√T log
1 + T
δ)
LinUCB policy garantees the optimal regret O(d√T )
Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)
1Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.2Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.3Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
Review: Thompson Sampling
Another popular strategy for MAB is Thompson Sampling1
It can be applied to both contextual and non-contextual bandit
Assume ri ,t ∼ P(ri ,t | ci ,t , θ∗) with prior θ∗ ∼ P(θ)
Idea: sample estimator θt from the posterior distribution
step 1. draw θt from posterior P(θ | D = ct , at , rt)step 2. select arm ai = arg maxi E[ri ,t | ci ,t , θt ]
The idea is simple, but it works well both in theory2 and in practice3
1Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.Biometrica, 1933.
2Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.3Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
LinTS
Assume ri ,t ∼ N (cTi ,tθ∗, v2) and θ∗ ∼ N (θt , v
2B−1t ) where
Bt =t−1∑τ=1
ci ,τcTi ,τ + Id , θt = B−1
t
(t−1∑τ=1
ci ,τ ri ,τ
)
ri ,t ∈ [ri ,t − R, ri ,t + R], v = R
√24
εd log
t
δ
Then, the posterior of θ∗ is N (θt+1, v2B−1
t+1)
LinTS1: run Thompson Sampling in this assumption
Regret bound (with probability 1− δ) is
O(d2
ε
√T 1+ε log(Td) log
1
δ)
1Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
UCB & TS: Nonlinear Case
Assume E[ri ,t ] = f (ci ,t) is general nonlinear function
If we assume f is a member of exponential family, we can useGLM-UCB1
If we assume f is sampled from a Guassian Process, we can useGP-UCB2/CGP-UCB3
If we assume f is an element of Reproducing Kernel Hilbert Space,we can use KernelUCB4
Also, we can use Thompson Sampling if we know the form ofprobability distribution
1Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.2Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.3Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.4Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
Stochastic Contextual BanditArbitrary Set of Policies
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32
Epoch-Greedy
Assume policy space H if finite1
Idea: explore T ′ steps and exploit T − T ′ steps (epsilon-first)
issue 1. how to get an unbiased estimator of the best policy?
issue 2. how to balance explore and exploit if we don’t know T?
trick 1: use D = ct , at , rt observed in explore step
π = maxπ∈H
∑(ct ,at ,rt)∈D
raI(π(ct) = at)
1/K
trick 2: run epsilon-first in mini-batches (partition of T )
1Infinite w/ finite VC-dimension can be derived in similar way
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
Epoch-Greedy
Epoch-Greedy1: combine trick 1 & trick 2
Regret bound is O(T 2/3) (not optimal!)
1Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
RandomizedUCB
Idea: estimate the distribution Pt over the policy space HRandomizedUCB1:
Regret bound is O(√T ), but time complexity is O(T 6)
1Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
ILOVECONBANDITS
Idea: similar to RandomizedUCB, improve time complexity
ILOVECONBANDITS1 (Importance-weighted LOw-VarianceEpoch-Timed Oracleized CONtextual BANDITS):
Regret bound is O(√T ), and time complexity is O(T 1.5)
1Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
Adversarial Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32
Review: EXP3
Assume ri ,t ∈ [0, 1] is selected by the enviroment
In adversarial setting, the agent must select arm randomly
Idea: weight more probability to higher-reward ovserved arms
EXP31 (EXPonential-weight algorithm for EXPloration andEXPloitation):
Regret bound is O(√TK logK )
1Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
EXP4
Idea: run EXP3 on policies, instead of arms
EXP41 (EXPonential-weight algorithm for EXPloration andEXPloitation using EXPert advice):
Regret bound is O(√TK logN), but variance is high
1Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
EXP4.P
Idea: run EXP4 with better weight, to make algorithm stable
EXP4.P1 (EXP4 with Probability):
Regret bound is O(√TK logN), with high probability
1Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
Supervised Learning to Contextual Bandit
Idea: note that contextual bandit can be thought as a supervisedlearing problem with partially-observed restriction
Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased
(true) reward estimator rat ,t =r ′at ,tpat
instead of observed reward r ′at ,t .
Then,
E[ri ,t ] = pi ·ri ,tpi
+ (1− pi ) · 0 = ri ,t
Using this trick, any supervised learning algorithm can be convertedto a contextual bandit algorithm
Banditron and NeuralBandit are examples using neural network
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
Banditron and NeuralBandit
Both Banditron1 and NeuralBandit2 uses multi-layer perceptron andepsilon-greedy algorithm w/ unbiased reward estimator
However, Banditron uses 0-1 loss (classification) while NeuralBandituses L2 loss (regression)
Regret bound of original Banditron is O(T 2/3), and a 2nd-ordervariant3 reduced it to O(
√T )
No theoretical garnatee is proved for NeuralBandit yet
1Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.2Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.3Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32
Summary & Reference
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32
Summary
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32
Reference
[Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv,2015.
[Burtini’ 2015] A Survey of Online Experiment Design with theStochastic Multi-Armed Bandit. arXiv, 2015.
[Bubeck’ 2012] Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. arXiv, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32