Introduction to Bandits:
Algorithms and Theory
Jean-Yves Audibert1,2 & Remi Munos3
1. Universite Paris-Est, LIGM, Imagine,2. CNRS/Ecole Normale Superieure/INRIA, LIENS, Sierra3. INRIA Sequential Learning team, France
ICML 2011, Bellevue (WA), USA
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 1/1
Outline
I Bandit problems and applications
I Bandits with small set of actions
I Stochastic setting
I Adversarial setting
I Bandits with large set of actions
I unstructured set
I structured setI linear banditsI Lipschitz banditsI tree bandits
I Extensions
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 2/1
Bandit gameParameters available to the forecaster:the number of arms (or actions) K and the number of rounds nUnknown to the forecaster: the way the gain vectorsgt = (g1,t , . . . , gK ,t) ∈ [0, 1]K are generated
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . ,K}2. the forecaster receives the gain gIt ,t
3. only gIt ,t is revealed to the forecaster
Cumulative regret goal: maximize the cumulative gains obtained.More precisely, minimize
Rn =
(max
i=1,...,KE
n∑t=1
gi ,t
)− E
n∑t=1
gIt ,t
where E comes from both a possible stochastic generation of thegain vector and a possible randomization in the choice of It
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 3/1
Stochastic and adversial environments
I Stochastic environment: the gain vector gt is sampled froman unknown product distribution ν1⊗ . . .⊗ νK on [0, 1]K , thatis gi ,t ∼ νi .
I Adversarial environment: the gain vector gt is chosen by anadversary (which, at time t, knows all the past, but not It)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 4/1
Numerous variants
I different environments: adversarial, “stochastic”, non-stationary
I different targets: cumulative regret, simple regret, tracking thebest expert
I Continuous or discrete set of actions
I extension with additional rules: varying set of arms, pay-per-observation, . . .
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 5/1
Various applications
I Clinical trials (Thompson, 1933)
I Ads placement on webpages
I Nash equilibria (traffic or communication networks, agent simu-lation, tic-tac-toe phantom, . . . )
I Game-playing computers (Go, urban rivals, . . . )
I Packet routing, itinerary selection
I . . .
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 6/1
Outline
I Bandit problems and applications
I Bandits with small set of actions
I Stochastic setting
I Adversarial setting
I Bandits with large set of actions
I unstructured set
I structured setI linear banditsI Lipschitz banditsI tree bandits
I Extensions
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 7/1
Stochastic bandit game (Robbins, 1952)
Parameters available to the forecaster: K and nParameters unknown to the forecaster: the reward distributionsν1, . . . , νK of the arms (with respective means µ1, . . . , µK )
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . ,K}2. the environment draws the gain vector gt = (g1,t , . . . , gK ,t)
according to ν1 ⊗ · · · ⊗ νK3. the forecaster receives the gain gIt ,t
Notation: i∗ = arg maxi=1,...,K µi µ∗ = maxi=1,...,K µi∆i = µ∗ − µi , Ti (n) =
∑nt=1 1It=i
Cumulative regret: Rn =∑n
t=1 gi∗,t −∑n
t=1 gIt ,t
Goal: minimize the expected cumulative regret
Rn = ERn = nµ∗ − En∑
t=1
gIt ,t = nµ∗ − EK∑i=1
Ti (n)µi =K∑i=1
∆iETi (n)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 8/1
A simple policy: ε-greedy
I Playing the arm with highest empirical mean does not workI ε-greedy: at time t,
I with probability 1−εt , play the arm with highest empirical meanI with probability εt , play a random arm
I Theoretical guarantee: (Auer, Cesa-Bianchi, Fischer, 2002)
I Let ∆ = mini :∆i>0 ∆i and consider εt = min( 6K∆2t , 1)
I When t ≥ 6K∆2 , the probability of choosing a suboptimal arm i
is bounded by C∆2t for some constant C > 0
I As a consequence, E[Ti (n)] ≤ C∆2 log n and ERn ≤
∑i :∆i>0
C∆i
∆2 log n−→ logarithmic regret
I drawbacks:I requires knowledge of ∆I outperforms by UCB policy in practice
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 9/1
Optimism in face of uncertainty
I At time t, from past observations and some probabilistic ar-gument, you have an upper confidence bound (UCB) on theexpected rewards.
I Simple implementation:
play the arm having the largest UCB !
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 10/1
Why does it make sense?
I Could we stay a long time drawing a wrong arm?
No, since:I The more we draw a wrong arm i the closer the UCB gets to
the expected reward µi ,I µi < µ∗ ≤ UCB on µ∗
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 11/1
Illustration of UCB policy
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 12/1
Confidence intervals vs sampling times
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 13/1
Hoeffding-based UCB (Auer, Cesa-Bianchi, Fischer, 2002)
I Hoeffding’s inequality: Let X ,X1, . . . ,Xm be i.i.d. r.v. takingtheir values in [0, 1]. For any ε > 0, with probability at least1− ε, we have
EX ≤ 1
m
m∑s=1
Xs +
√log(ε−1)
2m
I UCB1 policy: at time t, play
It ∈ arg maxi∈{1,...,K}
{µi ,t−1 +
√2 log t
Ti (t − 1)
},
where µi ,t−1 = 1Ti (t−1)
∑Ti (t−1)s=1 Xi ,s
I Regret bound:
Rn ≤∑i 6=i∗
min
(10
∆ilog n, n∆i
)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 14/1
Hoeffding-based UCB (Auer, Cesa-Bianchi, Fischer, 2002)
I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking theirvalues in [0, 1]. For any ε > 0, with probability at least 1− ε,we have
EX ≤ 1
m
m∑s=1
Xs +
√log(ε−1)
2m
I UCB1 policy: At time t, play
It ∈ arg maxi∈{1,...,K}
{µi ,t−1 +
√2 log t
Ti (t − 1)
},
where µi ,t−1 = 1Ti (t−1)
∑Ti (t−1)s=1 Xi ,s
I UCB1 is an anytime policy (it does not need to know n to beimplemented)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 15/1
Hoeffding-based UCB (Auer, Cesa-Bianchi, Fischer, 2002)
I Hoeffding’s inequality: Let X1, . . . ,Xm be i.i.d. r.v. taking theirvalues in [0, 1]. For any ε > 0, with probability at least 1− ε,we have
EX ≤ 1
m
m∑s=1
Xs +
√log(ε−1)
2m
I UCB1 policy: At time t, play
It ∈ arg maxi∈{1,...,K}
{µi ,t−1 +
√2 log t
Ti (t − 1)
},
where µi ,t−1 = 1Ti (t−1)
∑Ti (t−1)s=1 Xi ,s
I UCB1 corresponds to 2 log t = log(ε−1)2 , hence ε = 1/t4
I Critical confidence level ε = 1/t (Lai & Robbins, 1985; Agrawal,
1995; Burnetas & Katehakis, 1996; Audibert, Munos, Szepesvari, 2009;
Honda & Takemura, 2010)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 16/1
Better confidence bounds imply smaller regretI Hoeffding’s inequality 1
t-confidence bound
EX ≤ 1
m
m∑s=1
Xs +
√log(t)
2m
I Bernstein’s inequality 1t-confidence bound
EX ≤ 1
m
m∑s=1
Xs +
√2 log(t)VarX
m+
log(t)
3m
I Empirical Bernstein’s inequality 1t -confidence bound
EX ≤ 1
m
m∑s=1
Xs +
√2 log(t)Var X
m+
8 log(t)
3m
(Audibert, Munos, Szepesvari, 2009; Maurer, 2009; Audibert, 2010)I Asymptotic confidence bound leads to catastrophy:
EX ≤ 1
m
m∑s=1
Xs +
√VarX
mx with x s.t.
∫ +∞
x
e−u2/2
√2π
du =1
t
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 17/1
Better confidence bounds imply smaller regret
Hoeffding-based UCB empirical Bernstein-based UCB
EX ≤ 1m
∑ms=1 Xs +
√log(ε−1)
2mEX ≤ 1
m
∑ms=1 Xs +
√2 log(ε−1)Var X
m+
8 log(ε−1)3m
Rn ≤∑
i 6=i∗ min(
c∆i
log n, n∆i
)Rn ≤
∑i 6=i∗ min
(c(Var νi
∆i+ 1)
log n, n∆i
)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 18/1
Tuning the exploration: simple vs difficult bandit problems
I UCB1(ρ) policy: At time t, play
It ∈ arg maxi∈{1,...,K}
{µi ,t−1 +
√ρ log t
Ti (t − 1)
},
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
5
10
15
20
25
Exp
ecte
dre
gret
Regret of UCB1(ρ) for n = 1000 and K = 2 arms:Dirac(0.6) and Ber(0.5)
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
5
10
15
20
25
30
35
Exp
ecte
dre
gret
Regret of UCB1(ρ) for n = 1000 and K = 2 arms:Ber(0.6) and Ber(0.5)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 19/1
Tuning the exploration parameter: from theory to practice
I Theory:I for ρ < 0.5, UCB1(ρ) has polynomial regretI for ρ > 0.5, UCB1(ρ) has logarithmic regret
I Practice: ρ = 0.2 seems to be the best default value for n < 108
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
5
10
15
20
25
30
35
40
45
50
Exp
ecte
dre
gret
Regret of UCB1(ρ) for n = 1000 and K = 3 arms:Ber(0.6), Ber(0.5) and Ber(0.5)
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
10
20
30
40
50
60
70
80
Exp
ecte
dre
gret
Regret of UCB1(ρ) for n = 1000 and K = 5 arms:Ber(0.7), Ber(0.6), Ber(0.5), Ber(0.4) and Ber(0.3)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 20/1
Deviations of UCB1 regret
I UCB1 policy: At time t, play
It ∈ arg maxi∈{1,...,K}
{µi,t−1 +
√2 log t
Ti (t − 1)
},
I Inequality of the form P(Rn > ERn+γ) ≤ ce−cγ does not hold!
I If the smallest reward observable from the optimal arm is smallerthan the mean reward of the second optimal arm, then theregret of UCB1 satisfies: for any C > 0, there exists C ′ > 0such that for any n ≥ 2
P(Rn > ERn + C log n) >1
C ′(log n)C ′
(Audibert, Munos, Szepesvari, 2009)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 21/1
Anytime UCB policies has a heavy-tailed regret
I For some difficult bandit problems, the regret of UCB1 satisfies: for anyC > 0, there exists C ′ > 0 such that for any n ≥ 2
P(Rn > ERn + C log n) >1
C ′(log n)C ′(A)
I UCB-Horizon policy: At time t, play
It ∈ arg maxi∈{1,...,K}
{µi ,t−1 +
√2 log n
Ti (t − 1)
},
(Audibert, Munos, Szepesvari, 2009)
I UCB-H satisfies P(Rn > ERn + C log n) ≤ Cn for some C
I (A) = unavoidable for anytime policies (Salomon, Audibert, 2011)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 22/1
Comparison of UCB1 (solid lines) and UCB-H (dotted lines)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 23/1
Comparison of UCB1(ρ) and UCB-H(ρ) in expectation
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
5
10
15
20
25
Exp
ecte
dre
gret
Comparison of policies for n = 1000 and K = 2 arms:Dirac(0.6) and Ber(0.5)
GCL∗
UCB1(ρ)UCB-H(ρ)
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0Exploration parameter ρ
0
10
20
30
40
50
60
70
Exp
ecte
dre
gret
Comparison of policies for n = 1000 and K = 2 arms:Ber(0.6) and Dirac(0.5)
GCL∗
UCB1(ρ)UCB-H(ρ)
Left: Dirac(0.6) vs Bernoulli(0.5) Right: Bernoulli(0.6) vs Dirac(0.5)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 24/1
Comparison of UCB1(ρ) and UCB-H(ρ) in deviations
I For n = 1000 and K = 2 arms: Bernoulli(0.6) and Dirac(0.5)
−20 0 20 40 60 80 100 120 140 160
Regret level r0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
IP(r−
2≤Rn≤r
+2)
GCL∗
UCB1(0.2)UCB-H(0.2)UCB1(0.5)UCB-H(0.5)
40 60 80 100 120 140 160
Regret level r0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
IP(R
n>r)
GCL∗
UCB1(0.2)UCB-H(0.2)UCB1(0.5)UCB-H(0.5)
Left: smoothed probability mass function. Right: tail distribution of the regret.
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 25/1
Knowing the horizon: theory and practice
I Theory: use UCB-H to avoid heavy tails of the regret
I Practice: Theory is right. Besides, thanks to this robustness,the expected regret of UCB-H(ρ) consistently outperforms theexpected regret of UCB1(ρ), but the gain is small.
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 26/1
Knowing µ∗
I Hoeffding-based GCL∗ policy: play each arm once, then play
It ∈ argmini∈{1,...,K}
Ti (t − 1) (µ∗ − µi ,t−1)2+
I Underlying ideas:I compare p-values of the K tests: H0 = {µi = µ∗}, i ∈ {1, . . . ,K}I the p-values are estimated using Hoeffding’s inequality
PH0
(µi,t−1 ≤ µ(obs)
i,t−1
)/ exp
(− 2Ti (t − 1)
(µ∗ − µ(obs)
i,t−1
)2
+
)I play the arm for which we have the Greatest Confidence Level
that it is the optimal arm.I Advantages:
I logarithmic expected regretI anytime policyI regret with a subexponential right-tailI parameter-free policy !I outperforms any other Hoeffding-based algorithm !
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 27/1
From Chernoff’s inequality to KL-based algorithms
I LetK(p, q) be the Kullback-Leibler divergence between Bernoullidistributions of respective parameter p and q
I Let X1, . . . ,XT be i.i.d. r.v. of mean µ, and taking their valuesin [0, 1]. Let X = 1
T
∑Ti=1 Xi . For any γ > 0
P(X ≤ µ− γ
)≤ exp
(− T K(µ− γ, µ)
).
In particular, we have
P(X ≤ X (obs)
)≤ exp
(− T K
(min(X (obs), µ), µ
)).
I If µ∗ is known, using the same idea of comparing the p-valuesof the tests H0 = {µi = µ∗}, i ∈ {1, . . . ,K}, we get theChernoff-based GCL∗ policy: play each arm once, then play
It ∈ argmini∈{1,...,K}
Ti (t − 1) K(
min(µi ,t−1, µ∗), µ∗
)Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 28/1
Back to unknown µ∗
I When µ∗ is unknown, the principle
playing the arm for which we have the greatest confidence levelthat it is the optimal arm
is replaced by
being optimistic in face of uncertainty:
I an arm i is represented by the highest mean of a distribution νfor which the hypothesis H0 = {νi = ν} has a p-value greaterthan 1
tβ(critical β = 1, as usual)
I the arm with the highest index (=UCB) is played
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 29/1
KL-based algorithms when µ∗ is unknown
I Approximating the p-value using Sanov’s theorem is tightlylinked to the DMED policy, which satisfies
lim supn→+∞
Rn
log n≤ ∆i
infν:EX∼νX≥µ∗ K(νi , ν)
for distributions with finite support (included in [0, 1])(Burnetas & Katehakis, 1996; Honda & Takemura, 2010)
It matches the lower bound
lim infn→+∞
Rn
log n≥ ∆i
infν:EX∼νX≥µ∗ K(νi , ν)
(Lai & Robbins, 1985; Burnetas & Katehakis, 1996)
I Approximating the p-value using non-asymptotic version of Sanov’stheorem leads to the KL-UCB (Cappe & Garivier, COLT 2011) andthe K-strategy (Maillard, Munos, Stoltz, COLT 2011)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 30/1
Outline
I Bandit problems and applications
I Bandits with small set of actions
I Stochastic setting
I Adversarial setting
I Bandits with large set of actions
I unstructured set
I structured setI linear banditsI Lipschitz banditsI tree bandits
I Extensions
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 31/1
Adversarial bandit
Parameters: the number of arms K and the number of rounds n
For each round t = 1, 2, . . . , n
1. the forecaster chooses an arm It ∈ {1, . . . ,K}, possibly with the help ofan external randomization
2. the adversary chooses a gain vector gt = (g1,t , . . . , gK ,t) ∈ [0, 1]K
3. the forecaster receives and observes only the gain gIt ,t
Goal: Maximize the cumulative gains obtained. We consider the regret:
Rn =
(max
i=1,...,KE
n∑t=1
gi,t
)− E
n∑t=1
gIt ,t ,
I In full information, step 3. is replaced by the forecaster receivesgIt ,t and observes the full gain vector gt
I In both settings, the forecaster should use an external rando-mization to have o(n) regret.
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 32/1
Adversarial setting in full information: an optimal policy
I Cumulative reward on [1, t − 1]: Gi ,t−1 =∑t−1
s=1 gi ,sI Follow-the-leader: It ∈ arg maxi∈{1,...,K} Gi ,t−1 is a bad policy
I An “optimal” policy is obtained by considering
pi ,t = P(It = i) =eηGi,t−1∑Kk=1 eηGk,t−1
I For this policy, Rn ≤ nη8 + log K
η
I in particular, for η =√
8 log Kn , we have Rn ≤
√n log K
2
(Littlestone, Warmuth, 1994; Long, 1996; Bylanger, 1997; Cesa-Bianchi, 1999)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 33/1
Proof of the regret bound
pi ,t = P(It = i) =eηGi,t−1∑Kk=1 eηGk,t−1
E∑t
gIt ,t
=E∑t
∑i
pi,tgi,t
=E∑t
(− 1
ηlog∑i
pi,teη(gi,t−
∑j pj,tgj,t ) +
1
ηlog∑i
pi,teηgi,t
)
=E∑t
(− 1
ηlogEeη(Vt−EVt ) +
1
ηlog
∑i eηGi,t∑
i eηGi,t−1
)P(Vt = gi,t ) = pi,t
≥E(−∑t
η
8
)+
1
ηE log
∑j eηGj,n∑
j eηGj,0
≥− nη
8+
1
ηE log
eηmaxj Gj,n
K= −nη
8− logK
η+ Emax
jGj,n
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 34/1
Adapting the exponentially weighted forecasterI In bandit setting, Gi ,t−1, i = 1, . . . ,K are not observed
Trick = estimate them
I Precisely, Gi ,t−1 is estimated by Gi ,t−1 =∑t−1
s=1 gi ,s with
gi ,s = 1− 1− gi ,spi ,s
1Is=i .
Note that EIs∼ps gi ,s = 1−K∑
k=1
pk,s1− gi ,s
pi ,s1k=i = gi ,s
pi ,t = P(It = i) =eηGi,t−1∑Kk=1 eηGk,t−1
I For this policy, Rn ≤ nKη2 + log K
η
I In particular, for η =√
2 log KnK , we have Rn ≤
√2nK log K
(Auer, Cesa-Bianchi, Freund, Schapire, 1995)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 35/1
Implicitly Normalized Forecaster (Audibert, Bubeck, 2010)
Let ψ : R∗− → R∗+ increasing, convex, twice continuously differen-tiable, and s.t. [ 1
K , 1] ⊂ ψ(R∗−)
Let p1 be the uniform distribution over {1, . . . ,K}
For each round t = 1, 2, . . . ,
I It ∼ pt
I Compute pt+1 = (p1,t+1, . . . , pK ,t+1) where
pi ,t+1 = ψ(Gi ,t − Ct)
where Ct is the unique real number s.t.∑K
i=1 pi ,t+1 = 1
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 36/1
Minimax policy
I ψ(x) = exp(ηx) with η > 0; this corresponds exactly to theexponentially weighted forecaster
I ψ(x) = (−ηx)−q with q > 1 and η > 0; this is a new policywhich is minimax optimal: for q = 2 and η =
√2n, we have
Rn ≤ 2√
2nK
(Audibert, Bubeck, 2010; Audibert, Bubeck, Lugosi, 2011)
while for any strategy, we have
sup Rn ≥1
20
√nK
(Auer, Cesa-Bianchi, Freund, Schapire, 1995)
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 37/1
Outline
I Bandit problems and applications
I Bandits with small set of actions
I Stochastic setting
I Adversarial setting
I Bandits with large set of actions
I unstructured set
I structured setI linear banditsI Lipschitz banditsI tree bandits
I Extensions
Jean-Yves Audibert, Introduction to Bandits: Algorithms and Theory 38/1