Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound

Bandits: Part I

Stochastic, Finite-Armed Bandits

Csaba Szepesvari

Department of Computing Science & AICMLUniversity of Alberta & (@ Deepmind since August)

[email protected]

August 31, 2017

Summer School @ DS3

1 / 108

mailto:[email protected]

Outline

1 Overview of Talks

2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications

3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret

4 Measure Concentration

2 / 108

Outline

5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration

6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management

7 Summary

8 Further Reading

3 / 108

Overview of Talks

• Talk 1: Basics• What, why, applications• Bandits: Problem definition• Stochastic finite-armed bandits: Basics• Measure concentration. Subgaussianity• Explore-then-commit• UCB, Optimism, Optimality

• Talk 2: Adversarial and linear bandits• Adversarial finite-armed bandits• Contextual bandits• Exp4• Stochastic linear bandits• Adversarial linear bandits

• Talk 3: Bandits in the wild

4 / 108

Learning Material

• This lecture (and more): http://banditalgs.com

• Joint effort with Tor Lattimore;• Book to be published by early next year: Stay tuned!• Tor’s lightweight C++ bandit library

• I will share some exercises later

• Sebastien Bubeck’s tutorial• Blog post 1• Blog post 2

• Bubeck and Cesa-Bianchi’s book;(Bubeck and Cesa-Bianchi, 2012)

5 / 108

banditalgs.com

http://banditalgs.com

http://tor-lattimore.com/

https://github.com/tor/libbandit

https://www.dropbox.com/s/b5u2ep29szmdscn/BubeckTutorialMLSS16.pdf?dl=0

https://blogs.princeton.edu/imabandit/2016/05/11/bandit-theory-part-i/

https://blogs.princeton.edu/imabandit/2016/05/11/bandit-theory-part-ii/

http://sbubeck.com/tutorial2.pdf

banditalgs.com

Outline

1 Overview of Talks




6 / 108

Learning Objectives: Knowledge

The goal is to gain knowledge about:

• Bandit problems: What are they?

• Types of bandit problems: How do they differ?

• Key ideas: Explore vs. exploit; why significant? What to do?

• Basic solution techniques

• Core results: How far did we get? How far can we go?

• Peak into contemporary research

7 / 108

Learning Objectives: Skills

Skills to be acquired. Ability to . . .

• . . . recognize bandit problems;

• . . . recognize types/variants of bandits;

• . . . write code for bandits (algorithms, environment, . . . );

• . . . recognize insurmountable tradeoffs; limits of what is possible;

• . . . get around in the literature.

8 / 108

Outline

1 Overview of Talks




9 / 108

Origin of the Name

Mouse learning in a T-maze1953: Frederick Mosteller andRobert Bush, psychogolists“A Stochastic Model withApplications to Learning”

Generalization to humans:“Two-armed bandits”.

10 / 108

Classics

• First paper on bandits: Thompson (1933)

• Major effort by Herbert Robbins in the 50s; earliest is (Robbins,1952)

• Another pioneer: Herman Chernoff, e.g., (Bather and Chernoff,1967)

• Breakthrough in Bayesian bandits (computation): (Gittins, 1979)

• A breakthrough in stochastic bandits: (Lai and Robbins, 1985),asymptotic regret-optimality in a frequentist setting

• UCB as we know it: (Auer et al., 2002), finite-time optimality

• Exp3, adversarial bandits: (Auer et al., 1995)

• Early books (sequential design, Bayesian setting): (Chernoff,1959; Berry and Fristedt, 1985; Gittins et al., 2011).

11 / 108

Outline

1 Overview of Talks




12 / 108

The Number Game

Number of papers published in 5-year periods (present=2016)on bandits as reported by google scholar.

13 / 108

Why Do People Care?

• Decision making uncertainty is a significant challenge

• Bandits capture key aspects of these challenge: Theexploration-exploitation dilemma

Some examples• Which drugs should a patient receive?

• How should I allocate my study time between courses?

• Which version of a website will generate the most revenue?

• What move should be considered next when playing chess/go?

• . . .

14 / 108

Outline

1 Overview of Talks




15 / 108

The Exploration-Exploitation DilemmaPayoffs after 5 pulls of each arm:

Left arm: 0, 10, 0, 0, 10Right arm: 10, 0, 0, 0, 0

Left arm average payoff: 4 dollars per round;Right arm average payoff: 2 dollars per round.Budget: 20 more trials (pulls).

• Shall we pull left only? “Exploit”?

• Shall we pull the right arm at all? “Explore”?

• Why?

• How many times to pull each arm?

• “Good luck/bad luck?”

Play time!

16 / 108

http://downloads.tor-lattimore.com/bandits/

Outline

1 Overview of Talks




17 / 108

Applications, applications..

1 A/B testing2 Drug testing3 Advert placement4 Network routing (packets,

planes, cars)5 Tree search (MCTS)6 Recommendation services

(for example, news ormovies)

7 Ranking (for example,search)

8 Educational games

9 Resource allocation(memory, bandwidth,manufacturing space)

10 Waiting problems (hard-diskshutdown, auto logout,waiting for a bus)

11 Dynamic pricing (forexample, on Amazon)

12 A core component of RL

18 / 108

Questions?

19 / 108

Outline

1 Overview of Talks




20 / 108

Bandits: Interaction Protocol and Goal

For rounds t = 1, 2, ..., n:

1 Learner chooses an action At from a set A of available actions.The chosen action is sent to the environment;

2 The environment (E) generates a response in the form of areal-valued reward Xt ∈ R, which is sent back to the learner.

The goal of the learner is to maximize the sum of rewards that itreceives,

∑nt=1 Xt .

21 / 108

Bandits: Interaction Protocol and Goal: Concepts

• History at the end of round t: Ht = (A1,X1, . . . ,At ,Xt).

• Learner can use history to base its action At+1 on in round t + 1.

• Learner “uses” a “policy”, a map of all possible histories toactions.

• The learner is also allowed to randomize.

22 / 108

Bandits and Regret

Definition (Regret – Informal definition)

The regret of learner relative to action a =Total reward gained when a is used for all n rounds –Total reward gained by the specific learner in n rounds according toits chosen actions.

• Maximize reward ⇔ minimize regret.

• Advantage? Normalizes the scale so that zero has specialmeaning (kills meaningless shifts).

Questions:

• What does it mean that a learner has positive regret?

• (What does it mean that a learner has positive reward?)

• What does it mean that a learner has zero regret?

23 / 108

Regret – II.

Definition (Zero-regret learner)

“Zero-regret learner”: If Rn is regret at time n, Rn/n→ 0; a.k.a.“vanishing regret”, sublinear regret; Rn = o(n).

In general we care about how fast Rn/n→ 0 happens ⇔ How slowlyRn grows.

Examples:

• Rn = O(√n) (Rn/n = O(1/

√n)).

• Rn = O(log(n)) (Rn/n = O(log(n)/n)).

24 / 108

Discussion

• Why compare with fixed actions? Is this limiting? What doesthis capture?

• Stationarity.

• Does the policy know the “horizon”? Is it “anytime”?

• Good learner: “Small” regret (small worst-case regret!?) over alarge class of environments.

• Instance-dependent regret.

• Regret lower and upper bounds: “Worst-case” vs. instancedependent.

25 / 108

Types of Environments

• Stochastic

• Adversarial (“unconstrained”)

• Unstructured vs. structured payoff

• Contextual

• . . .

26 / 108

Outline

1 Overview of Talks




27 / 108

Stochastic Bandits

DefinitionA K -armed stochastic bandit environment is a tuple ofdistributions ν = (P1,P2, . . . ,PK ), where Pi is a distribution over thereals for each i ∈ [K ]

.= {1, 2, . . . ,K}.

InteractionFor rounds t = 1, 2, 3, . . .

1 Based on its past observations (if any), the learner chooses anaction At ∈ [K ] following some policy π:At ∼ π(·|A1,X1, . . . ,At−1,Xt−1). The chosen action is sent tothe environment.

2 The environment generates a random reward Xt whosedistribution is PAt (in notation: Xt ∼ PAt ). The generatedreward is sent back to the learner.

28 / 108

A note

• In the environment, no joint (over all arms) is specified. Why?

• The learner sees only Xt , not rewards from other arms. Even ifthose exist, they are latent, the learner cannot access them,hence, the properties of joint (even if it was specified) do notmatter.

• Precise answer: All information about an interconnectedenvironment-policy pair (ν, π) is in the joint distribution ofaction-reward sequences.

29 / 108

Typical Environments

Name Symbol Definition

Bernoulli EKB{

(B(µi ))i : µ ∈ [0, 1]K}

Uniform EKU{

(U(ai , bi ))i : a, b ∈ RK , a ≤ b}

Gaussian (known var.) EKN (σ2){

(N (µi , σ2))i : µ ∈ RK

}Gaussian (unknown var.) EKN

{(N (µi , σ

2i ))i : µ ∈ RK , σ2 ∈ RK

+

}Finite variance EKV (σ2)

{(Pi )i : VX∼Pi [X ] ≤ σ2 for all i

}Finite kurtosis EKKurt(κ) {(Pi )i : KurtX∼Pi [X ] ≤ κ for all i}

Bounded support EK[a,b] {(Pi )i : Supp(Pi ) ⊆ [a, b]}

Subgaussian EKSG(σ2) {(Pi )i : Pi is σ-subgaussian for all i}

Supp(P) is the support of distribution P; Kurt(X ) = E[(X−E[X ])4]V[X ]2 .

Table: Typical environment classes for stochastic bandits

30 / 108

Expected Reward/Regret

• Sn =∑n

t=1 Xt : Total reward. Random!!!

• Possible goal: Maximize E [Sn], the expected reward.

• OK?

• Same as minimizing Rn, the (expected) regret, where

Rn.

= nµ∗ − E [Sn] ,

µ∗ = maxi∈[K ] µi , µi =∫ +∞−∞ x Pi(dx).

31 / 108

Outline

1 Overview of Talks




32 / 108

Basic Properties of Regret – I.

• Let Rn(π, ν) be the regret (expected!) of policy π onenvironment ν.

Lemma (Warmup – Exercise 0)

(a) Rn(π, ν) ≥ 0 for all policies π.

(b) The policy π choosing At ∈ argmaxi µi for all t satisfies

Rn(π, ν) = 0 .

(c) If Rn(π, ν) = 0 for some policy π then for all t, At ∈ [K ] isoptimal with probability one:

P (µAt = µ∗) = 1 .

33 / 108

Regret Decomposition – Important!!!

• Let ν = (P1, . . . ,PK ) be a bandit environment.• Let ∆i(ν) = µ∗(ν)− µi(ν): sub-optimality gap or action gap

or immediate regret of action i .• Usage count for action i :

Ti(t) =t∑

s=1

I{As=i} .

• Note: ∆i.

= ∆i(ν) is non-random, Ti(t) is random! (Why?)

Lemma (Regret Decomposition Lemma)

For any policy π and K -armed stochastic bandit environment ν andhorizon n ∈ N, the regret Rn of policy π in ν satisfies

Rn =∑K

i=1 ∆i E [Ti(n)] .

34 / 108

Proof of the Regret Decomposition Lemma

Proof.For any fixed t we have

∑k I{At=k} = 1.

Hence, Sn =∑

t Xt =∑

t

∑k XtI{At=k} and thus

Rn = nµ∗ − E [Sn] =K∑

k=1

n∑t=1

E[(µ∗ − Xt)I{At=k}

].

Now, knowing At , the expected reward is µAt . Thus we have

E[(µ∗ − Xt)I{At=k} |At

]= I{At=k}E [µ∗ − Xt |At ]

= I{At=k}(µ∗ − µAt )

= I{At=k}(µ∗ − µk) .

Take expectations, sum both sides over t = 1, . . . , n.

35 / 108

Questions?

36 / 108

Exercise 1: (get code )Implement a Bernoulli bandit environment in Python using the codesnippet below:

class BernoulliBandit:

# accepts a list of K >= 2 floats , each lying in

[0,1]

def __init__(self , means):

pass

# Function should return the number of arms

def K(self):

pass

# Accepts a parameter 0 <= a <= K-1 and returns the

# realisation of random variable X with P(X = 1)

being

# the mean of the (a+1)th arm.

def pull(self , a):

pass

# Returns the regret incurred so far.

def regret(self):

pass

37 / 108

https://www.dropbox.com/s/n82rzi1zxhpy3tx/bernoullibandit.py?dl=0



Exercise 2: Follow-the-Leader (get code )

Implement the following simple algorithm called ‘Follow-the-Leader’(FTL), which chooses each action once and subsequently chooses theaction with the largest average observed so far. Ties should bebroken randomly.

def FollowTheLeader(bandit , n):

# implement the Follow -the -Leader algorithm by

replacing

# the code below that just plays the first arm in

every round

for t in range(n):

bandit.pull (0)

Note: Depending on the literature you are reading, Follow-the-Leadermay be called ‘stay with the winner’ or the ‘greedy algorithm’.

38 / 108

https://www.dropbox.com/s/jlhj490vo4f4nbj/followtheleader.py?dl=0

Exercise 3: Distribution of (random) regret

Consider a Bernoulli bandit with two arms and means µ1 = 0.5 andµ2 = 0.6.

(a) Using a horizon of n = 100,run 1000 simulations of yourimplementation ofFollow-the-Leader on theBernoulli bandit above andrecord the (random) regret,nµ∗ − Sn, in each simulation.

(b) Plot the results using ahistogram (see fig. on theright).

(c) Explain the results in thefigure.

0 5 10

0

200

400

Regret

Fre

quen

cyFigure: Histogram of regret for FTLover 1000 trials on Bernoulli banditwith means µ1 = 0.5, µ2 = 0.6

39 / 108

https://www.dropbox.com/s/nfw6xc184hboe4t/BanditSimHistograms.ipynb?dl=0

Exercise 4: Regret Over Time

Consider the same Bernoulli bandit as used in the previous question.

(a) Run 1000 simulations of yourimplementation of FTL foreach horizonn ∈ {100, 200, 300, . . . , 1000}.

(b) Plot the average regretobtained as a function of n(see the fig. on the right).Include error bars.

(c) Explain the plot. Do you thinkFTL is a good algorithm?Why/why not?

500 1,000

20

40

n

Exp

ecte

dR

egre

t

Figure: Histogram of regret for FTLover 1000 trials.

40 / 108

Outline

1 Overview of Talks




41 / 108

Statistics 101

• X1, . . . ,Xn independent, identically distributed (∼ P),real-valued random variables (think rewards).

• What is µ.

= E [X1] (= E [Xt ])?

• Estimate: µ = 1n

∑nt=1 Xt

• “Statistics” of data: any function of X1, . . . ,Xn.

• Notice that µ is random, but µ ≈ µ.

• The distribution of µ depends on P .

• “How close”?: Characterize the distribution of |µ− µ|.• A priori characterization: The characterization depends on P

where all we know that P ∈ P .

• A posteriori characterization: The characterization depends onX1, . . . ,Xn.

42 / 108

Tail probabilities

Imagine X.

= µ.

We care about the probability masses P (µ < µ− ε), P (µ > µ− ε).

Either, given ε, give lower or upper bounds on this (“probabilitybound”). Or for a given probability mass, give upper and/or lowerbounds on ε (“deviation bound”).

43 / 108

Markov and Chebyshev

Lemma

For any random variable X with finite mean and ε > 0 it holds that:

(a) (Markov): P (|X | ≥ ε) ≤ E [|X |]ε

.

(b) (Chebyshev): P (|X − E [X ] | ≥ ε) ≤ V[X ]

ε2.

• Exercise 1: Prove Markov. Hint: Prove it for nonnegative r.v.s.,use E [X ] =

∫∞0

x P(dx) and split the integral.• Exercise 2: Prove Chebyshev. Hint: Apply Markov.• Chebyshev applied to µ: V[µ] = σ2

n. Hence,

P (|µ− µ| ≥ ε) ≤ σ2

nε2 .• Note: Chebyshev is more precise. You can further increase

precision by applying Markov to |X − E [X ] |k , k big.44 / 108

Central Limit Theorem (CLT)

Theorem (CLT)

Let Xt iid, σ2 = V[X1] <∞, Sn =∑n

t=1(Xt − µ), Zn = Sn/√σ2n.

Then FZn → FZ as n→∞ where Z ∼ N (0, 1).

Note: FZ (u) = P (Z ≥ u) =∫∞u

1√2π

exp(− x2

2

)dx .

Bounding FZ (u):∫ ∞u

1√2π

exp(− x2

2

)dx ≤ 1

u√

2π

∫ ∞u

x exp(− x2

2

)dx

=√

12πu2 exp

(−u2

2

).

Hence

P (µ ≥ µ + ε) = P(

Sn√σ2n≥ ε√

nσ2

)≈ P

(Z ≥ ε

√σ2n)

≤√

σ2

2πnε2 exp(− nε2

2σ2

).

45 / 108

How good is the CLT?

P (µn ≥ µ + ε) /√

σ2

2πnε2 exp(− nε2

2σ2

).

Question: Can we safely swap / to ≤, e.g., when Xt ∼ P and P is

supported on [0, 1].

(a) Yes, when n ≥ 30, the error will be very small for thesedistributions.

(b) Yes, when n ≥ 1000, the error will be very small for thesedistributions.

(c) No: No n will make the error uniformly small when P ranges overall distributions with support in [0, 1].

46 / 108

Subgaussianity

Definition (Subgaussianity)

A random variable X is σ-subgaussian if for all λ ∈ R it holds thatE [exp(λX )] ≤ exp (λ2σ2/2).

Moment/cumulant generating function of X , MX , ψX : R→ R:MX (λ) = E [exp(λX )], ψX (λ) = logMX (λ), λ ∈ R.

LemmaX σ-subgaussian iff ψX (λ) ≤ 1

2λ2σ2 for all λ ∈ R.

Example: Z ∼ N(0, σ2). Then, MX (λ) = exp(λ2σ2/2).

Does MX (or ψX ) exist always? No, e.g., for X ∼ Exp, MX (λ) =∞for λ ≥ 1.

47 / 108

Why the Name?

Theorem

If X is σ-subgaussian, then for any ε ≥ 0,

P (X ≥ ε) ≤ exp

(− ε2

2σ2

). (1)

48 / 108

Proof

Proof.We take a generic approach called Cramer-Chernoff’s method.Let λ > 0 be some constant to be tuned later. Then

P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))

≤ E [exp (λX )] exp (−λε) (Markov’s inequality)

≤ exp

(λ2σ2

2− λε

). (Def. of subgaussianity)

Now λ was any positive constant, and in particular may be chosen tominimize the bound above, which is achieved by λ = ε/σ2.

49 / 108

Variations

Union bound: P (A ∪ B) ≤ P (A) + P (B).

Corollary: P (|X | ≥ ε) ≤ 2 exp(−ε2/(2σ2)).

Equivalent “deviation” forms:

P(X ≥

√2σ2 log(1/δ)

)≤ δ P

(|X | ≥

√2σ2 log(2/δ)

)≤ δ ,

or w.p. 1− δ, (−√

2σ2 log(2/δ),√

2σ2 log(2/δ)).

50 / 108

Algebra with Subgaussian Random Variables

Lemma

Suppose that X is σ-subgaussian and X1 and X2 are independent andσ1 and σ2-subgaussian respectively, then:

(a) E[X ] = 0 and Var [X ] ≤ σ2.

(b) cX is |c |σ-subgaussian for all c ∈ R.

(c) X1 + X2 is√σ2

1 + σ22-subgaussian.

NoteNo matter what, X1 + X2 is (σ1 + σ2)-subgaussian.Independence improves this to

√σ2

1 + σ22.

51 / 108

Concentration of the Mean

Corollary

Assume that Xi − µ are independent, σ-subgaussian randomvariables. Then, for any ε ≥ 0,

P (µ ≥ µ + ε) ≤ exp(− nε2

2σ2

)and P (µ ≤ µ− ε) ≤ exp

(− nε2

2σ2

),

(2)

where µ = 1n

∑nt=1 Xt .

Exercise 5Using exp(−x) ≤ 1/(ex) (which holds for all x ≥ 0), show thatexcept for a very small ε the above inequality is strictly stronger thanwhat we obtained via Chebyshev’s inequality and exponentiallysmaller (tighter) if nε2 is large relative to σ2.

52 / 108

Deviation Form

Corollary

For any δ ∈ [0, 1], with probability at least 1− δ,

µ ≤ µ +

√2σ2 log(1/δ)

n. (3)

Symmetrically, it also follows that with probability at least 1− δ,

µ ≥ µ−√

2σ2 log(1/δ)

n. (4)

53 / 108

Further Examples

• If X is distributed like a Gaussian with zero mean and varianceσ2, then X is σ-subgaussian.

• If X is bounded, zero-mean (i.e., E [X ] = 0 and |X | ≤ B almostsurely for some B ≥ 0) then X is B-subgaussian.

• Specifically, if X is a shifted Bernoulli with P (X = 1− p) = pand P (X = −p) = 1− p, it also holds that X is1/2-subgaussian.

Extension of the Definition• X is σ-subgaussian if the noise X − E [X ] is σ-subgaussian.

• A distribution is called σ-subgaussian if a random variable drawnfrom that distribution is σ-subgaussian.

54 / 108

Bibliography

• Basic reference: Boucheron et al. (2013) (concentration underindependence).

• Matrix versions of many standard results: Tropp (2015).

• Survey of classical results: McDiarmid (1998).

• Self-normalization (by standard deviation/variance): Pena et al.(2008).

• Empirical process theory: van de Geer (2000) or Dudley (2014).

55 / 108

Questions?

56 / 108

Outline



7 Summary

8 Further Reading

57 / 108

Standing Assumption (until further notice)

Assumptions

All bandit instances are in EKSG(1), i.e., the reward distribution for allarms is 1-subgaussian.

Is this restrictive?

1 All the algorithms that follow rely on the knowledge of σ.

2 Unequal subgaussianity constant across arms.

58 / 108

Notation

Empirical mean of arm i :

µi(t) =1

Ti(t)

t∑s=1

I{As=i}Xs ,

where

Ti(t) =t∑

s=1

I{As=i}

and As ∈ [K ] = {1, . . . ,K} is the index of arm chosen in round s.

59 / 108

Explore-then-Commit (ETC)

• Explore all the K arms m times.

• Go with the winner for the remaining rounds.

1: Input m ∈ N.2: In round t choose action

At =

{i , if (t modK ) + 1 = i and t ≤ mK ;

argmaxi µi(mK ) , t > mK .

(ties in the argmax are broken arbitrarily)

60 / 108

History and Related Algorithms

• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε (origin lost in history);

• “Certainty equivalence with forcing”: Robbins (1952);

• In more complex bandits:• “epoch-greedy”: Langford and Zhang (2008);• “Forced Exploration”: Abbasi-Yadkori et al. (2009);

Abbasi-Yadkori (2009);• “Phased exploration and greedy exploitation” (PEGE)

Rusmevichientong and Tsitsiklis (2010).

61 / 108

Notes

• Looks silly: Explore for m steps, then exploit? Why should wecare?

• Simplicity is great! Educational!

• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε.

• In n rounds, a particular arm i will be chosen on the averageabout nε/K times. So, m ≈ nε/K , or ε = mK/n.

• Will the additional randomness help ε greedy (for theenvironments considered)?

• Does it make sense to intermix exploration and exploitationsteps, rather than exploring first and then exploiting?

• How to choose m? (How to choose ε?)

62 / 108

Outline



7 Summary

8 Further Reading

63 / 108

How Big Can the Regret Be?

Let a ∧ b = min(a, b) and (x)+ = max(x , 0), for a, b, x ∈ R.

Basic regret decomposition identity:

Rn =K∑i=1

∆iE [Ti(n)] .

Thus, enough to bound E [Ti(n)]:

E [Ti(n)] ≤ m ∧⌈ nK

⌉+ (n −mK )+ P (i = AmK+1)

≤ m ∧⌈ nK

⌉+ (n −mK )+ P

(µi(mK ) ≥ max

j 6=iµj(mK )

).

64 / 108

Bounding the Probability

We have

P

(µi(mK ) ≥ max

j 6=iµj(mK )

)≤ P (µi(mK ) ≥ µ1(mK ))

= P ( µi(mK )− µi − (µ1(mK )− µ1) ≥ ∆i ) .

Claim: µi(mK )− µi − (µ1(mK )− µ1) is√

2/m-subgaussian.

Hence, by (2) we have

P (µi(mK )− µi − µ1(mK ) + µ1 ≥ ∆i) ≤ exp

(−m∆2

i

4

).

65 / 108

ETC Regret Upper Bound

Theorem (Instance-Dependent Bound)

After n rounds, the expected regret Rn of the ETC policy satisfies

Rn ≤(m ∧

⌈ nK

⌉) K∑i=1

∆i + (n −mK )+K∑i=1

∆i exp

(−m∆2

i

4

). (5)

Instance dependent: Because the bound depends on ∆i

(properties of the bandit environment instances).AKA: Gap-dependent, problem-dependent.

How to choose m?!66 / 108

Outline



7 Summary

8 Further Reading

67 / 108

Optimal choice of m.

Take K = 2. WLOG, ∆1 = 0, ∆.

= ∆2. Then,

Rn ≤ m∆ + (n − 2m)+∆ exp

(−m∆2

4

)≤ m∆ + n∆ exp

(−m∆2

4

).

(6)

Assume n is reasonably large. Then, optimal choice for m is

m∗(n,∆) = max

{0,

⌈4

∆2log

(n∆2

4

)⌉}(7)

and we get

Rn ≤ ∆ +4

∆

(1 + max

{0, log

(n∆2

4

)}). (8)

68 / 108

A Simple but Important Improvement

Regardless of a policy, Rn = ∆E[T2(n)] ≤ ∆n. Combine with (8), toget

Rn ≤ min

{n∆, ∆ +

4

∆

(1 + log

(n∆2

4

))}. (9)

Corollary ((Infeasible) Worst-Case Bound)

Consider ETC with m = dn/Ke ∧m∗(∆, n). Then, there existsC > 0 such that for any ∆ > 0 and n > 0, Rn ≤ C

√n.

69 / 108

Instance-Adaptive Algorithms

Is there a good, non-cheating choice for m (dependence on n isallowed, but not on ∆)?

Claim: Best such choice gives Rn = Θ(n2/3).

Earlier we had Rn = O(n1/2), much better! Is there some otheralgorithm that achieves this without knowing ∆? “Adaptation to ∆”.(E.g., Auer and Ortner (2010) and Garivier et al. (2016)).

Notice that a worst-case bound shows(!) whether adaptation toindividual instances happens!

70 / 108

Outline



7 Summary

8 Further Reading

71 / 108

Exercise 6 (soln. )

Gaussian bandit, K = 2, σ2 = 1, µ1 = 0 and µ2 = −∆. Setn = 1000, repeat simulations N = 104 times and report the average.

Plot, as a function of ∆ ∈ [0, 1],

(a) The theoretical upper boundgiven in (9) (blue).

(b) The regret of the ETCalgorithm with m set assuggested in (7) (green).

(c) The regret of the ETCalgorithm with “the” optimalm (yellow: calculatednumerically using that thenoise is exactly Gaussian).

What can we conclude?

0 0.2 0.4 0.6 0.8 1

20

40

60

80

∆

Exp

ecte

dre

gret

72 / 108

https://www.dropbox.com/s/f06mp9qlc4vjzpn/BanditSimETC.ipynb?dl=0

Exercise 7

Let Rn =∑n

t=1 ∆At (random, “pseudo-regret”).

(a) Fix ∆ = 1/10 and plot Rn asa function of m withn = 2000. See upper plot onright.

(b) Plot the variance V[Rn] as afunction of m for the samebandit as above. See lowerplot on right.

(c) Explain the curves andreconcile with theory.

(d) Did it make sense to plotV[Rn]? Why or why not?

0 10020030040040506070

m

Exp

ecte

dR

egre

t

0 100200300400

2,0004,0006,0008,000

m

Var

ianc

eof

Reg

ret

73 / 108

Questions?

74 / 108

Outline



7 Summary

8 Further Reading

75 / 108

Optimism Principle

Optimism in the Face of Uncertainty (OFU) Principle

One should choose their actions as if the environment is as nice asplausibly possible.

76 / 108

Illustration

Visiting a new country.

Shall I try local cuisine/beer/. . . ?

Or stick to what I know?

Optimism: Yes.

Pessimism: No.

Optimism leads to exploration, pessimism prevents exploration.

Exploration is necessary: One can be unlucky with one’s priors!Hence, optimism is good.

How much??

77 / 108

Outline



7 Summary

8 Further Reading

78 / 108

On What is Plausible

Recall: if X1,X2, . . . ,Xn are independent and 1-subgaussian withmean µ and µ =

∑nt=1 Xt/n, then for any δ ∈ [0, 1],

P

(µ ≥ µ +

√2 log(1/δ)

n

)≤ δ . (10)

Round t. How big can µi be? Data: µi(t − 1) (empirical mean),based on Ti(t − 1) observations. Define

UCBi(t − 1, δ) = µi(t − 1) +

√2 log(1/δ)

Ti(t − 1). (11)

Caveat: Ti(t − 1) is random, hence it is not clear whetherP (UCBi(t − 1, δ) ≥ µi) ≤ δ holds.

79 / 108

The UCB(δ) Algorithm

1: Input K and δ2: Choose each action once3: For rounds t > K choose action

At = argmaxi

UCBi(t − 1, δ)

NoteAlthough there are many versions of the UCB algorithm, we often donot distinguish them by name and hope the context is clear. For therest of this tutorial we’ll usually call UCB(δ) just UCB.

80 / 108

Notes

• The algorithm first chooses each arm once, which is necessarybecause the term inside the square root is undefined whenTi(t − 1) = 0.

• The value inside the argmax is called the index of arm i .• An index algorithm chooses the arm in each round that

maximizes some value (the index), which usually only dependson current time-step and the samples from that arm.

• In the case of UCB, the index is the sum of the empirical meanof rewards experienced so far and the exploration bonus (alsoknown as the confidence width).

• The algorithm will work so that the UCB indices areapproximately the same all the time (why?).

Demo: http://downloads.tor-lattimore.com/bandits/

81 / 108


Outline



7 Summary

8 Further Reading

82 / 108

Instance-Dependent Bound for UCB

Theorem

Consider UCB as shown earlier on a stochastic K -armed1-subgaussian bandit problem. For any horizon n, if δ = 1/n2 then

Rn ≤ 3K∑i=1

∆i +∑

i :∆i>0

16 log(n)

∆i.

83 / 108

Proof: Main Ideas

Regret decomposition identity:

Rn =K∑i=1

∆iE [Ti(n)] .

Take i so that ∆i > 0. Bound E [Ti(n)].

Key observation: After initialization, action i can only be chosen if itsindex is higher than that of an optimal arm.

This can only happen if at least one of the following is true:(a) The index of action i is larger than the true mean of a specific

optimal arm.(b) The index of a specific optimal arm is smaller than its true mean.

Both of these have low probability. In particular,E [Ti(n)] ≤ c1 + c2 log(n)/∆2

i . Qu.e.d.84 / 108

Proof: The Reward Consumption Model

For i ∈ [K ], let (Zi ,s)s be an i.i.d. sequence with Zi ,s ∼ Pi .

Define Xt to be the TAt th element of sequence (ZAt ,s)s :

Xt = ZAt ,TAt (t) . (12)

Is there any loss of generality?

No: Interaction protocol ⇔ constraint on the distribution of(A1,X1, . . . ,An,Xn). This constraint clearly holds in this case.

Benefit: Defining

µi ,s =1

s

s∑u=1

Zi ,u , s ∈ [n]

(usual sample means), we have

µi(t) = µi ,Ti (t) .

85 / 108

Proof: Some Details II.

WLOG µ1 = µ∗. Fix i such that ∆i > 0. Let Gi be the ‘good’ eventdefined by

Gi ={µ1 < min

t∈[n]UCB1(t)

}∩{µi ,ui +

√2

uilog

(1

δ

)< µ1

},

where ui ∈ [n] is a constant to be chosen later. Then

1 If Gi occurs, then Ti(n) ≤ ui .

2 The complement event G ci occurs with low probability (governed

in some way yet to be discovered by ui).

Because Ti(n) ≤ n no matter what, this will mean that

E [Ti(n)] = E[I{Gi}Ti(n)

]+ E

[I{G c

i }Ti(n)]≤ ui + P (G c

i ) n . (13)

For details, see our website.

86 / 108

http://banditalgs.com/2016/09/18/the-upper-confidence-bound-algorithm/

Worst-case Bound

Theorem

If δ = 1/n2, then the regret of UCB(δ) on any ν ∈ EKSG(1)environment is bounded by

Rn ≤ 8√nK log(n) + 3

K∑i=1

∆i .

87 / 108

Notes

Recall: For all ν ∈ EKSG(1),

Rn(ν) ≤ 8√

nK log(n) + 3K∑i=1

∆i .

• Same tuning as in the other result! Good!

• Not anytime. Hmm..

• The additive∑

i ∆i term is unavoidable: all reasonablealgorithms must play each arm once (exercise: what if not??).

• The bound is close to optimal: no algorithm can enjoy regretsmaller than const ·

√nK over all problems in EKSG(1).

• A more complicated variant of UCB(δ) shaves the logarithmicterm from the upper bound given above.

88 / 108

Proof

Previous proof actually gives:

E[Ti(n)] ≤ 3 +16 log(n)

∆2i

.

Using the basic regret decomposition, choosing some ∆ > 0,

Rn =K∑i=1

∆iE [Ti(n)] =∑

i :∆i<∆

∆iE [Ti(n)] +∑

i :∆i≥∆

∆iE [Ti(n)]

≤ n∆ +∑

i :∆i≥∆

(3∆i +

16 log(n)

∆i

)≤ n∆ +

16K log(n)

∆+ 3

∑i

∆i

≤ 8√

nK log(n) + 3K∑i=1

∆i ,

where the first inequality follows because∑

i :∆i<∆ Ti(n) ≤ n and the

last line by choosing ∆ =√

16K log(n)/n. Qu.e.d.89 / 108

Outline



7 Summary

8 Further Reading

90 / 108

Demo/Exercise 8

Setup: n = 1000, K = 2, P1 = N (0, 1), P2 = N (−∆, 1). Plotestimates of the expected regret of UCB, ETC withm ∈ {25, 50, 75, 100,Optimum} as ∆ ∈ [0, 1]. Your plot shouldresemble this:

0 0.2 0.4 0.6 0.8 1

20

40

60

80

100

∆

Exp

ecte

dre

gret

ETC (m = 25)ETC (m = 50)ETC (m = 75)ETC (m = 100)ETC (optimal m)UCB

What can we conclude? 91 / 108

Exercise 9/Demo

Compare the regret histogram of UCB and ETC.

Demo

92 / 108

https://www.dropbox.com/s/nfw6xc184hboe4t/BanditSimHistograms.ipynb?dl=0

Literature

• Confidence bounds, OFU principle is by Lai and Robbins (1985)(context: parametric bandits, asymptotics).

• UCB algorithm: Katehakis and Robbins (1995) (Gaussianbandits) and Agrawal (1995). Still asymptotics. Agrawal(1995)’s analysis is modular: All that is needed is appropriateUCBs (the form is irrelevant).

• Independently, Kaelbling (1993) also discovered UCB. No regretanalysis, or clear advice on how to tune the confidenceparameter.

• “This” UCB is most similar to UCB1 of Auer et al. (2002),except that n is t in UCB1. Finite-time regret bound, [0, 1]bounded payoffs (i.e., 1/2 subgaussian).

• Worst-case bound: Bubeck and Cesa-Bianchi (2012), whichfocuses on the subgaussian setup.

93 / 108

Questions?

94 / 108

Outline



7 Summary

8 Further Reading

95 / 108

Asymptotically Optimal UCB (AO-UCB)

1: Input K2: Choose each arm once3: Subsequently choose

At = argmaxi

(µi(t − 1) +

√2 log f (t)

Ti(t − 1)

)

where f (t) = 1 + t log2(t)

96 / 108

Asymptotic Regret, Asymptotic Optimality

Theorem (Upper Bound)

The regret of AO-UCB satisfies

lim supn→∞

Rn(ν)

log(n)≤∑

i :∆i>0

2

∆i. (14)

Theorem (Lower Bound)

For any policy π that has subpolynomial regret for all 1-subgaussianenvironments (i.e., Rn(ν, π) = o(np) for all p > 0 and all ν), for anyinstance ν with gaps ∆ = ∆(ν),

lim infn→∞

Rn(ν, π)

log(n)≥∑

i :∆i>0

2

∆i. (15)

97 / 108

Finite-time Regret for AO-UCB

Corollarythere exists some universal constant C > 0 such that the regret ofAO-UCB is bounded by

Rn ≤ C∑

i :∆i>0

(∆i +

log(n)

∆i

),

and, in particular,

Rn ≤ CK∑i=1

∆i + 2√

CnK log(n) .

98 / 108

Outline



7 Summary

8 Further Reading

99 / 108

Zoo of UCBs

Strategy Conf. Dist. free regret Asy. Opt. Inst. Opt. Anytime

UCBAuer et al. (2002)

t√

kn log(n) 3 3 3

UCB*Lai (1987)

n/T√

kn log(k) 3 3 7

UCB+Garivier and Cappe (2011)

t/T√

kn log(k) 3 3 3

MOSSAudibert and Bubeck (2009)

n/(kT )√kn 3 7 7

Anytime MOSSDegenne and Perchet (2016)

t/(kT )√kn 3 7 3

OCUCBLattimore (2015)

(n/t)1+ε√kn 7 3 7

UCB†Lattimore (2017)

see ref.√kn 3 3 7

UCB‡Lattimore (2017)

see ref.√

kn log log(n) 3 3 3

All strategies choose At = t for 1 ≤ t ≤ k and subsequently

At = argmaxi

µi(t − 1) +

√2 log (Conf.)

Ti(t − 1)︸︷︷︸exploration bonus

, (16)

where log(x) ∼ log(x) is approximately logarithmic. In the table Tstands for Ti(t − 1).

Demo with UCB and OCUCB:http://downloads.tor-lattimore.com/bandits/

100 / 108


Dealing with Risk

Setup: K = 2, Gaussian noise, gap ∆. If δ is the probability ofmissing the optimal arm by some (reasonable) algorithm then

Rn = O

(n∆δ +

1

∆log

(1

δ

)),

Optimize δ to get δ = 1/(n∆2) and

Rn = O

(1

∆

(1 + log

(n∆2

))).

How about E [R2n ]? We get: E [R2

n ] ≈ δ(n∆)2 = n. Too big!!Choose δ = (n∆)−2 to get

Rn = O

(1

∆

(1

n+ log

(n2∆2

)))and E [R2

n ] ≈ log2(n)!! For further info, see, Audibert et al. (2007).101 / 108

Outline



7 Summary

8 Further Reading

102 / 108

Summary

• Bandits: The drosophila of “Explore-Exploit” problems.• Learner interacts with its environment, which is initially

unknown.• We care about the total reward/regret.

• Exploration is necessary to avoid large loss due to (plausible)unlucky start

• Exploitation is necessary to avoid large loss due to being justcurious all the time

• What is the right amount?• Stochastic, finite-armed bandits

• Explore-then-commit: Ideal, but infeasible tuning shows whatcan be achieved.

• Optimism does it: UCB.• Simultaneously satisfying many optimality criteria is possible

with a carefully tuned UCB.

• We left out lower bound proofs and many other topics.103 / 108

Questions?

104 / 108

References I

Abbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with largeaction sets. Master’s thesis, University of Alberta, Department of Computing Science.

Abbasi-Yadkori, Y., Antos, A., and Szepesvari, C. (2009). Forced-exploration based algorithmsfor playing in stochastic linear bandits. In COLT Workshop on On-line Learning with LimitedFeedback.

Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.

Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits.In Proceedings of Conference on Learning Theory (COLT), pages 217–226.

Audibert, J.-Y., Munos, R., and Szepesvari, C. (2007). Tuning bandit algorithms in stochasticenvironments. In Algorithmic Learning Theory (ALT), pages 150–165. Springer.

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning, 47:235–256.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science,1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.

Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochasticmulti-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.

105 / 108

References II

Bather, J. and Chernoff, H. (1967). Sequential decisions in the control of a spaceship. In FifthBerkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 181–207.

Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments.Chapman and Hall, London ; New York :.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: Anonasymptotic theory of independence. OUP Oxford.

Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. NowPublishers Incorporated.

Chernoff, H. (1959). Sequential design of experiments. The Annals of Mathematical Statistics,30(3):755–770.

Degenne, R. and Perchet, V. (2016). Anytime optimal algorithms in stochastic multi-armedbandits. In Proceedings of International Conference on Machine Learning (ICML).

Dudley, R. M. (2014). Uniform central limit theorems, volume 142. Cambridge university press.

Garivier, A. and Cappe, O. (2011). The KL-UCB algorithm for bounded stochastic bandits andbeyond. In Proceedings of Conference on Learning Theory (COLT).

Garivier, A., Kaufmann, E., and Lattimore, T. (2016). On explore-then-commit strategies. InAdvances in Neural Information Processing Systems (NIPS).

Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the RoyalStatistical Society. Series B (Methodological), 41(2):148–177.

106 / 108

References III

Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. JohnWiley & Sons.

Kaelbling, L. P. (1993). Learning in embedded systems. MIT press.

Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations.Proceedings of the National Academy of Sciences of the United States of America,92(19):8584.

Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. TheAnnals of Statistics, pages 1091–1114.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advancesin applied mathematics, 6(1):4–22.

Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits withside information. In NIPS, pages 817–824.

Lattimore, T. (2015). Optimally confident UCB: Improved regret for finite-armed bandits. arXivpreprint arXiv:1507.07880.

Lattimore, T. (2017). Auto-tuning the confidence level for optimistic bandit strategies.technical report.

McDiarmid, C. (1998). Concentration. In Probabilistic methods for algorithmic discretemathematics, pages 195–248. Springer.

Pena, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory andStatistical Applications. Springer Science & Business Media.

107 / 108

References IV

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of theAmerican Mathematical Society, 58(5):527–535.

Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematicsof Operations Research, 35(2):395–411.

Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika, 25(3/4):285–294.

Tropp, J. A. (2015). An introduction to matrix concentration inequalities. Foundations andTrends® in Machine Learning, 8(1-2):1–230.

van de Geer, S. (2000). Empirical Processes in M-estimation, volume 6. Cambridge universitypress.

108 / 108

Documents

Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound