Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Bandits: Part I
Stochastic, Finite-Armed Bandits
Csaba Szepesvari
Department of Computing Science & AICMLUniversity of Alberta & (@ Deepmind since August)
August 31, 2017
Summer School @ DS3
1 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
2 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
3 / 108
Overview of Talks
• Talk 1: Basics• What, why, applications• Bandits: Problem definition• Stochastic finite-armed bandits: Basics• Measure concentration. Subgaussianity• Explore-then-commit• UCB, Optimism, Optimality
• Talk 2: Adversarial and linear bandits• Adversarial finite-armed bandits• Contextual bandits• Exp4• Stochastic linear bandits• Adversarial linear bandits
• Talk 3: Bandits in the wild
4 / 108
Learning Material
• This lecture (and more): http://banditalgs.com
• Joint effort with Tor Lattimore;• Book to be published by early next year: Stay tuned!• Tor’s lightweight C++ bandit library
• I will share some exercises later
• Sebastien Bubeck’s tutorial• Blog post 1• Blog post 2
• Bubeck and Cesa-Bianchi’s book;(Bubeck and Cesa-Bianchi, 2012)
5 / 108
banditalgs.com
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
6 / 108
Learning Objectives: Knowledge
The goal is to gain knowledge about:
• Bandit problems: What are they?
• Types of bandit problems: How do they differ?
• Key ideas: Explore vs. exploit; why significant? What to do?
• Basic solution techniques
• Core results: How far did we get? How far can we go?
• Peak into contemporary research
7 / 108
Learning Objectives: Skills
Skills to be acquired. Ability to . . .
• . . . recognize bandit problems;
• . . . recognize types/variants of bandits;
• . . . write code for bandits (algorithms, environment, . . . );
• . . . recognize insurmountable tradeoffs; limits of what is possible;
• . . . get around in the literature.
8 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
9 / 108
Origin of the Name
Mouse learning in a T-maze1953: Frederick Mosteller andRobert Bush, psychogolists“A Stochastic Model withApplications to Learning”
Generalization to humans:“Two-armed bandits”.
10 / 108
Classics
• First paper on bandits: Thompson (1933)
• Major effort by Herbert Robbins in the 50s; earliest is (Robbins,1952)
• Another pioneer: Herman Chernoff, e.g., (Bather and Chernoff,1967)
• Breakthrough in Bayesian bandits (computation): (Gittins, 1979)
• A breakthrough in stochastic bandits: (Lai and Robbins, 1985),asymptotic regret-optimality in a frequentist setting
• UCB as we know it: (Auer et al., 2002), finite-time optimality
• Exp3, adversarial bandits: (Auer et al., 1995)
• Early books (sequential design, Bayesian setting): (Chernoff,1959; Berry and Fristedt, 1985; Gittins et al., 2011).
11 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
12 / 108
The Number Game
Number of papers published in 5-year periods (present=2016)on bandits as reported by google scholar.
13 / 108
Why Do People Care?
• Decision making uncertainty is a significant challenge
• Bandits capture key aspects of these challenge: Theexploration-exploitation dilemma
Some examples• Which drugs should a patient receive?
• How should I allocate my study time between courses?
• Which version of a website will generate the most revenue?
• What move should be considered next when playing chess/go?
• . . .
14 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
15 / 108
The Exploration-Exploitation DilemmaPayoffs after 5 pulls of each arm:
Left arm: 0, 10, 0, 0, 10Right arm: 10, 0, 0, 0, 0
Left arm average payoff: 4 dollars per round;Right arm average payoff: 2 dollars per round.Budget: 20 more trials (pulls).
• Shall we pull left only? “Exploit”?
• Shall we pull the right arm at all? “Explore”?
• Why?
• How many times to pull each arm?
• “Good luck/bad luck?”
Play time!
16 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
17 / 108
Applications, applications..
1 A/B testing2 Drug testing3 Advert placement4 Network routing (packets,
planes, cars)5 Tree search (MCTS)6 Recommendation services
(for example, news ormovies)
7 Ranking (for example,search)
8 Educational games
9 Resource allocation(memory, bandwidth,manufacturing space)
10 Waiting problems (hard-diskshutdown, auto logout,waiting for a bus)
11 Dynamic pricing (forexample, on Amazon)
12 A core component of RL
18 / 108
Questions?
19 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
20 / 108
Bandits: Interaction Protocol and Goal
For rounds t = 1, 2, ..., n:
1 Learner chooses an action At from a set A of available actions.The chosen action is sent to the environment;
2 The environment (E) generates a response in the form of areal-valued reward Xt ∈ R, which is sent back to the learner.
The goal of the learner is to maximize the sum of rewards that itreceives,
∑nt=1 Xt .
21 / 108
Bandits: Interaction Protocol and Goal: Concepts
• History at the end of round t: Ht = (A1,X1, . . . ,At ,Xt).
• Learner can use history to base its action At+1 on in round t + 1.
• Learner “uses” a “policy”, a map of all possible histories toactions.
• The learner is also allowed to randomize.
22 / 108
Bandits and Regret
Definition (Regret – Informal definition)
The regret of learner relative to action a =Total reward gained when a is used for all n rounds –Total reward gained by the specific learner in n rounds according toits chosen actions.
• Maximize reward ⇔ minimize regret.
• Advantage? Normalizes the scale so that zero has specialmeaning (kills meaningless shifts).
Questions:
• What does it mean that a learner has positive regret?
• (What does it mean that a learner has positive reward?)
• What does it mean that a learner has zero regret?
23 / 108
Regret – II.
Definition (Zero-regret learner)
“Zero-regret learner”: If Rn is regret at time n, Rn/n→ 0; a.k.a.“vanishing regret”, sublinear regret; Rn = o(n).
In general we care about how fast Rn/n→ 0 happens ⇔ How slowlyRn grows.
Examples:
• Rn = O(√n) (Rn/n = O(1/
√n)).
• Rn = O(log(n)) (Rn/n = O(log(n)/n)).
24 / 108
Discussion
• Why compare with fixed actions? Is this limiting? What doesthis capture?
• Stationarity.
• Does the policy know the “horizon”? Is it “anytime”?
• Good learner: “Small” regret (small worst-case regret!?) over alarge class of environments.
• Instance-dependent regret.
• Regret lower and upper bounds: “Worst-case” vs. instancedependent.
25 / 108
Types of Environments
• Stochastic
• Adversarial (“unconstrained”)
• Unstructured vs. structured payoff
• Contextual
• . . .
26 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
27 / 108
Stochastic Bandits
DefinitionA K -armed stochastic bandit environment is a tuple ofdistributions ν = (P1,P2, . . . ,PK ), where Pi is a distribution over thereals for each i ∈ [K ]
.= {1, 2, . . . ,K}.
InteractionFor rounds t = 1, 2, 3, . . .
1 Based on its past observations (if any), the learner chooses anaction At ∈ [K ] following some policy π:At ∼ π(·|A1,X1, . . . ,At−1,Xt−1). The chosen action is sent tothe environment.
2 The environment generates a random reward Xt whosedistribution is PAt (in notation: Xt ∼ PAt ). The generatedreward is sent back to the learner.
28 / 108
A note
• In the environment, no joint (over all arms) is specified. Why?
• The learner sees only Xt , not rewards from other arms. Even ifthose exist, they are latent, the learner cannot access them,hence, the properties of joint (even if it was specified) do notmatter.
• Precise answer: All information about an interconnectedenvironment-policy pair (ν, π) is in the joint distribution ofaction-reward sequences.
29 / 108
Typical Environments
Name Symbol Definition
Bernoulli EKB{
(B(µi ))i : µ ∈ [0, 1]K}
Uniform EKU{
(U(ai , bi ))i : a, b ∈ RK , a ≤ b}
Gaussian (known var.) EKN (σ2){
(N (µi , σ2))i : µ ∈ RK
}Gaussian (unknown var.) EKN
{(N (µi , σ
2i ))i : µ ∈ RK , σ2 ∈ RK
+
}Finite variance EKV (σ2)
{(Pi )i : VX∼Pi [X ] ≤ σ2 for all i
}Finite kurtosis EKKurt(κ) {(Pi )i : KurtX∼Pi [X ] ≤ κ for all i}
Bounded support EK[a,b] {(Pi )i : Supp(Pi ) ⊆ [a, b]}
Subgaussian EKSG(σ2) {(Pi )i : Pi is σ-subgaussian for all i}
Supp(P) is the support of distribution P; Kurt(X ) = E[(X−E[X ])4]V[X ]2 .
Table: Typical environment classes for stochastic bandits
30 / 108
Expected Reward/Regret
• Sn =∑n
t=1 Xt : Total reward. Random!!!
• Possible goal: Maximize E [Sn], the expected reward.
• OK?
• Same as minimizing Rn, the (expected) regret, where
Rn.
= nµ∗ − E [Sn] ,
µ∗ = maxi∈[K ] µi , µi =∫ +∞−∞ x Pi(dx).
31 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
32 / 108
Basic Properties of Regret – I.
• Let Rn(π, ν) be the regret (expected!) of policy π onenvironment ν.
Lemma (Warmup – Exercise 0)
(a) Rn(π, ν) ≥ 0 for all policies π.
(b) The policy π choosing At ∈ argmaxi µi for all t satisfies
Rn(π, ν) = 0 .
(c) If Rn(π, ν) = 0 for some policy π then for all t, At ∈ [K ] isoptimal with probability one:
P (µAt = µ∗) = 1 .
33 / 108
Regret Decomposition – Important!!!
• Let ν = (P1, . . . ,PK ) be a bandit environment.• Let ∆i(ν) = µ∗(ν)− µi(ν): sub-optimality gap or action gap
or immediate regret of action i .• Usage count for action i :
Ti(t) =t∑
s=1
I{As=i} .
• Note: ∆i.
= ∆i(ν) is non-random, Ti(t) is random! (Why?)
Lemma (Regret Decomposition Lemma)
For any policy π and K -armed stochastic bandit environment ν andhorizon n ∈ N, the regret Rn of policy π in ν satisfies
Rn =∑K
i=1 ∆i E [Ti(n)] .
34 / 108
Proof of the Regret Decomposition Lemma
Proof.For any fixed t we have
∑k I{At=k} = 1.
Hence, Sn =∑
t Xt =∑
t
∑k XtI{At=k} and thus
Rn = nµ∗ − E [Sn] =K∑
k=1
n∑t=1
E[(µ∗ − Xt)I{At=k}
].
Now, knowing At , the expected reward is µAt . Thus we have
E[(µ∗ − Xt)I{At=k} |At
]= I{At=k}E [µ∗ − Xt |At ]
= I{At=k}(µ∗ − µAt )
= I{At=k}(µ∗ − µk) .
Take expectations, sum both sides over t = 1, . . . , n.
35 / 108
Questions?
36 / 108
Exercise 1: (get code )Implement a Bernoulli bandit environment in Python using the codesnippet below:
class BernoulliBandit:
# accepts a list of K >= 2 floats , each lying in
[0,1]
def __init__(self , means):
pass
# Function should return the number of arms
def K(self):
pass
# Accepts a parameter 0 <= a <= K-1 and returns the
# realisation of random variable X with P(X = 1)
being
# the mean of the (a+1)th arm.
def pull(self , a):
pass
# Returns the regret incurred so far.
def regret(self):
pass
37 / 108
Exercise 2: Follow-the-Leader (get code )
Implement the following simple algorithm called ‘Follow-the-Leader’(FTL), which chooses each action once and subsequently chooses theaction with the largest average observed so far. Ties should bebroken randomly.
def FollowTheLeader(bandit , n):
# implement the Follow -the -Leader algorithm by
replacing
# the code below that just plays the first arm in
every round
for t in range(n):
bandit.pull (0)
Note: Depending on the literature you are reading, Follow-the-Leadermay be called ‘stay with the winner’ or the ‘greedy algorithm’.
38 / 108
Exercise 3: Distribution of (random) regret
Consider a Bernoulli bandit with two arms and means µ1 = 0.5 andµ2 = 0.6.
(a) Using a horizon of n = 100,run 1000 simulations of yourimplementation ofFollow-the-Leader on theBernoulli bandit above andrecord the (random) regret,nµ∗ − Sn, in each simulation.
(b) Plot the results using ahistogram (see fig. on theright).
(c) Explain the results in thefigure.
0 5 10
0
200
400
Regret
Fre
quen
cyFigure: Histogram of regret for FTLover 1000 trials on Bernoulli banditwith means µ1 = 0.5, µ2 = 0.6
39 / 108
Exercise 4: Regret Over Time
Consider the same Bernoulli bandit as used in the previous question.
(a) Run 1000 simulations of yourimplementation of FTL foreach horizonn ∈ {100, 200, 300, . . . , 1000}.
(b) Plot the average regretobtained as a function of n(see the fig. on the right).Include error bars.
(c) Explain the plot. Do you thinkFTL is a good algorithm?Why/why not?
500 1,000
20
40
n
Exp
ecte
dR
egre
t
Figure: Histogram of regret for FTLover 1000 trials.
40 / 108
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
41 / 108
Statistics 101
• X1, . . . ,Xn independent, identically distributed (∼ P),real-valued random variables (think rewards).
• What is µ.
= E [X1] (= E [Xt ])?
• Estimate: µ = 1n
∑nt=1 Xt
• “Statistics” of data: any function of X1, . . . ,Xn.
• Notice that µ is random, but µ ≈ µ.
• The distribution of µ depends on P .
• “How close”?: Characterize the distribution of |µ− µ|.• A priori characterization: The characterization depends on P
where all we know that P ∈ P .
• A posteriori characterization: The characterization depends onX1, . . . ,Xn.
42 / 108
Tail probabilities
Imagine X.
= µ.
We care about the probability masses P (µ < µ− ε), P (µ > µ− ε).
Either, given ε, give lower or upper bounds on this (“probabilitybound”). Or for a given probability mass, give upper and/or lowerbounds on ε (“deviation bound”).
43 / 108
Markov and Chebyshev
Lemma
For any random variable X with finite mean and ε > 0 it holds that:
(a) (Markov): P (|X | ≥ ε) ≤ E [|X |]ε
.
(b) (Chebyshev): P (|X − E [X ] | ≥ ε) ≤ V[X ]
ε2.
• Exercise 1: Prove Markov. Hint: Prove it for nonnegative r.v.s.,use E [X ] =
∫∞0
x P(dx) and split the integral.• Exercise 2: Prove Chebyshev. Hint: Apply Markov.• Chebyshev applied to µ: V[µ] = σ2
n. Hence,
P (|µ− µ| ≥ ε) ≤ σ2
nε2 .• Note: Chebyshev is more precise. You can further increase
precision by applying Markov to |X − E [X ] |k , k big.44 / 108
Central Limit Theorem (CLT)
Theorem (CLT)
Let Xt iid, σ2 = V[X1] <∞, Sn =∑n
t=1(Xt − µ), Zn = Sn/√σ2n.
Then FZn → FZ as n→∞ where Z ∼ N (0, 1).
Note: FZ (u) = P (Z ≥ u) =∫∞u
1√2π
exp(− x2
2
)dx .
Bounding FZ (u):∫ ∞u
1√2π
exp(− x2
2
)dx ≤ 1
u√
2π
∫ ∞u
x exp(− x2
2
)dx
=√
12πu2 exp
(−u2
2
).
Hence
P (µ ≥ µ + ε) = P(
Sn√σ2n≥ ε√
nσ2
)≈ P
(Z ≥ ε
√σ2n)
≤√
σ2
2πnε2 exp(− nε2
2σ2
).
45 / 108
How good is the CLT?
P (µn ≥ µ + ε) /√
σ2
2πnε2 exp(− nε2
2σ2
).
Question: Can we safely swap / to ≤, e.g., when Xt ∼ P and P is
supported on [0, 1].
(a) Yes, when n ≥ 30, the error will be very small for thesedistributions.
(b) Yes, when n ≥ 1000, the error will be very small for thesedistributions.
(c) No: No n will make the error uniformly small when P ranges overall distributions with support in [0, 1].
46 / 108
Subgaussianity
Definition (Subgaussianity)
A random variable X is σ-subgaussian if for all λ ∈ R it holds thatE [exp(λX )] ≤ exp (λ2σ2/2).
Moment/cumulant generating function of X , MX , ψX : R→ R:MX (λ) = E [exp(λX )], ψX (λ) = logMX (λ), λ ∈ R.
LemmaX σ-subgaussian iff ψX (λ) ≤ 1
2λ2σ2 for all λ ∈ R.
Example: Z ∼ N(0, σ2). Then, MX (λ) = exp(λ2σ2/2).
Does MX (or ψX ) exist always? No, e.g., for X ∼ Exp, MX (λ) =∞for λ ≥ 1.
47 / 108
Why the Name?
Theorem
If X is σ-subgaussian, then for any ε ≥ 0,
P (X ≥ ε) ≤ exp
(− ε2
2σ2
). (1)
48 / 108
Proof
Proof.We take a generic approach called Cramer-Chernoff’s method.Let λ > 0 be some constant to be tuned later. Then
P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))
≤ E [exp (λX )] exp (−λε) (Markov’s inequality)
≤ exp
(λ2σ2
2− λε
). (Def. of subgaussianity)
Now λ was any positive constant, and in particular may be chosen tominimize the bound above, which is achieved by λ = ε/σ2.
49 / 108
Variations
Union bound: P (A ∪ B) ≤ P (A) + P (B).
Corollary: P (|X | ≥ ε) ≤ 2 exp(−ε2/(2σ2)).
Equivalent “deviation” forms:
P(X ≥
√2σ2 log(1/δ)
)≤ δ P
(|X | ≥
√2σ2 log(2/δ)
)≤ δ ,
or w.p. 1− δ, (−√
2σ2 log(2/δ),√
2σ2 log(2/δ)).
50 / 108
Algebra with Subgaussian Random Variables
Lemma
Suppose that X is σ-subgaussian and X1 and X2 are independent andσ1 and σ2-subgaussian respectively, then:
(a) E[X ] = 0 and Var [X ] ≤ σ2.
(b) cX is |c |σ-subgaussian for all c ∈ R.
(c) X1 + X2 is√σ2
1 + σ22-subgaussian.
NoteNo matter what, X1 + X2 is (σ1 + σ2)-subgaussian.Independence improves this to
√σ2
1 + σ22.
51 / 108
Concentration of the Mean
Corollary
Assume that Xi − µ are independent, σ-subgaussian randomvariables. Then, for any ε ≥ 0,
P (µ ≥ µ + ε) ≤ exp(− nε2
2σ2
)and P (µ ≤ µ− ε) ≤ exp
(− nε2
2σ2
),
(2)
where µ = 1n
∑nt=1 Xt .
Exercise 5Using exp(−x) ≤ 1/(ex) (which holds for all x ≥ 0), show thatexcept for a very small ε the above inequality is strictly stronger thanwhat we obtained via Chebyshev’s inequality and exponentiallysmaller (tighter) if nε2 is large relative to σ2.
52 / 108
Deviation Form
Corollary
For any δ ∈ [0, 1], with probability at least 1− δ,
µ ≤ µ +
√2σ2 log(1/δ)
n. (3)
Symmetrically, it also follows that with probability at least 1− δ,
µ ≥ µ−√
2σ2 log(1/δ)
n. (4)
53 / 108
Further Examples
• If X is distributed like a Gaussian with zero mean and varianceσ2, then X is σ-subgaussian.
• If X is bounded, zero-mean (i.e., E [X ] = 0 and |X | ≤ B almostsurely for some B ≥ 0) then X is B-subgaussian.
• Specifically, if X is a shifted Bernoulli with P (X = 1− p) = pand P (X = −p) = 1− p, it also holds that X is1/2-subgaussian.
Extension of the Definition• X is σ-subgaussian if the noise X − E [X ] is σ-subgaussian.
• A distribution is called σ-subgaussian if a random variable drawnfrom that distribution is σ-subgaussian.
54 / 108
Bibliography
• Basic reference: Boucheron et al. (2013) (concentration underindependence).
• Matrix versions of many standard results: Tropp (2015).
• Survey of classical results: McDiarmid (1998).
• Self-normalization (by standard deviation/variance): Pena et al.(2008).
• Empirical process theory: van de Geer (2000) or Dudley (2014).
55 / 108
Questions?
56 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
57 / 108
Standing Assumption (until further notice)
Assumptions
All bandit instances are in EKSG(1), i.e., the reward distribution for allarms is 1-subgaussian.
Is this restrictive?
1 All the algorithms that follow rely on the knowledge of σ.
2 Unequal subgaussianity constant across arms.
58 / 108
Notation
Empirical mean of arm i :
µi(t) =1
Ti(t)
t∑s=1
I{As=i}Xs ,
where
Ti(t) =t∑
s=1
I{As=i}
and As ∈ [K ] = {1, . . . ,K} is the index of arm chosen in round s.
59 / 108
Explore-then-Commit (ETC)
• Explore all the K arms m times.
• Go with the winner for the remaining rounds.
1: Input m ∈ N.2: In round t choose action
At =
{i , if (t modK ) + 1 = i and t ≤ mK ;
argmaxi µi(mK ) , t > mK .
(ties in the argmax are broken arbitrarily)
60 / 108
History and Related Algorithms
• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε (origin lost in history);
• “Certainty equivalence with forcing”: Robbins (1952);
• In more complex bandits:• “epoch-greedy”: Langford and Zhang (2008);• “Forced Exploration”: Abbasi-Yadkori et al. (2009);
Abbasi-Yadkori (2009);• “Phased exploration and greedy exploitation” (PEGE)
Rusmevichientong and Tsitsiklis (2010).
61 / 108
Notes
• Looks silly: Explore for m steps, then exploit? Why should wecare?
• Simplicity is great! Educational!
• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε.
• In n rounds, a particular arm i will be chosen on the averageabout nε/K times. So, m ≈ nε/K , or ε = mK/n.
• Will the additional randomness help ε greedy (for theenvironments considered)?
• Does it make sense to intermix exploration and exploitationsteps, rather than exploring first and then exploiting?
• How to choose m? (How to choose ε?)
62 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
63 / 108
How Big Can the Regret Be?
Let a ∧ b = min(a, b) and (x)+ = max(x , 0), for a, b, x ∈ R.
Basic regret decomposition identity:
Rn =K∑i=1
∆iE [Ti(n)] .
Thus, enough to bound E [Ti(n)]:
E [Ti(n)] ≤ m ∧⌈ nK
⌉+ (n −mK )+ P (i = AmK+1)
≤ m ∧⌈ nK
⌉+ (n −mK )+ P
(µi(mK ) ≥ max
j 6=iµj(mK )
).
64 / 108
Bounding the Probability
We have
P
(µi(mK ) ≥ max
j 6=iµj(mK )
)≤ P (µi(mK ) ≥ µ1(mK ))
= P ( µi(mK )− µi − (µ1(mK )− µ1) ≥ ∆i ) .
Claim: µi(mK )− µi − (µ1(mK )− µ1) is√
2/m-subgaussian.
Hence, by (2) we have
P (µi(mK )− µi − µ1(mK ) + µ1 ≥ ∆i) ≤ exp
(−m∆2
i
4
).
65 / 108
ETC Regret Upper Bound
Theorem (Instance-Dependent Bound)
After n rounds, the expected regret Rn of the ETC policy satisfies
Rn ≤(m ∧
⌈ nK
⌉) K∑i=1
∆i + (n −mK )+K∑i=1
∆i exp
(−m∆2
i
4
). (5)
Instance dependent: Because the bound depends on ∆i
(properties of the bandit environment instances).AKA: Gap-dependent, problem-dependent.
How to choose m?!66 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
67 / 108
Optimal choice of m.
Take K = 2. WLOG, ∆1 = 0, ∆.
= ∆2. Then,
Rn ≤ m∆ + (n − 2m)+∆ exp
(−m∆2
4
)≤ m∆ + n∆ exp
(−m∆2
4
).
(6)
Assume n is reasonably large. Then, optimal choice for m is
m∗(n,∆) = max
{0,
⌈4
∆2log
(n∆2
4
)⌉}(7)
and we get
Rn ≤ ∆ +4
∆
(1 + max
{0, log
(n∆2
4
)}). (8)
68 / 108
A Simple but Important Improvement
Regardless of a policy, Rn = ∆E[T2(n)] ≤ ∆n. Combine with (8), toget
Rn ≤ min
{n∆, ∆ +
4
∆
(1 + log
(n∆2
4
))}. (9)
Corollary ((Infeasible) Worst-Case Bound)
Consider ETC with m = dn/Ke ∧m∗(∆, n). Then, there existsC > 0 such that for any ∆ > 0 and n > 0, Rn ≤ C
√n.
69 / 108
Instance-Adaptive Algorithms
Is there a good, non-cheating choice for m (dependence on n isallowed, but not on ∆)?
Claim: Best such choice gives Rn = Θ(n2/3).
Earlier we had Rn = O(n1/2), much better! Is there some otheralgorithm that achieves this without knowing ∆? “Adaptation to ∆”.(E.g., Auer and Ortner (2010) and Garivier et al. (2016)).
Notice that a worst-case bound shows(!) whether adaptation toindividual instances happens!
70 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
71 / 108
Exercise 6 (soln. )
Gaussian bandit, K = 2, σ2 = 1, µ1 = 0 and µ2 = −∆. Setn = 1000, repeat simulations N = 104 times and report the average.
Plot, as a function of ∆ ∈ [0, 1],
(a) The theoretical upper boundgiven in (9) (blue).
(b) The regret of the ETCalgorithm with m set assuggested in (7) (green).
(c) The regret of the ETCalgorithm with “the” optimalm (yellow: calculatednumerically using that thenoise is exactly Gaussian).
What can we conclude?
0 0.2 0.4 0.6 0.8 1
20
40
60
80
∆
Exp
ecte
dre
gret
72 / 108
Exercise 7
Let Rn =∑n
t=1 ∆At (random, “pseudo-regret”).
(a) Fix ∆ = 1/10 and plot Rn asa function of m withn = 2000. See upper plot onright.
(b) Plot the variance V[Rn] as afunction of m for the samebandit as above. See lowerplot on right.
(c) Explain the curves andreconcile with theory.
(d) Did it make sense to plotV[Rn]? Why or why not?
0 10020030040040506070
m
Exp
ecte
dR
egre
t
0 100200300400
2,0004,0006,0008,000
m
Var
ianc
eof
Reg
ret
73 / 108
Questions?
74 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
75 / 108
Optimism Principle
Optimism in the Face of Uncertainty (OFU) Principle
One should choose their actions as if the environment is as nice asplausibly possible.
76 / 108
Illustration
Visiting a new country.
Shall I try local cuisine/beer/. . . ?
Or stick to what I know?
Optimism: Yes.
Pessimism: No.
Optimism leads to exploration, pessimism prevents exploration.
Exploration is necessary: One can be unlucky with one’s priors!Hence, optimism is good.
How much??
77 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
78 / 108
On What is Plausible
Recall: if X1,X2, . . . ,Xn are independent and 1-subgaussian withmean µ and µ =
∑nt=1 Xt/n, then for any δ ∈ [0, 1],
P
(µ ≥ µ +
√2 log(1/δ)
n
)≤ δ . (10)
Round t. How big can µi be? Data: µi(t − 1) (empirical mean),based on Ti(t − 1) observations. Define
UCBi(t − 1, δ) = µi(t − 1) +
√2 log(1/δ)
Ti(t − 1). (11)
Caveat: Ti(t − 1) is random, hence it is not clear whetherP (UCBi(t − 1, δ) ≥ µi) ≤ δ holds.
79 / 108
The UCB(δ) Algorithm
1: Input K and δ2: Choose each action once3: For rounds t > K choose action
At = argmaxi
UCBi(t − 1, δ)
NoteAlthough there are many versions of the UCB algorithm, we often donot distinguish them by name and hope the context is clear. For therest of this tutorial we’ll usually call UCB(δ) just UCB.
80 / 108
Notes
• The algorithm first chooses each arm once, which is necessarybecause the term inside the square root is undefined whenTi(t − 1) = 0.
• The value inside the argmax is called the index of arm i .• An index algorithm chooses the arm in each round that
maximizes some value (the index), which usually only dependson current time-step and the samples from that arm.
• In the case of UCB, the index is the sum of the empirical meanof rewards experienced so far and the exploration bonus (alsoknown as the confidence width).
• The algorithm will work so that the UCB indices areapproximately the same all the time (why?).
Demo: http://downloads.tor-lattimore.com/bandits/
81 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
82 / 108
Instance-Dependent Bound for UCB
Theorem
Consider UCB as shown earlier on a stochastic K -armed1-subgaussian bandit problem. For any horizon n, if δ = 1/n2 then
Rn ≤ 3K∑i=1
∆i +∑
i :∆i>0
16 log(n)
∆i.
83 / 108
Proof: Main Ideas
Regret decomposition identity:
Rn =K∑i=1
∆iE [Ti(n)] .
Take i so that ∆i > 0. Bound E [Ti(n)].
Key observation: After initialization, action i can only be chosen if itsindex is higher than that of an optimal arm.
This can only happen if at least one of the following is true:(a) The index of action i is larger than the true mean of a specific
optimal arm.(b) The index of a specific optimal arm is smaller than its true mean.
Both of these have low probability. In particular,E [Ti(n)] ≤ c1 + c2 log(n)/∆2
i . Qu.e.d.84 / 108
Proof: The Reward Consumption Model
For i ∈ [K ], let (Zi ,s)s be an i.i.d. sequence with Zi ,s ∼ Pi .
Define Xt to be the TAt th element of sequence (ZAt ,s)s :
Xt = ZAt ,TAt (t) . (12)
Is there any loss of generality?
No: Interaction protocol ⇔ constraint on the distribution of(A1,X1, . . . ,An,Xn). This constraint clearly holds in this case.
Benefit: Defining
µi ,s =1
s
s∑u=1
Zi ,u , s ∈ [n]
(usual sample means), we have
µi(t) = µi ,Ti (t) .
85 / 108
Proof: Some Details II.
WLOG µ1 = µ∗. Fix i such that ∆i > 0. Let Gi be the ‘good’ eventdefined by
Gi ={µ1 < min
t∈[n]UCB1(t)
}∩{µi ,ui +
√2
uilog
(1
δ
)< µ1
},
where ui ∈ [n] is a constant to be chosen later. Then
1 If Gi occurs, then Ti(n) ≤ ui .
2 The complement event G ci occurs with low probability (governed
in some way yet to be discovered by ui).
Because Ti(n) ≤ n no matter what, this will mean that
E [Ti(n)] = E[I{Gi}Ti(n)
]+ E
[I{G c
i }Ti(n)]≤ ui + P (G c
i ) n . (13)
For details, see our website.
86 / 108
Worst-case Bound
Theorem
If δ = 1/n2, then the regret of UCB(δ) on any ν ∈ EKSG(1)environment is bounded by
Rn ≤ 8√nK log(n) + 3
K∑i=1
∆i .
87 / 108
Notes
Recall: For all ν ∈ EKSG(1),
Rn(ν) ≤ 8√
nK log(n) + 3K∑i=1
∆i .
• Same tuning as in the other result! Good!
• Not anytime. Hmm..
• The additive∑
i ∆i term is unavoidable: all reasonablealgorithms must play each arm once (exercise: what if not??).
• The bound is close to optimal: no algorithm can enjoy regretsmaller than const ·
√nK over all problems in EKSG(1).
• A more complicated variant of UCB(δ) shaves the logarithmicterm from the upper bound given above.
88 / 108
Proof
Previous proof actually gives:
E[Ti(n)] ≤ 3 +16 log(n)
∆2i
.
Using the basic regret decomposition, choosing some ∆ > 0,
Rn =K∑i=1
∆iE [Ti(n)] =∑
i :∆i<∆
∆iE [Ti(n)] +∑
i :∆i≥∆
∆iE [Ti(n)]
≤ n∆ +∑
i :∆i≥∆
(3∆i +
16 log(n)
∆i
)≤ n∆ +
16K log(n)
∆+ 3
∑i
∆i
≤ 8√
nK log(n) + 3K∑i=1
∆i ,
where the first inequality follows because∑
i :∆i<∆ Ti(n) ≤ n and the
last line by choosing ∆ =√
16K log(n)/n. Qu.e.d.89 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
90 / 108
Demo/Exercise 8
Setup: n = 1000, K = 2, P1 = N (0, 1), P2 = N (−∆, 1). Plotestimates of the expected regret of UCB, ETC withm ∈ {25, 50, 75, 100,Optimum} as ∆ ∈ [0, 1]. Your plot shouldresemble this:
0 0.2 0.4 0.6 0.8 1
20
40
60
80
100
∆
Exp
ecte
dre
gret
ETC (m = 25)ETC (m = 50)ETC (m = 75)ETC (m = 100)ETC (optimal m)UCB
What can we conclude? 91 / 108
Exercise 9/Demo
Compare the regret histogram of UCB and ETC.
Demo
92 / 108
Literature
• Confidence bounds, OFU principle is by Lai and Robbins (1985)(context: parametric bandits, asymptotics).
• UCB algorithm: Katehakis and Robbins (1995) (Gaussianbandits) and Agrawal (1995). Still asymptotics. Agrawal(1995)’s analysis is modular: All that is needed is appropriateUCBs (the form is irrelevant).
• Independently, Kaelbling (1993) also discovered UCB. No regretanalysis, or clear advice on how to tune the confidenceparameter.
• “This” UCB is most similar to UCB1 of Auer et al. (2002),except that n is t in UCB1. Finite-time regret bound, [0, 1]bounded payoffs (i.e., 1/2 subgaussian).
• Worst-case bound: Bubeck and Cesa-Bianchi (2012), whichfocuses on the subgaussian setup.
93 / 108
Questions?
94 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
95 / 108
Asymptotically Optimal UCB (AO-UCB)
1: Input K2: Choose each arm once3: Subsequently choose
At = argmaxi
(µi(t − 1) +
√2 log f (t)
Ti(t − 1)
)
where f (t) = 1 + t log2(t)
96 / 108
Asymptotic Regret, Asymptotic Optimality
Theorem (Upper Bound)
The regret of AO-UCB satisfies
lim supn→∞
Rn(ν)
log(n)≤∑
i :∆i>0
2
∆i. (14)
Theorem (Lower Bound)
For any policy π that has subpolynomial regret for all 1-subgaussianenvironments (i.e., Rn(ν, π) = o(np) for all p > 0 and all ν), for anyinstance ν with gaps ∆ = ∆(ν),
lim infn→∞
Rn(ν, π)
log(n)≥∑
i :∆i>0
2
∆i. (15)
97 / 108
Finite-time Regret for AO-UCB
Corollarythere exists some universal constant C > 0 such that the regret ofAO-UCB is bounded by
Rn ≤ C∑
i :∆i>0
(∆i +
log(n)
∆i
),
and, in particular,
Rn ≤ CK∑i=1
∆i + 2√
CnK log(n) .
98 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
99 / 108
Zoo of UCBs
Strategy Conf. Dist. free regret Asy. Opt. Inst. Opt. Anytime
UCBAuer et al. (2002)
t√
kn log(n) 3 3 3
UCB*Lai (1987)
n/T√
kn log(k) 3 3 7
UCB+Garivier and Cappe (2011)
t/T√
kn log(k) 3 3 3
MOSSAudibert and Bubeck (2009)
n/(kT )√kn 3 7 7
Anytime MOSSDegenne and Perchet (2016)
t/(kT )√kn 3 7 3
OCUCBLattimore (2015)
(n/t)1+ε√kn 7 3 7
UCB†Lattimore (2017)
see ref.√kn 3 3 7
UCB‡Lattimore (2017)
see ref.√
kn log log(n) 3 3 3
All strategies choose At = t for 1 ≤ t ≤ k and subsequently
At = argmaxi
µi(t − 1) +
√2 log (Conf.)
Ti(t − 1)︸ ︷︷ ︸exploration bonus
, (16)
where log(x) ∼ log(x) is approximately logarithmic. In the table Tstands for Ti(t − 1).
Demo with UCB and OCUCB:http://downloads.tor-lattimore.com/bandits/
100 / 108
Dealing with Risk
Setup: K = 2, Gaussian noise, gap ∆. If δ is the probability ofmissing the optimal arm by some (reasonable) algorithm then
Rn = O
(n∆δ +
1
∆log
(1
δ
)),
Optimize δ to get δ = 1/(n∆2) and
Rn = O
(1
∆
(1 + log
(n∆2
))).
How about E [R2n ]? We get: E [R2
n ] ≈ δ(n∆)2 = n. Too big!!Choose δ = (n∆)−2 to get
Rn = O
(1
∆
(1
n+ log
(n2∆2
)))and E [R2
n ] ≈ log2(n)!! For further info, see, Audibert et al. (2007).101 / 108
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
102 / 108
Summary
• Bandits: The drosophila of “Explore-Exploit” problems.• Learner interacts with its environment, which is initially
unknown.• We care about the total reward/regret.
• Exploration is necessary to avoid large loss due to (plausible)unlucky start
• Exploitation is necessary to avoid large loss due to being justcurious all the time
• What is the right amount?• Stochastic, finite-armed bandits
• Explore-then-commit: Ideal, but infeasible tuning shows whatcan be achieved.
• Optimism does it: UCB.• Simultaneously satisfying many optimality criteria is possible
with a carefully tuned UCB.
• We left out lower bound proofs and many other topics.103 / 108
Questions?
104 / 108
References I
Abbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with largeaction sets. Master’s thesis, University of Alberta, Department of Computing Science.
Abbasi-Yadkori, Y., Antos, A., and Szepesvari, C. (2009). Forced-exploration based algorithmsfor playing in stochastic linear bandits. In COLT Workshop on On-line Learning with LimitedFeedback.
Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.
Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits.In Proceedings of Conference on Learning Theory (COLT), pages 217–226.
Audibert, J.-Y., Munos, R., and Szepesvari, C. (2007). Tuning bandit algorithms in stochasticenvironments. In Algorithmic Learning Theory (ALT), pages 150–165. Springer.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning, 47:235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science,1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.
Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochasticmulti-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.
105 / 108
References II
Bather, J. and Chernoff, H. (1967). Sequential decisions in the control of a spaceship. In FifthBerkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 181–207.
Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments.Chapman and Hall, London ; New York :.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: Anonasymptotic theory of independence. OUP Oxford.
Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. NowPublishers Incorporated.
Chernoff, H. (1959). Sequential design of experiments. The Annals of Mathematical Statistics,30(3):755–770.
Degenne, R. and Perchet, V. (2016). Anytime optimal algorithms in stochastic multi-armedbandits. In Proceedings of International Conference on Machine Learning (ICML).
Dudley, R. M. (2014). Uniform central limit theorems, volume 142. Cambridge university press.
Garivier, A. and Cappe, O. (2011). The KL-UCB algorithm for bounded stochastic bandits andbeyond. In Proceedings of Conference on Learning Theory (COLT).
Garivier, A., Kaufmann, E., and Lattimore, T. (2016). On explore-then-commit strategies. InAdvances in Neural Information Processing Systems (NIPS).
Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the RoyalStatistical Society. Series B (Methodological), 41(2):148–177.
106 / 108
References III
Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. JohnWiley & Sons.
Kaelbling, L. P. (1993). Learning in embedded systems. MIT press.
Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations.Proceedings of the National Academy of Sciences of the United States of America,92(19):8584.
Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. TheAnnals of Statistics, pages 1091–1114.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advancesin applied mathematics, 6(1):4–22.
Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits withside information. In NIPS, pages 817–824.
Lattimore, T. (2015). Optimally confident UCB: Improved regret for finite-armed bandits. arXivpreprint arXiv:1507.07880.
Lattimore, T. (2017). Auto-tuning the confidence level for optimistic bandit strategies.technical report.
McDiarmid, C. (1998). Concentration. In Probabilistic methods for algorithmic discretemathematics, pages 195–248. Springer.
Pena, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory andStatistical Applications. Springer Science & Business Media.
107 / 108
References IV
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of theAmerican Mathematical Society, 58(5):527–535.
Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematicsof Operations Research, 35(2):395–411.
Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika, 25(3/4):285–294.
Tropp, J. A. (2015). An introduction to matrix concentration inequalities. Foundations andTrends® in Machine Learning, 8(1-2):1–230.
van de Geer, S. (2000). Empirical Processes in M-estimation, volume 6. Cambridge universitypress.
108 / 108