Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári....

Preview:

Citation preview

Exploration-Exploitationin Reinforcement LearningPart 2 – Regret Minimization in Tabular MDPs

Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo PirottaFacebook AI Research

Outline2

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Websitehttps://rlgammazero.github.io

Ghavamzadeh, Lazaric and Pirotta

Minimax Lower Bound3

Theorem (adapted from Jaksch et al. [2010])

For any MDP M? = 〈S,A, ph, rh, H〉 with stationary (p1 = p2 = . . . = pH) transitions,any algorithm A at any episode K suffers a regret of at least

Ω(√

HSAT)

with T = HK.

If non-stationary transitions• p1, . . . , pH can be arbitrary different

• Effective number of states is S′ = HS• Lower bound

Ω(H√SAT

)Ghavamzadeh, Lazaric and Pirotta

Tabular MDPs: Outline4

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

The Optimism Principle: Intuition

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information

Ghavamzadeh, Lazaric and Pirotta

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information

Ghavamzadeh, Lazaric and Pirotta

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful informationExploitation Exploration

Ghavamzadeh, Lazaric and Pirotta

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful informationExploitation Exploration

Optimism in value function

Ghavamzadeh, Lazaric and Pirotta

History: OFU for Regret MinimizationTabular MDPs

7

FH: finite-horizonAR: average reward

Agraw

al[1990]

AuerandOrtner[2006] (AR)

Bartlett

andTewari[2009] (AR)

Filippi et al.[2010] (AR)

Jakschetal.[2010] (AR)

Talebi andMaillard[2018] (AR)

Fruit

etal.

[201

8b] (AR)

Fruit

etal.

[201

8a] (AR)

Fruit

etal.

[201

9]:

Tossouetal.[2019] :

é

Qianet

al.[201

9](AR)

ZhangandJi[2019] (AR)

Azar et al. [2017] (FH)

Zanette andBrunskill [2018] (FH)

Kakade et al. [2018] :é

Jinet al. [2018] (FH)

Zanette andBrunskill [2019] (FH)

Efroni et al. [2019] (FH):: arXiv paper (not published)é: possibly incorrect

Ghavamzadeh, Lazaric and Pirotta

Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end

Defines the type of algorithm

Ghavamzadeh, Lazaric and Pirotta

Model-based Learning9

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from Dk

phk(s′|s, a) =

∑k−1i=1 1

((shi, ahi, sh+1,i) = (s, a, s′)

)Nhk(s, a)

, rhk(s, a) =

∑k−1i=1 rhi · 1 ((shi, ahi) = (s, a))

Nhk(s, a)

Planning (by backward induction) for πhk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

Ghavamzadeh, Lazaric and Pirotta

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

Ghavamzadeh, Lazaric and Pirotta

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

UCBt(V π1 )

Ghavamzadeh, Lazaric and Pirotta

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

UCBt(V π1 )Optimism: UCBt(V π

1 ) = maxM∈Mt

V π1,M≥V π

1,M?

Ghavamzadeh, Lazaric and Pirotta

Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Policy executed at episode k

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

BrH

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

Qhk(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Optimism 13

Q?h(s, a)

Qhk(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

UCRL2-CH for Finite Horizon14

Theorem (adapted from [Jaksch et al., 2010])

For any tabular MDP with stationary transitions, UCRL2 with Chernoff-Hoeffdingconfidence intervals (UCRL2-CH), with high-probability, suffers a regret

R(K,M?,UCRL2-CH) = O(HS√AT +H2SA

)Order optimal

√AT√

HS factor worse than the lower-bound

Lower-bound: Ω(√HSAT ) (stationary transitions)

Ghavamzadeh, Lazaric and Pirotta

Extended Value Iteration15

Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)

r + pTVh+1,k

= maxr∈Brhk(s,a)

r + maxp∈Bphk(s,a)

pTVh+1,k

= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)

pTVh+1,k

≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k

≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k

- Exploration bonus (1 +H√S)βrhk(s, a) for the reward

Ghavamzadeh, Lazaric and Pirotta

Extended Value Iteration15

Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)

r + pTVh+1,k

= maxr∈Brhk(s,a)

r + maxp∈Bphk(s,a)

pTVh+1,k

= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)

pTVh+1,k

≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k

≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k

- Exploration bonus (1 +H√S)βrhk(s, a) for the reward

Ghavamzadeh, Lazaric and Pirotta

UCBVI[Azar et al., 2017]

16

Replace EVI with Exploration Bonus

Input: S, A, Brhk, Bphk, rhk, phk, bhkSet QH+1,k(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = rhk(s, a) + bhk(s, a) + Es′∼phk(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a′∈AQhk(s

′, a′)

endendreturn πhk(s) = argmaxa∈A Qhk(s, a)

Equivalent to value iteration on Mk = (S,A, rhk + bhk, phk, H)

Ghavamzadeh, Lazaric and Pirotta

UCBVI: Measuring Uncertainty 17

Combine uncertainties in rewards and transitionsIn a smart way

bhk(s, a) = (H + 1)

√log(Nhk(s, a)/δ)

Nhk(s, a)< βrhk +Hβphk

Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸

≤H

∣∣∣ ≤ H√ log(Nhk(s, a)/δ)

Nhk(s, a)︸ ︷︷ ︸=β

phk/√S

Ghavamzadeh, Lazaric and Pirotta

UCBVI: Measuring Uncertainty 17

Combine uncertainties in rewards and transitionsIn a smart way

bhk(s, a) = (H + 1)

√log(Nhk(s, a)/δ)

Nhk(s, a)< βrhk +Hβphk

Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸

≤H

∣∣∣ ≤ H√ log(Nhk(s, a)/δ)

Nhk(s, a)︸ ︷︷ ︸=β

phk/√S

Ghavamzadeh, Lazaric and Pirotta

UCBVI-CH: Regret 18

Theorem (Thm. 1 of Azar et al. [2017])

For any tabular MDP with stationary transitions, UCBVI with Chernoff-Hoeffdingconfidence intervals (UCBVI-CH), with high-probability, suffers a regret

R(K,M?,UCBVI-CH) = O(H√SAT +H2S2A

)Order optimal

√SAT√

H factor worse than the lower-boundLong “warm up” phase

If non-stationary, then O(H3/2

√SAT

)

Lower-bound: Ω(√HSAT ) (stationary transitions)

Ghavamzadeh, Lazaric and Pirotta

Refined Confidence Bounds19

UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website

R(K,M?,UCRL2B) = O(√

H Γ SAT +H2S2A

), Still not matching the lower-bound!

UCBVI with Bernstein-Freedman bounds: *

R(K,M?,UCBVI-BF) = O(√

HSAT +H2S2A+H√T)

- Matching the Lower-Bound!

, Long “warm up” phase

Γ = maxh,s,a

‖ph(·|s, a)‖0 ≤ S

* stationary model (p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Refined Confidence Bounds19

UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website

R(K,M?,UCRL2B) = O(√

H Γ SAT +H2S2A

), Still not matching the lower-bound!

UCBVI with Bernstein-Freedman bounds: *

R(K,M?,UCBVI-BF) = O(√

HSAT +H2S2A+H√T)

- Matching the Lower-Bound!

, Long “warm up” phase

Γ = maxh,s,a

‖ph(·|s, a)‖0 ≤ S

* stationary model (p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Refined Confidence Bounds20

EULER [Zanette and Brunskill, 2019]keeps upper and lower bounds on V ?

h

R(K,M?, EULER) = O(√

Q?SAT +√SSAH2(

√S +√H))

, Problem-dependent bound based on environmental norm [Maillard et al., 2014]

Q? = maxs,a,h

(V(rh(s, a)) + Vx∼ph(·|s,a)(V

?h+1(x))

)Vx∼p(f(x)) = Ex∼p

[(f(x)− Ey∼p[f(y)]

)2]

- Can remove the dependence on H

- Matching lower-bound in the worst case

Ghavamzadeh, Lazaric and Pirotta

UCRL2: RiverSwim 21

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

V(rhk) =1

N

∑i

(rh,i − rhk)2

is the population variance

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

Ghavamzadeh, Lazaric and Pirotta

UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCBVI-H

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCBVI-H

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

Model-Based Advantages23

Learning efficiencyFirst order optimalMatching lower-bound

Counterfactual reasoningOptimistic/Pessimistic value estimate for any πUsefull for inference (e.g., safety)

Ghavamzadeh, Lazaric and Pirotta

Model-Based Issues24

Complexity

Space O(HS2A)

non-stationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

incremental updates

Ghavamzadeh, Lazaric and Pirotta

Model-Based Issues24

Complexity

Space O(HS2A)

non-stationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

incremental updates

Ghavamzadeh, Lazaric and Pirotta

Real-Time Dynamic Programming (RTDP)[Barto et al., 1995]

25

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H]

for k = 1, . . . ,K do // episodes

Observe initial state s1k (arbitrary)

for h = 1, . . . , H doahk ∈ arg max

a∈Arh(shk, a) + ph(·|shk, a)>Vh+1,k−1 (1-step planning)

Vh,k(shk) = rh(shk, ahk) + ph(·|shk, ahk)>Vh+1,k−1 (1-step planning)

Observe sh+1,k ∼ ph(·|shk, ahk)end

end

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end Optimism + RTDP

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end Optimism + RTDP

Forward estimate (not backward!)

Next stage but previous episode!

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Properties 27

Non-Increasing EstimatesVhk(s) ≤ Vh,k−1(s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Clipping:

V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Properties 28

Optimistic EstimatesVhk(s) ≥ V ?

h (s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Optimistic update

Example. UCRL2-like step

Q(shk, a) = maxr∈Br

hk(shk,a)r(shk, a) + max

p∈Bphk(shk,a)

Es′∼p(·|shk,a) [Vh+1,k−1(s′)]

Vh+1,k−1 is one episode behind but optimistic

Then Q is optimistic!

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Properties 28

Optimistic EstimatesVhk(s) ≥ V ?

h (s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Optimistic update

Example. UCRL2-like step

Q(shk, a) = maxr∈Br

hk(shk,a)r(shk, a) + max

p∈Bphk(shk,a)

Es′∼p(·|shk,a) [Vh+1,k−1(s′)]

Vh+1,k−1 is one episode behind but optimistic

Then Q is optimistic!

Ghavamzadeh, Lazaric and Pirotta

Opt-RTDP: Regret 29

Theorem (Thm. 8 of Efroni et al. [2019])

For any tabular MDP with stationary transitions, UCRL2-GP (Opt-RTDP based onUCRL2 with Hoeffding bounds), with high-probability, suffers a regret

R(K,M?,UCRL2-GP) = O(HS√AT +H2S3/2A

)Same regret as UCRL2-CH

Computationally more efficientTime: O(SA) per step, total runtime O(HSAK)

can be adapted to any algorithm (e.g.,UCBVI, EULER)

Ghavamzadeh, Lazaric and Pirotta

UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-GP-H

UCRL2-GP-B

Ghavamzadeh, Lazaric and Pirotta

UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCRL2-GP-H

UCRL2-GP-B

Ghavamzadeh, Lazaric and Pirotta

Tabular MDPs: Outline31

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

Posterior Sampling (PS)a.k.a. Thompson Sampling [Thompson, 1933]

32

Keep Bayesian posterior for the unknown MDP

A sample from the posterior is used as anestimate of the unknown MDP

Few samples =⇒ uncertainty in theestimate

Exploration

More samples =⇒ posterior concentrateson the true MDP

Exploitation

Posteriordistribution µt

Set of MDPs

Ghavamzadeh, Lazaric and Pirotta

History: PS for Regret MinimizationTabular MDPs

33

FH: finite-horizonAR: average reward

Thompson[1933]

Strens [2000]

Osbandet al. [2013] (FH)

Ë

Gopalan

andMannor [2015] (AR) Ë

Abbasi-YadkoriandSzepesvári[2015] (AR) é

OsbandandRoy [2016] :

Osbandand

Roy [2017] (FH)é

Ouyangetal.[2017] (AR) î

Agraw

alandJia

[2017] (AR) é

Theocharousetal.[2018] (AR) î

Russo[2019] (FH)

:: arXiv paper (not published)é: possibly incorrectî: possibly incorrect assumptions

Ghavamzadeh, Lazaric and Pirotta

Bayesian Regret 34

RB(K,µ1,A) = EM?∼µ1

[R(K,M?,A)︸ ︷︷ ︸

:=E[R(K,M?,A)

]]

= EM?

[K∑k=1

V ?1,M?(s1k)− V πk

1,M?(s1k)

]

Ghavamzadeh, Lazaric and Pirotta

Posterior Sampling[Osband and Roy, 2017]

35

Input: S, A, rh, ph, prior µ1

Initialize D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Sample Mk ∼ µk(·|Dk)Compute

πk ∈ arg maxπ

V π1,Mk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Prior distribution:

∀Θ, P(M∗ ∈ Θ) = µ1(Θ)

Posterior distribution:

∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)

PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)

Ghavamzadeh, Lazaric and Pirotta

Posterior Sampling[Osband and Roy, 2017]

35

Input: S, A, rh, ph, prior µ1

Initialize D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Sample Mk ∼ µk(·|Dk)Compute

πk ∈ arg maxπ

V π1,Mk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Prior distribution:

∀Θ, P(M∗ ∈ Θ) = µ1(Θ)

Posterior distribution:

∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)

PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)

Ghavamzadeh, Lazaric and Pirotta

Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸ ︷︷ ︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n

Ghavamzadeh, Lazaric and Pirotta

Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸ ︷︷ ︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n

n =

S∑i=1

αi

Ghavamzadeh, Lazaric and Pirotta

Posterior Sampling is Usually Better37

Bandit

[Chapelle and Li, 2011]

Finite horizon RL

[Osband and Roy, 2017]

Ghavamzadeh, Lazaric and Pirotta

PSRL: Regret 38

Theorem (Osband and Roy [2017] revisited)

For any prior µ1 with any independent Dirichlet prior over stationary transitions, theBayesian regret of PSRL is bounded as

RB(K,µ1, PSRL) = O(HS√AT )

Order optimal√AT√

HS factor suboptimal

Lower-bound: Ω(√HSAT ) (stationary transitions)

* in [Osband and Roy, 2017] is O(H√SAT ) for stationary MDPs

but there is a mistake in Lem. 3 (see [Qian et al., 2020])Ghavamzadeh, Lazaric and Pirotta

PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

10

20

30

K

R(K

)

PSRL

Ghavamzadeh, Lazaric and Pirotta

PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

20

40

60

80

100

120

140

160

K

R(K

)

PSRL

UCRL2-B

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

Tabular Randomized Least-Squares Value Iteration (RLSVI)[Russo, 2019]

40

Input: S, A, H

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Run Tabular-RLSVI on Dkfor h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qhk(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

* Not necessary to store all the data. Updates can be done incrementallyGhavamzadeh, Lazaric and Pirotta

Tabular-RLSVI41

Input: Dataset Dk = (shi, ahi, rhi)H,kh=1,i=1

Estimate empirical MDP Mk = (S,A, ph, rh, H)

phk(s′|s, a) =1

Nhk(s, a)

k−1∑i=1

1((shi, ahi, sh+1,i) = (s, a, s′)

),

rhk(s, a) =1

Nhk(s, a)

k−1∑i=1

rhi · 1 ((shi, ahi) = (s, a))

for h = H, . . . , 1 do // backward inductionSample ξhk ∼ N

(0, σ2

hkI)

Compute

∀(s, a) ∈ S ×A, Qhk(s, a) = rhk(s, a) + ξhk(s, a) +∑s′∈S

phk(s′|s, a)Vh+1,k(s

′)

endreturn QhkHh=1

Planning on MDPMk = (S,A, phk, rhk + ξhk, H)

Ghavamzadeh, Lazaric and Pirotta

Tabular-RLSVI: Frequentist Regret42

Theorem (Russo [2019])

For any tabular MDP with non-stationary transitions, Tab-RLSVI with

σhk(s, a) = O(√

SH3

Nhk(s, a) + 1

)

suffers with high probability a frequentist regret

R(K,M?,Tab-RLSVI) = O(H5/2S3/2

√AT)

Order optimal√AT

H3/2S worse than the lower-bound Ω(H√SAT )

Analysis can be improved!

Ghavamzadeh, Lazaric and Pirotta

Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

0.2

0.4

0.6

0.8

1

·104

K

R(K

)

Tab-RLSVI-1

Tab-RLSVI-2

Tab-RLSVI-3

Ghavamzadeh, Lazaric and Pirotta

Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

200

400

600

800

1,000

1,200

K

R(K

)

Tab-RLSVI-3

UCRL2-H

Ghavamzadeh, Lazaric and Pirotta

Tabular MDPs: Outline44

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

River Swim: Q-learning w\ ε-greedy Exploration46

εt = 1.0

εt = 0.5

εt =ε0

(N(st)− 1000)2/3

εt =

1.0 t < 6000ε0

N(st)1/2otherwise

εt =

1.0 t < 7000ε0

N(st)1/2otherwise

Tuning the ε schedule is difficult and problem dependent

Regret: Ω(

minT,AH/2)

Ghavamzadeh, Lazaric and Pirotta

Optimistic Q-learning47

Input: S, A, rh, phInitialize Qh(s, a) = H − (h− 1) and Nh(s, a) = 0 for all (s, a) ∈ S ×A and h = [H]

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)for h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qh(shk, a)

Observe rhk and sh+1,k

Set Nh(shk, ahk) = Nh(shk, ahk) + 1Update

Qh(shk, ahk) = (1− αt)Qh(shk, ahk) + αt(rhk + Vh+1(sh+1,k) + bt

)Set Vh(shk) = min

H − (h− 1),max

a∈AQh(shk, a)

end

end

Upper-Confidence Bound

Ghavamzadeh, Lazaric and Pirotta

Step size αt48

Qlearning uses αt ofO(1/t) or O(1/

√t)

with t = Nhk(s, a)

Opt-QL

αt =H + 1

H + t

Ghavamzadeh, Lazaric and Pirotta

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Idea: favoring later updateslast 1/H fraction of samples of (s, a) have non-negligible weights1− 1/H is forgotten

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Example. H = 10 and assume t = Nhk(s, a) = 1000

0 100 200 300 400 500 600 700 800 900 1,0000

0.5

1

1.5·10−2

samples (i)

αt 1000

αi = (H + 1)/(H + i)

αi = 1/i

αt = 1/√i

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Exploration Bonus bt50

Let t = Nhk(s, a)∣∣∣∣∣t∑i=1

αit

(V ?h+1(sh+1,ki)− Es′|s,a[V ?

h+1(s′)])∣∣∣∣∣ ≤ c

√H3 log(SAT/δ)

t︸ ︷︷ ︸:=bt

Note thatt∑i=1

αit = 1.

Ghavamzadeh, Lazaric and Pirotta

Opt-Q-learning: Regret51

Theorem ([Jin et al., 2018])For any tabular MDP with non-stationary transitions, Opt-QL with Hoeffdinginequalities (bt = O(

√H3/t)), with high probability, suffers a regret

R(K,M?,Opt-QL) = O(H2√SAT +H2SA)

Order optimal√SAT

H factor worse than the lower-bound Ω(H√SAT )

• √H factor worse than model-based with Hoeffding inequalitiesUCBVI-CH for non-stationary ph suffers O(H3/2

√SAT )

• but better second-order terms

The bound does not improve in stationary MDPs (i.e., p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Opt-Qlearning: Example52

0 1 2 3 4

·104

0

1,000

2,000

3,000

K

R(K

)

OptQL

Ghavamzadeh, Lazaric and Pirotta

Refined Confidence Intervals53

Opt-QL with Bernstein-Freedman bounds (instead of Hoeffding/Weissman):

R(K) = O(H3/2

√SAT +

√H9S3A3

), Still not matching the lower-bound!√

H worse than model-based (e.g.,UCBVI-BF)

Ghavamzadeh, Lazaric and Pirotta

Open Questions 54

1 prove frequentist regret for PSRL

2 whether the gap between the regret of model-based and model-free should exist?

3 which algorithm is better in practice?

Ghavamzadeh, Lazaric and Pirotta

Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. InUAI, pages 1–11. AUAI Press, 2015.

Rajeev Agrawal. Adaptive control of markov chains under the weak accessibility. In 29th IEEE Conference onDecision and Control, pages 1426–1431. IEEE, 1990.

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regretbounds. In NIPS, pages 1184–1194, 2017.

Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. InNIPS, pages 49–56. MIT Press, 2006.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning.In ICML, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 2017.

Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning inweakly communicating MDPs. In UAI, pages 35–42. AUAI Press, 2009.

Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamicprogramming. Artif. Intell., 72(1-2):81–138, 1995.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel,P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems24, pages 2249–2257. Curran Associates, Inc., 2011. URLhttp://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf.

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds formodel-based reinforcement learning with greedy policies. In NeurIPS, pages 12203–12213, 2019.

Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leiblerdivergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 115–122, 2010.

Ghavamzadeh, Lazaric and Pirotta

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation innon-communicating markov decision processes. In NeurIPS, pages 2998–3008, 2018a.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrainedexploration-exploitation in reinforcement learning. In ICML, Proceedings of Machine Learning Research.PMLR, 2018b.

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2B, 2019. URLhttps://rlgammazero.github.io/docs/ucrl2b_improved.pdf.

Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. InCOLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 861–898. JMLR.org, 2015.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the AmericanStatistical Association, 58(301):13–30, 1963. URL http://www.jstor.org/stable/2282952.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journalof Machine Learning Research, 11:1563–1600, 2010.

Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? InNeurIPS, pages 4868–4878, 2018.

Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcementlearning. CoRR, abs/1802.09184, 2018.

Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. How hard is my mdp?" the distribution-normto the rescue". In NIPS, pages 1835–1843, 2014.

Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. CoRR,abs/1608.02731, 2016.

Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning?In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.

Ghavamzadeh, Lazaric and Pirotta

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posteriorsampling. In NIPS, pages 3003–3011, 2013.

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: Athompson sampling approach. In NIPS, pages 1333–1342, 2017.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Exploration bonus for regret minimization indiscrete and continuous average reward mdps. In NeurIPS, pages 4891–4900, 2019.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoullirandom variables. CoRR, abs/2001.11595, 2020.

Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In NeurIPS, pages14410–14420, 2019.

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.

Malcolm Strens. A bayesian framework for reinforcement learning. In In Proceedings of the SeventeenthInternational Conference on Machine Learning, pages 943–950. ICML, 2000.

Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscountedreinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine Learning Research, pages770–805. PMLR, 2018.

Georgios Theocharous, Zheng Wen, Yasin Abbasi, and Nikos Vlassis. Scalar posterior sampling withapplications. In NeurIPS, pages 7696–7704, 2018.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidenceof two samples. Biometrika, 25(3-4):285–294, 1933.

Aristide C. Y. Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcementlearning using empirical bernstein inequalities. CoRR, abs/1905.12425, 2019.

Ghavamzadeh, Lazaric and Pirotta

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities forthe L1 deviation of the empirical distribution. 2003.

Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identifybandit structure in mdps. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages5732–5740. JMLR.org, 2018.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learningwithout domain knowledge using value function bounds. In ICML, volume 97 of Proceedings of MachineLearning Research, pages 7304–7312. PMLR, 2019.

Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal biasfunction. In NeurIPS, pages 2823–2832, 2019.

Ghavamzadeh, Lazaric and Pirotta