Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári....

Exploration-Exploitationin Reinforcement LearningPart 2 – Regret Minimization in Tabular MDPs

Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo PirottaFacebook AI Research

Outline2

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Websitehttps://rlgammazero.github.io

Ghavamzadeh, Lazaric and Pirotta

Minimax Lower Bound3

Theorem (adapted from Jaksch et al. [2010])

For any MDP M? = 〈S,A, ph, rh, H〉 with stationary (p1 = p2 = . . . = pH) transitions,any algorithm A at any episode K suffers a regret of at least

Ω(√

with T = HK.

If non-stationary transitions• p1, . . . , pH can be arbitrary different

• Effective number of states is S′ = HS• Lower bound

Ω(H√SAT

)Ghavamzadeh, Lazaric and Pirotta

Tabular MDPs: Outline4

The Optimism Principle: Intuition

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information

=⇒ no regret

=⇒ learn useful information

=⇒ no regret

=⇒ learn useful informationExploitation Exploration

=⇒ no regret

=⇒ learn useful informationExploitation Exploration

Optimism in value function

History: OFU for Regret MinimizationTabular MDPs

FH: finite-horizonAR: average reward

al[1990]

AuerandOrtner[2006] (AR)

Bartlett

andTewari[2009] (AR)

Filippi et al.[2010] (AR)

Jakschetal.[2010] (AR)

Talebi andMaillard[2018] (AR)

8b] (AR)

8a] (AR)

Tossouetal.[2019] :

Qianet

al.[201

9](AR)

ZhangandJi[2019] (AR)

Azar et al. [2017] (FH)

Zanette andBrunskill [2018] (FH)

Kakade et al. [2018] :é

Jinet al. [2018] (FH)

Zanette andBrunskill [2019] (FH)

Efroni et al. [2019] (FH):: arXiv paper (not published)é: possibly incorrect

Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

Learning Problem8

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

Defines the type of algorithm

Model-based Learning9

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from Dk

phk(s′|s, a) =

∑k−1i=1 1

((shi, ahi, sh+1,i) = (s, a, s′)

)Nhk(s, a)

, rhk(s, a) =

∑k−1i=1 rhi · 1 ((shi, ahi) = (s, a))

Nhk(s, a)

Planning (by backward induction) for πhk

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Bounded Parameter MDP: Optimism11

V πM Fix a policy π

UCBt(V π1 )

UCBt(V π1 )Optimism: UCBt(V π

1 ) = maxM∈Mt

V π1,M≥V π

Extended Value Iteration[Jaksch et al., 2010]

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

= rhk(s, a) + βrhk(s, a) + maxph∈B

[Vh+1,k(s

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Extended Value Iteration[Jaksch et al., 2010]

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

[Vh+1,k(s

= rhk(s, a) + βrhk(s, a) + maxph∈B

[Vh+1,k(s

a∈AQhk(s, a)

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Policy executed at episode k

Optimism 13

Q?h(s, a)

uncertainty