Shipra Agrawal IEOR 8100 (Reinforcement Learning) · [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Q𝒍𝒏(𝑻)1+𝜖

Shipra Agrawal

IEOR 8100 (Reinforcement Learning)

Computational complexity

the amount of per-time-step computation the algorithm uses during learning;

Space complexity

the amount of memory used by the algorithm;

Learning complexity

a measure of how much experience the algorithm needs to learn in a given task.

2

Using a single thread of experience

No resets or generative model

Vs.

Generative models (simulator)

http://hunch.net/~jl/projects/RL/EE.html

3

http://hunch.net/~jl/projects/RL/EE.html

Optimal learning complexity?

“optimally explore” meaning to obtain maximum expected discounted rewardE[ 𝑡=1

∞ 𝛾𝑡−1𝑟𝑡] over a known prior over the unknown MDPs

Tractable in special cases (𝛾 < 1,𝑀𝐴𝐵, Gittins, 1989).

4

Bound number of steps on which suboptimal policy is played

near-optimally on all but a polynomial number of steps (Kakade, 2003; Strehl and Littman, 2008b, Strehl et al 2006, 2009)

Example: [Strehl et al 2006] (may been improved in recent work)

5

Some examples:

An O(SA) analysis of Q-learning.

Michael Kearns and Satinder Singh, Finite-Sample Convergence Rates for Q-learning and Indirect Algorithms NIPS 1999.

An analysis of TD-lambda.

Michael Kearns and Satinder Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT 2000.

6

Bound difference in total reward of algorithm compared to a benchmark policy

𝑅𝑒𝑔 𝑀, 𝑇 = 𝑇 𝜌∗(𝑠0) − 𝐸[ 𝑡=1𝑇 𝑟𝑡]

𝜌∗ s0 is infinite horizon average reward achieved by the best stationary policy 𝜋∗

Episodic/single thread

Episodic: starting state is reset every H steps

Single thread: assume Communicating with diameter D

Worst-case regret bounds; example

Theorem [Agrawal, Jia, 2017]: For any communicating MDP M with (unknown) diameter

𝐷, and for T ≥ 𝑆5𝐴, our algorithm achieves: Regret 𝑀, 𝑇 ≤ 𝑂(𝐷 𝑆𝐴𝑇)

Bayesian regret bounds: (expectation over known prior on MDP M); example

Theorem [Osband, Russo, Van Roy, 2013]: For any known prior distribution 𝑓 on MDP M,

our algorithm achieves in episodic setting: EM∼𝑓[Regret 𝑀, 𝑇 ] ≤ 𝑂(𝐻𝑆 𝐴𝑇)

7

Exploration-exploitation and regret minimization for multi-armed bandit

UCB

Thompson Sampling

Regret minimization for RL (tabular)

UCRL

Posterior sampling based algorithms

8

Two armed bandit

9

S

1.0, r(s,2)

1.0, r(s,1)

𝑃 𝑠, 𝑎, 𝑠 , 𝑟(𝑠, 𝑎)

Online decisions

At every time step 𝑡 = 1,… , 𝑇, pull one arm out of 𝑁 arms

Stochastic feedback

For each arm 𝑖, reward is generated i.i.d. from a fixed but unknown distribution 𝜈𝑖

support [0,1], mean 𝜇𝑖

Bandit feedback

Only the reward of the pulled arm can be observed

Multiple rigged slot machines in a casino.

Which one to put money on?

• Try each one out

Arm == actions

WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?

Minimize expected regret in time 𝑇 𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 = 𝑇𝜇∗ − 𝐸[ 𝑡=1

𝑇 𝑟𝑡] where 𝜇∗ = max𝑗

𝜇𝑗

Equivalent formulation Expected regret for playing arm It = 𝑖 at time 𝑡: Δ𝑖 = 𝜇∗ − 𝜇𝑖

𝑛𝑖,𝑇 be the number times arm 𝑖 is played till time 𝑇

𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 = 𝑡(𝜇∗ − 𝜇𝐼𝑡) = 𝑖:𝜇𝑗≠𝜇∗ Δ𝑖 𝐸[𝑛𝑖,𝑇]

If we can bound 𝑛𝑖,𝑇 by 𝐶 log 𝑇

Δ𝑖2 , then regret bound: 𝑖:𝜇𝑖≠𝜇∗

𝐶 log 𝑇

Δ𝑖 Problem-dependent bound, close to lower bound [Lai and Robbins 1985]

For problem independent bound: C 𝑁𝑇𝑙𝑜𝑔(𝑇)

Separately bound total regret for playing an arm with Δ𝑖 ≤𝑁log(T)

𝑇and Δ𝑖 >

𝑁log(T)

𝑇

Lower bound Ω 𝑁𝑇 ,

Two arms black and red

Random rewards with unknown mean 𝝁𝟏 = 𝟏. 𝟏, 𝝁𝟐 = 𝟏

Optimal expected reward in 𝑇 time steps is 1.1 × 𝑇

Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms

Initial few trials can mislead into playing red action forever

1.1, 1, 0.2,

1, 1, 1, 1, 1, 1, 1, ……

Expected regret in 𝑇 steps is close to 0.1 × 𝑇

4/20/2019 13

Exploitation: play the empirical mean reward maximizer

Exploration: play less explored actions to ensure empirical estimates converge

14

Empirical mean at time t for arm 𝑖

Upper confidence bound

15

Using Azuma-Hoeffding (since 𝐸 𝑟𝑡 𝐼𝑡 = 𝑖 = 𝜇𝑖), with probability 1 −2

𝑡2,

Two implications: for every arm 𝑖

𝑈𝐶𝐵𝑖,𝑡 > 𝜇𝑖 with probability 1 −2

𝑡2 Recall:

If 𝑛𝑖,𝑡 >16ln 𝑇

Δ𝑖2 , 𝜇𝑖,𝑡 < 𝜇𝑖 +

Δ𝑖

2

Each suboptimal arm is played at most

16ln 𝑇

Δ𝑖2 times, since after that many plays:

16

Natural and Efficient heuristic

Maintain belief about parameters (e.g., mean reward) of each arm

Observe feedback, update belief of pulled arm i in Bayesian manner

Pull arm with posterior probability of being best arm

NOT same as choosing the arm that is most likely to be best

Maintain Bayesian posteriors for unknown parameters

With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation

Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables

exploration

A sample from the posterior is used as an estimate for unknown parameters to make decisions

18

Uniform distribution 𝐵𝑒𝑡𝑎(1,1)

𝐵𝑒𝑡𝑎 𝛼, 𝛽 prior ⇒ Posterior

𝐵𝑒𝑡𝑎 𝛼 + 1, 𝛽 if you observe 1

𝐵𝑒𝑡𝑎 𝛼, 𝛽 + 1 if you observe 0

Start with 𝐵𝑒𝑡𝑎(𝛼0, 𝛽0) prior belief for every arm

In round t,

• For every arm 𝑖, sample 𝜃𝑖,𝑡 independently from posterior B𝑒𝑡𝑎 𝛼0 + 𝑆𝑖,𝑡 , 𝛽0 + 𝐹𝑖,𝑡

• Play arm 𝑖𝑡 = max𝑖

𝜃𝑖,𝑡

• Observe reward and update the Beta posterior for arm 𝑖𝑡

Theorem [Russo and Van Roy 2014]: Given any prior 𝑓 over MAB instance M described by reward distributions with 0,1 bounded support (e.g., prior distribution over 𝜇1, 𝜇2, … , 𝜇𝑁 in case of Bernoulli rewards)

BayesianRegret 𝑇 = 𝐸𝑀∼𝑓[Regret 𝑇,𝑀 ] ≤ 𝑂( 𝑁𝑇ln 𝑇 )

Note;

Regret 𝑇,𝑀 notation is used to explicitly indicate dependence on regret on the MAB instance M.

In comparison, worst case regret,

Regret 𝑇 = 𝑚𝑎𝑥𝑀[Regret 𝑇,𝑀 ]

20

[A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013]

Instance-dependent bounds for {0,1} rewards

Regret 𝑇 ≤ 𝒍𝒏(𝑻) 1 + 𝜖 𝑖Δ𝑖

𝐾𝐿(𝜇∗||𝜇𝑖)+ 𝑂(

𝑁

𝜖2)

Matches asymptotic instance wise lower bound [Lai Robbins 1985]

UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012]

Arbitrary bounded [0,1] rewards (using Beta and Gaussian priors)

Regret 𝑇 ≤ 𝑂(𝒍𝒏(𝑻) 𝑖1

Δ𝑖)

Matches the best available for UCB for general reward distributions

Instance-independent bounds (Beta and Gaussian priors)

Regret 𝑇 ≤ 𝑂( 𝑁𝑇ln 𝑇 )

Prior and likelihood mismatch allowed!

Two arms, 𝜇1 ≥ 𝜇2, Δ = 𝜇1 − 𝜇2

Every time arm 2 is pulled, Δ regret

Bound the number of pulls of arm 2 by log T

Δ2 to get log T

Δregret bound

How many pulls of arm 2 are actually needed?

After n=O(log T

Δ2 ) pulls of arm 2 and arm 1

Empirical means are well separated

Error |𝜇𝑖 − 𝜇𝑖| ≤log(𝑇)

𝑛≤

Δ

4whp

(Using Azuma Hoeffding inequality)

Beta Posteriors are well separated

Mean = 𝛼𝑖

𝛼𝑖+𝛽𝑖= 𝜇𝑖

standard deviation ≃1

𝛼+𝛽=

1

n≤

Δ

4

The two arms can be distinguished!

No more arm 2 pulls.

𝜇2 𝜇1

𝜇2 𝜇1

Δ

If arm 2 is pulled less than n=O(log T

Δ2 ) times?

Regret is at most nΔ =log T

Δ

At least log T

Δ2 pulls of arm 2, but few pulls of arm 1

𝜇2 𝜇1𝜇2 𝜇1

Δ

𝜇2 𝜇1𝜇2 𝜇1

Δ

Arm 1 will be played roughly every constant number of steps in this situation

It will take at most constant ×log 𝑇

Δ2 steps (extra pulls of arm 2) to get out of

this situation

Total number of pulls of arm 2 is at most O(log 𝑇

Δ2 )

Summary: variance of posterior enables exploration

Optimal bounds (and for multiple arms) require more careful use of posterior structure

Using UCB and TS ideas for exploration-exploitation in RL

27

Single state MDP Solution concept: optimal action

Multi-armed bandit problem

Uncertainty in rewards Random rewards with unknown mean 𝜇1, 𝜇2

Exploit only: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms

Initial few trials can mislead into playing red action forever

1.1, 1, 0.2,

1, 1, 1, 1, 1, 1, 1, ……

S

1.0, +1

1.0, +1.1

𝑃 𝑠, 𝑎, 𝑠 , 𝑟(𝑠, 𝑎)

28

Uncertainty in rewards, state transitions

Unknown reward distribution, transition probabilities

Exploration-exploitation:

Explore actions/states/policies, learn reward distributions and transition model

Exploit the (seemingly) best policy

29

S1 S2

1.0, $1

0.6, $0

0.9, $100

0.4, $0

0.1, $0

1.0, $1

Upper confidence bound based algorithms [Jaksch, Ortner, Auer, 2010] [Bartlett, Tewari, 2012]

Worst-case regret bound 𝑂(𝐷𝑆 𝐴𝑇) for communicating MDP

Lower bound Ω 𝐷𝑆𝐴𝑇

Optimistic Posterior Sampling [A. Jia 2017]

Worst-case regret bound 𝑂(𝐷 𝑆𝐴𝑇) for communicating MDP of diameter 𝐷

Improvement by a factor of 𝑆

Optimistic Value iteration [Azar, Osband, Munos, 2017]

Worst-case reget bound 𝑂( 𝐻𝑆𝐴𝑇) in episodic setting

Posterior sampling known prior setting [Osband, Russo, and Van Roy 2013, Osbandand Van Roy, 2016, 2017]b

Bayesian regret bound of 𝑂(𝐻 𝑆𝐴𝑇) in episodic setting, length 𝐻 episodes

30

UCRL: Upper confidence bound based algorithm for RL

Posterior sampling based algorithm for RL

Main result

Proof techniques

31

Similar principles as UCB

This is a Model-based approach

Maintain an estimate of model 𝑃, 𝑅

Occassionally solve the MDP S, A, 𝑃, 𝑅, 𝑠1 to find a policy

Run this policy for some time to get samples, and update model estimate

Compare to ``model-free” approach or direct learning approach like Q-learning

Directly update Q-values or value function or policy using samples.

32

Proceed in epochs, an epoch ends when the number of visits of some state-action pair doubles.

In the beginning of every epoch k

Use samples to compute an optimistic MDP (S, A, 𝑅, 𝑃, 𝑠1)

MDP with value greater than true MDP (Upper bound!!)

Solve the optimistic MDP to find optimal policy 𝜋

Execute 𝜋 in epoch k

observe samples 𝑠𝑡 , 𝑟𝑡, 𝑠𝑡+1

Go to next epoch If visits of some state-action pair doubles

If 𝑛𝑘 𝑠, 𝑎 ≥ 2 𝑛𝑘−1(𝑠, 𝑎) for some 𝑠, 𝑎

33

In the beginning of every epoch k

For every 𝑠, 𝑎, compute empirical model estimate let nk(𝑠, 𝑎) be the number of times 𝑠, 𝑎 was visited before this epoch,

let nk(𝑠, 𝑎, 𝑠′) be the number of transition to 𝑠′

Set 𝑅(𝑠, 𝑎) as average reward over these nk(𝑠, 𝑎) steps

Set 𝑃(𝑠, 𝑎, 𝑠′) as nk 𝑠,𝑎,𝑠′

nk(𝑠,𝑎)

Compute optimistic model estimate Use Chernoff bounds to define confidence region around 𝑅, 𝑃

| 𝑃 𝑠, 𝑎, 𝑠′ − 𝑃 𝑠, 𝑎, 𝑠′ | ≤log 𝑡

𝑛𝑘(𝑠,𝑎)with probability 1 −

1

𝑡2

True 𝑅, 𝑃 lies in this region

Find the best combination , 𝑅, 𝑃 in this region MDP (S, A, 𝑅, 𝑃, 𝑠1) with maximum value

Will have value more than the true MDP

34

Recall regret:

Regret 𝑀, 𝑇 = 𝑇 𝜌∗ −

𝑡=1

𝑇

𝑟(𝑠𝑡 , 𝑎𝑡)

Theorem: For any communicating MDP M with (unknown) diameter 𝐷, with high probability:

Regret 𝑀, 𝑇 ≤ 𝑂(𝐷𝑆 𝐴𝑇)

𝑂(⋅) notation hides logarithmic factors in S,A,T beyond constants.

35

Our setting, regret definition

Posterior sampling algorithm for MDPs

Main result

Proof techniques

36

Maintain Bayesian posteriors for unknown parameters

With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation

Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables

exploration

A sample from the posterior is used as an estimate for unknown parameters to make decisions

37

Assume for simplicity: Known reward distribution

Needs to learn the unknown transition probability vector 𝑃𝑠,𝑎 = (𝑃𝑠,𝑎 1 ,… , 𝑃𝑠,𝑎(𝑆)) for all 𝑠, 𝑎

In any state 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, observes new state 𝑠𝑡+1

outcome of a Multivariate Bernoulli trial with probability vector 𝑃𝑠,𝑎

38

Given prior Dirichlet 𝛼1, 𝛼2, … , 𝛼𝑆 on 𝑃𝑠,𝑎

After a Multinoulli trial with outcome (new state) 𝑖, Bayesian posterior on 𝑃𝑠,𝑎

Dirichlet 𝛼1, 𝛼2, … , 𝛼𝑖 + 1,… , 𝛼𝑆

After 𝑛𝑠,𝑎 = 𝛼1 + ⋯+ 𝛼𝑆 observations for a state-action pair 𝑠, 𝑎

Posterior mean vector is empirical mean

𝑃𝑠,𝑎(𝑖) =𝛼𝑖

𝛼1 + ⋯+ 𝛼𝑆=

𝛼𝑖

𝑛𝑠,𝑎

variance bounded by 1

𝑛𝑠,𝑎

With more trials of 𝑠, 𝑎, the posterior mean concentrates around true mean

39

Learning

Maintain a Dirichlet posterior for 𝑃𝑠,𝑎 for every 𝑠, 𝑎 After round 𝑡, on observing outcome 𝑠𝑡+1, update for state 𝑠𝑡 and action 𝑎𝑡

To decide action

Sample a 𝑃𝑠,𝑎 for every 𝑠, 𝑎

Compute the optimal policy 𝝅 for sample MDP (𝑆, 𝐴, 𝑃, 𝑟, 𝑠0)

Choose 𝑎𝑡 = 𝜋 𝑠𝑡

Exploration-exploitation

Exploitation: With more observations Dirichlet posterior concentrates, 𝑃𝑠,𝑎approaches empirical mean 𝑃𝑠,𝑎

Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored

40

Proceed in epochs, an epoch ends when the number of visits 𝑁𝑠,𝑎 of any state-action pair doubles.

In every epoch

For every 𝑠, 𝑎, generate multiple 𝜓 = 𝑂 𝑆 independent samples from a Dirichletposterior for P𝑠,𝑎

Form extended sample MDP (𝑆, 𝜓𝐴, 𝑃, 𝑟, 𝑠0)

Find optimal policy 𝜋 and use through the epoch

Further, initial exploration:

For 𝑠, 𝑎 with very small 𝑁𝑠,𝑎 <𝑇𝑆

𝐴, use a simple optimistic sampling, that provides

extra exploration

41

An algorithm based on posterior sampling with high probability near-optimal worst-case regret upper bound

Recall regret:

Regret 𝑀, 𝑇 = 𝑇 𝜌∗ −

𝑡=1

𝑇

𝑟(𝑠𝑡 , 𝑎𝑡)

Theorem: For any communicating MDP M with (unknown) diameter 𝐷, and for T ≥ 𝑆5𝐴, with high probability:

Regret 𝑀, 𝑇 ≤ 𝑂(𝐷 𝑆𝐴𝑇)

Improvement of 𝑆 factor above UCB based algorithm

42

43

Documents

Shipra Agrawal IEOR 8100 (Reinforcement Learning) · [A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013] Instance-dependent bounds for {0,1} rewards Regret Q𝒍𝒏(𝑻)1+𝜖