Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Shipra Agrawal
IEOR 8100 (Reinforcement Learning)
Computational complexity
the amount of per-time-step computation the algorithm uses during learning;
Space complexity
the amount of memory used by the algorithm;
Learning complexity
a measure of how much experience the algorithm needs to learn in a given task.
2
Using a single thread of experience
No resets or generative model
Vs.
Generative models (simulator)
http://hunch.net/~jl/projects/RL/EE.html
3
Optimal learning complexity?
“optimally explore” meaning to obtain maximum expected discounted rewardE[ 𝑡=1
∞ 𝛾𝑡−1𝑟𝑡] over a known prior over the unknown MDPs
Tractable in special cases (𝛾 < 1,𝑀𝐴𝐵, Gittins, 1989).
4
Bound number of steps on which suboptimal policy is played
near-optimally on all but a polynomial number of steps (Kakade, 2003; Strehl and Littman, 2008b, Strehl et al 2006, 2009)
Example: [Strehl et al 2006] (may been improved in recent work)
5
Some examples:
An O(SA) analysis of Q-learning.
Michael Kearns and Satinder Singh, Finite-Sample Convergence Rates for Q-learning and Indirect Algorithms NIPS 1999.
An analysis of TD-lambda.
Michael Kearns and Satinder Singh, Bias-Variance Error Bounds for Temporal Difference Updates, COLT 2000.
6
Bound difference in total reward of algorithm compared to a benchmark policy
𝑅𝑒𝑔 𝑀, 𝑇 = 𝑇 𝜌∗(𝑠0) − 𝐸[ 𝑡=1𝑇 𝑟𝑡]
𝜌∗ s0 is infinite horizon average reward achieved by the best stationary policy 𝜋∗
Episodic/single thread
Episodic: starting state is reset every H steps
Single thread: assume Communicating with diameter D
Worst-case regret bounds; example
Theorem [Agrawal, Jia, 2017]: For any communicating MDP M with (unknown) diameter
𝐷, and for T ≥ 𝑆5𝐴, our algorithm achieves: Regret 𝑀, 𝑇 ≤ 𝑂(𝐷 𝑆𝐴𝑇)
Bayesian regret bounds: (expectation over known prior on MDP M); example
Theorem [Osband, Russo, Van Roy, 2013]: For any known prior distribution 𝑓 on MDP M,
our algorithm achieves in episodic setting: EM∼𝑓[Regret 𝑀, 𝑇 ] ≤ 𝑂(𝐻𝑆 𝐴𝑇)
7
Exploration-exploitation and regret minimization for multi-armed bandit
UCB
Thompson Sampling
Regret minimization for RL (tabular)
UCRL
Posterior sampling based algorithms
8
Two armed bandit
9
S
1.0, r(s,2)
1.0, r(s,1)
𝑃 𝑠, 𝑎, 𝑠 , 𝑟(𝑠, 𝑎)
Online decisions
At every time step 𝑡 = 1,… , 𝑇, pull one arm out of 𝑁 arms
Stochastic feedback
For each arm 𝑖, reward is generated i.i.d. from a fixed but unknown distribution 𝜈𝑖
support [0,1], mean 𝜇𝑖
Bandit feedback
Only the reward of the pulled arm can be observed
Multiple rigged slot machines in a casino.
Which one to put money on?
• Try each one out
Arm == actions
WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
Minimize expected regret in time 𝑇 𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 = 𝑇𝜇∗ − 𝐸[ 𝑡=1
𝑇 𝑟𝑡] where 𝜇∗ = max𝑗
𝜇𝑗
Equivalent formulation Expected regret for playing arm It = 𝑖 at time 𝑡: Δ𝑖 = 𝜇∗ − 𝜇𝑖
𝑛𝑖,𝑇 be the number times arm 𝑖 is played till time 𝑇
𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 = 𝑡(𝜇∗ − 𝜇𝐼𝑡) = 𝑖:𝜇𝑗≠𝜇∗ Δ𝑖 𝐸[𝑛𝑖,𝑇]
If we can bound 𝑛𝑖,𝑇 by 𝐶 log 𝑇
Δ𝑖2 , then regret bound: 𝑖:𝜇𝑖≠𝜇∗
𝐶 log 𝑇
Δ𝑖 Problem-dependent bound, close to lower bound [Lai and Robbins 1985]
For problem independent bound: C 𝑁𝑇𝑙𝑜𝑔(𝑇)
Separately bound total regret for playing an arm with Δ𝑖 ≤𝑁log(T)
𝑇and Δ𝑖 >
𝑁log(T)
𝑇
Lower bound Ω 𝑁𝑇 ,
Two arms black and red
Random rewards with unknown mean 𝝁𝟏 = 𝟏. 𝟏, 𝝁𝟐 = 𝟏
Optimal expected reward in 𝑇 time steps is 1.1 × 𝑇
Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, ……
Expected regret in 𝑇 steps is close to 0.1 × 𝑇
4/20/2019 13
Exploitation: play the empirical mean reward maximizer
Exploration: play less explored actions to ensure empirical estimates converge
14
Empirical mean at time t for arm 𝑖
Upper confidence bound
15
Using Azuma-Hoeffding (since 𝐸 𝑟𝑡 𝐼𝑡 = 𝑖 = 𝜇𝑖), with probability 1 −2
𝑡2,
Two implications: for every arm 𝑖
𝑈𝐶𝐵𝑖,𝑡 > 𝜇𝑖 with probability 1 −2
𝑡2 Recall:
If 𝑛𝑖,𝑡 >16ln 𝑇
Δ𝑖2 , 𝜇𝑖,𝑡 < 𝜇𝑖 +
Δ𝑖
2
Each suboptimal arm is played at most
16ln 𝑇
Δ𝑖2 times, since after that many plays:
16
Natural and Efficient heuristic
Maintain belief about parameters (e.g., mean reward) of each arm
Observe feedback, update belief of pulled arm i in Bayesian manner
Pull arm with posterior probability of being best arm
NOT same as choosing the arm that is most likely to be best
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
18
Uniform distribution 𝐵𝑒𝑡𝑎(1,1)
𝐵𝑒𝑡𝑎 𝛼, 𝛽 prior ⇒ Posterior
𝐵𝑒𝑡𝑎 𝛼 + 1, 𝛽 if you observe 1
𝐵𝑒𝑡𝑎 𝛼, 𝛽 + 1 if you observe 0
Start with 𝐵𝑒𝑡𝑎(𝛼0, 𝛽0) prior belief for every arm
In round t,
• For every arm 𝑖, sample 𝜃𝑖,𝑡 independently from posterior B𝑒𝑡𝑎 𝛼0 + 𝑆𝑖,𝑡 , 𝛽0 + 𝐹𝑖,𝑡
• Play arm 𝑖𝑡 = max𝑖
𝜃𝑖,𝑡
• Observe reward and update the Beta posterior for arm 𝑖𝑡
Theorem [Russo and Van Roy 2014]: Given any prior 𝑓 over MAB instance M described by reward distributions with 0,1 bounded support (e.g., prior distribution over 𝜇1, 𝜇2, … , 𝜇𝑁 in case of Bernoulli rewards)
BayesianRegret 𝑇 = 𝐸𝑀∼𝑓[Regret 𝑇,𝑀 ] ≤ 𝑂( 𝑁𝑇ln 𝑇 )
Note;
Regret 𝑇,𝑀 notation is used to explicitly indicate dependence on regret on the MAB instance M.
In comparison, worst case regret,
Regret 𝑇 = 𝑚𝑎𝑥𝑀[Regret 𝑇,𝑀 ]
20
[A. and Goyal COLT 2012, AISTATS 2013, JACM 2017, Kaufmann et al. 2013]
Instance-dependent bounds for {0,1} rewards
Regret 𝑇 ≤ 𝒍𝒏(𝑻) 1 + 𝜖 𝑖Δ𝑖
𝐾𝐿(𝜇∗||𝜇𝑖)+ 𝑂(
𝑁
𝜖2)
Matches asymptotic instance wise lower bound [Lai Robbins 1985]
UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012]
Arbitrary bounded [0,1] rewards (using Beta and Gaussian priors)
Regret 𝑇 ≤ 𝑂(𝒍𝒏(𝑻) 𝑖1
Δ𝑖)
Matches the best available for UCB for general reward distributions
Instance-independent bounds (Beta and Gaussian priors)
Regret 𝑇 ≤ 𝑂( 𝑁𝑇ln 𝑇 )
Prior and likelihood mismatch allowed!
Two arms, 𝜇1 ≥ 𝜇2, Δ = 𝜇1 − 𝜇2
Every time arm 2 is pulled, Δ regret
Bound the number of pulls of arm 2 by log T
Δ2 to get log T
Δregret bound
How many pulls of arm 2 are actually needed?
After n=O(log T
Δ2 ) pulls of arm 2 and arm 1
Empirical means are well separated
Error |𝜇𝑖 − 𝜇𝑖| ≤log(𝑇)
𝑛≤
Δ
4whp
(Using Azuma Hoeffding inequality)
Beta Posteriors are well separated
Mean = 𝛼𝑖
𝛼𝑖+𝛽𝑖= 𝜇𝑖
standard deviation ≃1
𝛼+𝛽=
1
n≤
Δ
4
The two arms can be distinguished!
No more arm 2 pulls.
𝜇2 𝜇1
𝜇2 𝜇1
Δ
If arm 2 is pulled less than n=O(log T
Δ2 ) times?
Regret is at most nΔ =log T
Δ
At least log T
Δ2 pulls of arm 2, but few pulls of arm 1
𝜇2 𝜇1𝜇2 𝜇1
Δ
𝜇2 𝜇1𝜇2 𝜇1
Δ
Arm 1 will be played roughly every constant number of steps in this situation
It will take at most constant ×log 𝑇
Δ2 steps (extra pulls of arm 2) to get out of
this situation
Total number of pulls of arm 2 is at most O(log 𝑇
Δ2 )
Summary: variance of posterior enables exploration
Optimal bounds (and for multiple arms) require more careful use of posterior structure
Using UCB and TS ideas for exploration-exploitation in RL
27
Single state MDP Solution concept: optimal action
Multi-armed bandit problem
Uncertainty in rewards Random rewards with unknown mean 𝜇1, 𝜇2
Exploit only: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms
Initial few trials can mislead into playing red action forever
1.1, 1, 0.2,
1, 1, 1, 1, 1, 1, 1, ……
S
1.0, +1
1.0, +1.1
𝑃 𝑠, 𝑎, 𝑠 , 𝑟(𝑠, 𝑎)
28
Uncertainty in rewards, state transitions
Unknown reward distribution, transition probabilities
Exploration-exploitation:
Explore actions/states/policies, learn reward distributions and transition model
Exploit the (seemingly) best policy
29
S1 S2
1.0, $1
0.6, $0
0.9, $100
0.4, $0
0.1, $0
1.0, $1
Upper confidence bound based algorithms [Jaksch, Ortner, Auer, 2010] [Bartlett, Tewari, 2012]
Worst-case regret bound 𝑂(𝐷𝑆 𝐴𝑇) for communicating MDP
Lower bound Ω 𝐷𝑆𝐴𝑇
Optimistic Posterior Sampling [A. Jia 2017]
Worst-case regret bound 𝑂(𝐷 𝑆𝐴𝑇) for communicating MDP of diameter 𝐷
Improvement by a factor of 𝑆
Optimistic Value iteration [Azar, Osband, Munos, 2017]
Worst-case reget bound 𝑂( 𝐻𝑆𝐴𝑇) in episodic setting
Posterior sampling known prior setting [Osband, Russo, and Van Roy 2013, Osbandand Van Roy, 2016, 2017]b
Bayesian regret bound of 𝑂(𝐻 𝑆𝐴𝑇) in episodic setting, length 𝐻 episodes
30
UCRL: Upper confidence bound based algorithm for RL
Posterior sampling based algorithm for RL
Main result
Proof techniques
31
Similar principles as UCB
This is a Model-based approach
Maintain an estimate of model 𝑃, 𝑅
Occassionally solve the MDP S, A, 𝑃, 𝑅, 𝑠1 to find a policy
Run this policy for some time to get samples, and update model estimate
Compare to ``model-free” approach or direct learning approach like Q-learning
Directly update Q-values or value function or policy using samples.
32
Proceed in epochs, an epoch ends when the number of visits of some state-action pair doubles.
In the beginning of every epoch k
Use samples to compute an optimistic MDP (S, A, 𝑅, 𝑃, 𝑠1)
MDP with value greater than true MDP (Upper bound!!)
Solve the optimistic MDP to find optimal policy 𝜋
Execute 𝜋 in epoch k
observe samples 𝑠𝑡 , 𝑟𝑡, 𝑠𝑡+1
Go to next epoch If visits of some state-action pair doubles
If 𝑛𝑘 𝑠, 𝑎 ≥ 2 𝑛𝑘−1(𝑠, 𝑎) for some 𝑠, 𝑎
33
In the beginning of every epoch k
For every 𝑠, 𝑎, compute empirical model estimate let nk(𝑠, 𝑎) be the number of times 𝑠, 𝑎 was visited before this epoch,
let nk(𝑠, 𝑎, 𝑠′) be the number of transition to 𝑠′
Set 𝑅(𝑠, 𝑎) as average reward over these nk(𝑠, 𝑎) steps
Set 𝑃(𝑠, 𝑎, 𝑠′) as nk 𝑠,𝑎,𝑠′
nk(𝑠,𝑎)
Compute optimistic model estimate Use Chernoff bounds to define confidence region around 𝑅, 𝑃
| 𝑃 𝑠, 𝑎, 𝑠′ − 𝑃 𝑠, 𝑎, 𝑠′ | ≤log 𝑡
𝑛𝑘(𝑠,𝑎)with probability 1 −
1
𝑡2
True 𝑅, 𝑃 lies in this region
Find the best combination , 𝑅, 𝑃 in this region MDP (S, A, 𝑅, 𝑃, 𝑠1) with maximum value
Will have value more than the true MDP
34
Recall regret:
Regret 𝑀, 𝑇 = 𝑇 𝜌∗ −
𝑡=1
𝑇
𝑟(𝑠𝑡 , 𝑎𝑡)
Theorem: For any communicating MDP M with (unknown) diameter 𝐷, with high probability:
Regret 𝑀, 𝑇 ≤ 𝑂(𝐷𝑆 𝐴𝑇)
𝑂(⋅) notation hides logarithmic factors in S,A,T beyond constants.
35
Our setting, regret definition
Posterior sampling algorithm for MDPs
Main result
Proof techniques
36
Maintain Bayesian posteriors for unknown parameters
With more trials posteriors concentrate on the true parameters Mode captures MLE: enables exploitation
Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables
exploration
A sample from the posterior is used as an estimate for unknown parameters to make decisions
37
Assume for simplicity: Known reward distribution
Needs to learn the unknown transition probability vector 𝑃𝑠,𝑎 = (𝑃𝑠,𝑎 1 ,… , 𝑃𝑠,𝑎(𝑆)) for all 𝑠, 𝑎
In any state 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎, observes new state 𝑠𝑡+1
outcome of a Multivariate Bernoulli trial with probability vector 𝑃𝑠,𝑎
38
Given prior Dirichlet 𝛼1, 𝛼2, … , 𝛼𝑆 on 𝑃𝑠,𝑎
After a Multinoulli trial with outcome (new state) 𝑖, Bayesian posterior on 𝑃𝑠,𝑎
Dirichlet 𝛼1, 𝛼2, … , 𝛼𝑖 + 1,… , 𝛼𝑆
After 𝑛𝑠,𝑎 = 𝛼1 + ⋯+ 𝛼𝑆 observations for a state-action pair 𝑠, 𝑎
Posterior mean vector is empirical mean
𝑃𝑠,𝑎(𝑖) =𝛼𝑖
𝛼1 + ⋯+ 𝛼𝑆=
𝛼𝑖
𝑛𝑠,𝑎
variance bounded by 1
𝑛𝑠,𝑎
With more trials of 𝑠, 𝑎, the posterior mean concentrates around true mean
39
Learning
Maintain a Dirichlet posterior for 𝑃𝑠,𝑎 for every 𝑠, 𝑎 After round 𝑡, on observing outcome 𝑠𝑡+1, update for state 𝑠𝑡 and action 𝑎𝑡
To decide action
Sample a 𝑃𝑠,𝑎 for every 𝑠, 𝑎
Compute the optimal policy 𝝅 for sample MDP (𝑆, 𝐴, 𝑃, 𝑟, 𝑠0)
Choose 𝑎𝑡 = 𝜋 𝑠𝑡
Exploration-exploitation
Exploitation: With more observations Dirichlet posterior concentrates, 𝑃𝑠,𝑎approaches empirical mean 𝑃𝑠,𝑎
Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored
40
Proceed in epochs, an epoch ends when the number of visits 𝑁𝑠,𝑎 of any state-action pair doubles.
In every epoch
For every 𝑠, 𝑎, generate multiple 𝜓 = 𝑂 𝑆 independent samples from a Dirichletposterior for P𝑠,𝑎
Form extended sample MDP (𝑆, 𝜓𝐴, 𝑃, 𝑟, 𝑠0)
Find optimal policy 𝜋 and use through the epoch
Further, initial exploration:
For 𝑠, 𝑎 with very small 𝑁𝑠,𝑎 <𝑇𝑆
𝐴, use a simple optimistic sampling, that provides
extra exploration
41
An algorithm based on posterior sampling with high probability near-optimal worst-case regret upper bound
Recall regret:
Regret 𝑀, 𝑇 = 𝑇 𝜌∗ −
𝑡=1
𝑇
𝑟(𝑠𝑡 , 𝑎𝑡)
Theorem: For any communicating MDP M with (unknown) diameter 𝐷, and for T ≥ 𝑆5𝐴, with high probability:
Regret 𝑀, 𝑇 ≤ 𝑂(𝐷 𝑆𝐴𝑇)
Improvement of 𝑆 factor above UCB based algorithm
42
43