Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lecture: Reinforcement Learning
http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html
Acknowledgement: this slides is based on Prof. David Silver’s lecture notes
Thanks: Mingming Zhao for preparing this slides
1/73
2/73
Outline
Introduction of MDP
Dynamic Programming
Model-free Control
Large-Scale RL
Model-based RL
3/73
What is RL
Reinforcement learning
is learning what to do–how to map situations to actions–so as tomaximize a numerical reward signal. The decision-maker is called theagent, the thing it interacts with, is called the environment.
A reinforcement learning task that satisfies the Markov property iscalled a Markov Decision process, or MDP
We assume that all RL tasks can be approximated with Markovproperty. So this talk is based on MDP.
4/73
Definition
A Markov Decision Process is a tuple (S,A,P, r, γ):S is a finite set of states, s ∈ S
A is a finite set of actions, a ∈ A
P is the transition probability distribution.probability from state s with action a to state s′: P(s′|s, a)also called the model or the dynamics
r is a reward function, r(s, a, s′)sometimes just r(s)or rt after time step t
γ ∈ [0, 1] is a discount factor, why discount?
5/73
Example
A simple MDP with three states and two actions:
6/73
Markov Property
A state st is Markov iff
P(st+1|st) = P(st+1|s1, ..., st)
the state captures all relevant information from the history
once the state is known, the history may be thrown away
i.e the state is a sufficient statistic of the future
7/73
Agent and Environment
The agent selects actions based on the observations and rewardsreceived at each time-step t
The environment selects observations and rewards based on theactions received at each time-step t
8/73
Policy
π(a|s) = P(at = a|st = s)
policy π defines the behaviour of an agent
for MDP, the policy depends on the current state(Markovproperty)
deterministic policy: a = π(s)
9/73
Value function
Denote
G(t) = rt+1 + γrt+2 + γ2rt+3 + ...
=
∞∑k=0
γkrt+k+1
state-value function
Vπ(s) is the total amount of reward expected to accumulate over thefuture starting from the state s and then following policy π
Vπ(s) = Eπ(G(t)|st = s)
action-value function
Qπ(s, a) is the total amount of reward expected to accumulate over thefuture starting from the state s, taking aciton a, and then followingpolicy π
Qπ(s, a) = Eπ(G(t)|st = s, at = a)
10/73
Bellman Equation
Vπ(s) =Eπ(rt+1 + γrt+2 + γ2rt+3 + ...|st = s)
=Eπ(rt+1 + γ(rt+2 + γrt+3 + ...)|st = s)
=Eπ(rt+1 + γG(t + 1)|st = s)
=Eπ(rt+1 + γVπ(st+1)|st = s)
For state-value function
Vπ(s) =Eπ(rt+1 + γVπ(st+1)|st = s)
=∑a∈A
π(a|s)∑s′∈S
P(s′|s, a)[r(s, a, s′) + γVπ(s′)]
Similarly,
Qπ(s, a) =Eπ(r(s) + γQπ(s′, a′)|s, a)
=∑s′∈S
P(s′|s, a)[r(s, a, s′) + γ∑
a′∈Aπ(a′|s′)Qπ(s′, a′)]
11/73
Bellman Equation
The Bellman equation is a linear equation, it can be solved directly, butonly possible for small MDP
The Bellman equation motivates a lot of iterative methods for MDP,
Dynamic ProgrammingMonte-Carlo evaluationTemporal-Difference learning
12/73
Optimal value function
The optimal state-value function V∗(s) is
V∗(s) = maxπ Vπ(s)
The optimal action-value function q∗(s, a) is
q∗(s, a) = maxπ Qπ(s, a)
The goal for any MDP is finding the optimal value function
Or equivalently an optimal policy π∗
for any policy π, Vπ∗(s) ≥ Vπ(s),∀s ∈ S
13/73
Bellman Optimal Equation
V∗(s) = maxa∈A
∑s′∈S
P(s′|s, a)[r(s, a, s′) + γV∗(s′)]
q∗(s, a) =∑s′∈S
P(s′|s, a)[r(s, a, s′) + γ maxa′∈A
q∗(s′, a′)]
Bellman optimal equation is non-linear
many iterative methods
value iterationQ-learning
14/73
Structure of RL
Basis: Policy Evaluation + Policy Improvement
15/73
Policy iteration
Policy evaluation
for a given policy π, evaluate the state-value function Vπ(s) ateach state s ∈ Siterative application of Bellman expectation backup
Vπ(s)←∑a∈A
π(a|s)∑s′∈S
P(s′|s, a)[r(s, a, s′) + γVπ(s′)]
Policy improvement
consider deterministic policy(greedy policy)
π(s)← arg maxa
∑s′∈S
P(s′|s, a)[r(s, a, s′) + γVπ(s′)]
ε-greedy policy
16/73
Policy iteration
17/73
Value iteration
One-step truncated policy evaluation
Iterative application of Bellman optimality backup
V(s)← maxa∈A
∑s′∈S
P(s′|s, a)[r(s, a, s′) + γV(s′)]
Learn optimal value function directly
Unlike policy iteration, there is no explicit policy
18/73
Value iteration
19/73
State-value function space
Consider the vector space V over state-value functions
There are |S| dimensions
Each point in this space fully specifies a state-value function V(s)
For any U,V ∈ V, we measure the distance between U,V
||U − V||∞ = maxs∈S|U(s)− V(s)|
20/73
Convergence
Define the Bellman expectation backup operator Tπ, for any state s
Tπ(V(s)) =∑a∈A
π(a|s)∑s′∈S
P(s′|s, a)[r(s, a, s′) + γV(s′)]
then
‖Tπ(V(s))− Tπ(U(s))‖∞ =‖γ∑a∈A
π(a|s)∑s′∈S
P(s′|s, a)[V(s′)− U(s′)]‖∞
≤γmaxs′∈S|V(s′)− U(s′)|
21/73
Convergence
thus
‖Tπ(V)− Tπ(U)‖∞ ≤ γ‖V − U‖∞
we call the operator Tπ is a γ-contraction
Contraction Mapping TheoremFor any metric space V that is complete under an operator T(v),where T is a γ-contraction, then
I T converges to a unique fixed pointI At a linear convergence rate of γ
22/73
Convergence
Both the Bellman expectation operator Tπ and the Bellmanoptimality operator T∗ are γ-contractionBellman equation shows Vπ is the fixed point of Tπ
Bellman optimality equation shows V∗ is the fixed point of T∗
Iterative policy evaluation converges on Vπ, policy iterationconverges at V∗Value iteration converges at V∗
23/73
Model-free Algorithm
Previous discussionplanning by dynamic programmingthe model and dynamics of a MDP are requiredcurse of dimensionalityNext discussionmodel-free prediction and controlestimate and optimize the value function of an unknown MDPIn RL, we always deal with a unknown model, exploration andexploitation are needed
24/73
Model-free prediction
Goal: learn Vπ(s) from episodes of experience under policy π
On policy learning
No requirement of MDP transitions or rewards
Monte-Carlo learning
Temporal-Difference learning
25/73
Monte-Carlo
Update Vπ(s) incrementally from episodes of experience under policy π
Every(or the first) time-step t that state s is visited in an episode
N(s)←N(s) + 1
V(s)← (N(s)− 1)V(s) + Gt
N(s)or, V(s)←V(s) + α(Gt − V(s))
By the law of large numbers, V(s)→ Vπ(s) as N(s)→∞
MC must learns from complete episodes, no bootstrapping.
26/73
Temporal-Diference
The simplest: TD(0)
Update Vπ(s) incrementally from episodes of experience under policy π
The update only requires one step episode
V(s)← V(s) + α(r(s, a, s′) + γV(s′)− V(s))
TD target: r(s, a, s′) + γV(s′)
TD error: r(s, a, s′) + γV(s′)− V(s)
r(s, a, s′) is the observed reward from one step forward simulation
TD(0) learns from incomplete episodes, by bootstrapping.
27/73
Comparison between MC and TD(0)
TD(0) can learn before knowing the final outcome, even without finaloutcome.
MC must wait until the end of episode before return is known
In MC, return Gt is unbiased estimate of Vπ(st), it depends an manyrandom actions, transitions and rewards, which yields high variance
However, TD(0) target is biased estimate of Vπ(st), it depends an onerandom action, transition and reward, which yields low variance
MC has good convergence properties, and not very sensitive to initialvalue of V(s)
TD(0) also converges , and more sensitive to initial value of V(s). TD(0)is more efficient than MC
28/73
TD(λ)
Consider n-step reward, n = 1, 2, ..., define
Gnt = rt+1 + γrt+2 + ...+ γnrt+n+1 + γn+1V(st+n+1)
n-step TD(0) learning
V(st)← V(st) + α(Gnt − V(st))
TD(0)→MC, as n→∞
λ-return Gλt combines all n-step returns Gnt
Gλt = (1− λ)
∞∑n=1
λn−1Gnt
Forward-view TD(λ)
V(st)← V(st) + α(Gλt − V(st))
29/73
Unified view of RL
30/73
Policy Improvement
After evaluate the value function of policy π, now improve policy
greedy policy improvement
π′(s) =arg max
a
∑s′∈S
P(s′|s, a)(r(s, a, s′) + γVπ(s′))
π′(s) =arg max
aQπ(s, a)
it is model-free and easier to obtain policy from Qπ(s, a)
ε− greedy policy improvement
π′(a|s) =
{ε|A| + 1− ε, a = arg maxa Q(s, a)ε|A| , o.w
ε-greedy policy ensures continual exploration, all actions are tried
31/73
ε-Greedy Policy Improvement
TheoremFor any ε-greedy policy π, the ε-greedy policy π
′w.r.t Qπ is an
improvement, i.e Vπ′ (s) ≥ Vπ(s), ∀s ∈ S.
Pf:
Qπ(s, π′(s)) =
∑a∈A
π′(a|s)Qπ(s, a)
=(1− ε) maxa∈A
Q(s, a) +ε
|A|∑a∈A
Qπ(s, a)
≥(1− ε)∑a∈A
π(a|s)− ε|A|
1− εQ(s, a) +
ε
|A|∑a∈A
Qπ(s, a)
=∑a∈A
π(a|s)Qπ(s, a) = Vπ(s)
32/73
ε-Greedy Policy Improvement (Proof)
On the other hand,
Qπ(s, π′(s)) =Eπ′ [rt+1 + γVπ(st+1)|st = s]
≤Eπ′ [rt+1 + γQπ(st+1, π′(st+1))|st = s]
=Eπ′ [rt+1 + γEπ′ [rt+2 + γVπ(st+2)]|st = s]
=Eπ′ [rt+1 + γrt+2 + γ2Vπ(st+2)|st = s]
≤Eπ′ [rt+1 + γrt+2 + γ2Qπ(st+2, π′(st+2))|st = s]
...
≤Eπ′ [rt+1 + γrt+2 + γ2rt+3 + γ3rt+4 + ...|st = s]
=Vπ′ (s)
Thus, Vπ′ (s) ≥ Vπ(s).
33/73
Monte-Carlo Policy Iteration
Policy evaluation: Monte-Carlo
Policy improvement: ε-greedy policy
TD Policy iteration is similar
34/73
Monte-Carlo Control
For each episode:
Policy evaluation: Monte-Carlo
Policy improvement: ε-greedy policy
35/73
GLIE
DefinitionA sequence of policy {πk}∞k=1 is called GLIE (Greedy in the limit withinfinite exploration), if:
I all state-action pairs are explored infinitely many times,
limk→∞
Nk(s, a) =∞
I the policy converges on a greedy policy
limk→∞
πk(a′|s) = 1(a′ = arg maxa
Qk(s, a))
For example: ε-greedy policy is GLIE if ε reduce to zero as εk = 1k
TheroemGLIE Monte-Carlo control converges to the optimal action-valuefunction, i.e Q(s, a)→ q∗(s, a)
36/73
TD control–Sarsa
For each time-step:
apply TD to evaluate Qπ(s, a), then use ε-greedy policy improvement
Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))
action a′ is chosed from s′ using policy derived from Q (eg, ε-greedy)
37/73
Convergence of Sarsa
TheroemSarsa converges to the optimal action-value function, i.eQ(s, a)→ q∗(s, a), under the following conditions:
I GLIE sequence of policies πk(a|s)I Robbins-Monro sequence of step-size αk
∞∑k=1
αk =∞,∞∑
k=1
α2k <∞
38/73
Off-policy learning
we have assumed that the episode is generated following the learningpolicy π
on policy learningthat is the learning policy and behavior policy is coincident
consider following behavior policy µ(a|s){s1, a1, s2, a2, ..., sT} ∼ µimportance samplingoff-policy learning
why is this important?
Re-use experience generated from old policieslearn about optimal policy while following exploratory policylearn about multiple policies while following one policy
39/73
Importance sampling
To estimate the expectation EX∼P[f (X)], we estimate a differentdistribution instead
EX∼P[f (X)] =∑
P(X)f (X)
=∑
Q(X)P(X)
Q(X)f (X)
= EX∼Q[P(X)
Q(X)f (X)]
Q is some simple or known distribution
40/73
Off-policy version of MC and TD
MC
use rewards generated from µ to evaluate π
Gπ/µt =π(at|st)
µ(at|st)
π(at+1|st+1)
µ(at+1|st+1)...π(aT |sT)
µ(aT |sT)Gt
V(s)←V(st) + α(Gπ/µt − V(st))
TD
use rewards generated from µ to evaluate π
V(s)← V(st) + α(π(at|st)
µ(at|st)(r(st, at, st+1) + γV(st+1))− V(st))
much lower variance than MC importance sampling
41/73
Q-learning
Off-policy learning of action-values Q(s, a)
Q(s, a)← Q(s, a) + α(r(s, a, s′) + γmaxa′
Q(s′, a′)− Q(s, a))
Why Q-learning is considered Off-policy?
the learned Q directly approximates q∗while behavior policy π may not optimal
Comparing with Sarsa
Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))
Sarsa prefers to learn carefully in an environment where exploration iscostly, while Q-learning not
42/73
Q-learning V.S Sarsa
43/73
Summary
44/73
Summary
45/73
Value function approximation
So far, we have represented value function by a lookup table
every state s (or state-action pair s, a) has an entry V(s) (or Q(s, a))
For large scale MDPs
two many state s (or state-action pair s, a) to storetoo slow to learn the value individually
Estimate value function with function approximation
V̂(s,w) ≈Vπ(s)
Q̂(s, a,w) ≈Qπ(s, a)
generalize from seen states to unseen statesupdate parameter w by MC or TD learning
46/73
Types of function approximation
Function approximators, e.g
Linear combinations of features, Neural Network, Decision tree,Nearest neighbor, Fourier/wavelet bases...
47/73
Basic idea
Goal: find parameter vector w minimizing mean-square error betweenapproximate value V̂(s,w) and true value Vπ(s)
J(w) = Eπ[(Vπ(s)− V̂(s,w))2]
Gradient descent (α is stepsize)
∆w = −12α∇wJ(w) = αEπ[(Vπ(s)− V̂(s,w))∇wV̂(s,w)]
Stochastic gradient descent
∆w = α(Vπ(s)− V̂(s,w))∇wV̂(s,w)
48/73
Linear case
V̂(s,w) = x(s)Tw =∑N
i=1 xi(s)wi, where x(s) is a feature vector
Objective function is quadratic in w
J(w) = Eπ[(Vπ(s)− x(s)Tw)2]
Update is simple
∆w = α(Vπ(s)− V̂(s,w))x(s)
49/73
Practical algorithm
The update requires true value function
But in RL, it is unavailable, we only have rewards from experiences
In practice, we substitute a target for Vπ(s), e.g
for MC, the target is Gt
for TD(0), the target is r(s, a, s′) + γV̂(s′,w)for TD(λ), the target is the λ-return Gλt
Similarly, we can generate approximators of action-value function in thesame way
Proximate value function methods suffer from a lack of strongtheoretical performance guarantees1
1Sham Kakade, John Langford. Approximately Optimal ApproximateReinforcement Learning
50/73
Batch methods
Gradient descent is simple and appealing, but the sample is notefficient
Experience D consisting of < state, value > pairs
D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}
Minimising sum-squared error
LS(w) =
T∑t=1
(Vπt − V̂(st,w))2
=ED((Vπ − V̂(s,w))2)
51/73
SGD with experience replay
Given experience D consisting of < state, value > pairs
D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}
Repeat:
1. sample state, value from experience: < s,Vπ >∼ D2. apply SGD update: ∆w = α(Vπ − V̂(s,w))∇wV̂(s,w)
Converges to least square solution
wπ = arg minw
LS(w)
In practice, we use noisy or biased samples of Vπt , e.g.
LSMC (Least Square Monte-Carlo)use return Gt ≈ Vπt
52/73
Policy Gradient
So far, we are discussing value-based RL
learn value functionimplicit policy (e.g. ε-greedy)
Now, consider parametrize the policy
πθ(a|s) = P(a|s, θ)
Policy-based RL
no value functionlearn policy
53/73
Policy-based RL
Advantages:
Better convergence propertieseffective in high-dimensional or continuous action spacescan learn stochastic policy
Disadvantages:
Typically converge to a local rather than global optimumevaluating a policy is typically inefficient and high variance
54/73
Measure the equality of a policy πθ
In episodic environments:
s1 is the start state in an episodic
J1(θ) = Vπθ(s1)
In continuing environments:
JavV(θ) =∑
s
pπθ(s)Vπθ
(s)
or, JavR(θ) =∑
s
pπθ(s)
∑a
πθ(a|s)ras
pπθ(s) is stationary distribution of Markov chain for πθ
pπθ(s) = P(s1 = s) + γP(s2 = s) + γ2P(s3 = s) + ...
55/73
Policy Gradient
Goal: find θ to maximize J(θ) by ascending the gradient of the policy,w.r.t θ
∆θ = α∇θJ(θ), where ∇θJ(θ) =
∂J(θ)θ1...
∂J(θ)θN
How to compute gradient?
Perturbing θ by small amount ε in kth dimension, ∀k ∈ [1,N]
∂J(θ)
θk≈ J(θ + εek)− J(θ)
ε
Simple, but noisy, inefficient in most cases
56/73
Score function
Assume policy πθ is differentiable whenever it is non-zero
Likelihood ratios exploit the following identity
∇θπθ(a|s) = πθ(a|s)∇θπθ(a|s)πθ(a|s)
= πθ(a|s)∇θ logπθ(a|s)
The score function is ∇θ logπθ(a|s)
For example, consider a Gaussian policy: a ∼ N (µ(s), σ2)
Mean is linear combination of state features µ(s) = φ(s)Tθ
The score function is
∇θ logπθ(a|s) =(a− µ(s))φ(s)
σ2
57/73
One-step MDPs
Consider one-step MDPs
start with s ∼ p(s), and terminate after one step with reward ras
J(θ) =∑
s
p(s)∑
a
πθ(a|s)ras ,
∇θJ(θ) =∑
s
p(s)∑
a
∇πθ(a|s)ras
=∑
s
p(s)∑
a
πθ(a|s)∇θ logπθ(a|s)ras
=Eπθ[∇θ logπθ(a|s)ra
s ]
58/73
Policy Gradient Theorem
Consider the multi-step MDPs, we can use likelihood ratio to obtain thesimilar conclusion:
TheoremFor any differentiable policy πθ(a|s), and for any of the policy objectivefunction J = J1, JavR or 1
1−γ JavV , the policy gradient for policy objectivefunction J(θ) is
∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qπθ
(s, a)]
59/73
Monte-Carlo Policy Gradient (REINFORCE)
Apply stochastic gradient ascent
Experience an episode
{s1, a1, r2, s2, a2, r3, ..., sT−1, aT−1, rT , sT} ∼ πθ
Use return vt as unbiased sample of Qπθ(st, at), t = 1, ..,T − 1
∆θt = α∇θ logπθ(a|s)vt
High variance
60/73
Actor-Critic
Reduce the variance
Use a critic to estimate the action-value function
Qw(s, a) ≈ Qπθ(s, a)
Actor-Critic algorithms maintains two set of parameters
Critic: update action-value function parameters wActor: update policy parameters θ in direction suggested by Critic
∇θJ(θ) ≈ Eπθ[∇θ logπθ(a|s)Qw(s, a)]
∆θ = α∇θ logπθ(a|s)Qw(s, a)
Can we avoid any bias by choosing action-value functionapproximation carefully?
61/73
Compatible Function Approximation Theorem
TheoremIf the action-value function approximator Qw(s, a) satisfies thefollowing two conditions:
∇wQw(s, a) =∇θ logπθ(a|s),w = arg min
w′Eπθ
[(Qπθ(s, a)− Qw′(s, a))2]
then the policy gradient is exact, i.e.
∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qw(s, a)]
The function approximator is compatible with the policy in thesense that if we use the approximations Qw(s, a) in lieu of theirtrue values to compute the gradient, then the result would beexact
62/73
Proof
Pf: Denote ε = Eπθ[(Qπθ
(s, a)− Qw(s, a))2], then
∇wε =0
Eπθ[(Qπθ
(s, a)− Qw(s, a))∇wQw(s, a)] = 0
Eπθ[(Qπθ
(s, a)− Qw(s, a))∇θ logπθ(a|s)] = 0
Eπθ[Qπθ
(s, a)∇θ logπθ(a|s)] =Eπθ[Qw(s, a)∇θ logπθ(a|s)].
63/73
Baseline
Reduce the variance
A baseline function B(s)
Eπθ[∇θ logπθ(a, a)B(s)]
=∑
s
pπθ(s)
∑a
∇θπθ(a|s)B(s)
=∑
s
pπθ(s)B(s)∇θ
∑a
πθ(a|s) = 0
A good baseline B(s) = Vπθ(s)
Advantage function
Aπθ(s, a) = Qπθ
(s, a)− Vπθ(s)
64/73
Policy gradient
∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qπθ
(s, a)]
=Eπθ[∇θ logπθ(a|s)Aπθ
(s, a)]
∆θ =α∇θ logπθ(a|s)Aπθ(s, a)
The advantage function can significantly reduce variance of policygradient
Estimate both Vπθ(s) and Qπθ
(s, a) to obtain Aπθ(s, a)
e.g. TD learning
65/73
Estimation of advantage function
Apply TD learning to estimate value function
TD error δπθ
δπθ = r(s, a, s′) + γVπθ(s′)− Vπθ
(s)
is an unbiased estimate of advantage function
Eπθ[δπθ |s, a] =Eπθ
[r(s, a, s′) + γVπθ(s′)− Vπθ
(s)|s, a]
=Eπθ[r(s, a, s′) + γVπθ
(s′)|s, a]− Vπθ(s)
=Qπθ(s, a)− Vπθ
(s) = Aπθ(s, a)
thus the update
∆θ = α∇θ logπθ(a|s)δπθ
66/73
Natural Policy Gradient 2
Consider
maxdθ
J(θ + dθ)
s.t. ‖dθ‖Gθ= a
where a is a small constant, and ‖dθ‖2Gθ
= (dθ)TGθ(dθ).
The optimal
(dθ)∗ := ∇natJ(θ) = G−1θ ∇θJ(θ)
Take the metric matrix Gθ as Fisher information matrix
Gθ = Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]
2Sham Kakade. A Natural Policy Gradient, In Advance in Neural InformationProcessing Systems, pp. 1057-1063. MIT press, 2002.
67/73
Natural Actor-Critic
Use compatible function function Qw(s, a) = ∇θ logπθ(a|s)Tw
w = arg minw′ Eπθ[(Qπθ
(s, a)− Qw′(s, a))2]
then
∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qw(s, a)]
=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)Tw]
=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]w
=Gθw
so, ∇natJ(θ) = G−1θ ∇θJ(θ) = w
i.e. the natural policy gradient update Actor parameters θ in direction ofCritic parameters w
68/73
Model-based RL
Learn a model directly from experience
Use planning to construct a value function or policy
69/73
Learn a model
For an MDP (S,A,P,R)
A modelM =< Pη,Rη > , parameterized by η
Pη ≈ P, Rη ≈ R
s′ ∼ Pη(s′|s, a), r(s, a, s′) = Rη(s, a, s′)
Typically assume conditional independence between state transitionsand rewards
P(s′, r(s, a, s′)|s, a) = P(s′|s, a)P(r(s, a, s′)|s, a)
70/73
Learn a model
Experience s1, a1, r2, s2, a2, r3, ...sT
s1, a1 →r2, s2
s2, a2 →r3, s3
...sT−1, aT−1 →rT , sT
Learning s, a→ r is a regression problem
Learning s, a→ s′ is a density estimation problem
find parameters η to minimise empirical loss
71/73
Example: Table Lookup Model
count visits N(s, a) to each pair action pair
P̂(s′|s, a) =1
N(s, a)
T∑t=1
1(st = s, at = a, st+1 = s′)
r̂as =
1N(s, a)
T∑t=1
1(st = s, at = a)rt
72/73
Planning with a model
After estimating a model, we can plan with algorithms introducedbefore
Dynamic Programming
Policy iterationValue iteration...
Model-free RL
Monte-CarloSarsa
Q-learning...
Performance of model-based RL is limited to optimal policy forapproximate MDP (S,A,Pη,Rη)
when the estimated model is imperfect, model-free RL is more efficient
73/73
Dyna Architecture
Learn a model from real experience (true MDP)
Learn and plan value function (and/or policy) from real and simulatedexperience