Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

Lecture: Reinforcement Learning

http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html

Acknowledgement: this slides is based on Prof. David Silver’s lecture notes

Thanks: Mingming Zhao for preparing this slides

1/73

http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html

2/73

Outline

Introduction of MDP

Dynamic Programming

Model-free Control

Large-Scale RL

Model-based RL

3/73

What is RL

Reinforcement learning

is learning what to do–how to map situations to actions–so as tomaximize a numerical reward signal. The decision-maker is called theagent, the thing it interacts with, is called the environment.

A reinforcement learning task that satisfies the Markov property iscalled a Markov Decision process, or MDP

We assume that all RL tasks can be approximated with Markovproperty. So this talk is based on MDP.

4/73

Definition

A Markov Decision Process is a tuple (S,A,P, r, γ):S is a finite set of states, s ∈ S

A is a finite set of actions, a ∈ A

P is the transition probability distribution.probability from state s with action a to state s′: P(s′|s, a)also called the model or the dynamics

r is a reward function, r(s, a, s′)sometimes just r(s)or rt after time step t

γ ∈ [0, 1] is a discount factor, why discount?

5/73

Example

A simple MDP with three states and two actions:

6/73

Markov Property

A state st is Markov iff

P(st+1|st) = P(st+1|s1, ..., st)

the state captures all relevant information from the history

once the state is known, the history may be thrown away

i.e the state is a sufficient statistic of the future

7/73

Agent and Environment

The agent selects actions based on the observations and rewardsreceived at each time-step t

The environment selects observations and rewards based on theactions received at each time-step t

8/73

Policy

π(a|s) = P(at = a|st = s)

policy π defines the behaviour of an agent

for MDP, the policy depends on the current state(Markovproperty)

deterministic policy: a = π(s)

9/73

Value function

Denote

G(t) = rt+1 + γrt+2 + γ2rt+3 + ...

=

∞∑k=0

γkrt+k+1

state-value function

Vπ(s) is the total amount of reward expected to accumulate over thefuture starting from the state s and then following policy π

Vπ(s) = Eπ(G(t)|st = s)

action-value function

Qπ(s, a) is the total amount of reward expected to accumulate over thefuture starting from the state s, taking aciton a, and then followingpolicy π

Qπ(s, a) = Eπ(G(t)|st = s, at = a)

10/73

Bellman Equation

Vπ(s) =Eπ(rt+1 + γrt+2 + γ2rt+3 + ...|st = s)

=Eπ(rt+1 + γ(rt+2 + γrt+3 + ...)|st = s)

=Eπ(rt+1 + γG(t + 1)|st = s)

=Eπ(rt+1 + γVπ(st+1)|st = s)

For state-value function

Vπ(s) =Eπ(rt+1 + γVπ(st+1)|st = s)

=∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

Similarly,

Qπ(s, a) =Eπ(r(s) + γQπ(s′, a′)|s, a)

=∑s′∈S

P(s′|s, a)[r(s, a, s′) + γ∑

a′∈Aπ(a′|s′)Qπ(s′, a′)]

11/73

Bellman Equation

The Bellman equation is a linear equation, it can be solved directly, butonly possible for small MDP

The Bellman equation motivates a lot of iterative methods for MDP,

Dynamic ProgrammingMonte-Carlo evaluationTemporal-Difference learning

12/73

Optimal value function

The optimal state-value function V∗(s) is

V∗(s) = maxπ Vπ(s)

The optimal action-value function q∗(s, a) is

q∗(s, a) = maxπ Qπ(s, a)

The goal for any MDP is finding the optimal value function

Or equivalently an optimal policy π∗

for any policy π, Vπ∗(s) ≥ Vπ(s),∀s ∈ S

13/73

Bellman Optimal Equation

V∗(s) = maxa∈A

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV∗(s′)]

q∗(s, a) =∑s′∈S

P(s′|s, a)[r(s, a, s′) + γ maxa′∈A

q∗(s′, a′)]

Bellman optimal equation is non-linear

many iterative methods

value iterationQ-learning

14/73

Structure of RL

Basis: Policy Evaluation + Policy Improvement

15/73

Policy iteration

Policy evaluation

for a given policy π, evaluate the state-value function Vπ(s) ateach state s ∈ Siterative application of Bellman expectation backup

Vπ(s)←∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

Policy improvement

consider deterministic policy(greedy policy)

π(s)← arg maxa

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

ε-greedy policy

16/73

Policy iteration

17/73

Value iteration

One-step truncated policy evaluation

Iterative application of Bellman optimality backup

V(s)← maxa∈A

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV(s′)]

Learn optimal value function directly

Unlike policy iteration, there is no explicit policy

18/73

Value iteration

19/73

State-value function space

Consider the vector space V over state-value functions

There are |S| dimensions

Each point in this space fully specifies a state-value function V(s)

For any U,V ∈ V, we measure the distance between U,V

||U − V||∞ = maxs∈S|U(s)− V(s)|

20/73

Convergence

Define the Bellman expectation backup operator Tπ, for any state s

Tπ(V(s)) =∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV(s′)]

then

‖Tπ(V(s))− Tπ(U(s))‖∞ =‖γ∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[V(s′)− U(s′)]‖∞

≤γmaxs′∈S|V(s′)− U(s′)|

21/73

Convergence

thus

‖Tπ(V)− Tπ(U)‖∞ ≤ γ‖V − U‖∞

we call the operator Tπ is a γ-contraction

Contraction Mapping TheoremFor any metric space V that is complete under an operator T(v),where T is a γ-contraction, then

I T converges to a unique fixed pointI At a linear convergence rate of γ

22/73

Convergence

Both the Bellman expectation operator Tπ and the Bellmanoptimality operator T∗ are γ-contractionBellman equation shows Vπ is the fixed point of Tπ

Bellman optimality equation shows V∗ is the fixed point of T∗

Iterative policy evaluation converges on Vπ, policy iterationconverges at V∗Value iteration converges at V∗

23/73

Model-free Algorithm

Previous discussionplanning by dynamic programmingthe model and dynamics of a MDP are requiredcurse of dimensionalityNext discussionmodel-free prediction and controlestimate and optimize the value function of an unknown MDPIn RL, we always deal with a unknown model, exploration andexploitation are needed

24/73

Model-free prediction

Goal: learn Vπ(s) from episodes of experience under policy π

On policy learning

No requirement of MDP transitions or rewards

Monte-Carlo learning

Temporal-Difference learning

25/73

Monte-Carlo

Update Vπ(s) incrementally from episodes of experience under policy π

Every(or the first) time-step t that state s is visited in an episode

N(s)←N(s) + 1

V(s)← (N(s)− 1)V(s) + Gt

N(s)or, V(s)←V(s) + α(Gt − V(s))

By the law of large numbers, V(s)→ Vπ(s) as N(s)→∞

MC must learns from complete episodes, no bootstrapping.

26/73

Temporal-Diference

The simplest: TD(0)

Update Vπ(s) incrementally from episodes of experience under policy π

The update only requires one step episode

V(s)← V(s) + α(r(s, a, s′) + γV(s′)− V(s))

TD target: r(s, a, s′) + γV(s′)

TD error: r(s, a, s′) + γV(s′)− V(s)

r(s, a, s′) is the observed reward from one step forward simulation

TD(0) learns from incomplete episodes, by bootstrapping.

27/73

Comparison between MC and TD(0)

TD(0) can learn before knowing the final outcome, even without finaloutcome.

MC must wait until the end of episode before return is known

In MC, return Gt is unbiased estimate of Vπ(st), it depends an manyrandom actions, transitions and rewards, which yields high variance

However, TD(0) target is biased estimate of Vπ(st), it depends an onerandom action, transition and reward, which yields low variance

MC has good convergence properties, and not very sensitive to initialvalue of V(s)

TD(0) also converges , and more sensitive to initial value of V(s). TD(0)is more efficient than MC

28/73

TD(λ)

Consider n-step reward, n = 1, 2, ..., define

Gnt = rt+1 + γrt+2 + ...+ γnrt+n+1 + γn+1V(st+n+1)

n-step TD(0) learning

V(st)← V(st) + α(Gnt − V(st))

TD(0)→MC, as n→∞

λ-return Gλt combines all n-step returns Gnt

Gλt = (1− λ)

∞∑n=1

λn−1Gnt

Forward-view TD(λ)

V(st)← V(st) + α(Gλt − V(st))

29/73

Unified view of RL

30/73

Policy Improvement

After evaluate the value function of policy π, now improve policy

greedy policy improvement

π′(s) =arg max

a

∑s′∈S

P(s′|s, a)(r(s, a, s′) + γVπ(s′))

π′(s) =arg max

aQπ(s, a)

it is model-free and easier to obtain policy from Qπ(s, a)

ε− greedy policy improvement

π′(a|s) =

{ε|A| + 1− ε, a = arg maxa Q(s, a)ε|A| , o.w

ε-greedy policy ensures continual exploration, all actions are tried

31/73

ε-Greedy Policy Improvement

TheoremFor any ε-greedy policy π, the ε-greedy policy π

′w.r.t Qπ is an

improvement, i.e Vπ′ (s) ≥ Vπ(s), ∀s ∈ S.

Pf:

Qπ(s, π′(s)) =

∑a∈A

π′(a|s)Qπ(s, a)

=(1− ε) maxa∈A

Q(s, a) +ε

|A|∑a∈A

Qπ(s, a)

≥(1− ε)∑a∈A

π(a|s)− ε|A|

1− εQ(s, a) +

ε

|A|∑a∈A

Qπ(s, a)

=∑a∈A

π(a|s)Qπ(s, a) = Vπ(s)

32/73

ε-Greedy Policy Improvement (Proof)

On the other hand,

Qπ(s, π′(s)) =Eπ′ [rt+1 + γVπ(st+1)|st = s]

≤Eπ′ [rt+1 + γQπ(st+1, π′(st+1))|st = s]

=Eπ′ [rt+1 + γEπ′ [rt+2 + γVπ(st+2)]|st = s]

=Eπ′ [rt+1 + γrt+2 + γ2Vπ(st+2)|st = s]

≤Eπ′ [rt+1 + γrt+2 + γ2Qπ(st+2, π′(st+2))|st = s]

...

≤Eπ′ [rt+1 + γrt+2 + γ2rt+3 + γ3rt+4 + ...|st = s]

=Vπ′ (s)

Thus, Vπ′ (s) ≥ Vπ(s).

33/73

Monte-Carlo Policy Iteration

Policy evaluation: Monte-Carlo

Policy improvement: ε-greedy policy

TD Policy iteration is similar

34/73

Monte-Carlo Control

For each episode:

Policy evaluation: Monte-Carlo

Policy improvement: ε-greedy policy

35/73

GLIE

DefinitionA sequence of policy {πk}∞k=1 is called GLIE (Greedy in the limit withinfinite exploration), if:

I all state-action pairs are explored infinitely many times,

limk→∞

Nk(s, a) =∞

I the policy converges on a greedy policy

limk→∞

πk(a′|s) = 1(a′ = arg maxa

Qk(s, a))

For example: ε-greedy policy is GLIE if ε reduce to zero as εk = 1k

TheroemGLIE Monte-Carlo control converges to the optimal action-valuefunction, i.e Q(s, a)→ q∗(s, a)

36/73

TD control–Sarsa

For each time-step:

apply TD to evaluate Qπ(s, a), then use ε-greedy policy improvement

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))

action a′ is chosed from s′ using policy derived from Q (eg, ε-greedy)

37/73

Convergence of Sarsa

TheroemSarsa converges to the optimal action-value function, i.eQ(s, a)→ q∗(s, a), under the following conditions:

I GLIE sequence of policies πk(a|s)I Robbins-Monro sequence of step-size αk

∞∑k=1

αk =∞,∞∑

k=1

α2k <∞

38/73

Off-policy learning

we have assumed that the episode is generated following the learningpolicy π

on policy learningthat is the learning policy and behavior policy is coincident

consider following behavior policy µ(a|s){s1, a1, s2, a2, ..., sT} ∼ µimportance samplingoff-policy learning

why is this important?

Re-use experience generated from old policieslearn about optimal policy while following exploratory policylearn about multiple policies while following one policy

39/73

Importance sampling

To estimate the expectation EX∼P[f (X)], we estimate a differentdistribution instead

EX∼P[f (X)] =∑

P(X)f (X)

=∑

Q(X)P(X)

Q(X)f (X)

= EX∼Q[P(X)

Q(X)f (X)]

Q is some simple or known distribution

40/73

Off-policy version of MC and TD

MC

use rewards generated from µ to evaluate π

Gπ/µt =π(at|st)

µ(at|st)

π(at+1|st+1)

µ(at+1|st+1)...π(aT |sT)

µ(aT |sT)Gt

V(s)←V(st) + α(Gπ/µt − V(st))

TD

use rewards generated from µ to evaluate π

V(s)← V(st) + α(π(at|st)

µ(at|st)(r(st, at, st+1) + γV(st+1))− V(st))

much lower variance than MC importance sampling

41/73

Q-learning

Off-policy learning of action-values Q(s, a)

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γmaxa′

Q(s′, a′)− Q(s, a))

Why Q-learning is considered Off-policy?

the learned Q directly approximates q∗while behavior policy π may not optimal

Comparing with Sarsa

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))

Sarsa prefers to learn carefully in an environment where exploration iscostly, while Q-learning not

42/73

Q-learning V.S Sarsa

43/73

Summary

44/73

Summary

45/73

Value function approximation

So far, we have represented value function by a lookup table

every state s (or state-action pair s, a) has an entry V(s) (or Q(s, a))

For large scale MDPs

two many state s (or state-action pair s, a) to storetoo slow to learn the value individually

Estimate value function with function approximation

V̂(s,w) ≈Vπ(s)

Q̂(s, a,w) ≈Qπ(s, a)

generalize from seen states to unseen statesupdate parameter w by MC or TD learning

46/73

Types of function approximation

Function approximators, e.g

Linear combinations of features, Neural Network, Decision tree,Nearest neighbor, Fourier/wavelet bases...

47/73

Basic idea

Goal: find parameter vector w minimizing mean-square error betweenapproximate value V̂(s,w) and true value Vπ(s)

J(w) = Eπ[(Vπ(s)− V̂(s,w))2]

Gradient descent (α is stepsize)

∆w = −12α∇wJ(w) = αEπ[(Vπ(s)− V̂(s,w))∇wV̂(s,w)]

Stochastic gradient descent

∆w = α(Vπ(s)− V̂(s,w))∇wV̂(s,w)

48/73

Linear case

V̂(s,w) = x(s)Tw =∑N

i=1 xi(s)wi, where x(s) is a feature vector

Objective function is quadratic in w

J(w) = Eπ[(Vπ(s)− x(s)Tw)2]

Update is simple

∆w = α(Vπ(s)− V̂(s,w))x(s)

49/73

Practical algorithm

The update requires true value function

But in RL, it is unavailable, we only have rewards from experiences

In practice, we substitute a target for Vπ(s), e.g

for MC, the target is Gt

for TD(0), the target is r(s, a, s′) + γV̂(s′,w)for TD(λ), the target is the λ-return Gλt

Similarly, we can generate approximators of action-value function in thesame way

Proximate value function methods suffer from a lack of strongtheoretical performance guarantees1

1Sham Kakade, John Langford. Approximately Optimal ApproximateReinforcement Learning

50/73

Batch methods

Gradient descent is simple and appealing, but the sample is notefficient

Experience D consisting of < state, value > pairs

D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}

Minimising sum-squared error

LS(w) =

T∑t=1

(Vπt − V̂(st,w))2

=ED((Vπ − V̂(s,w))2)

51/73

SGD with experience replay

Given experience D consisting of < state, value > pairs

D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}

Repeat:

1. sample state, value from experience: < s,Vπ >∼ D2. apply SGD update: ∆w = α(Vπ − V̂(s,w))∇wV̂(s,w)

Converges to least square solution

wπ = arg minw

LS(w)

In practice, we use noisy or biased samples of Vπt , e.g.

LSMC (Least Square Monte-Carlo)use return Gt ≈ Vπt

52/73

Policy Gradient

So far, we are discussing value-based RL

learn value functionimplicit policy (e.g. ε-greedy)

Now, consider parametrize the policy

πθ(a|s) = P(a|s, θ)

Policy-based RL

no value functionlearn policy

53/73

Policy-based RL

Advantages:

Better convergence propertieseffective in high-dimensional or continuous action spacescan learn stochastic policy

Disadvantages:

Typically converge to a local rather than global optimumevaluating a policy is typically inefficient and high variance

54/73

Measure the equality of a policy πθ

In episodic environments:

s1 is the start state in an episodic

J1(θ) = Vπθ(s1)

In continuing environments:

JavV(θ) =∑

s

pπθ(s)Vπθ

(s)

or, JavR(θ) =∑

s

pπθ(s)

∑a

πθ(a|s)ras

pπθ(s) is stationary distribution of Markov chain for πθ

pπθ(s) = P(s1 = s) + γP(s2 = s) + γ2P(s3 = s) + ...

55/73

Policy Gradient

Goal: find θ to maximize J(θ) by ascending the gradient of the policy,w.r.t θ

∆θ = α∇θJ(θ), where ∇θJ(θ) =

∂J(θ)θ1...

∂J(θ)θN

How to compute gradient?

Perturbing θ by small amount ε in kth dimension, ∀k ∈ [1,N]

∂J(θ)

θk≈ J(θ + εek)− J(θ)

ε

Simple, but noisy, inefficient in most cases

56/73

Score function

Assume policy πθ is differentiable whenever it is non-zero

Likelihood ratios exploit the following identity

∇θπθ(a|s) = πθ(a|s)∇θπθ(a|s)πθ(a|s)

= πθ(a|s)∇θ logπθ(a|s)

The score function is ∇θ logπθ(a|s)

For example, consider a Gaussian policy: a ∼ N (µ(s), σ2)

Mean is linear combination of state features µ(s) = φ(s)Tθ

The score function is

∇θ logπθ(a|s) =(a− µ(s))φ(s)

σ2

57/73

One-step MDPs

Consider one-step MDPs

start with s ∼ p(s), and terminate after one step with reward ras

J(θ) =∑

s

p(s)∑

a

πθ(a|s)ras ,

∇θJ(θ) =∑

s

p(s)∑

a

∇πθ(a|s)ras

=∑

s

p(s)∑

a

πθ(a|s)∇θ logπθ(a|s)ras

=Eπθ[∇θ logπθ(a|s)ra

s ]

58/73

Policy Gradient Theorem

Consider the multi-step MDPs, we can use likelihood ratio to obtain thesimilar conclusion:

TheoremFor any differentiable policy πθ(a|s), and for any of the policy objectivefunction J = J1, JavR or 1

1−γ JavV , the policy gradient for policy objectivefunction J(θ) is

∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qπθ

(s, a)]

59/73

Monte-Carlo Policy Gradient (REINFORCE)

Apply stochastic gradient ascent

Experience an episode

{s1, a1, r2, s2, a2, r3, ..., sT−1, aT−1, rT , sT} ∼ πθ

Use return vt as unbiased sample of Qπθ(st, at), t = 1, ..,T − 1

∆θt = α∇θ logπθ(a|s)vt

High variance

60/73

Actor-Critic

Reduce the variance

Use a critic to estimate the action-value function

Qw(s, a) ≈ Qπθ(s, a)

Actor-Critic algorithms maintains two set of parameters

Critic: update action-value function parameters wActor: update policy parameters θ in direction suggested by Critic

∇θJ(θ) ≈ Eπθ[∇θ logπθ(a|s)Qw(s, a)]

∆θ = α∇θ logπθ(a|s)Qw(s, a)

Can we avoid any bias by choosing action-value functionapproximation carefully?

61/73

Compatible Function Approximation Theorem

TheoremIf the action-value function approximator Qw(s, a) satisfies thefollowing two conditions:

∇wQw(s, a) =∇θ logπθ(a|s),w = arg min

w′Eπθ

[(Qπθ(s, a)− Qw′(s, a))2]

then the policy gradient is exact, i.e.

∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qw(s, a)]

The function approximator is compatible with the policy in thesense that if we use the approximations Qw(s, a) in lieu of theirtrue values to compute the gradient, then the result would beexact

62/73

Proof

Pf: Denote ε = Eπθ[(Qπθ

(s, a)− Qw(s, a))2], then

∇wε =0

Eπθ[(Qπθ

(s, a)− Qw(s, a))∇wQw(s, a)] = 0

Eπθ[(Qπθ

(s, a)− Qw(s, a))∇θ logπθ(a|s)] = 0

Eπθ[Qπθ

(s, a)∇θ logπθ(a|s)] =Eπθ[Qw(s, a)∇θ logπθ(a|s)].

63/73

Baseline

Reduce the variance

A baseline function B(s)

Eπθ[∇θ logπθ(a, a)B(s)]

=∑

s

pπθ(s)

∑a

∇θπθ(a|s)B(s)

=∑

s

pπθ(s)B(s)∇θ

∑a

πθ(a|s) = 0

A good baseline B(s) = Vπθ(s)

Advantage function

Aπθ(s, a) = Qπθ

(s, a)− Vπθ(s)

64/73

Policy gradient

∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qπθ

(s, a)]

=Eπθ[∇θ logπθ(a|s)Aπθ

(s, a)]

∆θ =α∇θ logπθ(a|s)Aπθ(s, a)

The advantage function can significantly reduce variance of policygradient

Estimate both Vπθ(s) and Qπθ

(s, a) to obtain Aπθ(s, a)

e.g. TD learning

65/73

Estimation of advantage function

Apply TD learning to estimate value function

TD error δπθ

δπθ = r(s, a, s′) + γVπθ(s′)− Vπθ

(s)

is an unbiased estimate of advantage function

Eπθ[δπθ |s, a] =Eπθ

[r(s, a, s′) + γVπθ(s′)− Vπθ

(s)|s, a]

=Eπθ[r(s, a, s′) + γVπθ

(s′)|s, a]− Vπθ(s)

=Qπθ(s, a)− Vπθ

(s) = Aπθ(s, a)

thus the update

∆θ = α∇θ logπθ(a|s)δπθ

66/73

Natural Policy Gradient 2

Consider

maxdθ

J(θ + dθ)

s.t. ‖dθ‖Gθ= a

where a is a small constant, and ‖dθ‖2Gθ

= (dθ)TGθ(dθ).

The optimal

(dθ)∗ := ∇natJ(θ) = G−1θ ∇θJ(θ)

Take the metric matrix Gθ as Fisher information matrix

Gθ = Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]

2Sham Kakade. A Natural Policy Gradient, In Advance in Neural InformationProcessing Systems, pp. 1057-1063. MIT press, 2002.

67/73

Natural Actor-Critic

Use compatible function function Qw(s, a) = ∇θ logπθ(a|s)Tw

w = arg minw′ Eπθ[(Qπθ

(s, a)− Qw′(s, a))2]

then

∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qw(s, a)]

=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)Tw]

=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]w

=Gθw

so, ∇natJ(θ) = G−1θ ∇θJ(θ) = w

i.e. the natural policy gradient update Actor parameters θ in direction ofCritic parameters w

68/73

Model-based RL

Learn a model directly from experience

Use planning to construct a value function or policy

69/73

Learn a model

For an MDP (S,A,P,R)

A modelM =< Pη,Rη > , parameterized by η

Pη ≈ P, Rη ≈ R

s′ ∼ Pη(s′|s, a), r(s, a, s′) = Rη(s, a, s′)

Typically assume conditional independence between state transitionsand rewards

P(s′, r(s, a, s′)|s, a) = P(s′|s, a)P(r(s, a, s′)|s, a)

70/73

Learn a model

Experience s1, a1, r2, s2, a2, r3, ...sT

s1, a1 →r2, s2

s2, a2 →r3, s3

...sT−1, aT−1 →rT , sT

Learning s, a→ r is a regression problem

Learning s, a→ s′ is a density estimation problem

find parameters η to minimise empirical loss

71/73

Example: Table Lookup Model

count visits N(s, a) to each pair action pair

P̂(s′|s, a) =1

N(s, a)

T∑t=1

1(st = s, at = a, st+1 = s′)

r̂as =

1N(s, a)

T∑t=1

1(st = s, at = a)rt

72/73

Planning with a model

After estimating a model, we can plan with algorithms introducedbefore

Dynamic Programming

Policy iterationValue iteration...

Model-free RL

Monte-CarloSarsa

Q-learning...

Performance of model-based RL is limited to optimal policy forapproximate MDP (S,A,Pη,Rη)

when the estimated model is imperfect, model-free RL is more efficient

73/73

Dyna Architecture

Learn a model from real experience (true MDP)

Learn and plan value function (and/or policy) from real and simulatedexperience

Documents

Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical