73
Lecture: Reinforcement Learning http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html Acknowledgement: this slides is based on Prof. David Silver’s lecture notes Thanks: Mingming Zhao for preparing this slides 1/73

Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

Lecture: Reinforcement Learning

http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html

Acknowledgement: this slides is based on Prof. David Silver’s lecture notes

Thanks: Mingming Zhao for preparing this slides

1/73

Page 2: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

2/73

Outline

Introduction of MDP

Dynamic Programming

Model-free Control

Large-Scale RL

Model-based RL

Page 3: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

3/73

What is RL

Reinforcement learning

is learning what to do–how to map situations to actions–so as tomaximize a numerical reward signal. The decision-maker is called theagent, the thing it interacts with, is called the environment.

A reinforcement learning task that satisfies the Markov property iscalled a Markov Decision process, or MDP

We assume that all RL tasks can be approximated with Markovproperty. So this talk is based on MDP.

Page 4: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

4/73

Definition

A Markov Decision Process is a tuple (S,A,P, r, γ):S is a finite set of states, s ∈ S

A is a finite set of actions, a ∈ A

P is the transition probability distribution.probability from state s with action a to state s′: P(s′|s, a)also called the model or the dynamics

r is a reward function, r(s, a, s′)sometimes just r(s)or rt after time step t

γ ∈ [0, 1] is a discount factor, why discount?

Page 5: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

5/73

Example

A simple MDP with three states and two actions:

Page 6: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

6/73

Markov Property

A state st is Markov iff

P(st+1|st) = P(st+1|s1, ..., st)

the state captures all relevant information from the history

once the state is known, the history may be thrown away

i.e the state is a sufficient statistic of the future

Page 7: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

7/73

Agent and Environment

The agent selects actions based on the observations and rewardsreceived at each time-step t

The environment selects observations and rewards based on theactions received at each time-step t

Page 8: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

8/73

Policy

π(a|s) = P(at = a|st = s)

policy π defines the behaviour of an agent

for MDP, the policy depends on the current state(Markovproperty)

deterministic policy: a = π(s)

Page 9: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

9/73

Value function

Denote

G(t) = rt+1 + γrt+2 + γ2rt+3 + ...

=

∞∑k=0

γkrt+k+1

state-value function

Vπ(s) is the total amount of reward expected to accumulate over thefuture starting from the state s and then following policy π

Vπ(s) = Eπ(G(t)|st = s)

action-value function

Qπ(s, a) is the total amount of reward expected to accumulate over thefuture starting from the state s, taking aciton a, and then followingpolicy π

Qπ(s, a) = Eπ(G(t)|st = s, at = a)

Page 10: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

10/73

Bellman Equation

Vπ(s) =Eπ(rt+1 + γrt+2 + γ2rt+3 + ...|st = s)

=Eπ(rt+1 + γ(rt+2 + γrt+3 + ...)|st = s)

=Eπ(rt+1 + γG(t + 1)|st = s)

=Eπ(rt+1 + γVπ(st+1)|st = s)

For state-value function

Vπ(s) =Eπ(rt+1 + γVπ(st+1)|st = s)

=∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

Similarly,

Qπ(s, a) =Eπ(r(s) + γQπ(s′, a′)|s, a)

=∑s′∈S

P(s′|s, a)[r(s, a, s′) + γ∑

a′∈Aπ(a′|s′)Qπ(s′, a′)]

Page 11: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

11/73

Bellman Equation

The Bellman equation is a linear equation, it can be solved directly, butonly possible for small MDP

The Bellman equation motivates a lot of iterative methods for MDP,

Dynamic ProgrammingMonte-Carlo evaluationTemporal-Difference learning

Page 12: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

12/73

Optimal value function

The optimal state-value function V∗(s) is

V∗(s) = maxπ Vπ(s)

The optimal action-value function q∗(s, a) is

q∗(s, a) = maxπ Qπ(s, a)

The goal for any MDP is finding the optimal value function

Or equivalently an optimal policy π∗

for any policy π, Vπ∗(s) ≥ Vπ(s),∀s ∈ S

Page 13: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

13/73

Bellman Optimal Equation

V∗(s) = maxa∈A

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV∗(s′)]

q∗(s, a) =∑s′∈S

P(s′|s, a)[r(s, a, s′) + γ maxa′∈A

q∗(s′, a′)]

Bellman optimal equation is non-linear

many iterative methods

value iterationQ-learning

Page 14: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

14/73

Structure of RL

Basis: Policy Evaluation + Policy Improvement

Page 15: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

15/73

Policy iteration

Policy evaluation

for a given policy π, evaluate the state-value function Vπ(s) ateach state s ∈ Siterative application of Bellman expectation backup

Vπ(s)←∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

Policy improvement

consider deterministic policy(greedy policy)

π(s)← arg maxa

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γVπ(s′)]

ε-greedy policy

Page 16: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

16/73

Policy iteration

Page 17: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

17/73

Value iteration

One-step truncated policy evaluation

Iterative application of Bellman optimality backup

V(s)← maxa∈A

∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV(s′)]

Learn optimal value function directly

Unlike policy iteration, there is no explicit policy

Page 18: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

18/73

Value iteration

Page 19: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

19/73

State-value function space

Consider the vector space V over state-value functions

There are |S| dimensions

Each point in this space fully specifies a state-value function V(s)

For any U,V ∈ V, we measure the distance between U,V

||U − V||∞ = maxs∈S|U(s)− V(s)|

Page 20: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

20/73

Convergence

Define the Bellman expectation backup operator Tπ, for any state s

Tπ(V(s)) =∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[r(s, a, s′) + γV(s′)]

then

‖Tπ(V(s))− Tπ(U(s))‖∞ =‖γ∑a∈A

π(a|s)∑s′∈S

P(s′|s, a)[V(s′)− U(s′)]‖∞

≤γmaxs′∈S|V(s′)− U(s′)|

Page 21: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

21/73

Convergence

thus

‖Tπ(V)− Tπ(U)‖∞ ≤ γ‖V − U‖∞

we call the operator Tπ is a γ-contraction

Contraction Mapping TheoremFor any metric space V that is complete under an operator T(v),where T is a γ-contraction, then

I T converges to a unique fixed pointI At a linear convergence rate of γ

Page 22: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

22/73

Convergence

Both the Bellman expectation operator Tπ and the Bellmanoptimality operator T∗ are γ-contractionBellman equation shows Vπ is the fixed point of Tπ

Bellman optimality equation shows V∗ is the fixed point of T∗

Iterative policy evaluation converges on Vπ, policy iterationconverges at V∗Value iteration converges at V∗

Page 23: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

23/73

Model-free Algorithm

Previous discussionplanning by dynamic programmingthe model and dynamics of a MDP are requiredcurse of dimensionalityNext discussionmodel-free prediction and controlestimate and optimize the value function of an unknown MDPIn RL, we always deal with a unknown model, exploration andexploitation are needed

Page 24: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

24/73

Model-free prediction

Goal: learn Vπ(s) from episodes of experience under policy π

On policy learning

No requirement of MDP transitions or rewards

Monte-Carlo learning

Temporal-Difference learning

Page 25: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

25/73

Monte-Carlo

Update Vπ(s) incrementally from episodes of experience under policy π

Every(or the first) time-step t that state s is visited in an episode

N(s)←N(s) + 1

V(s)← (N(s)− 1)V(s) + Gt

N(s)or, V(s)←V(s) + α(Gt − V(s))

By the law of large numbers, V(s)→ Vπ(s) as N(s)→∞

MC must learns from complete episodes, no bootstrapping.

Page 26: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

26/73

Temporal-Diference

The simplest: TD(0)

Update Vπ(s) incrementally from episodes of experience under policy π

The update only requires one step episode

V(s)← V(s) + α(r(s, a, s′) + γV(s′)− V(s))

TD target: r(s, a, s′) + γV(s′)

TD error: r(s, a, s′) + γV(s′)− V(s)

r(s, a, s′) is the observed reward from one step forward simulation

TD(0) learns from incomplete episodes, by bootstrapping.

Page 27: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

27/73

Comparison between MC and TD(0)

TD(0) can learn before knowing the final outcome, even without finaloutcome.

MC must wait until the end of episode before return is known

In MC, return Gt is unbiased estimate of Vπ(st), it depends an manyrandom actions, transitions and rewards, which yields high variance

However, TD(0) target is biased estimate of Vπ(st), it depends an onerandom action, transition and reward, which yields low variance

MC has good convergence properties, and not very sensitive to initialvalue of V(s)

TD(0) also converges , and more sensitive to initial value of V(s). TD(0)is more efficient than MC

Page 28: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

28/73

TD(λ)

Consider n-step reward, n = 1, 2, ..., define

Gnt = rt+1 + γrt+2 + ...+ γnrt+n+1 + γn+1V(st+n+1)

n-step TD(0) learning

V(st)← V(st) + α(Gnt − V(st))

TD(0)→MC, as n→∞

λ-return Gλt combines all n-step returns Gnt

Gλt = (1− λ)

∞∑n=1

λn−1Gnt

Forward-view TD(λ)

V(st)← V(st) + α(Gλt − V(st))

Page 29: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

29/73

Unified view of RL

Page 30: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

30/73

Policy Improvement

After evaluate the value function of policy π, now improve policy

greedy policy improvement

π′(s) =arg max

a

∑s′∈S

P(s′|s, a)(r(s, a, s′) + γVπ(s′))

π′(s) =arg max

aQπ(s, a)

it is model-free and easier to obtain policy from Qπ(s, a)

ε− greedy policy improvement

π′(a|s) =

{ε|A| + 1− ε, a = arg maxa Q(s, a)ε|A| , o.w

ε-greedy policy ensures continual exploration, all actions are tried

Page 31: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

31/73

ε-Greedy Policy Improvement

TheoremFor any ε-greedy policy π, the ε-greedy policy π

′w.r.t Qπ is an

improvement, i.e Vπ′ (s) ≥ Vπ(s), ∀s ∈ S.

Pf:

Qπ(s, π′(s)) =

∑a∈A

π′(a|s)Qπ(s, a)

=(1− ε) maxa∈A

Q(s, a) +ε

|A|∑a∈A

Qπ(s, a)

≥(1− ε)∑a∈A

π(a|s)− ε|A|

1− εQ(s, a) +

ε

|A|∑a∈A

Qπ(s, a)

=∑a∈A

π(a|s)Qπ(s, a) = Vπ(s)

Page 32: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

32/73

ε-Greedy Policy Improvement (Proof)

On the other hand,

Qπ(s, π′(s)) =Eπ′ [rt+1 + γVπ(st+1)|st = s]

≤Eπ′ [rt+1 + γQπ(st+1, π′(st+1))|st = s]

=Eπ′ [rt+1 + γEπ′ [rt+2 + γVπ(st+2)]|st = s]

=Eπ′ [rt+1 + γrt+2 + γ2Vπ(st+2)|st = s]

≤Eπ′ [rt+1 + γrt+2 + γ2Qπ(st+2, π′(st+2))|st = s]

...

≤Eπ′ [rt+1 + γrt+2 + γ2rt+3 + γ3rt+4 + ...|st = s]

=Vπ′ (s)

Thus, Vπ′ (s) ≥ Vπ(s).

Page 33: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

33/73

Monte-Carlo Policy Iteration

Policy evaluation: Monte-Carlo

Policy improvement: ε-greedy policy

TD Policy iteration is similar

Page 34: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

34/73

Monte-Carlo Control

For each episode:

Policy evaluation: Monte-Carlo

Policy improvement: ε-greedy policy

Page 35: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

35/73

GLIE

DefinitionA sequence of policy {πk}∞k=1 is called GLIE (Greedy in the limit withinfinite exploration), if:

I all state-action pairs are explored infinitely many times,

limk→∞

Nk(s, a) =∞

I the policy converges on a greedy policy

limk→∞

πk(a′|s) = 1(a′ = arg maxa

Qk(s, a))

For example: ε-greedy policy is GLIE if ε reduce to zero as εk = 1k

TheroemGLIE Monte-Carlo control converges to the optimal action-valuefunction, i.e Q(s, a)→ q∗(s, a)

Page 36: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

36/73

TD control–Sarsa

For each time-step:

apply TD to evaluate Qπ(s, a), then use ε-greedy policy improvement

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))

action a′ is chosed from s′ using policy derived from Q (eg, ε-greedy)

Page 37: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

37/73

Convergence of Sarsa

TheroemSarsa converges to the optimal action-value function, i.eQ(s, a)→ q∗(s, a), under the following conditions:

I GLIE sequence of policies πk(a|s)I Robbins-Monro sequence of step-size αk

∞∑k=1

αk =∞,∞∑

k=1

α2k <∞

Page 38: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

38/73

Off-policy learning

we have assumed that the episode is generated following the learningpolicy π

on policy learningthat is the learning policy and behavior policy is coincident

consider following behavior policy µ(a|s){s1, a1, s2, a2, ..., sT} ∼ µimportance samplingoff-policy learning

why is this important?

Re-use experience generated from old policieslearn about optimal policy while following exploratory policylearn about multiple policies while following one policy

Page 39: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

39/73

Importance sampling

To estimate the expectation EX∼P[f (X)], we estimate a differentdistribution instead

EX∼P[f (X)] =∑

P(X)f (X)

=∑

Q(X)P(X)

Q(X)f (X)

= EX∼Q[P(X)

Q(X)f (X)]

Q is some simple or known distribution

Page 40: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

40/73

Off-policy version of MC and TD

MC

use rewards generated from µ to evaluate π

Gπ/µt =π(at|st)

µ(at|st)

π(at+1|st+1)

µ(at+1|st+1)...π(aT |sT)

µ(aT |sT)Gt

V(s)←V(st) + α(Gπ/µt − V(st))

TD

use rewards generated from µ to evaluate π

V(s)← V(st) + α(π(at|st)

µ(at|st)(r(st, at, st+1) + γV(st+1))− V(st))

much lower variance than MC importance sampling

Page 41: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

41/73

Q-learning

Off-policy learning of action-values Q(s, a)

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γmaxa′

Q(s′, a′)− Q(s, a))

Why Q-learning is considered Off-policy?

the learned Q directly approximates q∗while behavior policy π may not optimal

Comparing with Sarsa

Q(s, a)← Q(s, a) + α(r(s, a, s′) + γQ(s′, a′)− Q(s, a))

Sarsa prefers to learn carefully in an environment where exploration iscostly, while Q-learning not

Page 42: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

42/73

Q-learning V.S Sarsa

Page 43: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

43/73

Summary

Page 44: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

44/73

Summary

Page 45: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

45/73

Value function approximation

So far, we have represented value function by a lookup table

every state s (or state-action pair s, a) has an entry V(s) (or Q(s, a))

For large scale MDPs

two many state s (or state-action pair s, a) to storetoo slow to learn the value individually

Estimate value function with function approximation

V̂(s,w) ≈Vπ(s)

Q̂(s, a,w) ≈Qπ(s, a)

generalize from seen states to unseen statesupdate parameter w by MC or TD learning

Page 46: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

46/73

Types of function approximation

Function approximators, e.g

Linear combinations of features, Neural Network, Decision tree,Nearest neighbor, Fourier/wavelet bases...

Page 47: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

47/73

Basic idea

Goal: find parameter vector w minimizing mean-square error betweenapproximate value V̂(s,w) and true value Vπ(s)

J(w) = Eπ[(Vπ(s)− V̂(s,w))2]

Gradient descent (α is stepsize)

∆w = −12α∇wJ(w) = αEπ[(Vπ(s)− V̂(s,w))∇wV̂(s,w)]

Stochastic gradient descent

∆w = α(Vπ(s)− V̂(s,w))∇wV̂(s,w)

Page 48: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

48/73

Linear case

V̂(s,w) = x(s)Tw =∑N

i=1 xi(s)wi, where x(s) is a feature vector

Objective function is quadratic in w

J(w) = Eπ[(Vπ(s)− x(s)Tw)2]

Update is simple

∆w = α(Vπ(s)− V̂(s,w))x(s)

Page 49: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

49/73

Practical algorithm

The update requires true value function

But in RL, it is unavailable, we only have rewards from experiences

In practice, we substitute a target for Vπ(s), e.g

for MC, the target is Gt

for TD(0), the target is r(s, a, s′) + γV̂(s′,w)for TD(λ), the target is the λ-return Gλt

Similarly, we can generate approximators of action-value function in thesame way

Proximate value function methods suffer from a lack of strongtheoretical performance guarantees1

1Sham Kakade, John Langford. Approximately Optimal ApproximateReinforcement Learning

Page 50: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

50/73

Batch methods

Gradient descent is simple and appealing, but the sample is notefficient

Experience D consisting of < state, value > pairs

D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}

Minimising sum-squared error

LS(w) =

T∑t=1

(Vπt − V̂(st,w))2

=ED((Vπ − V̂(s,w))2)

Page 51: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

51/73

SGD with experience replay

Given experience D consisting of < state, value > pairs

D = {< s1,Vπ1 >,< s2,Vπ2 >, ..., < sT ,VπT >}

Repeat:

1. sample state, value from experience: < s,Vπ >∼ D2. apply SGD update: ∆w = α(Vπ − V̂(s,w))∇wV̂(s,w)

Converges to least square solution

wπ = arg minw

LS(w)

In practice, we use noisy or biased samples of Vπt , e.g.

LSMC (Least Square Monte-Carlo)use return Gt ≈ Vπt

Page 52: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

52/73

Policy Gradient

So far, we are discussing value-based RL

learn value functionimplicit policy (e.g. ε-greedy)

Now, consider parametrize the policy

πθ(a|s) = P(a|s, θ)

Policy-based RL

no value functionlearn policy

Page 53: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

53/73

Policy-based RL

Advantages:

Better convergence propertieseffective in high-dimensional or continuous action spacescan learn stochastic policy

Disadvantages:

Typically converge to a local rather than global optimumevaluating a policy is typically inefficient and high variance

Page 54: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

54/73

Measure the equality of a policy πθ

In episodic environments:

s1 is the start state in an episodic

J1(θ) = Vπθ(s1)

In continuing environments:

JavV(θ) =∑

s

pπθ(s)Vπθ

(s)

or, JavR(θ) =∑

s

pπθ(s)

∑a

πθ(a|s)ras

pπθ(s) is stationary distribution of Markov chain for πθ

pπθ(s) = P(s1 = s) + γP(s2 = s) + γ2P(s3 = s) + ...

Page 55: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

55/73

Policy Gradient

Goal: find θ to maximize J(θ) by ascending the gradient of the policy,w.r.t θ

∆θ = α∇θJ(θ), where ∇θJ(θ) =

∂J(θ)θ1...

∂J(θ)θN

How to compute gradient?

Perturbing θ by small amount ε in kth dimension, ∀k ∈ [1,N]

∂J(θ)

θk≈ J(θ + εek)− J(θ)

ε

Simple, but noisy, inefficient in most cases

Page 56: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

56/73

Score function

Assume policy πθ is differentiable whenever it is non-zero

Likelihood ratios exploit the following identity

∇θπθ(a|s) = πθ(a|s)∇θπθ(a|s)πθ(a|s)

= πθ(a|s)∇θ logπθ(a|s)

The score function is ∇θ logπθ(a|s)

For example, consider a Gaussian policy: a ∼ N (µ(s), σ2)

Mean is linear combination of state features µ(s) = φ(s)Tθ

The score function is

∇θ logπθ(a|s) =(a− µ(s))φ(s)

σ2

Page 57: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

57/73

One-step MDPs

Consider one-step MDPs

start with s ∼ p(s), and terminate after one step with reward ras

J(θ) =∑

s

p(s)∑

a

πθ(a|s)ras ,

∇θJ(θ) =∑

s

p(s)∑

a

∇πθ(a|s)ras

=∑

s

p(s)∑

a

πθ(a|s)∇θ logπθ(a|s)ras

=Eπθ[∇θ logπθ(a|s)ra

s ]

Page 58: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

58/73

Policy Gradient Theorem

Consider the multi-step MDPs, we can use likelihood ratio to obtain thesimilar conclusion:

TheoremFor any differentiable policy πθ(a|s), and for any of the policy objectivefunction J = J1, JavR or 1

1−γ JavV , the policy gradient for policy objectivefunction J(θ) is

∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qπθ

(s, a)]

Page 59: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

59/73

Monte-Carlo Policy Gradient (REINFORCE)

Apply stochastic gradient ascent

Experience an episode

{s1, a1, r2, s2, a2, r3, ..., sT−1, aT−1, rT , sT} ∼ πθ

Use return vt as unbiased sample of Qπθ(st, at), t = 1, ..,T − 1

∆θt = α∇θ logπθ(a|s)vt

High variance

Page 60: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

60/73

Actor-Critic

Reduce the variance

Use a critic to estimate the action-value function

Qw(s, a) ≈ Qπθ(s, a)

Actor-Critic algorithms maintains two set of parameters

Critic: update action-value function parameters wActor: update policy parameters θ in direction suggested by Critic

∇θJ(θ) ≈ Eπθ[∇θ logπθ(a|s)Qw(s, a)]

∆θ = α∇θ logπθ(a|s)Qw(s, a)

Can we avoid any bias by choosing action-value functionapproximation carefully?

Page 61: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

61/73

Compatible Function Approximation Theorem

TheoremIf the action-value function approximator Qw(s, a) satisfies thefollowing two conditions:

∇wQw(s, a) =∇θ logπθ(a|s),w = arg min

w′Eπθ

[(Qπθ(s, a)− Qw′(s, a))2]

then the policy gradient is exact, i.e.

∇θJ(θ) = Eπθ[∇θ logπθ(a|s)Qw(s, a)]

The function approximator is compatible with the policy in thesense that if we use the approximations Qw(s, a) in lieu of theirtrue values to compute the gradient, then the result would beexact

Page 62: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

62/73

Proof

Pf: Denote ε = Eπθ[(Qπθ

(s, a)− Qw(s, a))2], then

∇wε =0

Eπθ[(Qπθ

(s, a)− Qw(s, a))∇wQw(s, a)] = 0

Eπθ[(Qπθ

(s, a)− Qw(s, a))∇θ logπθ(a|s)] = 0

Eπθ[Qπθ

(s, a)∇θ logπθ(a|s)] =Eπθ[Qw(s, a)∇θ logπθ(a|s)].

Page 63: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

63/73

Baseline

Reduce the variance

A baseline function B(s)

Eπθ[∇θ logπθ(a, a)B(s)]

=∑

s

pπθ(s)

∑a

∇θπθ(a|s)B(s)

=∑

s

pπθ(s)B(s)∇θ

∑a

πθ(a|s) = 0

A good baseline B(s) = Vπθ(s)

Advantage function

Aπθ(s, a) = Qπθ

(s, a)− Vπθ(s)

Page 64: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

64/73

Policy gradient

∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qπθ

(s, a)]

=Eπθ[∇θ logπθ(a|s)Aπθ

(s, a)]

∆θ =α∇θ logπθ(a|s)Aπθ(s, a)

The advantage function can significantly reduce variance of policygradient

Estimate both Vπθ(s) and Qπθ

(s, a) to obtain Aπθ(s, a)

e.g. TD learning

Page 65: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

65/73

Estimation of advantage function

Apply TD learning to estimate value function

TD error δπθ

δπθ = r(s, a, s′) + γVπθ(s′)− Vπθ

(s)

is an unbiased estimate of advantage function

Eπθ[δπθ |s, a] =Eπθ

[r(s, a, s′) + γVπθ(s′)− Vπθ

(s)|s, a]

=Eπθ[r(s, a, s′) + γVπθ

(s′)|s, a]− Vπθ(s)

=Qπθ(s, a)− Vπθ

(s) = Aπθ(s, a)

thus the update

∆θ = α∇θ logπθ(a|s)δπθ

Page 66: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

66/73

Natural Policy Gradient 2

Consider

maxdθ

J(θ + dθ)

s.t. ‖dθ‖Gθ= a

where a is a small constant, and ‖dθ‖2Gθ

= (dθ)TGθ(dθ).

The optimal

(dθ)∗ := ∇natJ(θ) = G−1θ ∇θJ(θ)

Take the metric matrix Gθ as Fisher information matrix

Gθ = Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]

2Sham Kakade. A Natural Policy Gradient, In Advance in Neural InformationProcessing Systems, pp. 1057-1063. MIT press, 2002.

Page 67: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

67/73

Natural Actor-Critic

Use compatible function function Qw(s, a) = ∇θ logπθ(a|s)Tw

w = arg minw′ Eπθ[(Qπθ

(s, a)− Qw′(s, a))2]

then

∇θJ(θ) =Eπθ[∇θ logπθ(a|s)Qw(s, a)]

=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)Tw]

=Eπθ[∇θ logπθ(a|s)∇θ logπθ(a|s)T ]w

=Gθw

so, ∇natJ(θ) = G−1θ ∇θJ(θ) = w

i.e. the natural policy gradient update Actor parameters θ in direction ofCritic parameters w

Page 68: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

68/73

Model-based RL

Learn a model directly from experience

Use planning to construct a value function or policy

Page 69: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

69/73

Learn a model

For an MDP (S,A,P,R)

A modelM =< Pη,Rη > , parameterized by η

Pη ≈ P, Rη ≈ R

s′ ∼ Pη(s′|s, a), r(s, a, s′) = Rη(s, a, s′)

Typically assume conditional independence between state transitionsand rewards

P(s′, r(s, a, s′)|s, a) = P(s′|s, a)P(r(s, a, s′)|s, a)

Page 70: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

70/73

Learn a model

Experience s1, a1, r2, s2, a2, r3, ...sT

s1, a1 →r2, s2

s2, a2 →r3, s3

...sT−1, aT−1 →rT , sT

Learning s, a→ r is a regression problem

Learning s, a→ s′ is a density estimation problem

find parameters η to minimise empirical loss

Page 71: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

71/73

Example: Table Lookup Model

count visits N(s, a) to each pair action pair

P̂(s′|s, a) =1

N(s, a)

T∑t=1

1(st = s, at = a, st+1 = s′)

r̂as =

1N(s, a)

T∑t=1

1(st = s, at = a)rt

Page 72: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

72/73

Planning with a model

After estimating a model, we can plan with algorithms introducedbefore

Dynamic Programming

Policy iterationValue iteration...

Model-free RL

Monte-CarloSarsa

Q-learning...

Performance of model-based RL is limited to optimal policy forapproximate MDP (S,A,Pη,Rη)

when the estimated model is imperfect, model-free RL is more efficient

Page 73: Lecture: Reinforcement Learningbicmr.pku.edu.cn/~wenzw/bigdata/MDP.pdfReinforcement learning is learning what to do–how to map situations to actions–so as to maximize a numerical

73/73

Dyna Architecture

Learn a model from real experience (true MDP)

Learn and plan value function (and/or policy) from real and simulatedexperience