Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Reinforcement Learning

Markov decision process & Dynamic programming

value function, Bellman equation, optimality, Markov property, Markov decision process,dynamic programming, value iteration, policy iteration.

Vien NgoMLR, University of Stuttgart

Outline

• Reinforcement learning problem.– Element of reinforcement learning– Markov Process– Markov Reward Process– Markov decision process.

• Dynamic programming– Value iteration– Policy iteration

2/??

Reinforcement Learning ProblemElements of Reinforcement Learning Problem

• Agent vs. Environment.

• State, Action, Reward, Goal, Return.

• The Markov property.

• Markov decision process.

• Bellman equations.

• Optimality and Approximation.

3/??

Agent vs. Environment

• The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, iscalled the environment.

• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.

(from Introduction to RL book, Sutton & Barto)

4/??

Agent vs. Environment

• The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, iscalled the environment.

• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.

(from Introduction to RL book, Sutton & Barto)4/??

The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.

(Introduction to RL book, Sutton & Barto)

• Formally,

Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.

5/??



• Formally,



5/??



• Formally,



5/??

Markov Process

• A Markov Process (Markov Chain) is defined as 2-tuple (S,P).– S is a state space.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)

6/??

Markov Process: ExampleRycycling Robot’s Markov Chain

Batery:high

Batery:low

wait stop

search

recharge0.9 0.90.9

0.1

0.5

0.50.50.51.0

0.1

7/??

Markov Reward Process

• A Markov Reward Process is defined as 4-tuple (S,P,R, γ).– S is a state space of n states.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)

– R is a reward matrix of Rs.– γ is a discount factor, γ ∈ [0, 1].

• The total return is

ρt = Rt + γRt+1 + γ2Rt+2 + . . .

8/??

Markov Reward Process: Example

Batery:high

Batery:low

wait stop

search

recharge0.9;0.0 0.9;0.0

0.1;-10.0

0.5;0.0

0.5;0.00.5;0.00.5;-1.01.0;0.0

0.1;0.0

9/??

Markov Reward Process: Bellman Equations

• The value function V (s)

V (s) = E[ρt|st = s

]= E

[Rt + γV (st+1)|st = s

]

• V = R+ γPV , hence V = (I − γP )−1RWe will visit again in MDP.

10/??

Markov Reward Process: Discount Factor?Many meanings:

• Weighing the importance of differently timed rewards, higherimportance of more recent rewards.

• Representing uncertainty over the presence of next rewards, i.egeometric distributions.

• Representing human/animal’s preference over ordering of receivedrewards.

11/??

Markov decision process

12/??


• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.

• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??



• MDP = {S,A, T ,R,P0, γ}.

– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??



• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??

Example: Recycling Robot MDP

14/??

a0

s0

r0

a1

s1

r1

a2

s2

r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function:– Expected average reward.

η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]

– Expected discounted reward.

ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]

• Singh et. al. 1994:

ηγ =1

1− γη

15/??

a0

s0

r0

a1

s1

r1

a2

s2

r2


µ : S 7→ A


η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]


ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]


ηγ =1

1− γη

15/??

a0

s0

r0

a1

s1

r1

a2

s2

r2


µ : S 7→ A


η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]


ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]


ηγ =1

1− γη

15/??

a0

s0

r0

a1

s1

r1

a2

s2

r2


µ : S 7→ A


η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]


ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]


ηγ =1

1− γη

15/??

Dynamic Programming

16/??

Dynamic Programming

• State Value Functions

• Bellman Equations

• Value Iteration

• Policy Iteration

17/??

State value function

• The value (expected discounted return) of policy π when started instate s:

V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

• definition of optimality: behavior π∗ is optimal iff

∀s : V π∗(s) = V ∗(s) where V ∗(s) = max

πV π(s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??

State value function

• The value (expected discounted return) of policy π when started instate s:

V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

• definition of optimality: behavior π∗ is optimal iff

∀s : V π∗(s) = V ∗(s) where V ∗(s) = max

πV π(s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??

Bellman optimality equation

V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ

∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)

• We can write this in vector notation V π = Rπ + γP πV π

with vectors V πs = V π(s), Rπ

s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)

• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ

∑s′,a π(a|s)P (s′ | a, s) V π(s′)

• Bellman optimality equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π

everywhere except π′(s) = argmaxa[·] would be better.)

• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)

19/??



∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)





∑s′,a π(a|s)P (s′ | a, s) V π(s′)


V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)



• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)

19/??



∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)





∑s′,a π(a|s)P (s′ | a, s) V π(s′)


V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)



• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm) 19/??

Richard E. Bellman (1920-1984)Bellman’s principle of optimality

A

B

A opt ⇒ B opt

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]

20/??

Value Iteration

• Given the Bellman equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′

P (s′ | a, s) V ∗(s′)]

→ iterate

∀s : Vk+1(s) = maxa

[R(a, s) + γ

∑s′

P (s′|π(s), s) Vk(s′)]

stopping criterion:

maxs|Vk+1(s)− Vk(s)| ≤ ε

• Value Iteration converges to the optimal value function V ∗ (proof below)

21/??

2x2 Maze

0.0 1.0

0.0 0.010%10%

80%

manually solving.

22/??

State-action value function (Q-function)

• The state-action value function (or Q-function) is the expecteddiscounted return when starting in state s and taking first action a:

Qπ(a, s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s, a0 =a}= R(a, s) + γ

∑s′

P (s′ | a, s) Qπ(π(s′), s′)

(Note: V π(s) = Qπ(π(s), s).)

• Bellman optimality equation for the Q-function

Q∗(a, s) = R(a, s) + γ∑s′

P (s′ | a, s) maxa′

Q∗(a′, s′)

π∗(s) = argmaxa

Q∗(a, s)

23/??

Q-Iteration

• Given the Bellman equation

Q∗(a, s) = R(a, s) + γ∑s′

P (s′ | a, s) maxa′

Q∗(a′, s′)

→ iterate

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) maxa′

Qk(a′, s′)

stopping criterion:

maxa,s|Qk+1(a, s)−Qk(a, s)| ≤ ε

• Q-Iteration converges to the optimal state-action value function Q∗

24/??

Proof of convergence

• Let ∆k = ||Q∗ −Qk||∞ = maxa,s |Q∗(a, s)−Qk(a, s)|

Qk+1(a, s) = R(a, s) + γ∑s′


Qk(a′, s′)

≤ R(a, s) + γ∑s′


[Q∗(a′, s′) + ∆k

]=[R(a, s) + γ

∑s′


Q∗(a′, s′)]

+ γ∆k

= Q∗(a, s) + γ∆k

similarly: Qk ≥ Q∗ −∆k ⇒ Qk+1 ≥ Q∗ − γ∆k

25/??

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.

||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||

• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)

Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Convergence


||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||


Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Convergence


||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||


Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗

If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)

• Or, invert the matrix equation

V π = Rπ + γP πV π

V π + γP πV π = Rπ

(I − γP π)V π = Rπ

V π = (I − γP π)−1Rπ

requires inversion of n× n matrix for |S| = n, O(n3)

27/??




∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)






requires inversion of n× n matrix for |S| = n, O(n3)

27/??




∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)






requires inversion of n× n matrix for |S| = n, O(n3)27/??

Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?


1. Initialise π0 somehow (e.g. randomly)2. Iterate

– Policy Evaluation: compute V πk or Qπk

– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)

demo: 2x2 maze

28/??

Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?


1. Initialise π0 somehow (e.g. randomly)2. Iterate

– Policy Evaluation: compute V πk or Qπk

– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)

demo: 2x2 maze

28/??

Convergence proofThe fact is that:

• After policy improvement: V πk ≤ V πk+1 (with a sketch proof from RichSutton’s book)

• The policy space is finite, |A||S|.• The Bellman operator has a unique fixed point (due to the strict

contraction property (0 < γ < 1) on a Banach space). This condition isalso used to prove the fixed point for the VI algorithm.

29/??

VI vs. PI

• VI is PI with one step of policy evaluation.

• PI converges surprisingly rapildy, however with expensive compution,i.e. the policy evaluation step (wait for convergence of V π).

• PI is prefered if the action set is large.

30/??

Asynchronous Dynamic Programming

• The value function table is updated asynchronously.• Computation is significantly reduced.• If all states are updated infinitely, convergence is still guaranteed.

• Three simple algorithms:

• Gauss-Seidel Value Iteration

• Real-time dynamic programming

• Prioritised sweeping

31/??

Gauss-Seidel Value Iteration• Standard VI algorithm updates all states at next iteration using old

values at previous iteration (each iteration finishes when all states getupdated).

Algorithm 1 Standard Value Iteration Algorithm1: while (!converged) do2: Vold = V

3: for (each s ∈ S) do4: V (s) = maxa{R(s, a) + γ

∑s′ P (s′|s, a)Vold(s′)}

• Gauss-Seidel VI updates each state using values from previouscomputation.

Algorithm 2 Gauss-Seidel Value Iteration Algorithm1: while (!converged) do2: for (each s ∈ S) do3: V (s) = maxa{R(s, a) + γ

∑s′ P (s′|s, a)V (s′)}

32/??

Prioritised Sweeping

• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis proportional to their update magnitudes (Bellman errors).

• Define Bellman error as E(s;Vt) = |Vt+1(s)− Vt(s)| that is the changeof s’s value after the most recent update.

Algorithm 3 Prioritised Sweeping VI Algorithm1: Initialize V0(s) and priority values H0(s), ∀s ∈ S.2: for k = 0, 1, 2, 3, . . . do3: pick a state to update (with the highest priortiy): sk ∈ argmaxs∈S Hk(s)

4: value update: Vk+1(sk) = maxa∈A[R(sk, ak) + γ

∑s′ P (s′|sk, ak)Vk(s′)

]5: for s 6= sk: Vk+1(s) = Vk(s)

6: update priority values: ∀s ∈ S, Hk+1(s) ← E(s;Vk+1) (Note: the error is w.r.tthe future update).

33/??

Real-Time Dynamic Programming

• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis generated by simulating the transitions.

Algorithm 4 Real-Time Value Iteration Algorithm1: start at an arbitray s0, and initialize V0(s),∀s ∈ calS.2: for k = 0, 1, 2, 3, . . . do3: action selection:

ak ∈ argmaxa∈A

{R(sk, a) + γ

∑s′

P (s′|sk, a)Vk(s′)}

4: value update: Vk+1(sk) = R(sk, ak) + γ∑s′ P (s′|sk, ak)Vk(s′)

5: For s 6= sk: Vk+1(s) = Vk(s)

6: simulate the next state: sk+1 ∼ P (s′|sk, ak)

34/??

• So far, we introduce basic notions of an MDP and value functions andmethods to compute optimal policies assuming that we know theworld (know P (s′|a, s) and R(a, s)):

– Value Iteration/Q-Iteration → V ∗, Q∗, π∗

– Policy Evaluation→ V π, Qπ

– Policy Improvement π(s)← argmaxaQπk(a, s)

– Policy Iteration (iterate Policy Evaluation and Policy Improvement)

• Reinforcement Learning?

35/??

Documents

Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &