51
Reinforcement Learning Markov decision process & Dynamic programming value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration. Vien Ngo MLR, University of Stuttgart

Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

  • Upload
    others

  • View
    15

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Reinforcement Learning

Markov decision process & Dynamic programming

value function, Bellman equation, optimality, Markov property, Markov decision process,dynamic programming, value iteration, policy iteration.

Vien NgoMLR, University of Stuttgart

Page 2: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Outline

• Reinforcement learning problem.– Element of reinforcement learning– Markov Process– Markov Reward Process– Markov decision process.

• Dynamic programming– Value iteration– Policy iteration

2/??

Page 3: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Reinforcement Learning ProblemElements of Reinforcement Learning Problem

• Agent vs. Environment.

• State, Action, Reward, Goal, Return.

• The Markov property.

• Markov decision process.

• Bellman equations.

• Optimality and Approximation.

3/??

Page 4: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Agent vs. Environment

• The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, iscalled the environment.

• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.

(from Introduction to RL book, Sutton & Barto)

4/??

Page 5: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Agent vs. Environment

• The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, iscalled the environment.

• The environment is formally formulated as a Markov Decision Process,which is a mathematically principled framework for sequential decisionproblems.

(from Introduction to RL book, Sutton & Barto)4/??

Page 6: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.

(Introduction to RL book, Sutton & Barto)

• Formally,

Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.

5/??

Page 7: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.

(Introduction to RL book, Sutton & Barto)

• Formally,

Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.

5/??

Page 8: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

The Markov propertyA state that summarizes past sensations compactly yet in such a waythat all relevant information is retained. This normally requires morethan the immediate sensations, but never more than the completehistory of all past sensations. A state that succeeds in retaining allrelevant information is said to be Markov, or to have the Markovproperty.

(Introduction to RL book, Sutton & Barto)

• Formally,

Pr(st+1, rt+1|st, at, rt, · · · , s0, a0, r0) = Pr(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting thenext steps, the position, velocity of the cart, the angle and its changingrate of the pole in cart-pole domain.

5/??

Page 9: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Process

• A Markov Process (Markov Chain) is defined as 2-tuple (S,P).– S is a state space.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)

6/??

Page 10: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Process: ExampleRycycling Robot’s Markov Chain

Batery:high

Batery:low

wait stop

search

recharge0.9 0.90.9

0.1

0.5

0.50.50.51.0

0.1

7/??

Page 11: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Reward Process

• A Markov Reward Process is defined as 4-tuple (S,P,R, γ).– S is a state space of n states.– P is a state transition probability matrix: Pss′ = P (st+1 = s′|st = s)

– R is a reward matrix of Rs.– γ is a discount factor, γ ∈ [0, 1].

• The total return is

ρt = Rt + γRt+1 + γ2Rt+2 + . . .

8/??

Page 12: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Reward Process: Example

Batery:high

Batery:low

wait stop

search

recharge0.9;0.0 0.9;0.0

0.1;-10.0

0.5;0.0

0.5;0.00.5;0.00.5;-1.01.0;0.0

0.1;0.0

9/??

Page 13: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Reward Process: Bellman Equations

• The value function V (s)

V (s) = E[ρt|st = s

]= E

[Rt + γV (st+1)|st = s

]

• V = R+ γPV , hence V = (I − γP )−1RWe will visit again in MDP.

10/??

Page 14: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov Reward Process: Discount Factor?Many meanings:

• Weighing the importance of differently timed rewards, higherimportance of more recent rewards.

• Representing uncertainty over the presence of next rewards, i.egeometric distributions.

• Representing human/animal’s preference over ordering of receivedrewards.

11/??

Page 15: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov decision process

12/??

Page 16: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov decision process

• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.

• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??

Page 17: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov decision process

• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.

• MDP = {S,A, T ,R,P0, γ}.

– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??

Page 18: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Markov decision process

• A reinforcement learning problem that satisfies the Markov property iscalled a Markov decision process, or MDP.

• MDP = {S,A, T ,R,P0, γ}.– S: consists of all possible states.– A: consists of all possible actions.– T : is a transition function which defines the probabilityT (s′, s, a) = Pr(s′|s, a).– R: is a reward function which defines the reward R(s, a).– P0: is the probability distribution over initial states.– γ ∈ [0, 1]: is a discount factor.

13/??

Page 19: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Example: Recycling Robot MDP

14/??

Page 20: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

a0

s0

r0

a1

s1

r1

a2

s2

r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function:– Expected average reward.

η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]

– Expected discounted reward.

ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]

• Singh et. al. 1994:

ηγ =1

1− γη

15/??

Page 21: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

a0

s0

r0

a1

s1

r1

a2

s2

r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function:– Expected average reward.

η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]

– Expected discounted reward.

ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]

• Singh et. al. 1994:

ηγ =1

1− γη

15/??

Page 22: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

a0

s0

r0

a1

s1

r1

a2

s2

r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function:– Expected average reward.

η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]

– Expected discounted reward.

ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]

• Singh et. al. 1994:

ηγ =1

1− γη

15/??

Page 23: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

a0

s0

r0

a1

s1

r1

a2

s2

r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function:– Expected average reward.

η = limT→∞

1

TE[ T−1∑t=0

r(st, at, st+1)]

– Expected discounted reward.

ηγ = E[ ∞∑t=0

γtr(st, at, st+1)]

• Singh et. al. 1994:

ηγ =1

1− γη

15/??

Page 24: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Dynamic Programming

16/??

Page 25: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Dynamic Programming

• State Value Functions

• Bellman Equations

• Value Iteration

• Policy Iteration

17/??

Page 26: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

State value function

• The value (expected discounted return) of policy π when started instate s:

V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

• definition of optimality: behavior π∗ is optimal iff

∀s : V π∗(s) = V ∗(s) where V ∗(s) = max

πV π(s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??

Page 27: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

State value function

• The value (expected discounted return) of policy π when started instate s:

V π(s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

• definition of optimality: behavior π∗ is optimal iff

∀s : V π∗(s) = V ∗(s) where V ∗(s) = max

πV π(s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??

Page 28: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Bellman optimality equation

V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ

∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)

• We can write this in vector notation V π = Rπ + γP πV π

with vectors V πs = V π(s), Rπ

s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)

• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ

∑s′,a π(a|s)P (s′ | a, s) V π(s′)

• Bellman optimality equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π

everywhere except π′(s) = argmaxa[·] would be better.)

• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)

19/??

Page 29: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Bellman optimality equation

V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ

∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)

• We can write this in vector notation V π = Rπ + γP πV π

with vectors V πs = V π(s), Rπ

s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)

• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ

∑s′,a π(a|s)P (s′ | a, s) V π(s′)

• Bellman optimality equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π

everywhere except π′(s) = argmaxa[·] would be better.)

• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm)

19/??

Page 30: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Bellman optimality equation

V π(s) = E{r0 + γr1 + γ2r2 + · · · | s0 =s;π}= E{r0 | s0 =s;π}+ γE{r1 + γr2 + · · · | s0 =s;π}= R(π(s), s) + γ

∑s′ P (s′ |π(s), s) E{r1 + γr2 + · · · | s1 =s′;π}

= R(π(s), s) + γ∑s′ P (s′ |π(s), s) V π(s′)

• We can write this in vector notation V π = Rπ + γP πV π

with vectors V πs = V π(s), Rπ

s = R(π(s), s) and matrix P πs′s = P (s′ |π(s), s)

• For stochastic π(a|s): V π(s) =∑a π(a|s)R(a, s) + γ

∑s′,a π(a|s)P (s′ | a, s) V π(s′)

• Bellman optimality equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

](Sketch of proof: If π would select another action than argmaxa[·], then π′ which = π

everywhere except π′(s) = argmaxa[·] would be better.)

• This is the principle of optimality in the stochastic case(related to Viterbi, max-product algorithm) 19/??

Page 31: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Richard E. Bellman (1920-1984)Bellman’s principle of optimality

A

B

A opt ⇒ B opt

V ∗(s) = maxa

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]π∗(s) = argmax

a

[R(a, s) + γ

∑s′ P (s′ | a, s) V ∗(s′)

]

20/??

Page 32: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Value Iteration

• Given the Bellman equation

V ∗(s) = maxa

[R(a, s) + γ

∑s′

P (s′ | a, s) V ∗(s′)]

→ iterate

∀s : Vk+1(s) = maxa

[R(a, s) + γ

∑s′

P (s′|π(s), s) Vk(s′)]

stopping criterion:

maxs|Vk+1(s)− Vk(s)| ≤ ε

• Value Iteration converges to the optimal value function V ∗ (proof below)

21/??

Page 33: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

2x2 Maze

0.0 1.0

0.0 0.010%10%

80%

manually solving.

22/??

Page 34: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

State-action value function (Q-function)

• The state-action value function (or Q-function) is the expecteddiscounted return when starting in state s and taking first action a:

Qπ(a, s) = Eπ{r0 + γr1 + γ2r2 + · · · | s0 =s, a0 =a}= R(a, s) + γ

∑s′

P (s′ | a, s) Qπ(π(s′), s′)

(Note: V π(s) = Qπ(π(s), s).)

• Bellman optimality equation for the Q-function

Q∗(a, s) = R(a, s) + γ∑s′

P (s′ | a, s) maxa′

Q∗(a′, s′)

π∗(s) = argmaxa

Q∗(a, s)

23/??

Page 35: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Q-Iteration

• Given the Bellman equation

Q∗(a, s) = R(a, s) + γ∑s′

P (s′ | a, s) maxa′

Q∗(a′, s′)

→ iterate

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) maxa′

Qk(a′, s′)

stopping criterion:

maxa,s|Qk+1(a, s)−Qk(a, s)| ≤ ε

• Q-Iteration converges to the optimal state-action value function Q∗

24/??

Page 36: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Proof of convergence

• Let ∆k = ||Q∗ −Qk||∞ = maxa,s |Q∗(a, s)−Qk(a, s)|

Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) maxa′

Qk(a′, s′)

≤ R(a, s) + γ∑s′

P (s′|a, s) maxa′

[Q∗(a′, s′) + ∆k

]=[R(a, s) + γ

∑s′

P (s′|a, s) maxa′

Q∗(a′, s′)]

+ γ∆k

= Q∗(a, s) + γ∆k

similarly: Qk ≥ Q∗ −∆k ⇒ Qk+1 ≥ Q∗ − γ∆k

25/??

Page 37: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.

||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||

• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)

Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Page 38: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.

||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||

• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)

Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Page 39: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||which guarantees convergence with different initial values U0, V0 of twoapproximations.

||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ . . . ≤ γk+1||U0 − V0||

• Stopping condition: ||Vk+1 − Vk|| ≤ ε⇒ ||Vk+1 − V ∗|| ≤ εγ/(1− γ)

Proof:

|Vk+1 − V ∗ |γ

≤ |Vk − V ∗ | ≤ |Vk+1 − Vk | + |Vk+1 − V ∗ |

|Vk+1 − V ∗ |γ

≤ ε+ |Vk+1 − V ∗ |

26/??

Page 40: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗

If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)

• Or, invert the matrix equation

V π = Rπ + γP πV π

V π + γP πV π = Rπ

(I − γP π)V π = Rπ

V π = (I − γP π)−1Rπ

requires inversion of n× n matrix for |S| = n, O(n3)

27/??

Page 41: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗

If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)

• Or, invert the matrix equation

V π = Rπ + γP πV π

V π + γP πV π = Rπ

(I − γP π)V π = Rπ

V π = (I − γP π)−1Rπ

requires inversion of n× n matrix for |S| = n, O(n3)

27/??

Page 42: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Policy EvaluationValue Iteration and Q-Iteration compute directly V ∗ and Q∗

If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

∀s : Vk+1(s) = R(π(s), s) + γ∑s′

P (s′|π(s), s) Vk(s′)

∀a,s : Qk+1(a, s) = R(a, s) + γ∑s′

P (s′|a, s) Qk(π(s′), s′)

• Or, invert the matrix equation

V π = Rπ + γP πV π

V π + γP πV π = Rπ

(I − γP π)V π = Rπ

V π = (I − γP π)−1Rπ

requires inversion of n× n matrix for |S| = n, O(n3)27/??

Page 43: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?

• Policy Iteration

1. Initialise π0 somehow (e.g. randomly)2. Iterate

– Policy Evaluation: compute V πk or Qπk

– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)

demo: 2x2 maze

28/??

Page 44: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?

• Policy Iteration

1. Initialise π0 somehow (e.g. randomly)2. Iterate

– Policy Evaluation: compute V πk or Qπk

– Policy Improvement: πk+1(s)← argmaxaQπk (a, s)

demo: 2x2 maze

28/??

Page 45: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Convergence proofThe fact is that:

• After policy improvement: V πk ≤ V πk+1 (with a sketch proof from RichSutton’s book)

• The policy space is finite, |A||S|.• The Bellman operator has a unique fixed point (due to the strict

contraction property (0 < γ < 1) on a Banach space). This condition isalso used to prove the fixed point for the VI algorithm.

29/??

Page 46: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

VI vs. PI

• VI is PI with one step of policy evaluation.

• PI converges surprisingly rapildy, however with expensive compution,i.e. the policy evaluation step (wait for convergence of V π).

• PI is prefered if the action set is large.

30/??

Page 47: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Asynchronous Dynamic Programming

• The value function table is updated asynchronously.• Computation is significantly reduced.• If all states are updated infinitely, convergence is still guaranteed.

• Three simple algorithms:

• Gauss-Seidel Value Iteration

• Real-time dynamic programming

• Prioritised sweeping

31/??

Page 48: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Gauss-Seidel Value Iteration• Standard VI algorithm updates all states at next iteration using old

values at previous iteration (each iteration finishes when all states getupdated).

Algorithm 1 Standard Value Iteration Algorithm1: while (!converged) do2: Vold = V

3: for (each s ∈ S) do4: V (s) = maxa{R(s, a) + γ

∑s′ P (s′|s, a)Vold(s′)}

• Gauss-Seidel VI updates each state using values from previouscomputation.

Algorithm 2 Gauss-Seidel Value Iteration Algorithm1: while (!converged) do2: for (each s ∈ S) do3: V (s) = maxa{R(s, a) + γ

∑s′ P (s′|s, a)V (s′)}

32/??

Page 49: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Prioritised Sweeping

• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis proportional to their update magnitudes (Bellman errors).

• Define Bellman error as E(s;Vt) = |Vt+1(s)− Vt(s)| that is the changeof s’s value after the most recent update.

Algorithm 3 Prioritised Sweeping VI Algorithm1: Initialize V0(s) and priority values H0(s), ∀s ∈ S.2: for k = 0, 1, 2, 3, . . . do3: pick a state to update (with the highest priortiy): sk ∈ argmaxs∈S Hk(s)

4: value update: Vk+1(sk) = maxa∈A[R(sk, ak) + γ

∑s′ P (s′|sk, ak)Vk(s′)

]5: for s 6= sk: Vk+1(s) = Vk(s)

6: update priority values: ∀s ∈ S, Hk+1(s) ← E(s;Vk+1) (Note: the error is w.r.tthe future update).

33/??

Page 50: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

Real-Time Dynamic Programming

• Similar to Gauss-Seidel VI, but the sequence of states in each iterationis generated by simulating the transitions.

Algorithm 4 Real-Time Value Iteration Algorithm1: start at an arbitray s0, and initialize V0(s),∀s ∈ calS.2: for k = 0, 1, 2, 3, . . . do3: action selection:

ak ∈ argmaxa∈A

{R(sk, a) + γ

∑s′

P (s′|sk, a)Vk(s′)}

4: value update: Vk+1(sk) = R(sk, ak) + γ∑s′ P (s′|sk, ak)Vk(s′)

5: For s 6= sk: Vk+1(s) = Vk(s)

6: simulate the next state: sk+1 ∼ P (s′|sk, ak)

34/??

Page 51: Reinforcement Learning Lecture Markov decision process ...ipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2016/04/02... · Reinforcement Learning Markov decision process &

• So far, we introduce basic notions of an MDP and value functions andmethods to compute optimal policies assuming that we know theworld (know P (s′|a, s) and R(a, s)):

– Value Iteration/Q-Iteration → V ∗, Q∗, π∗

– Policy Evaluation→ V π, Qπ

– Policy Improvement π(s)← argmaxaQπk(a, s)

– Policy Iteration (iterate Policy Evaluation and Policy Improvement)

• Reinforcement Learning?

35/??