Lecture 4: Markov Decision Processes
Bolei Zhou
The Chinese University of Hong Kong
January 24, 2019
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 1 / 29
Today’s Plan
1 Last TimeKey elements of an RL agent: model, value, policyMulti-armed Bandits
2 This TimeMarkov decision processes
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 2 / 29
Define the model of the environment
Markov Processes
Markov Reward Processes(MRPs)
Markov Decision Processes (MDPs)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 3 / 29
Full Observability: Markov Decision Process (MDP)
1 Markov Decision Process can model a lot of real-world problem. Itformally describes an environment for reinforcement learning
2 Under MDP, the environment is fully observable.1 Optimal control primarily deals with continuous MDPs2 Partially observable problems can be converted into MDPs3 Bandits are MDPs with one state (why?)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 4 / 29
Markov Property
1 The history of states: ht = {s1, s2, s3, ..., st}2 State st is Markov if and only if:
p(st+1|st , at) = p(st+1|ht , at) (1)
3 ”The future is independent of the past given the present”
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 5 / 29
Markov Process/Markov Chain
1 P is the dynamics/transition model that specifies p(st+1 = s ′|st = s)
2 State transition matrix: (From|To)
P =
P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)
......
. . ....
P(s1|sN) P(s2|sN) . . . P(sN |sN)
3 Note that there are no rewards or no actions.
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 6 / 29
Example: USA Mars Rover
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 7 / 29
Example: Chinese Jade Rabbit Rover on Moon
Video of the China’s Historic Moon Landing:https://www.youtube.com/watch?v=XfEwjPKJ4JE
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 8 / 29
Example: Rover MP
1 Samples episodes starting from s31 s3, s4, s5, s6, s62 s3, s2, s3, s2, s13 s3, s4, s4, s5, s5
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 9 / 29
Markov Reward Process (MRP)
1 Markov Reward Process is a Markov Chain + reward2 Definition of Markov Reward Process (MRP)
1 S is a (finite) set of states (s ∈ S)2 P is dynamics/transition model that specifies P(St+1 = s ′|st = s)3 R is a reward function R(st = s) = E[rt |st = s]4 Discount factor γ ∈ [0, 1]
3 If finite number of states, R can be a vector
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 10 / 29
Example: Rover MRP
Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 11 / 29
Return function and Value function
1 Definition of Horizon1 Number of maximum time steps in each episode2 Can be infinite, otherwise called finite Markov (reward) Process
2 Definition of Return1 Discounted sum of rewards from time step t to horizon
Gt = Rt+1 + γRt+2 + γ2Rt+3 + γ3Rt+4 + ...+ γT−t−1RT
3 Definition of State value function Vt(s) for a MRP1 Expected return from t in state s
Vt(s) =E[Gt |st = s]
=E[Rt+1 + γRt+2 + γ2Rt+3 + ...+ γT−t−1RT |st = s]
2 Present value of future rewards
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 12 / 29
Why Discount Factor γ
1 Mathematically convenient to discount rewards
2 Avoids infinite returns in cyclic Markov processes
3 Uncertainty about the future may not be fully represented
4 If the reward is financial, immediate rewards may earn more interestthan delayed rewards
5 Animal/human behaviour shows preference for immediate reward6 It is sometimes possible to use undiscounted Markov reward processes
(i.e. γ = 1), e.g. if all sequences terminate.1 γ = 0: Only care about the immediate reward2 γ = 1: Future reward is equal to the immediate reward.
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 13 / 29
Example: Rover MRP
1 Reward: +5 in s1, +10 in s7, 0 in all other states. So that we canrepresent R = [5, 0, 0, 0, 0, 0, 10]
2 Sample returns for a 4-step episodes with γ = 1/21 s4, s5, s6, s7 : 0 + 1
2 × 0 + 14 × 0 + 1
8 × 10 = 1.252 s4, s3, s2, s1 : 0 + 1
2 × 0 + 14 × 0 + 1
8 × 1 = 0.1253 How to compute the value function?
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 14 / 29
Computing the Value of a Markov Reward Process
1 Value function: expected return from starting in state s
V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ2Rt+3 + ...|st = s]
2 MRP value function satisfies the following Bellman equation:
V (s) = R(s)︸︷︷︸Immediate reward
+ γ∑s′∈S
P(s ′|s)V (s ′)︸ ︷︷ ︸Discounted sum of future reward
3 Practice: To derive the Bellman equation for V(s)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 15 / 29
Matrix Form of Bellman Equation for MRP
Therefore, we can express V(s) using the matrix form:V (s1)V (s2)
...V (sN)
=
R(s1)R(s2)
...R(sN)
+γ
P(s1|s1) P(s2|s1) . . . P(sN |s1)P(s1|s2) P(s2|s2) . . . P(sN |s2)
......
. . ....
P(s1|sN) P(s2|sN) . . . P(sN |sN)
V (s1)V (s2)
...V (sN)
V = R + γPV
1 Analytic solution for value of MRP: V = (I − γP)−1R1 Matrix inverse takes the complexity O(N3) for N states2 Only possible for a small MRPs
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 16 / 29
Iterative Algorithm for Computing Value of a MRP
1 Iterative methods for large MRPs:1 Dynamic Programming2 Monte-Carlo evaluation3 Temporal-Difference learning
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 17 / 29
Iterative Algorithm for Computing Value of a MRP
Algorithm 1 Monte Carlo simulation to calculate MRP value function
1: i ← 0,Gt ← 02: while i 6= N do3: generate an episode, starting from state s and time t4: Using the generated episode, calculate return g =
∑H−1i=t γ i−tri
5: Gt ← Gt + g , i ← i + 16: end while7: Vt(s)← Gt/N
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 18 / 29
Iterative Algorithm for Computing Value of a MRP
Algorithm 2 Iterative algorithm to calculate MRP value function
1: for all states s ∈ S ,V ′(s)← 0,V (s)←∞2: while ||V − V ′|| > ε do3: V ← V ′
4: For all states s ∈ S ,V ′(s) = R(s) + γ∑
s′∈S P(s ′|s)V (s ′)5: end while6: return V ′(s) for all s ∈ S
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 19 / 29
Markov Decision Process (MDP)
1 Markov Decision Process is Markov Reward Process with decisions.2 Definition of MDP
1 S is a finite set of states2 A is a finite set of actions3 Pa is dynamics/transition model for each action
P(st+1 = s ′|st = s, at = a)4 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]5 Discount factor γ ∈ [0, 1]
3 MDP is a tuple: (S ,A,P,R, γ)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 20 / 29
MDP Policies
1 Policy specifies what action to take in each state
2 Give a state, specify a distribution over actions
3 Policy: π(a|s) = P(at = a|st = s)
4 Policies are stationary (time-independent), At ∼ π(a|s) for any t > 0
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 21 / 29
MDP Policies
1 Given an MDP (S ,A,P,R, γ) and a policy π
2 The state sequence S1,S2, ... is a Markov process (S ,Pπ)
3 The state and reward sequence S1,R2, S2,R2, ... is a Markov rewardprocess (S ,Pπ,Rπ, γ)where
Pπ(s ′|s) =∑a∈A
π(a|s)P(s ′|s, a)
Rπ(s) =∑a∈A
π(a|s)R(s, a)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 22 / 29
Value function for MDP
1 The state-value function vπ(s) of an MDP is the expected returnstarting from state s, and following policy π
vπ(s) = Eπ[Gt |st = s] (2)
2 The action-value function qπ(s, a) is the expected return startingfrom state s, taking action a, and then following policy π
Qπ(s, a) = Eπ[Gt |st = s,At = a] (3)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 23 / 29
Bellman Expectation Equation
1 The state-value function can be decomposed into immediate rewardplus discounted value of the successor state,
Vπ(s) = Eπ[Rt+1 + γVπ(st+1)|st = s] (4)
2 The action-value function can similarly be decomposed
Qπ(s, a) = Eπ[Rt+1 + γQπ(st+1,At+1)|st = s,At = a] (5)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 24 / 29
Bellman Expectation Equation for V π and Qπ
vπ(s) =∑a∈A
π(a|s)qπ(s, a) (6)
qπ(s, a) =Ras + γ
∑s′∈S
P(s ′|s, a)vπ(s ′) (7)
Thus
vπ(s) =∑a∈A
π(a|s)(R(s, a) + γ∑s′∈S
P(s ′|s, a)vπ(s ′)) (8)
qπ(s, a) =R(s, a) + γ∑s′∈S
P(s ′|s, a)∑a′∈A
π(a′|s ′)qπ(s ′, a′) (9)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 25 / 29
Bellman Expectation Equation for V π
vπ(s) =∑a∈A
π(a|s)(R(s, a) + γ∑s′∈S
P(s ′|s, a)vπ(s ′)) (10)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 26 / 29
Bellman Expectation Equation for Qπ
qπ(s, a) =R(s, a) + γ∑s′∈S
P(s ′|s, a)∑a′∈A
π(a′|s ′)qπ(s ′, a′) (11)
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 27 / 29
Example: Policy Evaluation
1 Two actions: a1 and a22 For all actions, reward: +5 in s1, +10 in s7, 0 in all other states. So
that we can represent R = [5, 0, 0, 0, 0, 0, 10]
3 Let’s have a deterministic policy π(s) = a1 and γ = 0 for any state s
4 Iterative: vπk (s) = r(s, π(s)) + γ∑
s′∈S P(s ′|s, π(s))vπk−1(s ′)
5 V π = [5, 0, 0, 0, 0, 0, 10]
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 28 / 29
Next Time
1 More Examples on MDP
2 Policy Evaluation
3 Policy Improvement: How to find the better actions
4 Policy Iteration and Value Iteration
Bolei Zhou (CUHK) IERG6130 Reinforcement Learning January 24, 2019 29 / 29