85
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. [email protected] Deep Learning November 5, 2018 1 / 64

Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Reinforcement Learning andDeep Reinforcement Learning

Ashis Kumer Biswas, Ph.D.

[email protected]

Deep Learning November 5, 2018 1 / 64

Page 2: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 2 / 64

Page 3: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 3 / 64

Page 4: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Reinforcement Learning

Environment

Agent

Action

Interpreter

Reward

StateX=42

Figure: Theperception-action-learning loop. Imagesourcehttps://en.wikipedia.org/wiki/Reinforcement_learning

1 An agent takes an action inan environment.

2 That action is interpretedinto a reward, R and arepresentation of the state,S.

3 These two are then fed backinto the agent.

Deep Learning November 5, 2018 4 / 64

Page 5: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Reinforcement Learning

Environment

Agent

Action

Interpreter

Reward

StateX=42

Figure: Theperception-action-learning loop. Imagesourcehttps://en.wikipedia.org/wiki/Reinforcement_learning

1 An agent takes an action inan environment.

2 That action is interpretedinto a reward, R and arepresentation of the state,S.

3 These two are then fed backinto the agent.

Deep Learning November 5, 2018 4 / 64

Page 6: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Reinforcement Learning

Environment

Agent

Action

Interpreter

Reward

StateX=42

Figure: Theperception-action-learning loop. Imagesourcehttps://en.wikipedia.org/wiki/Reinforcement_learning

1 An agent takes an action inan environment.

2 That action is interpretedinto a reward, R and arepresentation of the state,S.

3 These two are then fed backinto the agent.

Deep Learning November 5, 2018 4 / 64

Page 7: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

A Crawling Robot learns to crawl

Figure: A Crawling robot developed by Francis wyffels.Video Link: https://www.youtube.com/watch?v=2iNrJx6IDEo

Deep Learning November 5, 2018 5 / 64

Page 8: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Introduction to Markov Decision Process, MDP

MDPs formally describe an environment for reinforcementlearning, RL.

where the environment is fully observable.That is, the current state completely characterizes the process.

Almost all RL problems can be formalized as MDPs.

Deep Learning November 5, 2018 6 / 64

Page 9: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Property

“The future is independent of the past given the present”.

DefinitionA state St is Markov if and only ifP[St+1|St] = P[St+1|S1, · · · , St]

The state captures all relevant information from the history.Once the state is known, the history may be thrown away.

i.e., the state is a sufficient statistic of the future.

Deep Learning November 5, 2018 7 / 64

Page 10: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

State Transition Matrix

For a Markov state, s, and the successor state s′, the statetransition probability is defined as:

Pss′ = P[St+1 = s′|St = s]

State transition matrix, P defines transition probabilities from allstates s to all successor states, s′,

P =

P11 · · · P1n... . . . ...Pn1 · · · Pnn

where each row of the matrix sums to 1.

Deep Learning November 5, 2018 8 / 64

Page 11: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Process

A Markov Process is a memoryless random process, i.e., asequence of random states S1, S2, · · · , with the Markov property.

DefinitionA Markov Process (or Markov Chain) is a tuple, 〈S,P〉, such that:S is a finite set of states.P is a state transition probability matrix:Pss′ = P[St+1 = s′|St = s]

Deep Learning November 5, 2018 9 / 64

Page 12: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Student Markov Chain

Sample episodes for StudentMarkov Chain starting fromS1 = C1.

C1, C2, C3, Pass, Sleep

C1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

Deep Learning November 5, 2018 10 / 64

Page 13: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Student Markov Chain

Sample episodes for StudentMarkov Chain starting fromS1 = C1.

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, Sleep

C1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

Deep Learning November 5, 2018 10 / 64

Page 14: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Student Markov Chain

Sample episodes for StudentMarkov Chain starting fromS1 = C1.

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, Sleep

C1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

Deep Learning November 5, 2018 10 / 64

Page 15: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Student Markov Chain

Sample episodes for StudentMarkov Chain starting fromS1 = C1.

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

Deep Learning November 5, 2018 10 / 64

Page 16: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Student Markov Chain

Deep Learning November 5, 2018 11 / 64

Page 17: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Reward Process, MRP

A Markov reward process is a Markov Chain with values.

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉

S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 12 / 64

Page 18: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Reward Process, MRP

A Markov reward process is a Markov Chain with values.

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]

R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 12 / 64

Page 19: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Reward Process, MRP

A Markov reward process is a Markov Chain with values.

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]

γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 12 / 64

Page 20: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Reward Process, MRP

A Markov reward process is a Markov Chain with values.

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 12 / 64

Page 21: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

The Student MRP

Deep Learning November 5, 2018 13 / 64

Page 22: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Return

DefinitionThe return, Gt is the total discounted reward from time-step t:

Gt = Rt+1 + γRt+2 + γ2Rt+3 · · ·

=∞∑k=0

γkRt+k+1

The discount, γ ∈ [0, 1] is the present value of future rewards.The value of receiving reward, R after k + 1 time-steps is γkR.This values immediate reward above delayed reward:

γ close to 0 leads to “myopic” evaluation.γ close to 1 leads to “far-sighted” evaluation.

Deep Learning November 5, 2018 14 / 64

Page 23: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Why discount?

Most Markov reward and decision processes are discounted. Why?Avoids infinite returns in cyclic Markov Processes.Uncertainty about the future may not be fully represented.If the reward is financial, immediate rewards may earn moreinterest than delayed rewards.Animal/human behavior shows preference for immediate reward.

Deep Learning November 5, 2018 15 / 64

Page 24: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Value function

The value function v(s) gives the long-term value of state s.

DefinitionThe state value function v(s) of an MRP is the expected return startingfrom state s:

v(s) = E[Gt|St = s]

Deep Learning November 5, 2018 16 / 64

Page 25: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

The Student MRP

Sample returns for Student MRP,starting from S1 = C1 withγ = 0.5

G1 = R2 + γR3 + · · ·

C1, C2, C3, Pass, Sleepv1 = −2− 2 ∗ 1

2 − 2 ∗ 14 +

10 ∗ 18 = −2.25

Deep Learning November 5, 2018 17 / 64

Page 26: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

The Student MRP

Sample returns for Student MRP,starting from S1 = C1 withγ = 0.5

G1 = R2 + γR3 + · · ·

C1, FB, FB, C1, C2, Sleepv1 = −2− 1 ∗ 1

2 − 1 ∗ 14 −

2 ∗ 18 − 2 ∗ 1

16 = −3.125

Deep Learning November 5, 2018 18 / 64

Page 27: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

The Student MRP

Sample returns for Student MRP,starting from S1 = C1 withγ = 0.5

G1 = R2 + γR3 + · · ·

C1, FB, FB, C1, C2, Sleepv1 = −2− 1 ∗ 1

2 − 1 ∗ 14 −

2 ∗ 18 − 2 ∗ 1

16 = −3.125

Deep Learning November 5, 2018 19 / 64

Page 28: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

State-Value function for Student MRP (γ = 0)

A myopic evaluation!!!

Deep Learning November 5, 2018 20 / 64

Page 29: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

State-Value function for Student MRP (γ = 0.9)

Deep Learning November 5, 2018 21 / 64

Page 30: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

State-Value function for Student MRP (γ = 1)

A far-sighted evaluation!!!

Deep Learning November 5, 2018 22 / 64

Page 31: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation for MRPs

The value function, v(s) can be decomposed into two parts:immediate reward, Rt+1.

discounted value of successor state γv(St+1)

v(s) = E[Gt|St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · |St = s]= E[Rt+1 + γ(Rt+2 + γRt+3 + · · · )|St = s]= E[Rt+1 + γGt+1|St = s]= E[Rt+1 + γv(St+1)|St = s]

Deep Learning November 5, 2018 23 / 64

Page 32: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation for MRPs

The value function, v(s) can be decomposed into two parts:immediate reward, Rt+1.discounted value of successor state γv(St+1)

v(s) = E[Gt|St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · |St = s]= E[Rt+1 + γ(Rt+2 + γRt+3 + · · · )|St = s]= E[Rt+1 + γGt+1|St = s]= E[Rt+1 + γv(St+1)|St = s]

Deep Learning November 5, 2018 23 / 64

Page 33: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation for MRPs

The value function, v(s) can be decomposed into two parts:immediate reward, Rt+1.discounted value of successor state γv(St+1)

v(s) = E[Gt|St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · |St = s]= E[Rt+1 + γ(Rt+2 + γRt+3 + · · · )|St = s]= E[Rt+1 + γGt+1|St = s]= E[Rt+1 + γv(St+1)|St = s]

Deep Learning November 5, 2018 23 / 64

Page 34: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation for MRPs

v(s) = E[Rt+1 + γv(St+1)|St = s]

it can be represented as:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Deep Learning November 5, 2018 24 / 64

Page 35: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation for Student MRP

Deep Learning November 5, 2018 25 / 64

Page 36: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

v = R+ γPv

where, v is a column vector with one entry per state.v(1)...

v(n)

=

R1...Rn

+ γ

P11 · · · P1n... . . . ...Pn1 · · · Pnn

v(1)

...v(n)

Deep Learning November 5, 2018 26 / 64

Page 37: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.

It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo EvaluationTemporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 38: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo EvaluationTemporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 39: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.

There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo EvaluationTemporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 40: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic Programming

Monte-Carlo EvaluationTemporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 41: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo Evaluation

Temporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 42: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Equation

It is a linear equation.It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo EvaluationTemporal Difference Learning

Deep Learning November 5, 2018 27 / 64

Page 43: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉

S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 44: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.

A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 45: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.

P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 46: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]

R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 47: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]

γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 48: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Deep Learning November 5, 2018 28 / 64

Page 49: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Student Markov Process

Deep Learning November 5, 2018 29 / 64

Page 50: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Student Markov Reward Process, MRP

Deep Learning November 5, 2018 30 / 64

Page 51: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Student Markov Decision Process, MDP

Deep Learning November 5, 2018 31 / 64

Page 52: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Policies (1 of 2)

DefinitionA policy, π is a distribution over actions given states,

π(a|s) = P[At = a|St = s]

A policy fully defines the behavior of an agent.MDP policies depend on the current state (and not the history).

That is, policies are stationary (time independent),

At ∼ π(.|st), ∀t > 0

Deep Learning November 5, 2018 32 / 64

Page 53: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Policies (2 of 2)

Given an MDP, M = 〈S,A,P,R, γ〉, and a policy, πThe state sequence S1, S2, · · · is a Markov process 〈S,Pπ〉The state and reward sequence S1, R2, S2, · · · is a Markov RewardProcess, MRP: 〈S,Pπ,Rπ, γ〉, where

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Deep Learning November 5, 2018 33 / 64

Page 54: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Value Functions

DefinitionThe state-value function, vπ(s) of an MDP is the expected returnstarting from state s, and then following policy π.

vπ(s) = Eπ[Gt|St = s]

DefinitionThe action-value function qπ(s, a) is the expected return starting fromstate s, taking action a, and then following policy π.

qπ(s, a) = Eπ[Gt|St = s,At = a]

Deep Learning November 5, 2018 34 / 64

Page 55: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Value Functions

DefinitionThe state-value function, vπ(s) of an MDP is the expected returnstarting from state s, and then following policy π.

vπ(s) = Eπ[Gt|St = s]

DefinitionThe action-value function qπ(s, a) is the expected return starting fromstate s, taking action a, and then following policy π.

qπ(s, a) = Eπ[Gt|St = s,At = a]

Deep Learning November 5, 2018 34 / 64

Page 56: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: State-Value Function for Student MDP

Deep Learning November 5, 2018 35 / 64

Page 57: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation

The state-value function can again be decomposed into immediatereward plus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can also be decomposed as,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Deep Learning November 5, 2018 36 / 64

Page 58: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation for vπ(.)

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

Deep Learning November 5, 2018 37 / 64

Page 59: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation for qπ(.) (1 of 2)

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Deep Learning November 5, 2018 38 / 64

Page 60: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation for vπ(.)

vπ(s) =∑a∈A

π(a|s)

Ras + γ∑s′∈SPass′vπ(s′)

Deep Learning November 5, 2018 39 / 64

Page 61: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation for qπ(.) (2 of 2)

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Deep Learning November 5, 2018 40 / 64

Page 62: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Expectation Equation in Student MDP

Deep Learning November 5, 2018 41 / 64

Page 63: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Optimal Value Function

DefinitionThe optimal state-value function, v∗(s) is the maximum state-valuefunction over all policies

v∗(s) = maxπ

vπ(s)

DefinitionThe optimal action-value function, q∗(s, a) is the maximumaction-value function over all policies

q∗(s, a) = maxπ

qπ(s, a)

The optimal value function specifies the best possible performancein the MDP.An MDP is considered “solved” when we know the optimal valuefunction.

Deep Learning November 5, 2018 42 / 64

Page 64: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Optimal State-Value function for the Student MDP

Deep Learning November 5, 2018 43 / 64

Page 65: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Optimal Action-Value function for the Student MDP

Deep Learning November 5, 2018 44 / 64

Page 66: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Optimal Policy

It defines a partial ordering over the policies,

π ≥ π′ if vπ(s) ≥ vπ′(s), ∀s

TheoremFor any Markov Decision Process,

There exists an optimal policy, π∗ that is better than or equal to allother policies, π∗ ≥ π,∀π.All optimal policies achieve the optimal state value function,vπ∗(s) = v∗(s)All optimal policies achieve the optimal action value function,qπ∗(s, a) = q∗(s, a)

Deep Learning November 5, 2018 45 / 64

Page 67: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Finding an optimal policy

An optimal policy can be found by maximizing over q∗(s, a),

π∗(a|s) =

1 if a = argmaxa∈A

q∗(s, a)

0 otherwise

There is always a deterinistic optimal policy for any MDP.If we know q∗(s, a), we immediately have the optimal policy.

Deep Learning November 5, 2018 46 / 64

Page 68: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Example: Optimal Policy for the Student MDP

Deep Learning November 5, 2018 47 / 64

Page 69: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Optimality Equation for v∗

The optimal value functions are recursively related by the Bellmanoptimality equations:

v∗(s) = maxa

q∗(s, a)

Deep Learning November 5, 2018 48 / 64

Page 70: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Optimality Equation for q∗

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Deep Learning November 5, 2018 49 / 64

Page 71: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Optimality Equation for v∗ (2)

v∗(s) = maxa

Ras + γ∑s′∈SPass′v∗(s′)

Deep Learning November 5, 2018 50 / 64

Page 72: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Optimality Equation for q∗ (2)

q∗(s, a) = Ras + γ∑s′∈SPass′max

a′q∗(s′, a′)

Deep Learning November 5, 2018 51 / 64

Page 73: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Bellman Optimality Equation for the Student MDP

Deep Learning November 5, 2018 52 / 64

Page 74: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear.No closed form solution (in general).However, many iterative solution available:

Policy gradientQ-learningSARSA· · ·

Deep Learning November 5, 2018 53 / 64

Page 75: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 54 / 64

Page 76: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

The Q value

How to find the optimal π∗?How does the agent learn by interacting with the environment?

Instead of finding the policy that maximizes the state-value for all states, find theaction that maximizes the Q value for all states.

Deep Learning November 5, 2018 55 / 64

Page 77: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 56 / 64

Page 78: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 57 / 64

Page 79: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 58 / 64

Page 80: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 59 / 64

Page 81: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 60 / 64

Page 82: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 61 / 64

Page 83: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 62 / 64

Page 84: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Deep Learning November 5, 2018 63 / 64

Page 85: Reinforcement Learning and Deep Reinforcement Learningcse.ucdenver.edu/.../Class-22-Reinforcement-learning-DL.pdf · 2018. 11. 28. · Outlines 1 Principles of Reinforcement Learning

ThanksQuestions?

Deep Learning November 5, 2018 64 / 64