Reinforcement Learning and Deep Reinforcement...

Reinforcement Learning andDeep Reinforcement Learning

Ashis Kumer Biswas, Ph.D.

ashis.biswas@ucdenver.edu

Deep Learning November 5, 2018 1 / 64

Outlines

1 Principles of Reinforcement Learning

2 The Q value

3 Q-learning example

4 Q-learning in Python

5 Non-deterministic Environment

6 Temporal difference Learning

7 Q-learning on OpenAI Gym

8 Deep Q-network (DQN)

9 DQN on Keras

10 Double DQN (DDQN)

Outlines

2 The Q value

9 DQN on Keras

Reinforcement Learning

Environment

Action

Interpreter

Reward

StateX=42

Figure: Theperception-action-learning loop. Imagesourcehttps://en.wikipedia.org/wiki/Reinforcement_learning

1 An agent takes an action inan environment.

2 That action is interpretedinto a reward, R and arepresentation of the state,S.

3 These two are then fed backinto the agent.

Environment

Action

Interpreter

Reward

StateX=42

Environment

Action

Interpreter

Reward

StateX=42

A Crawling Robot learns to crawl

Figure: A Crawling robot developed by Francis wyffels.Video Link: https://www.youtube.com/watch?v=2iNrJx6IDEo

Introduction to Markov Decision Process, MDP

MDPs formally describe an environment for reinforcementlearning, RL.

where the environment is fully observable.That is, the current state completely characterizes the process.

Almost all RL problems can be formalized as MDPs.

Markov Property

“The future is independent of the past given the present”.

DefinitionA state St is Markov if and only ifP[St+1|St] = P[St+1|S1, · · · , St]

The state captures all relevant information from the history.Once the state is known, the history may be thrown away.

i.e., the state is a sufficient statistic of the future.

State Transition Matrix

For a Markov state, s, and the successor state s′, the statetransition probability is defined as:

Pss′ = P[St+1 = s′|St = s]

State transition matrix, P defines transition probabilities from allstates s to all successor states, s′,

P11 · · · P1n... . . . ...Pn1 · · · Pnn

where each row of the matrix sums to 1.

Markov Process

A Markov Process is a memoryless random process, i.e., asequence of random states S1, S2, · · · , with the Markov property.

DefinitionA Markov Process (or Markov Chain) is a tuple, 〈S,P〉, such that:S is a finite set of states.P is a state transition probability matrix:Pss′ = P[St+1 = s′|St = s]

Example: Student Markov Chain

Sample episodes for StudentMarkov Chain starting fromS1 = C1.

C1, C2, C3, Pass, Sleep

C1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, Sleep

C1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, Sleep

C1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

C1, C2, C3, Pass, SleepC1, FB, FB, C1, C2, SleepC1, C2, C3, Pub, C2, C3,Pass, SleepC1, FB, FB, C1, C2, C3,Pub, C1, FB, FB, FB, C1,C2, C3, Pub, C2, Sleep

Markov Reward Process, MRP

A Markov reward process is a Markov Chain with values.

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉

S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]

R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]

γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Reward Process is a tuple 〈S,P,R, γ〉S is a finite set of states.P is a state transition probability matrix,Pss′ = P[St+1 = s′|St = s]R is a reward function, Rs = E[Rt+1|St = s]γ is a discount factor, γ ∈ [0, 1]

The Student MRP

Return

DefinitionThe return, Gt is the total discounted reward from time-step t:

Gt = Rt+1 + γRt+2 + γ2Rt+3 · · ·

=∞∑k=0

γkRt+k+1

The discount, γ ∈ [0, 1] is the present value of future rewards.The value of receiving reward, R after k + 1 time-steps is γkR.This values immediate reward above delayed reward:

γ close to 0 leads to “myopic” evaluation.γ close to 1 leads to “far-sighted” evaluation.

Why discount?

Most Markov reward and decision processes are discounted. Why?Avoids infinite returns in cyclic Markov Processes.Uncertainty about the future may not be fully represented.If the reward is financial, immediate rewards may earn moreinterest than delayed rewards.Animal/human behavior shows preference for immediate reward.

Value function

The value function v(s) gives the long-term value of state s.

DefinitionThe state value function v(s) of an MRP is the expected return startingfrom state s:

v(s) = E[Gt|St = s]

The Student MRP

Sample returns for Student MRP,starting from S1 = C1 withγ = 0.5

G1 = R2 + γR3 + · · ·

C1, C2, C3, Pass, Sleepv1 = −2− 2 ∗ 1

2 − 2 ∗ 14 +

10 ∗ 18 = −2.25

The Student MRP

G1 = R2 + γR3 + · · ·

C1, FB, FB, C1, C2, Sleepv1 = −2− 1 ∗ 1

2 − 1 ∗ 14 −

2 ∗ 18 − 2 ∗ 1

16 = −3.125

The Student MRP

G1 = R2 + γR3 + · · ·

C1, FB, FB, C1, C2, Sleepv1 = −2− 1 ∗ 1

2 − 1 ∗ 14 −

2 ∗ 18 − 2 ∗ 1

16 = −3.125

State-Value function for Student MRP (γ = 0)

A myopic evaluation!!!

State-Value function for Student MRP (γ = 0.9)

State-Value function for Student MRP (γ = 1)

A far-sighted evaluation!!!

Bellman Equation for MRPs

The value function, v(s) can be decomposed into two parts:immediate reward, Rt+1.

discounted value of successor state γv(St+1)

The value function, v(s) can be decomposed into two parts:immediate reward, Rt+1.discounted value of successor state γv(St+1)

v(s) = E[Rt+1 + γv(St+1)|St = s]

it can be represented as:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Bellman Equation for Student MRP

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

v = R+ γPv

where, v is a column vector with one entry per state.v(1)...

R1...Rn

P11 · · · P1n... . . . ...Pn1 · · · Pnn

...v(n)

Solving the Bellman Equation

It is a linear equation.

It can be solved directly::

v = R+ γPv(1− γP)v = R

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.There are iterative solutions for large MRPs developed:

Dynamic ProgrammingMonte-Carlo EvaluationTemporal Difference Learning

It is a linear equation.It can be solved directly::

v = (1− γP)−1R

Computational complexity is O(n3) for n states.Direct solution only possible for small MRPs.

There are iterative solutions for large MRPs developed:

v = (1− γP)−1R

Dynamic Programming

Monte-Carlo EvaluationTemporal Difference Learning

v = (1− γP)−1R

Dynamic ProgrammingMonte-Carlo Evaluation

Temporal Difference Learning

v = (1− γP)−1R

Markov Decision Process, MDP

A Markov Decision Process, MDP is a Markov Reward Process withdecisions. It is an environment in which all states are Markov.

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉

S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.

A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.

P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]

R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]

γ is a discount factor, γ ∈ [0, 1]

DefinitionA Markov Decision Process is a tuple 〈S,A,P,R, γ〉S is a finite set of states.A is a finite set of actions.P is a state transition probability matrix,Pass′ = P[St+1 = s′|St = s,At = a]R is a reward function, Ras = E[Rt+1|St = s,At = a]γ is a discount factor, γ ∈ [0, 1]

Student Markov Process

Student Markov Reward Process, MRP

Student Markov Decision Process, MDP

Policies (1 of 2)

DefinitionA policy, π is a distribution over actions given states,

π(a|s) = P[At = a|St = s]

A policy fully defines the behavior of an agent.MDP policies depend on the current state (and not the history).

That is, policies are stationary (time independent),

At ∼ π(.|st), ∀t > 0

Policies (2 of 2)

Given an MDP, M = 〈S,A,P,R, γ〉, and a policy, πThe state sequence S1, S2, · · · is a Markov process 〈S,Pπ〉The state and reward sequence S1, R2, S2, · · · is a Markov RewardProcess, MRP: 〈S,Pπ,Rπ, γ〉, where

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Value Functions

DefinitionThe state-value function, vπ(s) of an MDP is the expected returnstarting from state s, and then following policy π.

vπ(s) = Eπ[Gt|St = s]

DefinitionThe action-value function qπ(s, a) is the expected return starting fromstate s, taking action a, and then following policy π.

qπ(s, a) = Eπ[Gt|St = s,At = a]

Value Functions

DefinitionThe state-value function, vπ(s) of an MDP is the expected returnstarting from state s, and then following policy π.

vπ(s) = Eπ[Gt|St = s]

DefinitionThe action-value function qπ(s, a) is the expected return starting fromstate s, taking action a, and then following policy π.

qπ(s, a) = Eπ[Gt|St = s,At = a]

Example: State-Value Function for Student MDP

Bellman Expectation Equation

The state-value function can again be decomposed into immediatereward plus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can also be decomposed as,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Bellman Expectation Equation for vπ(.)

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

Bellman Expectation Equation for qπ(.) (1 of 2)

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Bellman Expectation Equation for vπ(.)

vπ(s) =∑a∈A

π(a|s)

Ras + γ∑s′∈SPass′vπ(s′)

Bellman Expectation Equation for qπ(.) (2 of 2)

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Bellman Expectation Equation in Student MDP

Optimal Value Function

DefinitionThe optimal state-value function, v∗(s) is the maximum state-valuefunction over all policies

v∗(s) = maxπ

vπ(s)

DefinitionThe optimal action-value function, q∗(s, a) is the maximumaction-value function over all policies

q∗(s, a) = maxπ

qπ(s, a)

The optimal value function specifies the best possible performancein the MDP.An MDP is considered “solved” when we know the optimal valuefunction.

Optimal State-Value function for the Student MDP

Optimal Action-Value function for the Student MDP

Optimal Policy

It defines a partial ordering over the policies,

π ≥ π′ if vπ(s) ≥ vπ′(s), ∀s

TheoremFor any Markov Decision Process,

There exists an optimal policy, π∗ that is better than or equal to allother policies, π∗ ≥ π,∀π.All optimal policies achieve the optimal state value function,vπ∗(s) = v∗(s)All optimal policies achieve the optimal action value function,qπ∗(s, a) = q∗(s, a)

Finding an optimal policy

An optimal policy can be found by maximizing over q∗(s, a),

π∗(a|s) =

1 if a = argmaxa∈A

q∗(s, a)

0 otherwise

There is always a deterinistic optimal policy for any MDP.If we know q∗(s, a), we immediately have the optimal policy.

Example: Optimal Policy for the Student MDP

Bellman Optimality Equation for v∗

The optimal value functions are recursively related by the Bellmanoptimality equations:

v∗(s) = maxa

q∗(s, a)

Bellman Optimality Equation for q∗

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Bellman Optimality Equation for v∗ (2)

v∗(s) = maxa

Ras + γ∑s′∈SPass′v∗(s′)

Bellman Optimality Equation for q∗ (2)

q∗(s, a) = Ras + γ∑s′∈SPass′max

a′q∗(s′, a′)

Bellman Optimality Equation for the Student MDP

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear.No closed form solution (in general).However, many iterative solution available:

Policy gradientQ-learningSARSA· · ·

Outlines

2 The Q value

9 DQN on Keras

The Q value

How to find the optimal π∗?How does the agent learn by interacting with the environment?

Instead of finding the policy that maximizes the state-value for all states, find theaction that maximizes the Q value for all states.

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

Outlines

2 The Q value

9 DQN on Keras

ThanksQuestions?

Reinforcement Learning and Deep Reinforcement...

Documents

Reinforcement Learning Chapter 13 What is Reinforcement Learning? Q-Learning Examples 1

Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

Generalization in Reinforcement Learning: Successful ...papers.nips.cc/paper/1109-generalization-in-reinforcement-learning... · Generalization in Reinforcement Learning: Successful

Hierarchical Deep Reinforcement Learning: Integrating ...papers.nips.cc/paper/6233-hierarchical-deep-reinforcement-learning... · work that integrates deep reinforcement learning

Reinforcement Learning

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Reinforcement Learning - uni-freiburg.degki.informatik.uni-freiburg.de/.../recordings/reinforcement.pdf · Reinforcement Learning 3 What is Reinforcement Learning? Learning from interaction

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement

Reinforcement Learning Lecture Inverse Reinforcement Learningipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2017/07/09... · Reinforcement Learning Inverse Reinforcement

Reinforcement Learning - Multi-Agent Reinforcement

Universal Reinforcement Learning Algorithms: … Reinforcement Learning Algorithms: Survey and ... class of models hypothesesenvironments M . ... Universal Reinforcement Learning Algorithms:

Inverse Reinforcement Learning CS885 Reinforcement

Deep Learning for Reinforcement Learning in Pacman · Deep Learning for Reinforcement Learning in Pacman Deep Learning für Reinforcement Learning in Pacman Vorgelegte Bachelor-Thesis

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary

Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning

Reinforcement Learning: Learning algorithms

The Reinforcement Learning Toolbox – Reinforcement Learning in Optimal Control Tasks