71
Background Introduction to Reinforcement Learning Riashat Islam Riashat Islam Introduction to Reinforcement Learning

Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Introduction to Reinforcement Learning

Riashat Islam

Riashat Islam Introduction to Reinforcement Learning

Page 2: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Overview

Riashat Islam Introduction to Reinforcement Learning

Page 3: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Branches of Machine LearningReinforcement Learning

Riashat Islam Introduction to Reinforcement Learning

Page 4: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Motivations

I Goal-directed learning

I Learning from interaction with our surroundings

I What to do to achieve goals

I Foundational idea of learning and intelligence

I Computational approach to learning from interaction

Riashat Islam Introduction to Reinforcement Learning

Page 5: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Reinforcement LearningApplications

I AlphaGo in the Game of Go

I Playing Atari Games

I Beating world champion in Backgammon

I Helicopter manoeuvres

Riashat Islam Introduction to Reinforcement Learning

Page 6: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

ApplicationsGame of Go - Google DeepMind

https://www.youtube.com/watch?v=SUbqykXVx0A

https://www.youtube.com/watch?v=g-dKXOlsf98

Riashat Islam Introduction to Reinforcement Learning

Page 7: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

ApplicationsPlaying Atari Games - Google DeepMind

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Riashat Islam Introduction to Reinforcement Learning

Page 8: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

ApplicationsHelicopter Manouevres - Stanford

https://www.youtube.com/watch?v=VCdxqn0fcnE

Riashat Islam Introduction to Reinforcement Learning

Page 9: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

ApplicationsRobotics - Humanoid Robots

https://www.youtube.com/watch?v=370cT-OAzzM

Riashat Islam Introduction to Reinforcement Learning

Page 10: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Agent and Environment

Riashat Islam Introduction to Reinforcement Learning

Page 11: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Agent and Environment

I Learning agent interacting in the environment

I Learn what to do - to maximize reward

I Discover which actions will yield the most reward

I Actions may affect long term rewards

I Prefer actions already taken, or whether to try new actions

Riashat Islam Introduction to Reinforcement Learning

Page 12: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Key Ideas in RL

I Markov Decision Processes (MDPs)

I Rewards

I Policy of the agent

I Value Functions

I Environment Model and Dynamics

Riashat Islam Introduction to Reinforcement Learning

Page 13: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Sequential Decision Process and MDPs

I Decision-making to maximize utility/satisfaction

I Select actions to maximize goals

I Goal is to maximize expected cumulative reward

Riashat Islam Introduction to Reinforcement Learning

Page 14: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Goals and Rewards

I A reward signal Rt

I Feedback signal to indicate agent performance or behaviour

I Sequence of actions, states, and rewards

I Maximize cumulative reward of the agent

All goals can be described by the maximisation of the expectedcumulative reward

I Do you agree with this statement?

Riashat Islam Introduction to Reinforcement Learning

Page 15: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy

I Describes the agent’s behaviour

I A mapping from state to action

I Sequence of actions at each state to achieve goal

I Find the optimal policy - optimal behaviour of the agent

Riashat Islam Introduction to Reinforcement Learning

Page 16: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Value Functions

I Estimate how good it is for an agent to be in the current state

I Goodness - defined in terms of expected future rewards

I Value functions defined with respect to policies/behaviour

Riashat Islam Introduction to Reinforcement Learning

Page 17: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Model - Environment Dynamics

I The environment in which agent is located

I Defines state transitions

I The states the agent can go from current state

I The next immediate reward agent can achieve

I Model-Free : Unknown environment dynamics

I Model-Based: Known environment dynamics

Riashat Islam Introduction to Reinforcement Learning

Page 18: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Summary

I Agent interacts with the environment

I Agent to improve its policy

I Find optimal policy to maximize cumulative reward

I Policy also defined in terms of value functions

I Agent can be in an unknown or known environment

Riashat Islam Introduction to Reinforcement Learning

Page 19: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Overview

Riashat Islam Introduction to Reinforcement Learning

Page 20: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Markov Decision Processes

Riashat Islam Introduction to Reinforcement Learning

Page 21: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Introduction to MDPs

I Model for sequential decision making under uncertainty

I Used to formulate RL problems

I Describes the environment of the RL agent

I Extension from Decision Theory - long term plan of actions

Riashat Islam Introduction to Reinforcement Learning

Page 22: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

MDP Framework

I MDPs are discrete time state transition systemsI MDPs described by 5 components:

I States: The state of the system needs to be observed by thedecision maker when a decision has to be made.

I Actions: Choose an action from the action set in the currentstate

I Transition Probabilities P(st+1|st , at : Describes the dynamicsof the world.

I Reward R(s) or R(s, a): Real-valued reward that may dependon state, or both state and action.

I γ is the discount factor γ ∈ [0, 1]

Riashat Islam Introduction to Reinforcement Learning

Page 23: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Markov Property

A state st is Markov if and only if:

P(st+1|st) = P(st+1|s1, s2, s3, ....st) (1)

This means, the state captures all the relevant information fromthe history.

State transition probability P(st+1|st) to define the transitionprobability from current state st to next or successor state st+1.

The conditional probability distribution of future states dependsonly on the present state, and not the sequence of events thatpreceded it

Riashat Islam Introduction to Reinforcement Learning

Page 24: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Graphical Model for MDPs

Riashat Islam Introduction to Reinforcement Learning

Page 25: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Return

The return Gt is the total discounted reward from time step t.

Gt = Rt+1 + γRt+2 + .. =T∑

k=0

γkRt+k+1 (2)

After k + 1 time steps, the reward R is γkR

The discount factor γ is used to give more preference to immediaterewards over delayed rewards

Riashat Islam Introduction to Reinforcement Learning

Page 26: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy

Mapping from states to actions - at every state st , the agent cantake action at that is defined by the policy π(a|s)

Policy defines the behaviour of the agent

Due to Markov property, the choice of action only depends on thecurrent state st but not on any of the previous states

Policy is the distribution over actions given states

π(a|s) = P[At = a|St = s] (3)

Policy can be defined as the sequence of decision rules over thetime steps

Riashat Islam Introduction to Reinforcement Learning

Page 27: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Value Functions

Estimate of how good it is for an agent to be in the current state

Goodness of a state defined in terms of future rewards or expectedreturn

Value of a state under a policy π is V π(s). It is the expectedreturn when starting in state s and following policy π.

vπ(s) =π [Gt |St = s] (4)

This is the same as:

vπ(s) =π [∞∑k=0

γk rt+k+1|st = s] (5)

Riashat Islam Introduction to Reinforcement Learning

Page 28: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Bellman Equation

Value functions satisfy recursive relationship (DynamicProgramming)

vπ(s) = [Gt |St = s]

= [Rt+1 + γRt+2 + γ2Rt+3 + ..|St = s]

= [Rt+1 + γ(Rt+2 + γRt+3 + ..)|St = s]

= [Rt+1 + γGt |St = s]

= [Rt+1 + γv(st+1)|St = s]

(6)

Value functions can therefore be decomposed into immediatereward plus discounted value of successor state

vπ(s) =π [Rt+1 + γvπ(St+1)|St = s] (7)

Riashat Islam Introduction to Reinforcement Learning

Page 29: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Bellman Equation for Value Functions

vπ(s) = [Rt+1 + γv(St+1)|St = s] (8)

Equivalently, we can write the value function as

vπ(s) = Rs + γ∑s′∈S

Pss′v(s ′) (9)

Riashat Islam Introduction to Reinforcement Learning

Page 30: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Bellman Equation for Value Functions

We can therefore write Bellman equation for Value Functions as

v = R + γPv (10)

This Bellman equation could have been solved directly as

v = (I − γP)−1R (11)

However, this direct solution is only possible for small MDPs

We look into iterative solutions for solving the Bellman equation(later!)

Riashat Islam Introduction to Reinforcement Learning

Page 31: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Value Functions and Policy

Riashat Islam Introduction to Reinforcement Learning

Page 32: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Action-Value Function Qπ

Similar to the value function V π, we can also define theaction-value function Qπ

It is the value of taking action a in state s under a policy πdenoted as Qπ(s, a).

It is the expected return starting from state s, taking the action aand following the policy π

Qπ(s, a) =π [∞∑k=0

γk rt+k+1|st = s, at = a] (12)

We can therefore write the Bellman equation for Qπ(s, a)

Qπ(s, a) =π [Rt+1 + γQπ(St+1,At+1)|St = s,At = a] (13)

Riashat Islam Introduction to Reinforcement Learning

Page 33: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Bellman Expectation Equations

We can write V π in terms of Qπ

vπ(s) =∑a∈A

π(a|s)Qπ(s, a) (14)

andQπ(s, a) = Ra

s + γ∑s′∈S

PassVπ(s ′) (15)

therefore, the value function is

Vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈S

Pass′Vπ(s ′)) (16)

the action-value function is

Qπ(s, a) = Ras + γ

∑s′∈S

Pass′

∑a′∈A

π(a′|s ′)Qπ(s ′, a′) (17)

Riashat Islam Introduction to Reinforcement Learning

Page 34: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Optimal Value Functions

The optimal value function is when we have the maximum value inall the states - maximum value function over all policies

V∗(s) = maxπ

Vπ(s) (18)

Similarly, we also have the optima action-value function - theaction-value is maximum over all policies

Q∗(s, a) = maxπ

Qπ(s, a) (19)

Optimal value functions means - we have the best performance ofthe agent in the environment

The agent behaves optimally in the environment - optimal policy

Riashat Islam Introduction to Reinforcement Learning

Page 35: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Optimal Policy (1)

An optimal policy of the agent which is better than or equal to allother policies

A policy π is better than another policy π′ when its expectedreturn is greater than or equal to that of π′.

A policy π is better than π′ if and only if Vπ(s) ≥ Vπ′(s)

All optimal policies π∗ will therefore achieve the optimal V∗(s) andQ∗(s, a)

Riashat Islam Introduction to Reinforcement Learning

Page 36: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Optimal Policy (2)

If we know the optimal action-value function, we can find theoptimal policy. Q∗(s, a) has maximum goodness over all states

If we know Q∗(s, a), we know the optimal action to take at everystate

Optimal policy π∗ is defined as the optimal action to take at everystate - following π∗, we can achieve goal state with maximum longterm reward

π∗(a|s) =

{1, if a = arg maxa∈AQ∗(s, a)

0, otherwise

Riashat Islam Introduction to Reinforcement Learning

Page 37: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Bellman Optimality Equations

The optimal value functions are also recursively related by theBellman optimality equations

v∗(s) = maxa

Q∗(s, a) (20)

Bellman optimality for Q∗(s, a) is

Q∗(s, a) = Ras + γ

∑s′∈S

Pass′V∗(s

′) (21)

and V∗(s) is given as

V∗(s) = maxa

Ras + γ

∑s′∈S

Pass′V∗(s

′) (22)

Finally, again writing Q∗(s, a) as

Q∗(s, a) = Ras + γ

∑s′∈S

Pass′ max

a′Q∗(s

′, a′) (23)

Riashat Islam Introduction to Reinforcement Learning

Page 38: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Solving Bellman Optimality Equations

However, there exists no closed form solutions for the Bellmanoptimality equations

If we could solve for V∗(s) and Q∗(s, a) directly, RL problemswould have been easy!

No closed form of V∗(s) and Q∗(s, a) for large MDPs

The Bellman optimality equations are non-linear

I We use iterative solution methods to find optimal valuefunctions (next few lectures...)

I Value Iteration and Policy IterationI TD-Learning (Q-Learning and SARSA)

Riashat Islam Introduction to Reinforcement Learning

Page 39: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Evaluation and Improvement

Riashat Islam Introduction to Reinforcement Learning

Page 40: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Evaluation

I Use to evaluate a given policy π

I Solve the Bellman expectation backup

I v1 →v2 →v3 →vπI Solve using synchronous backups

I At each iteration k + 1I For all states s ∈ SI Update vk+1(s) from vk(s ′)

I Iterative policy evaluation leads to convergence of vπ

Riashat Islam Introduction to Reinforcement Learning

Page 41: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Iterative Policy Evaluation

vk+1(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈S

Pass′vk(s ′)) (24)

vk+1 = Rπ + γPπvk (25)

I We therefore consider iterative solution methods

I Sequence of approximate value functions V0, V1, V2

I Each successive approximation is therefore obtained by theBellman equation for V π

I V π can be shown to converge as k →∞

Riashat Islam Introduction to Reinforcement Learning

Page 42: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Improvement

I Given a policy π

I Evaluate the policy πI

vπ(s) = E [Rt+1 + γRt+2 + γ2Rt+3 + ...|St = s] (26)

I Policy improvement by acting greedily with respect to πI

π′

= greedy(vπ) (27)

I By iterative policy improvement, we want to find the optimalpolicy π∗

I This process is known as policy iteration which converges toπ∗

Riashat Islam Introduction to Reinforcement Learning

Page 43: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Improvement

We compute value functions of policies so that we can find betterpoliciesFor a given value function V (s)

I For state s, we want to know whether or not to change thepolicy

I We want to know how good it is to follow the current policyfrom state s which is V π(s)

Consider policy improvement → select a in s and thereafter followexisting policy π. In other words, we compute Qπ(s, a)

I See whether Qπ(s, a) is better or less than V π(s).

Riashat Islam Introduction to Reinforcement Learning

Page 44: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy ImprovementAs before, first evaluate the policy by computing value functions

Qπ(s, a) = Eπrt+1 + γV π(st+1|st = s, at = a (28)

I Improve the policy by acting greedily

π′(s) = arg maxa∈A

Qπ(s, a) (29)

I This improves the value from any state s over one step

Qπ(s, π′(s)) = maxa∈A

Qπ(s, a) ≥ Qπ(s, π(s))

I This therefore improves the value function

Qπ(s, π′(s)) ≥ V π(s) (30)

Riashat Islam Introduction to Reinforcement Learning

Page 45: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Improvement

This improvement in policy by acting greedily is known as thepolicy improvement theorem

Qπ(s, π′(s)) ≥ V π(s) (31)

This means that the new policy π′ must be as good as or betterthan the old policy π. That is, it must obtain greater or equalexpected return from all states s ∈ S

V π′(s) ≥ V π(s) (32)

Riashat Islam Introduction to Reinforcement Learning

Page 46: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Improvement

I If the policy improvements stop, i,e

Qπ(s, π′(s)) = maxa∈A

Qπ(s, a) = Qπ(s, π(s)) = V π(s) (33)

I This means that the Bellman optimality equation has beensatisfied

V π(s) = maxa∈A

Qπ(s, a) (34)

I This means, we have reached the optimal value function

V π(s) = V ∗(s) (35)

I Therefore, π(s) is the optimal policy. π = π∗

Riashat Islam Introduction to Reinforcement Learning

Page 47: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Iteration

Policy Evaluation

I Iterative policy evaluation

I Estimate V π → for given π compute V π

Policy Improvement

I Greedy policy improvement

I Generate π′ ≥ π

Riashat Islam Introduction to Reinforcement Learning

Page 48: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Policy Iteration

Riashat Islam Introduction to Reinforcement Learning

Page 49: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Value Iteration

Instead of the two step process of policy iteration, we can insteaduse value iteration

I Goal is again to find optimal policy π

I Iteratively compute the Bellman optimality backup

Riashat Islam Introduction to Reinforcement Learning

Page 50: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Value Iteration

V1 → V2 → ... V ∗

I Using synchronous backups

I At each iteration k + 1I For all states s ∈ SI Update Vk+1(s) from Vk(s ′)

I This can be proved to converge to V ∗ (not included here)

I Unlike policy iteration, in value iteration there is no explicitgreedy policy improvement step

I In other words, by improving the value function iteratively,policy itself is improved

I

Vk+1(s) = maxa∈A

(Ras + γ

∑s′∈S

Pass′Vk(s ′)) (36)

Riashat Islam Introduction to Reinforcement Learning

Page 51: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Model Free Prediction

Riashat Islam Introduction to Reinforcement Learning

Page 52: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Monte-Carlo Reinforcement Learning

I MC methods learn directly from episodes of experience

I MC is model-free: no knowledge of MDP transitions / rewards

I MC learns from complete episodes: no bootstrapping

I MC uses the simplest idea: value = meanreturn

I Caveat: can only apply MC to episodic MDPs (i,e. Allepisodes must terminate)

Riashat Islam Introduction to Reinforcement Learning

Page 53: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Monte-Carlo Policy Evaluation

I Goal : learn V π from episodes of experience under policy π.

I Recall that the return is the total discounted return

vt = rt+1 + γrt+2 + ...γT−1rT (37)

I Recall that the value function is the expected return

V π(s) = Eπ[vt |st = s] (38)

I Monte-Carlo policy evaluation uses empirical mean returninstead of the expected return

Riashat Islam Introduction to Reinforcement Learning

Page 54: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Temporal-Difference Learning

I TD methods learn directly from episodes of experience

I TD is model-free: no knowledge of MDP transitions / rewards

I TD learns from incomplete episodes, by bootstrapping

I TD updates a guess towards a guess

Riashat Islam Introduction to Reinforcement Learning

Page 55: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

MC and TD

I Goal: learn V π online from experience under policy πI Incremental every-visit Monte-Carlo

I Update value V (st) towards actual return vt

V (st) = V (st) + α(vt − V (st)) (39)

I Simplest temporal-difference learning algorithm: TD(0)I Update value V (st) towards estimated return rt+1 + γV (st+1)I V (st)← V (st) + α(rt+1 + γV (st+1 − V (st)))I rt+1 + γV (st+1) is called the TD targetI δt = rt+1 + γV (st+1)− V (st) is called the TD error

Riashat Islam Introduction to Reinforcement Learning

Page 56: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Advantages and Disadvantages of MC vs TD

I TD can learn before knowing the final outcome

I TD can learn online after every stepI MC must wait until end of episode before retun is known

I TD can learn without the final outcome

I TD can learn from incomplete sequencesI MC can only learn from complete sequencesI TD works in continuing (non-terminating) environmentsI MC only works for episodic (terminating) environments

Riashat Islam Introduction to Reinforcement Learning

Page 57: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Monte-Carlo Backup

Riashat Islam Introduction to Reinforcement Learning

Page 58: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Temporal-Difference Backup

Riashat Islam Introduction to Reinforcement Learning

Page 59: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Model Free Control

Riashat Islam Introduction to Reinforcement Learning

Page 60: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

On and Off-Policy Learning

Riashat Islam Introduction to Reinforcement Learning

Page 61: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

ε-Greedy Exploration

Riashat Islam Introduction to Reinforcement Learning

Page 62: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

MC vs TD Control

Riashat Islam Introduction to Reinforcement Learning

Page 63: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Updating Action-Value Functions with Sarsa

Riashat Islam Introduction to Reinforcement Learning

Page 64: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

On-Policy Control with Sarsa

Riashat Islam Introduction to Reinforcement Learning

Page 65: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Sarsa Algorithm for On-Policy Control

Riashat Islam Introduction to Reinforcement Learning

Page 66: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Off-Policy Learning

Riashat Islam Introduction to Reinforcement Learning

Page 67: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Q-Learning

Riashat Islam Introduction to Reinforcement Learning

Page 68: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Off-Policy Control with Q-Learning

Riashat Islam Introduction to Reinforcement Learning

Page 69: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Q-Learning Control Algorithm

Riashat Islam Introduction to Reinforcement Learning

Page 70: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Q-Learning Algorithm for Off-Policy Control

Riashat Islam Introduction to Reinforcement Learning

Page 71: Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Questions

Riashat Islam Introduction to Reinforcement Learning