Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system

Background

Introduction to Reinforcement Learning

Riashat Islam

Riashat Islam Introduction to Reinforcement Learning

Background

Overview


Background

Branches of Machine LearningReinforcement Learning


Background

Motivations

I Goal-directed learning

I Learning from interaction with our surroundings

I What to do to achieve goals

I Foundational idea of learning and intelligence

I Computational approach to learning from interaction


Background

Reinforcement LearningApplications

I AlphaGo in the Game of Go

I Playing Atari Games

I Beating world champion in Backgammon

I Helicopter manoeuvres


Background

ApplicationsGame of Go - Google DeepMind

https://www.youtube.com/watch?v=SUbqykXVx0A

https://www.youtube.com/watch?v=g-dKXOlsf98


https://www.youtube.com/watch?v=SUbqykXVx0A

https://www.youtube.com/watch?v=g-dKXOlsf98

Background

ApplicationsPlaying Atari Games - Google DeepMind

https://www.youtube.com/watch?v=V1eYniJ0Rnk


https://www.youtube.com/watch?v=V1eYniJ0Rnk

Background

ApplicationsHelicopter Manouevres - Stanford

https://www.youtube.com/watch?v=VCdxqn0fcnE


https://www.youtube.com/watch?v=VCdxqn0fcnE

Background

ApplicationsRobotics - Humanoid Robots

https://www.youtube.com/watch?v=370cT-OAzzM


https://www.youtube.com/watch?v=370cT-OAzzM

Background

Agent and Environment


Background

Agent and Environment

I Learning agent interacting in the environment

I Learn what to do - to maximize reward

I Discover which actions will yield the most reward

I Actions may affect long term rewards

I Prefer actions already taken, or whether to try new actions


Background

Key Ideas in RL

I Markov Decision Processes (MDPs)

I Rewards

I Policy of the agent

I Value Functions

I Environment Model and Dynamics


Background

Sequential Decision Process and MDPs

I Decision-making to maximize utility/satisfaction

I Select actions to maximize goals

I Goal is to maximize expected cumulative reward


Background

Goals and Rewards

I A reward signal Rt

I Feedback signal to indicate agent performance or behaviour

I Sequence of actions, states, and rewards

I Maximize cumulative reward of the agent

All goals can be described by the maximisation of the expectedcumulative reward

I Do you agree with this statement?


Background

Policy

I Describes the agent’s behaviour

I A mapping from state to action

I Sequence of actions at each state to achieve goal

I Find the optimal policy - optimal behaviour of the agent


Background

Value Functions

I Estimate how good it is for an agent to be in the current state

I Goodness - defined in terms of expected future rewards

I Value functions defined with respect to policies/behaviour


Background

Model - Environment Dynamics

I The environment in which agent is located

I Defines state transitions

I The states the agent can go from current state

I The next immediate reward agent can achieve

I Model-Free : Unknown environment dynamics

I Model-Based: Known environment dynamics


Background

Summary

I Agent interacts with the environment

I Agent to improve its policy

I Find optimal policy to maximize cumulative reward

I Policy also defined in terms of value functions

I Agent can be in an unknown or known environment


Background

Overview


Background

Markov Decision Processes


Background

Introduction to MDPs

I Model for sequential decision making under uncertainty

I Used to formulate RL problems

I Describes the environment of the RL agent

I Extension from Decision Theory - long term plan of actions


Background

MDP Framework

I MDPs are discrete time state transition systemsI MDPs described by 5 components:

I States: The state of the system needs to be observed by thedecision maker when a decision has to be made.

I Actions: Choose an action from the action set in the currentstate

I Transition Probabilities P(st+1|st , at : Describes the dynamicsof the world.

I Reward R(s) or R(s, a): Real-valued reward that may dependon state, or both state and action.

I γ is the discount factor γ ∈ [0, 1]


Background

Markov Property

A state st is Markov if and only if:

P(st+1|st) = P(st+1|s1, s2, s3, ....st) (1)

This means, the state captures all the relevant information fromthe history.

State transition probability P(st+1|st) to define the transitionprobability from current state st to next or successor state st+1.

The conditional probability distribution of future states dependsonly on the present state, and not the sequence of events thatpreceded it


Background

Graphical Model for MDPs


Background

Return

The return Gt is the total discounted reward from time step t.

Gt = Rt+1 + γRt+2 + .. =T∑

k=0

γkRt+k+1 (2)

After k + 1 time steps, the reward R is γkR

The discount factor γ is used to give more preference to immediaterewards over delayed rewards


Background

Policy

Mapping from states to actions - at every state st , the agent cantake action at that is defined by the policy π(a|s)

Policy defines the behaviour of the agent

Due to Markov property, the choice of action only depends on thecurrent state st but not on any of the previous states

Policy is the distribution over actions given states

π(a|s) = P[At = a|St = s] (3)

Policy can be defined as the sequence of decision rules over thetime steps


Background

Value Functions

Estimate of how good it is for an agent to be in the current state

Goodness of a state defined in terms of future rewards or expectedreturn

Value of a state under a policy π is V π(s). It is the expectedreturn when starting in state s and following policy π.

vπ(s) =π [Gt |St = s] (4)

This is the same as:

vπ(s) =π [∞∑k=0

γk rt+k+1|st = s] (5)


Background

Bellman Equation

Value functions satisfy recursive relationship (DynamicProgramming)

vπ(s) = [Gt |St = s]

= [Rt+1 + γRt+2 + γ2Rt+3 + ..|St = s]

= [Rt+1 + γ(Rt+2 + γRt+3 + ..)|St = s]

= [Rt+1 + γGt |St = s]

= [Rt+1 + γv(st+1)|St = s]

(6)

Value functions can therefore be decomposed into immediatereward plus discounted value of successor state

vπ(s) =π [Rt+1 + γvπ(St+1)|St = s] (7)


Background

Bellman Equation for Value Functions

vπ(s) = [Rt+1 + γv(St+1)|St = s] (8)

Equivalently, we can write the value function as

vπ(s) = Rs + γ∑s′∈S

Pss′v(s ′) (9)


Background

Bellman Equation for Value Functions

We can therefore write Bellman equation for Value Functions as

v = R + γPv (10)

This Bellman equation could have been solved directly as

v = (I − γP)−1R (11)

However, this direct solution is only possible for small MDPs

We look into iterative solutions for solving the Bellman equation(later!)


Background

Value Functions and Policy


Background

Action-Value Function Qπ

Similar to the value function V π, we can also define theaction-value function Qπ

It is the value of taking action a in state s under a policy πdenoted as Qπ(s, a).

It is the expected return starting from state s, taking the action aand following the policy π

Qπ(s, a) =π [∞∑k=0

γk rt+k+1|st = s, at = a] (12)

We can therefore write the Bellman equation for Qπ(s, a)

Qπ(s, a) =π [Rt+1 + γQπ(St+1,At+1)|St = s,At = a] (13)


Background

Bellman Expectation Equations

We can write V π in terms of Qπ

vπ(s) =∑a∈A

π(a|s)Qπ(s, a) (14)

andQπ(s, a) = Ra

s + γ∑s′∈S

PassVπ(s ′) (15)

therefore, the value function is

Vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈S

Pass′Vπ(s ′)) (16)

the action-value function is

Qπ(s, a) = Ras + γ

∑s′∈S

Pass′

∑a′∈A

π(a′|s ′)Qπ(s ′, a′) (17)


Background

Optimal Value Functions

The optimal value function is when we have the maximum value inall the states - maximum value function over all policies

V∗(s) = maxπ

Vπ(s) (18)

Similarly, we also have the optima action-value function - theaction-value is maximum over all policies

Q∗(s, a) = maxπ

Qπ(s, a) (19)

Optimal value functions means - we have the best performance ofthe agent in the environment

The agent behaves optimally in the environment - optimal policy


Background

Optimal Policy (1)

An optimal policy of the agent which is better than or equal to allother policies

A policy π is better than another policy π′ when its expectedreturn is greater than or equal to that of π′.

A policy π is better than π′ if and only if Vπ(s) ≥ Vπ′(s)

All optimal policies π∗ will therefore achieve the optimal V∗(s) andQ∗(s, a)


Background

Optimal Policy (2)

If we know the optimal action-value function, we can find theoptimal policy. Q∗(s, a) has maximum goodness over all states

If we know Q∗(s, a), we know the optimal action to take at everystate

Optimal policy π∗ is defined as the optimal action to take at everystate - following π∗, we can achieve goal state with maximum longterm reward

π∗(a|s) =

{1, if a = arg maxa∈AQ∗(s, a)

0, otherwise


Background

Bellman Optimality Equations

The optimal value functions are also recursively related by theBellman optimality equations

v∗(s) = maxa

Q∗(s, a) (20)

Bellman optimality for Q∗(s, a) is

Q∗(s, a) = Ras + γ

∑s′∈S

Pass′V∗(s

′) (21)

and V∗(s) is given as

V∗(s) = maxa

Ras + γ

∑s′∈S

Pass′V∗(s

′) (22)

Finally, again writing Q∗(s, a) as

Q∗(s, a) = Ras + γ

∑s′∈S

Pass′ max

a′Q∗(s

′, a′) (23)


Background

Solving Bellman Optimality Equations

However, there exists no closed form solutions for the Bellmanoptimality equations

If we could solve for V∗(s) and Q∗(s, a) directly, RL problemswould have been easy!

No closed form of V∗(s) and Q∗(s, a) for large MDPs

The Bellman optimality equations are non-linear

I We use iterative solution methods to find optimal valuefunctions (next few lectures...)

I Value Iteration and Policy IterationI TD-Learning (Q-Learning and SARSA)


Background

Policy Evaluation and Improvement


Background

Policy Evaluation

I Use to evaluate a given policy π

I Solve the Bellman expectation backup

I v1 →v2 →v3 →vπI Solve using synchronous backups

I At each iteration k + 1I For all states s ∈ SI Update vk+1(s) from vk(s ′)

I Iterative policy evaluation leads to convergence of vπ


Background

Iterative Policy Evaluation

vk+1(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈S

Pass′vk(s ′)) (24)

vk+1 = Rπ + γPπvk (25)

I We therefore consider iterative solution methods

I Sequence of approximate value functions V0, V1, V2

I Each successive approximation is therefore obtained by theBellman equation for V π

I V π can be shown to converge as k →∞


Background

Policy Improvement

I Given a policy π

I Evaluate the policy πI

vπ(s) = E [Rt+1 + γRt+2 + γ2Rt+3 + ...|St = s] (26)

I Policy improvement by acting greedily with respect to πI

π′

= greedy(vπ) (27)

I By iterative policy improvement, we want to find the optimalpolicy π∗

I This process is known as policy iteration which converges toπ∗


Background

Policy Improvement

We compute value functions of policies so that we can find betterpoliciesFor a given value function V (s)

I For state s, we want to know whether or not to change thepolicy

I We want to know how good it is to follow the current policyfrom state s which is V π(s)

Consider policy improvement → select a in s and thereafter followexisting policy π. In other words, we compute Qπ(s, a)

I See whether Qπ(s, a) is better or less than V π(s).


Background

Policy ImprovementAs before, first evaluate the policy by computing value functions

Qπ(s, a) = Eπrt+1 + γV π(st+1|st = s, at = a (28)

I Improve the policy by acting greedily

π′(s) = arg maxa∈A

Qπ(s, a) (29)

I This improves the value from any state s over one step

Qπ(s, π′(s)) = maxa∈A

Qπ(s, a) ≥ Qπ(s, π(s))

I This therefore improves the value function

Qπ(s, π′(s)) ≥ V π(s) (30)


Background

Policy Improvement

This improvement in policy by acting greedily is known as thepolicy improvement theorem

Qπ(s, π′(s)) ≥ V π(s) (31)

This means that the new policy π′ must be as good as or betterthan the old policy π. That is, it must obtain greater or equalexpected return from all states s ∈ S

V π′(s) ≥ V π(s) (32)


Background

Policy Improvement

I If the policy improvements stop, i,e

Qπ(s, π′(s)) = maxa∈A

Qπ(s, a) = Qπ(s, π(s)) = V π(s) (33)

I This means that the Bellman optimality equation has beensatisfied

V π(s) = maxa∈A

Qπ(s, a) (34)

I This means, we have reached the optimal value function

V π(s) = V ∗(s) (35)

I Therefore, π(s) is the optimal policy. π = π∗


Background

Policy Iteration

Policy Evaluation

I Iterative policy evaluation

I Estimate V π → for given π compute V π

Policy Improvement

I Greedy policy improvement

I Generate π′ ≥ π


Background

Policy Iteration


Background

Value Iteration

Instead of the two step process of policy iteration, we can insteaduse value iteration

I Goal is again to find optimal policy π

I Iteratively compute the Bellman optimality backup


Background

Value Iteration

V1 → V2 → ... V ∗

I Using synchronous backups

I At each iteration k + 1I For all states s ∈ SI Update Vk+1(s) from Vk(s ′)

I This can be proved to converge to V ∗ (not included here)

I Unlike policy iteration, in value iteration there is no explicitgreedy policy improvement step

I In other words, by improving the value function iteratively,policy itself is improved

I

Vk+1(s) = maxa∈A

(Ras + γ

∑s′∈S

Pass′Vk(s ′)) (36)


Background

Model Free Prediction


Background

Monte-Carlo Reinforcement Learning

I MC methods learn directly from episodes of experience

I MC is model-free: no knowledge of MDP transitions / rewards

I MC learns from complete episodes: no bootstrapping

I MC uses the simplest idea: value = meanreturn

I Caveat: can only apply MC to episodic MDPs (i,e. Allepisodes must terminate)


Background

Monte-Carlo Policy Evaluation

I Goal : learn V π from episodes of experience under policy π.

I Recall that the return is the total discounted return

vt = rt+1 + γrt+2 + ...γT−1rT (37)

I Recall that the value function is the expected return

V π(s) = Eπ[vt |st = s] (38)

I Monte-Carlo policy evaluation uses empirical mean returninstead of the expected return


Background

Temporal-Difference Learning

I TD methods learn directly from episodes of experience

I TD is model-free: no knowledge of MDP transitions / rewards

I TD learns from incomplete episodes, by bootstrapping

I TD updates a guess towards a guess


Background

MC and TD

I Goal: learn V π online from experience under policy πI Incremental every-visit Monte-Carlo

I Update value V (st) towards actual return vt

V (st) = V (st) + α(vt − V (st)) (39)

I Simplest temporal-difference learning algorithm: TD(0)I Update value V (st) towards estimated return rt+1 + γV (st+1)I V (st)← V (st) + α(rt+1 + γV (st+1 − V (st)))I rt+1 + γV (st+1) is called the TD targetI δt = rt+1 + γV (st+1)− V (st) is called the TD error


Background

Advantages and Disadvantages of MC vs TD

I TD can learn before knowing the final outcome

I TD can learn online after every stepI MC must wait until end of episode before retun is known

I TD can learn without the final outcome

I TD can learn from incomplete sequencesI MC can only learn from complete sequencesI TD works in continuing (non-terminating) environmentsI MC only works for episodic (terminating) environments


Background

Monte-Carlo Backup


Background

Temporal-Difference Backup


Background

Model Free Control


Background

On and Off-Policy Learning


Background

ε-Greedy Exploration


Background

MC vs TD Control


Background

Updating Action-Value Functions with Sarsa


Background

On-Policy Control with Sarsa


Background

Sarsa Algorithm for On-Policy Control


Background

Off-Policy Learning


Background

Q-Learning


Background

Off-Policy Control with Q-Learning


Background

Q-Learning Control Algorithm


Background

Q-Learning Algorithm for Off-Policy Control


Background

Questions


Documents

Introduction to Reinforcement Learning · Background MDP Framework I MDPs are discrete time state transition systems I MDPs described by 5 components: I States: The state of the system