Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Background
Introduction to Reinforcement Learning
Riashat Islam
Riashat Islam Introduction to Reinforcement Learning
Background
Overview
Riashat Islam Introduction to Reinforcement Learning
Background
Branches of Machine LearningReinforcement Learning
Riashat Islam Introduction to Reinforcement Learning
Background
Motivations
I Goal-directed learning
I Learning from interaction with our surroundings
I What to do to achieve goals
I Foundational idea of learning and intelligence
I Computational approach to learning from interaction
Riashat Islam Introduction to Reinforcement Learning
Background
Reinforcement LearningApplications
I AlphaGo in the Game of Go
I Playing Atari Games
I Beating world champion in Backgammon
I Helicopter manoeuvres
Riashat Islam Introduction to Reinforcement Learning
Background
ApplicationsGame of Go - Google DeepMind
https://www.youtube.com/watch?v=SUbqykXVx0A
https://www.youtube.com/watch?v=g-dKXOlsf98
Riashat Islam Introduction to Reinforcement Learning
Background
ApplicationsPlaying Atari Games - Google DeepMind
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Riashat Islam Introduction to Reinforcement Learning
Background
ApplicationsHelicopter Manouevres - Stanford
https://www.youtube.com/watch?v=VCdxqn0fcnE
Riashat Islam Introduction to Reinforcement Learning
Background
ApplicationsRobotics - Humanoid Robots
https://www.youtube.com/watch?v=370cT-OAzzM
Riashat Islam Introduction to Reinforcement Learning
Background
Agent and Environment
Riashat Islam Introduction to Reinforcement Learning
Background
Agent and Environment
I Learning agent interacting in the environment
I Learn what to do - to maximize reward
I Discover which actions will yield the most reward
I Actions may affect long term rewards
I Prefer actions already taken, or whether to try new actions
Riashat Islam Introduction to Reinforcement Learning
Background
Key Ideas in RL
I Markov Decision Processes (MDPs)
I Rewards
I Policy of the agent
I Value Functions
I Environment Model and Dynamics
Riashat Islam Introduction to Reinforcement Learning
Background
Sequential Decision Process and MDPs
I Decision-making to maximize utility/satisfaction
I Select actions to maximize goals
I Goal is to maximize expected cumulative reward
Riashat Islam Introduction to Reinforcement Learning
Background
Goals and Rewards
I A reward signal Rt
I Feedback signal to indicate agent performance or behaviour
I Sequence of actions, states, and rewards
I Maximize cumulative reward of the agent
All goals can be described by the maximisation of the expectedcumulative reward
I Do you agree with this statement?
Riashat Islam Introduction to Reinforcement Learning
Background
Policy
I Describes the agent’s behaviour
I A mapping from state to action
I Sequence of actions at each state to achieve goal
I Find the optimal policy - optimal behaviour of the agent
Riashat Islam Introduction to Reinforcement Learning
Background
Value Functions
I Estimate how good it is for an agent to be in the current state
I Goodness - defined in terms of expected future rewards
I Value functions defined with respect to policies/behaviour
Riashat Islam Introduction to Reinforcement Learning
Background
Model - Environment Dynamics
I The environment in which agent is located
I Defines state transitions
I The states the agent can go from current state
I The next immediate reward agent can achieve
I Model-Free : Unknown environment dynamics
I Model-Based: Known environment dynamics
Riashat Islam Introduction to Reinforcement Learning
Background
Summary
I Agent interacts with the environment
I Agent to improve its policy
I Find optimal policy to maximize cumulative reward
I Policy also defined in terms of value functions
I Agent can be in an unknown or known environment
Riashat Islam Introduction to Reinforcement Learning
Background
Overview
Riashat Islam Introduction to Reinforcement Learning
Background
Markov Decision Processes
Riashat Islam Introduction to Reinforcement Learning
Background
Introduction to MDPs
I Model for sequential decision making under uncertainty
I Used to formulate RL problems
I Describes the environment of the RL agent
I Extension from Decision Theory - long term plan of actions
Riashat Islam Introduction to Reinforcement Learning
Background
MDP Framework
I MDPs are discrete time state transition systemsI MDPs described by 5 components:
I States: The state of the system needs to be observed by thedecision maker when a decision has to be made.
I Actions: Choose an action from the action set in the currentstate
I Transition Probabilities P(st+1|st , at : Describes the dynamicsof the world.
I Reward R(s) or R(s, a): Real-valued reward that may dependon state, or both state and action.
I γ is the discount factor γ ∈ [0, 1]
Riashat Islam Introduction to Reinforcement Learning
Background
Markov Property
A state st is Markov if and only if:
P(st+1|st) = P(st+1|s1, s2, s3, ....st) (1)
This means, the state captures all the relevant information fromthe history.
State transition probability P(st+1|st) to define the transitionprobability from current state st to next or successor state st+1.
The conditional probability distribution of future states dependsonly on the present state, and not the sequence of events thatpreceded it
Riashat Islam Introduction to Reinforcement Learning
Background
Graphical Model for MDPs
Riashat Islam Introduction to Reinforcement Learning
Background
Return
The return Gt is the total discounted reward from time step t.
Gt = Rt+1 + γRt+2 + .. =T∑
k=0
γkRt+k+1 (2)
After k + 1 time steps, the reward R is γkR
The discount factor γ is used to give more preference to immediaterewards over delayed rewards
Riashat Islam Introduction to Reinforcement Learning
Background
Policy
Mapping from states to actions - at every state st , the agent cantake action at that is defined by the policy π(a|s)
Policy defines the behaviour of the agent
Due to Markov property, the choice of action only depends on thecurrent state st but not on any of the previous states
Policy is the distribution over actions given states
π(a|s) = P[At = a|St = s] (3)
Policy can be defined as the sequence of decision rules over thetime steps
Riashat Islam Introduction to Reinforcement Learning
Background
Value Functions
Estimate of how good it is for an agent to be in the current state
Goodness of a state defined in terms of future rewards or expectedreturn
Value of a state under a policy π is V π(s). It is the expectedreturn when starting in state s and following policy π.
vπ(s) =π [Gt |St = s] (4)
This is the same as:
vπ(s) =π [∞∑k=0
γk rt+k+1|st = s] (5)
Riashat Islam Introduction to Reinforcement Learning
Background
Bellman Equation
Value functions satisfy recursive relationship (DynamicProgramming)
vπ(s) = [Gt |St = s]
= [Rt+1 + γRt+2 + γ2Rt+3 + ..|St = s]
= [Rt+1 + γ(Rt+2 + γRt+3 + ..)|St = s]
= [Rt+1 + γGt |St = s]
= [Rt+1 + γv(st+1)|St = s]
(6)
Value functions can therefore be decomposed into immediatereward plus discounted value of successor state
vπ(s) =π [Rt+1 + γvπ(St+1)|St = s] (7)
Riashat Islam Introduction to Reinforcement Learning
Background
Bellman Equation for Value Functions
vπ(s) = [Rt+1 + γv(St+1)|St = s] (8)
Equivalently, we can write the value function as
vπ(s) = Rs + γ∑s′∈S
Pss′v(s ′) (9)
Riashat Islam Introduction to Reinforcement Learning
Background
Bellman Equation for Value Functions
We can therefore write Bellman equation for Value Functions as
v = R + γPv (10)
This Bellman equation could have been solved directly as
v = (I − γP)−1R (11)
However, this direct solution is only possible for small MDPs
We look into iterative solutions for solving the Bellman equation(later!)
Riashat Islam Introduction to Reinforcement Learning
Background
Value Functions and Policy
Riashat Islam Introduction to Reinforcement Learning
Background
Action-Value Function Qπ
Similar to the value function V π, we can also define theaction-value function Qπ
It is the value of taking action a in state s under a policy πdenoted as Qπ(s, a).
It is the expected return starting from state s, taking the action aand following the policy π
Qπ(s, a) =π [∞∑k=0
γk rt+k+1|st = s, at = a] (12)
We can therefore write the Bellman equation for Qπ(s, a)
Qπ(s, a) =π [Rt+1 + γQπ(St+1,At+1)|St = s,At = a] (13)
Riashat Islam Introduction to Reinforcement Learning
Background
Bellman Expectation Equations
We can write V π in terms of Qπ
vπ(s) =∑a∈A
π(a|s)Qπ(s, a) (14)
andQπ(s, a) = Ra
s + γ∑s′∈S
PassVπ(s ′) (15)
therefore, the value function is
Vπ(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈S
Pass′Vπ(s ′)) (16)
the action-value function is
Qπ(s, a) = Ras + γ
∑s′∈S
Pass′
∑a′∈A
π(a′|s ′)Qπ(s ′, a′) (17)
Riashat Islam Introduction to Reinforcement Learning
Background
Optimal Value Functions
The optimal value function is when we have the maximum value inall the states - maximum value function over all policies
V∗(s) = maxπ
Vπ(s) (18)
Similarly, we also have the optima action-value function - theaction-value is maximum over all policies
Q∗(s, a) = maxπ
Qπ(s, a) (19)
Optimal value functions means - we have the best performance ofthe agent in the environment
The agent behaves optimally in the environment - optimal policy
Riashat Islam Introduction to Reinforcement Learning
Background
Optimal Policy (1)
An optimal policy of the agent which is better than or equal to allother policies
A policy π is better than another policy π′ when its expectedreturn is greater than or equal to that of π′.
A policy π is better than π′ if and only if Vπ(s) ≥ Vπ′(s)
All optimal policies π∗ will therefore achieve the optimal V∗(s) andQ∗(s, a)
Riashat Islam Introduction to Reinforcement Learning
Background
Optimal Policy (2)
If we know the optimal action-value function, we can find theoptimal policy. Q∗(s, a) has maximum goodness over all states
If we know Q∗(s, a), we know the optimal action to take at everystate
Optimal policy π∗ is defined as the optimal action to take at everystate - following π∗, we can achieve goal state with maximum longterm reward
π∗(a|s) =
{1, if a = arg maxa∈AQ∗(s, a)
0, otherwise
Riashat Islam Introduction to Reinforcement Learning
Background
Bellman Optimality Equations
The optimal value functions are also recursively related by theBellman optimality equations
v∗(s) = maxa
Q∗(s, a) (20)
Bellman optimality for Q∗(s, a) is
Q∗(s, a) = Ras + γ
∑s′∈S
Pass′V∗(s
′) (21)
and V∗(s) is given as
V∗(s) = maxa
Ras + γ
∑s′∈S
Pass′V∗(s
′) (22)
Finally, again writing Q∗(s, a) as
Q∗(s, a) = Ras + γ
∑s′∈S
Pass′ max
a′Q∗(s
′, a′) (23)
Riashat Islam Introduction to Reinforcement Learning
Background
Solving Bellman Optimality Equations
However, there exists no closed form solutions for the Bellmanoptimality equations
If we could solve for V∗(s) and Q∗(s, a) directly, RL problemswould have been easy!
No closed form of V∗(s) and Q∗(s, a) for large MDPs
The Bellman optimality equations are non-linear
I We use iterative solution methods to find optimal valuefunctions (next few lectures...)
I Value Iteration and Policy IterationI TD-Learning (Q-Learning and SARSA)
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Evaluation and Improvement
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Evaluation
I Use to evaluate a given policy π
I Solve the Bellman expectation backup
I v1 →v2 →v3 →vπI Solve using synchronous backups
I At each iteration k + 1I For all states s ∈ SI Update vk+1(s) from vk(s ′)
I Iterative policy evaluation leads to convergence of vπ
Riashat Islam Introduction to Reinforcement Learning
Background
Iterative Policy Evaluation
vk+1(s) =∑a∈A
π(a|s)(Ras + γ
∑s′∈S
Pass′vk(s ′)) (24)
vk+1 = Rπ + γPπvk (25)
I We therefore consider iterative solution methods
I Sequence of approximate value functions V0, V1, V2
I Each successive approximation is therefore obtained by theBellman equation for V π
I V π can be shown to converge as k →∞
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Improvement
I Given a policy π
I Evaluate the policy πI
vπ(s) = E [Rt+1 + γRt+2 + γ2Rt+3 + ...|St = s] (26)
I Policy improvement by acting greedily with respect to πI
π′
= greedy(vπ) (27)
I By iterative policy improvement, we want to find the optimalpolicy π∗
I This process is known as policy iteration which converges toπ∗
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Improvement
We compute value functions of policies so that we can find betterpoliciesFor a given value function V (s)
I For state s, we want to know whether or not to change thepolicy
I We want to know how good it is to follow the current policyfrom state s which is V π(s)
Consider policy improvement → select a in s and thereafter followexisting policy π. In other words, we compute Qπ(s, a)
I See whether Qπ(s, a) is better or less than V π(s).
Riashat Islam Introduction to Reinforcement Learning
Background
Policy ImprovementAs before, first evaluate the policy by computing value functions
Qπ(s, a) = Eπrt+1 + γV π(st+1|st = s, at = a (28)
I Improve the policy by acting greedily
π′(s) = arg maxa∈A
Qπ(s, a) (29)
I This improves the value from any state s over one step
Qπ(s, π′(s)) = maxa∈A
Qπ(s, a) ≥ Qπ(s, π(s))
I This therefore improves the value function
Qπ(s, π′(s)) ≥ V π(s) (30)
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Improvement
This improvement in policy by acting greedily is known as thepolicy improvement theorem
Qπ(s, π′(s)) ≥ V π(s) (31)
This means that the new policy π′ must be as good as or betterthan the old policy π. That is, it must obtain greater or equalexpected return from all states s ∈ S
V π′(s) ≥ V π(s) (32)
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Improvement
I If the policy improvements stop, i,e
Qπ(s, π′(s)) = maxa∈A
Qπ(s, a) = Qπ(s, π(s)) = V π(s) (33)
I This means that the Bellman optimality equation has beensatisfied
V π(s) = maxa∈A
Qπ(s, a) (34)
I This means, we have reached the optimal value function
V π(s) = V ∗(s) (35)
I Therefore, π(s) is the optimal policy. π = π∗
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Iteration
Policy Evaluation
I Iterative policy evaluation
I Estimate V π → for given π compute V π
Policy Improvement
I Greedy policy improvement
I Generate π′ ≥ π
Riashat Islam Introduction to Reinforcement Learning
Background
Policy Iteration
Riashat Islam Introduction to Reinforcement Learning
Background
Value Iteration
Instead of the two step process of policy iteration, we can insteaduse value iteration
I Goal is again to find optimal policy π
I Iteratively compute the Bellman optimality backup
Riashat Islam Introduction to Reinforcement Learning
Background
Value Iteration
V1 → V2 → ... V ∗
I Using synchronous backups
I At each iteration k + 1I For all states s ∈ SI Update Vk+1(s) from Vk(s ′)
I This can be proved to converge to V ∗ (not included here)
I Unlike policy iteration, in value iteration there is no explicitgreedy policy improvement step
I In other words, by improving the value function iteratively,policy itself is improved
I
Vk+1(s) = maxa∈A
(Ras + γ
∑s′∈S
Pass′Vk(s ′)) (36)
Riashat Islam Introduction to Reinforcement Learning
Background
Model Free Prediction
Riashat Islam Introduction to Reinforcement Learning
Background
Monte-Carlo Reinforcement Learning
I MC methods learn directly from episodes of experience
I MC is model-free: no knowledge of MDP transitions / rewards
I MC learns from complete episodes: no bootstrapping
I MC uses the simplest idea: value = meanreturn
I Caveat: can only apply MC to episodic MDPs (i,e. Allepisodes must terminate)
Riashat Islam Introduction to Reinforcement Learning
Background
Monte-Carlo Policy Evaluation
I Goal : learn V π from episodes of experience under policy π.
I Recall that the return is the total discounted return
vt = rt+1 + γrt+2 + ...γT−1rT (37)
I Recall that the value function is the expected return
V π(s) = Eπ[vt |st = s] (38)
I Monte-Carlo policy evaluation uses empirical mean returninstead of the expected return
Riashat Islam Introduction to Reinforcement Learning
Background
Temporal-Difference Learning
I TD methods learn directly from episodes of experience
I TD is model-free: no knowledge of MDP transitions / rewards
I TD learns from incomplete episodes, by bootstrapping
I TD updates a guess towards a guess
Riashat Islam Introduction to Reinforcement Learning
Background
MC and TD
I Goal: learn V π online from experience under policy πI Incremental every-visit Monte-Carlo
I Update value V (st) towards actual return vt
V (st) = V (st) + α(vt − V (st)) (39)
I Simplest temporal-difference learning algorithm: TD(0)I Update value V (st) towards estimated return rt+1 + γV (st+1)I V (st)← V (st) + α(rt+1 + γV (st+1 − V (st)))I rt+1 + γV (st+1) is called the TD targetI δt = rt+1 + γV (st+1)− V (st) is called the TD error
Riashat Islam Introduction to Reinforcement Learning
Background
Advantages and Disadvantages of MC vs TD
I TD can learn before knowing the final outcome
I TD can learn online after every stepI MC must wait until end of episode before retun is known
I TD can learn without the final outcome
I TD can learn from incomplete sequencesI MC can only learn from complete sequencesI TD works in continuing (non-terminating) environmentsI MC only works for episodic (terminating) environments
Riashat Islam Introduction to Reinforcement Learning
Background
Monte-Carlo Backup
Riashat Islam Introduction to Reinforcement Learning
Background
Temporal-Difference Backup
Riashat Islam Introduction to Reinforcement Learning
Background
Model Free Control
Riashat Islam Introduction to Reinforcement Learning
Background
On and Off-Policy Learning
Riashat Islam Introduction to Reinforcement Learning
Background
ε-Greedy Exploration
Riashat Islam Introduction to Reinforcement Learning
Background
MC vs TD Control
Riashat Islam Introduction to Reinforcement Learning
Background
Updating Action-Value Functions with Sarsa
Riashat Islam Introduction to Reinforcement Learning
Background
On-Policy Control with Sarsa
Riashat Islam Introduction to Reinforcement Learning
Background
Sarsa Algorithm for On-Policy Control
Riashat Islam Introduction to Reinforcement Learning
Background
Off-Policy Learning
Riashat Islam Introduction to Reinforcement Learning
Background
Q-Learning
Riashat Islam Introduction to Reinforcement Learning
Background
Off-Policy Control with Q-Learning
Riashat Islam Introduction to Reinforcement Learning
Background
Q-Learning Control Algorithm
Riashat Islam Introduction to Reinforcement Learning
Background
Q-Learning Algorithm for Off-Policy Control
Riashat Islam Introduction to Reinforcement Learning
Background
Questions
Riashat Islam Introduction to Reinforcement Learning