Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
REINFORCEMENT LEARNINGMulti-state RL
1
OVERVIEW
2
Reinforcement Learning • Markov Decision Process• Q-Learning & Sarsa• Convergence• Planning & learning• Actor Critic• Monte Carlo
Reinforcement Learning in Normal Form Games
MARKOV DECISION PROCESS• It is often useful to assume that all relevant information is
present in the current state: Markov property
• If a RL task has the Markov property, it is basically a Markov Decision Process (MDP)
• Assuming finite state and action spaces, it is a finite MDP
3
P (st+1, rt+1|st, at) = P (st+1, rt+1|st, at, rt, st�1, at�1, . . . , r1, a0, s0)
MARKOV DECISION PROCESSAn MDP is defined by:• State and action sets• A transition function
• A reward function
4
P ass0 = P (st+1 = s0|st = s, at = a)
Rass0 = E(rt+1|st = s, at = a, st+1 = s0)
AGENT-ENVIRONMENT INTERFACE
Agent
Environment
actionatst
rewardrt
rt+1st+1
state
t
. . . st art +1 st +1
t +1art +2 st +2
t +2art +3 st +3
. . .t +3a
14
VALUE FUNCTIONS• Goal: learn , given • When following a fixed policy we can define the value
of a state s under that policy as
• Similar, we can define the value of taking action a in state s as
• Optimal
5
⇡ : S ! A hhs, a, i, ri
⇡
V ⇡(s) = E⇡(Rt|st = s) = E⇡(1X
k=0
�krt+k+1|st = s)
Q⇡(s, a) = E⇡(Rt|st = s, at = a)
⇡⇤= argmax
⇡V ⇡
(s)
BACKUP DIAGRAMS
6
V ⇡(s) =X
a
⇡(s, a)X
s0
P ass0 [R
ass0 + �V ⇡(s0)] Q⇡(s, a) =
X
s0
P ass0 [R
ass0 + �V ⇡(s0)]
V ⇡(s) = argmax
aQ(s, a)
STATE VALUES & STATE-ACTION VALUES
7
MODEL BASED: DYNAMIC PROGRAMMING
8
TT TT T
T T TT T
st
st+1
rt+1
V ⇡(s) = E⇡{Rt|st = s}= E⇡{rt+1 + �V (st+1)|st = s}
MODEL FREE: REINFORCEMENT LEARNING
9
TT TT T
T T TT T
st
st+1
rt+1
Q-LEARNINGOne-step Q-learning:
10
Q(s, a) Q(s, a) + ↵[rt+1 + � argmax
a0Q(st+1, a
0)�Q(s, a)]
Q-LEARNING: EXAMPLE
• Epoch 1: 1,2,4• Epoch 2: 1,6• Epoch 3: 1,3• Epoch 4: 1,2,5• Epoch 6: 2,5
11
1
2
3
5
4
6
a
b
c
d
R=1R=1
R=4
R=5
R=2R=10
0.2
0.8
1.0
1.0
0.3
0.7
UPDATING Q: IN PRACTICE
12
CONVERGENCE OF DETERMINISTIC Q-LEARNING
Q-learning is guaranteed to converge in a Markovian settingi.e. Q converges to Q when each (s,a) is visited infinitely often
Extra material: Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-learning. in Machine Learning, Vol 16:pp185-202, 1994.
13
^
CONVERGENCE OF DETERMINISTIC Q-LEARNING
Proof:
• Let a full interval be an interval during which each (s,a) is visited
• Let be the Q-table after n updates• is the maximum error in :
14
Q̂n
�n Q̂n
�n = maxs,a|Q̂n �Q(s, a)|
CONVERGENCE OF DETERMINISTIC Q-LEARNING
For any table entry updated on iteration n+1, the error in the revised estimate is
15
Q̂n+1(s, a)�Q(s, a)| = |(r + �maxa0Q̂n(s
0, a
0))
�(r + �maxa0Q(s0, a0))|
= |�maxa0Q̂n(s
0, a
0))� �maxa0Q(s0, a0))|
�maxa0 |Q̂n(s0, a
0)�Q(s0, a0))| �maxs00,a0 |Q̂n(s
00, a
0)�Q(s00, a0))|Q̂n+1(s, a)�Q(s, a)| ��n < �n
Q̂n(s, a)Q̂n+1(s, a)
SARSA: ON-POLICY TD-CONTROL
16
Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]
Q-LEARNING VS SARSA• One step Q-learning:
• Sarsa:
17
Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]
Q(s, a) Q(s, a) + ↵[rt+1 + � argmax
a0Q(st+1, a
0)�Q(s, a)]
Off-policy
On-policy
CLIFF WALKING EXAMPLE
18
Rewardper
epsiode
−100
−75
−50
−25
0 100 200 300 400 500
Episodes
Sarsa
Q-learning
S G
r = −100
T h e C l i f f
r = −1 safe path
optimal path
Actions: , , , Reward: cliff = -100
goal = 0default = -1
-greedy, with ✏ ✏ = 0.1
PLANNING AND LEARNING• Model: anything the agent can use to predict how the environment
will respond to its actions• Distribution model: description of all possibilities and their
probabilities
e.g., and for all s,s’ and • Sample model: produces sample experiences
e.g., a simulation model• Both type of models can be used to produce simulated experience
• Sample models are often easier to come by
19
P ass0 Ra
ss0 a 2 A(s)
PLANNING• Planning is any computational process that uses a model
to create or improve a policy
• Planning in AI:- state-space planning (e.g. Heuristic search methods)
• We take the following (unusual) view:- all state-space planning methods involve computing value functions either explicitly or implicitly - they all apply backups to simulated experience
20
model policyplanning
model simulatedexperience
backups values policy
PLANNINGTwo uses of real experience:
• model learning: to improve the model
• direct RL: to directly improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL, model-based RL or planning.
21
planning
value/policy
experiencemodel
modellearning
acting
directRL
INDIRECT VS DIRECT RLIndirect methods
• Make fuller use of experience: get better policy with fewer environment interactions
22
Direct methods
• simpler• not affected by bad
models
These are closely related and planning, acting, model learning and direct RL can occur simultaneously and in parallel
DYNA-Q ALGORITHM
23
DYNA-Q IN A MAZE
24
2
800
600
400
200
142010 30 40 50
0 planning steps(direct RL only)
Episodes
Stepsper
episode 5 planning steps
50 planning steps
S
G
actions
Reward = 0, until goal when =1 e-greedy, e=0.1learning rate = 0.1initial Q-values = 0discount factor = 0.95
DYNA-Q: SNAPSHOTS
25
S
G
S
GWITHOUT PLANNING (N=0) WITH PLANNING (N=50)
DYNA-Q: WRONG MODEL• Easier env
26
Cumulativereward
S
G G
S
0 3000 6000Time steps
400
0
Dyna-Q+Dyna-Q
Dyna-AC
DYNA-Q WRONG MODEL• Harder env
27
Cumulativereward
0 1000 2000 3000
Time steps
150
0
Dyna-Q+Dyna-Q
Dyna-AC
S
G G
S
WHAT IS DYNA-Q+
Uses an ‘exploration bonus’
• Keep track of time since each state-action pair was tried for real
• An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting
• The agent actually “plans” how to visit long unvisited states
28
r + kpn , with k a weight factor
ACTOR CRITIC METHODS• Explicit representation of policy
as well as value function• Minimal computation to select
actions• Can learn an explicit stochastic
policy• Can put constraints on policies• Appealing as psychological and
neural models
29
Policy
TDerror
Environment
ValueFunction
reward
state action
Actor
Critic
ACTOR-CRITIC DETAILSIf actions are determined by preferences, , as follows:
then you can update the preferences like this:
TD-error is used to evaluate actions:
30
p(s, a)
⇡t(s, a) = Pr{at = a|st = s} = ep(s,a)P
b ep(s,b)
p(st, at) p(st, at) + ��t
�t = rt+1 + �V (st+1)� V (st)
MONTE CARLO METHODS• Monte Carlo methods learn from complete sample
returns - Only defined for episodic tasks- No Bootstrapping
• Monte Carlo methods learn directly from experience - Online: No model necessary and still attains optimality- Simulated: No need for a full model
31
MONTE CARLO POLICY EVALUATION• Goal: learn • Given: some number of episodes under which contains s• Idea: average returns observed after visits to s
• Every-visit MC: average returns for every time s is visited in an episode
• First-visit MC: average returns only for first time s is visited in an episode
• Both converge asymptotically 32
V ⇡(s)
⇡
1 2 3 4 5
FIRST-VISIT MONTE CARLO POLICY EVALUATION
33
BLACKJACK EXAMPLE• Objective: Have your card sum be greater than the dealers
without exceeding 21.• States:
- #200- current sum (12-21) - dealer's showing card (ace-10)- do I have a useable ace?
• Reward: +1 for winning, 0 for a draw, -1 for losing• Actions: stick (stop receiving cards), hit (receive another card)• Policy: Stick if my sum is 20 or 21, else hit• Dealer's policy: sticks on any sum of 17 or greater, otherwise hit
34
BLACKJACK VALUE FUNCTIONS
35
BACKUP SCHEME DP
36
V (st) E⇡{rt+1 + �V (st)}
TT TT T
T T TT T
st
st+1
rt+1
BACKUP SCHEME TD
37
TT TT T
T T TT T
st
st+1
rt+1
V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]
SIMPLE MONTE CARLO
38
TT TT T
T T TT T
st
st+1
rt+1
V (st) V (st) + ↵[Rt � V (st)], where Rt is the actual return following state st
MONTE CARLO ESTIMATION OF ACTION VALUES (Q)
= average return starting from state s and action a following Also converges asymptotically if every state-action pair is visited
Exploring starts: Every state-action pair has a non-zeroprobability of being the starting pair
39
Q⇡(s, a)
⇡
MONTE CARLO CONTROL
MC policy iteration: Policy evaluation using MC methods followed by policy improvementPolicy improvement step: greedify with respect to value (or action-value) funtion
40
Greedified policy meets the conditions for policy improvement:
This assumes exploring starts and infinite number of episodes for MC policy evaluationTo solve the latter :- update only to a given level of performance - alternate between evaluation and improvement per episode
CONVERGENCE OF MC CONTROL
41
MONTE CARLO EXPLORING STARTS
42
BLACKJACK EXAMPLE CONTINUEDExploring startsInitial policy as described before
43
ON-POLICY MONTE CARLO CONTROL• On-policy: learn about policy currently executing
Policy can also be non-deterministic? e.g. -soft policy- Probability of selecting non-best action: - Probability of selecting best action:Similar to GPI: move policy towards greedy policy (i.e. - soft) Converges to best -soft policy
44
✏
✏✏
ON-POLICY MC CONTROL
45