REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or...

Preview:

Citation preview

REINFORCEMENT LEARNINGMulti-state RL

1

OVERVIEW

2

Reinforcement Learning • Markov Decision Process• Q-Learning & Sarsa• Convergence• Planning & learning• Actor Critic• Monte Carlo

Reinforcement Learning in Normal Form Games

MARKOV DECISION PROCESS• It is often useful to assume that all relevant information is

present in the current state: Markov property

• If a RL task has the Markov property, it is basically a Markov Decision Process (MDP)

• Assuming finite state and action spaces, it is a finite MDP

3

P (st+1, rt+1|st, at) = P (st+1, rt+1|st, at, rt, st�1, at�1, . . . , r1, a0, s0)

MARKOV DECISION PROCESSAn MDP is defined by:• State and action sets• A transition function

• A reward function

4

P ass0 = P (st+1 = s0|st = s, at = a)

Rass0 = E(rt+1|st = s, at = a, st+1 = s0)

AGENT-ENVIRONMENT INTERFACE

Agent

Environment

actionatst

rewardrt

rt+1st+1

state

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

14

VALUE FUNCTIONS• Goal: learn , given • When following a fixed policy we can define the value

of a state s under that policy as

• Similar, we can define the value of taking action a in state s as

• Optimal

5

⇡ : S ! A hhs, a, i, ri

V ⇡(s) = E⇡(Rt|st = s) = E⇡(1X

k=0

�krt+k+1|st = s)

Q⇡(s, a) = E⇡(Rt|st = s, at = a)

⇡⇤= argmax

⇡V ⇡

(s)

BACKUP DIAGRAMS

6

V ⇡(s) =X

a

⇡(s, a)X

s0

P ass0 [R

ass0 + �V ⇡(s0)] Q⇡(s, a) =

X

s0

P ass0 [R

ass0 + �V ⇡(s0)]

V ⇡(s) = argmax

aQ(s, a)

STATE VALUES & STATE-ACTION VALUES

7

MODEL BASED: DYNAMIC PROGRAMMING

8

TT TT T

T T TT T

st

st+1

rt+1

V ⇡(s) = E⇡{Rt|st = s}= E⇡{rt+1 + �V (st+1)|st = s}

MODEL FREE: REINFORCEMENT LEARNING

9

TT TT T

T T TT T

st

st+1

rt+1

Q-LEARNINGOne-step Q-learning:

10

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Q-LEARNING: EXAMPLE

• Epoch 1: 1,2,4• Epoch 2: 1,6• Epoch 3: 1,3• Epoch 4: 1,2,5• Epoch 6: 2,5

11

1

2

3

5

4

6

a

b

c

d

R=1R=1

R=4

R=5

R=2R=10

0.2

0.8

1.0

1.0

0.3

0.7

UPDATING Q: IN PRACTICE

12

CONVERGENCE OF DETERMINISTIC Q-LEARNING

Q-learning is guaranteed to converge in a Markovian settingi.e. Q converges to Q when each (s,a) is visited infinitely often

Extra material: Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-learning. in Machine Learning, Vol 16:pp185-202, 1994.

13

^

CONVERGENCE OF DETERMINISTIC Q-LEARNING

Proof:

• Let a full interval be an interval during which each (s,a) is visited

• Let be the Q-table after n updates• is the maximum error in :

14

Q̂n

�n Q̂n

�n = maxs,a|Q̂n �Q(s, a)|

CONVERGENCE OF DETERMINISTIC Q-LEARNING

For any table entry updated on iteration n+1, the error in the revised estimate is

15

Q̂n+1(s, a)�Q(s, a)| = |(r + �maxa0Q̂n(s

0, a

0))

�(r + �maxa0Q(s0, a0))|

= |�maxa0Q̂n(s

0, a

0))� �maxa0Q(s0, a0))|

�maxa0 |Q̂n(s0, a

0)�Q(s0, a0))| �maxs00,a0 |Q̂n(s

00, a

0)�Q(s00, a0))|Q̂n+1(s, a)�Q(s, a)| ��n < �n

Q̂n(s, a)Q̂n+1(s, a)

SARSA: ON-POLICY TD-CONTROL

16

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Q-LEARNING VS SARSA• One step Q-learning:

• Sarsa:

17

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Off-policy

On-policy

CLIFF WALKING EXAMPLE

18

Rewardper

epsiode

−100

−75

−50

−25

0 100 200 300 400 500

Episodes

Sarsa

Q-learning

S G

r = −100

T h e C l i f f

r = −1 safe path

optimal path

Actions: , , , Reward: cliff = -100

goal = 0default = -1

-greedy, with ✏ ✏ = 0.1

PLANNING AND LEARNING• Model: anything the agent can use to predict how the environment

will respond to its actions• Distribution model: description of all possibilities and their

probabilities

e.g., and for all s,s’ and • Sample model: produces sample experiences

e.g., a simulation model• Both type of models can be used to produce simulated experience

• Sample models are often easier to come by

19

P ass0 Ra

ss0 a 2 A(s)

PLANNING• Planning is any computational process that uses a model

to create or improve a policy

• Planning in AI:- state-space planning (e.g. Heuristic search methods)

• We take the following (unusual) view:- all state-space planning methods involve computing value functions either explicitly or implicitly - they all apply backups to simulated experience

20

model policyplanning

model simulatedexperience

backups values policy

PLANNINGTwo uses of real experience:

• model learning: to improve the model

• direct RL: to directly improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL, model-based RL or planning.

21

planning

value/policy

experiencemodel

modellearning

acting

directRL

INDIRECT VS DIRECT RLIndirect methods

• Make fuller use of experience: get better policy with fewer environment interactions

22

Direct methods

• simpler• not affected by bad

models

These are closely related and planning, acting, model learning and direct RL can occur simultaneously and in parallel

DYNA-Q ALGORITHM

23

DYNA-Q IN A MAZE

24

2

800

600

400

200

142010 30 40 50

0 planning steps(direct RL only)

Episodes

Stepsper

episode 5 planning steps

50 planning steps

S

G

actions

Reward = 0, until goal when =1 e-greedy, e=0.1learning rate = 0.1initial Q-values = 0discount factor = 0.95

DYNA-Q: SNAPSHOTS

25

S

G

S

GWITHOUT PLANNING (N=0) WITH PLANNING (N=50)

DYNA-Q: WRONG MODEL• Easier env

26

Cumulativereward

S

G G

S

0 3000 6000Time steps

400

0

Dyna-Q+Dyna-Q

Dyna-AC

DYNA-Q WRONG MODEL• Harder env

27

Cumulativereward

0 1000 2000 3000

Time steps

150

0

Dyna-Q+Dyna-Q

Dyna-AC

S

G G

S

WHAT IS DYNA-Q+

Uses an ‘exploration bonus’

• Keep track of time since each state-action pair was tried for real

• An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

• The agent actually “plans” how to visit long unvisited states

28

r + kpn , with k a weight factor

ACTOR CRITIC METHODS• Explicit representation of policy

as well as value function• Minimal computation to select

actions• Can learn an explicit stochastic

policy• Can put constraints on policies• Appealing as psychological and

neural models

29

Policy

TDerror

Environment

ValueFunction

reward

state action

Actor

Critic

ACTOR-CRITIC DETAILSIf actions are determined by preferences, , as follows:

then you can update the preferences like this:

TD-error is used to evaluate actions:

30

p(s, a)

⇡t(s, a) = Pr{at = a|st = s} = ep(s,a)P

b ep(s,b)

p(st, at) p(st, at) + ��t

�t = rt+1 + �V (st+1)� V (st)

MONTE CARLO METHODS• Monte Carlo methods learn from complete sample

returns - Only defined for episodic tasks- No Bootstrapping

• Monte Carlo methods learn directly from experience - Online: No model necessary and still attains optimality- Simulated: No need for a full model

31

MONTE CARLO POLICY EVALUATION• Goal: learn • Given: some number of episodes under which contains s• Idea: average returns observed after visits to s

• Every-visit MC: average returns for every time s is visited in an episode

• First-visit MC: average returns only for first time s is visited in an episode

• Both converge asymptotically 32

V ⇡(s)

1 2 3 4 5

FIRST-VISIT MONTE CARLO POLICY EVALUATION

33

BLACKJACK EXAMPLE• Objective: Have your card sum be greater than the dealers

without exceeding 21.• States:

- #200- current sum (12-21) - dealer's showing card (ace-10)- do I have a useable ace?

• Reward: +1 for winning, 0 for a draw, -1 for losing• Actions: stick (stop receiving cards), hit (receive another card)• Policy: Stick if my sum is 20 or 21, else hit• Dealer's policy: sticks on any sum of 17 or greater, otherwise hit

34

BLACKJACK VALUE FUNCTIONS

35

BACKUP SCHEME DP

36

V (st) E⇡{rt+1 + �V (st)}

TT TT T

T T TT T

st

st+1

rt+1

BACKUP SCHEME TD

37

TT TT T

T T TT T

st

st+1

rt+1

V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]

SIMPLE MONTE CARLO

38

TT TT T

T T TT T

st

st+1

rt+1

V (st) V (st) + ↵[Rt � V (st)], where Rt is the actual return following state st

MONTE CARLO ESTIMATION OF ACTION VALUES (Q)

= average return starting from state s and action a following Also converges asymptotically if every state-action pair is visited

Exploring starts: Every state-action pair has a non-zeroprobability of being the starting pair

39

Q⇡(s, a)

MONTE CARLO CONTROL

MC policy iteration: Policy evaluation using MC methods followed by policy improvementPolicy improvement step: greedify with respect to value (or action-value) funtion

40

Greedified policy meets the conditions for policy improvement:

This assumes exploring starts and infinite number of episodes for MC policy evaluationTo solve the latter :- update only to a given level of performance - alternate between evaluation and improvement per episode

CONVERGENCE OF MC CONTROL

41

MONTE CARLO EXPLORING STARTS

42

BLACKJACK EXAMPLE CONTINUEDExploring startsInitial policy as described before

43

ON-POLICY MONTE CARLO CONTROL• On-policy: learn about policy currently executing

Policy can also be non-deterministic? e.g. -soft policy- Probability of selecting non-best action: - Probability of selecting best action:Similar to GPI: move policy towards greedy policy (i.e. - soft) Converges to best -soft policy

44

✏✏

ON-POLICY MC CONTROL

45

Recommended