45
REINFORCEMENT LEARNING Multi-state RL 1

REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

REINFORCEMENT LEARNINGMulti-state RL

1

Page 2: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

OVERVIEW

2

Reinforcement Learning • Markov Decision Process• Q-Learning & Sarsa• Convergence• Planning & learning• Actor Critic• Monte Carlo

Reinforcement Learning in Normal Form Games

Page 3: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MARKOV DECISION PROCESS• It is often useful to assume that all relevant information is

present in the current state: Markov property

• If a RL task has the Markov property, it is basically a Markov Decision Process (MDP)

• Assuming finite state and action spaces, it is a finite MDP

3

P (st+1, rt+1|st, at) = P (st+1, rt+1|st, at, rt, st�1, at�1, . . . , r1, a0, s0)

Page 4: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MARKOV DECISION PROCESSAn MDP is defined by:• State and action sets• A transition function

• A reward function

4

P ass0 = P (st+1 = s0|st = s, at = a)

Rass0 = E(rt+1|st = s, at = a, st+1 = s0)

AGENT-ENVIRONMENT INTERFACE

Agent

Environment

actionatst

rewardrt

rt+1st+1

state

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

14

Page 5: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

VALUE FUNCTIONS• Goal: learn , given • When following a fixed policy we can define the value

of a state s under that policy as

• Similar, we can define the value of taking action a in state s as

• Optimal

5

⇡ : S ! A hhs, a, i, ri

V ⇡(s) = E⇡(Rt|st = s) = E⇡(1X

k=0

�krt+k+1|st = s)

Q⇡(s, a) = E⇡(Rt|st = s, at = a)

⇡⇤= argmax

⇡V ⇡

(s)

Page 6: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BACKUP DIAGRAMS

6

V ⇡(s) =X

a

⇡(s, a)X

s0

P ass0 [R

ass0 + �V ⇡(s0)] Q⇡(s, a) =

X

s0

P ass0 [R

ass0 + �V ⇡(s0)]

V ⇡(s) = argmax

aQ(s, a)

Page 7: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

STATE VALUES & STATE-ACTION VALUES

7

Page 8: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MODEL BASED: DYNAMIC PROGRAMMING

8

TT TT T

T T TT T

st

st+1

rt+1

V ⇡(s) = E⇡{Rt|st = s}= E⇡{rt+1 + �V (st+1)|st = s}

Page 9: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MODEL FREE: REINFORCEMENT LEARNING

9

TT TT T

T T TT T

st

st+1

rt+1

Page 10: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

Q-LEARNINGOne-step Q-learning:

10

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Page 11: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

Q-LEARNING: EXAMPLE

• Epoch 1: 1,2,4• Epoch 2: 1,6• Epoch 3: 1,3• Epoch 4: 1,2,5• Epoch 6: 2,5

11

1

2

3

5

4

6

a

b

c

d

R=1R=1

R=4

R=5

R=2R=10

0.2

0.8

1.0

1.0

0.3

0.7

Page 12: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

UPDATING Q: IN PRACTICE

12

Page 13: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

CONVERGENCE OF DETERMINISTIC Q-LEARNING

Q-learning is guaranteed to converge in a Markovian settingi.e. Q converges to Q when each (s,a) is visited infinitely often

Extra material: Tsitsiklis, J.N. Asynchronous Stochastic Approximation and Q-learning. in Machine Learning, Vol 16:pp185-202, 1994.

13

^

Page 14: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

CONVERGENCE OF DETERMINISTIC Q-LEARNING

Proof:

• Let a full interval be an interval during which each (s,a) is visited

• Let be the Q-table after n updates• is the maximum error in :

14

Q̂n

�n Q̂n

�n = maxs,a|Q̂n �Q(s, a)|

Page 15: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

CONVERGENCE OF DETERMINISTIC Q-LEARNING

For any table entry updated on iteration n+1, the error in the revised estimate is

15

Q̂n+1(s, a)�Q(s, a)| = |(r + �maxa0Q̂n(s

0, a

0))

�(r + �maxa0Q(s0, a0))|

= |�maxa0Q̂n(s

0, a

0))� �maxa0Q(s0, a0))|

�maxa0 |Q̂n(s0, a

0)�Q(s0, a0))| �maxs00,a0 |Q̂n(s

00, a

0)�Q(s00, a0))|Q̂n+1(s, a)�Q(s, a)| ��n < �n

Q̂n(s, a)Q̂n+1(s, a)

Page 16: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

SARSA: ON-POLICY TD-CONTROL

16

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Page 17: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

Q-LEARNING VS SARSA• One step Q-learning:

• Sarsa:

17

Q(st, at) Q(st, at) = ↵[rt+1 + �Q(st+1, at+1)�Q(st, at)]

Q(s, a) Q(s, a) + ↵[rt+1 + � argmax

a0Q(st+1, a

0)�Q(s, a)]

Off-policy

On-policy

Page 18: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

CLIFF WALKING EXAMPLE

18

Rewardper

epsiode

−100

−75

−50

−25

0 100 200 300 400 500

Episodes

Sarsa

Q-learning

S G

r = −100

T h e C l i f f

r = −1 safe path

optimal path

Actions: , , , Reward: cliff = -100

goal = 0default = -1

-greedy, with ✏ ✏ = 0.1

Page 19: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

PLANNING AND LEARNING• Model: anything the agent can use to predict how the environment

will respond to its actions• Distribution model: description of all possibilities and their

probabilities

e.g., and for all s,s’ and • Sample model: produces sample experiences

e.g., a simulation model• Both type of models can be used to produce simulated experience

• Sample models are often easier to come by

19

P ass0 Ra

ss0 a 2 A(s)

Page 20: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

PLANNING• Planning is any computational process that uses a model

to create or improve a policy

• Planning in AI:- state-space planning (e.g. Heuristic search methods)

• We take the following (unusual) view:- all state-space planning methods involve computing value functions either explicitly or implicitly - they all apply backups to simulated experience

20

model policyplanning

model simulatedexperience

backups values policy

Page 21: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

PLANNINGTwo uses of real experience:

• model learning: to improve the model

• direct RL: to directly improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL, model-based RL or planning.

21

planning

value/policy

experiencemodel

modellearning

acting

directRL

Page 22: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

INDIRECT VS DIRECT RLIndirect methods

• Make fuller use of experience: get better policy with fewer environment interactions

22

Direct methods

• simpler• not affected by bad

models

These are closely related and planning, acting, model learning and direct RL can occur simultaneously and in parallel

Page 23: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

DYNA-Q ALGORITHM

23

Page 24: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

DYNA-Q IN A MAZE

24

2

800

600

400

200

142010 30 40 50

0 planning steps(direct RL only)

Episodes

Stepsper

episode 5 planning steps

50 planning steps

S

G

actions

Reward = 0, until goal when =1 e-greedy, e=0.1learning rate = 0.1initial Q-values = 0discount factor = 0.95

Page 25: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

DYNA-Q: SNAPSHOTS

25

S

G

S

GWITHOUT PLANNING (N=0) WITH PLANNING (N=50)

Page 26: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

DYNA-Q: WRONG MODEL• Easier env

26

Cumulativereward

S

G G

S

0 3000 6000Time steps

400

0

Dyna-Q+Dyna-Q

Dyna-AC

Page 27: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

DYNA-Q WRONG MODEL• Harder env

27

Cumulativereward

0 1000 2000 3000

Time steps

150

0

Dyna-Q+Dyna-Q

Dyna-AC

S

G G

S

Page 28: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

WHAT IS DYNA-Q+

Uses an ‘exploration bonus’

• Keep track of time since each state-action pair was tried for real

• An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting

• The agent actually “plans” how to visit long unvisited states

28

r + kpn , with k a weight factor

Page 29: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

ACTOR CRITIC METHODS• Explicit representation of policy

as well as value function• Minimal computation to select

actions• Can learn an explicit stochastic

policy• Can put constraints on policies• Appealing as psychological and

neural models

29

Policy

TDerror

Environment

ValueFunction

reward

state action

Actor

Critic

Page 30: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

ACTOR-CRITIC DETAILSIf actions are determined by preferences, , as follows:

then you can update the preferences like this:

TD-error is used to evaluate actions:

30

p(s, a)

⇡t(s, a) = Pr{at = a|st = s} = ep(s,a)P

b ep(s,b)

p(st, at) p(st, at) + ��t

�t = rt+1 + �V (st+1)� V (st)

Page 31: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MONTE CARLO METHODS• Monte Carlo methods learn from complete sample

returns - Only defined for episodic tasks- No Bootstrapping

• Monte Carlo methods learn directly from experience - Online: No model necessary and still attains optimality- Simulated: No need for a full model

31

Page 32: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MONTE CARLO POLICY EVALUATION• Goal: learn • Given: some number of episodes under which contains s• Idea: average returns observed after visits to s

• Every-visit MC: average returns for every time s is visited in an episode

• First-visit MC: average returns only for first time s is visited in an episode

• Both converge asymptotically 32

V ⇡(s)

1 2 3 4 5

Page 33: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

FIRST-VISIT MONTE CARLO POLICY EVALUATION

33

Page 34: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BLACKJACK EXAMPLE• Objective: Have your card sum be greater than the dealers

without exceeding 21.• States:

- #200- current sum (12-21) - dealer's showing card (ace-10)- do I have a useable ace?

• Reward: +1 for winning, 0 for a draw, -1 for losing• Actions: stick (stop receiving cards), hit (receive another card)• Policy: Stick if my sum is 20 or 21, else hit• Dealer's policy: sticks on any sum of 17 or greater, otherwise hit

34

Page 35: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BLACKJACK VALUE FUNCTIONS

35

Page 36: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BACKUP SCHEME DP

36

V (st) E⇡{rt+1 + �V (st)}

TT TT T

T T TT T

st

st+1

rt+1

Page 37: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BACKUP SCHEME TD

37

TT TT T

T T TT T

st

st+1

rt+1

V (st) V (st) + ↵[rt+1 + �V (st+1)� V (st)]

Page 38: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

SIMPLE MONTE CARLO

38

TT TT T

T T TT T

st

st+1

rt+1

V (st) V (st) + ↵[Rt � V (st)], where Rt is the actual return following state st

Page 39: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MONTE CARLO ESTIMATION OF ACTION VALUES (Q)

= average return starting from state s and action a following Also converges asymptotically if every state-action pair is visited

Exploring starts: Every state-action pair has a non-zeroprobability of being the starting pair

39

Q⇡(s, a)

Page 40: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MONTE CARLO CONTROL

MC policy iteration: Policy evaluation using MC methods followed by policy improvementPolicy improvement step: greedify with respect to value (or action-value) funtion

40

Page 41: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

Greedified policy meets the conditions for policy improvement:

This assumes exploring starts and infinite number of episodes for MC policy evaluationTo solve the latter :- update only to a given level of performance - alternate between evaluation and improvement per episode

CONVERGENCE OF MC CONTROL

41

Page 42: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

MONTE CARLO EXPLORING STARTS

42

Page 43: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

BLACKJACK EXAMPLE CONTINUEDExploring startsInitial policy as described before

43

Page 44: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

ON-POLICY MONTE CARLO CONTROL• On-policy: learn about policy currently executing

Policy can also be non-deterministic? e.g. -soft policy- Probability of selecting non-best action: - Probability of selecting best action:Similar to GPI: move policy towards greedy policy (i.e. - soft) Converges to best -soft policy

44

✏✏

Page 45: REINFORCEMENT LEARNING · • Planning is any computational process that uses a model to create or improve a policy • Planning in AI: - state-space planning (e.g. Heuristic search

ON-POLICY MC CONTROL

45