21
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model

Chapter 5: Monte Carlo Methods

  • Upload
    thuy

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

Chapter 5: Monte Carlo Methods. Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model. 1. 2. 3. 4. 5. - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 5: Monte Carlo Methods

Monte Carlo methods learn from complete sample returns Only defined for episodic tasks

Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model

Page 2: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

Monte Carlo Policy Evaluation

Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s

Every-Visit MC: average returns for every time s is visited in an episode

First-visit MC: average returns only for first time s is visited in an episode

Both converge asymptotically

1 2 3 4 5

Page 3: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

First-visit Monte Carlo Policy Evaluation

Initialize: policy to be evaluatedV an arbitrary state-value functionReturns(s) empty list, for all sS

Repeat forever:Generate an episode using For each state s appearing in the episode

R return following the first occurrence of sAppend R to Return(s)V(s) average(Returns(s))

Page 4: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Blackjack example

Object: Have your card sum be greater than the dealers without exceeding 21.

States (200 of them): current sum (12-21) dealer’s showing card (ace-10) do I have a useable ace?

Reward: +1 for winning, 0 for a draw, -1 for losing Actions: stick (stop receiving cards), hit (receive another

card) Policy: Stick if my sum is 20 or 21, else hit

Page 5: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Blackjack Value Functions

After many MC state visit evaluations the state-value function is well approximatedDynamic Programming parameters difficult to formulate here!

For instance with a given hand and a decision to stay what would be the expected return value?

Page 6: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Backup Diagram for Monte Carlo

Entire episode is included while in DP only one step transitions

Only one choice at each state (unlike DP which uses all possible transitions in one step)

Estimates for all states are independent so MC does not bootstrap (build on other estimates)

Time required to estimate one state does not depend on the total number of states

Page 7: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Example - Elastic Membrane (Dirichlet Problem)

The Power of Monte Carlo

How do we compute the shape of the membrane or bubble attached to a fixed frame?

Page 8: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

RelaxationIterate on the grid and compute averages(like DP iterations)

Kakutani’s algorithm, 1945Use many random walks and average the boundary points values(like MC approach)

Two Approaches

Page 9: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Monte Carlo Estimation of Action Values (Q)

Monte Carlo is most useful when a model is not available We want to learn Q*

Q(s,a) - average return starting from state s and action a following

Converges asymptotically if every state-action pair is visited

To assure this we must maintain exploration to visit many state-action pairs

Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

Page 10: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Monte Carlo Control (to approximate an optimal policy)

MC policy iteration: Policy evaluation by approximating Q using MC methods, followed by policy improvement

Policy improvement step: greedy with respect to Q (action-value) function – no model needed to construct greedy policy

Generalized Policy Iteration GPI

Page 11: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Convergence of MC Control

Policy improvement theorem tells us:

Q k (s, k1(s)) Q k (s,arg maxa

Q k (s,a))

maxa

Q k (s,a)

Q k (s, k (s))

V k (s) This assumes exploring starts and infinite number of

episodes for MC policy evaluation To solve the latter:

update only to a given level of performance alternate between evaluation and improvement per

episode

Page 12: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Monte Carlo Exploring Starts

Fixed point is optimal policy *

Proof is one of the fundamental questions of RL

Page 13: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Blackjack Example continued Exploring starts – easy to enforce by generating state-

action pairs randomly Initial policy as described before (sticks only at 20 or 21) Initial action-value function equal to zero

Page 14: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

On-policy Monte Carlo Control

)(1

sA

greedy action

)(sA

non-max actions

On-policy: learn or improve the policy currently executing How do we get rid of exploring starts?

Need soft policies: (s,a) > 0 for all s and a e.g. -soft policy

– probability of action selection:

Similar to GPI: move policy towards greedy policy (i.e. -soft)

Converges to best -soft policy

Page 15: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

On-policy MC Control

Page 16: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Learning about while following another

Page 17: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Off-policy Monte Carlo control

On-policy estimates the policy value while following it so the policy to generate behavior and estimate its result is identical Off-policy assumes separate behavior and estimation policies

Behavior policy generates behavior in environment may be randomized to sample all actions

Estimation policy is policy being learned about may be deterministic (greedy)

Off-policy estimates one policy while following another Average returns from behavior policy by using nonzero

probabilities of selecting all actions for the estimation policy May be slow to find improvement after nongreedy action

Page 18: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Off-policy MC control

Page 19: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Incremental Implementation

MC can be implemented incrementally saves memory

Compute the weighted average of each return

Vn wk Rk

k 1

n

wkk1

n

000

11

11

11

WV

wWW

VRW

wVV

nnn

nnn

nnn

incremental equivalentnon-incremental

Page 20: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Racetrack Exercise

States: grid squares, velocity horizontal and vertical

Rewards: -1 on track, -5 off track Only the right turns allowed

Actions: +1, -1, 0 to velocity 0 < Velocity < 5 Stochastic: 50% of the time it

moves 1 extra square up or right

Page 21: Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Summary

MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)

MC methods provide an alternate policy evaluation process

One issue to watch for: maintaining sufficient exploration exploring starts, soft policies

No bootstrapping (as opposed to DP)