Chapter 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 5: Monte Carlo Methods

Monte Carlo methods learn from complete sample returns Only defined for episodic tasks

Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model


Monte Carlo Policy Evaluation

Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s

Every-Visit MC: average returns for every time s is visited in an episode

First-visit MC: average returns only for first time s is visited in an episode

Both converge asymptotically

1 2 3 4 5


First-visit Monte Carlo Policy Evaluation

Initialize: policy to be evaluatedV an arbitrary state-value functionReturns(s) empty list, for all sS

Repeat forever:Generate an episode using For each state s appearing in the episode

R return following the first occurrence of sAppend R to Return(s)V(s) average(Returns(s))


Blackjack example

Object: Have your card sum be greater than the dealers without exceeding 21.

States (200 of them): current sum (12-21) dealer’s showing card (ace-10) do I have a useable ace?

Reward: +1 for winning, 0 for a draw, -1 for losing Actions: stick (stop receiving cards), hit (receive another

card) Policy: Stick if my sum is 20 or 21, else hit


Blackjack Value Functions

After many MC state visit evaluations the state-value function is well approximatedDynamic Programming parameters difficult to formulate here!

For instance with a given hand and a decision to stay what would be the expected return value?


Backup Diagram for Monte Carlo

Entire episode is included while in DP only one step transitions

Only one choice at each state (unlike DP which uses all possible transitions in one step)

Estimates for all states are independent so MC does not bootstrap (build on other estimates)

Time required to estimate one state does not depend on the total number of states


Example - Elastic Membrane (Dirichlet Problem)

The Power of Monte Carlo

How do we compute the shape of the membrane or bubble attached to a fixed frame?


RelaxationIterate on the grid and compute averages(like DP iterations)

Kakutani’s algorithm, 1945Use many random walks and average the boundary points values(like MC approach)

Two Approaches


Monte Carlo Estimation of Action Values (Q)

Monte Carlo is most useful when a model is not available We want to learn Q*

Q(s,a) - average return starting from state s and action a following

Converges asymptotically if every state-action pair is visited

To assure this we must maintain exploration to visit many state-action pairs

Exploring starts: Every state-action pair has a non-zero probability of being the starting pair


Monte Carlo Control (to approximate an optimal policy)

MC policy iteration: Policy evaluation by approximating Q using MC methods, followed by policy improvement

Policy improvement step: greedy with respect to Q (action-value) function – no model needed to construct greedy policy

Generalized Policy Iteration GPI


Convergence of MC Control

Policy improvement theorem tells us:

Q k (s, k1(s)) Q k (s,arg maxa

Q k (s,a))

maxa

Q k (s,a)

Q k (s, k (s))

V k (s) This assumes exploring starts and infinite number of

episodes for MC policy evaluation To solve the latter:

update only to a given level of performance alternate between evaluation and improvement per

episode


Monte Carlo Exploring Starts

Fixed point is optimal policy *

Proof is one of the fundamental questions of RL


Blackjack Example continued Exploring starts – easy to enforce by generating state-

action pairs randomly Initial policy as described before (sticks only at 20 or 21) Initial action-value function equal to zero


On-policy Monte Carlo Control

)(1

sA

greedy action

)(sA

non-max actions

On-policy: learn or improve the policy currently executing How do we get rid of exploring starts?

Need soft policies: (s,a) > 0 for all s and a e.g. -soft policy

– probability of action selection:

Similar to GPI: move policy towards greedy policy (i.e. -soft)

Converges to best -soft policy


On-policy MC Control


Learning about while following another


Off-policy Monte Carlo control

On-policy estimates the policy value while following it so the policy to generate behavior and estimate its result is identical Off-policy assumes separate behavior and estimation policies

Behavior policy generates behavior in environment may be randomized to sample all actions

Estimation policy is policy being learned about may be deterministic (greedy)

Off-policy estimates one policy while following another Average returns from behavior policy by using nonzero

probabilities of selecting all actions for the estimation policy May be slow to find improvement after nongreedy action


Off-policy MC control


Incremental Implementation

MC can be implemented incrementally saves memory

Compute the weighted average of each return

Vn wk Rk

k 1

n

wkk1

n

000

11

11

11

WV

wWW

VRW

wVV

nnn

nnn

nnn

incremental equivalentnon-incremental


Racetrack Exercise

States: grid squares, velocity horizontal and vertical

Rewards: -1 on track, -5 off track Only the right turns allowed

Actions: +1, -1, 0 to velocity 0 < Velocity < 5 Stochastic: 50% of the time it

moves 1 extra square up or right


Summary

MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book)

MC methods provide an alternate policy evaluation process

One issue to watch for: maintaining sufficient exploration exploring starts, soft policies

No bootstrapping (as opposed to DP)

Documents

Chapter 5: Monte Carlo Methods