The Right Way to do Reinforcement Learning with Function Approximation Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns

The Right Way to do Reinforcement Learning

with Function Approximation

Rich SuttonAT&T Labswith thanks to

Satinder Singh, David McAllester, Mike Kearns

The Prize• To find the “Right Way” to do RL with FA

– sound (stable, non-divergent)– ends up with a good policy– gets there quickly, efficiently– applicable to any (discrete-time, finite-state) MDP– compatible with (at least) linear FA– online and incremental

• To prove that it is so

• Tensions:– Proof and practice often pull in different directions– Lack of knowledge negative knowledge– We have not handled this well as a field

critical to viability of RL !

Outline• Questions• History: from policy to value back to policy• Problem Definition

– Why function approximation changes everything

• REINFORCE• Policy Gradient Theory• Do we need values? Do we need TD?

– Return baselines - using values without bias– TD/boostrapping/truncation

• may not be possible without bias• but seems essential for reducing variance

QuestionsIs RL theory fundamentally different/harder with FA? yes

Are value methods unsound with FA? absolutely not

Should we prefer policy methods for other reasons? probably

Is it sufficient to learn just a policy, not value? apparently not

Didn’t we already do all this policy stuff in the 1980s? only some of it

Can values be used without introducing bias? yes

Can TD (bootstrapping) be done without bias? I wish

Is TD much more efficient than Monte Carlo? apparently

Is it TD that makes FA hard? yes and no, but mostly no

So are we stuck with dual, “actor-critic” methods? maybe so

Are we talking about genetic algorithms? No!

What about learning “heuristic” or “relative” values. Are these policy methods or value methods? policy

The Swing towards Value Functions

• Early RL methods all used parameterized policies• But adding value functions seemed key to

efficiency• Why not just learn action value functions and compute policies from them!

• A prediction problem - almost supervised• Fewer parameters• Enabled first proofs of convergence to optimal

policy• Impressive applications using FA• So successful that early policy work was bypassed

Q*(s,a) =E r1 +γr2 +γ2r3 +L s0 =s,a0 =a,π *

π * (s) =argmaxa

Q* (s,a)CleanerSimplerEasier to use

Q-learningWatkins, 1989

• Theory hit a brick wall for RL with FA• Q-learning shown to diverge with linear FA• Many counterexamples to convergence• Widespread scepticism about any argmax VF

solution– that is, about any way to get conventional convergence

• But is this really a problem? • In practice, on-policy methods perform well • Is this only a problem for our theory?• With Gordon’s latest result these concerns seem

to have been hasty, now invalid

The Swing away from Value Functions

Why?

Diagram

Why FA makes RL hard

• All the states interact and must be balanced, traded off• Which states are visited is affected by the policy

A small change (or error) in the VF estimate can cause a large, discontinuous change in the

policy can cause a large change in the VF estimate

Diagram of What Happens in— Value Function Space —

inadmiss

able value fu

nctions

value fu

nctions c

onsiste

nt with

True V*Region of π*

best admissable

policy

Original naïve hope

guaranteed convergenc

eto good

policy

Res gradient et al.

guaranteed convergenc

eto less

desirable policy

Sarsa, TD() & other on-policy methods

chatteringwithout

divergence or guaranteed

convergence

Q-learning, DP & otheroff-policy methods

divergence possible

…and towards Policy Parameterization

• A parameterized policy (PP) can be changed continuously

• A PP can find stochastic policies– the optimal policy will often be stochastic with FA

• A PP can omit the argmax (the action-space search)– necessary for large/continuous action spaces

• A PP can be more direct, simpler• Prior knowledge is often better expressed as a PP

• A PP method can be proven convergent to a local optimum for general differentiable FA

REINFORCE Williams, 1988

Defining the Problem (RL with FA)

Part I: Parameterized PoliciesFinite state and action sets

Discrete time

Transition probabilities

Expected rewards

Stochastic policy

possibly parameterized

S A N

t 0,1,2,3, K

ps s a Pr st 1 s st s,at a

sa E rt 1 st s,at a

π (s,a) Pr at a st s

π w.l.o.g. n

n N

s ′ s a rπ , p

L L e.g.,

Examples of Policy Parameterizations

Gibbs

or =1Normalization

featuresof

= weightsRanking #s,one per action

ActionProbabilities

π (st ,a)

Gibbsor

=1

featuresof and

Ranking # forrepeat

ActionProbabilities

π (st ,a)

st

st a

a Aa

Ranking #s are mechanistically like action valuesBut do not have value semanticsMany “heuristic” or “relative” values are better viewed as ranking #s

More Policy ParameterizationsContinuous actions work too, e.g.:

GaussianSampler

featuresof

= weightsimplicitly determine the

continuous distribution

st

mean of

std. dev. of a t

a t

a t

π (st ,a)Much stranger parameterizations are possible

e.g., cascades of interacting stochastic processes

such as in a communications network or factory

We require only that our policy process produces

at according to some distribution π (st,a)∇π (st ,a)

perhaps π (st ,a)

Choose π to maximize a measure of total future reward,called the return

Values are expected returns

Defining the Problem (RL with FA) II

Rt

Vπ(s) E Rt st s,π

Qπ(s,a) E Rt st s,at a,π

Optimal policies

π * arg maxπ

Vπ(s) s S π *

(s,a) arg maxa

Qπ(s,a)

Value methods maintain a parameterized approximation to a value function,

And then compute their policy, e.g.,

Vπ, Q

π, V

π *

, or Qπ *

π (s) arg maxa

ˆ Q (s,a)

FA Breaks the Standard Problem Def’n!

Discounted case

One infinite, ergodic episode: s0, a0, r1, s1, a1, r2 , s2 , K

Let be the space of all policiesLet be all policies consistent with the parameterizationProblem: depends on s!

no one policy—in —is best for all statesstates compete for control of

Need an overall (not per state) measure of policy quality, e.g.,

argmaxπ∈Π

Vπ (s)

J(π) = Vπ (s)dπ (s)s∑

dπ

(s)asymptotic fraction of time spent in s under πBut! Thm: J(π) is independent of J(π) = ave. reward/step

Rt =1

1−γ rt+1 +γrt+2 +γ2rt+3 +L( )Return:

RL Cases Consistent with FA

Average-reward case

J(π) = average reward per time step underπ

Rt = rt+k −J (π )[ ]k=1

∞

∑

Episodic case

J(π) =Vπ (s0 )

Rt =rt+1 +γrt+2 +γ2rt+3 +L +γT−k−1rT

Many epsiodes, all starting from

One infinite, ergodic episode: s0, a0, r1, s1, a1, r2 , s2 , K

s0

Outline• Questions• History: from policy to value back to policy• Problem Definition

– Why function approximation changes everything

• REINFORCE• Policy Gradient Theory• Do we need values? Do we need TD?

– Return baselines - using values without bias– TD/boostrapping/truncation

• may not be possible without bias• but seems essential for reducing variance

Do we need Values at all?

Extended REINFORCE (Williams, 1988)

Δ t = α Rt

∇θπ (st ,at )

π (st ,at )Δ = Δt

t∑ offline

updating

episodiccase

Thm: Eπ Δ{ } =α∇ J (π )There is also an online, incremental implementation using

eligibility traces

Converges to a local optimum of J for general diff. FA!Simple, clean, a single parameter...Why didn’t we love this algorithm in 1988??

No TD/bootstrapping (a Monte Carlo method) thought to be inefficient Extended to average-case

(Baxter and Bartlett, 1999)

Policy Gradient Theorem

Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)a

∑s

∑Marbach & Tsitsiklis ‘98Jaakkola Singh Jordan ‘95Cao & Chen ‘97Sutt McAl Sing Mans ‘99Konda & Tsitsiklis ‘99Williams ‘88

how often soccurs under π

does not involve

∇d

π (s) !

Policy Gradient Theory

Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)a

∑s

∑


= dπ (s) Qπ (s,a) − b(s)[ ]∇θπ (s,a)a

∑s

∑ for any b : S → ℜ

∇π(s,a)

a∑ = 0 ∀s

= dπ (s) π (s,a) Qπ (s,a) − b(s)[ ]∇θπ (s,a)π (s,a)a

∑s

∑

how often s,a occurs under π

=Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

t

∇ J(π ) = Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st , at )

π (st , at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

=Eπ Rt

∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪REINFORCE

t


π (st , at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

=Eπ rt +1 + γV π (st +1) − V π (st )[ ]

∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

OR

≈Eπ rt+1 + γ ˆ V π (st +1) − ˆ V π (st )[ ]

∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪actor-critic

≈Eπ Rtλ − ˆ V π (st )[ ]

∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

OR

general form,includes all above

possibleTD/bootstrapping

idealbaseline?

=Eπ Rt

∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪REINFORCE

t


π (st , at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

Conjecture: The ideal baseline is

b(s) =Vπ (s)

In which case our error term is an advantage

Qπ (st,at)−Vπ (st) =Aπ (st,at)Baird ‘93

No bias is introduced by an approximation here:

b(s) = ˆ Vπ (s)

How important is a baseline to the efficiency of REINFORCE?

Apparently very important, but previous tests flawed

Random MDP Testbed

• 50 Randomly constructed episodic MDPs– 50 states, uniform starting distribution– 2 actions/state– 2 possible next states per action– expected rewards (1,1); actual rewards +(0,0.1)– 0.1 prob of termination on each step

• State aggregation FA - 5 groups of 10 states each• Gibbs action selection• Baseline learned by gradient descent• Parameters initially• Step-size parameters

= ˆ V π (s)

=0 w = 0

Effect of Learned Baseline

.01 .1 1 10.0

10.8

No Baseline

α

01

02

01

2

J(π)after 50

episodes

Much better to learn a baseline approximating Vπ

REINFORCE withper-episode updating


π (st , at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

Can We TD without Introducing Bias?

Thm: An approximation Q can replace Q without bias if it is of the form

ˆ Q (s,a) =wT ∇π (s,a)π (s,a)

and has converged to a local optimum.

Sutton et al. ‘99Konda & Tsitsiklis ‘99

However!Thm: Under batch updating, such a Q results in exactly the same updates as REINFORCE. There is no useful bootstrapping.

Empirically, there is also no win with per-episode updating

Singh McAllester Suttonunpublished

Effect of Unbiased Linear Q^

.01 .1 1 0.1

1.7

J(π)after 50

episodes

α

REINFORCE Unbiasedat best

ˆ Q

per-episode updating

TD Creates Bias; Must We TD?

Is TD really more efficient than Monte Carlo?

Apparently “Yes”,but this question deserves a better answer

accumulatingtraces

0.2

0.3

0.4

0.5

0 0.2 0.4 0.6 0.8 1

RANDOM WALK

50

100

150

200

250

300

Failures per100,000 steps

0 0.2 0. 0.6 0.8 1

CART AND POLE

00

50

500

550

600

650

700

Steps perepisode

0 0.2 0. 0.6 0.8 1

MOUNTAIN CAR

replacingtraces

150

160

170

180

190

200

210

220

230

20

Cost perepisode

0 0.2 0. 0.6 0.8 1

PUDDLE WORLD

replacingtraces

accumulatingtraces

replacingtraces

accumulatingtraces

RMS error

Is it TD that makes FA hard?

• Yes, TD prediction with FA is trickier than Monte Carlo– even the linear case converges only near an optimum– nonlinear cases can even diverge

• No, TD is not the reason the control case is hard

• This problem is intrinsic to control + FA• It happens even with Monte Carlo methods

small change in value

discontinuouschange in policylarge change in

state distribution

large

Small Sample Importance Sampling -

A Superior Eligibility Term?Thm: ∇ J(π ) = dπ (s) Qπ (s, a)∇θπ (s,a)

a∑

s∑


= dπ (s) Qπ (s,a) − b(s)[ ]∇θπ (s,a)a

∑s

∑ for any b : S → ℜ

∇π(s,a)

a∑ = 0 ∀s

= dπ (s) π (s,a) Qπ (s,a) − b(s)[ ]∇θπ (s,a)π (s,a)a

∑s

∑

how often s,a occurs under π

=Eπ Qπ (st ,at ) − b(st )[ ]∇θπ (st ,at )

π (st ,at )

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

t

n?

n?

Questions

Is RL theory fundamentally different/harder with FA? yes

Are value methods unsound with FA? absolutely not

Should we prefer policy methods for other reasons? probably

Is it sufficient to learn just a policy, not value? apparently not

Can values be used without introducing bias? yes

Can TD (bootstrapping) be done without bias? I wish

Is TD much more efficient than Monte Carlo? apparently

Is it TD that makes FA hard? yes and no, but mostly

no

So are we stuck with dual, “actor-critic” methods? maybe so

Documents

The Right Way to do Reinforcement Learning with Function Approximation Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns