The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University

The Mathematical Foundations of Policy Gradient Methods

Sham M. KakadeUniversity of Washington

&Microsoft Research

Reinforcement (interactive) learning (RL):

Setting: Markov decision processes

S states. start with s0 ⇠ d0

A actions.dynamics model P(s0|s, a).reward function r(s)discount factor �

Sutton, Barto ’18

Stochastic policy ⇡: st ! at

Standard objective: find ⇡ which maximizes:

V⇡(s0) = E[r(s0) + �r(s1) + �2r(s2) + . . .]

where the distribution of st , at is induced by ⇡.

S. M. Kakade (UW) Curiosity 1 / 1

Markov Decision Processes:a framework for RL

• A policy:!: States → Actions

• We execute ! to obtain a trajectory:.!, 0!, 1!, .", 0", 1", …

• Total 3-discounted reward:

4#(.!) = 8 9$%!

&3$1$ │.!, !

Goal: Find a policy that maximizes our value, !!(#").

O0 TOuna

j EI

Challenges in RL

1. Exploration(the environment may be unknown)

2. Credit assignment problem(due to delayed rewards)

3. Large state/action spaces:hand state: joint angles/velocitiescube state: configuration actions: forces applied to actuators

Dexterous Robotic Hand ManipulationOpenAI, Oct 15, 2019

Values, State-Action Values, and Advantages

• Expectation with respect to sampled trajectories under !• Have S states and A actions.• Effective “horizon” is 1/(1 − 3) time steps.

!!(#") = & '#$"

%(#)(##, +#) │#", -

.!(#", +") = & '#$"

%(# )(##, +#)│#", +", -

/! #, + = .! #, + − !!(#)

demand start atso takedo thenfollow

FT advantages 8 4

go ye weightinglike she

The “Tabular” Dynamic Programming approach

• Table: ‘bookkeeping’ for dynamic programming (with known rewards/dynamics)

1. Estimate the state-action value '!(), +) for every entry in the table.

2. Update the policy - & goto step 1

• Generalization: how can we deal with this infinite table? using sampling/supervised learning?

State >: (joint angles, … cube config,…)

Action ?:(forces at joints)

&!((, *): state-action value“one step look-ahead value” using #

(31°, 12°, … , 8134,… ) (1.2 Newton, 0.1 Newton,… ) 8 units of reward

⋮ ⋮ ⋮startwith to

ATTY Q ga

§ Part – I: BasicsA. Derivation and EstimationB. Preconditioning and the Natural Policy Gradient

§ Part – II: Convergence and ApproximationA. Convergence: This is a non-convex problems!B. Approximation: How to the think about the role of deep learning?

This Tutorial: Mathematical Foundations of Policy Gradient Methods

generalization

Part-1: Basics

State-Action Visitation Measures!• This helps to clean up notation!• “Occupancy frequency” of being in state . and action a, after following ! starting in .!

%$!! # = 1 − ) * +%&"

')% , #% = # │#", /

• @.!# is a probability distribution

• With this notation:

4#(.!) =1

1 − 3 8.∼0"!# ,1∼# 1(., 0)digG chance

of visiting

Direct Policy Optimization over Stochastic Policies

• /( 0 # is the probability of action 0 given #, parameterized by

/( 0 # ∝ exp(5((#, 0))

• Softmax policy class: 5( #, 0 = 6$,*• Linear policy class: 5( #, 0 = 6⃗ ⋅ 9(#, 0)where 9(#, 0) ∈ ;+• Neural policy class: 5((#, 0) is a neural network

0 just z

In practice, policy gradient methods rule…

• Why do we like them?

• They easily deal with large state/action spaces (through the neural net parameterization)

• We can estimate the gradient using only simulation of our current policy ,!(the expectation is under the state actions visited under ,!)

• They directly optimize the cost function of interest!

They are the most effective method for obtaining state of the art.

! ← ! + $ %&!A((")

Two (equal) expressions for the policy gradient!

(some shorthand notation above)

• Where do these expression come from?

• How do we compute this?

%&# (" = 11 − - .$∼&B,(∼! /

# (, 1 %log 5# 1|(

%&# (" = 11 − - .$∼&B,(∼! 7

# (, 1 %log 5# 1|(

miectonies

11 exercise

hintp Glo lyPoly O

Example: an important special case!• Remember the softmax policy class (a “tabular” parameterization)

-C + # ∝ exp(6D,E)• Complete class with AB params:

one parameter per state action, so it contains the optimal policy

• Expression for softmax class:

7!C #"76D,E

= 8!2 # -C + # /C #, +

• Intuition: increase !!,# if the ‘weighted’ advantage is large.

Part-1A: Derivations and Estimation

O

General DerivationrV ⇡✓ (s0)

= rX

a0

⇡✓(a0|s0)Q⇡✓ (s0, a0)

=X

a0

⇣r⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) +

X

a0

⇡✓(a0|s0)rQ⇡✓ (s0, a0)

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0)

+X

a0

⇡✓(a0|s0)r⇣r(s0, a0) + �

X

s1

P (s1|s0, a0)V ⇡✓ (s1)⌘

=X

a0

⇡✓(a0|s0)⇣r log ⇡✓(a0|s0)

⌘Q⇡✓ (s0, a0) + �

X

a0,s1

⇡✓(a0|s0)P (s1|s0, a0)rV ⇡✓ (s1)

= E [Q⇡✓ (s0, a0)r log ⇡✓(a0|s0)] + �E [rV ⇡✓ (s1)] .

y vain quaint matey'siutinemn

Y drain rule taking action

m dates.ainresintrinar

CE w

unravelimmediate impact E E8tQe0lyA

SL vs RL: How do we obtain gradients?• In supervised learning, how do we compute the gradient of our loss CD(E)?

6 ← 6 + ; <=(6)• Hint: can we compute our loss?

• In reinforcement learning, how do we compute the policy gradient C43(.!)?

6 ← 6 + ; <!C(#")

%&# (" = 11 − - .$,( /# (, 1 %log 5# 1|(

If 8 Qeoly

even computing

Gol is tricky

Monte Carlo Estimation• Sample a trajectory: execute !3 and .!, 0!, 1!, .", 0", 1", …

• Lemma: [Glynn ’90, Williams ‘92]] This gives an unbiased estimate of the gradient:

E #$%$ = $%$((%)This is the “likelihood ratio” method.

bQ(st, at) =1X

t0=0

�t0r(st0+t, at0+t)

[rV ✓ =1X

t=0

�t bQ(st, at)r log ⇡✓(at|st)

OHHtruncation

Exercise

Back to the softmax policy class…

-C + # ∝ exp(6D,E)• Expression for softmax class:

7!C #"76D,E

= 8!2 # -C + # /C #, +

• What might be making gradient estimation difficult here?(hint: when does gradient descent “effective” stop?)

sas Y sa

g e

method Gp stopswhen grad is small

if alals is smallsuppose Asais 770 but als is small X

Part-1B: Preconditioning and the Natural Policy Gradient

A closer look at Natural Policy Gradient (NPG)

• Practice: (almost) all methods are gradient based, usually variants of:Natural Policy Gradient [K. ‘01]; TRPO [Schulman ‘15]; PPO [Schulman ‘17]

• NPG warps the distance metric to stretch the corners out (using the Fisher information metric) move ‘more’ near the boundaries. The update is:

F E = 8.∼0#,1∼# Clog !3 0|. Clog !3 0|. 4

E ← E + L F E 5"C43(.!)

tf statehave a Dist

Flats

SEm Istate space sine 2 7 Parang

TRPO (Trust Region Policy Optimization)

• TRPO [Schulman ‘15] (related: PPO [Schulman ‘17]): move staying “close” in KL to previous policy:

E$6" = argmin3 43(.!)s. t. 8.∼0#$ PD !3 ⋅ . R !3$ ⋅ .

• NPG=TRPO: they are first order equivalent (and have same practical behavior)

8

NPG intuition. But first….• NPG as preconditioning:

E ← E + L F E 5"C43(.!)OR

E ← E + L1 − 3 8 Clog !3 0|. Clog !3 0|. 4 5"8 Clog !3 0|. B3(., 0)

• What does the following problem remind you of?

8 SS7 5"8[SU]

• What is NPG is trying to approximate?

weup

I Px ray

WrGakAL

Equivalent Update Rule (for the softmax)• Take the best linear fit of W3 in “policy space”-features”: this gives

W.,1∗ = B3(., 0)• Using the NPG update rule :

E.,1 ← E.,1 +L

1 − γ B3(., 0)

• And so an equivalent update rule to NPG is:

!3 0|. ← !3 0|. exp L1 − γB

3(., 0) /\

• What algorithm does this remind you of?

Questions: convergence? General case/approximation?

Ftv exercisein

soft Spica

PolicyIteration

as 9 70 next policy it is policy it update

But does gradient descent even work in RL??

Supervised Learning Reinforcement Learning

What about approximation?

Stay tuned!!

Part-2: Convergence and Approximation

The Optimization Landscape

Supervised Learning:• Gradient descent tends to ‘just

work’ in practice and is not sensitive to initialization• Saddle points not a problem…

Reinforcement Learning:• Local search depends on initialization in

many real problems, due to “very” flat regions.

• Gradients can be exponentially small in the “horizon”

RL and the vanishing gradient problem

Reinforcement Learning:• The random init. has “very” flat regions in real problems (lack of ‘exploration’)• Lemma: [Agarwal, Lee, K., Mahajan 2019]

With random init, all F-th higher-order gradients are 2#$/& in magnitude for up to k <H/ ln L orders, L = 1/(1 − O).

• This is a landscape/optimization issues.(also a statistical issue if we used random init).

Prior work: The Explore/Exploit Tradeoff

Thrun ’92

Random search does not find the reward quickly.

(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]

S. M. Kakade (UW) Curiosity 4 / 16

s!

artat2 a 3

a

n

§ A: Convergence • Let’s look at the tabular/”softmax” case

§ B: Approximation§ Approximation: “linear” policies and neural nets

Part 2:Understanding the convergence properties of the (NPG) policy gradient methods!

NPG: back to the “soft” policy iteration interpretation

• Remember the softmax policy class

!3 0 . ∝ exp(E.,1)has A ⋅ B params

• At iteration t, the NPG update rule:E$6" ← E$ + L F E$ 5"C4$(.!)

is equivalent to a “soft” (exact) policy iteration update rule:

!$6" 0|. ← !$ 0|. exp L1 − γB

$(., 0) /\

• What happens for this non-convex update rule?

to11 A

00 A

ParfIintIfEyz qyqnaes.AtEike

Documents

The Mathematical Foundations of Policy Gradient Methodsstatisticalml.stat.columbia.edu/.../uploads/...1.pdfThe Mathematical Foundations of Policy Gradient Methods Sham M. Kakade University