24
Advanced Policy Gradients CS 285: Deep Reinforcement Learning, Decision Making, and Control Sergey Levine

Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Advanced Policy Gradients

CS 285: Deep Reinforcement Learning, Decision Making, and Control

Sergey Levine

Page 2: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Class Notes

1. Homework 2 due today (11:59 pm)!• Don’t be late!

2. Homework 3 comes out this week• Start early! Q-learning takes a while to run

Page 3: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Today’s Lecture

1. Why does policy gradient work?

2. Policy gradient is a type of policy iteration

3. Policy gradient as a constrained optimization

4. From constrained optimization to natural gradient

5. Natural gradients and trust regions

• Goals:• Understand the policy iteration view of policy gradient

• Understand how to analyze policy gradient improvement

• Understand what natural gradient does and how to use it

Page 4: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Recap: policy gradients

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

“reward to go”

can also use function approximation here

Page 5: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Why does policy gradient work?

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

look familiar?

Page 6: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Policy gradient as policy iteration

Page 7: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Policy gradient as policy iterationimportance sampling

Page 8: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Ignoring distribution mismatch??

why do we want this to be true?

is it true? and when?

Page 9: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Bounding the distribution change

seem familiar?

not a great bound, but a bound!

Page 10: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Bounding the distribution change

Proof based on: Schulman, Levine, Moritz, Jordan, Abbeel. “Trust Region Policy Optimization.”

Page 11: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Bounding the objective value

Page 12: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Where are we at so far?

Page 13: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Break

Page 14: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

A more convenient bound

KL divergence has some very convenient properties that make it much easier to approximate!

Page 15: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

How do we optimize the objective?

Page 16: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

How do we enforce the constraint?

can do this incompletely (for a few grad steps)

Page 17: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

How (else) do we optimize the objective?

Use first order Taylor approximation for objective (a.k.a., linearization)

Page 18: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

How do we optimize the objective?

(see policy gradient lecture for derivation)

exactly the normal policy gradient!

Page 19: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Can we just use the gradient then?

Page 20: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Can we just use the gradient then?

not the same!

second order Taylor expansion

Page 21: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Can we just use the gradient then?

natural gradient

Page 22: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Is this even a problem in practice?

(image from Peters & Schaal 2008)

Essentially the same problem as this:

(figure from Peters & Schaal 2008)

Page 23: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

Practical methods and notes

• Natural policy gradient• Generally a good choice to stabilize policy gradient training

• See this paper for details:• Peters, Schaal. Reinforcement learning of motor skills with policy gradients.

• Practical implementation: requires efficient Fisher-vector products, a bit non-trivial to do without computing the full matrix• See: Schulman et al. Trust region policy optimization

• Trust region policy optimization

• Just use the IS objective directly• Use regularization to stay close to old policy

• See: Proximal policy optimization

Page 24: Advanced Policy Gradients - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Advanced Policy Gradients CS 285: Deep Reinforcement Learning,

generate samples (i.e.

run the policy)

fit a model to estimate return

improve the policy

Review

• Policy gradient = policy iteration

• Optimize advantage under new policy state distribution

• Using old policy state distribution optimizes a bound, if the policies are close enough

• Results in constrained optimization problem

• First order approximation to objective = gradient ascent

• Regular gradient ascent has the wrong constraint, use natural gradient

• Practical algorithms

• Natural policy gradient

• Trust region policy optimization