55
Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine

Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Model-Based Reinforcement Learning

CS 294-112: Deep Reinforcement Learning

Sergey Levine

Page 2: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Class Notes

1. Homework 3 due in one week• Don’t put it off! It takes a while to train.

2. Project proposal due in two weeks!

Page 3: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

1. Last lecture: choose good actions autonomously by backpropagating (or planning) through known system dynamics (e.g. known physics)

2. Today: what do we do if the dynamics are unknown?a. Fitting global dynamics models (“model-based RL”)

b. Fitting local dynamics models

3. Wednesday: combining optimal control and policy search to train neural network policies with the aid of optimal control

Overview

Page 4: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

1. Overview of model-based RL• Learn only the model

• Learn model & policy

2. What kind of models can we use?

3. Global models and local models

4. Learning with local models and trust regions

• Goals:• Understand the terminology and formalism of model-based RL

• Understand the options for models we can use in model-based RL

• Understand practical considerations of model learning

Today’s Lecture

Page 5: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Why learn the model?

Page 6: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Why learn the model?

Page 7: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Why learn the model?

Page 8: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Does it work? Yes!

• Essentially how system identification works in classical robotics

• Some care should be taken to design a good base policy

• Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

Page 9: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Does it work? No!

• Distribution mismatch problem becomes exacerbated as we use more expressive model classes

go right to get higher!

Page 10: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Can we do better?

Page 11: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What if we make a mistake?

Page 12: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Can we do better?ev

ery

N s

tep

s

This will be on HW4!

Page 13: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

How to replan?ev

ery

N s

tep

s

• The more you replan, the less perfect each individual plan needs to be

• Can use shorter horizons

• Even random sampling can often work well here!

Page 14: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

That seems like a lot of work…ev

ery

N s

tep

s

Page 15: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Backpropagate directly into the policy?

backprop backpropbackprop

easy for deterministic policies, but also possible for stochastic policy (more on this later)

Page 16: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Summary

• Version 0.5: collect random samples, train dynamics, plan• Pro: simple, no iterative procedure• Con: distribution mismatch problem

• Version 1.0: iteratively collect data, replan, collect data• Pro: simple, solves distribution mismatch• Con: open loop plan might perform poorly, esp. in stochastic domains

• Version 1.5: iteratively collect data using MPC (replan at each step)• Pro: robust to small model errors• Con: computationally expensive, but have a planning algorithm available

• Version 2.0: backpropagate directly into policy• Pro: computationally cheap at runtime• Con: can be numerically unstable, especially in stochastic domains (more on this later)

Page 17: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: model-based policy search with GPs

Page 18: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: model-based policy search with GPs

Page 19: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: model-based policy search with GPs

Page 20: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through
Page 21: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What kind of models can we use?

Gaussian process neural network

image: Punjani & Abbeel ‘14

other

video prediction?more on this later in the course

Page 22: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

ever

y N

ste

ps

Page 23: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through
Page 24: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Break

Page 25: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

Page 26: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

• In some tasks, the model is much more complex than the policy

Page 27: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Local models

Page 28: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Local models

Page 29: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Local models

Page 30: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What controller to execute?

Page 31: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What controller to execute?

Page 32: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What controller to execute?

Page 33: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Local models

Page 34: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

How to fit the dynamics?

Can we do better?

Page 35: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

What if we go too far?

Page 36: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

How to stay close to old controller?

Page 37: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

KL-divergences between trajectories

• Not just for trajectory optimization – really important for model-free policy search too! More on this in later lectures

dynamics & initial state are the same!

Page 38: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

KL-divergences between trajectories

Page 39: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

KL-divergences between trajectories

negative entropy

Page 40: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

KL-divergences between trajectories

Page 41: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Digression: dual gradient descent

how to maximize? Compute the gradient!

Page 42: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Digression: dual gradient descent

Page 43: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Digression: dual gradient descent

Page 44: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

DGD with iterative LQR

Page 45: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

DGD with iterative LQR

this is the hard part, everything else is easy!

Page 46: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

DGD with iterative LQR

Page 47: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

DGD with iterative LQR

Page 48: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Trust regions & trajectory distributions

•Bounding KL-divergences between two policies or controllers, whether linear-Gaussian or more complex (e.g. neural networks) is really useful

•Bounding KL-divergence between policies is equivalent to bounding KL-divergences between trajectory distributions

•We’ll use this later in the course in model-free RL too!

Page 49: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: local models & iterative LQR

Page 50: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through
Page 51: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through
Page 52: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: local models & iterative LQR

Page 53: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through

Case study: combining global and local models

Page 54: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through
Page 55: Model-Based Reinforcement Learning - University of California, Berkeley · 2017. 9. 24. · 1. Last lecture: choose good actions autonomously by backpropagating (or planning) through