55
Model-Based Reinforcement Learning CS 294-112: Deep Reinforcement Learning Sergey Levine

Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Model-Based Reinforcement Learning

CS 294-112: Deep Reinforcement Learning

Sergey Levine

Page 2: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Class Notes

1. Homework 3 due in one week• Don’t put it off! It takes a while to train.

2. Project proposal due in two weeks!

Page 3: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

1. Last lecture: choose good actions autonomously by backpropagating (or planning) through known system dynamics (e.g. known physics)

2. Today: what do we do if the dynamics are unknown?a. Fitting global dynamics models (“model-based RL”)

b. Fitting local dynamics models

3. Wednesday: combining optimal control and policy search to train neural network policies with the aid of optimal control

Overview

Page 4: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

1. Overview of model-based RL• Learn only the model

• Learn model & policy

2. What kind of models can we use?

3. Global models and local models

4. Learning with local models and trust regions

• Goals:• Understand the terminology and formalism of model-based RL

• Understand the options for models we can use in model-based RL

• Understand practical considerations of model learning

Today’s Lecture

Page 5: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Why learn the model?

Page 6: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Why learn the model?

Page 7: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Why learn the model?

Page 8: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Does it work? Yes!

• Essentially how system identification works in classical robotics

• Some care should be taken to design a good base policy

• Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

Page 9: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Does it work? No!

• Distribution mismatch problem becomes exacerbated as we use more expressive model classes

go right to get higher!

Page 10: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Can we do better?

Page 11: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What if we make a mistake?

Page 12: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Can we do better?ev

ery

N s

tep

s

This will be on HW4!

Page 13: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

How to replan?ev

ery

N s

tep

s

• The more you replan, the less perfect each individual plan needs to be

• Can use shorter horizons

• Even random sampling can often work well here!

Page 14: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

That seems like a lot of work…ev

ery

N s

tep

s

Page 15: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Backpropagate directly into the policy?

backprop backpropbackprop

easy for deterministic policies, but also possible for stochastic policy (more on this later)

Page 16: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Summary

• Version 0.5: collect random samples, train dynamics, plan• Pro: simple, no iterative procedure• Con: distribution mismatch problem

• Version 1.0: iteratively collect data, replan, collect data• Pro: simple, solves distribution mismatch• Con: open loop plan might perform poorly, esp. in stochastic domains

• Version 1.5: iteratively collect data using MPC (replan at each step)• Pro: robust to small model errors• Con: computationally expensive, but have a planning algorithm available

• Version 2.0: backpropagate directly into policy• Pro: computationally cheap at runtime• Con: can be numerically unstable, especially in stochastic domains (more on this later)

Page 17: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: model-based policy search with GPs

Page 18: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: model-based policy search with GPs

Page 19: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: model-based policy search with GPs

Page 20: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine
Page 21: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What kind of models can we use?

Gaussian process neural network

image: Punjani & Abbeel ‘14

other

video prediction?more on this later in the course

Page 22: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

ever

y N

ste

ps

Page 23: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine
Page 24: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Break

Page 25: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

Page 26: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

• In some tasks, the model is much more complex than the policy

Page 27: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Local models

Page 28: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Local models

Page 29: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Local models

Page 30: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What controller to execute?

Page 31: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What controller to execute?

Page 32: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What controller to execute?

Page 33: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Local models

Page 34: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

How to fit the dynamics?

Can we do better?

Page 35: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

What if we go too far?

Page 36: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

How to stay close to old controller?

Page 37: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

KL-divergences between trajectories

• Not just for trajectory optimization – really important for model-free policy search too! More on this in later lectures

dynamics & initial state are the same!

Page 38: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

KL-divergences between trajectories

Page 39: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

KL-divergences between trajectories

negative entropy

Page 40: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

KL-divergences between trajectories

Page 41: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Digression: dual gradient descent

how to maximize? Compute the gradient!

Page 42: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Digression: dual gradient descent

Page 43: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Digression: dual gradient descent

Page 44: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

DGD with iterative LQR

Page 45: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

DGD with iterative LQR

this is the hard part, everything else is easy!

Page 46: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

DGD with iterative LQR

Page 47: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

DGD with iterative LQR

Page 48: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Trust regions & trajectory distributions

•Bounding KL-divergences between two policies or controllers, whether linear-Gaussian or more complex (e.g. neural networks) is really useful

•Bounding KL-divergence between policies is equivalent to bounding KL-divergences between trajectory distributions

•We’ll use this later in the course in model-free RL too!

Page 49: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: local models & iterative LQR

Page 50: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine
Page 51: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine
Page 52: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: local models & iterative LQR

Page 53: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine

Case study: combining global and local models

Page 54: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine
Page 55: Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs › lecture_9_model_based_rl.pdfLearning CS 294-112: Deep Reinforcement Learning Sergey Levine