Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs ›...

Model-Based Reinforcement Learning

CS 294-112: Deep Reinforcement Learning

Sergey Levine

Class Notes

1. Homework 3 due in one week• Don’t put it off! It takes a while to train.

2. Project proposal due in two weeks!

1. Last lecture: choose good actions autonomously by backpropagating (or planning) through known system dynamics (e.g. known physics)

2. Today: what do we do if the dynamics are unknown?a. Fitting global dynamics models (“model-based RL”)

b. Fitting local dynamics models

3. Wednesday: combining optimal control and policy search to train neural network policies with the aid of optimal control

Overview

1. Overview of model-based RL• Learn only the model

• Learn model & policy

2. What kind of models can we use?

3. Global models and local models

4. Learning with local models and trust regions

• Goals:• Understand the terminology and formalism of model-based RL

• Understand the options for models we can use in model-based RL

• Understand practical considerations of model learning

Today’s Lecture

Why learn the model?

Does it work? Yes!

• Essentially how system identification works in classical robotics

• Some care should be taken to design a good base policy

• Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters

Does it work? No!

• Distribution mismatch problem becomes exacerbated as we use more expressive model classes

go right to get higher!

Can we do better?

What if we make a mistake?

Can we do better?ev

This will be on HW4!

How to replan?ev

• The more you replan, the less perfect each individual plan needs to be

• Can use shorter horizons

• Even random sampling can often work well here!

That seems like a lot of work…ev

Backpropagate directly into the policy?

backprop backpropbackprop

easy for deterministic policies, but also possible for stochastic policy (more on this later)

Summary

• Version 0.5: collect random samples, train dynamics, plan• Pro: simple, no iterative procedure• Con: distribution mismatch problem

• Version 1.0: iteratively collect data, replan, collect data• Pro: simple, solves distribution mismatch• Con: open loop plan might perform poorly, esp. in stochastic domains

• Version 1.5: iteratively collect data using MPC (replan at each step)• Pro: robust to small model errors• Con: computationally expensive, but have a planning algorithm available

• Version 2.0: backpropagate directly into policy• Pro: computationally cheap at runtime• Con: can be numerically unstable, especially in stochastic domains (more on this later)

Case study: model-based policy search with GPs

What kind of models can we use?

Gaussian process neural network

image: Punjani & Abbeel ‘14

video prediction?more on this later in the course

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

The trouble with global models

• Planner will seek out regions where the model is erroneously optimistic

• Need to find a very good model in most of the state space to converge on a good solution

• In some tasks, the model is much more complex than the policy

Local models

What controller to execute?

Local models

How to fit the dynamics?

Can we do better?

What if we go too far?

How to stay close to old controller?

KL-divergences between trajectories

• Not just for trajectory optimization – really important for model-free policy search too! More on this in later lectures

dynamics & initial state are the same!

negative entropy

Digression: dual gradient descent

how to maximize? Compute the gradient!

Digression: dual gradient descent

DGD with iterative LQR

this is the hard part, everything else is easy!

DGD with iterative LQR

Trust regions & trajectory distributions

•Bounding KL-divergences between two policies or controllers, whether linear-Gaussian or more complex (e.g. neural networks) is really useful

•Bounding KL-divergence between policies is equivalent to bounding KL-divergences between trajectory distributions

•We’ll use this later in the course in model-free RL too!

Case study: local models & iterative LQR

Case study: combining global and local models

Model-Based Reinforcement Learningrail.eecs.berkeley.edu › deeprlcourse-fa17 › f17docs ›...

Documents

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

Introduction to Reinforcement Learningrll.berkeley.edu/deeprlcourse/f17docs/lecture_3_rl_intro.pdf · Introduction to Reinforcement Learning ... •Structure of RL algorithms

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning

Tensor ow Review Session - rail.eecs.berkeley.edurail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/tf_review_session.pdf · What is Tensor ow? Tensor ow is a library for building and

Transfer and Multi-Task Learningrail.eecs.berkeley.edu/...15_multi_task_learning.pdf · 2. Multi-task transfer: train on many tasks, transfer to a new task a) Model-based reinforcement

Reinforcement Learning - Multi-Agent Reinforcement

Skinner’s Behavioral Reinforcement Theory Positive Reinforcement Negative Reinforcement PunishmentExtinction Person repeats desired behaviors to gain a

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

Meta Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-20.pdf · Learning to Learn with Gradients. PhD Thesis 2018 {k rollouts from dataset of datasets

Meta-Learningrll.berkeley.edu/.../f17docs/lecture_16_meta_learning.pdf•Meta-learning = learning to learn •In practice, very closely related to multi-task learning •Many formulations

Inverse Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-16.pdfReview • IRL: infer unknown reward from expert demonstrations • MaxEnt IRL: infer

Google Brain team Thanks: AutoML: Automated Machine Learningrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-25.pdf · AutoML: Automated Machine Learning Barret Zoph, Quoc Le

Inverse Reinforcement Learning CS885 Reinforcement

Model-Based Reinforcement Learningrail.eecs.berkeley.edu/deeprlcourse-fa19/static/slides/...Class Notes 1. Homework 3 is out! Due next week •Start early, this one will take a bit

Multi-Objective Reinforcement Learning using Sets of Pareto … · 2020. 10. 19. · learning and multi-objective reinforcement learning. 2.1 Reinforcement Learning A reinforcement

Exploration - Robot Learningrll.berkeley.edu/deeprlcourse/f17docs/lecture_13_exploration.pdfExploration and exploitation examples •Restaurant selection •Exploitation: go to your

Connections Between Inference and Controlrll.berkeley.edu/deeprlcourse/f17docs/lecture_11_control... · 2017. 10. 4. · Linearly solvable Markov decision problems: one framework

for high REINFORCEMENT SYSTEMS - nevoga.com€¦ · 2 REINFORCEMENT SYSTEMS REINFORCEMENT SYSTEMS REINFORCEMENT SYSTEM PLEXUS®, PYRAPLEX®, FTW The reinforcement system PLEXUS®,

Continuous hoops for transverse reinforcement of ... · Continuous hoops for transverse reinforcement of ... transverse reinforcement details for the ... Fig. 1 Transverse reinforcement

Reinforcement and deep reinforcement learning for wireless