43
Supervised Learning of Behaviors CS 294-112: Deep Reinforcement Learning Sergey Levine

Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Supervised Learning of Behaviors

CS 294-112: Deep Reinforcement Learning

Sergey Levine

Page 2: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Class Notes

1. Make sure you sign up for Piazza!

2. Homework 1 is now out

3. Remember to start forming final project groups

Page 3: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Today’s Lecture

1. Definition of sequential decision problems

2. Imitation learning: supervised learning for decision makinga. Does direct imitation work?

b. How can we make it work more often?

3. Case studies of recent work in (deep) imitation learning

4. What is missing from imitation learning?

• Goals:• Understand definitions & notation

• Understand basic imitation learning algorithms

• Understand their strengths & weaknesses

Page 4: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

1. run away

2. ignore

3. pet

Terminology & notation

Page 5: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

1. run away

2. ignore

3. pet

Terminology & notation

Page 6: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Aside: notation

Richard Bellman Lev Pontryagin

управление

Page 7: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Images: Bojarski et al. ‘16, NVIDIA

trainingdata

supervisedlearning

Imitation Learning

behavior cloning

Page 8: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Does it work? No!

Page 9: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA

Page 10: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why did that work?

Bojarski et al. ‘16, NVIDIA

Page 11: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Can we make it work more often?

cost

stability

Page 12: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Learning from a stabilizing controller

(more on this later)

Page 13: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Can we make it work more often?

Page 14: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11

Page 15: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

DAgger Example

Ross et al. ‘11

Page 16: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

What’s the problem?

Ross et al. ‘11

Page 17: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Can we make it work without more data?

• DAgger addresses the problem of distributional “drift”

• What if our model is so good that it doesn’t drift?

• Need to mimic expert behavior very accurately

• But don’t overfit!

Page 18: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why might we fail to fit the expert?

1. Non-Markovian behavior

2. Multimodal behavior

behavior depends only on current observation

If we see the same thing twice, we do the same thing twice, regardless of what happened before

Often very unnatural for human demonstrators

behavior depends on all past observations

Page 19: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

How can we use the whole history?

variable number of frames, too many weights

Page 20: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

How can we use the whole history?

RNN state

RNN state

RNN state

shared weights

Typically, LSTM cells work better here

Page 21: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why might we fail to fit the expert?

1. Non-Markovian behavior

2. Multimodal behavior1. Output mixture of

Gaussians

2. Latent variable models

3. Autoregressive discretization

Page 22: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why might we fail to fit the expert?

1. Output mixture of Gaussians

2. Latent variable models

3. Autoregressive discretization

Page 23: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why might we fail to fit the expert?

1. Output mixture of Gaussians

2. Latent variable models

3. Autoregressive discretization

Look up some of these:• Conditional variational autoencoder• Normalizing flow/realNVP• Stein variational gradient descent

Page 24: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Why might we fail to fit the expert?

1. Output mixture of Gaussians

2. Latent variable models

3. Autoregressive discretization

(discretized) distribution over dimension 1 only

discrete sampling

discrete sampling

dim 1 value

dim 2 value

Page 25: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Imitation learning: recap

• Often (but not always) insufficient by itself• Distribution mismatch problem

• Sometimes works well• Hacks (e.g. left/right images)

• Samples from a stable trajectory distribution

• Add more on-policy data, e.g. using Dagger

• Better models that fit more accurately

trainingdata

supervisedlearning

Page 26: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Break

Page 27: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Case study 1: trail following as classification

Page 28: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement
Page 29: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Case study 2: Imitation with LSTMs

Page 30: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement
Page 31: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement
Page 32: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Follow-up: adding vision

Page 33: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement
Page 34: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Other topics in imitation learning

• Structured prediction

• Inverse reinforcement learning▪ Instead of copying the demonstration, figure out the goal

▪ Will be covered later in this course

“where” “are” “you”

I’m at work

x: where are you

y: I’m at work

in school

Page 35: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Imitation learning: what’s the problem?

• Humans need to provide data, which is typically finite• Deep learning works best when data is plentiful

• Humans are not good at providing some kinds of actions

• Humans can learn autonomously; can our machines do the same?• Unlimited data from own experience

• Continuous self-improvement

Page 36: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

1. run away

2. ignore

3. pet

Terminology & notation

Page 37: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Aside: notation

Richard Bellman Lev Pontryagin

управление

Page 38: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

A cost function for imitation?

trainingdata

supervisedlearning

Ross et al. ‘11

Page 39: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Some analysis How bad is it?

Page 40: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

More general analysis

For more analysis, see Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”

Page 41: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

Cost/reward functions in theory and practice

Page 42: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

The trouble with cost & reward functions

More on this later…

Page 43: Supervised Learning of Behaviorsrail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-2.pdfOther topics in imitation learning •Structured prediction •Inverse reinforcement

A note about terminology…

the “R” word

a bit of history…

Richard SuttonAndrew BartoLev Pontryagin Richard Bellman

reinforcement learning

(the problem statement)

reinforcement learning

(the method)without using the model