30
Exploration (Part 2) CS 285 Instructor: Sergey Levine UC Berkeley

Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Exploration (Part 2)

CS 285

Instructor: Sergey LevineUC Berkeley

Page 2: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Recap: what’s the problem?

this is easy (mostly) this is impossible

Why?

Page 3: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Unsupervised learning of diverse behaviors

What if we want to recover diverse behavior without any reward function at all?

Why?

➢ Learn skills without supervision, then use them to accomplish goals

➢ Learn sub-skills to use with hierarchical reinforcement learning

➢Explore the space of possible behaviors

Page 4: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

An Example Scenario

training time: unsupervised

How can you prepare for an unknown future goal?

Page 5: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 6: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 7: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Some useful identities

Page 8: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Some useful identities

Page 9: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Information theoretic quantities in RL

quantifies coverage

can be viewed as quantifying “control authority” in an information-theoretic way

Page 10: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 11: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

An Example Scenario

training time: unsupervised

How can you prepare for an unknown future goal?

Page 12: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Learn without any rewards at all

(but there are many other choices)

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

12

Page 13: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Learn without any rewards at all

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

13

Page 14: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Learn without any rewards at all

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

14

Page 15: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

How do we get diverse goals?

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

15

Page 16: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

How do we get diverse goals?

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

16

Page 17: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

How do we get diverse goals?

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

17

goals get higher entropy due to Skew-Fit

goal final state

Page 18: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

How do we get diverse goals?

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

18

Page 19: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Reinforcement learning withimagined goals

Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19

imagin

ed g

oal

RL e

pis

ode

19

Page 20: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 21: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Aside: exploration with intrinsic motivation

Page 22: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Can we use this for state marginal matching?

Lee*, Eysenbach*, Parisotto*, Xing, Levine, Salakhutdinov. Efficient Exploration via State Marginal MatchingSee also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

Page 23: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

MaxEnt on actions

variants of SMM

State marginal matching for exploration

much better coverage!

Lee*, Eysenbach*, Parisotto*, Xing, Levine, Salakhutdinov. Efficient Exploration via State Marginal MatchingSee also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

Page 24: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 25: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Is state entropy really a good objective?

25

more or less the same thing

Gupta, Eysenbach, Finn, Levine. Unsupervised Meta-Learning for Reinforcement Learning

See also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration

Page 26: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

In this lecture…

➢ Definitions & concepts from information theory

➢ A distribution-matching formulation of reinforcement learning

➢ Learning without a reward function by reaching goals

➢ A state distribution-matching formulation of reinforcement learning

➢ Is coverage of valid states a good exploration objective?

➢ Beyond state covering: covering the space of skills

Page 27: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Learning diverse skills

task index

Intuition: different skills should visit different state-space regions

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Reaching diverse goals is not the same as performing diverse tasks

not all behaviors can be captured by goal-reaching

Page 28: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Diversity-promoting reward function

Policy(Agent)

Discriminator(D)

Skill (z)

Environment

Action State

Predict Skill

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

Page 29: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

CheetahAnt

Examples of learned tasks

Mountain car

Page 30: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality

A connection to mutual information

Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.

See also: Gregor et al. Variational Intrinsic Control. 2016