42
Inverse Reinforcement Learning Chelsea Finn 3/5/2017 1

Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Inverse Reinforcement LearningChelsea Finn

3/5/2017

1

Page 2: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations

2

Page 3: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

3

Inverse RL: Outline

Page 4: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

4

Inverse RL: Outline

Page 5: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Mnih et al. ’15

video from Montessori New Zealand

what is the reward?reinforcement learning agent

reward

In the real world, humans don’t get a score.

5

Page 6: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

reward function is essential for RL

real-world domains: reward/cost often difficult to specify

• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…

Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95

6

Page 7: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Behavioral Cloning: Mimic actions of expert

7

Motivation

- but no reasoning about outcomes or dynamics

Can we reason about what the expert is trying to achieve?

- the expert might have different degrees of freedom

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations

(Kalman ’64, Ng & Russell ’00)(IOC/IRL)

Page 8: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

8

Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations

Challenges underdefined problem

difficult to evaluate a learned cost demonstrations may not be precisely optimal

given: - state & action space - roll-outs from π* - dynamics model [sometimes]

goal: - recover reward function - then use reward to get policy

Compare to DAgger: no direct access to π*

Page 9: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

9

Early IRL Approaches

All: alternate between solving MDP w.r.t. cost and updating cost

Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better

Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories

Ratliff et al. ’06: max margin formulation between value of expert actions and other actions

How to handle ambiguity? What if expert is not perfect?

Page 10: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions

10

Inverse RL: Outline

Page 11: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

11

Whiteboard

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Notation:

: cost with parameters [linear case ]

: dataset of demonstrations

: transition dynamics

Page 12: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

12

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Page 13: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions

13

Inverse RL: Outline

Page 14: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

14

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

NIPS Deep RL workshop 2015

IROS 2016

Page 15: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

15Need to iteratively solve MDP for every weight update

Page 16: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

16

demonstrations

mean height height variance cell visibility

120km of demonstration data

test-set trajectory prediction:

manually designed cost:

MHD: modified Hausdorff distance

Page 17: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost

17

Strengths - scales to neural net costs - efficient enough for real robots

Limitations - still need to repeatedly solve the MDP - assumes known dynamics

Page 18: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

18

Whiteboard

What about unknown dynamics?

Page 19: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

19

Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces

Case Study: Guided Cost Learning

ICML 2016

Page 20: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

guided cost learning algorithm

20

policy π0

Page 21: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Update cost using samples & demos

generate policy samples from π

update π w.r.t. cost (partially optimize)

generator

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator

guided cost learning algorithm

update cost in inner loop of policy optimization

21

policy π0

Page 22: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Update cost using samples & demos

generate policy samples from π

generator

Ho et al., ICML ’16, NIPS ‘16

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

policy π cost c

discriminator

guided cost learning algorithm

update π w.r.t. cost (partially optimize)

22

policy π0

Page 23: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

23

What about unknown dynamics?Adaptive importance sampling

Page 24: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

GCL Experiments

dish placement pouring almonds

Real-world Tasks

state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]

24

action: joint torques

Page 25: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Update cost using samples & demos

generate policy samples from q

x

x

x

1

2

n

h

h

h

1

2

k

h

h

h

1

2

p

(1)

(1)

(1)

(2)

(2)

(2)

h

h

h

1

2

m

(3)

(3)

(3)

c (x)θ2

cost c

ComparisonsRelative Entropy IRL (Boularias et al. ‘11)

Path Integral IRL (Kalakrishnan et al. ‘13)

25

Page 26: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Dish placement, demos

26

Page 27: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Dish placement, standard cost

27

Page 28: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Dish placement, RelEnt IRL

• video of dish baseline method

28

Page 29: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

29

Page 30: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Pouring, demos

• video of pouring demos

30

Page 31: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Pouring, RelEnt IRL

• video of pouring baseline method

31

Page 32: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Pouring, GCL policy

• video of pouring our method - samples

32

Page 33: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Conclusion: We can recover successful policies for new positions.

Is the cost function also useful for new scenarios?

33

Page 34: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Dish placement - GCL reopt.

• video of dish our method - samples & reoptimizing

34

Page 35: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Pouring - GCL reopt.

• video of pouring our method - reoptimization

35

Note: normally the GAN discriminator is discarded

Page 36: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

36

Case Study: Generative Adversarial Imitation Learning

NIPS 2016

Page 37: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

37

Case Study: Generative Adversarial Imitation Learning

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks

Page 38: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Guided Cost Learning & Generative Adversarial Imitation Learning

38

Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots

Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic

teaching or teleoperation (first person)

Page 39: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Next Time: Back to forward RL (advanced policy gradients)

39

Page 40: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

40

Page 41: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

IOC is under-definedneed regularization:• encourage slowly changing cost

• cost of demos decreases strictly monotonically in time

Page 42: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially

Regularization ablationReaching

samples

dist

ance

5 25 45 65 850

0.2

0.4

0.6

0.8

Peg Insertion

samples

dist

ance

5 25 45 65 850

0.1

0.2

0.3

0.4

0.5

ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init