19
INVERSE REWARD DESIGN Dylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan University of California, Berkeley Slides by Anthony Chen

CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

INVERSE REWARD DESIGNDylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

University of California, Berkeley

Slides by Anthony Chen

Page 2: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

Inverse Reinforcement Learning (Review)

Inverse Reward Design (Intuition)

Inverse Reward Design (Formalism)

2

Page 3: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

INVERSE REINFORCEMENT LEARNING

• Given a Markov Decision Process (MDP) with a reward function, number of ways to obtain a near-optimal or optimal policy.

• However, reward functions can be hard to specify.

• Instead, we can learn a policy by learning from the trajectories (data) generated by an expert.

• Called inverse reinforcement learning (IRL).

3

Page 4: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

DRIVING• Many things that go into a reward: safety,

comfort, etc. • Humans learn by watching experts drive.

• In machines, we can learn from data generated by an expert.

• Don’t want to exactly copy the expert; doesn’t generalize.

• Instead, learn a policy that implicitly maximizes reward from expert data.

4

Page 5: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

NOTATION• MDP generally defined as a tuple:

• S: set of states• A: set of actions• T: transition probabilities • : discount factor• r : reward function, bounded by 1.

• A MDP without a reward function is called a world model, denoted by

• Assume a feature vector for states: • Assume an optimal reward function:

�(s) 2 Rk,� : S ! [0, 1]k

r⇤ = w⇤�(s), w⇤ 2 Rk, ||w⇤||1 1

M = hS,A, T, �, ri

fM

5

Page 6: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

• A policy, , is a mapping from states to a probability over actions. • The value of a policy can be written as follows:

• We can then define the feature expectations to be

and the policy value

E[V ⇡(s0)] = E

1X

t=0

�tR(st)|⇡�

= E

1X

t=0

�tw · �(st)|⇡�

= w · E 1X

t=0

�t�(st)|⇡�

µ(⇡) = E

1X

t=0

�t�(st)|⇡�

E[V ⇡(s0)] = w · µ(⇡)

6

Page 7: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

IRL FORMALISM• Given a world model, , a predefined feature mapping, , and a set of

trajectories generated from an expert policy, . • Think of this expert policy as returning the optimal value.

• Let be the expert feature expectation and be the estimated expert feature expectation.

• We can find a policy that implicitly maximizes the expert’s unknown reward function via the following optimization:

⇡E

µE µ̂E

{s(i)0 , s(i)1 , . . . }mi=1

µ̂E =1

m

mX

i=1

1X

t=0

�t��s(i)t

⇡̂ = min⇡

||µ(⇡)� µ̂E ||2

fM

7

Page 8: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

Inverse Reinforcement Learning (Review)

Inverse Reward Design (Intuition)

Inverse Reward Design (Formalism)

8

Page 9: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

INTUITION• When a human designs a reward function (known as a proxy reward

function), they try and capture as much about the world as possible.

• However, we can’t possibly capture all scenarios, leading to negative side effects when encountering new states.

• Machines should thus take our reward function as an guess at what the true reward function is and assign uncertainty estimates to the rewards generated by our reward function.

9

Page 10: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

Image taken from paper

10

Page 11: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

• Proxy reward functions are likely to be true in the context they were defined. Assign high confidence to rewards generated in this case.

• Proxy reward functions can be very wrong in contexts not defined in. Assign high uncertainty to rewards generated in this case.

• Our agent should try and avoid high uncertainty scenarios.

• Formally speaking, inverse reward design tries to capture the true reward function given a proxy reward function.

KEY IDEAS

11

Page 12: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

INVERSE RL V.S. RD• IRL and IRD both tackle the problem of value alignment: the problem of

communicating to an agent an good reward.

• Inverse Reinforcement Learning: Hard to design a reward function. Instead given trajectories from expert.

• Inverse Reward Design: More tractable to design reward function. Reward function designed by “expert” but not necessarily complete.

12

Page 13: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

Inverse Reinforcement Learning (Review)

Inverse Reward Design (Intuition)

Inverse Reward Design (Formalism)

13

Page 14: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

BAYESIAN MODELING OF IRD• In the IRD formalism, we are given a world model, , and a proxy reward function,

• represents a trajectory

• Our goal is to recover the optimal reward function by recovering the optimal set of weights.

• Let be the probability of a trajectory under a world model and a proxy reward function.

• We can represent a posterior distribution over the optimal reward function via:

fMer = ew · �(⇠)⇠

r⇤ = w⇤ · �(⇠)

⇡(⇠|er, fM)

14

P (w⇤| ew, fM) = P ( ew|w⇤, fM) · P (w⇤)

Page 15: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

• Remember, we want to assume that our proxy reward function results in near optimal in the situations in which it was defined.

• We can model the likelihood as follows:

• This says that in the trajectories generated by our proxy reward function, the reward yielded is optimal with respect to the optimal reward function.

• The posterior can then be written as

where is a normalizing constant. Need to approximate.

eZ(w) =

Z

ewexp

��wT eµ)d ew

15

P ( ew|w⇤,

fM) / exp

✓� · E

w

⇤T�(⇠)

��⇠ ⇠ ⇡(⇠| ew, fM)

�◆

P (w = w⇤| ew, fM) /exp

�� · wT eµ

eZ(w)P (w)

Page 16: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

RISK-AVERSE POLICY• We can obtain a risk averse policy in the following way:

1. Sample a set of weights from our posterior distribution

2. Given those weights, pick a trajectory that maximizes the reward in the worst case:

�ewi

⇠⇤ = argmax

⇠min

w2wi

wT�(⇠)

16

Page 17: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

EVALUATION: LAVALAND

Training Testing

17

Page 18: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

• Maps initialized at random.

• Testing: 5% lava, 66.5% dirt, 28.5% grass

• Features: 3 dimensional

• Proxy reward: generated random uniformly.

• Metric: % of trajectories during test time with lava.

18

Page 19: CS295 Inverse Reward Designdechter/courses/ics-295/winter...exp wT µe)dwe 15 P (we|w , Mf) / exp · E w T ( ) ( |w,e Mf) P (w = w |w,e Mf) / exp · wT µe Ze(w) P (w) RISK-AVERSE

CONCLUSIONS• In this work, the authors present a method that accounts for uncertainty in

the reward function.

• They propose a Bayesian framework to model this uncertainty

• They show empirically on toy examples that their framework can avoid catastrophic events in the face of high uncertainty.

19