23
Cooperative Inverse Reinforcement Learning Robert “π” Pinsler and Adri` a Garriga Alonso October 26, 2017 D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative Inverse Reinforcement Learning. NIPS 2016.

ˇ Pinsler and Adri a Garriga Alonso

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ˇ Pinsler and Adri a Garriga Alonso

Cooperative Inverse Reinforcement Learning

Robert “π” Pinsler and Adria Garriga Alonso

October 26, 2017

D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative

Inverse Reinforcement Learning. NIPS 2016.

Page 2: ˇ Pinsler and Adri a Garriga Alonso
Page 3: ˇ Pinsler and Adri a Garriga Alonso

Value Alignment Problem

How can we guarantee agents behave according to humanobjectives?

I humans are bad at stating what they want

I stated vs. intended reward function

I related to principal-agent problem in economics

Page 4: ˇ Pinsler and Adri a Garriga Alonso

Reward Hacking

Figure 1: See https://www.youtube.com/watch?v=tlOIHko8ySg

As an optimization problem:If objective depends on only K < N variables, optimizer will oftenset remaining N − K unconstrained variables to extreme values.

→ Alternative: Use Inverse Reinforcement Learning to inferreward function from human

Page 5: ˇ Pinsler and Adri a Garriga Alonso

Reinforcement LearningMarkov Decision Process (MDP): M = (S,A,T ,R, γ)

I state space SI action space AI transition dynamics T (·) = p(st+1 | st , at)I discount factor γ ∈ [0, 1]

I reward function R : S ×A 7→ R

behavior characterized by policy π(a | s)

Page 6: ˇ Pinsler and Adri a Garriga Alonso

Inverse Reinforcement Learning (IRL)

Given MDP\R and teacher demonstrations D = {ζ1, . . . , ζN},find reward function R∗ which best explains teacher behavior:

E

[∑t

γtR∗(st)|π∗]≥ E

[∑t

γtR∗(st)|π

]∀π (1)

usually demonstration by expert (DBE): assumes teacher is optimal

Page 7: ˇ Pinsler and Adri a Garriga Alonso

IRL Problem Formulations

Need to resolve reward function ambiguity (e.g. R = 0 is solution)

I Max-margin approaches (Ng and Russell, 2000; Abbeel andNg, 2004; Radliff et al., 2006)

I Probabilistic approaches (Ramachandran and Amir, 2007;Ziebart et al., 2008)

Issues

I often don’t want the robot to imitate the human

I assumes human is unaware of being observed

I action selection independent of reward uncertainty

Page 8: ˇ Pinsler and Adri a Garriga Alonso

Contributions of CIRL paper

1. Formalise Cooperative Inverse Reinforcement Learning (CIRL).

2. Characterise solution: relation to POMDP, sufficient statisticsfor policy.

3. Relate to existing Apprenticeship Learning.

4. Show that Demonstration By Expert (an assumption ofexisting IRL approaches) is not optimal.

5. Give an empirically better algorithm.

Page 9: ˇ Pinsler and Adri a Garriga Alonso

Cooperative Inverse Reinforcement Learning

Idea: observing passively while a task is performed efficiently isnot the best way to learn how to perform a task, or which parts ofit the performer cares about.Examples: making coffee, playing a musical instrument.

Our problem setting should allow R to ask things, H to answer andcorrect R’s mistakes.

Human H and Robot R act simultaneously and interactively, tomaximise H’s reward.

Page 10: ˇ Pinsler and Adri a Garriga Alonso

CIRL formulation

CIRL is M = (S,AH ,AR ,T ,Θ,R,P0, γ)

I T (s ′ | s, aH , aR): a probability distribution over the next state,given the current state and actions.

I R : S ×AH ×AR ×Θ 7→ R: a parameterised reward function.

I P0(s0, θ): a probability distribution over initial states and θ.Bayesian prior, known in experiments.

Agents observe each other’s actions. Behaviour defined by a pairof policies (πH , πR):

I πH : [AH ×AR × S]∗ ×Θ 7→ AH .

I πR : [AH ×AR × S]∗ 7→ AR .

Page 11: ˇ Pinsler and Adri a Garriga Alonso

Computing optimal policy pair

I Analog of optimal policy in MDP, but not a solution.I If multiple optimal policy pairs exist, coordination problem.

I Instance of decentralised partially observable MDP(Dec-POMDP): several agents (H and R) with theircorresponding policies (πH and πR) and reward functions (Rand R), each potentially receiving partial information.

I Reducible to a single-agent coordination-POMDP MC :I State space: SC = S ×Θ.I Observation space: the original states S.I Action space: AC = (Θ 7→ AH)×AR .

I Solving this POMDP exactly is exponential in |S|.I Consequence: Optimal policy depends only on the state and

the belief of R about θ.

Page 12: ˇ Pinsler and Adri a Garriga Alonso

Apprenticeship Learning

Idea: use recovered R (from IRL) to learn πApprenticeship CIRL: turn-based CIRL game with two phases

I learning phase: H and R take turns acting

I deployment phase: R acts independently

Page 13: ˇ Pinsler and Adri a Garriga Alonso

Example

Task: Help H Make Office Supplies

I ps : #paperclips, qs : #staples

I θ ∈ [0, 1] (unobserved by R): preference for paperclips vs.staples

I action (pa, qa) creates paperclips/staples

I Reward function: R(s, (pa, qa); θ) = θpa + (1− θ)qa

t = 0 t = 1 t = 2(2, 0)

(1, 1)

(0, 2)true

(90, 0)

(50, 50)

(0, 90)

aH aR

Page 14: ˇ Pinsler and Adri a Garriga Alonso

Deployment Phase

Theorem 2In deployment phase, optimal policy for R maximizes reward inMDP induced by mean reward EbR [R] 1 2

Example

suppose πH =

(0, 2) θ ∈ [0, 13)

(1, 1) θ ∈ [13 ,23 ]

(2, 0) otherwise

bR0 uniform on θ observe aH = (0, 2)

bR1 = Unif ([0, 1/3)) → optimal R behaves as though θ = 1/6

1Proof: reduces to MDP under fixed distribution over R, can applyTheorem 3 from Ramachandran and Amir, 2007

2reduces to θ = E[θ] if R is linear in θ

Page 15: ˇ Pinsler and Adri a Garriga Alonso

Learning Phase

DBE policy πE : H greedily maximizes reward on its turn

I ignores impact on R’s belief about θ

πE (θ) =

(0, 2) θ < 0.5

(1, 1) θ = 0.5

(2, 0) θ > 0.5

Classic IRL as best response for R assuming DBE (πH = πE ):

I Learning Phase: use IRL to compute bR

I Deployment Phase: act to max. reward under θ (Theorem 2)

Page 16: ˇ Pinsler and Adri a Garriga Alonso

Learning Phase (contd.)

Theorem 3There exist Apprenticeship CIRL games where best response of Hviolates DBE assumption: br(br(πE )) 6= πE

Proof.

br(πE ) = πR(aH) =

(0, 90) aH = (0, 2)

(50, 50) aH = (1, 1)

(90, 0) aH = (2, 0)

br(br(πE )) = πH(θ) =

(0, 2) θ < 41

92

(1, 1) 4192 ≤ θ ≤

5192

(2, 0) θ ≥ 5192

6= πE

→ present demonstrations for fast learning instead

Page 17: ˇ Pinsler and Adri a Garriga Alonso

Generating Instructive Demonstrations

Compute H’s best response when R uses IRL

I simplification: assume R(s, aH , aR ; θ) = φ(s)T θ

I IRL objective changes:

find R∗ s.t. E [∑

t γtR∗(st)|π∗] ≥ E [

∑t γ

tR∗(st)|π] ∀π⇓

find θ∗ s.t. θ∗Tµ(π∗) ≥ θ∗Tµ(π) ∀π

where µ(π) = E [∑

t γtφ(st)|π] are expected feature counts

→ br(πE ) computes policy that matches observed feature countsµ from learning phase

Page 18: ˇ Pinsler and Adri a Garriga Alonso

Generating Instructive Demonstrations (contd.)

Idea:

I compute feature counts µ(πEθ ) that R expects to see under

true parameters θ

I select trajectory ζ that most closely matches these features

Approximate best response to feature matching:

ζH ← arg maxζ

φ(ζ)T θ − η||µ(πEθ )− φ(ζ)||2

where φ(ζ) =∑

s∈ζ φ(s)

Page 19: ˇ Pinsler and Adri a Garriga Alonso

Experiment 0: Instructive DemonstrationGround Truth

Expert Demonstration

Instructive Demonstration

= (0.1, 1.0, 1.5)φ(s) = {e−||x−xi ||}i

Square plots from CIRL paper, edited.

Page 20: ˇ Pinsler and Adri a Garriga Alonso

Experiment 1: approximate br(br(πE)) vs. πE

Is Approximate Best Response better, if R uses an IRL algorithm?Manipulated variables:

I H-policy: approximate br(br(πE)) (abbreviated br) vs. πE.

I Number of RBFs: 3, . . . , 10.

For a random sample of 500 θGT per experiment, measure:

I Regret: difference between values of the optimal policy forθGT and θ.

I KL-divergence between trajectory distribution induced by θGTand θ.

I Squared distance between parameters: ||θGT − θ||2

Page 21: ˇ Pinsler and Adri a Garriga Alonso

Experiment 1: findings; experiment 2: effect of MaxEnt λ

ANOVA: for all measures, for all quantities of RBFs, broutperformed πE. F > 962, p < .0001.Interaction effect between the number of RBFs and theKL-divergence and squared distance: larger gap with fewer RBFs.

Figure from CIRL paper.

In the MaxEnt algorithm, λ controls how optimally R expects H tobehave. 0 means H behaves independently of their preferences,whereas ∞ means H behaves exactly optimally.

Page 22: ˇ Pinsler and Adri a Garriga Alonso

Conclusions and Future Work

I Game-theoretic model of learning from another cooperativeagent’s demonstrations. Keys: robot knows it’s in a sharedenvironment, adopts the human reward function as its own.

I Conceptual connections with RL with corrupted rewards, andsome more tenuous with scalable supervision.

I Explore the coordination problem in the future. Coordinatingin reality is infeasible.

I Did not fully explore the consequences and emergingstrategies of this new setting, focusing instead only onapprenticeship CIRL.

Page 23: ˇ Pinsler and Adri a Garriga Alonso

References I

P. Abbeel and A. Y. Ng, Apprenticeship learning via inversereinforcement learning, ICML, 2004.

D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan,Cooperative inverse reinforcement learning, NIPS, 2016,pp. 3909–3917.

A. Y. Ng and S. J. Russell, Algorithms for inversereinforcement learning, ICML, 2000, pp. 663–670.

D. Ramachandran and E. Amir, Bayesian inverse reinforcementlearning, IJCAI, 2007, pp. 2586–2591.

N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, Maximummargin planning, ICML, 2006, pp. 729–736.

B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey,Maximum entropy inverse reinforcement learning, AAAI, 2008,p. 1433–1438.