ˇ Pinsler and Adri a Garriga Alonso

Cooperative Inverse Reinforcement Learning

Robert “π” Pinsler and Adria Garriga Alonso

October 26, 2017

D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative

Inverse Reinforcement Learning. NIPS 2016.

Value Alignment Problem

How can we guarantee agents behave according to humanobjectives?

I humans are bad at stating what they want

I stated vs. intended reward function

I related to principal-agent problem in economics

Reward Hacking

Figure 1: See https://www.youtube.com/watch?v=tlOIHko8ySg

As an optimization problem:If objective depends on only K < N variables, optimizer will oftenset remaining N − K unconstrained variables to extreme values.

→ Alternative: Use Inverse Reinforcement Learning to inferreward function from human

Reinforcement LearningMarkov Decision Process (MDP): M = (S,A,T ,R, γ)

I state space SI action space AI transition dynamics T (·) = p(st+1 | st , at)I discount factor γ ∈ [0, 1]

I reward function R : S ×A 7→ R

behavior characterized by policy π(a | s)

Inverse Reinforcement Learning (IRL)

Given MDP\R and teacher demonstrations D = {ζ1, . . . , ζN},find reward function R∗ which best explains teacher behavior:

E

[∑t

γtR∗(st)|π∗]≥ E

[∑t

γtR∗(st)|π

]∀π (1)

usually demonstration by expert (DBE): assumes teacher is optimal

IRL Problem Formulations

Need to resolve reward function ambiguity (e.g. R = 0 is solution)

I Max-margin approaches (Ng and Russell, 2000; Abbeel andNg, 2004; Radliff et al., 2006)

I Probabilistic approaches (Ramachandran and Amir, 2007;Ziebart et al., 2008)

Issues

I often don’t want the robot to imitate the human

I assumes human is unaware of being observed

I action selection independent of reward uncertainty

Contributions of CIRL paper

1. Formalise Cooperative Inverse Reinforcement Learning (CIRL).

2. Characterise solution: relation to POMDP, sufficient statisticsfor policy.

3. Relate to existing Apprenticeship Learning.

4. Show that Demonstration By Expert (an assumption ofexisting IRL approaches) is not optimal.

5. Give an empirically better algorithm.

Cooperative Inverse Reinforcement Learning

Idea: observing passively while a task is performed efficiently isnot the best way to learn how to perform a task, or which parts ofit the performer cares about.Examples: making coffee, playing a musical instrument.

Our problem setting should allow R to ask things, H to answer andcorrect R’s mistakes.

Human H and Robot R act simultaneously and interactively, tomaximise H’s reward.

CIRL formulation

CIRL is M = (S,AH ,AR ,T ,Θ,R,P0, γ)

I T (s ′ | s, aH , aR): a probability distribution over the next state,given the current state and actions.

I R : S ×AH ×AR ×Θ 7→ R: a parameterised reward function.

I P0(s0, θ): a probability distribution over initial states and θ.Bayesian prior, known in experiments.

Agents observe each other’s actions. Behaviour defined by a pairof policies (πH , πR):

I πH : [AH ×AR × S]∗ ×Θ 7→ AH .

I πR : [AH ×AR × S]∗ 7→ AR .

Computing optimal policy pair

I Analog of optimal policy in MDP, but not a solution.I If multiple optimal policy pairs exist, coordination problem.

I Instance of decentralised partially observable MDP(Dec-POMDP): several agents (H and R) with theircorresponding policies (πH and πR) and reward functions (Rand R), each potentially receiving partial information.

I Reducible to a single-agent coordination-POMDP MC :I State space: SC = S ×Θ.I Observation space: the original states S.I Action space: AC = (Θ 7→ AH)×AR .

I Solving this POMDP exactly is exponential in |S|.I Consequence: Optimal policy depends only on the state and

the belief of R about θ.

Apprenticeship Learning

Idea: use recovered R (from IRL) to learn πApprenticeship CIRL: turn-based CIRL game with two phases

I learning phase: H and R take turns acting

I deployment phase: R acts independently

Example

Task: Help H Make Office Supplies

I ps : #paperclips, qs : #staples

I θ ∈ [0, 1] (unobserved by R): preference for paperclips vs.staples

I action (pa, qa) creates paperclips/staples

I Reward function: R(s, (pa, qa); θ) = θpa + (1− θ)qa

t = 0 t = 1 t = 2(2, 0)

(1, 1)

(0, 2)true

(90, 0)

(50, 50)

(0, 90)

aH aR

Deployment Phase

Theorem 2In deployment phase, optimal policy for R maximizes reward inMDP induced by mean reward EbR [R] 1 2

Example

suppose πH =

(0, 2) θ ∈ [0, 13)

(1, 1) θ ∈ [13 ,23 ]

(2, 0) otherwise

bR0 uniform on θ observe aH = (0, 2)

bR1 = Unif ([0, 1/3)) → optimal R behaves as though θ = 1/6

1Proof: reduces to MDP under fixed distribution over R, can applyTheorem 3 from Ramachandran and Amir, 2007

2reduces to θ = E[θ] if R is linear in θ

Learning Phase

DBE policy πE : H greedily maximizes reward on its turn

I ignores impact on R’s belief about θ

πE (θ) =

(0, 2) θ < 0.5

(1, 1) θ = 0.5

(2, 0) θ > 0.5

Classic IRL as best response for R assuming DBE (πH = πE ):

I Learning Phase: use IRL to compute bR

I Deployment Phase: act to max. reward under θ (Theorem 2)

Learning Phase (contd.)

Theorem 3There exist Apprenticeship CIRL games where best response of Hviolates DBE assumption: br(br(πE )) 6= πE

Proof.

br(πE ) = πR(aH) =

(0, 90) aH = (0, 2)

(50, 50) aH = (1, 1)

(90, 0) aH = (2, 0)

br(br(πE )) = πH(θ) =

(0, 2) θ < 41

92

(1, 1) 4192 ≤ θ ≤

5192

(2, 0) θ ≥ 5192

6= πE

→ present demonstrations for fast learning instead

Generating Instructive Demonstrations

Compute H’s best response when R uses IRL

I simplification: assume R(s, aH , aR ; θ) = φ(s)T θ

I IRL objective changes:

find R∗ s.t. E [∑

t γtR∗(st)|π∗] ≥ E [

∑t γ

tR∗(st)|π] ∀π⇓

find θ∗ s.t. θ∗Tµ(π∗) ≥ θ∗Tµ(π) ∀π

where µ(π) = E [∑

t γtφ(st)|π] are expected feature counts

→ br(πE ) computes policy that matches observed feature countsµ from learning phase

Generating Instructive Demonstrations (contd.)

Idea:

I compute feature counts µ(πEθ ) that R expects to see under

true parameters θ

I select trajectory ζ that most closely matches these features

Approximate best response to feature matching:

ζH ← arg maxζ

φ(ζ)T θ − η||µ(πEθ )− φ(ζ)||2

where φ(ζ) =∑

s∈ζ φ(s)

Experiment 0: Instructive DemonstrationGround Truth

Expert Demonstration

Instructive Demonstration

= (0.1, 1.0, 1.5)φ(s) = {e−||x−xi ||}i

Square plots from CIRL paper, edited.

Experiment 1: approximate br(br(πE)) vs. πE

Is Approximate Best Response better, if R uses an IRL algorithm?Manipulated variables:

I H-policy: approximate br(br(πE)) (abbreviated br) vs. πE.

I Number of RBFs: 3, . . . , 10.

For a random sample of 500 θGT per experiment, measure:

I Regret: difference between values of the optimal policy forθGT and θ.

I KL-divergence between trajectory distribution induced by θGTand θ.

I Squared distance between parameters: ||θGT − θ||2

Experiment 1: findings; experiment 2: effect of MaxEnt λ

ANOVA: for all measures, for all quantities of RBFs, broutperformed πE. F > 962, p < .0001.Interaction effect between the number of RBFs and theKL-divergence and squared distance: larger gap with fewer RBFs.

Figure from CIRL paper.

In the MaxEnt algorithm, λ controls how optimally R expects H tobehave. 0 means H behaves independently of their preferences,whereas ∞ means H behaves exactly optimally.

Conclusions and Future Work

I Game-theoretic model of learning from another cooperativeagent’s demonstrations. Keys: robot knows it’s in a sharedenvironment, adopts the human reward function as its own.

I Conceptual connections with RL with corrupted rewards, andsome more tenuous with scalable supervision.

I Explore the coordination problem in the future. Coordinatingin reality is infeasible.

I Did not fully explore the consequences and emergingstrategies of this new setting, focusing instead only onapprenticeship CIRL.

References I

P. Abbeel and A. Y. Ng, Apprenticeship learning via inversereinforcement learning, ICML, 2004.

D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan,Cooperative inverse reinforcement learning, NIPS, 2016,pp. 3909–3917.

A. Y. Ng and S. J. Russell, Algorithms for inversereinforcement learning, ICML, 2000, pp. 663–670.

D. Ramachandran and E. Amir, Bayesian inverse reinforcementlearning, IJCAI, 2007, pp. 2586–2591.

N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, Maximummargin planning, ICML, 2006, pp. 729–736.

B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey,Maximum entropy inverse reinforcement learning, AAAI, 2008,p. 1433–1438.

Documents

ˇ Pinsler and Adri a Garriga Alonso