Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Cooperative Inverse Reinforcement Learning
Robert “π” Pinsler and Adria Garriga Alonso
October 26, 2017
D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan. Cooperative
Inverse Reinforcement Learning. NIPS 2016.
Value Alignment Problem
How can we guarantee agents behave according to humanobjectives?
I humans are bad at stating what they want
I stated vs. intended reward function
I related to principal-agent problem in economics
Reward Hacking
Figure 1: See https://www.youtube.com/watch?v=tlOIHko8ySg
As an optimization problem:If objective depends on only K < N variables, optimizer will oftenset remaining N − K unconstrained variables to extreme values.
→ Alternative: Use Inverse Reinforcement Learning to inferreward function from human
Reinforcement LearningMarkov Decision Process (MDP): M = (S,A,T ,R, γ)
I state space SI action space AI transition dynamics T (·) = p(st+1 | st , at)I discount factor γ ∈ [0, 1]
I reward function R : S ×A 7→ R
behavior characterized by policy π(a | s)
Inverse Reinforcement Learning (IRL)
Given MDP\R and teacher demonstrations D = {ζ1, . . . , ζN},find reward function R∗ which best explains teacher behavior:
E
[∑t
γtR∗(st)|π∗]≥ E
[∑t
γtR∗(st)|π
]∀π (1)
usually demonstration by expert (DBE): assumes teacher is optimal
IRL Problem Formulations
Need to resolve reward function ambiguity (e.g. R = 0 is solution)
I Max-margin approaches (Ng and Russell, 2000; Abbeel andNg, 2004; Radliff et al., 2006)
I Probabilistic approaches (Ramachandran and Amir, 2007;Ziebart et al., 2008)
Issues
I often don’t want the robot to imitate the human
I assumes human is unaware of being observed
I action selection independent of reward uncertainty
Contributions of CIRL paper
1. Formalise Cooperative Inverse Reinforcement Learning (CIRL).
2. Characterise solution: relation to POMDP, sufficient statisticsfor policy.
3. Relate to existing Apprenticeship Learning.
4. Show that Demonstration By Expert (an assumption ofexisting IRL approaches) is not optimal.
5. Give an empirically better algorithm.
Cooperative Inverse Reinforcement Learning
Idea: observing passively while a task is performed efficiently isnot the best way to learn how to perform a task, or which parts ofit the performer cares about.Examples: making coffee, playing a musical instrument.
Our problem setting should allow R to ask things, H to answer andcorrect R’s mistakes.
Human H and Robot R act simultaneously and interactively, tomaximise H’s reward.
CIRL formulation
CIRL is M = (S,AH ,AR ,T ,Θ,R,P0, γ)
I T (s ′ | s, aH , aR): a probability distribution over the next state,given the current state and actions.
I R : S ×AH ×AR ×Θ 7→ R: a parameterised reward function.
I P0(s0, θ): a probability distribution over initial states and θ.Bayesian prior, known in experiments.
Agents observe each other’s actions. Behaviour defined by a pairof policies (πH , πR):
I πH : [AH ×AR × S]∗ ×Θ 7→ AH .
I πR : [AH ×AR × S]∗ 7→ AR .
Computing optimal policy pair
I Analog of optimal policy in MDP, but not a solution.I If multiple optimal policy pairs exist, coordination problem.
I Instance of decentralised partially observable MDP(Dec-POMDP): several agents (H and R) with theircorresponding policies (πH and πR) and reward functions (Rand R), each potentially receiving partial information.
I Reducible to a single-agent coordination-POMDP MC :I State space: SC = S ×Θ.I Observation space: the original states S.I Action space: AC = (Θ 7→ AH)×AR .
I Solving this POMDP exactly is exponential in |S|.I Consequence: Optimal policy depends only on the state and
the belief of R about θ.
Apprenticeship Learning
Idea: use recovered R (from IRL) to learn πApprenticeship CIRL: turn-based CIRL game with two phases
I learning phase: H and R take turns acting
I deployment phase: R acts independently
Example
Task: Help H Make Office Supplies
I ps : #paperclips, qs : #staples
I θ ∈ [0, 1] (unobserved by R): preference for paperclips vs.staples
I action (pa, qa) creates paperclips/staples
I Reward function: R(s, (pa, qa); θ) = θpa + (1− θ)qa
t = 0 t = 1 t = 2(2, 0)
(1, 1)
(0, 2)true
(90, 0)
(50, 50)
(0, 90)
aH aR
Deployment Phase
Theorem 2In deployment phase, optimal policy for R maximizes reward inMDP induced by mean reward EbR [R] 1 2
Example
suppose πH =
(0, 2) θ ∈ [0, 13)
(1, 1) θ ∈ [13 ,23 ]
(2, 0) otherwise
bR0 uniform on θ observe aH = (0, 2)
bR1 = Unif ([0, 1/3)) → optimal R behaves as though θ = 1/6
1Proof: reduces to MDP under fixed distribution over R, can applyTheorem 3 from Ramachandran and Amir, 2007
2reduces to θ = E[θ] if R is linear in θ
Learning Phase
DBE policy πE : H greedily maximizes reward on its turn
I ignores impact on R’s belief about θ
πE (θ) =
(0, 2) θ < 0.5
(1, 1) θ = 0.5
(2, 0) θ > 0.5
Classic IRL as best response for R assuming DBE (πH = πE ):
I Learning Phase: use IRL to compute bR
I Deployment Phase: act to max. reward under θ (Theorem 2)
Learning Phase (contd.)
Theorem 3There exist Apprenticeship CIRL games where best response of Hviolates DBE assumption: br(br(πE )) 6= πE
Proof.
br(πE ) = πR(aH) =
(0, 90) aH = (0, 2)
(50, 50) aH = (1, 1)
(90, 0) aH = (2, 0)
br(br(πE )) = πH(θ) =
(0, 2) θ < 41
92
(1, 1) 4192 ≤ θ ≤
5192
(2, 0) θ ≥ 5192
6= πE
→ present demonstrations for fast learning instead
Generating Instructive Demonstrations
Compute H’s best response when R uses IRL
I simplification: assume R(s, aH , aR ; θ) = φ(s)T θ
I IRL objective changes:
find R∗ s.t. E [∑
t γtR∗(st)|π∗] ≥ E [
∑t γ
tR∗(st)|π] ∀π⇓
find θ∗ s.t. θ∗Tµ(π∗) ≥ θ∗Tµ(π) ∀π
where µ(π) = E [∑
t γtφ(st)|π] are expected feature counts
→ br(πE ) computes policy that matches observed feature countsµ from learning phase
Generating Instructive Demonstrations (contd.)
Idea:
I compute feature counts µ(πEθ ) that R expects to see under
true parameters θ
I select trajectory ζ that most closely matches these features
Approximate best response to feature matching:
ζH ← arg maxζ
φ(ζ)T θ − η||µ(πEθ )− φ(ζ)||2
where φ(ζ) =∑
s∈ζ φ(s)
Experiment 0: Instructive DemonstrationGround Truth
Expert Demonstration
Instructive Demonstration
= (0.1, 1.0, 1.5)φ(s) = {e−||x−xi ||}i
Square plots from CIRL paper, edited.
Experiment 1: approximate br(br(πE)) vs. πE
Is Approximate Best Response better, if R uses an IRL algorithm?Manipulated variables:
I H-policy: approximate br(br(πE)) (abbreviated br) vs. πE.
I Number of RBFs: 3, . . . , 10.
For a random sample of 500 θGT per experiment, measure:
I Regret: difference between values of the optimal policy forθGT and θ.
I KL-divergence between trajectory distribution induced by θGTand θ.
I Squared distance between parameters: ||θGT − θ||2
Experiment 1: findings; experiment 2: effect of MaxEnt λ
ANOVA: for all measures, for all quantities of RBFs, broutperformed πE. F > 962, p < .0001.Interaction effect between the number of RBFs and theKL-divergence and squared distance: larger gap with fewer RBFs.
Figure from CIRL paper.
In the MaxEnt algorithm, λ controls how optimally R expects H tobehave. 0 means H behaves independently of their preferences,whereas ∞ means H behaves exactly optimally.
Conclusions and Future Work
I Game-theoretic model of learning from another cooperativeagent’s demonstrations. Keys: robot knows it’s in a sharedenvironment, adopts the human reward function as its own.
I Conceptual connections with RL with corrupted rewards, andsome more tenuous with scalable supervision.
I Explore the coordination problem in the future. Coordinatingin reality is infeasible.
I Did not fully explore the consequences and emergingstrategies of this new setting, focusing instead only onapprenticeship CIRL.
References I
P. Abbeel and A. Y. Ng, Apprenticeship learning via inversereinforcement learning, ICML, 2004.
D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. Dragan,Cooperative inverse reinforcement learning, NIPS, 2016,pp. 3909–3917.
A. Y. Ng and S. J. Russell, Algorithms for inversereinforcement learning, ICML, 2000, pp. 663–670.
D. Ramachandran and E. Amir, Bayesian inverse reinforcementlearning, IJCAI, 2007, pp. 2586–2591.
N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, Maximummargin planning, ICML, 2006, pp. 729–736.
B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey,Maximum entropy inverse reinforcement learning, AAAI, 2008,p. 1433–1438.