11
Learning to Share and Hide Intentions using Information Regularization DJ Strouse 1 , Max Kleiman-Weiner 2 , Josh Tenenbaum 2 Matt Botvinick 3,4 , David Schwab 5 1 Princeton University, 2 MIT, 3 DeepMind 4 UCL, 5 CUNY Graduate Center Abstract Learning to cooperate with friends and compete with foes is a key component of multi-agent reinforcement learning. Typically to do so, one requires access to either a model of or interaction with the other agent(s). Here we show how to learn effective strategies for cooperation and competition in an asymmetric information game with no such model or interaction. Our approach is to encourage an agent to reveal or hide their intentions using an information-theoretic regularizer. We consider both the mutual information between goal and action given state, as well as the mutual information between goal and state. We show how to optimize these regularizers in a way that is easy to integrate with policy gradient reinforcement learning. Finally, we demonstrate that cooperative (competitive) policies learned with our approach lead to more (less) reward for a second agent in two simple asymmetric information games. 1 Introduction In order to effectively interact with others, an intelligent agent must understand the intentions of others. In order to successfully cooperate, collaborative agents that share their intentions will do a better job of coordinating their plans together [Tomasello et al., 2005]. This is especially salient when information pertinent to a goal is known asymmetrically between agents. When competing with others, a sophisticated agent might aim to hide this information from its adversary in order to deceive or surprise them. This type of sophisticated planning is thought to be a distinctive aspect of human intelligence compared to other animal species [Tomasello et al., 2005]. Furthermore, agents that share their intentions might have behavior that is more interpretable and understandable by people. Many reinforcement learning (RL) systems often plan in ways that can seem opaque to an observer. In particular, when an agent’s reward function is not aligned with the designer’s goal the intended behavior often deviates from what is expected [Hadfield-Menell et al., 2016]. If these agents are also trained to share high-level and often abstract information about its behavior (i.e. intentions) it is more likely a human operator or collaborator can understand, predict, and explain that agents decision. This is key requirement for building machines that people can trust. Previous approaches have tackled aspects of this problem but all share a similar structure [Dragan et al., 2013, Ho et al., 2016, Hadfield-Menell et al., 2016, Shafto et al., 2014]. They optimize their behavior against a known model of an observer which has a theory-of-mind [Baker et al., 2009, Ullman et al., 2009, Rabinowitz et al., 2018] or is doing some form of inverse-RL [Ng et al., 2000, Abbeel and Ng, 2004]. In this work we take an alternative approach based on an information theoretic formulation of the problem of sharing and hiding intentions. This approach does not require an explicit model of or interaction with the other agent, which could be especially useful in settings where interactive training is expensive or dangerous. Our approach also naturally combines with scalable policy-gradient methods commonly used in deep reinforcement learning. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

Learning to Share and Hide Intentions using Information ... · Learning to Share and Hide Intentions using Information Regularization DJ Strouse1, Max Kleiman-Weiner 2, Josh Tenenbaum

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Learning to Share and Hide Intentions using

    Information Regularization

    DJ Strouse1, Max Kleiman-Weiner2, Josh Tenenbaum2

    Matt Botvinick3,4, David Schwab5

    1 Princeton University, 2 MIT, 3 DeepMind4 UCL, 5 CUNY Graduate Center

    Abstract

    Learning to cooperate with friends and compete with foes is a key componentof multi-agent reinforcement learning. Typically to do so, one requires access toeither a model of or interaction with the other agent(s). Here we show how to learneffective strategies for cooperation and competition in an asymmetric informationgame with no such model or interaction. Our approach is to encourage an agentto reveal or hide their intentions using an information-theoretic regularizer. Weconsider both the mutual information between goal and action given state, as wellas the mutual information between goal and state. We show how to optimize theseregularizers in a way that is easy to integrate with policy gradient reinforcementlearning. Finally, we demonstrate that cooperative (competitive) policies learnedwith our approach lead to more (less) reward for a second agent in two simpleasymmetric information games.

    1 Introduction

    In order to effectively interact with others, an intelligent agent must understand the intentions ofothers. In order to successfully cooperate, collaborative agents that share their intentions will doa better job of coordinating their plans together [Tomasello et al., 2005]. This is especially salientwhen information pertinent to a goal is known asymmetrically between agents. When competingwith others, a sophisticated agent might aim to hide this information from its adversary in order todeceive or surprise them. This type of sophisticated planning is thought to be a distinctive aspect ofhuman intelligence compared to other animal species [Tomasello et al., 2005].

    Furthermore, agents that share their intentions might have behavior that is more interpretable andunderstandable by people. Many reinforcement learning (RL) systems often plan in ways that canseem opaque to an observer. In particular, when an agent’s reward function is not aligned with thedesigner’s goal the intended behavior often deviates from what is expected [Hadfield-Menell et al.,2016]. If these agents are also trained to share high-level and often abstract information about itsbehavior (i.e. intentions) it is more likely a human operator or collaborator can understand, predict,and explain that agents decision. This is key requirement for building machines that people can trust.

    Previous approaches have tackled aspects of this problem but all share a similar structure [Draganet al., 2013, Ho et al., 2016, Hadfield-Menell et al., 2016, Shafto et al., 2014]. They optimize theirbehavior against a known model of an observer which has a theory-of-mind [Baker et al., 2009,Ullman et al., 2009, Rabinowitz et al., 2018] or is doing some form of inverse-RL [Ng et al., 2000,Abbeel and Ng, 2004]. In this work we take an alternative approach based on an information theoreticformulation of the problem of sharing and hiding intentions. This approach does not require anexplicit model of or interaction with the other agent, which could be especially useful in settingswhere interactive training is expensive or dangerous. Our approach also naturally combines withscalable policy-gradient methods commonly used in deep reinforcement learning.

    32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.

  • 2 Hiding and revealing intentions via information-theoretic regularization

    We consider multi-goal environments in the form of a discrete-time finite-horizon discounted Markovdecision process (MDP) defined by the tuple M ≡ (S,A,G, P, ρG, ρS , r, γ, T ), where S is a stateset, A an action set, P : S × A × S → R+ a (goal-independent) probability distribution overtransitions, G a goal set, ρG : G → R+ a distribution over goals, ρS : S → R+ a probabilitydistribution over initial states, r : S × G → R a (goal-dependent) reward function γ ∈ [0, 1] adiscount factor, and T the horizon.

    In each episode, a goal is sampled and determines the reward structure for that episode. One agent,Alice, will have access to this goal and thus knowledge of the environment’s reward structure, whilea second agent, Bob, will not and instead must infer it from observing Alice. We assume thatAlice knows in advance whether Bob is a friend or foe and wants to make his task easier or harder,respectively, but that she has no model of him and must train without any interaction with him.

    Of course, Alice also wishes to maximize her own expected reward η[π] = Eτ

    [

    ∑T

    t=0 γtr(st, g)

    ]

    ,

    where τ = (g, s0, a0, s1, a1, . . . , sT ) denotes the episode trajectory, g ∼ ρG, s0 ∼ ρS , at ∼πg(at | st), and st+1 ∼ P (st+1 | st, at), and πg(a | s; θ) : G × S × A → R+ is Alice’s goal-dependent probability distribution over actions (policy) parameterized by θ.

    It is common in RL to consider loss functions of the form J [π] = η[π] + βℓ[π], where ℓ is aregularizer meant to help guide the agent toward desirable solutions. For example, the policy entropyis a common choice to encourage exploration [Mnih et al., 2016], while pixel prediction and controlhave been proposed to encourage exploration in visually rich environments with sparse rewards[Jaderberg et al., 2017].

    The setting we imagine is one in which we would like Alice to perform well in a joint environment withrewards rjoint, but we are only able to train her in a solo setting with rewards rsolo. How do we makesure that Alice’s learned behavior in the solo environment transfers well to the joint environment? Wepropose the training objective Jtrain = E[rsolo] +βI (where I is some sort of task-relevant informationmeasure) as a useful for proxy for the test objective Jtest = E[rjoint]. The structure of rjoint determineswhether the task is cooperative or competitive, and therefore the appropriate sign of β. For example,in the spatial navigation game of section 4.1, a competitive rjoint might provide +1 reward only to thefirst agent to reach the correct goal (and -1 for reaching the wrong one), whereas a cooperative rjointmight provide each of Alice and Bob with the sum of their individual rewards. In figure 2, we plotrelated metrics, after training Alice with Jtrain. On the bottom row, we plot the percentage of timeAlice beats Bob to the goal (which is her expected reward for the competitive rjoint). On the top row,we plot Bob’s expected time steps per unit reward, relative to Alice’s. Their combined steps per unitreward would be more directly related to the cooperative rjoint described above, but we plot Bob’sindividual contribution (relative to Alice’s), since his individual contribution to the joint reward ratevaries dramatically with β, whereas Alice’s does not. We note that one advantage of our approach isthat it unifies cooperative and competitive strategies in the same one-parameter (β) family.

    Below, we will consider two different information regularizers meant to encourage/discourageAlice from sharing goal information with Bob: the (conditional) mutual information between goaland action given state, Iaction[π] ≡ I(A;G | S), which we will call the "action information", andthe mutual information between state and goal, Istate[π] ≡ I(S;G), which we will call the "stateinformation." Since the mutual information is a general measure of dependence (linear and non-linear)between two variables, Iaction and Istate measure the ease in inferring the goal from the actions andstates, respectively, generated by the policy π. Thus, if Alice wants Bob to do well, she should choosea policy with high information, and vice versa if not.

    We consider both action and state informations because they have different advantages and disad-vantages. Using action information assumes that Bob (the observer) can see both Alice’s states andactions, which may be unrealistic in some environments, such as one in which the actions are thetorques a robot applies to its joint angles [Eysenbach et al., 2019]. Using state information insteadonly assumes that Bob can observe Alice’s states (and not actions), however it does so at the cost ofrequiring Alice to count goal-dependent state frequencies under the current policy. Optimizing actioninformation, on the other hand, does not require state counting. So, in summary, action informationis simpler to optimize, but state information may be more appropriate to use in a setting where anobserver can’t observe (or infer) the observee’s actions.

    2

  • The generality with which mutual information measures dependence is at once its biggest strengthand weakness. On the one hand, using information allows Alice to prepare for interaction withBob with neither a model of nor interaction with him. On the other hand, Bob might have limitedcomputational resources (for example, perhaps his policy is linear with respect to his observations ofAlice) and so he may not be able to “decode” all of the goal information that Alice makes availableto him. Nevertheless, Iaction and Istate can at least be considered upper bounds on Bob’s inferenceperformance; if Iaction = 0 or Istate = 0, it would be impossible for Bob to guess the goal (abovechance) from Alice’s actions or states, respectively, alone.

    Optimizing information can be equivalent to optimizing reward under certain conditions, such asin the following example. Consider Bob’s subtask of identifying the correct goal in a 2-goal setup.If his belief over the goal is represented by p(g), then he should guess g∗ = argmaxgp(g), which

    results in error probability perr = 1−maxg p(g). Since the binary entropy function H(g) ≡ H[p(g)]increases monotonically with perr, optimizing one is equivalent to optimizing the other. Denoting theparts of Alice’s behavior observable by Bob as x, then H(g | x) is the post-observation entropy inBob’s beliefs, and optimizing it is equivalent to optimizing I(g;x) = H(g) −H(g | x), since thepre-observation entropy H(g) is not dependent on Alice’s behavior. If Bob receives reward r whenidentifying the right goal, and 0 otherwise, then his expected reward is (1− perr) r. Thus, in thissimplified setup, optimizing information is directly related to optimizing reward. In general, whenone considers the temporal dynamics of an episode, more than two goals, or more complicated rewardstructures, the relationship becomes more complicated. However, information is useful in abstractingaway that complexity, and preparing Alice generically for a plethora of possible task setups.

    2.1 Optimizing action information: Iaction ≡ I(A;G | S)

    First, we discuss regularization via optimizing the mutual information between goal and action(conditioned on state), Iaction ≡ I(A;G | S), where G is the goal for the episode, A is the chosenaction, and S is the state of the agent. That is, we will train an agent to maximize the objectiveJaction[π] ≡ E[r] + βIaction, where β is a tradeoff parameters whose sign determines whether we wantthe agent to signal (positive) or hide (negative) their intentions, and whose magnitude determines therelative preference for rewards and intention signaling/hiding.

    Iaction is a functional of the multi-goal policy πg(a | s) ≡ p(a | s, g), that is the probability distributionover actions given the current goal and state, and is given by:

    Iaction ≡ I(A;G | S) =∑

    s

    p(s) I(A;G | S = s) (1)

    =∑

    g

    ρG(g)∑

    s

    p(s | g)∑

    a

    πg(a | s) logπg(a | s)

    p(a | s). (2)

    The quantity involving the sum over actions is a KL divergence between two distributions: the goal-dependent policy πg(a | s) and a goal-independent policy p(a | s). This goal-independent policycomes from marginalizing out the goal, that is p(a | s) =

    g ρG(g)πg(a | s), and can be thought ofas a fictitious policy that represents the agent’s “habit” in the absence of knowing the goal. We willdenote π0(a | s) ≡ p(a | s) and refer to it as the “base policy,” whereas we will refer to πg(a | s) assimply the “policy.” Thus, we can rewrite the information above as:

    Iaction =∑

    g

    ρG(g)∑

    s

    p(s | g)KL[πg(a | s) | π0(a | s)] = Eτ [KL[πg(a | s) | π0(a | s)]] . (3)

    Writing the information this way suggests a method for stochastically estimating it. First, we samplea goal g from p(g), that is we initialize an episode of some task. Next, we sample states s fromp(s | g), that is we generate state trajectories using our policy πg(a | s). At each step, we measurethe KL between the policy and the base policy. Averaging this quantity over episodes and steps giveus our estimate of Iaction.

    Optimizing Iaction with respect to the policy parameters θ is a bit trickier, however, because theexpectation above is with respect to a distribution that depends on θ. Thus, the gradient of Iaction with

    3

  • Algorithm 1 Action information regularized REINFORCE with value baseline.

    Input: β, ρG, γ, and ability to sample MDP MInitialize π, parameterized by θInitialize V , parameterized by φfor i = 1 to Nepisodes do

    Generate trajectory τ = (g, s0, a0, s1, a1, . . . , sT )for t = 0 to T − 1 do

    Update policy in direction of ∇θJaction(t) using equation 6

    Update value in direction of −∇φ

    (

    Vg(st)− R̃t

    )2

    with r̃(t) according to equation 7

    end forend for

    respect to θ has two terms:

    ∇θIaction =∑

    g

    ρG(g)∑

    s

    (∇θp(s | g))KL[πg(a | s) | π0(a | s)] (4)

    +∑

    g

    ρG(g)∑

    s

    p(s | g)∇θKL[πg(a | s) | π0(a | s)] . (5)

    The second term involves the same sum over goals and states as in equation 3, so it can be written asan expectation over trajectories, Eτ [∇θKL[πg(a | s) | π0(a | s)]], and therefore is straightforward toestimate from samples. The first term is more cumbersome, however, since it requires us to model (thepolicy dependence of) the goal-dependent state probabilities, which in principle involves knowingthe dynamics of the environment. Perhaps surprisingly, however, the gradient can still be estimatedpurely from sampled trajectories, by employing the so-called “log derivative” trick to rewrite the termas an expectation over trajectories. The calculation is identical to the proof of the policy gradienttheorem [Sutton et al., 2000], except with reward replaced by the KL divergence above.

    The resulting Monte Carlo policy gradient (MCPG) update is:

    ∇θJaction(t) =Aaction(t)∇θ log πg(at | st) + β∇θKL[πg(a | st) | π0(a | st)] , (6)

    where Aaction(t) ≡ R̃t − Vg(st) is a modified advantage, Vg(st) is a goal-state value function

    regressed toward R̃t, R̃t =∑T

    t′=t γ

    t′

    −tr̃t′ is a modified return, and the following is the modifiedreward feeding into that return:

    r̃t ≡ rt + βKL[πg(a | st) | π0(a | st)] . (7)

    The second term in equation 6 encourages the agent to alter the policy to share or hide information inthe present state. The first term, on the other hand, encourages modifications which lead the agent tostates in the future which result in reward and the sharing or hiding of information. Together, thisoptimizes Jaction. This algorithm is summarized in algorithm 2.1.

    2.2 Optimizing state information: Istate ≡ I(S;G)

    We now consider how to regularize an agent by the information one’s states give away about the goal,using the mutual information between state goal, Istate ≡ I(S;G). This can be written:

    Istate =∑

    g

    ρG(g)∑

    s

    p(s | g) logp(s | g)

    p(s)= Eτ

    [

    logp(s | g)

    p(s)

    ]

    . (8)

    In order to estimate this quantity, we could track and plug into the above equation the empirical

    state frequencies pemp(s | g) ≡Ng(s)Ng

    and pemp(s) ≡N(s)N

    , where Ng(s) is the number of times

    state s was visited during episodes with goal g, Ng ≡∑

    s Ng(s) is the total number of steps takenunder goal g, N(s) ≡

    g Ng(s) is the number of times state s was visited across all goals, and

    N ≡∑

    g,s Ng(s) =∑

    g Ng =∑

    s N(s) is the total number of state visits across all goals and states.

    Thus, keeping a moving average of logpemp(st|g)pemp(st)

    across episodes and steps yields an estimate of Istate.

    4

  • Algorithm 2 State information regularized REINFORCE with value baseline.

    Input: β, ρG, γ, and ability to sample MDP MInitialize π, parameterized by θInitialize V , parameterized by φInitialize the state counts Ng(s)for i = 1 to Nepisodes do

    Generate trajectory τ = (g, s0, a0, s1, a1, . . . , sT )Update Ng(s) (and therefore pemp(s | g)) according to τfor t = 0 to T − 1 do

    Update policy in direction of ∇θJstate(t) using equation 11

    Update value in direction of −∇φ

    (

    Vg(st)− R̃t

    )2

    with r̃(t) according to equation 12

    end forend for

    However, we are of course interested in optimizing Istate and so, as in the last section, we need toemploy a slightly more sophisticated estimate procedure. Taking the gradient of Istate with respect tothe policy parameters θ, we get:

    ∇θIstate =∑

    g

    ρG(g)∑

    s

    (∇θp(s | g)) logp(s | g)

    p(s)(9)

    +∑

    g

    ρG(g)∑

    s

    p(s | g)

    (

    ∇θp(s | g)

    p(s | g)−

    ∇θp(s)

    p(s)

    )

    . (10)

    The calculation is similar to that for evaluating ∇θIaction and details can be found in section S1. Theresulting MCPG update is:

    ∇θJstate(t) =Astate(t)∇θ log πg(at | st)− β∑

    g′ 6=g

    ρG

    (

    g′

    )

    Rcf

    (

    t, g, g′

    )

    ∇θ log πg′ (at | st) , (11)

    where Astate(t) ≡ R̃t−Vg(st) is a modified advantage, Vg(st) is a goal-state value function regressed

    toward R̃t, R̃t ≡∑T

    t′=t γ

    t′

    −tr̃t′ is a modified return, Rcf

    (

    t, g, g′

    )

    ≡∑T

    t′=t γ

    t′

    −trcf

    (

    t′

    , g, g′

    )

    is

    a “counterfactual goal return”, and the following are a modified reward and a “counterfactual goalreward”, respectively, which feed into the above returns:

    r̃t ≡ rt + β

    (

    1− pemp(g | st) + logpemp(st | g)

    pemp(st)

    )

    (12)

    rcf

    (

    t, g, g′

    )

    t∏

    t′=0

    πg′ (at′ | st′ )

    πg(at′ | st′ )

    pemp(st | g)

    pemp(st), (13)

    where pemp(g | st) ≡ ρG(g)pemp(st|g)pemp(st)

    . The modified reward can be viewed as adding a “state

    uniqueness bonus” logpemp(st|g)pemp(st)

    that tries to increase the frequency of the present state under the

    present goal to the extent that the present state is more common under the present goal. If the presentstate is less common than average under the present goal, then this bonus becomes a penalty. Thecounterfactual goal reward, on the other hand, tries to make the present state less common under other

    goals, and is again scaled by uniqueness under the present goalpemp(st|g)pemp(st)

    . It also includes importance

    sampling weights to account for the fact that the trajectory was generated under the current goal, butthe policy is being modified under other goals. This algorithm is summarized in algorithm 2.2.

    3 Related work

    Whye Teh et al. [2017] recently proposed an algorithm similar to our action information regularizedapproach (algorithm 2.1), but with very different motivations. They argued that constraining goal-specific policies to be close to a distilled base policy promotes transfer by sharing knowledge

    5

  • across goals. Due to this difference in motivation, they only explored the β < 0 regime (i.e. our“competitive” regime). They also did not derive their update from an information-theoretic costfunction, but instead proposed the update directly. Because of this, their approach differs in that it didnot include the β∇θKL[πg | π0] term, and instead only included the modified return. Moreover, theydid not calculate the full KLs in the modified return, but instead estimated them from single samples

    (e.g. KL[πg(a | st) | π0(a | st)] ≈ logπg(at|st)π0(at|st)

    ). Nevertheless, the similarity in our approaches

    suggest a link between transfer and competitive strategies, although we do not explore this here.

    Eysenbach et al. [2019] also recently proposed an algorithm similar to ours, which used both Istateand Iaction but with the “goal” replaced by a randomly sampled “skill” label in an unsupervised setting(i.e. no reward). Their motivation was to learn a diversity of skills that would later would be usefulfor a supervised (i.e. reward-yielding) task. Their approach to optimizing Istate differs from ours inthat it uses a discriminator, a powerful approach but one that, in our setting, would imply a morespecific model of the observer which we wanted to avoid.

    Tsitsiklis and Xu [2018] derive an inverse tradeoff between an agent’s delay in reaching a goal andthe ability of an adversary to predict that goal. Their approach relies on a number of assumptionsabout the environment (e.g. agent’s only source of reward is reaching the goal, opponent onlyneed identify the correct goal and not reach it as well, nearly uniform goal distribution), but issuggestive of the general tradeoff. It is an interesting open question as to under what conditions ourinformation-regularized approach achieves the optimal tradeoff.

    Dragan et al. [2013] considered training agents to reveal their goals (in the setting of a robot graspingtask), but did so by building an explicit model of the observer. Ho et al. [2016] uses a similarmodel to capture human generated actions that “show” a goal also using an explicit model of theobserver. There is also a long history of work on training RL agents to cooperate and competethrough interactive training and a joint reward (e.g. [Littman, 1994, 2001, Kleiman-Weiner et al.,2016, Leibo et al., 2017, Peysakhovich and Lerer, 2018, Hughes et al., 2018]), or through modelingone’s effect on another agent’s learning or behavior (e.g. [Foerster et al., 2018, Jaques et al., 2018]).Our approach differs in that it requires neither access to an opponent’s rewards, nor even interactionwith or a model of the opponent. Without this knowledge, one can still be cooperative (competitive)with others by being as (un)clear as possible about one’s own intentions. Our work achieves this bydirectly optimizing information shared.

    4 Experiments

    We demonstrate the effectiveness of our approach in two stages. First, we show that training Alice(who has access to the goal of the episode) with information regularization effectively encouragesboth goal signaling and hiding, depending on the sign of the coefficient β. Second, we showthat Alice’s goal signaling and hiding translate to higher and lower rates of reward acquisitionfor Bob (who does not have access to the goal and must infer it from observing Alice), respec-tively. We demonstrate these results in two different simple settings. Our code is available athttps://github.com/djstrouse/InfoMARL.

    4.1 Spatial navigation

    The first setting we consider is a simple grid world spatial navigation task, where we can fullyvisualize and understand Alice’s regularized policies. The 5× 5 environment contains two possiblegoals: the top left state or the top right. On any given episode, one goal is chosen randomly (soρG is uniform) and that goal state is worth +1 reward. The other goal state is then worth −1. Bothare terminal. Each of Alice and Bob spawn in a random (non-terminal) state and take actions inA = {left, right, up, down, stay}. A step into a wall is equivalent to the stay action but results in apenalty of −.1 reward. We first train Alice alone, and then freeze her parameters and introduce Bob.

    Alice was trained using implementations of algorithms 2.1 and 2.2 in TensorFlow [Abadi et al.,2016]. Given the small, discrete environment, we used tabular representations for both π and V . Seesection S2.1 for training parameters.

    Examples of Alice’s resulting policies are shown in figure 1. The top row contains policies regularizedwith Iaction, the bottom with Istate. The left column contains “cooperative” policies encouraged toshare goal information (β = .025), the middle “ambivalent” policies that are unregularized (β = 0),

    6

  • logp (s | g)

    p (s)AAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==AAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==AAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==AAACGnicbZDLSsNAFIYn9VbrLerSzWAR6qYkRVB3BTcuKxhbaEKZTCfp0MmFmROhhL6HG1/FjQsVd+LGt3HSBtTWHwZ+vnMOZ87vp4IrsKwvo7Kyura+Ud2sbW3v7O6Z+wd3KskkZQ5NRCJ7PlFM8Jg5wEGwXioZiXzBuv74qqh375lUPIlvYZIyLyJhzANOCWg0MFuuSEI3kITmqStYAA3lRnyIQ1fycASn0x9cgtrArFtNaya8bOzS1FGpzsD8cIcJzSIWAxVEqb5tpeDlRAKngk1rbqZYSuiYhKyvbUwiprx8dtsUn2gyxEEi9YsBz+jviZxESk0iX3dGBEZqsVbA/2r9DIILL+dxmgGL6XxRkAkMCS6CwkMuGQUx0YZQyfVfMR0RHRToOIsQ7MWTl43Tal427Zuzetsq06iiI3SMGshG56iNrlEHOYiiB/SEXtCr8Wg8G2/G+7y1YpQzh+iPjM9v62OheA==

    KL[πg | π0]AAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57QAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57QAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57QAAACE3icbZBNS8NAEIY3ftb6FfXoJVoEQSiJCOqt4EXQQwVjC00om+2kXbr5YHcilpAf4cW/4sWDilcv3vw3pmkO2vrCwsM7M8zO68WCKzTNb21ufmFxabmyUl1dW9/Y1Le271SUSAY2i0Qk2x5VIHgINnIU0I4l0MAT0PKGF+N66x6k4lF4i6MY3ID2Q+5zRjG3uvqRg/CA6dV15uw5AnzsODHvpv3MCXivQDNzJO8P0K129ZpZNwsZs2CVUCOlml39y+lFLAkgRCaoUh3LjNFNqUTOBGRVJ1EQUzakfejkGNIAlJsWR2XGQe70DD+S+QvRKNzfEykNlBoFXt4ZUByo6drY/K/WSdA/c1MexglCyCaL/EQYGBnjhIwel8BQjHKgTPL8rwYbUEkZ5jmOQ7CmT54F+7h+XrduTmoNs0yjQnbJPjkkFjklDXJJmsQmjDySZ/JK3rQn7UV71z4mrXNaObND/kj7/AHQc57Q

    ac

    tio

    n

    reg

    ula

    rize

    d

    sta

    te

    reg

    ula

    rize

    d

    cooperative competitiveambivalent

    Figure 1: Information-regularized policies. Top row: regularization with Iaction. Bottom row:regularization with Istate. Left column: β = .025. Center column: β = 0. Right column: β = .025.See main text for additional details.

    and the right “competitive” policies encouraged to hide goal information (β = −.025). In each case,two policies are shown - at left, the policy when the goal is in the top left, and at right, when it is inthe top right. The arrows are proportional to Alice’s action probabilities. The color scale indicates

    KL[πg | π0] and logpemp(s|g)pemp(s)

    in bits for action and state information regularization, respectively. Thus,

    bluer indicates more goal information shared in that state.

    In the cooperative case with action information regularization, Alice wants to maximize KL[πg | π0]and thus she wants her goal-dependent policies to differ as much as possible. This primarily resultsin her going left in most states when the goal is in the top left, and right when its in the top right.She can also choose to have some probability of going up in those states, as long as she only doesso under one goal or the other and not both. Ultimately this means that no matter what state Alicespawns in, she immediately signals the goal with her very first action, and continues to do so witheach subsequent action.

    In the competitive case with action information regularization, Alice instead wants to minimizeKL[πg | π0] and thus she wants her goal-dependent policies to match as much as possible. In thepresent case, this means that she chooses to go up as far as she can until she must pivot in order tostill reach the goal in the minimal number of steps. This leads to her hiding the goal for as long aspossible without sacrificing reward.

    Without information regularization, Alice’s policy is a random mixture of the competitive andcooperative strategies, the details of which are determined by initialization and the randomness oftraining trajectories. Regularization with state information yields similar policies to those generatedwith action information in all three cases, since in this deterministic setting, different action choicesare tightly coupled with different state trajectories.

    To demonstrate that Alice’s goal revealing and hiding behaviors are useful for cooperation andcompetition, respectively, we then trained a second agent, Bob, who does not have access to the goaland instead must infer it from observing Alice. Thus, while Alice’s inputs at time t were the presentgoal g and her state salicet , Bob’s are Alice’s present state and action s

    alicet and a

    alicet , as well as his

    own state sbobt . Details are available in section S2.1, but in brief, Bob processes Alice’s state-actiontrajectories with an RNN to form a belief about the goal, which then feeds into his policy, all of whichis trained end-to-end via REINFORCE.

    We trained 5 of each of the 3 versions of Alice above, and 10 Bobs per Alice. We plot the results forthe best performing Bob for each Alice (so 5 × 3 = 15 curves) in figure 2. We use all 5 Alices toestimate the variance in our approach, but the best-of-10 Bob to provide a reasonable estimate of thebest performance of a friend/foe.

    We measure Bob’s performance in terms of his episode length, relative to Alice’s, as well as thepercentage of time he beats Alice to the goal. For both action and state information regularization,encouraging Alice to hide goal information leads to Bob taking about 30% longer to reach the goalrelative to when Alice is encouraged to share goal information. Information-hiding Alice receives aboost of similar magnitude in the frequency with which she beats Bob to the goal. Training withoutinformation regularization leads to results in between the competitive and cooperative strategies,

    7

  • action regularized state regularized

    Figure 2: The effect of Alice hiding/sharing goal information on Bob’s performance. Leftcolumn: regularization with Iaction. Right column: regularization with Istate. Top row: Bob’s episodelength relative to Alice’s (moving average over 500 episodes). Bottom row: the percentage of timeAlice beats Bob to the goal (moving average over 1000 episodes).

    although closer to the cooperative strategy in this case. We also note that the variance in Bob’sperformance was higher for the unregularized case, and much higher for the competitive case, withnearly zero variance in performance for the cooperative case, indicating that information hiding andsharing make training harder and easier, respectively.

    4.2 Key-and-door game

    In the above spatial navigation task, information regularization of Alice breaks symmetries betweenequally-preferred (in terms of reward) navigation strategies. However, in many scenarios, it mightbe worthwhile for an agent to give up some reward if it means large gains in the ability to hide orshare information. To demonstrate that our approach could also discover such “lossy” strategies, wedesigned a simple key-and-door game with this feature (figure 3, left).

    It is again a two-goal (door) game with the same action space and reward structure as the spatialnavigation setting. Alice again alone receives the goal, and Bob must infer it from observing her. Thedifference is that, in order to enter the terminal states, Alice and Bob must first pick up an appropriatekey. Each agent has goal-specific keys that only they can pick up (top/bottom rows, color-coded todoor, labeled with A/B for Alice/Bob). Alice also has access to a master key that can open both doors(center right). Agents can only pick up one key per episode - the first they encounter. Bob spawns inthe same location every time (the “B”), while Alice spawns in any of the 3 spaces between her twogoal-specific keys (the “A” and spaces above/below). This means that Bob has a shorter path to thegoals, and thus if Alice telegraphs the goal right away, Bob will beat her to it. While Alice’s masterkey is strictly on a longer path to the goal, picking it up allows her to delay informing Bob of the goalsuch that she can beat him to it.

    We trained Alice with action information regularization as in the previous section (see section S2.2for training parameters). When unregularized or encouraged to share goal information (β = .25),Alice simply took the shortest path to the goal, never picking up the master key. When Bob wastrained on these Alices, he beat/tied her to the goal on approximately 100% of episodes (figure 3,right). When encouraged to hide information (β = −.25), however, we found that Alice learned totake the longer path via the master key on about half of initializations (example in figure 3, center).When Bob was trained on these Alices, he beat/tied her to the goal much less than half the time(figure 3, right). Thus, our approach successfully encourages Alice us to forgo rewards during solotraining in order to later compete more effectively in an interactive setting.

    8

  • B

    B

    A

    A

    B AA

    Figure 3: Key-and-door game results. Left: depiction of game. Center: percentage episodes inwhich Alice picks up goal-specific vs master key during training in an example run (moving averageover 100 episodes). Right: percentage episodes in which Bob beats/tie Alice to the goal (movingaverage over 1000 episodes).

    5 Discussion

    In this work, we developed a new framework for building agents that balance reward-seeking withinformation-hiding/sharing behavior. We demonstrate that our approach allows agents to learneffective cooperative and competitive strategies in asymmetric information games without an explicitmodel or interaction with the other agent(s). Such an approach could be particularly useful in settingswhere interactive training with other agents could be dangerous or costly, such as the training ofexpensive robots or the deployment of financial trading strategies.

    We have here focused on simple environments with discrete and finite states, goals, and actions,and so we briefly describe how to generalize our approach to more complex environments. Whenoptimizing Iaction with many or continuous actions, one could stochastically approximate the actionsum in KL[πg | π0] and its gradient (as in [Whye Teh et al., 2017]). Alternatively, one could choose aform for the policy πg and base policy π0 such that the KL is analytic. For example, it is common forπg to be Gaussian when actions are continuous. If one also chooses to use a Gaussian approximationfor π0 (forming a variational bound on Iaction), then KL[πg | π0] is closed form. For optimizing Istatewith continuous states, one can no longer count states exactly, so these counts could be replacedwith, for example, a pseudo-count based on an approximate density model. [Bellemare et al., 2016,Ostrovski et al., 2017] Of course, for both types of information regularization, continuous states oractions also necessitate using function approximation for the policy representation. Finally, althoughwe have assumed access to the goal distribution ρG, one could also approximate it from experience.

    Acknowledgements

    The authors would like to acknowledge Dan Roberts and our anonymous reviewers for carefulcomments on the original draft; Jane Wang, David Pfau, and Neil Rabinowitz for discussions on theoriginal idea; and funding from the Hertz Foundation (DJ and Max), The Center for Brain, Minds andMachines (NSF #1231216) (Max and Josh), the NSF Center for the Physics of Biological Function(PHY-1734030) (David), and as a Simons Investigator in the MMLS (David).

    References

    Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, MatthieuDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, PeteWarden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scalemachine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Designand Implementation, 2016.

    Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. InProceedings of the Twenty-First International Conference on Machine Learning (ICML), 2004.

    Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning.Cognition, 113(3):329–349, 2009.

    9

  • Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.Unifying count-based exploration and intrinsic motivation. In Advances in Neural InformationProcessing Systems (NIPS) 29, pages 1471–1479. 2016.

    Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder forStatistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods inNatural Language Processing (EMNLP), pages 1724–1734, 2014.

    Anca D. Dragan, Kenton C.T. Lee, and Siddhartha S. Srinivasa. Legibility and predictability of robotmotion. International Conference on Human-Robot Interaction (HRI), pages 301–308, 2013.

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is All YouNeed: Learning Skills without a Reward Function. In International Conference on LearningRepresentations (ICLR), 2019.

    Jakob Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and IgorMordatch. Learning with opponent-learning awareness. In Proceedings of the 17th InternationalConference on Autonomous Agents and MultiAgent Systems (AAMAS), pages 122–130, 2018.

    Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inversereinforcement learning. In Advances in Neural Information Processing Systems (NIPS) 29, pages3909–3917, 2016.

    Mark K Ho, Michael Littman, James MacGlashan, Fiery Cushman, and Joseph L Austerweil. Showingversus doing: Teaching by demonstration. In Advances In Neural Information Processing Systems(NIPS) 29, pages 3027–3035, 2016.

    Edward Hughes, Joel Z Leibo, Matthew Phillips, Karl Tuyls, Edgar Dueñez Guzman, Antonio GarcíaCastañeda, Iain Dunning, Tina Zhu, Kevin McKee, Raphael Koster, Heather Roff, and ThoreGraepel. Inequity aversion improves cooperation in intertemporal social dilemmas. In Advances inNeural Information Processing Systems (NIPS) 31, pages 3330–3340. 2018.

    Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, DavidSilver, and Koray Kavukcuoglu. Reinforcement Learning with Unsupervised Auxiliary Tasks. InInternational Conference on Learning Representations (ICLR), 2017.

    Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Çaglar Gülçehre, Pedro A. Ortega, DJ Strouse,Joel Z. Leibo, and Nando de Freitas. Intrinsic social motivation via causal influence in multi-agentRL. CoRR, abs/1810.08647, 2018.

    Max Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenen-baum. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction.In Proceedings of the 38th Annual Conference of the Cognitive Science Society, 2016.

    Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agentreinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference onAutonomous Agents and MultiAgent Systems (AAMAS), pages 464–473, 2017.

    Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. InProceedings of the 11th International Conference on Machine Learning (ICML), pages 157–163,1994.

    Michael L Littman. Friend-or-foe q-learning in general-sum games. In Proceedings of the 28thInternational Conference on Machine Learning (ICML), pages 322–328, 2001.

    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning. In Proceedings of The 33rd International Conference on Machine Learning (ICML),pages 1928–1937, 2016.

    Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Proceedingsof the 17th International Conference on Machine Learning (ICML), pages 663–670, 2000.

    10

  • Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based explorationwith neural density models. In Proceedings of the 34th International Conference on MachineLearning (ICML), pages 2721–2730, 2017.

    Alexander Peysakhovich and Adam Lerer. Prosocial learning agents solve generalized stag huntsbetter than selfish ones. In Proceedings of the 17th International Conference on AutonomousAgents and MultiAgent Systems (AAMAS), pages 2043–2044, 2018.

    Neil Rabinowitz, Frank Perbet, Francis Song, Chiyuan Zhang, S. M. Ali Eslami, and MatthewBotvinick. Machine theory of mind. In Proceedings of the 35th International Conference onMachine Learning (ICML), pages 4218–4227, 2018.

    Patrick Shafto, Noah D Goodman, and Thomas L Griffiths. A rational account of pedagogicalreasoning: Teaching by, and learning from, examples. Cognitive psychology, 71:55–89, 2014.

    Richard S Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient meth-ods for reinforcement learning with function approximation. In Advances in Neural InformationProcessing Systems (NIPS) 12, pages 1057–1063. 2000.

    Michael Tomasello, Malinda Carpenter, Josep Call, Tanya Behne, and Henrike Moll. Understandingand sharing intentions: The origins of cultural cognition. Behavioral and Brain Sciences, 28(05):675–691, 2005.

    John N. Tsitsiklis and Kuang Xu. Delay-predictability trade-offs in reaching a secret goal. OperationsResearch, 66(2):587–596, 2018.

    Tomer Ullman, Chris Baker, Owen Macindoe, Owain Evans, Noah Goodman, and Joshua B. Tenen-baum. Help or hinder: Bayesian models of social goal inference. In Advances in Neural InformationProcessing Systems (NIPS) 22, pages 1874–1882. 2009.

    Yee Whye Teh, Victor Bapst, Wojciech M. Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell,Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advancesin Neural Information Processing Systems (NIPS) 30, pages 4496–4506. 2017.

    11