A Regularized Opponent Model with Maximum Entropy Objective · A Regularized Opponent Model with Maximum Entropy Objective Zheng Tian 1, Ying Wen , Zhichen Gong1, Faiz Punakkath1,

A Regularized Opponent Model with Maximum Entropy Objective

Zheng Tian1∗ , Ying Wen1∗ , Zhichen Gong1 , Faiz Punakkath1 , Shihao Zou2 and Jun Wang 1

1University College London2University of Alberta

{zheng.tian, ying.wen, jun.wang}@cs.ucl.ac.uk

AbstractIn a single-agent setting, reinforcement learning(RL) tasks can be cast into an inference problemby introducing a binary random variable o, whichstands for the “optimality”. In this paper, we rede-fine the binary random variable o in multi-agent set-ting and formalize multi-agent reinforcement learn-ing (MARL) as probabilistic inference. We derive avariational lower bound of the likelihood of achiev-ing the optimality and name it as Regularized Op-ponent Model with Maximum Entropy Objective(ROMMEO). From ROMMEO, we present a novelperspective on opponent modeling and show howit can improve the performance of training agentstheoretically and empirically in cooperative games.To optimize ROMMEO, we first introduce a tab-ular Q-iteration method ROMMEO-Q with proofof convergence. We extend the exact algorithmto complex environments by proposing an approx-imate version, ROMMEO-AC. We evaluate thesetwo algorithms on the challenging iterated matrixgame and differential game respectively and showthat they can outperform strong MARL baselines.

1 IntroductionCasting decision making and optimal control as an inferenceproblem have a long history, which dates back to [Kalman,1960] where the Kalman smoothing is used to solve opti-mal control in linear dynamics with quadratic cost. Bayesianmethods can capture the uncertainties regarding the transi-tion probabilities, the rewards functions in the environment orother agents’ policies. This distributional information can beused to formulate a more structured exploration/exploitationstrategy than those commonly used in classical RL, e.g. ε-greedy. A common approach in many works [Toussaint andStorkey, 2006; Rawlik et al., 2013; Levine and Koltun, 2013;Abdolmaleki et al., 2018] for framing RL as an inferenceproblem is by introducing a binary random variable o whichrepresents “optimality”. By this way, RL problems are able tolend itself to powerful inference tools [Levine, 2018]. How-ever, the Bayesian approach in a multi-agent environment is∗The first two authors contributed equally.

less well studied.In many single-agent works, maximizing entropy is part of

a training agent’s objective for resolving ambiguities in in-verse reinforcement learning [Ziebart et al., 2008], improv-ing the diversity [Florensa et al., 2017], robustness [Fox etal., 2015] and the compositionality [Haarnoja et al., 2018a]of the learned policy. In Bayesian RL, it often presents inthe evidence lower bound (ELBO) for the log likelihood ofoptimality [Haarnoja et al., 2017; Schulman et al., 2017;Haarnoja et al., 2018b], commonly known as maximum en-tropy objective (MEO), which encourages the optimal policyto maximize the expected return and long term entropy.

In MARL, there is more than one agent interacting with astationary environment. In contrast with the single agent en-vironment, an agent’s reward not only depends on the currentenvironment state and the agent’s action but also on the ac-tions of others. The existence of other agents increases theuncertainty in the environment. Therefore, the capability ofreasoning about other agents’ belief, private information, be-havior, strategy, and other characteristics is crucial. A rea-soning model can be used in many different ways, but themost common case is where an agent utilize its reasoningmodel to help its self decision making [Brown, 1951; Hein-rich and Silver, 2016; He et al., 2016; Raileanu et al., 2018;Wen et al., 2019; Tian et al., 2018]. In this work, we usethe word “opponent” when referring to another agent in theenvironment irrespective of the environment’s cooperative oradversarial nature.

In our work, we reformulate the MARL problem intoBayesian inference and derive a multi-agent version of MEO,which we call the regularized opponent model with maxi-mum entropy objective (ROMMEO). Optimizing this objec-tive with respect to one agent’s opponent model gives rise toa new perceptive on opponent modeling. We present two off-policy RL algorithms for optimizing ROMMEO in MARL.ROMMEO-Q is applied in discrete action case with proof ofconvergence. For the complex and continuous action environ-ment, we propose ROMMEO Actor-Critic (ROMMEO-AC),which approximates the former procedure and extend itselfto continuous problems. We evaluate these two approacheson the matrix game and the differential game against strongbaselines and show that our methods can outperform all thebaselines in terms of the overall performance and speed ofconvergence.

arX

iv:1

905.

0808

7v2

[cs

.MA

] 1

9 A

ug 2

019

2 Method2.1 Stochastic GamesFor an n-agent stochastic game [Shapley, 1953], we define atuple (S,A1, . . . ,An, R1, . . . , Rn, p, T , γ), where S denotesthe state space, p is the distribution of the initial state, γ is adiscount factor, Ai and Ri = Ri(s, ai, a−i) are the actionspace and the reward function for agent i ∈ {1, . . . , n} re-spectively. States are transitioned according to T : S × A,where A = {A1, · · · ,An}. Agent i chooses its actionai ∈ Ai according to the policy πiθi(a

i|s) parameterized byθi conditioning on some given state s ∈ S . Let us definethe joint policy as the collection of all agents’ policies πθwith θ representing the joint parameter. It is convenient tointerpret the joint policy from the perspective of agent i suchthat πθ = (πiθi(a

i|s), π−iθ−i(a−i|s)), where a−i = (aj)j 6=i,

θ−i = (θj)j 6=i, and π−iθ−i(a−i|s) is a compact representation

of the joint policy of all complementary agents of i. At eachstage of the game, actions are taken simultaneously. Eachagent is presumed to pursue the maximal cumulative reward,expressed as

max ηi(πθ) = E

[ ∞∑t=1

γtRi(st, ait, a−it )

], (1)

with (ait, a−it ) sample from (πiθi , π

−iθ−i). In fully cooperative

games, we assume there exists at least one joint policy πθsuch that all agents can achieve the maximal cumulative re-ward with this joint policy.

2.2 A Variational Lower Bound for Multi-AgentReinforcement Learning Problems

We transform the control problem into an inference problemby introducing a binary random variable oit which serves asthe indicator for “optimality” for each agent i at each timestep t. Recall that in single agent problem, rewardR(st, at) isbounded, but the achievement of the maximum reward giventhe action at is unknown. Therefore, in the single-agent case,ot indicates the optimality of achieving the bounded maxi-mum reward r∗t . It thus can be regarded as a random vari-able and we have P (ot = 1|st, at) ∝ exp(R(st, at)). Intu-itively, this formulation dictates that higher rewards reflect ahigher likelihood of achieving optimality, i.e., the case whenot = 1. However, the definition of “optimality” in the multi-agent case is subtlety different from the one in the single-agent situation.

In cooperative multi-agent reinforcement learning(CMARL), to define agent i’s optimality, we first introducethe definition of optimum and optimal policy:Definition 1. In cooperative multi-agent reinforcementlearning, optimum is a strategy profile (π1∗, . . . , πn∗) suchthat:

Es∼ps,ai∗t ∼πi∗,a−i∗t ∼π−i∗

[ ∞∑t=1

γtRi(st, ai∗t , a

−i∗t )

]

≥ Es∼ps,ait∼πi,a−it ∼π−i

[ ∞∑t=1

γtRi(st, ait, a−it )

](2)

∀π ∈ Π, i ∈ (1 . . . n),

where π = (πi, π−i) and Agent i’s optimal policy is πi∗.

In CMARL, a single agent’s “optimality” oit cannot implythat it obtains the maximum reward because the reward de-pends on the joint actions of all agents (ai, a−i). Therefore,we define oit = 1 only indicates that agent i’s policy at timestep t is optimal. The posterior probability of agent i’s opti-mality given its action ait is the probability that the action issampled from the optimal policy:

P (oit|ait) = P (ait ∼ πi∗|ait) = πi∗(ait). (3)

We also assume that given other players playing optimally(o−it = 1), the higher the reward agent i receives the higherthe probability of agent i’s current policy is optimal (oit = 1):

P (oit = 1|o−it = 1, st, ait, a−it ) ∝ exp(R(st, a

it, a−it )). (4)

The conditional probabilities of “optimality” in both ofCMARL and single-agent case have similar forms. However,it is worth mentioning that the “optimality” in CMARL has adifferent interpretation to the one in single-agent case.

For cooperative games, if all agents play optimally, thenagents can receive the maximum rewards, which is the opti-mum of the games. Therefore, given the fact that other agentsare playing their optimal policies o−i = 1, the probability thatagent i also plays its optimal policy P (oi = 1|o−i = 1) is theprobability of obtaining the maximum reward from agent i’sperspective. Therefore, we define agent i’s objective as:

max J ∆= logP (oi1:T = 1|o−i1:T = 1) (5)

As we assume no knowledge of the optimal policies and themodel of the environment, we treat them as latent variables.To optimize the observed evidence defined in Eq. 5, there-fore, we use variational inference (VI) with an auxiliary dis-tribution over these latent variables q(ai1:T , a

−i1:T , s1:T |oi1:T =

1, o−i1:T = 1). Without loss of generality, we here derive thesolution for agent i. We factorize q(ai1:T , a

−i1:T , s1:T |oi1:T =

1, o−i1:T = 1) so as to capture agent i’s conditional policy onthe current state and opponents actions, and beliefs regard-ing opponents actions. This way, agent i will learn optimalpolicy, while also possessing the capability to model oppo-nents actions a−i. Using all modelling assumptions, we mayfactorize q(ai1:T , a

−i1:T , s1:T |oi1:T = 1, o−i1:T = 1) as:

q(ai1:T , a−i1:T , s1:T |oi1:T = 1, o−i1:T = 1)

= P (s1)∏t

P (st+1|st, at)q(ait|a−it , st, oit = o−it = 1)

× q(a−it |st, oit = o−it = 1)

= P (s1)∏t

P (st+1|st, at)π(ait|st, a−it )ρ(a−it |st),

where we have assumed the same initial and states transitionsas in the original model. With this factorization, we derive a

lower bound on the likelihood of optimality of agent i:

logP (oi1:T = 1|o−i1:T = 1)

≥ J (π, ρ)∆=∑t

E(st,ait,a−it )∼q[R

i(st, ait, a−it )

+H(π(ait|st, a−it ))−DKL(ρ(a−it |st)||P (a−it |st))] (6)

=∑t

Est [Eait∼π,a−it ∼ρ[Ri(st, a

it, a−it ) +H(π(ait|st, a−it ))︸︷︷︸

MEO

]

− Ea−it ∼ρ[DKL(ρ(a−it |st)||P (a−it |st))]︸︷︷︸Regularizer of ρ

]. (7)

Written out in full, ρ(a−it |st, o−it = 1) is agent i’s op-ponent model estimating optimal policies of its opponents,π(ait|st, a−it , oit = 1, o−it = 1) is the agent i’s conditionalpolicy at optimum (oit = o−it = 1) and P (a−it |st, o−it = 1)is the prior of optimal policy of opponents. In our work, weset the prior P (a−it |st, o−it = 1) equal to the observed em-pirical distribution of opponents’ actions given states. As weare only interested in the case where (oit = 1, o−it = 1), wedrop them in π, ρ and P (a−it |st) here and thereafter. H(·) isthe entropy function. Eq. 6 is a variational lower bound oflogP (oi1:T = 1|o−i1:T = 1) and the derivation is deferred toAppendix B.1.

2.3 The Learning of Opponent ModelWe can further expand Eq. 6 into Eq. 7 and we find that itresembles the maximum entropy objective in single-agent re-inforcement learning [Kappen, 2005; Todorov, 2007; Ziebartet al., 2008; Haarnoja et al., 2017]. We denote agent i’s ex-pectation of reward R(st, a

it, a−it ) plus entropy of the condi-

tional policy H(π(ai|s, a−i)) as agent i’s maximum entropyobjective (MEO). In the multi-agent version, however, it isworthy of noting that optimizing the MEO will lead to theoptimization of ρ. This can be counter-intuitive at first sightas opponent behaviour models are normally trained with onlypast state-action data (s, a−i) to predict opponents’ actions.

However, recall that ρ(a−it |st, o−it = 1) is modelling op-ponents’ optimal policies in our work. Given agent i’s policyπi being fixed, optimizing MEO with respect to ρ updatesagent i’s opponent model in the direction of the higher sharedreward R(s, ai, a−i) and the more stochastic conditional pol-icy πi(ai|s, a−i), making it closer to the real optimal policiesof the opponents. Without any regularization, at iteration d,agent i can freely learn a new opponent model ρid+1 whichis the closest to the optimal opponent policies π−i∗ from itsperspective given πid(a

i|s, a−i). Next, agent i can optimizethe lower bound with respect to πid+1(ai|s, a−i) given ρid+1.Then we have an EM-like iterative training and can showit monotonically increases the probability that the opponentmodel ρ is optimal policies of the opponents. Then, by act-ing optimally to the converged opponent model ρi∞, we canrecover agent i’s optimal policy πi∗.

However, it is unrealistic to learn such an opponent model.As the opponents have no access to agent i’s conditional pol-icy πid(a

i|s, a−i), the learning of its policy can be different

from the one of agent i’s opponent model. Then the actual op-ponent policies π−id+1 can be very different from agent i’s con-verged opponent model ρi∞ learned in the above way givenagent i’s conditional policy πid(a

i|s, a−i). Therefore, actingoptimally to an opponent model far from the real opponents’policies can lead to poor performance.

The last term in Eq. 7 can prevent agent i build-ing an unrealistic opponent model. The Kullback-Leibler(KL) divergence between opponent model and a priorDKL(ρ(a−it |st)||P (a−it |st)) can act as a regularizer of ρ. Bysetting the prior to the empirical distribution of opponent pastbehaviour, the KL divergence penalizes ρ heavily if it deviatesfrom the empirical distribution too much. As the objective inEq. 7 can be seen as a Maximum Entropy objective for oneagent’s policy and opponent model with regularization on theopponent model, we call this objective as Regularized Oppo-nent Model with Maximum Entropy Objective (ROMMEO).

3 Multi-Agent Soft Actor CriticTo optimize the ROMMEO in Eq. 7 derived in the previoussection, we propose two off-policy algorithms. We first intro-duce an exact tabular Q-iteration method with proof of con-vergence. For practical implementation in a complex continu-ous environment, we then propose the ROMMEO actor criticROMMEO-AC, which is an approximation to this procedure.

3.1 Regularized Opponent Model with MaximumEntropy Objective Q-Iteration

In this section, we derive a multi-agent version of Soft Q-iteration algorithm proposed in [Haarnoja et al., 2017] and wename our algorithm as ROMMEO-Q. The derivation followsfrom a similar logic to [Haarnoja et al., 2017], but the exten-sion of Soft Q-learning to MARL is still nontrivial. From thissection, we slightly modify the objective in Eq. 7 by addinga weighting factor α for the entropy term and the original ob-jective can be recovered by setting α = 1.

We first define multi-agent soft Q-function and V-functionrespectively. Then we can show that the conditional policyand opponent model defined in Eq. 10 and 11 below are opti-mal solutions with respect to the objective defined in Eq. 7:

Theorem 1. We define the soft state-action value function ofagent i as

Qπ∗,ρ∗

soft (st, ait, a−it ) = rt + E

(st+l,ait+l

,a−it+l

,...)∼q[

∞∑l=1

γl(rt+l

+ αH(π∗(ait+l|a−it+l, st+l))−DKL(ρ∗(a−it+l|st+l)||P (a−it+l|st+l))],

(8)and soft state value function as

V ∗(s) = log∑a−i P (a−i|s)

(∑ai exp( 1

αQ∗soft(s, a

i, a−i)))α

,

(9)Then the optimal conditional policy and opponent model forEq. 6 are

π∗(ai|s, a−i) =exp( 1

αQπ∗,ρ∗

soft (s, ai, a−i))∑ai exp( 1

αQπ∗,ρ∗

soft (s, ai, a−i)), (10)

and

ρ∗(a−i|s) =P (a−i|s)

(∑ai exp( 1

αQ∗soft(s, a

i, a−i)))α

exp(V ∗(s)).

(11)

Proof. See Appendix C.2.

Following from Theorem 1, we can find the optimal solu-tion of Eq. 7 by learning the soft multi-agent Q-function firstand recover the optimal policy π∗ and opponent model ρ∗ byEquations 10 and 11. To learn the Q-function, we show that itsatisfies a Bellman-like equation, which we name it as multi-agent soft Bellman equation:

Theorem 2. We define the soft multi-agent Bellman equationfor the soft state-action value function Qπ,ρsoft(s, a

i, a−i) ofagent i as

Qπ,ρsoft(s, ai, a−i) = rt + γE(st+1)[Vsoft(st+1)]. (12)


With this Bellman equation defined above, we can derive asolution to Eq. 12 with a fixed point iteration, which we callROMMEO Q-iteration (ROMMEO-Q). Additionally, We canshow that it can converge to the optimalQ∗soft and V ∗soft withcertain restrictions as stated in [Wen et al., 2019]:

Theorem 3. ROMMEO Q-iteration. In a symmetric gamewith only one global optimum, i.e. Eπ∗

[Qit(s)

]≥

Eπ[Qit(s)

], where π∗ is the optimal strategy profile. Let

Qsoft(·, ·, ·) and Vsoft(·) be bounded and assume

∑a−i

P (a−i|s)

(∑ai

exp(1

αQ∗soft(s, a

i, a−i))

)α<∞

and that Q∗soft <∞ exists. Then the fixed-point iteration

Qsoft(st, ait, a−it )← rt + γE(st+1)[Vsoft(st+1)], (13)

where Vsoft(st) ← log∑a−it

P (a−it |st) ×(∑ait

exp( 1αQsoft(st, a

it, a−it ))

)α, ∀st, ait, a−it , converges

to Q∗soft and V ∗soft respectively.


3.2 Regularized Opponent Model with MaximumEntropy Objective Actor Critic

The ROMMEO-Q assumes we have the model of the envi-ronment and is impractical to implement in high-dimensionalcontinuous problems. To solve these problems, we pro-pose the ROMMEO actor critic (ROMMEO-AC) which isa model-free method. We use neural networks (NNs) asfunction approximators for the conditional policy, opponentmodel and Q-function and learn these functions by stochasticgradient. We parameterize the Q-function, conditional policyand opponent model by Qω(s, ai, a−i), πθ(ait|st, a−it ) andρφ(a−it |st) respectively.

Without access to the environment model, we first replacethe Q-iteration with Q-learning. Therefore, we can train ω tominimize:

JQ(ω) = E(st,ait,a−it )∼D[

1

2(Qω(st, a

it, a−it )

−R(st, ait, a−it )− γEst+1∼ps [V (st+1)])2], (14)

withV (st+1) = Qω(st+1, a

it+1, a

−it+1)− log ρφ(a−it+1|st+1)

− α log πθ(ait+1|st+1, a

−it+1) + logP (a−it+1|st+1),

(15)where Qω are target functions for providing relatively stabletarget values. We use a−it denoting the action sampled fromagent i’s opponent model ρ(a−it |st) and it should be distin-guished from a−it which is the real action taken by agent i’sopponent. Eq. 15 can be derived from Eq. 10 and 11.

To recover the optimal conditional policy and oppo-nent model and avoid intractable inference steps defined inEq. 10 and 11 in complex problems, we follow the methodin [Haarnoja et al., 2018b] where θ and φ are trained to min-imize the KL-divergence:Jπ(θ) = Est∼D,a−it ∼ρ[DKL

(πθ(·|st, a−it )

∣∣∣∣∣∣∣∣exp( 1αQω(st, ·, a−it ))

Zω(st, a−it )

)], (16)

Jρ(φ) = E(st,ait)∼DDKL

ρ(·|st)∣∣∣∣∣∣∣∣P (·|st)

(exp( 1

αQ(st,ait,·))

πθ(ait|st,·)

)αZω(st)

. (17)

By using the reparameterization trick: a−it = gφ(ε−it ; st) andait = fθ(ε

it; st, a

−it ), we can rewrite the objectives above as:

Jπ(θ) = Est∼D,εit∼N,a−it ∼ρ[α log πθ(fθ(εit; st, a

−it ))

−Qω(st, fθ(εit; st, a

−it ), a−it )], (18)

Jρ(φ) = E(st,at)∼D,ε−it ∼N[log ρφ(gφ(ε−it ; st)|st)

− logP (a−it |st)−Q(st, ait, gφ(ε−it ; st))

+ α log πθ(ait|st, gφ(ε−it ; st))]. (19)

The gradient of Eq. 14, 18 and 19 with respect to the cor-responding parameters are listed as below:

∇ωJQ(ω) = ∇ωQω(st, ait, a−it )(Qω(st, a

it, a−it )

−R(st, ait, a−it )− γV (st+1)), (20)

∇θJπ(θ) = ∇θα log πθ(ait|st, a−it ) +∇θfθ(εit; st, a−it )

(∇aitα log πθ(ait|st, a−it )−∇aitQω(st, a

it, a−it )),

(21)

∇φJρ(φ) = ∇φ log ρφ(a−it |st) + (∇a−it log ρφi(a−it |st)

−∇a−it logP (a−it |st)−∇a−it Qωi(st, ait, a−it )

+∇a−it α log πθ(ai|st, a−it ))∇φgφ(ε−it ; st). (22)

We list the pseudo-code of ROMMEO-Q and ROMMEO-AC in Appendix A.

0 50 100 150 200 250 300Episodes

30

20

10

0

10Re

war

dJALWoLF-PHC

PR2FMQ-c10

ROMMEO-QROMMEO-Q-EMP

(a)

0 50 100 150 200 250 300Episodes

0.00

0.25

0.50

0.75

1.00

P(A,

A)

JALWoLF-PHC

PR2FMQ-c10

ROMMEO-QROMMEO-Q-EMP

(b)

0 50 100 150 200 250 300Episodes

0.0

0.2

0.4

0.6

0.8

1.0

p(A) 1

2

1

2

(c)

Figure 1: (a): Learning curves of ROMMEO and baselines on ICG over 100 episodes. (b): Probability of convergence to the global optimumfor ROMMEO and baselines on ICG over 100 episodes. The vertical axis is the joint probability of taking actions A for both agents. (c):Probability of taking A estimated by agent i’s opponent model ρi and observed empirical frequency P i in one trail of training, i ∈ {1, 2}.

4 Related WorksIn early works, the maximum entropy principle has beenused in policy search in linear dynamics [Todorov, 2010;Toussaint, 2009; Levine and Koltun, 2013] and path integralcontrol in general dynamics [Kappen, 2005; Theodorou et al.,2010]. Recently, off-policy methods [Haarnoja et al., 2017;Schulman et al., 2017; Nachum et al., 2017] have been pro-posed to improve the sample efficiency in optimizing MEO.To avoid complex sampling procedure, training a policy in su-pervised fashion is employed in [Haarnoja et al., 2018b]. Ourwork is closely related to this series of recent works becauseROMMEO is an extension of MEO to MARL.

A few related works to ours have been conducted in multi-agent soft Q-learning [Wen et al., 2019; Wei et al., 2018;Grau-Moya et al., 2018], where variants of soft Q-learningare applied for solving different problems in MARL. How-ever, unlike previous works, we do not take the soft Q-learning as given and apply it to MARL problems with mod-ifications. In our work, we first establish a novel objectiveROMMEO and ROMMEO-Q is only an off-policy methodwe derive with complete convergence proof, which can opti-mize the objective. There are other ways of optimizing ROM-MEO, for example, the on-policy gradient-based methods,but they are not included in the paper.

There has been substantial progress in combining RL withprobabilistic inference. However, most of the existing worksfocus on the single-agent case. The literature of Bayesianmethods in MARL is limited. Among these are methodsperforming on cooperative games with prior knowledge ondistributions of the game model and the possible strategiesof others [Chalkiadakis and Boutilier, 2003] or policy pa-rameters and possible roles of other agents [Wilson et al.,2010]. In our work, we assume very limited prior knowl-edge of the environment model, optimal policy, opponentsor the observations during the play. In addition, our al-gorithms are fully decentralized at training and execution,which is more challenging than problems from the cen-tralized training [Foerster et al., 2017; Lowe et al., 2017;Rashid et al., 2018].

In our work, we give a new definition of optimality inCMARL and derive a novel objective ROMMEO. We providetwo off-policy RL algorithms for optimizing ROMMEO andthe exact version comes with convergence proof. In addition,we provide a natural perspective on opponent modeling in co-

ordination problems: biasing one’s opponent model towardsthe optimum from its perspective but regularizing it with theempirical distribution of opponent’s real behavior.

5 Experiments5.1 Iterated Matrix GamesWe first present the proof-of-principle result of ROMMEO-Q1 on iterated matrix games where players need to cooperateto achieve the shared maximum reward. To this end, we studythe iterated climbing games (ICG) which is a classic purelycooperative two-player stateless iterated matrix games.

Climbing game (CG) is a fully cooperativegame proposed in [Claus and Boutilier, 1998]whose payoff matrix is summarized as follows:

R =

A B C

ABC

(11, 11) (−30,−30) (0, 0)(−30,−30) (7, 7) (6, 6)(0, 0) (0, 0) (5, 3)

. It is

a challenging benchmark because of the difficulty of conver-gence to its global optimum. There are two Nash equilibrium(A,A) and (B,B) but one global optimal (A,A). Thepunishment of miscoordination by choosing a certain actionincreases in the order of C → B → A. The safest action isC and the miscoordination punishment is the most severe forA. Therefore it is very difficult for agents to converge to theglobal optimum in ICG.

We compare our method to a series of strong baselinesin MARL, including Joint Action Learner (JAL) [Clausand Boutilier, 1998], WoLF Policy Hillclimbing (WoLF-PHC) [Bowling and Veloso, 2001], Frequency Maximum Q(FMQ) [Kapetanakis and Kudenko, 2002] and ProbabilisticRecursive Reasoning (PR2) [Wen et al., 2019]. ROMMEO-Q-EMP is an ablation study to evaluate the effectiveness ofour proposed opponent model learning process, where we re-place our opponent model with empirical frequency. Fig. 1ashows the learning curves on ICG for different algorithms.The difference of rewards between ROMMEO-Q and FMQ-c10 may seem small because of the small reward marginbetween the global optimum and the local one. However,ROMMEO-Q actually outperforms all baselines significantlyin terms of converging to the global optimum, which is shown

1The experiment code and appendix are available athttps://github.com/rommeoijcai2019/rommeo.

https://github.com/rommeoijcai2019/rommeo

10 5 0 5Action of Agent 1

10

5

0

5

Actio

n of

Age

nt 2

-30.0

-30.0

-26.

0

-26.0

-22.0

-22.0

-22.0

-18.

0

-18.0

-18.0-14.0

-14.0

-10.0

-10.0

-6.0

-6.0

-2.0

-2.0

2.06.0

0

200

400

600

800

1000

Step

(a)

0 25 50 75 100 125 150 175 200Episodes

30

20

10

0

10

Retu

rn

ROMMEO-ACPR2-ACDDPG-OM

DDPGMADDPGMASQL

(b)

0 10 20 30 40 50Episodes

10

5

0

5

Mean

of P

olicy

1

2

1

2

(c)

Figure 2: Experiment on Max of Two Quadratic Game. (a) Reward surface and learning path of agents. Scattered points are actions taken ateach step; (b) Learning curve of ROMMEO and baselines. (c) Mean of agents’ policies π and opponent models ρ.

in Fig. 1b. To further analyze the opponent modeling de-scribed in Sec. 2.3, we visualize the probability of agent −itaking the optimal action A estimated by agent i’s opponentmodel ρi and its true policy π−i in Fig. 1c. Agent i’s op-ponent model “thinks ahead” of agent −i and converges toagent −i’s optimal policy before agent −i itself converges tothe optimal policy. This helps agent i to respond to its op-ponent model optimally by choosing action A, which in turnleads to the improvement of agent −i’s opponent model andpolicy. Therefore, the game converges to the global optimum.To note, the big drop of P (A) for both policies and opponentmodels at the beginning of the training comes from the severepunishment of miscoordination associated with action A.

5.2 Differential GamesWe adopt the differential Max of Two Quadratic Game [Weiet al., 2018] for continuous case. The agents have continu-ous action space of [−10, 10]. Each agent’s reward dependson the joint action following the equations: r1

(a1, a2

)=

r2(a1, a2

)= max (f1, f2) , where f1 = 0.8× [−(a

1+53 )2 −

(a2+53 )2], f2 = 1.0 × [−(a

1−51 )2 − (a

2−51 )2] + 10. We

compare the algorithm with a series of baselines includ-ing PR2 [Wen et al., 2019], MASQL [Wei et al., 2018;Grau-Moya et al., 2018], MADDPG [Lowe et al., 2017]and independent learner via DDPG [Lillicrap et al., 2015].To compare against traditional opponent modeling methods,similar to [Rabinowitz et al., 2018; He et al., 2016], we im-plement an additional baseline of DDPG with an opponentmodule that is trained online with supervision in order tocapture the latest opponent behaviors, called DDPG-OM. Wetrained all agents for 200 episodes with 25 steps per episode.

This is a challenging task to most continuous gradientbased RL algorithms because gradient update tends to directthe training agent to the sub-optimal point. The reward sur-face is provided in Fig. 2a ; there is a local maximum 0 at(−5,−5) and a global maximum 10 at (5, 5), with a deepvalley staying in the middle. If the agents’ policies are initial-ized to (0, 0) (the red starred point) that lies within the basinof the left local maximum, the gradient-based methods wouldtend to fail to find the global maximum equilibrium point dueto the valley blocking the upper right area.

A learning path of ROMMEO-AC is summarized in Fig.2a and the solid bright circle on the right corner implies theconvergence to the global optimum. The learning curve is

presented in Fig. 2b, ROMMEO-AC shows the capabilityof converging to the global optimum in a limited amount ofsteps, while most of the baselines can only reach the sub-optimal point. PR2-AC can also achieve the global optimumbut requires many more steps to explore and learn. Addition-ally, fine tuning on the exploration noise or separate explo-ration stage is required for deterministic RL methods (MAD-DPG, DDPG, DDPG-OM, PR2-AC), and the learning out-comes of energy-based RL method (MASQL) is extremelysensitive to the annealing scheme for the temperature. In con-trast, ROMMEO-AC employs a stochastic policy and con-trols the exploration level by the weighting factor α. It doesnot need a separate exploration stage at the beginning of thetraining or a delicately designed annealing scheme for α.

Furthermore, we analyze the learning path of policy π andmodeled opponent policy ρ during the training, the resultsare shown in Fig. 2c. The red and orange lines are mean ofmodeled opponent policy ρ, which always learn to approachthe optimal ahead of the policy π (in dashed blue and greenlines). This helps the agents to establish the trust and con-verge to the optimum quickly, which further justifies the ef-fectiveness and benefits of conducting a regularized opponentmodel proposed in Sec. 2.3.

6 ConclusionIn this paper, we use Bayesian inference to formulate MARLproblem and derive a novel objective ROMMEO which givesrise to a new perspective on opponent modeling. We designan off-policy algorithm ROMMEO-Q with complete conver-gence proof for optimizing ROMMEO. For better general-ity, we also propose ROMMEO-AC, an actor critic algorithmpowered by NNs to solve complex and continuous problems.We give an insightful analysis of the effect of the new learningprocess of the opponent modeling on agent’s performance inMARL. We evaluate our methods on the challenging matrixgame and differential game and show that they can outper-form a series of strong base lines. It is worthy of noting thatTheorems 1,2 and 3 only guarantees the convergence to opti-mal solutions with respect to ROMMEO objective but not theoptimum in the game. The achievement of the optimum in thegame also relies on the opponent learning algorithm. In ourwork, we demonstrate that ROMMEO-Q/AC’s convergenceto the optimum of the game in self-play setting. The conver-gence to optimum of games in non-self-play settings will bestudied in our future work.

References[Abdolmaleki et al., 2018] Abbas Abdolmaleki, Jost Tobias Sprin-

genberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Mar-tin A. Riedmiller. Maximum a posteriori policy optimisation.CoRR, abs/1806.06920, 2018.

[Bowling and Veloso, 2001] Michael Bowling and ManuelaVeloso. Rational and convergent learning in stochastic games. InIJCAI, San Francisco, CA, USA, 2001.

[Brown, 1951] George W. Brown. Iterative solution of games byfictitious play. In AAPA. New York, 1951.

[Chalkiadakis and Boutilier, 2003] Georgios Chalkiadakis andCraig Boutilier. Coordination in multiagent reinforcementlearning: A bayesian approach. In AAMAS, pages 709–716, NewYork, NY, USA, 2003. ACM.

[Claus and Boutilier, 1998] Caroline Claus and Craig Boutilier.The dynamics of reinforcement learning in cooperative multia-gent systems. AAAI ’98/IAAI ’98, Menlo Park, CA, USA, 1998.American Association for Artificial Intelligence.

[Florensa et al., 2017] Carlos Florensa, Yan Duan, and PieterAbbeel. Stochastic Neural Networks for Hierarchical Reinforce-ment Learning. arXiv e-prints, April 2017.

[Foerster et al., 2017] Jakob Foerster, Gregory Farquhar, Tri-antafyllos Afouras, Nantas Nardelli, and Shimon Whiteson.Counterfactual Multi-Agent Policy Gradients. arXiv e-prints,page arXiv:1705.08926, May 2017.

[Fox et al., 2015] Roy Fox, Ari Pakman, and Naftali Tishby. Tam-ing the Noise in Reinforcement Learning via Soft Updates. arXive-prints, page arXiv:1512.08562, December 2015.

[Grau-Moya et al., 2018] Jordi Grau-Moya, Felix Leibfried, andHaitham Bou-Ammar. Balancing Two-Player Stochastic Gameswith Soft Q-Learning. IJCAI, 2018.

[Haarnoja et al., 2017] Tuomas Haarnoja, Haoran Tang, PieterAbbeel, and Sergey Levine. Reinforcement learning with deepenergy-based policies. CoRR, abs/1702.08165, 2017.

[Haarnoja et al., 2018a] Tuomas Haarnoja, Vitchyr Pong, AurickZhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. Com-posable Deep Reinforcement Learning for Robotic Manipulation.arXiv e-prints, page arXiv:1803.06773, March 2018.

[Haarnoja et al., 2018b] Tuomas Haarnoja, Aurick Zhou, PieterAbbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Max-imum Entropy Deep Reinforcement Learning with a StochasticActor. arXiv e-prints, page arXiv:1801.01290, January 2018.

[He et al., 2016] He He, Jordan Boyd-Graber, Kevin Kwok, andHal Daume III. Opponent modeling in deep reinforcement learn-ing. In International Conference on Machine Learning, pages1804–1813, 2016.

[Heinrich and Silver, 2016] Johannes Heinrich and David Silver.Deep reinforcement learning from self-play in imperfect-information games. CoRR, abs/1603.01121, 2016.

[Kalman, 1960] Rudolf Kalman. A new approach to linear filteringand prediction problems. Transactions of the ASME - Journal ofbasic Engineering, 82:35–45, 01 1960.

[Kapetanakis and Kudenko, 2002] Spiros Kapetanakis and DanielKudenko. Reinforcement learning of coordination in coopera-tive multi-agent systems. In Eighteenth National Conference onArtificial Intelligence, Menlo Park, CA, USA, 2002.

[Kappen, 2005] Hilbert J. Kappen. Path integrals and symmetrybreaking for optimal control theory. Journal of Statistical Me-chanics: Theory and Experiment, 2005(11), 2005.

[Levine and Koltun, 2013] Sergey Levine and Vladlen Koltun.Variational policy search via trajectory optimization. In C. J. C.Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein-berger, editors, NIPS ’13. 2013.

[Levine, 2018] Sergey Levine. Reinforcement Learning and Con-trol as Probabilistic Inference: Tutorial and Review. arXiv e-prints, page arXiv:1805.00909, May 2018.

[Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, DavidSilver, and Daan Wierstra. Continuous control with deep rein-forcement learning. arXiv preprint arXiv:1509.02971, 2015.

[Lowe et al., 2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb,Pieter Abbeel, and Igor Mordatch. Multi-Agent Actor-Critic forMixed Cooperative-Competitive Environments. arXiv e-prints,page arXiv:1706.02275, June 2017.

[Nachum et al., 2017] Ofir Nachum, Mohammad Norouzi, KelvinXu, and Dale Schuurmans. Bridging the Gap Between Valueand Policy Based Reinforcement Learning. arXiv e-prints, pagearXiv:1702.08892, February 2017.

[Rabinowitz et al., 2018] Neil C Rabinowitz, Frank Perbet, H Fran-cis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick.Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.

[Raileanu et al., 2018] Roberta Raileanu, Emily Denton, ArthurSzlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. CoRR, abs/1802.09640, 2018.

[Rashid et al., 2018] Tabish Rashid, Mikayel Samvelyan, ChristianSchroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shi-mon Whiteson. QMIX: Monotonic Value Function Factorisationfor Deep Multi-Agent Reinforcement Learning. arXiv e-prints,page arXiv:1803.11485, March 2018.

[Rawlik et al., 2013] Konrad Rawlik, Marc Toussaint, and SethuVijayakumar. On stochastic optimal control and reinforcementlearning by approximate inference. IJCAI ’13, 2013.

[Schulman et al., 2017] John Schulman, Xi Chen, and PieterAbbeel. Equivalence Between Policy Gradients and Soft Q-Learning. arXiv e-prints, page arXiv:1704.06440, April 2017.

[Shapley, 1953] Lloyd S Shapley. Stochastic games. Proceedingsof the national academy of sciences, 39(10):1095–1100, 1953.

[Theodorou et al., 2010] Evangelos Theodorou, Jonas Buchli, andStefan Schaal. A generalized path integral control approach to re-inforcement learning. J. Mach. Learn. Res., 11, December 2010.

[Tian et al., 2018] Zheng Tian, Shihao Zou, Tim Warr, LishengWu, and Jun Wang. Learning Multi-agent Implicit Communi-cation Through Actions: A Case Study in Contract Bridge, aCollaborative Imperfect-Information Game. arXiv e-prints, pagearXiv:1810.04444, October 2018.

[Todorov, 2007] Emanuel Todorov. Linearly-solvable markov de-cision problems. In B. Scholkopf, J. C. Platt, and T. Hoffman,editors, NIPS ’19. 2007.

[Todorov, 2010] Emanuel Todorov. Policy gradients in linearly-solvable mdps. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, NIPS. 2010.

[Toussaint and Storkey, 2006] Marc Toussaint and Amos Storkey.Probabilistic inference for solving discrete and continuous statemarkov decision processes. ICML ’06, pages 945–952, NewYork, NY, USA, 2006. ACM.

[Toussaint, 2009] Marc Toussaint. Robot trajectory optimizationusing approximate inference. ICML ’09, pages 1049–1056, NewYork, NY, USA, 2009. ACM.

[Wei et al., 2018] Ermo Wei, Drew Wicke, David Freelan, and SeanLuke. Multiagent soft q-learning. AAAI, 2018.

[Wen et al., 2019] Ying Wen, Yaodong Yang, Rui Luo, Jun Wang,and Wei Pan. Probabilistic recursive reasoning for multi-agentreinforcement learning. In ICLR, 2019.

[Wilson et al., 2010] Aaron Wilson, Alan Fern, and Prasad Tade-palli. Bayesian policy search for multi-agent role discovery.AAAI’10, pages 624–629. AAAI Press, 2010.

[Ziebart et al., 2008] Brian D. Ziebart, Andrew Maas, J. AndrewBagnell, and Anind K. Dey. Maximum entropy inverse reinforce-ment learning. In AAAI ’08, 2008.

A Algorithms

Algorithm 1 Multi-agent Soft Q-learning

Result: policy πi, opponent model ρi

Initialization:Initialize replay bufferM to capacity M .Initialize Qωi(s, ai, a−i) with random parameters ωi, P (a−i|s) arbitrarily, set γ as the discount factor.Initialize target Qωi(s, ai, a−i) with random parameters ωi, set C the target parameters update interval.

while not converge do

Collect experience

For the current state st compute the opponent model ρi(a−it |st) and conditional policy πi(ait|st, a−it ) respectively from:

ρi(a−it |st) ∝ P (a−it |st)

∑ait

exp(1

αQωi(st, a

it, a−it ))

α

,

πi(ait|st, a−it ) ∝ exp(1

αQωi(st, a

it, a−it )).

Compute the marginal policy πi(ait|st) and sample an action from it:

ait ∼ πi(ait|st) =∑a−i

πi(ait|st, a−it )ρ(a−it |st).

Observe next state st+1, opponent action a−it and reward rit, save the new experience in the reply buffer:

M←M∪ {(st, ait, a−it , st+1, rit)}.

Update the prior from the replay buffer:

P (a−it |st) =

∑|M|m=1 I(s = st, a

−i = a−it )∑|M|m=1 I(s = st)

∀st, a−it ∈M.

Sample a mini-batch from the replay buffer:

{s(n)t , a

i,(n)t , a

−i,(n)t , s

(n)t+1, r

(n)t }Nn=1 ∼M.

Update Qωi(s, ai, a−i):

for each tuple (s(n)t , a

i,(n)t , a

−i,(n)t , s

(n)t+1, r

(n)t ) do

Sample {a−i,(n,k)}Kk=1 ∼ ρ, {ai,(n,k)}Kk=1 ∼ π.Compute empirical V i(s(n)

t+1) as:

V i(s(n)t+1) = log

1

K

K∑k=1

(P

1α (a−i,(n,k)|s(n)

t+1) exp( 1αQωi(s

(n)t+1, a

i,(n,k), a−i,(n,k))))α

π(ai,(n,k)|s(n)t+1, a

−i,(n,k))ρ(a−i,(n,k)|s(n)t+1)

.

Set

y(n) =

{r

(n)t for terminal s(n)

t+1

r(n)t + γV i(s

(n)t+1) for non-terminal s(n)

t+1

Perform gradient descent step on (y(n) −Qωi(s(n)t+1, a

i,(n), a−i,(n)))2 with respect to parameters ωiEvery C gradient descent steps, reset target parameters:

ωi ← ω

end forend whileCompute converged πi and ρi

Algorithm 2 Multi-agent Variational Actor Critic

Result: policy πθi , opponent model ρφiInitialization:Initialize parameters θi, φi, ωi, ψi for each agent i and the random process N for action exploration.Assign target parameters of joint action Q-function: ωi ← ω.Initialize learning rates λV , λQ, λπ, λφ, α, and set γ as the discount factor.for Each episode d = (1, . . . , D) do

Initialize random process N for action exploration.for each time step t do

For the current state st, sample an action and opponent’s action using:a−it ← gφ−i(ε

−i; st), where ε−it ∼ N ,ait ← fθi(ε

i; st, a−it ), where εit ∼ N .

Observe next state st+1, opponent action a−it and reward rit, save the new experience in the replay buffer:

Di ← Di ∪ {(st, ait, a−it , a−it , st+1, rit)}.

Update the prior from the replay buffer:

ψi = arg maxEDi [−P (a−i|s) logPψi(a−i|s)]

Sample a mini-batch from the reply buffer:

{s(n)t , a

i,(n)t , a

−i,(n)t , a

−i,(n)t , s

(n)t+1, r

(n)t }Nn=1 ∼M.

For the state s(n)t+1, sample an action and opponent’s action using:

a−i,(n)t+1 ← gφ−i(ε

−i; s(n)t+1), where ε−it+1 ∼ N ,

ai,(n)t+1 ← fθi(ε

i; s(n)t+1, a

−i,(n)t+1 ), where εit+1 ∼ N .

V i(s(n)t+1) = Qω(s

(n)t+1, a

i,(n)t+1 , a

−i,(n)t+1 )−α log πθi(a

i,(n)t+1 |s

(n)t+1, a

−i,(n)t+1 )− log ρφi(a

−i,(n)t+1 |s(n)

t+1)+logPψi(a−i,(n)t+1 |s(n)

t+1).Set

y(n) =

{r

(n)t for terminal s(n)

t+1

r(n)t + γV i(s

(n)t+1) for non-terminal s(n)

t+1

∇ωiJQ(ωi) = ∇ωiQωi(s(n)t , a

i,(n)t , a

−i,(n)t )(Qωi(s

(n)t , a

i,(n)t , a

−i,(n)t )− y(n))

∇θiJπ(θi) = ∇θiα log πθi(ai,(n)t |s(n)

t , a−i,(n)t )

+ (∇ai,(n)t

α log πθi(ai,(n)t |s(n)

t , a−i,(n)t )−∇

ai,(n)t

Qω(s(n)t , a

i,(n)t , a

−i,(n)t ))∇θfθi(εit; s

(n)t , a

−i,(n)t )

∇φiJρ(φi) = ∇φi log ρφi(a−i,(n)t |s(n)

t )

+ (∇a−i,(n)t

log ρφi(a−i,(n)t |s(n)

t )−∇a−i,(n)t

logP (a−i,(n)t |s(n)

t )−∇a−i,(n)t

Qωi(s(n)t , a

i,(n)t , a

−i,(n)t )

+∇a−i,(n)t

α log πθi(ai,(n)|s(n)

t , a−i,(n)t ))∇φigφi(ε−it ; s

(n)t )

Update parameters:ωi = ωi − λQ∇ωiJQ(ωi)θi = θi − λπ∇θiJπ(θi)φi = φi − λφi∇φiJρ(φi)

end forEvery C gradient descent steps, reset target parameters:

ωi = βωi + (1− β)ωi

.end for

B Variational Lower Bounds in Multi-agent Reinforcement LearningB.1 The Lower Bound of The Log Likelihood of OptimalityWe can factorize P (ai1:T , a

−i1:T , s1:T |o−i1:T ) as :

P (ai1:T , a−i1:T , s1:T |o−i1:T ) = P (s1)

∏t

P (st+1|st, at)P (ait|a−it , st, o−it )P (a−it |st, o−it ), (23)

where P (ait|a−it , st, o−it ) is the conditional policy of agent i when other agents −i achieve optimality. As agent i has no

knowledge about rewards of other agents, we set P (ait|a−it , st, o−it ) ∝ 1.

Analogously, we factorize q(ai1:T , a−i1:T , s1:T |oi1:T , o

−i1:T ) as:

q(ai1:T , a−i1:T , s1:T |oi1:T , o

−i1:T ) = P (s1)

∏t

P (st+1|st, at)q(ait|a−it , st, oit, o−it )q(a−it |st, oit, o−it ) (24)

= P (s1)∏t

P (st+1|st, at)π(ait|st, a−it )ρ(a−it |st), (25)

where π(ait|a−it , st) is agent 1’s conditional policy at optimum and ρ(a−it |st) is agent 1’s model about opponents’ optimalpolicies.

With the above factorization, we have:

logP (oi1:T |o−i1:T )

= log∑

ai1:T ,a−i1:T ,s1:T

P (oi1:T , ai1:T , a

−i1:T , s1:T |o−i1:T ) (26)

≥∑

q(ai1:T , a−i1:T , s1:T |oi1:T , o

−i1:T ) log

P (oi1:T , ai1:T , a

−i1:T , s1:T |o−i1:T )

q(ai1:T , a−i1:T , s1:T |oi1:T , o

−i1:T )

(27)

= E(ai1:T ,a−i1:T ,s1:T∼q)

[

T∑t=1

logP (oit|o−it , st, ait, a−it ) +

XXXXXlogP (s1) +

XXXXXXXXXXXX

T∑t=1

logP (st+1|st, ait, a−it ) (28)

XXXXXX− logP (s1) −XXXXXXXXXXXX

T∑t=1

logP (st+1|st, ait, a−it ) (29)

−T∑t=1

log π(ait|st, a−it )−T∑t=1

logρ(a−it |st)

P (a−it |st, o−it )+

T∑t=1

logP (ait|st, a−it , o−it )] (30)

= E(ai1:T ,a−i1:T ,s1:T∼q)

[

T∑t=1

Ri(st, ait, a−it )− log π(ait|st, a−it )− log

ρ(a−it |st)P (a−it |st, o−it )

+ log 1] (31)

=∑t

E(st,ait,a−it )∼q[R

i(st, ait, a−it ) +H(π(ait|st, a−it ))−DKL(ρ(a−it |st)||P (a−it |st, o−it ))]. (32)

C Multi-Agent Soft-Q LearningC.1 Soft Q-FunctionWe define the soft state-action value function Qπ,ρsoft(s, a, a

−i) of agent i in a stochastic game as:

Qπ,ρsoft(st, ait, a−it )

= rt + E(st+l,ait+l,a−it+l,...)∼q

[

∞∑l=1

γl(rt+l + αH(π(ait+l|a−it+l, st+l))−DKL(ρ(a−it+l|st+l)||P (a−it+l|st+l))] (33)

= E(st+1,ait+1,a−it+1)[rt + γ(αH(π(ait+1|st+1, a

−it+1))−DKL(ρ(a−i|st+1)||P (a−i|st+1)) +Qπ,ρsoft(st+1, a

it+1, a

−it+1))] (34)

= E(st+1,a−it+1)[rt + γ(αH(π(·|st+1, a

−it+1))−DKL(ρ(a−i|st+1)||P (a−i|st+1)) + Eait+1∼π[Qπ,ρsoft(st+1, a

it+1, a

−it+1)])]

(35)

= E(st+1)[rt + γ(Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(·|st+1)||P (·|st+1))]

+ Ea−it+1∼ρ,ait+1∼π[Qπ,ρsoft(st+1, a

it+1, a

−it+1)])], (36)

Then we can easily see that the objective in Eq. 7 can be rewritten as:

J (π, φ) =∑t

E(st,ait,a−it )∼(ps,π,ρ)

[Qπ,ρsoft(st, ait, a−it ) + αH(π(ait|st, a−it ))−DKL(ρ(a−it |st)||P (a−it |st))], (37)

by setting α = 1.

C.2 Policy Improvement and Opponent Model Improvement

Theorem 4. (Policy improvement theorem) Given a conditional policy π and opponent model ρ, define a new conditional policyπ as

π(·|s, a−i) ∝ exp(1

αQπ,ρsoft(s, ·, a

−i)),∀s, a−i. (38)

Assume that throughout our computation, Q is bounded and∑ai Q(s, ai, a−i) is bounded for any s and a−i (for both π and

π). Then Qπ,ρsoft(s, ai, a−i) ≥ Qπ,ρsoft(s, ai, a−i)∀s, a.

Theorem 5. (Opponent model improvement theorem) Given a conditional policy π and opponent model ρ, define a new oppo-nent model ρ as

ρ(·|s) ∝ exp(∑ai

Qπ,ρsoft(s, ai, ·)π(ai|·, s) + αH(π(s)) + logP (·|s)),∀s, ai. (39)

Assume that throughout our computation, Q is bounded and∑a−i exp(

∑ai Q(s, ai, a−i)π(ai|s, a−i)) is bounded for any s

and ai (for both ρ and ρ). Then Qπ,ρsoft(s, ai, a−i) ≥ Qπ,ρsoft(s, ai, a−i)∀s, a.

The proof of Theorem 4 and 5 is based on two observations that:

αH(π(·|s, a−i)) + Eai∼π[Qπ,ρsoft(s, ai, a−i)] ≤ αH(π(·|s, a−i)) + Eai∼π[Qπ,ρsoft(s, a

i, a−i)], (40)

and

Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(·|st+1)||P (·|st+1))] + Ea−it+1∼ρ,ait+1∼π

[Qπ,ρsoft(st+1, ait+1, a

−it+1)] (41)

≤ Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(a−it+1|st+1)||P (·|st+1)) + Ea−it+1∼ρ,ait+1∼π

[Qπ,ρsoft(st+1, ait+1, a

−it+1)].

(42)

First, we notice that

αH(π(·|s, a−i)) +Eai∼π[Qπ,ρsoft(s, ai, a−i)] = −αDKL(π(·|s, a−i)||π(·|s, a−i)) +α log

∑ai

exp(1

αQπ,ρsoft(s, a

i, a−i)). (43)

Therefore, the LHS is only maximized if the KL-Divergence on the RHS is minimized. This KL-Divergence is minimized onlywhen π = π, which proves the Equation 40.

Similarly, we can have

Ea−i∼ρ,ai∼π[αH(π(ai|s, a−i))]−DKL(ρ(·|s)||P (·|s))]) + Ea−i∼ρ,ai∼π[Qπ,ρsoft(s, ai, a−i)]

= −DKL(ρ(·|s)||ρ(·|s)) + log∑a−i

exp(∑ai

Qπ,ρ(s, ai, a−i)π(ai|s, a−i) + αH(π(·|s, ai)) + logP (a−i|s)), (44)

which proves the Equation 42.

With the above observations, the proof of Theorem 4 and 5 is completed by as follows:

Qπ,ρsoft(st, ait, a−it )

= E(st+1,ait+1,a−it+1)[rt + γ(αH(π(ait+1|st+1, a

−it+1))−DKL(ρ(a−it+1|st+1)||P (a−it+1|st+1)) +Qπ,ρsoft(st+1, a

it+1, a

−it+1))]

(45)

= E(st+1,a−it+1)[rt + γ(αH(π(·|st+1, a

−it+1))−DKL(ρ(a−it+1|st+1)||P (a−it+1|st+1)) + Eait+1∼π[Qπ,ρsoft(st+1, a

it+1, a

−it+1)])]

(46)

≤ E(st+1,a−it+1)[rt + γ(αH(π(·|st+1, a


it+1, a

−it+1)])]

(47)

= E(st+1)[rt + γ(Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(·|st+1)||P (·|st+1))


it+1, a

−it+1)])] (48)

≤ E(st+1)[rt + γ(Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(·|st+1)||P (·|st+1))


it+1, a

−it+1)])] (49)

= E(st+1,ait+1,a−it+1)∼q[rt + γ(αH(π(ait+1|st+1, a

−it+1))−DKL(ρ(a−i|st+1)||P (a−i|st+1)) + rt+1)

+ γ2E(st+2,a−it+2)[αH(π(·|st+2, a


it+2, a

−it+2)]]] (50)

≤ E(st+1,ait+1,a−it+1)[rt + γ(αH(π(ait+1|st+1, a

−it+1))−DKL(ρ(a−i|st+1)||P (a−i|st+1)) + rt+1)

+ γ2E(st+2,a−it+2)[αH(π(·|st+2, a


it+2, a

−it+2)]]] (51)

...

≤ rt + E(st+l,ait+l,a−it+l,...)∼q

[

∞∑l=1

γl(rt+l + αH(π(ait+l|a−it+l, st+l))−DKL(ρ(a−it+l|st+l)||P (a−it+l|st+l))] (52)

= Qπ,ρsoft(st, ait, a−it ). (53)

With Theorem 4 and 5 and the above inequalities, we can see that, if we start from an arbitrary conditional policy π0 and anarbitrary opponent model ρ0 and we iterate between policy improvement as

πi+1(·|s, a−i) ∝ exp(1

αQπt,ρtsoft (s, ·, a−i)), (54)

and opponent model improvement as

ρt+1(·|s) ∝ exp(∑ai

Qπt+1,ρtsoft (s, ai, ·)πt+1(ai|·, s) + αH(πt+1(s)) + logP (·|s)), (55)

then Qπt,ρtsoft (s, ai, a−i) can be shown to increase monotonically. Similar to [Haarnoja et al., 2017], we can show that withcertain regularity conditions satisfied, any non optimal policy and opponent model can be improved this way and Theorem 1 isproved.

C.3 Soft Bellman EquationAs we show in Appendix C.2, when the training converges, we have:

π∗(ai|s, a−i) =1α exp(Q∗(s, ai, a−i))∑ai exp( 1

αQ∗(s, ai, a−i))

, (56)

and

ρ∗(a−i|s) =exp(

∑ai Q

∗(s, ai, a−i)π∗(ai|s, a−i) + αH(π∗(ai|s, a−i)) + logP (a−i|s))∑a−i exp(

∑ai Q

∗(s, ai, a−i)π∗(ai|s, a−i) + αH(π∗(ai|s, a−i)) + logP (a−i|s))

=P (a−i|s)

(∑ai exp(Q∗soft(s, a

i, a−i)))α

exp(V ∗(s)), (57)

where the equality in Eq. 57 comes from substituting π∗ with Eq. 56 and we define the soft sate value function V π,ρsoft(s) ofagent i as:

V π,ρsoft(st) = log∑a−it

P (a−it |st)

∑ait

exp

(1

αQπ,ρsoft(st, a

it, a−it )

)α

. (58)

Then we can show that

Qπ∗,ρ∗

soft (s, ai, a−i)

= rt + γEs′∼ps [(Ea−it+1∼ρ,ait+1∼π[αH(π(ait+1|st+1, a

−it+1))]−DKL(ρ(·|st+1)||P (·|st+1))]


it+1, a

−it+1)])]

= rt + γEs′∼ps [V ∗(s′)]. (59)

We define the soft value iteration operator T as:

T Q(s, ai, a−i) = R(s, ai, a−i) + γEs′∼ps

[log∑a−i′

P (a−i′|s′)

(∑ai′

exp

(1

αQ(s′, ai′, a−i′)

))α]. (60)

In a symmetric fully cooperative game with only one global optimum, we can show as done in [Wen et al.,2019], the operator defined above is a contraction mapping. We define a norm on Q-values

∥∥Qi1 −Qi2∥∥ ∆=

maxs,ai,a−i∣∣Qi1 (s, ai, a−i)−Qi2 (s, ai, a−i)∣∣. Let ε =

∥∥Qi1 −Qi2∥∥, then we have:

log∑a−i′

P (a−i′|s′)

(∑ai′

exp

(1

αQ1(s′, ai′, a−i′)

))α≤ log

∑a−i′

P (a−i′|s′)

(∑ai′

exp

(1

αQ2(s′, ai′, a−i′) + ε

))α

= log∑a−i′

P (a−i′|s′)

(∑ai′

exp

(1

αQ2(s′, ai′, a−i′)

)exp(ε)

)α

= log∑a−i′

P (a−i′|s′) exp(ε)α

(∑ai′

exp

(1

αQ2(s′, ai′, a−i′)

))α

= αε+ log∑a−i′

P (a−i′|s′)

(∑ai′

exp

(1

αQ2(s′, ai′, a−i′)

))α. (61)

Similarly, log∑a−i′ P (a−i′|s′)

(∑ai′ exp

(1αQ1(s′, ai′, a−i′)

))α ≥ −αε+log∑a−i′ P (a−i′|s′)

(∑ai′ exp

(1αQ2(s′, ai′, a−i′)

))α.

Therefore∥∥T Qi1 − T Qi2∥∥ ≤ γε = γ

∥∥Qi1 −Qi2∥∥, where α = 1.

Documents

A Regularized Opponent Model with Maximum Entropy Objective · A Regularized Opponent Model with Maximum Entropy Objective Zheng Tian 1, Ying Wen , Zhichen Gong1, Faiz Punakkath1,