17
Multi-Agent Determinantal Q-Learning Yaodong Yang *12 Ying Wen *12 Liheng Chen 3 Jun Wang 2 Kun Shao 1 David Mguni 1 Weinan Zhang 3 Abstract Centralized training with decentralized execution has become an important paradigm in multi-agent learning. Though practical, current methods rely on restrictive assumptions to decompose the cen- tralized value function across agents for execu- tion. In this paper, we eliminate this restriction by proposing multi-agent determinantal Q-learning. Our method is established on Q-DPP, an exten- sion of determinantal point process (DPP) with partition-matroid constraint to multi-agent setting. Q-DPP promotes agents to acquire diverse behav- ioral models; this allows a natural factorization of the joint Q-functions with no need for a pri- ori structural constraints on the value function or special network architectures. We demonstrate that Q-DPP generalizes major solutions including VDN, QMIX, and QTRAN on decentralizable co- operative tasks. To efficiently draw samples from Q-DPP, we adopt an existing linear-time sampler with theoretical approximation guarantee. The sampler also benefits exploration by coordinat- ing agents to cover orthogonal directions in the state space during multi-agent training. We evalu- ate our algorithm on various cooperative bench- marks; its effectiveness has been demonstrated when compared with the state-of-the-art. 1. Introduction Multi-agent reinforcement learning (MARL) methods hold great potential to solve a variety of real-world problems, such as mastering multi-player video games (Peng et al., 2017), dispatching taxi orders (Li et al., 2019), and studying population dynamics (Yang et al., 2018). In this work, we consider the multi-agent cooperative setting (Panait & Luke, 2005) where a team of agents collaborate to achieve one common goal in a partially observed environment. * Equal contribution 1 Huawei Technology R&D UK. 2 University College London. 3 Shanghai Jiaotong University. Cor- respondence to: Yaodong Yang <[email protected]>. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 by the author(s). A full spectrum of MARL algorithms has been developed to solve cooperative tasks (Panait & Luke, 2005); the two endpoints of the spectrum are independent and centralized learning (see Fig. 1). Independent learning (IL) (Tan, 1993) merely treats other agents’ influence to the system as part of the environment. The learning agent not only faces a non-stationary environment, but also suffers from spuri- ous rewards (Sunehag et al., 2017). Centralized learning (CL), in the other extreme, treats a multi-agent problem as a single-agent problem despite the fact that many real-world applications require local autonomy. Importantly, the CL approaches exhibit combinatorial complexity and can hardly scale to more than tens of agents (Yang et al., 2019). Another paradigm typically considered is a hybrid of central- ized training and decentralized execution (CTDE) (Oliehoek et al., 2008). For value-based approaches in the framework of CTDE, a fundamental challenge is how to correctly de- compose the centralized value function among agents for de- centralized execution. For a cooperative task to be deemed decentralizable, it is required that local maxima on the value function per every agent should amount to the global maxi- mum on the joint value function. In enforcing such a con- dition, current state-of-the-art methods rely on restrictive structural constraints and/or network architectures. For in- stance, Value Decomposition Network (VDN) (Sunehag et al., 2017) and Factorized-Q (Zhou et al., 2019) propose to directly factorize the joint value function into a summa- tion of individual value functions. QMIX (Rashid et al., 2018) augments the summation to be non-linear aggrega- tions, while maintaining a monotonic relationship between centralized and individual value functions. QTRAN (Son et al., 2019) introduces a refined learning objective on top of QMIX along with specific network designs. Unsurprisingly, the structural constraints put forward by VDN / QMIX / QTRAN inhibit the representational power of the centralized value function (Son et al., 2019); as a result, the class of decentralizable cooperative tasks these methods can tackle is limited. For example, poor empirical results of QTRAN have been reported on multiple multi- agent cooperative benchmarks (Mahajan et al., 2019). Apart from the aforementioned problems, structural con- straints also hinder efficient explorations when applied to value function decomposition. In fact, since agents are arXiv:2006.01482v4 [cs.LG] 9 Jun 2020

Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Yaodong Yang * 1 2 Ying Wen * 1 2 Liheng Chen 3 Jun Wang 2 Kun Shao 1 David Mguni 1 Weinan Zhang 3

AbstractCentralized training with decentralized executionhas become an important paradigm in multi-agentlearning. Though practical, current methods relyon restrictive assumptions to decompose the cen-tralized value function across agents for execu-tion. In this paper, we eliminate this restriction byproposing multi-agent determinantal Q-learning.Our method is established on Q-DPP, an exten-sion of determinantal point process (DPP) withpartition-matroid constraint to multi-agent setting.Q-DPP promotes agents to acquire diverse behav-ioral models; this allows a natural factorizationof the joint Q-functions with no need for a pri-ori structural constraints on the value function orspecial network architectures. We demonstratethat Q-DPP generalizes major solutions includingVDN, QMIX, and QTRAN on decentralizable co-operative tasks. To efficiently draw samples fromQ-DPP, we adopt an existing linear-time samplerwith theoretical approximation guarantee. Thesampler also benefits exploration by coordinat-ing agents to cover orthogonal directions in thestate space during multi-agent training. We evalu-ate our algorithm on various cooperative bench-marks; its effectiveness has been demonstratedwhen compared with the state-of-the-art.

1. IntroductionMulti-agent reinforcement learning (MARL) methods holdgreat potential to solve a variety of real-world problems,such as mastering multi-player video games (Peng et al.,2017), dispatching taxi orders (Li et al., 2019), and studyingpopulation dynamics (Yang et al., 2018). In this work, weconsider the multi-agent cooperative setting (Panait & Luke,2005) where a team of agents collaborate to achieve onecommon goal in a partially observed environment.

*Equal contribution 1Huawei Technology R&D UK.2University College London. 3Shanghai Jiaotong University. Cor-respondence to: Yaodong Yang <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).

A full spectrum of MARL algorithms has been developedto solve cooperative tasks (Panait & Luke, 2005); the twoendpoints of the spectrum are independent and centralizedlearning (see Fig. 1). Independent learning (IL) (Tan, 1993)merely treats other agents’ influence to the system as partof the environment. The learning agent not only faces anon-stationary environment, but also suffers from spuri-ous rewards (Sunehag et al., 2017). Centralized learning(CL), in the other extreme, treats a multi-agent problem as asingle-agent problem despite the fact that many real-worldapplications require local autonomy. Importantly, the CLapproaches exhibit combinatorial complexity and can hardlyscale to more than tens of agents (Yang et al., 2019).

Another paradigm typically considered is a hybrid of central-ized training and decentralized execution (CTDE) (Oliehoeket al., 2008). For value-based approaches in the frameworkof CTDE, a fundamental challenge is how to correctly de-compose the centralized value function among agents for de-centralized execution. For a cooperative task to be deemeddecentralizable, it is required that local maxima on the valuefunction per every agent should amount to the global maxi-mum on the joint value function. In enforcing such a con-dition, current state-of-the-art methods rely on restrictivestructural constraints and/or network architectures. For in-stance, Value Decomposition Network (VDN) (Sunehaget al., 2017) and Factorized-Q (Zhou et al., 2019) proposeto directly factorize the joint value function into a summa-tion of individual value functions. QMIX (Rashid et al.,2018) augments the summation to be non-linear aggrega-tions, while maintaining a monotonic relationship betweencentralized and individual value functions. QTRAN (Sonet al., 2019) introduces a refined learning objective on topof QMIX along with specific network designs.

Unsurprisingly, the structural constraints put forward byVDN / QMIX / QTRAN inhibit the representational powerof the centralized value function (Son et al., 2019); as aresult, the class of decentralizable cooperative tasks thesemethods can tackle is limited. For example, poor empiricalresults of QTRAN have been reported on multiple multi-agent cooperative benchmarks (Mahajan et al., 2019).

Apart from the aforementioned problems, structural con-straints also hinder efficient explorations when applied tovalue function decomposition. In fact, since agents are

arX

iv:2

006.

0148

2v4

[cs

.LG

] 9

Jun

202

0

Page 2: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Centralized Learning

IndependentLearning

Centralized TrainingDecentralized Execution

IndependentExploration

CoordinatedExploration

COMA(Foerster et al., 2018)

QMIX(Rashid et al, 2018)

VDN(Sunehag et al, 2017)

QTRAN(Son et al, 2019)

Factor. Q(Zhou et al, 2019)

MADDPG(Lowe et al., 2017)

Ind. Q(Tan 1993)

Det. SARSA(Osogami et al, 2019)

BiCNet(Peng et al., 2017)

MAVEN(Mahajan et al, 2019)

Q-DPP

Value-basedmethods

Actor-Criticmethods

Figure 1: Spectrum of MARL methods on cooperative tasks.

treated independently during the execution stage, CTDEmethods inevitably lack a principled exploration strategy(Matignon et al., 2007). Clearly, an increasing per-agent ex-ploration rate of ε-greedy in the single-agent setting can helpexploration; however, it has been proved (Mahajan et al.,2019) that due to the structural constraints (e.g. the mono-tonicity assumption in QMIX), in the multi-agent setting,increasing ε will only lower the probability of obtaining theoptimal value function. As a treatment, MAVEN (Mahajanet al., 2019) introduces a hierarchical model to coordinatediverse explorations among agents. Yet, a principled explo-ration strategy with minor structural constraints on the valuefunction is still missing for value-based CTDE methods.

To eliminate restrictive constraints on the value functiondecomposition, one reasonable solution is to make agentsacquire a diverse set of behavioral models during training sothat the optimal action of one agent does not depend on theactions of the other agents. In such scenario, the equivalencebetween the local maxima on the per-agent value functionand the global maximum on the joint value function can beautomatically achieved. As a result, the centralized valuefunction can enjoy a natural factorization with no need forany structural constraints beforehand.

In this paper, we present a new value-based solution in theCTDE paradigm to multi-agent cooperative tasks. We es-tablish Q-DPP, an extension of determinantal point process(DPP) (Macchi, 1977) with partition constraint, and apply itto multi-agent learning. DPPs are elegant probabilistic mod-els on sets that can capture both quality and diversity whena subset is sampled from a ground set; this makes themideal for modeling the set that contains different agents’observation-action pairs in the multi-agent learning context.We adopt Q-DPP as a function approximator for the central-ized value function. An attractive property of using Q-DPPis that, when reaching the optimum, it can offer a naturalfactorization on the centralized value function, assumingagents have acquired a diverse set of behaviors. Our methodeliminates the need for a priori structural constraints or

bespoke neural architectures. In fact, we demonstrate thatQ-DPP generalizes current solvers including VDN, QMIX,and QTRAN, where all these methods can be derived asspecial cases from Q-DPP. As an additional contribution,we adopt a tractable sampler, based on the idea of sample-by-projection in P -DPP (Celis et al., 2018), for Q-DPP withtheoretical approximation guarantee. Our sampler makesagents explore in a sequential manner; agents who act laterare coordinated to visit only the orthogonal areas in the statespace that previous agents haven’t explored. Such coordi-nated way of explorations effectively boosts the samplingefficiency in the CTDE setting. Building upon these ad-vantages, finally, we demonstrate that our proposed Q-DPPalgorithm is superior to the existing state-of-the-art solutionson a variety of multi-agent cooperation benchmarks.

2. Preliminaries: Determinantal Point ProcessDPP is a probabilistic framework that characterizes howlikely a subset is going to be sampled from a ground set.Originated from quantum physics for modeling repulsiveFermion particles (Macchi, 1977), DPP has recently beenintroduced to the machine learning community due to itsprobabilistic nature (Kulesza et al., 2012).

Definition 1 (DPP). For a ground set of items Y ={1, 2, . . . ,M}, a DPP, denoted by P, is a probability mea-sure on the set of all subsets of Y , i.e., 2Y . Given anM×Mpositive semi-definite (PSD) kernel L that measures simi-larity for any pairs of items in Y , let Y be a random subsetdrawn according to P, then we have, ∀Y ⊆ Y,

PL(Y = Y

)∝ det

(LY)

= Vol2({wi}i∈Y

), (1)

whereLY := [Li,j ]i,j∈Y denotes the submatrix ofL whoseentries are indexed by the items included in Y . If we writeL :=WW> withW ∈ RM×P , P ≤M , and rows ofWbeing {wi}, then the determinant value is essentially thesquared |Y |-dimensional volume of parallelepiped spannedby the rows ofW corresponding to elements in Y .

A PSD matrix ensures all principal minors of L are non-negative det(LY ) ≥ 0; it thus suffices to be a proper prob-ability distribution. The normalizer can be computed as:∑Y⊆Y det(LY ) = det(L + I), where I is an M ×M

identity matrix. Intuitively, one can think of a diagonalentry Li,i as capturing the quality of item i, while an off-diagonal entry Li,j measures the similarity between itemsi and j. DPP models the repulsive connections amongmultiple items in a sampled subset. In the example of two

items, PL({i, j}) ∝∣∣∣∣Li,i Li,jLj,i Lj,j

∣∣∣∣ = Li,iLj,j−Li,jLj,i,which suggests, if item i and item j are perfectly similar,such that Li,j =

√Li,iLj,j , then we know these two itemswill almost surely not co-occur, hence such two-item subsetof {i, j} from the ground set will never be sampled.

Page 3: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

DPPs are attractive in that they only require training thekernel matrix L, which can be learned via maximum likeli-hood (Affandi et al., 2014). A trainable DPP favors manysupervised learning tasks where diversified outcomes aredesired, such as image generation (Elfeki et al., 2019), videosummarization (Sharghi et al., 2018), model ensemble (Panget al., 2019), and recommender system (Chen et al., 2018).It is, however, non-trivial to adapt DPPs to a multi-agentsetting since additional restrictions are required to put onthe ground set so that valid samples can be drawn for thepurpose of multi-agent training. This leads to our Q-DPPs.

3. Multi-Agent Determinantal Q-LearningWe offer a new value-based solution to multi-agent cooper-ative tasks. In particular, we introduce Q-DPPs as generalfunction approximators for the centralized value functions,similar to neural networks in deep Q-learning (Mnih et al.,2015). We start from the problem formulation.

3.1. Problem Formulation of Multi-Agent Cooperation

Multi-agent cooperation in a partially-observed environ-ment is usually modeled as a Dec-POMDP (Oliehoek et al.,2016) denoted by a tuple G =< S,N ,A,O,Z,P,R, γ >.Within G, s ∈ S denotes the global environmental state. Atevery time-step t ∈ Z+, each agent i ∈ N = {1, . . . , N}selects an action ai ∈ A where a joint action stands fora := (ai)i∈N ∈ AN . Since the environment is partiallyobserved, each agent only has access to its local observa-tion o ∈ O that is acquired through an observation functionZ(s, a) : S × A → O. The state transition dynamicsare determined by P(s′|s,a) := S × AN × S → [0, 1].Agents optimize towards one shared goal whose perfor-mance is measured by R(s,a) : S × AN → R, andγ ∈ [0, 1) discounts the future rewards. Each agent recallsan observation-action history τi ∈ T := (O × A)t, andexecutes a stochastic policy πi(ai|τi) : T × A → [0, 1]which is conditioned on τi. All of the agents historiesis defined as τ := (τi)i∈N ∈ T N . Given a joint policyπ := (πi)i∈N , the joint action-value function at time tstands as Qπ(τ t,at) = Est+1:∞,at+1:∞ [Gt|τ t,at], whereGt =

∑∞i=0 γ

iRt+i is the total accumulative rewards.

The goal is to find an optimal value function Q∗ =maxπ Q

π(τ t,at) and the corresponding policy π∗. A di-rect centralized approach is to learn the joint value function,parameterized by θ, by minimizing the squared temporal-difference error L(θ) (Watkins & Dayan, 1992) from a sam-pled mini-batch of transition data {〈τ ,a,R, τ ′〉}Ej=1, i.e.,

L(θ) =E∑

j=1

∥∥∥R+ γmaxa′

Q(τ ′,a′; θ−)−Qπ(τ ,a; θ)∥∥∥2

, (2)

where θ− denotes the target parameters that can be periodi-cally copied from θ during training.

In our work, apart from the joint value function, we alsofocus on obtaining a decentralized policy for each agent.CTDE is a paradigm for solving Dec-POMDP (Oliehoeket al., 2008) where it allows the algorithm access to all ofthe agents local histories τ during training. During testing,however, the algorithm uses each of the agent’s own historyτi for execution. CTDE methods provide valid solutionsto multi-agent cooperative tasks that are decentralizable,which is formally defined as below.

Definition 2 (Decentralizable Cooperative Tasks, a.k.a.IGM Condition (Son et al., 2019)). A cooperative task isdecentralizable if ∃{Qi}Ni=1 such that ∀τ ∈ τN ,a ∈ AN ,

argmaxa

Qπ(τ ,a) =

argmaxa1 Q1(τ1, a1)...

argmaxaN QN (τN , aN )

. (3)

Eq. 3 suggests that local maxima on the extracted valuefunction per every agent needs to amount to the globalmaximum on the joint value function. A key challenge forCTDE methods is, then, how to correctly extract each ofthe agent’s individual Q-function {Qi}Ni=1, and as such anexecutable policy, from a centralized Q-function Qπ .

To satisfy Eq. 3, current solutions rely on restrictive as-sumptions that enforce structural constraints on the factor-ization of the joint Q-function. For example, VDN (Sune-hag et al., 2017) adopts the additivity assumption by as-suming Qπ(τ ,a) :=

∑Ni=1Qi(τi, ai). QMIX (Rashid

et al., 2018) applies the monotonicity assumption to en-sure ∂Qπ(τ ,a)

∂Qi(τi,ai)≥ 0,∀i ∈ N . QTRAN (Son et al., 2019)

introduces a refined factorizable learning objective in ad-dition to QMIX. Nonetheless, structural constraints harmthe representational power of the centralized value function,and also hinder efficient explorations (Son et al., 2019). Tomitigate these problems, we propose Q-DPP as an alterna-tive that naturally factorizes the joint Q-function by learninga diverse set of behavioral models among agents.

3.2. Q-DPP: A Constrained DPP for MARL

Our method is established on Q-DPP which is an extensionof DPP that suits MARL. We assume that local observationoi encodes all history information τi at each time-step. Wemodel the ground set of all agents’ observation-action pairsby a DPP, i.e., Y =

{(o11, a

11), . . . , (o

|O|N , a

|A|N )}

with thesize of the ground set being |Y| = N |O||A|.In the context of multi-agent learning, each agent takesone valid action depending on its local observation. A validsample from DPP, therefore, is expected to include one validobservation-action pair for each agent, and the observationsfrom the sampled pairs must match the true observations thatagents receive at every time step. To meet such requirements,we propose a new type of DPP, named Q-DPP.

Page 4: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Diversity Feature

Quality Term

L

<latexit sha1_base64="IdFJmmSdXKEYSN2V7rmPCvlPyqU=">AAACJ3icdVA9SwMxGM75WevXqYsgyGERnMqdFHQQLLg4OLRgP6BXSi6Xa0OTy5HkhHKcm//ExUFnV3+Bm+jorD/A0Vzbwbb4QsjD87xv8ryPF1EilW1/GHPzC4tLy7mV/Ora+samubVdlzwWCNcQp1w0PSgxJSGuKaIobkYCQ+ZR3PD6F5neuMFCEh5eq0GE2wx2QxIQBJWmOuau63HqywHTV+IyqHoI0uQqTTtmwS7aw7JmgTMGhfOXr7v95+p3pWP+uD5HMcOhQhRK2XLsSLUTKBRBFKd5N5Y4gqgPu7ilYQgZlu1kuEFqHWrGtwIu9AmVNWT/TiSQycyk7sw8ymktI//TshflxP+Jx6b8qOC0nZAwihUO0chOEFNLcSsLzfKJwEjRgQYQCaI3slAPCoiUjjavo3Kmg5kF9eOiUyqWqnahfAZGlQN74AAcAQecgDK4BBVQAwjcgnvwCJ6MB+PVeDPeR61zxnhmB0yU8fkLVDWsLA==</latexit>

=

<latexit sha1_base64="KlsK//SaGUiNzzJVtuhTo4Pc5+Y=">AAACDnicdVDLSgMxFL1TX7W+qi7dBIvgqsxIQRcKBTcuW7APaIeSSTNtaJIZkoxQhn6BGxf6K+7Erb/gn7g0087CtnggcDjn3uTkBDFn2rjut1PY2Nza3inulvb2Dw6PyscnbR0litAWiXikugHWlDNJW4YZTruxolgEnHaCyX3md56o0iySj2YaU1/gkWQhI9hYqXk3KFfcqjsHWideTiqQozEo//SHEUkElYZwrHXPc2Pjp1gZRjidlfqJpjEmEzyiPUslFlT76TzoDF1YZYjCSNkjDZqrfzdSLLSeisBOCmzGetXLxP+87Ea99H4aiJU8JrzxUybjxFBJFnHChCMToawbNGSKEsOnlmCimP0RImOsMDG2wZKtylstZp20r6perVpr1ir127y0IpzBOVyCB9dQhwdoQAsIUHiGV3hzXpx358P5XIwWnHznFJbgfP0CCqqczg==</latexit>

B

<latexit sha1_base64="lG5NoI0XaHrp1Ju3CH8vOzZhcgY=">AAACJ3icdVBNS8MwGE79nPOr6kXwEhyCp9HKQA8ehl48TnAfsJaRpukWljQlSYVR6q/x4kH/ijfRo//Co+nWg9vwhZCH53nf5HmfIGFUacf5slZW19Y3Nitb1e2d3b19++Cwo0QqMWljwYTsBUgRRmPS1lQz0kskQTxgpBuMbwu9+0ikoiJ+0JOE+BwNYxpRjLShBvaxFwgWqgk3V+ZxpEcYsewmzwd2zak704LLwC1BDZTVGtg/XihwykmsMUNK9V0n0X6GpKaYkbzqpYokCI/RkPQNjBEnys+mG+TwzDAhjIQ0J9Zwyv6dyBBXhUnTWXhUi1pB/qcVL6q5/7OAL/jR0ZWf0ThJNYnxzE6UMqgFLEKDIZUEazYxAGFJzUYQj5BEWJtoqyYqdzGYZdC5qLuNeuO+UWtel6FVwAk4BefABZegCe5AC7QBBk/gGbyCN+vFerc+rM9Z64pVzhyBubK+fwGHXqej</latexit>

B>

<latexit sha1_base64="aGmwEWi+LSWtUx1cj61xkElHa+k=">AAACLnicdVC7TsMwFHXKq5RXgJEBiwqJqUpQJRgYKlgYi0QfUhMqx3Faq44d2Q5SFWXka1gY4FeQGBArn8CI03agrbiS5aNz7rXPPUHCqNKO82GVVlbX1jfKm5Wt7Z3dPXv/oK1EKjFpYcGE7AZIEUY5aWmqGekmkqA4YKQTjG4KvfNIpKKC3+txQvwYDTiNKEbaUH372AsEC9U4NlfmxUgPMWLZdZ4/ZJ4WSd63q07NmRRcBu4MVMGsmn37xwsFTmPCNWZIqZ7rJNrPkNQUM5JXvFSRBOERGpCegRzFRPnZZJEcnhomhJGQ5nANJ+zfiQzFqvBqOguralEryP+04kU1938WxAt+dHTpZ5QnqSYcT+1EKYNawCI7GFJJsGZjAxCW1GwE8RBJhLVJuGKicheDWQbt85pbr9Xv6tXG1Sy0MjgCJ+AMuOACNMAtaIIWwOAJPINX8Ga9WO/Wp/U1bS1Zs5lDMFfW9y/iJ6ru</latexit>

D

<latexit sha1_base64="ctGvX6iFXMkV53dodn7g+e8ktLQ=">AAACJ3icdVBNS8MwGE79nPOr6kXwEhyCp9HKQA8eBnrwOMF9wFpGmqZbWNKUJBVGqb/Giwf9K95Ej/4Lj6ZbD27DF0Ienud9k+d9goRRpR3ny1pZXVvf2KxsVbd3dvf27YPDjhKpxKSNBROyFyBFGI1JW1PNSC+RBPGAkW4wvin07iORior4QU8S4nM0jGlEMdKGGtjHXiBYqCbcXJnHkR5hxLLbPB/YNafuTAsuA7cENVBWa2D/eKHAKSexxgwp1XedRPsZkppiRvKqlyqSIDxGQ9I3MEacKD+bbpDDM8OEMBLSnFjDKft3IkNcFSZNZ+FRLWoF+Z9WvKjm/s8CvuBHR1d+RuMk1STGMztRyqAWsAgNhlQSrNnEAIQlNRtBPEISYW2irZqo3MVglkHnou426o37Rq15XYZWASfgFJwDF1yCJrgDLdAGGDyBZ/AK3qwX6936sD5nrStWOXME5sr6/gWKtqel</latexit>

D

<latexit sha1_base64="ctGvX6iFXMkV53dodn7g+e8ktLQ=">AAACJ3icdVBNS8MwGE79nPOr6kXwEhyCp9HKQA8eBnrwOMF9wFpGmqZbWNKUJBVGqb/Giwf9K95Ej/4Lj6ZbD27DF0Ienud9k+d9goRRpR3ny1pZXVvf2KxsVbd3dvf27YPDjhKpxKSNBROyFyBFGI1JW1PNSC+RBPGAkW4wvin07iORior4QU8S4nM0jGlEMdKGGtjHXiBYqCbcXJnHkR5hxLLbPB/YNafuTAsuA7cENVBWa2D/eKHAKSexxgwp1XedRPsZkppiRvKqlyqSIDxGQ9I3MEacKD+bbpDDM8OEMBLSnFjDKft3IkNcFSZNZ+FRLWoF+Z9WvKjm/s8CvuBHR1d+RuMk1STGMztRyqAWsAgNhlQSrNnEAIQlNRtBPEISYW2irZqo3MVglkHnou426o37Rq15XYZWASfgFJwDF1yCJrgDLdAGGDyBZ/AK3qwX6936sD5nrStWOXME5sr6/gWKtqel</latexit>

di = exp

✓1

2QI(oi,ai)(oi, ai)

<latexit sha1_base64="JRChgDmFj9Rmzv6KRSpXFK2iVX4=">AAACXXicdVFBSxtBFJ5sq7XRaqoHD70MiiWChF0J1Esh0IteRKFRwQ3L7ORtMjizu8y8lYRh/4X0l/TP6MWD3vwDPXY2ySFGfDDw8X3vzXzvmziXwqDv39e8Dx+Xlj+tfK6vrn1Z32h83bwwWaE5dHkmM30VMwNSpNBFgRKucg1MxRIu45tflX55C9qILP2N4xx6ig1SkQjO0FFR47QfCfqThorhUCsLo7wMJSTYDBPNuA1Ke1jS88hOGjiT9qRsZpE4oCwS+3Mw1GIwxP2oseu3/EnRtyCYgd3O95e7+87oz1nU+Bf2M14oSJFLZsx14OfYs0yj4BLKelgYyBm/YQO4djBlCkzPTvYu6Z5j+jTJtDsp0gk7P2GZMmasYtdZ2TeLWkW+p1U3mlfv21gt+MHkqGdFmhcIKZ/aSQpJMaNV1LQvNHCUYwcY18JtRPmQuVjRfUjdRRUsBvMWXBy2gnarfe4ya5JprZBvZIc0SUB+kA45JmekSzj5Sx7IE3muPXpL3pq3Pm31arOZLfKqvO3/M4K75Q==</latexit>

bI(oi,ai)(oi, ai)

<latexit sha1_base64="ozQOfYL9qfLd498si095H/G22hg=">AAACPnicdVDLSsNAFJ34rPUVdelmsAgVpCZS0GXBje4q2Ac0IUwmk3boJBNmJkIJ+Qy/xo0L/Qh/wJ247dJJm4Vt8cIwh3PunTn3+AmjUlnWp7G2vrG5tV3Zqe7u7R8cmkfHXclTgUkHc8ZF30eSMBqTjqKKkX4iCIp8Rnr++K7Qe89ESMrjJzVJiBuhYUxDipHSlGdeOT5ngZxE+sr83MucCKkRRix7yOvco5cQefRiDgvkmTWrYc0KrgK7BDVQVtszp07AcRqRWGGGpBzYVqLcDAlFMSN51UklSRAeoyEZaBijiEg3my2Ww3PNBDDkQp9YwRn7dyJDkSy8687CtlzWCvI/rXhRLvyf+dGSHxXeuhmNk1SRGM/thCmDisMiSxhQQbBiEw0QFlRvBPEICYSVTryqo7KXg1kF3euG3Ww0H5u1Vr0MrQJOwRmoAxvcgBa4B23QARi8gFfwDj6MN+PL+DZ+5q1rRjlzAhbKmP4CwdmwDw==</latexit>

LY

<latexit sha1_base64="OvS5Zz7uChtSqm2pJGTnkbyKt5k=">AAACK3icdVC7TsMwFHXKq5RXgBGGiAqJqUpQJRgYKrEwMBSJPlATRY7jtFYdO7IdpCrKwtewMMCvMIFY+QdGnDYDbcWVLB+dc6997gkSSqSy7Q+jsrK6tr5R3axtbe/s7pn7B13JU4FwB3HKRT+AElPCcEcRRXE/ERjGAcW9YHxd6L1HLCTh7F5NEuzFcMhIRBBUmvLNYzfgNJSTWF+ZG0M1QpBmt3nuZw+5b9bthj0taxk4JaiDstq++eOGHKUxZgpRKOXAsRPlZVAogijOa24qcQLRGA7xQEMGYyy9bLpFbp1qJrQiLvRhypqyfycyGMvCqO4sfMpFrSD/04oX5dz/WRAv+FHRpZcRlqQKMzSzE6XUUtwqgrNCIjBSdKIBRILojSw0ggIipeOt6aicxWCWQfe84TQbzbtmvXVVhlYFR+AEnAEHXIAWuAFt0AEIPIFn8ArejBfj3fg0vmatFaOcOQRzZXz/AicaqYU=</latexit>

(

<latexit sha1_base64="LhtotcKRBJKGYy+XE3lXymqYGek=">AAACDnicdVDLagIxFL1jX9a+bLvsJigFoSAzRWiX0m66VKgP0EEyMaPBZGZIMoVh8Au66cL+Snel2/6Cf9JlM+qiKj0QOJxzb3JyvIgzpW17buV2dvf2D/KHhaPjk9Oz4vlFW4WxJLRFQh7KrocV5SygLc00p91IUiw8Tjve5DHzOy9UKhYGzzqJqCvwKGA+I1gbqVkZFMt21V4AbRNnRcr1Uv9mNq8njUHxpz8MSSxooAnHSvUcO9JuiqVmhNNpoR8rGmEywSPaMzTAgio3XQSdomujDJEfSnMCjRbq340UC6US4ZlJgfVYbXqZ+J+X3ajW3k89sZFH+/duyoIo1jQgyzh+zJEOUdYNGjJJieaJIZhIZn6EyBhLTLRpsGCqcjaL2Sbt26pTq9aaprMHWCIPV1CCCjhwB3V4gga0gACFV5jBu/VmfVif1tdyNGetdi5hDdb3L3hpn8o=</latexit>

)

<latexit sha1_base64="7Ks6qzElIXmrOaYFbGwKxPQvS6k=">AAACDnicdVDLSgMxFM3UV62vqks3Q4ugCGVGCrosunHZgn1AO5RMeqcNTTJDkhGGoV/gxkX9FXfi1l/on7g003ZhWzwQOJxzb3Jy/IhRpR1nZuW2tnd29/L7hYPDo+OT4ulZS4WxJNAkIQtlx8cKGBXQ1FQz6EQSMPcZtP3xY+a3X0AqGopnnUTgcTwUNKAEayM1rvvFslNx5rA3ibsk5VqpdzOd1ZJ6v/jTG4Qk5iA0YViprutE2kux1JQwmBR6sYIIkzEeQtdQgTkoL50HndiXRhnYQSjNEdqeq383UsyVSrhvJjnWI7XuZeJ/XnajWnk/9flaHh3ceykVUaxBkEWcIGa2Du2sG3tAJRDNEkMwkdT8yCYjLDHRpsGCqcpdL2aTtG4rbrVSbZjOHtACeXSBSugKuegO1dATqqMmIgjQK5qid+vN+rA+ra/FaM5a7pyjFVjfv3oTn8s=</latexit>

Figure 2: Example of Q-DPP with quality-diversity kernel decomposition in a single-state three-player learning task, eachagent has three actions to choose. The size of the ground set is |Y| = 9, and the size of valid subsets |C(o)| is 33 = 27.Different colors represent different partitions of each agent’s observation-action pairs. Suppose all three agents select the2nd action, then the Q-value of the joint action according to Eq. 5 is Qπ

(o,a

)= det

([L[i,j],i,j∈{2,5,8}]

).

Definition 3 (Q-DPP). Given a ground set Y of size M thatincludes N agents’ all possible observation-action pairsY =

{(o11, a

11), . . . , (o

|O|N , a

|A|N )}

, we partition Y into Ndisjoint parts, i.e., Y =

⋃Ni=1 Yi and

∑Ni=1 |Yi| = M =

N |O||A|, where each partition represents each individualagent’s all possible observation-action pairs. At everytime-step, given agents’ observations, o = (oi)i∈N , wedefine C(o) ⊆ Y to be a set of valid subsets including onlyobservation-action pairs that agents are allowed to take,

C(o) :={Y ⊆ Y : |Y ∩ Yi(oi)| = 1,∀i ∈ {1, . . . , N}

},

with |C(o)| = |A|N , and Yi(oi) of size |A| denotes the setof pairs in partition Yi with only oi as the observation,

Yi(oi) ={

(oi, a1i ), . . . , (oi, a

|A|i )}.

Q-DPP, denoted by P, defines a probability measure overthe valid subsets Y ∈ C(o) ⊆ Y . Let Y be a random subsetdrawn according to P, its probability distribution is defined:

PL(Y = Y |Y ∈ C(o)

):=

det(LY

)∑

Y ′∈C(o)det(LY ′

) . (4)

In addition, given a valid sample Y ∈ C(o), we define anidentifying function I : Y → N that specifies the agentnumber for each valid pair in Y , and an index functionJ : Y → {1, . . . ,M} that specifies the cardinality of eachitem in Y in the ground set Y .

The construction of Q-DPP is inspired by P -DPP (Celiset al., 2018). However, the partitioned sets in P -DPPstay fixed, while in Q-DPP, C(ot) changes at every time-step with the new observation, and the kernel L is learnedthrough the process of reinforcement learning rather thanbeing given. More differences are listed in Appendix A.3.

Given Q-DPPs, we can represent the centralized value func-tion by adopting Q-DPPs as general function approximators:

Qπ(o,a

):= log det

(LY={(o1,a1),...,(oN ,aN )

}∈C(ot)

),

(5)where LY denotes the sub-matrix of L whose entries areindexed by the pairs included in Y . Q-DPP embeds the con-

nection between the joint action and each agent’s individualactions into a subset-sampling process, and the Q-value isquantified by the determinant of a kernel matrix whose ele-ments are indexed by the associated observation-action pairs.The goal of multi-agent learning is to learn an optimal jointQ-function. Eq. 5 states det(LY ) = exp(Qπ(o,a)), mean-ing Q-DPP actually assigns large probability to the subsetsthat have large Q-values. Given det(LY ) is always posi-tive, the log operator ensures Q-DPPs, as general functionapproximators, can recover any real Q-functions.

DPPs can capture both the quality and diversity of a sampledsubset; the joint Q-function represented by Q-DPP in theoryshould not only acknowledge the quality of each agent’sindividual action towards a large reward, but the diversifica-tion of agents’ actions as well. The remaining question is,then, how to obtain such quality-diversity representation.

3.3. Representation of Q-DPP Kernels

For any PSD matrix L, such aW can always be found sothat L = WW> whereW ∈ RM×P , P ≤ M . Sincethe diagonal and off-diagonal entries of L represent qualityand diversity respectively, we adopt an interpretable de-composition by expressing each row ofW as a productof a quality term di ∈ R+ and a diversity feature termbi ∈ RP×1 with ‖bi‖ ≤ 1, i.e.,wi = dib

>i . An example of

such decomposition is visualized in Fig. 2 where we defineB = [b1, . . . , bM ] and D = diag(d1, . . . , dM ). Note thatbothD and B are free parameters that can be learned fromthe environment during the Q-learning process in Eq. 2.

If we denote the quality term as each agent’s individualQ-value for a given observation-action pair, i.e., ∀(oi, ai) ∈Y, i = {1, . . . ,M}, di := exp

(12QI(oi,ai)(oi, ai)

), then

Eq. 5 can be further written into

Qπ(o,a

)= log det

(WY W>

Y

)= log

(tr(D>Y DY

)det(B>Y BY

))=

N∑i=1

QI(oi,ai)

(oi, ai

)+ log det

(B>Y BY

). (6)

Page 5: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Since a determinant value only reaches the maximum whenthe associated vectors in BY are mutually orthogonal (No-ble et al., 1988), Eq. 6 essentially stipulates that Q-DPPrepresents the joint value function by taking into accountnot only the quality of each agent’s contribution towardsreward maximization, more importantly, from a holisticperspective, the orthogonalization of agents’ actions.

In fact, the inclusion of diversifying agents’ behaviors isan important factor in satisfying the condition in Eq. 3.Intuitively, in a decentralizable task with a shared goal,promoting orthogonality between agent’s actions can helpclarify the functionality and responsibility of each agent,which in return leads to a better instantiation of Eq. 3. Onthe other hand, diversity does not means that agents have totake different actions all the time. Since the goal is still toachieve large reward via optimizing Eq. 2, certain scenarios,such as agents need to take identical actions to accomplish atask, will not be excluded as a result of promoting diversity.

3.4. Connections to Current Methods

Based on the quality-diversity representation, one can drawa key connection between Q-DPP and the existing methods.It turns out that, under the sufficient condition that if thelearned diversity features that correspond to the optimalactions are mutually orthogonal, then Q-DPP degeneratesto VDN (Sunehag et al., 2017), QMIX (Rashid et al., 2018),and QTRAN (Son et al., 2019) respectively.

To elaborate such condition, let us denote a∗i =arg maxQi(oi, ai), a∗ = (a∗i )i∈N , Y ∗ = {(oi, a∗i )}Ni=1,with ‖bi‖ = 1 and b>i bj = 0,∀i 6= j, then we have

det(B>Y ∗BY ∗

)= 1. (7)

Connection to VDN. When {bj}Mj=1 are pairwise orthog-nal, by plugging Eq. 7 into Eq. 6, we can obtain

Qπ(o,a∗

)=

N∑i=1

QI(oi,a∗i )(oi, a

∗i

). (8)

Eq. 8 recovers the exact additivity constraint that VDNapplies to factorize the joint value function in meeting Eq. 3.

Connection to QMIX. Q-DPP also generalizes QMIX,which adopts a monotonic constraint on the centralizedvalue function to meet Eq. 3. Under the special conditionwhen {bj}Mj=1 are mutually orthogonal, we can easily showthat Q-DPP meets the monotonicity condition because

∂Qπ(o,a∗

)∂QI(oi,a∗i )

(oi, a∗i

) = 1 ≥ 0, ∀I(oi, a∗i ) ∈ N . (9)

Connection to QTRAN. Q-DPP also meets the sufficientconditions that QTRAN proposes for meeting Eq. 3, that is,

N∑i=1

Qi

(oi, ai

)−Qπ(o,a)+V (o) =

{0 a = a∗

≥ 0 a 6= a∗ , (10)

where V (o) = maxaQπ(o,a) − ∑N

i=1Qi(oi, a

∗i

).

Through Eq. 6, we know Q-DPP can have Eq. 10 written as

− log det(B>YBY

)+max

aQπ(o,a)−

N∑

i=1

Qi(oi, a

∗i

).(11)

When a = a∗, for pairwise orthogonal {bj}Mj=1, Q-DPPsatisfies the first condition since Eq. 11 equals to zero dueto log det(B>Y ∗BY ∗) = 0. When a 6= a∗, Eq. 11 equals to− log det

(B>YBY)

+ log det(B>Y ∗B∗Y

), which is always

positive since det(B>YBY ) < 1,∀Y 6= Y ∗; Q-DPP therebymeets the second condition of Eq. 10 and recovers QTRAN.

Other Related Work. Determinantal SARSA (Os-ogami & Raymond, 2019) applies a normal DPP tomodel the ground set of the joint state-action pairs{

(s0, a01, . . . , a0N ), . . . , (s|S|, a|A|1 , . . . , a

|A|N )}

. It fails toconsider at all a proper ground set that suits multi-agentproblems, which leads to the size of subsets being 2|S||A|

N

that is double-exponential to the number of agents. Further-more, unlike Q-DPP that learns decentralized policies, Det.SARSA learns the centralized joint-action policy, whichstrongly limits its applicability for scalable real-world tasks.

3.5. Sampling from Q-DPP

Agents need to explore the environment effectively duringtraining; however, how to sample from Q-DPPs defined inEq. 4 is still unknown. In fact, sampling from the DPPswith partition-matroid constraint is a non-trivial task. Sofar, the best known exact sampler for partitioned DPPs hasO(mp) time complexity with m being the ground-set sizeand p being the number of partitions (Li et al., 2016; Celiset al., 2017). Nonetheless, these samplers still pose greatcomputational challenges for multi-agent learning tasks andcannot scale to large number of agents because we havem = |C(o)| = |A|N for multi-agent learning tasks.

In this work, we instead adopt a biased yet tractable samplerfor Q-DPP. Our sampler is an application of the sampling-by-projection idea in Celis et al. (2018) and Chen et al. (2018)which leverages the property that Gram-Schmidt processpreserves the determinant. One benefit of our sampler isthat it promotes efficient explorations among agents duringtraining. Importantly, it enjoys only linear-time complexityw.r.t. the number of agents. The intuition is as follows.

Additional Notations. In a Euclidean space Rn equippedwith an inner product 〈·, ·〉, let U ⊆ Rn be any linear sub-space, and U⊥ be its orthogonal complement U⊥ := {x ∈Rn|〈x, y〉 = 0,∀y ∈ U}. We define an orthogonal projec-tion operator, qU : Rn → Rn, such that ∀u ∈ Rn, if u =u1 + u2 with u1 ∈ U and u2 ∈ U⊥, then qU (u) = u2.

Gram-Schmidt (Noble et al., 1988) is a process for or-thogonalizing a set of vectors; given a set of linearly in-

Page 6: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Algorithm 1 Multi-Agent Determinantal Q-Learning1: DEF Orthogonalizing-Sampler (Y,D,B,o):2: Init: bj ← B[:,j], Y ← ∅, B ← ∅, J ← ∅.3: for each partition Yi do4: Define ∀(o, a) ∈ Yi(oi)

q(o, a) :=∥∥bJ (o,a)

∥∥2 exp(DJ (o,a),J (o,a)

).

5: Sample (oi, ai) ∈ Yi(oi) from the distribution:{

q(o, a)∑(o,a)∈Yi(oi)

q(o, a)

}

(o,a)∈Yi(oi)

.

6: Let Y ← Y ∪ (oi, ai), B ← B ∪ bJ (oi,ai),J ← J ∪ J (oi, ai).

7: // Gram-Schmidt orthogonalization8: Set bj = qspan{B} (bj) ,∀j ∈ {1, ...,M} − J9: end for

10: Return: Y .11:12: DEF Determinantal-Q-Learning (θ = [θD, θB],Y):13: Init: θ− ← θ, D ← ∅.14: for each time-step do15: Collect observations o = [o1, . . . , oN ] for all agents.16: a = Orthogonalizing-Sampler(Y, θD, θB,o).17: Execute a, store the transition 〈o,a,R,o′〉 in D.18: Sample a mini-batch of {〈o,a,R,o′〉}Ej=1 from D.19: Compute for each transition in the mini-batch

maxa′ Q(o′,a′; θ−

)

= log det(LY={(o′1,a∗1),...,(o′N ,a∗N )}

)

where // off-policy decentralized execution

a∗i = arg maxai∈Ai

[∥∥θ−bJ (o′i,ai)

∥∥2

· exp(θ−DJ (o′

i,ai),J (o′

i,ai)

)].

20: // centralized training21: Update θ by minimizing L(θ) defined in Eq. 2.22: Update target θ− = θ periodically.23: end for24: Return: θD, θB.

dependent vectors {wi}, it outputs a mutually orthogonalset of vectors {wi} by computing wi := qUi(wi) whereUi = span{w1, . . . ,wi−1}. Note that we neglect the nor-malizing step of Gram-Schmidt in this work. Finally, ifthe rows {wi} of a matrixW are mutually orthogonal, wecan compute the determinant by det(WW>) =

∏ ‖wi‖2.The Q-DPP sampler is built upon the following property.

Proposition 1 (Volume preservation of Gram-Schmidt, seeChapter 7 in Shafarevich & Remizov (2012), also Lemma3.1 in Celis et al. (2018).). Let Ui = span{w1, . . . ,wi−1}and wi ∈ RP be the i-th row of W ∈ RM×P , then∏Mi=1 ‖ qUi (wi)‖2 = det(WW>).

We also provide an intuition by Gaussian elimination inAppendix A.1. Proposition 1 suggests that the determinantof a Gram matrix is invariant to applying the Gram-Schmidtorthogonalization on the rows of that Gram matrix. In Q-

DPP’s case, a kernel matrix with mutually orthogonal rowscan largely simplify the sampling process. In such scenario,an effective sampler can be that, from each partition Yi,sample an item i ∈ Yi with P(i) ∝ ‖dib>i ‖2, then addi to the output sample Y and move to the next partition;the above steps iterate until all partitions are covered. It iseffortless to see that the probability of obtaining sample Yin such a way is

P(Y ) ∝∏

i∈Y‖dib>i ‖2 =

i∈Y‖wi‖2 = det(WYW>Y )

∝ det(LY ). (12)

We formally describe the orthogonalizing sampling proce-dures in Algorithm 1. As it is suggested in Celis et al. (2018),the time complexity of the sampling function is O(NMP )(see also the breakdown analysis for each step in AppendixA.4), given the input size being O(MP ), our sampler thusenjoys linear-time complexity w.r.t the agent number.

Though the Gram-Schmidt process can preserve the deter-minant and simply the sampling process, it comes at a prizeof introducing bias on the normalization term. Specifically,the normalization in our proposed sampler is conducted ateach agent/partition level Yi(oi) (see the red in line 5) whichdoes not match Eq. 4 that suggests normalizing by listing allvalid samples considering all partitions C(o); this directlyleads to a sampled subset from our sampler having largerprobability than what Q-DPP defines. Interestingly, it turnsout that such violation can be controlled through boundingthe singular values of each partition in the kernel matrix (seeAssumption 1), a technique also known as the β-balancecondition introduced in P -DPP (Celis et al., 2018).

Assumption 1 (Singular-Value Constraint on Partitions).For a Q-DPP defined in Definition 1, which is parameterizedby D ∈ RM×M ,B ∈ RP×M andW := DB>, let σ1 ≥. . . ≥ σP represent the singular values ofW , and σi,1 ≥. . . ≥ σi,P denote the singular values ofWYi

that is thesubmatrix ofW with the rows and columns correspondingto the i-th partition Yi , we assume ∀j ∈ {1, . . . , P}, ∃ δ ∈(0, 1] , s.t., mini∈{1,...,N} σ2

i,j/δ ≥ σ2j holds.

Theorem 1 (Approximation Guarantee of Orthogonaliz-ing Sampler). For a Q-DPP defined in Definition 1, underAssumption 1, the Orthogonalizing Sampler described inAlgorithm 1 returns a sampled subset Y ∈ C(o) with proba-bility P(Y ) ≤ 1/δN · P(Y = Y ) where N is the number ofagents, P is defined in Eq. 4, δ is defined in Assumption 1.

Proof. The proof is in Appendix A.2. It can also be taken asa special case of Theorem 3.2 in Celis et al. (2018) whenthe number of sample from each partition is one. �

Theorem 1 effectively suggests a way to bound the errorbetween our sampler and the true distribution of Q-DPPthrough minimizing the difference between σ2

j and σ2i,j .

Page 7: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

3.6. Determinantal Q-Learning

We present the full learning procedures in Algorithm 1.Determinantal Q-Learning is a CTDE method. Duringtraining, agents’ explorations are conducted through theorthogonalizing-sampler. The parameters of B andD areupdated through Q-learning in a centralized way by follow-ing Eq. 2. To meet Assumption 1, one can implement anauxiliary loss function of max(0, σ2

j − σ2i,j/δ) in addition

to where δ is a hyper-parameter. Given Theorem 1, for largeN , we know δ should be set close to 1 to make the boundtight. In fact, it is worth mentioning that the Gram-Schmidtprocess adopted in the sampler can boost the sampling ef-ficiency for multi-agent training. Since agents’ diversityfeatures of observation-action pairs are orthogonalized ev-ery time after a partition is visited, agents who act later areessentially coordinated to explore the observation-actionspace that is distinctive to all previous agents. This speedsup training in early stages.

During execution, agents only need to access the parametersin their own partitions to compute the greedy action (see line19). Note that neural networks can be seamlessly applied torepresent both B andD to tackle continuous states. Thougha full treatment of deep Q-DPP needs substantial futurework, we show a proof of concept in Appendix C. Hereafter,we use Q-DPP to represent our proposed algorithm.

4. ExperimentsWe compare Q-DPP with state-of-the-art CTDE solvers formulti-agent cooperative tasks, including COMA (Foersteret al., 2018), VDN (Sunehag et al., 2017), QMIX (Rashidet al., 2018), QTRAN (Son et al., 2019), and MAVEN (Ma-hajan et al., 2019). All baselines are imported from Py-MARL (Samvelyan et al., 2019a). Detailed settings arein Appendix B. Code is released in https://github.com/QDPP-GitHub/QDPP. We consider four coopera-tive tasks in Fig. 3, all of which require non-trivial valuefunction decomposition to achieve the largest reward.

Pathological Stochastic Game. The optimal policy of thisgame is to let both agents keep acting top left until the 10-thstep to change to bottom right, which results in the optimalreward of 13. The design of such stochastic game intendsto be pathological. First, it is non-monotonic (thus QMIXsurely fails), second, it demonstrates relative overgeneral-ization (Wei et al., 2018) because both agents playing the 1staction on average offer a higher reward 10 when matchedwith arbitrary actions from the other agent. We allow agentto observe the current step number and the joint action in thelast time-step. Zero reward leads to immediate termination.Fig. 4a shows Q-DPP can converge to the global optimal inonly 20K steps while other baselines struggle.

Blocker Game & Coordinated Navigation. Blocker game

1 0

0 10 0

0 1

1 1

1 1

1 0

0 0

1 1

1 4Initial State

8Times

Terminal Terminal

8Times

Blocker 1 Blocker 2

Agents can move in four directions or stay fixed,the target is trying to reach the bottom row.

Blockers can move left/right to block the agents.Available Target

Top Row

Bottom Row

Observation Area

Reward -0.51 Predator catches 1 Prey

Reward +12 Predators catch 1 Prey

(a) Pathological StochasticGame

1 0

0 10 0

0 1

1 1

1 1

1 0

0 0

1 1

1 4Initial State

8Times

Terminal Terminal

8Times

Blocker 1 Blocker 2

Agents can move in four directions or stay fixed,the target is trying to reach the bottom row.

Blockers can move left/right to block the agents.Available Target

Top Row

Bottom Row

Observation Area

Reward -0.51 Predator meets 1 Prey

Reward +52 Predators catch 1 Prey

Agent 1 Agent 2 Agent 3

(b) Blocker Game

1 0

0 10 0

0 1

1 1

1 1

1 0

0 0

1 1

1 4Initial State

8Times

Terminal Terminal

8Times

Blocker 1 Blocker 2

Agents can move in four directions or stay fixed,the target is trying to reach the bottom row.

Blockers can move left/right to block the agents.Available Target

Top Row

Bottom Row

Observation Area

Reward -0.51 Predator catches 1 Prey

Reward +12 Predators catch 1 Prey

(c) Coordinated Navigation

1 0

0 10 0

0 1

1 1

1 1

1 0

0 0

1 1

1 4Initial State

8Times

Terminal Terminal

8Times

Blocker 1 Blocker 2

Agents can move in four directions or stay fixed,the target is trying to reach the bottom row.

Blockers can move left/right to block the agents.Available Target

Top Row

Bottom Row

Observation Area

Reward -0.51 Predator meets 1 Prey

Reward +52 Predators catch 1 Prey

(d) Predator-Prey WorldFigure 3: Multi-agent cooperative tasks. The size of theground set for each task is a) 176, b) 420, c) 720, d) 3920.

(Heess et al., 2012) requires agents to reach the bottom rowby coordinating with its teammates to deceive the blockersthat can move left/right to block them. The navigationgame requires four agents to reach four different landmarks.For both tasks, it costs all agents −1 reward per time-stepbefore they all reach the destination. Depending on thestarting points, the largest reward of the game are −3 and−6 respectively. Both tasks are challenging in the sense thatcoordination is rather challenging for agents that only havedecentralized policies and local observations. Fig. 4b & 4csuggest Q-DPP still achieves the best performance.

Predator-Prey World. In this task, four predators attemptto capture two randomly-moving preys. Each predator canmove in four directions but they only have local views. Thepredators get a team reward of 1 if two or more predatorsare capturing the same prey at the same time, and they arepenalized for −0.5 if only one of them captures a prey. Thegame terminates when all preys are caught. Fig. 4d showsQ-DPP’s superior performance than all other baselines.

Apart from the best performance in terms of rewards, herewe offer more insights of why and how Q-DPP works well.

The Importance of Assumption 1. Assumption 1 is thepremise for the correctness of Q-DPP sampler to hold.To investigate its impact in practice, we conduct the ab-lation study on Blocker and Navigation games. We im-plement such assumption via an auxiliary loss function ofmax(0, σ2

j − σ2i,j/δ) that penalizes the violation of the as-

sumption, we set δ = 0.5. Fig. 4e presents the performancecomparisons of the Q-DPPs with and without such addi-tional loss function. We can tell that maintaining such acondition, though not helping improve the performance,

Page 8: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

0 10K 20K 25K 30KStep

0

2

4

6

8

10

12

Ret

urn

Q-DPP (Ours)QMIX

COMAMAVEN

VDNQTRAN

(a) Multi-Step Matrix Game

0 50K 100K 150K 200KStep

40

30

20

10

0

Ret

urn

Q-DPP (Ours)QMIX

COMAMAVEN

VDNQTRAN

(b) Blocker Game

0 25K 50K 75K 100KStep

50

40

30

20

10

0

Ret

urn

Q-DPP (Ours)QMIX

COMAMAVEN

VDNQTRAN

(c) Coordinated Navigation

0 1M 2M 3M 4MStep

20

10

0

8

Ret

urn

Q-DPP (Ours)QMIX

COMAMAVEN

VDNQTRAN

(d) Predator-Prey World

25K 50K 75K 100KStep

30

20

10

0

Ret

urn

Blocker w/ Assumption 1Blocker w/o Assumption 1

Navigation w/ Assumption 1Navigation w/o Assumption 1

(e) Ablation study on Assumption 1

0 50K 100K 150K 200K

Step

−300

−200

−100

0

50

Rat

io

Q-DPP (Ours)

(f) Diversity / Quality Ratio

Figure 4: (a)-(d):Performance over time on different tasks. (e): Ablation study on Assumption 1 on Blocker game. (f): Theratio of diversity to quality, i.e., log det(B>YBY )/

∑Ni=1QI(oi,ai)(oi, ai), during training on Blocker game.

1 2 3 4 5 6 7

1

2

3

4

v v v < < - ^

v v v ^ < v >

v < < v - > -

v - - < - - < 0.05

0.10

0.15

0.20

0.25

0.30

Quality

(a) Agent 1.

1 2 3 4 5 6 7

1

2

3

4

v < > v < > v

v < > v < > -

< < > v < > v

< - - v - - < 0.1

0.2

0.3

0.4

0.5

0.6

0.7

Quality

(b) Agent 2.

1 2 3 4 5 6 7

1

2

3

4

v ^ - - > v v

v < < ^ > v v

v ^ > > > > v

> - - < - - -0.05

0.10

0.15

0.20

0.25

0.30

0.35

Quality

(c) Agent 3.

Figure 5: (a)-(c): Each of the agent’s decentralized policy, i.e., arg maxaQi(oi, a), during execution on Blocker game.

stablizes the training process by significantly reducing thevariance of the rewards. We believe this is because violatingAssumption 1 leads to over-estimating the probability ofcertain observation-action pairs in the partition where theviolation happens, such over-estimation can make the agentstick to a poor local observation-action pair for some time.

The Satisfaction of Eq. 3. We show empirical evidenceon Blocker game that the natural factorization that Q-DPPoffers indeed satisfy Eq. 3. Intuitively, Q-DPPs encourageagents to acquire diverse behavorial models during trainingso that the optimal action of one agent does not dependon the actions of the other agents during the decentralizedexecution stage, as a result, Eq. 3 can be satisfied. Fig. 5(a-c) justify such intuition by showing Q-DPP learns mu-tually orthogonal behavioral models. Given the distinctionamong agents’ individual policies, one can tell that the jointoptimum is reached through individual optima.

Quality versus Diversity. We investigate the change ofthe relative importance of quality versus diversity dur-ing training. On Blocker game, we show the ratio of

log det(B>YBY

)/∑Ni=1QI(oi,ai)

(oi, ai

), which reflects

how the learning algorithm balances maximizing rewardagainst encouraging diverse behaviors. In Fig. 4f, we cansee that the ratio gradually converges to 0. The diversityterm plays a less important role with the development oftraining; this is also expected since explorations tend to berewarded more at the early stage of a task.

5. ConclusionWe proposed Q-DPP, a new type of value-function approxi-mator for cooperative multi-agent reinforcement learning.Q-DPP, as a probabilistic way of modeling sets, considersnot only the quality of agents’ actions towards reward max-imization, but the diversity of agents’ behaviors as well.We have demonstrated that Q-DPP addresses the limita-tion of current major solutions including VDN, QMIX, andQTRAN by learning the value function decomposition with-out structural constraints. In the future, we plan to investi-gate other kernel representations for Q-DPPs to tackle thetasks with continuous states and continuous actions.

Page 9: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

AcknowledgementWe sincerely thank Dr. Haitham Bou Ammar for his con-structive comments.

ReferencesAffandi, R. H., Fox, E., Adams, R., and Taskar, B. Learning

the parameters of determinantal point process kernels.In International Conference on Machine Learning, pp.1224–1232, 2014.

Celis, L. E., Deshpande, A., Kathuria, T., Straszak, D., andVishnoi, N. K. On the complexity of constrained determi-nantal point processes. In Approximation, Randomization,and Combinatorial Optimization. Algorithms and Tech-niques (APPROX/RANDOM 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

Celis, L. E., Keswani, V., Straszak, D., Deshpande,A., Kathuria, T., and Vishnoi, N. K. Fair and di-verse dpp-based data summarization. arXiv preprintarXiv:1802.04023, 2018.

Chen, L., Zhang, G., and Zhou, E. Fast greedy map infer-ence for determinantal point process to improve recom-mendation diversity. In Advances in Neural InformationProcessing Systems, pp. 5622–5633, 2018.

Deshpande, A., Rademacher, L., Vempala, S., and Wang,G. Matrix approximation and projective clustering viavolume sampling. Theory of Computing, 2(1):225–247,2006.

Elfeki, M., Couprie, C., Riviere, M., and Elhoseiny, M.Gdpp: Learning diverse generations using determinantalpoint processes. In International Conference on MachineLearning, pp. 1774–1783, 2019.

Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., andWhiteson, S. Counterfactual multi-agent policy gradi-ents. In Thirty-Second AAAI Conference on ArtificialIntelligence (AAAI), 2018.

Heess, N., Silver, D., and Teh, Y. W. Actor-critic reinforce-ment learning with energy-based policies. In EWRL, pp.43–58, 2012.

Kulesza, A., Taskar, B., et al. Determinantal point pro-cesses for machine learning. Foundations and Trends R©in Machine Learning, 5(2–3):123–286, 2012.

Li, C., Sra, S., and Jegelka, S. Fast mixing markov chainsfor strongly rayleigh measures, dpps, and constrainedsampling. In Advances in Neural Information ProcessingSystems, pp. 4188–4196, 2016.

Li, M., Qin, Z., Jiao, Y., Yang, Y., Wang, J., Wang, C., Wu,G., and Ye, J. Efficient ridesharing order dispatching withmean field multi-agent reinforcement learning. In TheWorld Wide Web Conference, pp. 983–994. ACM, 2019.

Macchi, O. The fermion processa model of stochastic pointprocess with repulsive points. In Transactions of theSeventh Prague Conference on Information Theory, Sta-tistical Decision Functions, Random Processes and ofthe 1974 European Meeting of Statisticians, pp. 391–398.Springer, 1977.

Mahajan, A., Rashid, T., Samvelyan, M., and Whiteson,S. Maven: Multi-agent variational exploration. In Ad-vances in Neural Information Processing Systems, pp.7611–7622, 2019.

Matignon, L., Laurent, G. J., and Le Fort-Piat, N. Hys-teretic q-learning: an algorithm for decentralized rein-forcement learning in cooperative multi-agent teams. In2007 IEEE/RSJ International Conference on IntelligentRobots and Systems, pp. 64–69. IEEE, 2007.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529–533, 2015.

Noble, B., Daniel, J. W., et al. Applied linear algebra,volume 3. Prentice-Hall New Jersey, 1988.

Oliehoek, F. A., Spaan, M. T., and Vlassis, N. Optimal andapproximate q-value functions for decentralized pomdps.Journal of Artificial Intelligence Research, 32:289–353,2008.

Oliehoek, F. A., Amato, C., et al. A concise introduction todecentralized POMDPs, volume 1. Springer, 2016.

Osogami, T. and Raymond, R. Determinantal reinforcementlearning. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 33, pp. 4659–4666, 2019.

Panait, L. and Luke, S. Cooperative multi-agent learning:The state of the art. Autonomous agents and multi-agentsystems, 11(3):387–434, 2005.

Pang, T., Xu, K., Du, C., Chen, N., and Zhu, J. Improvingadversarial robustness via promoting ensemble diversity.In International Conference on Machine Learning, pp.4970–4979, 2019.

Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long,H., and Wang, J. Multiagent bidirectionally-coordinatednets: Emergence of human-level coordination in learn-ing to play starcraft combat games. arXiv preprintarXiv:1703.10069, 2017.

Page 10: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G.,Foerster, J., and Whiteson, S. Qmix: monotonic valuefunction factorisation for deep multi-agent reinforcementlearning. arXiv preprint arXiv:1803.11485, 2018.

Samvelyan, M., Rashid, T., de Witt, C. S., Farquhar, G.,Nardelli, N., Rudner, T. G. J., Hung, C.-M., Torr, P. H. S.,Foerster, J., and Whiteson, S. The StarCraft Multi-AgentChallenge. CoRR, abs/1902.04043, 2019a.

Samvelyan, M., Rashid, T., Schroeder de Witt, C., Far-quhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr,P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. In Proceedings of the 18th InternationalConference on Autonomous Agents and MultiAgent Sys-tems, pp. 2186–2188. International Foundation for Au-tonomous Agents and Multiagent Systems, 2019b.

Shafarevich, I. R. and Remizov, A. O. Linear algebra andgeometry. Springer Science & Business Media, 2012.

Sharghi, A., Borji, A., Li, C., Yang, T., and Gong, B. Im-proving sequential determinantal point processes for su-pervised video summarization. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pp.517–533, 2018.

Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi,Y. Qtran: Learning to factorize with transformation forcooperative multi-agent reinforcement learning. arXivpreprint arXiv:1905.05408, 2019.

Sunehag, P., Lever, G., Gruslys, A., Czarnecki, W. M., Zam-baldi, V., Jaderberg, M., Lanctot, M., Sonnerat, N., Leibo,J. Z., Tuyls, K., et al. Value-decomposition networksfor cooperative multi-agent learning. arXiv preprintarXiv:1706.05296, 2017.

Tan, M. Multi-agent reinforcement learning: independentversus cooperative agents. In International Conferenceon Machine Learning (ICML), pp. 330–337, 1993.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.

Wei, E., Wicke, D., Freelan, D., and Luke, S. Multiagentsoft q-learning. In 2018 AAAI Spring Symposium Series,2018.

Yang, Y., Yu, L., Bai, Y., Wen, Y., Zhang, W., and Wang,J. A study of ai population dynamics with million-agentreinforcement learning. In Proceedings of the 17th Inter-national Conference on Autonomous Agents and MultiA-gent Systems, pp. 2133–2135. International Foundationfor Autonomous Agents and Multiagent Systems, 2018.

Yang, Y., Tutunov, R., Sakulwongtana, P., Ammar, H. B.,and Wang, J. Alpha-alpha-rank: Scalable multi-agentevaluation through evolution. AAMAS 2020, 2019.

Zhou, M., Chen, Y., Wen, Y., Yang, Y., Su, Y., Zhang,W., Zhang, D., and Wang, J. Factorized q-learning forlarge-scale multi-agent systems. In Proceedings of theFirst International Conference on Distributed ArtificialIntelligence, pp. 1–7, 2019.

Page 11: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

A. Detailed ProofsA.1. Proof of Proposition 1

Proposition 1 (Volume preservation of Gram-Schmidt, seeChapter 7 in Shafarevich & Remizov (2012), also Lemma3.1 in Celis et al. (2018).). Let Ui = span{w1, . . . ,wi−1}and wi ∈ RP be the i-th row of W ∈ RM×P , then∏Mi=1 ‖ qUi (wi)‖2 = det(WW>).

Proof. Such property has been mentioned in linear algebratextbook, e.g., Chapter 7 in Shafarevich & Remizov (2012).Celis et al. (2018) also gave out a proof by induction1 inLemma 3.1. Here we provide our own intuition of suchproperty through the classical Gaussian elimination method.

We first define an orthogonalization operator uwi(wj) that

takes an input of a vector wj ∈ RP and outputs anothervector that is orthogonal to a given vector wi ∈ RP by

uwi(wj) := wj −wi〈wi,wj〉/‖wi‖2 . (13)

Based on the Eq. 13, we know that ∀wi,wj ,wk ∈ RP ,

uwi(wj +wk) = uwi(wj) + uwi(wk).

Besides, we have two properties for the orthogonalizationoperator that will be used later; we present as lemmas.

Lemma 1 (Change of Projection Base). Let wi, wj , wk ∈RP , we have wj · uwi(wk)

>= uwi(wj) · uwi(wk)

>.

Proof. Based the definition of Eq. 13, one can easily writethat the left hand side equals to the right hand side. �

Lemma 2 (Subspace Orthogonalization). Let wi, wj ,wk ∈ RP , we have uuwi

(wj)(wk) · uwi(wk)

>=∥∥qUk(wk)

∥∥2 where Uk = span{wi,wj}.

Proof. The left-hand side of equation can be written by

uuwi(wj)(wk) · uwi

(wk)>

=(wk − uwi

(wj)〈uwi

(wj),wk〉‖uwi

(wj)‖2)

)·(wk −wi

〈wi,wk〉‖wi‖2

)>

= wkw>k −

uwi(wj) ·w>k

‖uwi(wj)‖〈 uwi

(wj)

‖uwi(wj)‖,wk〉

− wk ·w>i

‖wi‖〈 w

>i

‖wi‖,wk〉 .

(14)On the other hand, qUk(wk) represents the orthogonal pro-jection ofwk to the subspace that is spanned bywi and wj .

1We believe their proof is a special case, as interchangingthe order of rows can actually change the determinant value, i.e.,

det(WW>) 6=[wk

W ′][

w>k W ′>]

where the row vectors

are denoted as W = {w1, . . . , wk} and W ′ = {w1, . . . , wk−1}.

Since{wi

‖wi‖ ,uwi

(wj)

‖uwi(wj)‖

}form a set of orthornormal basis

for the subspace Uk = span{wi,wj}, according to the def-inition of qUk(wk) in Section 3.5, we can write qUk(wk)as wk minus the projection of wk on the subspace that isspanned by

{wi

‖wi‖ ,uwi

(wj)

‖uwi(wj)‖

}, i.e.,

qUk(wk) = wk − 〈wk,wi‖wi‖

〉 wi‖wi‖

− 〈wk,uwi

(wj)

‖uwi(wj)‖

〉 uwi(wj)

‖uwi(wj)‖

.

(15)

Under the orthonormal property of wi

‖wi‖ ·uwi

(wj)

‖uwi(wj)‖

>= 0

and∥∥∥ wi

‖wi‖

∥∥∥2

=∥∥∥ uwi

(wj)

‖uwi(wj)‖

∥∥∥2

= 1, finally, squaringthe Eq. 15 from both sides leads us to the Eq. 14, i.e.,∥∥qUk(wk)

∥∥2 = uuwi(wj)(wk) · uwi

(wk)>. �

Assuming {w1, . . . ,wM} being the rows ofW , then ap-plying the Gram-Schmidt orthogonalization process gives

Gram-Schmidt({wi}Mi=1

)={qUi(wi)

}Mi=1

where Ui = span{w1, . . . ,wi−1}. Note that we don’t con-sider normalizing each qUi(wi) in this work.

In fact, the effect on the Gram matrix determinantdet(WW>) of applying the Gram-Schmidt process onthe rows ofW is equivalent to applying Gaussian elimina-tion (Noble et al., 1988) to transform the Gram matrix tobe upper triangular. Since adding a row/column of a matrixmultiplied by a scalar to another row/column of that ma-trix will not change the determinant value of the originalmatrix (Noble et al., 1988), Gaussian elimination, so as theGram-Schmidt process, preserves the determinant.

To illustrate the above equivalence, we demonstrate theGaussian elimination process step-by-step on the case ofM = 3, the determinant of such a Gram matrix is

det(WW>) = det

w1w

>1 w1w

>2 w1w

>3

w2w>1 w2w

>2 w2w

>3

w3w>1 w3w

>2 w3w

>3

.

(16)

To apply Gaussian elimination to turn the Gram matrix to beupper triangular, first, we multiply the 1-st row by −w2w

>1

w1w>1and then add the result to the 2-nd row; without affectingthe determinant, we have the 2-nd row transformed into[0,w2w

>2 −

w2w>1

w1w>1w1w

>2 ,w2w

>3 −

w2w>1

w1w>1w1w

>3

]

=[0,w2 · uw1

(w2)>,w3 · uw1

(w2)>]

=[0,uw1

(w2) · uw1(w2)

>,uw1

(w3) · uw1(w2)

>]. (Lemma 1)

(17)

Page 12: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Similarly, we can apply the same process on the 3-rd row,which can be written as[0,w3w

>2 −

w2w>1

w1w>1w1w

>2 ,w3w

>3 −

w2w>1

w1w>1w1w

>3

]

=[0,uw1

(w2) · uw1(w3)

>,uw1

(w3) · uw1(w3)

>].

(18)To makeWW> upper triangular, we need to make the 2-ndelement in the 3-rd row be zero. To achieve that, we multiply

−uw1(w2)·uw1

(w3)>

uw1 (w2)·uw1 (w2)> to Eq. 17 and add the multiplication

to Eq. 18, and the 3-rd row can be further transformed into[0, 0,uw1(w3) · uw1(w3)

>−

uw1(w2) · uw1(w3)>

uw1(w2) · uw1

(w2)>uw1

(w3) · uw1(w2)

>]

=[0, 0,uw1

(w3) · uuw1(w2)

(uw1

(w3))>]

=[0, 0,uw1(w3) · uuw1 (w2)

(w3 −w1

〈w1,w3〉‖w1‖2

)>]

=[0, 0,uw1(w3) ·

(uuw1

(w2) (w3)

− uuw1(w2)

(w1〈w1,w3〉‖w1‖2

))>]

=[0, 0,uw1

(w3) · uuw1 (w2)

(w3

)>]

=[0, 0,

∥∥qU3 (w3)∥∥2]. (Lemma 2)

(19)In the fourth equation of Eq.19, we use the property thatuw1(·) · uuw1 (·)(w1)> = 0, i.e., the inner product betweena vector and its own orthogonalization equals to zero.

Given the Gran matrix is now upper triangular, by puttingEq. 17 and Eq. 19 into Eq. 16, and define U1 = ∅,U2 ={w1},U3 = {w1,w2}, we can write the determinant to be

det(WW>)

= det

w1w

>1 w1w

>2 w1w

>3

0∥∥uw1

(w2)∥∥2 uw1

(w3) · uw1(w2)

>

0 0∥∥qU3 (w3)

∥∥2

=

3∏

i=1

∥∥∥qUi (wi)∥∥∥2

.

When M ≥ 3, the consequence of eliminating all j-th ele-ments (j < i) in the i-th row of the Gram matrixWW>(i,j)by Gaussian elimination is equivalent to the i-th step of theGran-Schmidt process applied on the vector set {wi}Mi=1,in other words, the (i, i)-th element of the Gram matrixafter Gaussian elimination is essentially the squared normof qUi(wi). Finally, since the determinant of an upper-triangular matrix is simply the multiplication of its diagonalelements, we have

∏Mi=1

∥∥qUi (wi)∥∥2. �

A.2. Proof of Theorem 1

Theorem 1 (Approximation Guarantee of Orthogonaliz-ing Sampler). For a Q-DPP defined in Definition 1, underAssumption 1, the Orthogonalizing Sampler described inAlgorithm 1 returns a sampled subset Y ∈ C(o) with proba-bility P(Y ) ≤ 1/δN · P(Y = Y ) where N is the number ofagents, P is defined in Eq. 4, δ is defined in Assumption 1.

Proof. This result can be regarded as a special case of The-orem 3.2 in Celis et al. (2018) when the number of samplefrom each partition in P -DPP is set to one (please find Ap-pendix A.3 for the differences between P -DPP and Q-DPP).

Sine our sampling algorithm generates samples withthe probability in proportional to the determinant valuedet(LY ), which is also the nominator in Eq. 4, it is thennecessary to bound the denominator of the probability ofsamples from our proposed sampler so that the error to theexact denominator defined in Eq. 4 can be controlled. Westart from the Lemma that is going to be used.

Lemma 3 (Eckart-Young-Mirsky Theorem). For a real ma-trixW ∈ RM×P withM ≥ P , suppose thatW = UΣV >

is the singular value decomposition (SVD) ofW , then thebest rank k approximation toW under the Frobenius norm‖ · ‖F described as

minW′:rank(W′)=k

∥∥W −W ′∥∥2F

is given byW ′ =Wk =∑ki=1 σiuiv

>i where ui and vi

denote the i-th column of U and V respectively, and,

∥∥W −Wk∥∥2F

=∥∥∥

P∑

i=k+1

σiuiv>i

∥∥∥2

F=

P∑

i=k+1

σ2i .

Note that the singular values σi in Σ is ranked by size bythe SVD procedures such that σ1 ≥ . . . ≥ σP . �

Lemma 4 (Lemma 3.1 in (Deshpande et al., 2006)). For amatrixW ∈ RM×P with M ≥ P ≥ N , assume {σi}Pi=1

are the singular values ofW andWY is the submatrix ofW with rows indexed by the elements in Y , then we have

|Y |=Ndet(WYW>Y

)=

k1<···<kNσ2k1 · · ·σ2

kN . �

To stay consistency on notations, we use N for numberof agents, M for the size of ground set of Q-DPP, Pis the dimension of diverse feature vectors, we assumeM ≥ P ≥ N . Let Y be the random variable representingthe output of our proposed sampler in Algorithm 1. Sincethe algorithm visit each partition in Q-DPP sequentially, asample Y =

{(o11, a

11), . . . , (o

|O|N , a

|A|N )}

is therefore an or-dered set. Note that the algorithm is agnostic to the partitionnumber (i.e. the agent identity), without losing generality,

Page 13: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

we denote the first partition chosen as Y1. We further de-note Yi, i ∈ {1, . . . , N} as the i-th observation-state pair inY , and I(Yi) ∈ {1, . . . , N} denotes the partition numberwhere i-th pair is sampled.

According to the Algorithm 1, at first step, we choose Y1,and based on the corresponding observation o1, we thenlocate the valid subsets ∀(o, a) ∈ Yi(oi), and finally sampleone observation-action pair from the valid set Yi(oi) withprobability proportional to the norm of the vector defined inthe Line 4− 5 in Algorithm 1, that is,

P(Yi) ∝∥∥wJ (o,a)

∥∥2 =∥∥bJ (o,a)

∥∥2 exp(DJ (o,a),J (o,a)

).

(20)After Yi is sampled, the algorithm then moves to the nextpartition and repeat the same process until all N partitionsare covered.

The specialty of this sampler is that before sampling at eachpartition i ∈ {1, . . . , N}, the Gram-Schmidt process willbe applied to ensure all the rows in the i-th partition ofWto be orthogonal to all previous sampled pairs

bij = qspan{Bi}(bi−1j

),∀j ∈ {1, ...,M} − J.

where Bi = {btJ (ot,at)}i−1t=1, J = {J (ot, at)}i−1t=1. Notethat since D only contributes a scalar to wj , and bj is aP -dimensional vector same as wj , in practice, the Gram-Schmidt orthorgonalization needs only conducting on bj inorder to make rows ofW mutually orthogonal.

Based on the above sampling process and each time-step i,we can write the probability of getting a sample Y by

P(Y = Y )

= P(Y1) N∏

i=2

P(Yi∣∣Y1, . . . , Yi−1

)

=

N∏

i=1

∥∥∥qspan{Bi}(wI(Yi)

)∥∥∥2

∑(o,a)∈YI(Yi)

∥∥∥qspan{Bi}(wI(o,a)

) ∥∥∥2

=

∏Ni=1

(∥∥∥qspan{Bi}(wI(Yi)

)∥∥∥2)

∏Ni=1

(∑(o,a)∈YI(Yi)

∥∥∥qspan{Bi}(wI(o,a)

) ∥∥∥2)

=det(W YW>Y

)

∏Ni=1

(∑(o,a)∈YI(Yi)

∥∥∥qspan{Bi}(wI(o,a)

) ∥∥∥2) ,

(21)where the 4-th equation in Eq. 21 is valid because of Propo-sition 1.

For each term in the denominator, according to the definitionof the operator qspan{Bi}, we can rewrite into∑(o,a)∈Y

I(Yi)

∥∥∥qspan{Bi}(wI(o,a)

) ∥∥∥2 =∥∥∥WI(Yi)

−W ′I(Yi)

∥∥∥2F

where the rows ofWI(Yi)are {wI(o,a)}(o,a)∈YI(Yi)

whichare essentially the submatrix ofW that corresponds to par-tition I(Yi), and the rows ofW ′

I(Yi)are the orthogonal

projections of {wI(o,a)}(o,a)∈YI onto span{Bi}, and weknow rank

(W ′

I(Yi)

)= |Bi| = i−1. According to Lemma

3, with σI(Yi),kbeing the k-th singular value ofWI(Yi)

,we know that

∥∥∥WI(Yi)−W ′I(Yi)

∥∥∥2

F≥

P∑

k=i

σ2I(Yi),k

. (22)

Therefore, we have the denominator of Eq. 21 as:N∏

i=1

( ∑

(o,a)∈YI(Yi)

∥∥∥qspan{Bi}(wI(o,a)

) ∥∥∥2)

≥N∏

i=1

P∑

k=i

σ2I(Yi),k

≥N∏

i=1

P∑

k=i

δ · σ2k (Assumption 1)

≥ δN ·∑

k1<···<kNσ2k1 · · ·σ2

kN

= δN ·∑

Y⊆Y:|Y |=Ndet(WYW>Y

)(Lemma 4)

≥ δN ·∑

Y ∈C(o)det(WYW>Y

)

(23)Taking Eq. 23 into Eq. 21, we can obtain that

P(Y = Y ) ≤ δN · det(W YW>Y

)∑Y ∈C(o) det

(WYW>Y) = 1/δN ·P(Y = Y )

where P(Y = Y ) is the probability of obtaining the sam-ple Y from our proposed sampler and P(Y = Y ) is theprobability of getting that sample Y under Eq. 4. �

Page 14: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

A.3. Difference between Q-DPP and P -DPP

The design of Q-DPP and its samply-by-projection sampling process is inspired by and based on P -DPP (Celis et al., 2018).However, we would like to highlight the multiple differences in that 1) P -DPP is designed for modeling the fairness for datasummarization whereas Q-DPP serves as a function approximator for the joint Q-function in the context of multi-agentlearning; 2) though we analyze Eq. 21 based onW , the actual orthorgonalziation step of our sampler only needs performingon the vectors of bj rather than the entire matrixW due to our unique quality-diversity decomposition on the joint Q-functionin Eq. 6; 3) the set of elements in each partition Yi(oi) of Q-DPP change with the observation at each time-step, whilethe partitions stay fixed in the case of P -DPP; 4) the parameters ofW are learned through a trail-and-error multi-agentreinforcement learning process compared to the cases in P -DPP where the kernel is given by hand-crafted features (e.g.SIFT features on images); 5) we implement the constraint in Assumption 1 via a penalty term during the CTDE learningprocess, while P -DPP does not consider meeting such assumption through optimization.

A.4. Time Complexity of Algorithm 1

Lets analyze the time complexity of the proposed Q-DPP sampler in steps 1− 10 of Algorithm 1. Given the observation o,and the input matricesD,B (whose sizes are M ×M , P ×M , with M = |A| ×N being the size of all N agents’ allowedactions under o and P being the diversity feature dimension), the sampler samples one action for each agent sequentially, sothe outer loop of step 3 is O(N). Within the partition of each agent, step 4 is O(P ), step 5 is O(P |A|), step 6 is O(1), sothe complexity so far is O(NP |A|). Computing step 8 for ALL partitions is of O(N2P |A|)†. The overall complexity isO(N2P |A|) = O(NMP ), since the input is O(MP ) and the agent number N is a constant, our sampler has linear-timecomplexity with respect to the input, also linear-time with respect to the number of agents. Such argument is in line with theproject-and-sample sampler in Celis et al. (2018).†: In the Gram-Schmidt process, orthogonalizing a vector to another takes O(P ). Considering all valid actions for eachagent takes O(P |A|). Note that while looping over different partitions, the remaining unsampled partitions do not needrepeatedly orthogonalizing to all the previous samples, in fact, they only need orthogonalizing to the LATEST sample. Inthe example of Fig 2, after agent 2 selects action 5, agent 3s three actions only need orthogonalizing to action 5 but notaction 2 because it has been performed when the partition of agent 1 was visited. So the total number of orthogonalization is(N − 1)N/2 across all partitions, leading to O(N2P |A|) time for step 8.

B. Experimental Parameter SettingsThe hyper-parameters settings for Q-DPP are given in Table 1. For all experiments we update the target networks after every100 episodes. All activation functions in hidden layers are ReLU. The optimization is conducted using RMSprop with alearning rate of 5× 10−4 and α = 0.99 with no weight decay or momentum.

If not particularly indicated, all the baselines use common settings as listed in Section B. IQL, VDN, QMIX, MAVEN andQTRAN use common individual action-value networks as those used by Q-DPP; each consists of two 32-width hiddenlayers. The specialized parameter settings for each algorithm are provided in Table 2:

Page 15: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

Table 1: Q-DPP Hyper-parameter Settings.

COMMON SETTINGS VALUE DESCRIPTION

LEARNING RATE 0.0005 OPTIMIZER LEARNING RATE.BATCH SIZE 32 NUMBER OF EPISODES TO USE FOR EACH UPDATE.GAMMA 0.99 LONG TERM DISCOUNT FACTOR.HIDDEN DIMENSION 64 SIZE OF HIDDEN STATES.NUMBER OF HIDDEN LAYERS 3 NUMBER OF HIDDEN LAYERS.TARGET UPDATE INTERVAL 100 INTERVAL OF UPDATING THE TARGET NETWORK.MULTI-STEP MATRIX GAMESTEP 40K MAXIMUM TIME STEPS.FEATURE MATRIX SIZE 176× 32 NUMBER OF OBSERVATION-ACTION PAIR TIMES EMBEDDING SIZE.INDIVIDUAL POLICY TYPE RNN RECURRENT DQN.EPSILON DECAY SCHEME LINEAR DECAY FROM 1 TO 0.05 IN 30K STEPS.COORDINATED NAVIGATIONSTEP 100K MAXIMUM TIME STEPS.FEATURE MATRIX SIZE 720× 32 NUMBER OF OBSERVATION-ACTION PAIR TIMES EMBEDDING SIZE.INDIVIDUAL POLICY TYPE RNN RECURRENT DQN.EPSILON DECAY SCHEME LINEAR DECAY FROM 1 TO 0.1 IN 10K STEPS.BLOCKER GAMESTEP 200K MAXIMUM TIME STEPS.FEATURE MATRIX SIZE 420× 32 NUMBER OF OBSERVATION-ACTION PAIR TIMES EMBEDDING SIZE.INDIVIDUAL POLICY TYPE RNN RECURRENT DQN.EPSILON DECAY SCHEME LINEAR DECAY FROM 1 TO 0.01 IN 100K STEPS.PREDATOR-PREY WORLD (FOUR PREDATORS, TWO PREYS)STEP 4M MAXIMUM TIME STEPS.FEATURE MATRIX SIZE 3920× 32 NUMBER OF OBSERVATION-ACTION PAIR TIMES EMBEDDING SIZE.INDIVIDUAL POLICY TYPE FEEDFORWARD FEEDFORWARD DQN.EPSILON DECAY SCHEME LINEAR DECAY FROM 1 TO 0.1 IN 300K STEPS.

Table 2: Hyper-parameter Settings for Baseline Algorithms.

SETTINGS VALUE DESCRIPTION

QMIXMONOTONE NETWORK LAYER 2 LAYER NUMBER OF MONOTONE NETWORK.MONOTONE NETWORK SIZE 64 HIDDEN LAYER SIZE OF MONOTONE NETWORK.QTRANJOINT ACTION-VALUE NETWORK LAYER 2 LAYER NUMBER OF JOINT ACTION-VALUE NETWORK.JOINT ACTION-VALUE NETWORK SIZE 64 HIDDEN LAYER SIZE OF JOINT ACTION-VALUE NETWORK.MAVENz 2 NOISE DIMENSION.λMI 0.001 WEIGHT OF MI OBJECTIVE.λQL 1 WEIGHT OF QL OBJECTIVE.ENTROPY REGULARIZATION 0.001 FEEDFORWARD DQN.DISCRIMINATOR LAYER 1 NUMBER OF DISCRIMINATOR NETWORK LAYER.DISCRIMINATOR SIZE 32 HIDDEN LAYER SIZE OF DISCRIMINATOR NETWORK.

Page 16: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

C. Solution for Continuous States: Deep Q-DPPAlthough our proposed Q-DPP serves as a new type of function approximator for the value function in multi-agentreinforcement learning, deep neural networks can also be seamlessly applied on Q-DPP. Specifically, one can adopt deepnetworks to respectively represent the quality and diversity terms in the kernels of Q-DPP to tackle continuous state-actionspace, and we name such approach Deep Q-DPP. In Fig. 2, one can think of Deep Q-DPP as modelingD and B by neuralnetworks rather than look-up tables. An analogy of Deep Q-DPP to Q-DPP would be Deep Q-learning (Mnih et al., 2015) toQ-learning (Watkins & Dayan, 1992). As the main motivation of introducing Q-DPP is to eliminate structural constraintsand bespoke neural architecture designs in solving multi-agent cooperative tasks, we omit the study of Deep Q-DPP in themain body of this paper. Here we demonstrate a proof of concept for Deep Q-DPP and its effectiveness on StarCraft IImicro-management tasks (Samvelyan et al., 2019b) as an initiative. However, we do believe a full treatment needs substantialfuture work.

C.1. Neural Architectures for Deep Q-DPP.

Mixing Network

MLP

GRU

MLP

Agent 1 Agent Nexpexp

log det

indexindex

state encoderstate encoder

Figure 6: Neural Architecture of Deep Q-DPP. The middle part of the diagram shows the overall architecture of Q-DPP,which consists of each agent’s individual Q-networks and a centralized mixing network. Details of the mixing network arepresented in the left. We compute the quality term, di, by applying the exponential operator on the individual Q-value, andcompute the diversity feature term, bi, by index the corresponding vector in B through the global state s and each action ai.

A critical advantage of Deep Q-DPP is that it can deal with continuous states/observations. When the input state s iscontinuous, we first index the raw diversity feature b′i based on the embedding of discrete action ai. To integrate theinformation of the continuous state, we use two multi-layer feed-forward neural networks fd and fn, which encodes thedirection and norm of the diversity feature separately. fd outputs a feature vector with same shape as b′i indicating thedirection, and fn outputs a real value for computing the norm. In practice, we find modeling the direction and norm of thediversity features separately by two neural networks helps stabilize training, and the diversity feature vector is computed asbi = fd(b

′i, s)× σ(fn(b′i, s)). Finally, the centralized Q-value can then be computed from di and bi following Eq. 6.

Page 17: Multi-Agent Determinantal Q-Learning · 2020-06-11 · Multi-Agent Determinantal Q-Learning Yaodong Yang * 1 2Ying Wen Liheng Chen3 Jun Wang2 Kun Shao1 David Mguni1 Weinan Zhang3

Multi-Agent Determinantal Q-Learning

C.2. Experiments on StarCraft II Micro-Management

(a) Scenario Screenshot (b) 2m vs 1z

Figure 7: StarCraft II micro-management on the scenario of 2 Marines vs. 1 Zealot and its performance.

We study one of the simplest continuous state-action micro-management games in StarCraft II in SMAC (Samvelyan et al.,2019b), i.e., 2m vs 1z, the screenshots of scenarios are given in Fig. 7(a). In the 2m vs 1z map, we control a team of 2Marines to fight with 1 enemy Zergling. In this task, it requires the Marine units to take advantage of their larger firing rangeto defeat over Zerglings which can only attack local enemies. The agents can observe a continuous feature vector includingthe information of health, positions and weapon cooldown of other agents. In terms of reward design, we keep the defaultsetting. All agents receive a large final reward for winning a battle, at the meantime, they also receive immediate rewardsthat are proportional to the difference of total damages between the two teams in every time-step. We compare Q-DPP withaforementioned baseline models, i.e., COMA, VDN, QMIX, MAVEN, and QTRAN, and plot the results in Fig. 7(b). Theresults show that Q-DPP can perform as good as the state-of-the-art model, QMIX, even when the state feature is continuous.However, the performance is not stable and presents high variance. We believe full treatments need substantial future work.