10
On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe Morency, Satwik Kottur Carnegie Mellon University [email protected] ABSTRACT Several recent works have found the emergence of grounded com- positional language in the communication protocols developed by mostly cooperative multi-agent systems when learned end-to-end to maximize performance on a downstream task. However, human populations learn to solve complex tasks involving communicative behaviors not only in fully cooperative settings but also in scenar- ios where competition acts as an additional external pressure for improvement. In this work, we investigate whether competition for performance from an external, similar agent team could act as a social influence that encourages multi-agent populations to develop better communication protocols for improved performance, compositionality, and convergence speed. We start from Task & Talk, a previously proposed referential game between two coopera- tive agents as our testbed and extend it into Task, Talk & Compete, a game involving two competitive teams each consisting of two aforementioned cooperative agents. Using this new setting, we pro- vide an empirical study demonstrating the impact of competitive influence on multi-agent teams. Our results show that an external competitive influence leads to improved accuracy and generaliza- tion, as well as faster emergence of communicative languages that are more informative and compositional. CCS CONCEPTS Theory of computation Multi-agent learning; Computing methodologies Artificial intelligence; Multi-agent systems; KEYWORDS Learning agent-to-agent interactions; Multi-agent learning ACM Reference Format: Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe Morency, Satwik Kottur. 2020. On Emergent Communication in Competitive Multi- Agent Teams. In Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9–13, 2020, IFAAMAS, 10 pages. 1 INTRODUCTION The emergence and evolution of languages through human life, societies, and cultures has always been one of the hallmarks of human intelligence [11, 34, 42]. Humans intelligently communi- cate through language to solve multiple real-world tasks involv- ing vision, navigation, reasoning, and learning [35, 43]. Language emerges naturally for us humans in both individuals through inner speech [1, 3, 5] as well as in groups through grounded dialog in Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), May 9–13, 2020, Auckland, New Zealand. © 2020 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. (color, shape) (shape, color) (style, color) (color, style) (shape, style) (style, shape) color (a) Instances (b) Tasks (color, style) Prediction triangle shape square circle star blue green red purple style filled dashed dotted solid (c) Communication among agents and teams Q1: Y Q1: X A1: 2 A2: 2 (color, shape) Prediction Team 1 Team 2 Q1: Z A1: 2 A2: 1 Q1: Y (3) Task sharing (1) Reward sharing (color, style) (2) Dialog overhearing Y 2 2 X Z 2 1 Y (color, shape) 7 3 Result Result 3 7 (green, star) (purple, filled) Figure 1: We propose the Task, Talk & Compete game in- volving two competitive teams each consisting of multiple rounds of dialog between Q-bot and A-bot. Within each team, A-bot is given a target instance unknown to Q-bot and Q-bot is assigned a task (e.g. find the color and shape of the instance), unknown to A-bot, to uncover certain at- tributes of the target instance. This informational asymme- try necessitates communication between the two agents via multiple rounds of dialog where Q-bot asks questions re- garding the task and A-bot provides answers using the tar- get instance. We investigate whether competition for perfor- mance from an external, similar agent team 2 could result in improved compositionality of emergent language and con- vergence of task performance within team 1. Competition is introduced through three aspects: 1) reward sharing, 2) di- alogue overhearing, and 3) task sharing. Our hypothesis is that teams are able to leverage information from the perfor- mance of the other team and learn more efficiently beyond the sole reward signal obtained from their own performance. Our findings show that competition for performance with a similar team leads to improved overall accuracy of general- ization as well as faster emergence of emergent language. arXiv:2003.01848v2 [cs.AI] 16 Jul 2020

On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

On Emergent Communication inCompetitive Multi-Agent Teams

Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe Morency, Satwik KotturCarnegie Mellon University

[email protected]

ABSTRACT

Several recent works have found the emergence of grounded com-positional language in the communication protocols developed bymostly cooperative multi-agent systems when learned end-to-endto maximize performance on a downstream task. However, humanpopulations learn to solve complex tasks involving communicativebehaviors not only in fully cooperative settings but also in scenar-ios where competition acts as an additional external pressure forimprovement. In this work, we investigate whether competitionfor performance from an external, similar agent team could actas a social influence that encourages multi-agent populations todevelop better communication protocols for improved performance,compositionality, and convergence speed. We start from Task &Talk, a previously proposed referential game between two coopera-tive agents as our testbed and extend it into Task, Talk & Compete,a game involving two competitive teams each consisting of twoaforementioned cooperative agents. Using this new setting, we pro-vide an empirical study demonstrating the impact of competitiveinfluence on multi-agent teams. Our results show that an externalcompetitive influence leads to improved accuracy and generaliza-tion, as well as faster emergence of communicative languages thatare more informative and compositional.

CCS CONCEPTS

• Theory of computation→Multi-agent learning; •Computing

methodologies→ Artificial intelligence; Multi-agent systems;

KEYWORDS

Learning agent-to-agent interactions; Multi-agent learningACM Reference Format:

Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe Morency,Satwik Kottur. 2020. On Emergent Communication in Competitive Multi-Agent Teams. In Proc. of the 19th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May9–13, 2020, IFAAMAS, 10 pages.

1 INTRODUCTION

The emergence and evolution of languages through human life,societies, and cultures has always been one of the hallmarks ofhuman intelligence [11, 34, 42]. Humans intelligently communi-cate through language to solve multiple real-world tasks involv-ing vision, navigation, reasoning, and learning [35, 43]. Languageemerges naturally for us humans in both individuals through innerspeech [1, 3, 5] as well as in groups through grounded dialog in

Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), May9–13, 2020, Auckland, New Zealand. © 2020 International Foundation for AutonomousAgents and Multiagent Systems (www.ifaamas.org). All rights reserved.

(color, shape)(shape, color)(style, color)(color, style)

(shape, style)(style, shape)

color

(a) Instances (b) Tasks

(color, style)

Prediction

triangle

shapesquare circle star

blue green red purple

stylefilled dashed dotted solid

(c) Communication among agents and teams

Q1: Y

Q1: X

A1: 2

A2: 2

(color, shape)

Prediction

Team 1 Team 2

Q1: Z

A1: 2

A2: 1

Q1: Y

(3) Task sharing

(1) Reward sharing

(color, style)

(2) Dialog overhearing

Y 2 2X

Z 2 1Y

(color, shape)

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

Confidential Review Copy. DO NOT DISTRIBUTE.

ditionally generates output utterances at ∈ VA. SAt

is updated using questions at from Q-BOT.Base Cooperative Reward: Q-BOT and A-BOT

receive an identical base reward of R if Q-BOT’sprediction wG matches ground truth wG and a neg-ative reward of −10R otherwise. R is a hyper-paramter in our experiments which affects the rateof convergence.

Learning Policies with REINFORCE: Totrain these agents, we update policy parameters✓Q and ✓A using the popular REINFORCE policygradient algorithm (Williams, 1992).

2.2 Sources of CompetitionWhen the two teams do not share any informationand are trained completely independently, then theTask, Talk & Compete game reduces to (two copiesof) the Task & Talk game. Therefore, informationsharing across the two teams is necessary to intro-duce external competition for the two teams. Inthe following section we highlight the informationsharing that can happen in the Task, Talk & Com-pete game and the various sources of competitionthat can arise.

Reward Sharing: We modify the rewardstructure so that a team can assess their per-formance relative to other teams. Startingwith a base reward of R in the fully cooper-ative setting, we modify this reward into thecompetitive setting in the following Table:

Team 2 3 Team 2 7

Team 1 3 (+R,+R) (+2R,−500R)Team 1 7 (−500R,+2R) (−10R,−10R)

when both teams get the same result this reducesto the base reward setting: +R for correct answersand −10R for wrong ones. When there asymmetryin the results the correct team has a larger bonusof +2R while the incorrect team suffers a penaltyof −500R. This encourages teams to compete andassess their performance relative to the other team.

Dialog Overhearing: Overhearing the conver-sations of another team is a common way to get se-cret information about the tactics, knowledge, andprogress of one’s competitors. In a similar fash-ion, we modify the agent’s policy networks suchthat they can now overhear the concurrent dialoguefrom other teams. Take Q-BOT (1) and A-BOT (1)in team 1 for example. At round t, Q-BOT (1) nowobserves state

sQt (1) = [G(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2)] (1)

and similarly for Q-BOT (2). A-BOT (1) observes

sAt (1) = [F(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2), qt(1), qt(2)](2)

we view this as augmenting reward sharing by in-forming the agents exactly why they were penal-ized. They can now use the other agent’s dialogand information.

Task Sharing: Finally, we fully augment bothreward sharing and dialog overhearing with tasksharing, where the two sets of objects and tasksare known to both teams (i.e. G(2) is added toEquation 1 and F(2) is added to Equation 2).

Asynchronous Updates: For fair comparisonwith existing approaches Kottur et al. (2017), weevaluate performance with respect to only a sin-gle team. In the case of dialogue overhearing andtask sharing, we replace the overheard symbols andtasks with zeros during evaluation. However, onlyreplacing external inputs with zeros during testingwould lead to a data domain mismatch problemwhere the training and evaluation data distributionsare different (Quionero-Candela et al., 2009; Glorotet al., 2011). To ensure that we train on all possibleevaluation scenarios, we use a three-stage trainingprocess. Stage 1) Q-BOT (1) and A-BOT (1) aretrained to cooperate within a team, independent ofthe team 2. All information shared from team 2 ispassed to team 1 as zeros. Stage 2) Q-BOT (2) andA-BOT (2) are trained to cooperate within a team,independent of team 1. Stage 3) Both teams aretrained together with reward sharing and/or dialogoverhearing and/or task sharing. Intuitively, thisthree-step training procedure means that the agentswithin each team must learn how to cooperate witheach other within a team, in addition to learninghow to compete with the other team.

Note, the two teams do not share vocabularies

3 Experiments

In this section we present our experimental resultsand observations on the effect of competitive in-fluence on both 1) final task performance and 2)emergence of grounded, interpretable, and compo-sitional language.

3.1 MetricsEvaluation:

1. Train and test accuracies, convergence curves.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

Confidential Review Copy. DO NOT DISTRIBUTE.

ditionally generates output utterances at ∈ VA. SAt

is updated using questions at from Q-BOT.Base Cooperative Reward: Q-BOT and A-BOT

receive an identical base reward of R if Q-BOT’sprediction wG matches ground truth wG and a neg-ative reward of −10R otherwise. R is a hyper-paramter in our experiments which affects the rateof convergence.

Learning Policies with REINFORCE: Totrain these agents, we update policy parameters✓Q and ✓A using the popular REINFORCE policygradient algorithm (Williams, 1992).

2.2 Sources of CompetitionWhen the two teams do not share any informationand are trained completely independently, then theTask, Talk & Compete game reduces to (two copiesof) the Task & Talk game. Therefore, informationsharing across the two teams is necessary to intro-duce external competition for the two teams. Inthe following section we highlight the informationsharing that can happen in the Task, Talk & Com-pete game and the various sources of competitionthat can arise.

Reward Sharing: We modify the rewardstructure so that a team can assess their per-formance relative to other teams. Startingwith a base reward of R in the fully cooper-ative setting, we modify this reward into thecompetitive setting in the following Table:

Team 2 3 Team 2 7

Team 1 3 (+R,+R) (+2R,−500R)Team 1 7 (−500R,+2R) (−10R,−10R)

when both teams get the same result this reducesto the base reward setting: +R for correct answersand −10R for wrong ones. When there asymmetryin the results the correct team has a larger bonusof +2R while the incorrect team suffers a penaltyof −500R. This encourages teams to compete andassess their performance relative to the other team.

Dialog Overhearing: Overhearing the conver-sations of another team is a common way to get se-cret information about the tactics, knowledge, andprogress of one’s competitors. In a similar fash-ion, we modify the agent’s policy networks suchthat they can now overhear the concurrent dialoguefrom other teams. Take Q-BOT (1) and A-BOT (1)in team 1 for example. At round t, Q-BOT (1) nowobserves state

sQt (1) = [G(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2)] (1)

and similarly for Q-BOT (2). A-BOT (1) observes

sAt (1) = [F(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2), qt(1), qt(2)](2)

we view this as augmenting reward sharing by in-forming the agents exactly why they were penal-ized. They can now use the other agent’s dialogand information.

Task Sharing: Finally, we fully augment bothreward sharing and dialog overhearing with tasksharing, where the two sets of objects and tasksare known to both teams (i.e. G(2) is added toEquation 1 and F(2) is added to Equation 2).

Asynchronous Updates: For fair comparisonwith existing approaches Kottur et al. (2017), weevaluate performance with respect to only a sin-gle team. In the case of dialogue overhearing andtask sharing, we replace the overheard symbols andtasks with zeros during evaluation. However, onlyreplacing external inputs with zeros during testingwould lead to a data domain mismatch problemwhere the training and evaluation data distributionsare different (Quionero-Candela et al., 2009; Glorotet al., 2011). To ensure that we train on all possibleevaluation scenarios, we use a three-stage trainingprocess. Stage 1) Q-BOT (1) and A-BOT (1) aretrained to cooperate within a team, independent ofthe team 2. All information shared from team 2 ispassed to team 1 as zeros. Stage 2) Q-BOT (2) andA-BOT (2) are trained to cooperate within a team,independent of team 1. Stage 3) Both teams aretrained together with reward sharing and/or dialogoverhearing and/or task sharing. Intuitively, thisthree-step training procedure means that the agentswithin each team must learn how to cooperate witheach other within a team, in addition to learninghow to compete with the other team.

Note, the two teams do not share vocabularies

3 Experiments

In this section we present our experimental resultsand observations on the effect of competitive in-fluence on both 1) final task performance and 2)emergence of grounded, interpretable, and compo-sitional language.

3.1 MetricsEvaluation:

1. Train and test accuracies, convergence curves.

ResultResult

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

Confidential Review Copy. DO NOT DISTRIBUTE.

ditionally generates output utterances at ∈ VA. SAt

is updated using questions at from Q-BOT.Base Cooperative Reward: Q-BOT and A-BOT

receive an identical base reward of R if Q-BOT’sprediction wG matches ground truth wG and a neg-ative reward of −10R otherwise. R is a hyper-paramter in our experiments which affects the rateof convergence.

Learning Policies with REINFORCE: Totrain these agents, we update policy parameters✓Q and ✓A using the popular REINFORCE policygradient algorithm (Williams, 1992).

2.2 Sources of CompetitionWhen the two teams do not share any informationand are trained completely independently, then theTask, Talk & Compete game reduces to (two copiesof) the Task & Talk game. Therefore, informationsharing across the two teams is necessary to intro-duce external competition for the two teams. Inthe following section we highlight the informationsharing that can happen in the Task, Talk & Com-pete game and the various sources of competitionthat can arise.

Reward Sharing: We modify the rewardstructure so that a team can assess their per-formance relative to other teams. Startingwith a base reward of R in the fully cooper-ative setting, we modify this reward into thecompetitive setting in the following Table:

Team 2 3 Team 2 7

Team 1 3 (+R,+R) (+2R,−500R)Team 1 7 (−500R,+2R) (−10R,−10R)

when both teams get the same result this reducesto the base reward setting: +R for correct answersand −10R for wrong ones. When there asymmetryin the results the correct team has a larger bonusof +2R while the incorrect team suffers a penaltyof −500R. This encourages teams to compete andassess their performance relative to the other team.

Dialog Overhearing: Overhearing the conver-sations of another team is a common way to get se-cret information about the tactics, knowledge, andprogress of one’s competitors. In a similar fash-ion, we modify the agent’s policy networks suchthat they can now overhear the concurrent dialoguefrom other teams. Take Q-BOT (1) and A-BOT (1)in team 1 for example. At round t, Q-BOT (1) nowobserves state

sQt (1) = [G(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2)] (1)

and similarly for Q-BOT (2). A-BOT (1) observes

sAt (1) = [F(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2), qt(1), qt(2)](2)

we view this as augmenting reward sharing by in-forming the agents exactly why they were penal-ized. They can now use the other agent’s dialogand information.

Task Sharing: Finally, we fully augment bothreward sharing and dialog overhearing with tasksharing, where the two sets of objects and tasksare known to both teams (i.e. G(2) is added toEquation 1 and F(2) is added to Equation 2).

Asynchronous Updates: For fair comparisonwith existing approaches Kottur et al. (2017), weevaluate performance with respect to only a sin-gle team. In the case of dialogue overhearing andtask sharing, we replace the overheard symbols andtasks with zeros during evaluation. However, onlyreplacing external inputs with zeros during testingwould lead to a data domain mismatch problemwhere the training and evaluation data distributionsare different (Quionero-Candela et al., 2009; Glorotet al., 2011). To ensure that we train on all possibleevaluation scenarios, we use a three-stage trainingprocess. Stage 1) Q-BOT (1) and A-BOT (1) aretrained to cooperate within a team, independent ofthe team 2. All information shared from team 2 ispassed to team 1 as zeros. Stage 2) Q-BOT (2) andA-BOT (2) are trained to cooperate within a team,independent of team 1. Stage 3) Both teams aretrained together with reward sharing and/or dialogoverhearing and/or task sharing. Intuitively, thisthree-step training procedure means that the agentswithin each team must learn how to cooperate witheach other within a team, in addition to learninghow to compete with the other team.

Note, the two teams do not share vocabularies

3 Experiments

In this section we present our experimental resultsand observations on the effect of competitive in-fluence on both 1) final task performance and 2)emergence of grounded, interpretable, and compo-sitional language.

3.1 MetricsEvaluation:

1. Train and test accuracies, convergence curves.

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

Confidential Review Copy. DO NOT DISTRIBUTE.

ditionally generates output utterances at ∈ VA. SAt

is updated using questions at from Q-BOT.Base Cooperative Reward: Q-BOT and A-BOT

receive an identical base reward of R if Q-BOT’sprediction wG matches ground truth wG and a neg-ative reward of −10R otherwise. R is a hyper-paramter in our experiments which affects the rateof convergence.

Learning Policies with REINFORCE: Totrain these agents, we update policy parameters✓Q and ✓A using the popular REINFORCE policygradient algorithm (Williams, 1992).

2.2 Sources of CompetitionWhen the two teams do not share any informationand are trained completely independently, then theTask, Talk & Compete game reduces to (two copiesof) the Task & Talk game. Therefore, informationsharing across the two teams is necessary to intro-duce external competition for the two teams. Inthe following section we highlight the informationsharing that can happen in the Task, Talk & Com-pete game and the various sources of competitionthat can arise.

Reward Sharing: We modify the rewardstructure so that a team can assess their per-formance relative to other teams. Startingwith a base reward of R in the fully cooper-ative setting, we modify this reward into thecompetitive setting in the following Table:

Team 2 3 Team 2 7

Team 1 3 (+R,+R) (+2R,−500R)Team 1 7 (−500R,+2R) (−10R,−10R)

when both teams get the same result this reducesto the base reward setting: +R for correct answersand −10R for wrong ones. When there asymmetryin the results the correct team has a larger bonusof +2R while the incorrect team suffers a penaltyof −500R. This encourages teams to compete andassess their performance relative to the other team.

Dialog Overhearing: Overhearing the conver-sations of another team is a common way to get se-cret information about the tactics, knowledge, andprogress of one’s competitors. In a similar fash-ion, we modify the agent’s policy networks suchthat they can now overhear the concurrent dialoguefrom other teams. Take Q-BOT (1) and A-BOT (1)in team 1 for example. At round t, Q-BOT (1) nowobserves state

sQt (1) = [G(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2)] (1)

and similarly for Q-BOT (2). A-BOT (1) observes

sAt (1) = [F(1), q1(1), a1(1), q1(2), a1(2), . . . ,

qt−1(1), at−1(1), qt−1(2), at−1(2), qt(1), qt(2)](2)

we view this as augmenting reward sharing by in-forming the agents exactly why they were penal-ized. They can now use the other agent’s dialogand information.

Task Sharing: Finally, we fully augment bothreward sharing and dialog overhearing with tasksharing, where the two sets of objects and tasksare known to both teams (i.e. G(2) is added toEquation 1 and F(2) is added to Equation 2).

Asynchronous Updates: For fair comparisonwith existing approaches Kottur et al. (2017), weevaluate performance with respect to only a sin-gle team. In the case of dialogue overhearing andtask sharing, we replace the overheard symbols andtasks with zeros during evaluation. However, onlyreplacing external inputs with zeros during testingwould lead to a data domain mismatch problemwhere the training and evaluation data distributionsare different (Quionero-Candela et al., 2009; Glorotet al., 2011). To ensure that we train on all possibleevaluation scenarios, we use a three-stage trainingprocess. Stage 1) Q-BOT (1) and A-BOT (1) aretrained to cooperate within a team, independent ofthe team 2. All information shared from team 2 ispassed to team 1 as zeros. Stage 2) Q-BOT (2) andA-BOT (2) are trained to cooperate within a team,independent of team 1. Stage 3) Both teams aretrained together with reward sharing and/or dialogoverhearing and/or task sharing. Intuitively, thisthree-step training procedure means that the agentswithin each team must learn how to cooperate witheach other within a team, in addition to learninghow to compete with the other team.

Note, the two teams do not share vocabularies

3 Experiments

In this section we present our experimental resultsand observations on the effect of competitive in-fluence on both 1) final task performance and 2)emergence of grounded, interpretable, and compo-sitional language.

3.1 MetricsEvaluation:

1. Train and test accuracies, convergence curves.

(green, star)(purple, filled)

Figure 1: We propose the Task, Talk & Compete game in-

volving two competitive teams each consisting of multiple

rounds of dialog between Q-bot and A-bot. Within each

team, A-bot is given a target instance unknown to Q-bot

and Q-bot is assigned a task (e.g. find the color and shape

of the instance), unknown to A-bot, to uncover certain at-

tributes of the target instance. This informational asymme-

try necessitates communication between the two agents via

multiple rounds of dialog where Q-bot asks questions re-

garding the task and A-bot provides answers using the tar-

get instance. We investigate whether competition for perfor-mance from an external, similar agent team 2 could result in

improved compositionality of emergent language and con-

vergence of task performance within team 1. Competition

is introduced through three aspects: 1) reward sharing, 2) di-alogue overhearing, and 3) task sharing. Our hypothesis is

that teams are able to leverage information from the perfor-

mance of the other team and learn more efficiently beyond

the sole reward signal obtained from their ownperformance.

Our findings show that competition for performance with a

similar team leads to improved overall accuracy of general-

ization as well as faster emergence of emergent language.

arX

iv:2

003.

0184

8v2

[cs

.AI]

16

Jul 2

020

Page 2: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

both cooperative and competitive settings [23, 24, 41]. As a result,there has been a push to build artificial intelligence models thatcan effectively communicate their own intentions to both humansas well as other agents through language and dialog grounded invision [10, 29, 30, 38] and navigation [14].

A recent line of work has applied reinforcement learning tech-niques [44] for end-to-end learning of communication protocolsbetween agents situated in virtual environments [9, 27] and foundthe emergence of grounded, interpretable, and compositional sym-bolic language in the communication protocols developed by theagents [2, 26]. To bring this closer to how natural language evolveswithin communities, we observe that human populations learn tosolve complex tasks involving communicative behaviors not only infully cooperative settings but also in scenarios where competitionacts as an additional external pressure for improvement [15, 33].Furthermore, recent studies have also provided support for exter-nal competition in behavioral economics [17] and evolutionarybiology [12, 28, 36]. Inspired by the emergence of human behaviorand language in competitive settings, the goal of our paper is totherefore understand how social influences like competition affectsemergent language in multi-agent communication.

In this work, we conceptually distinguish two types of compe-tition: 1) constant sum competition, where agents are competingfor a finite amount of resources and thus a gain for a group ofagents results in a loss for another [4, 32], and 2) variable sum ornon-constant sum competition, where competition is inherent butboth mutual gains and mutual losses of power are possible [6, 40].For the sake of simplicity, we call these competition for resourcesand competition for performance respectively and focus on the latter.In this setting, each individual (or team) performs similar taskswith access to individual resources but is motivated to do betterdue to the external pressure from seeing how well other individ-uals (or teams) are performing. We hypothesize that competitionfor performance from an external, similar agent team could act asa social influence that encourages multi-agent populations to de-velop better communication protocols for improved performance,compositionality, and convergence.

To investigate this hypothesis, we extend the Task & Talk refer-ential game between two cooperative agents [25] into Task, Talk &Compete, a game involving two competitive teams each consistingof the aforementioned two cooperative agents (Figure 1). At a highlevel, Task & Talk requires the two cooperative agents (Q-bot andA-bot) to solve a task by interacting with each other via dialog, atthe end of which a reward is assigned to measure their performance.In such a setting, we introduce competition for performance be-tween two teams of Q-bot and A-bot through three aspects: 1)reward sharing, where we modify the reward structure so that ateam can assess their performance relative to the other team, 2)dialog overhearing, where we modify the agent’s policy networkssuch that they can overhear the concurrent dialog from the otherteam, and 3) task sharing, where teams gain additional informationabout the tasks given to the opposing team. Using this new setting,we provide an empirical study demonstrating the impact of com-petitive influence on multi-agent dialog games. Our results showthat external competitive influence leads to an improved accuracyand generalization, as well as faster emergence of communicativelanguages that are more informative and compositional.

2 THE TASK & TALK GAME

The Task & Talk benchmark is a cooperative reference game pro-posed to evaluate the emergence of language in multi-agent popu-lations with imperfect information [13, 25]. Task & Talk takes placebetween two agents Q-bot (which is tasked to ask questions) andA-bot (which is tasked to answer questions) in a synthetic world ofinstances comprised of three attributes: color, style, and shape. Atthe start of the game, A-bot is given a target instance I unknown toQ-bot and Q-bot is assigned a taskG (unknown to A-bot) to un-cover certain attributes of the target instance I . This informationalasymmetry necessitates communication between the two agentsvia multiple rounds of dialog whereQ-bot asks questions regardingthe task and A-bot provides answers using the target instance. Atthe end of the game, Q-bot uses the information conveyed fromthe dialog with A-bot to solve the task at hand. Both agents aregiven equal rewards based on the accuracy of Q-bot’s prediction.1

Base States and Actions: Each agent begins by observing itsspecific input: task G for Q-bot and instance I for A-bot. Thegame proceeds as a dialog over multiple rounds t = 1, ...,T . Weuse lower case characters (e.g. sQt ) to denote token symbols andupper case SQt to denote corresponding representations. Q-bot ismodeled with three modules – speaking, listening, and prediction.At round t , Q-bot stores an initial state representation S

Qt−1 from

which it conditionally generates output utterances qt ∈ VQ whereVQ is the vocabulary of Q-bot. SQt−1 is updated using answersat from A-bot and is used to make a prediction wG in the finalround.A-bot is modeledwith twomodules – speaking and listening.A-bot encodes instance I into its initial state SAt from which itconditionally generates output utterances at ∈ VA where VA is thevocabulary of Q-bot. SAt is updated using questions qt from Q-bot.

In more detail, Q-bot and A-bot are modeled as stochasticpolicies πQ (qt |sQt ;θQ ) and πA(at |sAt ;θA) implemented as recur-rent networks [20, 21]. At the beginning of round t , Q-bot observesstate sQt = [G,q1,a1, . . . ,qt−1,at−1] representing the task G andthe dialog conveyed up to round t −1. Q-bot conditions on state sQtand utters a question represented by some token qt ∈ VQ . A-botalso observes the dialog history and this new utterance as statesAt = [I ,q1,a1, . . . ,qt−1,at−1,qt ]. A-bot conditions on this statesAt and utters an answer at ∈ VA. At the final round, Q-bot predictsa pair of attributes wG = (wG

1 , wG2 ) using a network πG (д |sQT ;θQ )

to solve the task.Learning the Policy:Q-bot andA-bot are trained to cooperate

by maximizing a shared objective function that is determined bywhether Q-bot is able solve the task at hand. Q-bot and A-botreceive an identical base reward of R if Q-bot’s prediction wG

matches ground truthwG and a negative reward of −10R otherwise.R is a hyperparameter that affects the rate of convergence.

J (θQ ,θA) = EπQ ,πA[R(wG ,wG )

](1)

where R is a reward function. To train these agents, we learn policyparameters θQ and θA that maximize the expected reward J (θQ ,θA)

1detailed review of Task & Talk in the appendix.

Page 3: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

ListenNetSpeakNet

ListenNet

ListenNet

ListenNet

SpeakNet

ListenNetSpeakNet

ListenNet

ListenNet

ListenNet

SpeakNet

SAt�1(1)

<latexit sha1_base64="FKPxQh2RFA+7mlqFIsStpY/mUcE=">AAAB+nicbVBNS8NAEJ34WetXqkcvwSLUgyWpgh4rXjxWtB/QxrDZbtqlm03Y3Sgl5qd48aCIV3+JN/+N2zYHbX0w8Hhvhpl5fsyoVLb9bSwtr6yurRc2iptb2zu7ZmmvJaNEYNLEEYtEx0eSMMpJU1HFSCcWBIU+I21/dDXx2w9ESBrxOzWOiRuiAacBxUhpyTNL6e39pZeqEyfLvLTiHGeeWbar9hTWInFyUoYcDc/86vUjnISEK8yQlF3HjpWbIqEoZiQr9hJJYoRHaEC6mnIUEumm09Mz60grfSuIhC6urKn6eyJFoZTj0NedIVJDOe9NxP+8bqKCCzelPE4U4Xi2KEiYpSJrkoPVp4JgxcaaICyovtXCQyQQVjqtog7BmX95kbRqVee0at+cleu1PI4CHMAhVMCBc6jDNTSgCRge4Rle4c14Ml6Md+Nj1rpk5DP78AfG5w8fPZMz</latexit>

SQt�1(1)

<latexit sha1_base64="1obqfsYHfiNVG1EKiQujbUPOW8o=">AAAB+nicbVDLSsNAFJ34rPWV6tLNYBHqwpJUQZcFNy5btA9oY5hMJ+3QyYOZG6XEfIobF4q49Uvc+TdO2yy09cCFwzn3cu89Xiy4Asv6NlZW19Y3Ngtbxe2d3b19s3TQVlEiKWvRSESy6xHFBA9ZCzgI1o0lI4EnWMcbX0/9zgOTikfhHUxi5gRkGHKfUwJacs1SenvfdFM4s7PMTSv2aeaaZatqzYCXiZ2TMsrRcM2v/iCiScBCoIIo1bOtGJyUSOBUsKzYTxSLCR2TIetpGpKAKSednZ7hE60MsB9JXSHgmfp7IiWBUpPA050BgZFa9Kbif14vAf/KSXkYJ8BCOl/kJwJDhKc54AGXjIKYaEKo5PpWTEdEEgo6raIOwV58eZm0a1X7vGo1L8r1Wh5HAR2hY1RBNrpEdXSDGqiFKHpEz+gVvRlPxovxbnzMW1eMfOYQ/YHx+QM4TZND</latexit>

SQt�1(2)

<latexit sha1_base64="Y02IYSTT1JEDEp/EWF4aRn+0l7c=">AAAB+nicbVDLSsNAFL3xWesr1aWbwSLUhSWpgi4Lbly2aB/QxjCZTtqhkwczE6XEfIobF4q49Uvc+TdO2yy09cCFwzn3cu89XsyZVJb1baysrq1vbBa2its7u3v7ZumgLaNEENoiEY9E18OSchbSlmKK024sKA48Tjve+Hrqdx6okCwK79Qkpk6AhyHzGcFKS65ZSm/vm26qzuwsc9NK7TRzzbJVtWZAy8TOSRlyNFzzqz+ISBLQUBGOpezZVqycFAvFCKdZsZ9IGmMyxkPa0zTEAZVOOjs9QydaGSA/ErpChWbq74kUB1JOAk93BliN5KI3Ff/zeonyr5yUhXGiaEjmi/yEIxWhaQ5owAQlik80wUQwfSsiIywwUTqtog7BXnx5mbRrVfu8ajUvyvVaHkcBjuAYKmDDJdThBhrQAgKP8Ayv8GY8GS/Gu/Exb10x8plD+APj8wc505NE</latexit>

SQt (2)

<latexit sha1_base64="8R1q5hggOmJt7Nht5X6UovEznuA=">AAAB+HicbVBNS8NAEJ34WetHox69LBahXkpSBT0WvHhs0X5AW8Nmu2mXbjZhdyPUkF/ixYMiXv0p3vw3btsctPXBwOO9GWbm+TFnSjvOt7W2vrG5tV3YKe7u7R+U7MOjtooSSWiLRDySXR8rypmgLc00p91YUhz6nHb8yc3M7zxSqVgk7vU0poMQjwQLGMHaSJ5dSu8eml6qs8xLK7XzzLPLTtWZA60SNydlyNHw7K/+MCJJSIUmHCvVc51YD1IsNSOcZsV+omiMyQSPaM9QgUOqBun88AydGWWIgkiaEhrN1d8TKQ6Vmoa+6QyxHqtlbyb+5/USHVwPUibiRFNBFouChCMdoVkKaMgkJZpPDcFEMnMrImMsMdEmq6IJwV1+eZW0a1X3ouo0L8v1Wh5HAU7gFCrgwhXU4RYa0AICCTzDK7xZT9aL9W59LFrXrHzmGP7A+vwBWFmS0g==</latexit>

SQt (1)

<latexit sha1_base64="tFM/wHKUX7eYs09i7jIPx85t4Aw=">AAAB+HicbVBNS8NAEJ34WetHox69LBahXkpSBT0WvHhs0X5AG8Nmu22XbjZhdyPUkF/ixYMiXv0p3vw3btsctPXBwOO9GWbmBTFnSjvOt7W2vrG5tV3YKe7u7R+U7MOjtooSSWiLRDyS3QArypmgLc00p91YUhwGnHaCyc3M7zxSqVgk7vU0pl6IR4INGcHaSL5dSu8emn6qs8xPK+555ttlp+rMgVaJm5My5Gj49ld/EJEkpEITjpXquU6svRRLzQinWbGfKBpjMsEj2jNU4JAqL50fnqEzowzQMJKmhEZz9fdEikOlpmFgOkOsx2rZm4n/eb1ED6+9lIk40VSQxaJhwpGO0CwFNGCSEs2nhmAimbkVkTGWmGiTVdGE4C6/vEratap7UXWal+V6LY+jACdwChVw4QrqcAsNaAGBBJ7hFd6sJ+vFerc+Fq1rVj5zDH9gff4AVtOS0Q==</latexit>

SAt (1)

<latexit sha1_base64="C+eLg3CSitqj2rFoq0Ou/rl7UFI=">AAAB+HicbVBNS8NAEN34WetHox69LBahXkpSBT1WvHisaD+gjWGz3bZLN5uwOxFqyC/x4kERr/4Ub/4bt20O2vpg4PHeDDPzglhwDY7zba2srq1vbBa2its7u3sle/+gpaNEUdakkYhUJyCaCS5ZEzgI1okVI2EgWDsYX0/99iNTmkfyHiYx80IylHzAKQEj+XYpvXu48lPIMj+tuKeZb5edqjMDXiZuTsooR8O3v3r9iCYhk0AF0brrOjF4KVHAqWBZsZdoFhM6JkPWNVSSkGkvnR2e4ROj9PEgUqYk4Jn6eyIlodaTMDCdIYGRXvSm4n9eN4HBpZdyGSfAJJ0vGiQCQ4SnKeA+V4yCmBhCqOLmVkxHRBEKJquiCcFdfHmZtGpV96zq3J6X67U8jgI6Qseoglx0geroBjVQE1GUoGf0it6sJ+vFerc+5q0rVj5ziP7A+vwBPeOSwQ==</latexit>

SAt (2)

<latexit sha1_base64="BJCZVctsLd1w0oUQb/rFB1zaMBA=">AAAB+HicbVBNS8NAEN34WetHox69LBahXkpSBT1WvHisaD+gjWGz3bZLN5uwOxFqyC/x4kERr/4Ub/4bt20O2vpg4PHeDDPzglhwDY7zba2srq1vbBa2its7u3sle/+gpaNEUdakkYhUJyCaCS5ZEzgI1okVI2EgWDsYX0/99iNTmkfyHiYx80IylHzAKQEj+XYpvXu48lPIMj+t1E4z3y47VWcGvEzcnJRRjoZvf/X6EU1CJoEKonXXdWLwUqKAU8GyYi/RLCZ0TIasa6gkIdNeOjs8wydG6eNBpExJwDP190RKQq0nYWA6QwIjvehNxf+8bgKDSy/lMk6ASTpfNEgEhghPU8B9rhgFMTGEUMXNrZiOiCIUTFZFE4K7+PIyadWq7lnVuT0v12t5HAV0hI5RBbnoAtXRDWqgJqIoQc/oFb1ZT9aL9W59zFtXrHzmEP2B9fkDP2mSwg==</latexit>

SAt�1(2)

<latexit sha1_base64="+AW/Ywb1GlLIIRdF/u1znPlOtfw=">AAAB+nicbVBNS8NAEN34WetXqkcvwSLUgyWpgh4rXjxWtB/QxrDZbtqlm03YnSgl5qd48aCIV3+JN/+N2zYHbX0w8Hhvhpl5fsyZAtv+NpaWV1bX1gsbxc2t7Z1ds7TXUlEiCW2SiEey42NFORO0CQw47cSS4tDntO2PriZ++4FKxSJxB+OYuiEeCBYwgkFLnllKb+8vvRROnCzz0krtOPPMsl21p7AWiZOTMsrR8MyvXj8iSUgFEI6V6jp2DG6KJTDCaVbsJYrGmIzwgHY1FTikyk2np2fWkVb6VhBJXQKsqfp7IsWhUuPQ150hhqGa9ybif143geDCTZmIE6CCzBYFCbcgsiY5WH0mKQE+1gQTyfStFhliiQnotIo6BGf+5UXSqlWd06p9c1au1/I4CugAHaIKctA5qqNr1EBNRNAjekav6M14Ml6Md+Nj1rpk5DP76A+Mzx8gw5M0</latexit>

at(2)<latexit sha1_base64="tkmc9vXnYhr+zshvqaltBCK3H2A=">AAAB9HicbVDLTgJBEOzFF+IL9ehlIjHBC9lFEj2SePGIiYAJbDazwwATZh/O9JKQzX6HFw8a49WP8ebfOMAeFKykk0pV90x3+bEUGm372ypsbG5t7xR3S3v7B4dH5eOTjo4SxXibRTJSjz7VXIqQt1Gg5I+x4jTwJe/6k9u5351ypUUUPuAs5m5AR6EYCkbRSG5KvRSzzEur9cvMK1fsmr0AWSdOTiqQo+WVv/qDiCUBD5FJqnXPsWM0byoUTPKs1E80jymb0BHvGRrSgGs3XSydkQujDMgwUqZCJAv190RKA61ngW86A4pjverNxf+8XoLDGzcVYZwgD9nyo2EiCUZkngAZCMUZypkhlClhdiVsTBVlaHIqmRCc1ZPXSadec65q9n2j0mzkcRThDM6hCg5cQxPuoAVtYPAEz/AKb9bUerHerY9la8HKZ07hD6zPH5o9ke4=</latexit>

at(2)<latexit sha1_base64="tkmc9vXnYhr+zshvqaltBCK3H2A=">AAAB9HicbVDLTgJBEOzFF+IL9ehlIjHBC9lFEj2SePGIiYAJbDazwwATZh/O9JKQzX6HFw8a49WP8ebfOMAeFKykk0pV90x3+bEUGm372ypsbG5t7xR3S3v7B4dH5eOTjo4SxXibRTJSjz7VXIqQt1Gg5I+x4jTwJe/6k9u5351ypUUUPuAs5m5AR6EYCkbRSG5KvRSzzEur9cvMK1fsmr0AWSdOTiqQo+WVv/qDiCUBD5FJqnXPsWM0byoUTPKs1E80jymb0BHvGRrSgGs3XSydkQujDMgwUqZCJAv190RKA61ngW86A4pjverNxf+8XoLDGzcVYZwgD9nyo2EiCUZkngAZCMUZypkhlClhdiVsTBVlaHIqmRCc1ZPXSadec65q9n2j0mzkcRThDM6hCg5cQxPuoAVtYPAEz/AKb9bUerHerY9la8HKZ07hD6zPH5o9ke4=</latexit>

at(2)<latexit sha1_base64="tkmc9vXnYhr+zshvqaltBCK3H2A=">AAAB9HicbVDLTgJBEOzFF+IL9ehlIjHBC9lFEj2SePGIiYAJbDazwwATZh/O9JKQzX6HFw8a49WP8ebfOMAeFKykk0pV90x3+bEUGm372ypsbG5t7xR3S3v7B4dH5eOTjo4SxXibRTJSjz7VXIqQt1Gg5I+x4jTwJe/6k9u5351ypUUUPuAs5m5AR6EYCkbRSG5KvRSzzEur9cvMK1fsmr0AWSdOTiqQo+WVv/qDiCUBD5FJqnXPsWM0byoUTPKs1E80jymb0BHvGRrSgGs3XSydkQujDMgwUqZCJAv190RKA61ngW86A4pjverNxf+8XoLDGzcVYZwgD9nyo2EiCUZkngAZCMUZypkhlClhdiVsTBVlaHIqmRCc1ZPXSadec65q9n2j0mzkcRThDM6hCg5cQxPuoAVtYPAEz/AKb9bUerHerY9la8HKZ07hD6zPH5o9ke4=</latexit>

qt(2)<latexit sha1_base64="QA3LNi+YIQJXLW147IFEEeKq110=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBahXkpSC3osePFYwX5AG8Jmu2mXbjbp7qZQQn6HFw+KePXHePPfuG1z0NYHA4/3ZpiZ58ecKW3b31Zha3tnd6+4Xzo4PDo+KZ+edVSUSELbJOKR7PlYUc4EbWumOe3FkuLQ57TrT+4XfndGpWKReNLzmLohHgkWMIK1kdx06qU6y7y0Wr/OvHLFrtlLoE3i5KQCOVpe+WswjEgSUqEJx0r1HTvWboqlZoTTrDRIFI0xmeAR7RsqcEiVmy6PztCVUYYoiKQpodFS/T2R4lCpeeibzhDrsVr3FuJ/Xj/RwZ2bMhEnmgqyWhQkHOkILRJAQyYp0XxuCCaSmVsRGWOJiTY5lUwIzvrLm6RTrzk3NfuxUWk28jiKcAGXUAUHbqEJD9CCNhCYwjO8wps1s16sd+tj1Vqw8plz+APr8wezLZH+</latexit>

qt(2)<latexit sha1_base64="QA3LNi+YIQJXLW147IFEEeKq110=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBahXkpSC3osePFYwX5AG8Jmu2mXbjbp7qZQQn6HFw+KePXHePPfuG1z0NYHA4/3ZpiZ58ecKW3b31Zha3tnd6+4Xzo4PDo+KZ+edVSUSELbJOKR7PlYUc4EbWumOe3FkuLQ57TrT+4XfndGpWKReNLzmLohHgkWMIK1kdx06qU6y7y0Wr/OvHLFrtlLoE3i5KQCOVpe+WswjEgSUqEJx0r1HTvWboqlZoTTrDRIFI0xmeAR7RsqcEiVmy6PztCVUYYoiKQpodFS/T2R4lCpeeibzhDrsVr3FuJ/Xj/RwZ2bMhEnmgqyWhQkHOkILRJAQyYp0XxuCCaSmVsRGWOJiTY5lUwIzvrLm6RTrzk3NfuxUWk28jiKcAGXUAUHbqEJD9CCNhCYwjO8wps1s16sd+tj1Vqw8plz+APr8wezLZH+</latexit>

qt(2)<latexit sha1_base64="QA3LNi+YIQJXLW147IFEEeKq110=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBahXkpSC3osePFYwX5AG8Jmu2mXbjbp7qZQQn6HFw+KePXHePPfuG1z0NYHA4/3ZpiZ58ecKW3b31Zha3tnd6+4Xzo4PDo+KZ+edVSUSELbJOKR7PlYUc4EbWumOe3FkuLQ57TrT+4XfndGpWKReNLzmLohHgkWMIK1kdx06qU6y7y0Wr/OvHLFrtlLoE3i5KQCOVpe+WswjEgSUqEJx0r1HTvWboqlZoTTrDRIFI0xmeAR7RsqcEiVmy6PztCVUYYoiKQpodFS/T2R4lCpeeibzhDrsVr3FuJ/Xj/RwZ2bMhEnmgqyWhQkHOkILRJAQyYp0XxuCCaSmVsRGWOJiTY5lUwIzvrLm6RTrzk3NfuxUWk28jiKcAGXUAUHbqEJD9CCNhCYwjO8wps1s16sd+tj1Vqw8plz+APr8wezLZH+</latexit>

at(1)<latexit sha1_base64="JvqHAJU5Deu7XSNEWHcZnWFbGl4=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQL2FXA3oMePEYwTwgWZbZySQZMvtwpjcQlv0OLx4U8erHePNvnCR70MSChqKqe6a7/FgKjbb9bRU2Nre2d4q7pb39g8Oj8vFJW0eJYrzFIhmprk81lyLkLRQoeTdWnAa+5B1/cjf3O1OutIjCR5zF3A3oKBRDwSgayU2pl2KWeWnVucy8csWu2QuQdeLkpAI5ml75qz+IWBLwEJmkWvccO0bzpkLBJM9K/UTzmLIJHfGeoSENuHbTxdIZuTDKgAwjZSpEslB/T6Q00HoW+KYzoDjWq95c/M/rJTi8dVMRxgnykC0/GiaSYETmCZCBUJyhnBlCmRJmV8LGVFGGJqeSCcFZPXmdtK9qznXNfqhXGvU8jiKcwTlUwYEbaMA9NKEFDJ7gGV7hzZpaL9a79bFsLVj5zCn8gfX5A5i3ke0=</latexit>

at(1)<latexit sha1_base64="JvqHAJU5Deu7XSNEWHcZnWFbGl4=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQL2FXA3oMePEYwTwgWZbZySQZMvtwpjcQlv0OLx4U8erHePNvnCR70MSChqKqe6a7/FgKjbb9bRU2Nre2d4q7pb39g8Oj8vFJW0eJYrzFIhmprk81lyLkLRQoeTdWnAa+5B1/cjf3O1OutIjCR5zF3A3oKBRDwSgayU2pl2KWeWnVucy8csWu2QuQdeLkpAI5ml75qz+IWBLwEJmkWvccO0bzpkLBJM9K/UTzmLIJHfGeoSENuHbTxdIZuTDKgAwjZSpEslB/T6Q00HoW+KYzoDjWq95c/M/rJTi8dVMRxgnykC0/GiaSYETmCZCBUJyhnBlCmRJmV8LGVFGGJqeSCcFZPXmdtK9qznXNfqhXGvU8jiKcwTlUwYEbaMA9NKEFDJ7gGV7hzZpaL9a79bFsLVj5zCn8gfX5A5i3ke0=</latexit>

at(1)<latexit sha1_base64="JvqHAJU5Deu7XSNEWHcZnWFbGl4=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQL2FXA3oMePEYwTwgWZbZySQZMvtwpjcQlv0OLx4U8erHePNvnCR70MSChqKqe6a7/FgKjbb9bRU2Nre2d4q7pb39g8Oj8vFJW0eJYrzFIhmprk81lyLkLRQoeTdWnAa+5B1/cjf3O1OutIjCR5zF3A3oKBRDwSgayU2pl2KWeWnVucy8csWu2QuQdeLkpAI5ml75qz+IWBLwEJmkWvccO0bzpkLBJM9K/UTzmLIJHfGeoSENuHbTxdIZuTDKgAwjZSpEslB/T6Q00HoW+KYzoDjWq95c/M/rJTi8dVMRxgnykC0/GiaSYETmCZCBUJyhnBlCmRJmV8LGVFGGJqeSCcFZPXmdtK9qznXNfqhXGvU8jiKcwTlUwYEbaMA9NKEFDJ7gGV7hzZpaL9a79bFsLVj5zCn8gfX5A5i3ke0=</latexit>

qt(1)<latexit sha1_base64="Bn+5oKzDgBW1GIvI655Y6MNZJms=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBHqpSRa0GPBi8cK9gPaEDbbbbt0s0l3J4US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8IBZco+N8WxubW9s7u4W94v7B4dFx6eS0paNEUdakkYhUJyCaCS5ZEzkK1okVI2EgWDsY38/99pQpzSP5hLOYeSEZSj7glKCRvHTip5hlflpxrzK/VHaqzgL2OnFzUoYcDb/01etHNAmZRCqI1l3XidFLiUJOBcuKvUSzmNAxGbKuoZKETHvp4ujMvjRK3x5EypREe6H+nkhJqPUsDExnSHCkV725+J/XTXBw56VcxgkySZeLBomwMbLnCdh9rhhFMTOEUMXNrTYdEUUompyKJgR39eV10rquujdV57FWrtfyOApwDhdQARduoQ4P0IAmUJjAM7zCmzW1Xqx362PZumHlM2fwB9bnD7Gnkf0=</latexit>

qt(1)<latexit sha1_base64="Bn+5oKzDgBW1GIvI655Y6MNZJms=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBHqpSRa0GPBi8cK9gPaEDbbbbt0s0l3J4US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8IBZco+N8WxubW9s7u4W94v7B4dFx6eS0paNEUdakkYhUJyCaCS5ZEzkK1okVI2EgWDsY38/99pQpzSP5hLOYeSEZSj7glKCRvHTip5hlflpxrzK/VHaqzgL2OnFzUoYcDb/01etHNAmZRCqI1l3XidFLiUJOBcuKvUSzmNAxGbKuoZKETHvp4ujMvjRK3x5EypREe6H+nkhJqPUsDExnSHCkV725+J/XTXBw56VcxgkySZeLBomwMbLnCdh9rhhFMTOEUMXNrTYdEUUompyKJgR39eV10rquujdV57FWrtfyOApwDhdQARduoQ4P0IAmUJjAM7zCmzW1Xqx362PZumHlM2fwB9bnD7Gnkf0=</latexit>

qt(1)<latexit sha1_base64="Bn+5oKzDgBW1GIvI655Y6MNZJms=">AAAB9HicbVBNS8NAEJ34WetX1aOXYBHqpSRa0GPBi8cK9gPaEDbbbbt0s0l3J4US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8y8IBZco+N8WxubW9s7u4W94v7B4dFx6eS0paNEUdakkYhUJyCaCS5ZEzkK1okVI2EgWDsY38/99pQpzSP5hLOYeSEZSj7glKCRvHTip5hlflpxrzK/VHaqzgL2OnFzUoYcDb/01etHNAmZRCqI1l3XidFLiUJOBcuKvUSzmNAxGbKuoZKETHvp4ujMvjRK3x5EypREe6H+nkhJqPUsDExnSHCkV725+J/XTXBw56VcxgkySZeLBomwMbLnCdh9rhhFMTOEUMXNrTYdEUUompyKJgR39eV10rquujdV57FWrtfyOApwDhdQARduoQ4P0IAmUJjAM7zCmzW1Xqx362PZumHlM2fwB9bnD7Gnkf0=</latexit>

I(1)<latexit sha1_base64="sfMlETN3OVBwJyngj0rTJbNdo24=">AAAB/3icbVDLSsNAFL2pr1pfUcGNm2AR6qYkWtBlwY3uKtgHtCFMppN26GQSZiZCiVn4K25cKOLW33Dn3zhNs9DWAwOHc+6dM3P8mFGpbPvbKK2srq1vlDcrW9s7u3vm/kFHRonApI0jFomejyRhlJO2ooqRXiwICn1Guv7keuZ3H4iQNOL3ahoTN0QjTgOKkdKSZx4N8jtSnyE8ydJbL605Z1nmmVW7buewlolTkCoUaHnm12AY4SQkXGGGpOw7dqzcFAlFMSNZZZBIEusINCJ9TTkKiXTTPDuzTrUytIJI6MOVlau/N1IUSjkNfT0ZIjWWi95M/M/rJyq4clPK40QRjudBQcIsFVmzMqwhFQQrNtUEYUH1Wy08RgJhpSur6BKcxS8vk8553bmo23eNarNR1FGGYziBGjhwCU24gRa0AcMjPMMrvBlPxovxbnzMR0tGsXMIf2B8/gD9TpYF</latexit>

I(2)<latexit sha1_base64="4ECfYKSRtxvRIRCOvDGA8wh9NYU=">AAAB/3icbVDLSsNAFL3xWesrKrhxEyxC3ZSkFnRZcKO7CvYBbQiT6aQdOpmEmYlQYhb+ihsXirj1N9z5N07TLLT1wMDhnHvnzBw/ZlQq2/42VlbX1jc2S1vl7Z3dvX3z4LAjo0Rg0sYRi0TPR5IwyklbUcVILxYEhT4jXX9yPfO7D0RIGvF7NY2JG6IRpwHFSGnJM48H+R2pzxCeZOmtl1br51nmmRW7ZuewlolTkAoUaHnm12AY4SQkXGGGpOw7dqzcFAlFMSNZeZBIEusINCJ9TTkKiXTTPDuzzrQytIJI6MOVlau/N1IUSjkNfT0ZIjWWi95M/M/rJyq4clPK40QRjudBQcIsFVmzMqwhFQQrNtUEYUH1Wy08RgJhpSsr6xKcxS8vk0695lzU7LtGpdko6ijBCZxCFRy4hCbcQAvagOERnuEV3own48V4Nz7moytGsXMEf2B8/gD+1ZYG</latexit>

Team 1 Team 2

Prediction LSTM

wG1 wG

2

Prediction LSTM

R

SQ0 (2)

<latexit sha1_base64="giU810fm6+sbXUqQSzk+Lheidhs=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBahXkpSBT1JwYvHFu0HtDFsttt26WYTdzeFEvI7vHhQxKs/xpv/xm2bg7Y+GHi8N8PMPD/iTGnb/rZya+sbm1v57cLO7t7+QfHwqKXCWBLaJCEPZcfHinImaFMzzWknkhQHPqdtf3w789sTKhULxYOeRtQN8FCwASNYG8lN7h8bnp16Sbl6nnrFkl2x50CrxMlICTLUveJXrx+SOKBCE46V6jp2pN0ES80Ip2mhFysaYTLGQ9o1VOCAKjeZH52iM6P00SCUpoRGc/X3RIIDpaaBbzoDrEdq2ZuJ/3ndWA+u3YSJKNZUkMWiQcyRDtEsAdRnkhLNp4ZgIpm5FZERlphok1PBhOAsv7xKWtWKc1GxG5el2k0WRx5O4BTK4MAV1OAO6tAEAk/wDK/wZk2sF+vd+li05qxs5hj+wPr8Aa1jkV0=</latexit>

SQ0 (1)

<latexit sha1_base64="x7tRZZ5nh+nN2lJOvrn7nCGCr4Q=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0lU0JMUvHhs0X5AG8Nmu2mXbjZxd1MoIb/DiwdFvPpjvPlv3LY5aOuDgcd7M8zM82POlLbtb2tldW19Y7OwVdze2d3bLx0ctlSUSEKbJOKR7PhYUc4EbWqmOe3EkuLQ57Ttj26nfntMpWKReNCTmLohHggWMIK1kdz0/rHh2ZmXVpyzzCuV7ao9A1omTk7KkKPulb56/YgkIRWacKxU17Fj7aZYakY4zYq9RNEYkxEe0K6hAodUuens6AydGqWPgkiaEhrN1N8TKQ6VmoS+6QyxHqpFbyr+53UTHVy7KRNxoqkg80VBwpGO0DQB1GeSEs0nhmAimbkVkSGWmGiTU9GE4Cy+vExa51Xnomo3Lsu1mzyOAhzDCVTAgSuowR3UoQkEnuAZXuHNGlsv1rv1MW9dsfKZI/gD6/MHq92RXA==</latexit>

SQT (1)

<latexit sha1_base64="/wu+cdrHrP8hpEogFvWkDF22gFs=">AAAB9HicbVBNS8NAEJ3Ur1q/qh69LBahXkqigp6k4MVji/2CNobNdtsu3Wzi7qZQQn6HFw+KePXHePPfuG1z0NYHA4/3ZpiZ50ecKW3b31ZubX1jcyu/XdjZ3ds/KB4etVQYS0KbJOSh7PhYUc4EbWqmOe1EkuLA57Ttj+9mfntCpWKhaOhpRN0ADwUbMIK1kdzk4bHuNVIvKTvnqVcs2RV7DrRKnIyUIEPNK371+iGJAyo04ViprmNH2k2w1IxwmhZ6saIRJmM8pF1DBQ6ocpP50Sk6M0ofDUJpSmg0V39PJDhQahr4pjPAeqSWvZn4n9eN9eDGTZiIYk0FWSwaxBzpEM0SQH0mKdF8aggmkplbERlhiYk2ORVMCM7yy6ukdVFxLit2/apUvc3iyMMJnEIZHLiGKtxDDZpA4Ame4RXerIn1Yr1bH4vWnJXNHMMfWJ8/42mRgA==</latexit>

SQT (2)

<latexit sha1_base64="FZPcokC3175PSjH8z5/edn/tQko=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSLUS0mqoCcpePHYYr+gjWGz3bRLN5u4uymU0N/hxYMiXv0x3vw3btsctPXBwOO9GWbm+TFnStv2t7W2vrG5tZ3bye/u7R8cFo6OWypKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2R3czvz2mUrFINPQkpm6IB4IFjGBtJDd9eKx7jamXlioXU69QtMv2HGiVOBkpQoaaV/jq9SOShFRowrFSXceOtZtiqRnhdJrvJYrGmIzwgHYNFTikyk3nR0/RuVH6KIikKaHRXP09keJQqUnom84Q66Fa9mbif1430cGNmzIRJ5oKslgUJBzpCM0SQH0mKdF8YggmkplbERliiYk2OeVNCM7yy6ukVSk7l2W7flWs3mZx5OAUzqAEDlxDFe6hBk0g8ATP8Apv1th6sd6tj0XrmpXNnMAfWJ8/5O+RgQ==</latexit>

G(1)<latexit sha1_base64="W/kR39EQmiAWF6sC++u5SPOrX6k=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lU0JMUPOixgv2ANpTNdtIu3WzC7kYooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMCxLBtXHdb2dldW19Y7OwVdze2d3bLx0cNnWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwup36rSdUmsfy0YwT9CM6kDzkjBorte56WcU7m/RKZbfqzkCWiZeTMuSo90pf3X7M0gilYYJq3fHcxPgZVYYzgZNiN9WYUDaiA+xYKmmE2s9m507IqVX6JIyVLWnITP09kdFI63EU2M6ImqFe9Kbif14nNeG1n3GZpAYlmy8KU0FMTKa/kz5XyIwYW0KZ4vZWwoZUUWZsQkUbgrf48jJpnle9i6r7cFmu3eRxFOAYTqACHlxBDe6hDg1gMIJneIU3J3FenHfnY9664uQzR/AHzucPTHWO3g==</latexit>

G(1)<latexit sha1_base64="W/kR39EQmiAWF6sC++u5SPOrX6k=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lU0JMUPOixgv2ANpTNdtIu3WzC7kYooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMCxLBtXHdb2dldW19Y7OwVdze2d3bLx0cNnWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwup36rSdUmsfy0YwT9CM6kDzkjBorte56WcU7m/RKZbfqzkCWiZeTMuSo90pf3X7M0gilYYJq3fHcxPgZVYYzgZNiN9WYUDaiA+xYKmmE2s9m507IqVX6JIyVLWnITP09kdFI63EU2M6ImqFe9Kbif14nNeG1n3GZpAYlmy8KU0FMTKa/kz5XyIwYW0KZ4vZWwoZUUWZsQkUbgrf48jJpnle9i6r7cFmu3eRxFOAYTqACHlxBDe6hDg1gMIJneIU3J3FenHfnY9664uQzR/AHzucPTHWO3g==</latexit>

G(1)<latexit sha1_base64="W/kR39EQmiAWF6sC++u5SPOrX6k=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lU0JMUPOixgv2ANpTNdtIu3WzC7kYooT/CiwdFvPp7vPlv3LY5aOuDgcd7M8zMCxLBtXHdb2dldW19Y7OwVdze2d3bLx0cNnWcKoYNFotYtQOqUXCJDcONwHaikEaBwFYwup36rSdUmsfy0YwT9CM6kDzkjBorte56WcU7m/RKZbfqzkCWiZeTMuSo90pf3X7M0gilYYJq3fHcxPgZVYYzgZNiN9WYUDaiA+xYKmmE2s9m507IqVX6JIyVLWnITP09kdFI63EU2M6ImqFe9Kbif14nNeG1n3GZpAYlmy8KU0FMTKa/kz5XyIwYW0KZ4vZWwoZUUWZsQkUbgrf48jJpnle9i6r7cFmu3eRxFOAYTqACHlxBDe6hDg1gMIJneIU3J3FenHfnY9664uQzR/AHzucPTHWO3g==</latexit>

G(2)<latexit sha1_base64="pmwaHl/cbksM4H2jhXrY6liNiXY=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoMQL2E3CnqSgAc9RjAPSJYwO+kkQ2Znl5lZISz5CC8eFPHq93jzb5wke9DEgoaiqpvuriAWXBvX/XZya+sbm1v57cLO7t7+QfHwqKmjRDFssEhEqh1QjYJLbBhuBLZjhTQMBLaC8e3Mbz2h0jySj2YSox/SoeQDzqixUuuul5ar59NeseRW3DnIKvEyUoIM9V7xq9uPWBKiNExQrTueGxs/pcpwJnBa6CYaY8rGdIgdSyUNUfvp/NwpObNKnwwiZUsaMld/T6Q01HoSBrYzpGakl72Z+J/XSczg2k+5jBODki0WDRJBTERmv5M+V8iMmFhCmeL2VsJGVFFmbEIFG4K3/PIqaVYr3kXFfbgs1W6yOPJwAqdQBg+uoAb3UIcGMBjDM7zCmxM7L86787FozTnZzDH8gfP5A037jt8=</latexit>G(1)

<latexit sha1_base64="KaRQu9dKmv7iZ4oilHgB99pH8+M=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSLUS0lUsMeCBz1WsB/QhrLZTtqlm03Y3Qgl9Ed48aCIV3+PN/+N2zYHbX0w8Hhvhpl5QSK4Nq777aytb2xubRd2irt7+weHpaPjlo5TxbDJYhGrTkA1Ci6xabgR2EkU0igQ2A7GtzO//YRK81g+mkmCfkSHkoecUWOl9l0/q3gX036p7FbdOcgq8XJShhyNfumrN4hZGqE0TFCtu56bGD+jynAmcFrspRoTysZ0iF1LJY1Q+9n83Ck5t8qAhLGyJQ2Zq78nMhppPYkC2xlRM9LL3kz8z+umJqz5GZdJalCyxaIwFcTEZPY7GXCFzIiJJZQpbm8lbEQVZcYmVLQheMsvr5LWZdW7qroP1+V6LY+jAKdwBhXw4AbqcA8NaAKDMTzDK7w5ifPivDsfi9Y1J585gT9wPn8ASqeO2A==</latexit>

ListenNet ListenNet

Rounds of dialog

G(2)<latexit sha1_base64="pmwaHl/cbksM4H2jhXrY6liNiXY=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoMQL2E3CnqSgAc9RjAPSJYwO+kkQ2Znl5lZISz5CC8eFPHq93jzb5wke9DEgoaiqpvuriAWXBvX/XZya+sbm1v57cLO7t7+QfHwqKmjRDFssEhEqh1QjYJLbBhuBLZjhTQMBLaC8e3Mbz2h0jySj2YSox/SoeQDzqixUuuul5ar59NeseRW3DnIKvEyUoIM9V7xq9uPWBKiNExQrTueGxs/pcpwJnBa6CYaY8rGdIgdSyUNUfvp/NwpObNKnwwiZUsaMld/T6Q01HoSBrYzpGakl72Z+J/XSczg2k+5jBODki0WDRJBTERmv5M+V8iMmFhCmeL2VsJGVFFmbEIFG4K3/PIqaVYr3kXFfbgs1W6yOPJwAqdQBg+uoAb3UIcGMBjDM7zCmxM7L86787FozTnZzDH8gfP5A037jt8=</latexit>

Rounds of dialog

Figure 2: Detailed policy networks (implemented as conditional recurrent networks) and communication protocols for Q-bot

and A-bot in both teams. At the start, A-bot is given a target instance I unknown to Q-bot and Q-bot is assigned a task G

(unknown to A-bot) to uncover certain attributes of the target instance I . At each round t , Q-bot observes state sQt and utters

some token qt ∈ VQ . A-bot observes the history and this new utterance as state sAt and utters at ∈ VA. At the final round,

Q-bot predicts a pair of attribute values wG = (wG1 , w

G2 ) to solve the task. Policy parameters θQ and θA are updated using

the REINFORCE policy gradient algorithm [44] on a positive reward +R if Q-bot’s prediction wGmatches ground truth wG

and a negative reward of −10R otherwise. Competition is introduced via 1) reward sharing (R(2) passed to team 1), 2) dialogoverhearing (team 1 overhears and conditions on qt (2),at (2)), and 3) task sharing (team 1 knows about I(2) andG(2)). Blue dottedlines represent (cooperative) dialog within a team and red dotted lines represent (competitive) dialog across teams.

Algorithm 1 Training a single team of cooperative agents.CooperativeTrain:

1: Given Q-bot params θQ and A-bot params θA.2: for (I ,G) in each batch do

3: for communication round t = 1, ...,T do

4: sQt = [G,q1,a1, . . . ,qt−1,at−1]

5: qt = πQ (qt |sQt ;θQ )6: sAt = [I ,q1,a1, . . . ,qt−1,at−1,qt ]7: at = πA(at |sAt ;θA)8: end for

9: wG = (wG1 , w

G2 ) = πG (д |sQT ;θQ )

10: J (θQ ,θA) = EπQ ,πA[R(wG ,wG )

].

11: θQ = θQ + η∇θQ J (θQ ,θA ).12: θA = θA + η∇θA J (θQ ,θA ).

▷ Update θQ ,θA using estimated ∇J (θQ ,θA).13: end for

using the REINFORCE policy gradient algorithm [44]. The expecta-tion of policy gradients for θQ and θA are given by

∇θQ J (θQ ,θA) = EπQ ,πA[R(wG ,wG )∇θQ logπQ

(qt |sQt ;θQ

)],

(2)

∇θA J (θQ ,θA) = EπQ ,πA[R(wG ,wG )∇θA logπA

(at |sAt ;θA

)].

(3)

These expectation are approximated by sample averages across ob-ject instances and tasks in a batch, as well as across dialog roundsfor a given object instance and task. Using the estimated gradients,the parameters θQ and θA are updated using gradient-based meth-ods in an alternating fashion until convergence. We summarize theprocedure for cooperative training for a single team of agents inAlgorithm 1, which we call CooperativeTrain.

3 THE TASK, TALK & COMPETE GAME

Wemodify the Task & Talk benchmark into the Task, Talk & Competegame (Figure 2). Our setting now consists of two teams of agents:Q-bot(1) and A-bot(1) belonging to team 1 and Q-bot(2) and A-bot(2) belonging to team 2. Similar to the traditional Task & Talkgame, the agents Q-bot(1) and A-bot(1) in team 1 cooperate andcommunicate to solve a task, and likewise for the agents in team 2.

The Task, Talk & Compete game begins with two target instancesI(1) and I(2) presented to Q-bot(1) and Q-bot(2) respectively, andtwo tasks G(1) and G(2) presented to A-bot(1) and A-bot(2) respec-tively. Within a team, we largely follow the setting in Task & Talk asdescribed in the previous section. A team consists of a pair agentsQ-bot and A-bot cooperating in a partially observable world tosolve task G given instance I . The key difference is that agents inone team are competing against those in the other team. In thefollowing subsections, we explain the various sources of compe-tition that we introduce into the game and the modified trainingprocedure for teams of agents.

We use subscripts to index the rounds and subscripts in paren-thesis to index which team the agents belong to (i.e. sQt (1)). Wedrop the team subscript if it is clear from the context (i.e. sameteam). Note that for grounding to happen across teams (i.e. team1 to understand team 2 during dialog overhearing and vice-versa),the teams share vocabularies sizes (i.e.VQ is shared by Q-bot(1) andQ-bot(2)). They do not share vocabularies since both teams traindifferently but they have access to the same number of symbols.

3.1 Sources of Competition

When the two teams do not share any information and are trainedcompletely independently, the Task, Talk & Compete game reducesto (two copies of) the Task & Talk game. Therefore, informationsharing across the teams is necessary to introduce competition. Inthe following section we highlight the information sharing thatcan happen in Task, Talk & Compete and the various sources ofcompetition that can subsequently arise.

Page 4: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

Algorithm 2 Training two teams to compete against each other.CompetitiveTrain:

1: Given: Q-bot(j) params θQ (j); A-bot(j) params θA(j), for j =0, 1.

2: for (I ,G) in each batch do

3: for communication round t = 1, ...,T do

4: for j = 0, 1 do5: j ′ = 1 − j ▷ Index of the other team6: s

Qt (j) = [G(j),q1(j),a1(j), . . . ,qt−1(j),at−1(j)]

7: if Dialog Overhearing then

8: sQt (j)+ = [q1(j′),a1(j′), . . . ,qt−1(j′),at−1(j′)]▷ Overhear dialog from the other team

9: end if

10: if Task Sharing then

11: sQt (j)+ = [G(j′)]

12: end if

13: qt (j) = πQ (qt |sQt (j);θQ (j))14: (and symmetrically for sAt (j) of A-bot(j))15: end for

16: end for

17: Compute wG = (wG1 , w

G2 ) = πG (д |sQT ;θQ ) independently

for each team.18: if Reward Sharing then

19: J (θQ (j),θA(j)) = EπQ ,πA[Rshared(wG

(j),wG(j))

], j = 0, 1.

▷ Compute J () jointly using predictions of both teams20: else

21: J (θQ (j),θA(j)) = EπQ ,πA[R(wG

(j),wG(j))

], j = 0, 1.

▷ Compute J () independently for each team22: end if

23: θQ (j) = θQ (j) + η∇θQ (j )J (θQ (j),θA(j)), j = 0, 1.

24: θA(j) = θA(j) + η∇θA (j ) J (θQ (j),θA(j)), j = 0, 1.▷ Update parameters using estimated ∇J ().

25: end for

FullTrain:

26: for epoch = 1, ...,MAX do

27: CooperativeTrain agents within team 1.28: CooperativeTrain agents within team 2.29: CompetitiveTrain agents across teams 1 and 2.30: end for

Reward Sharing (RS): We modify the reward structure so thata team can assess their performance relative to other teams. Startingwith a base reward of R in the fully cooperative setting, we modifythis reward into the competitive setting in the following Table:

Team 2 ✓ Team 2 ✗

Team 1 ✓ (+R,+R) (+R,−100R)Team 1 ✗ (−100R,+R) (−10R,−10R)

When both teams get the same result this reduces to the basereward setting: +R for correct answers and −10R for wrong ones.When there is asymmetry in performance across the two teams, the

correct team gains a reward of +R while the incorrect team suffersa larger penalty of −100R. This encourages teams to compete andassess their performance relative to the other team.

Dialog Overhearing (DO): Overhearing the conversations ofanother team is a common way to get secret information about thetactics, knowledge, and progress of one’s competitors. In a similarfashion, we modify the agent’s policy networks such that theycan now overhear the concurrent dialog from other teams. TakeQ-bot(1) and A-bot(1) in team 1 for example. At round t , Q-bot(1)now observes state

sQt (1) = [G(1),q1(1),a1(1),q1(2),a1(2), . . . ,

qt−1(1),at−1(1),qt−1(2),at−1(2)], (4)

and similarly for Q-bot(2). A-bot(1) observes

sAt (1) = [I(1),q1(1),a1(1),q1(2),a1(2), . . . ,qt−1(1),at−1(1),qt−1(2),at−1(2),qt (1),qt (2)]. (5)

and similarly for A-bot(2). We view this as augmenting rewardsharing by informing the agents as to why they were penalized:a team can listen to what the other team is communicating anduse that to its own advantage. In practice, we define an overhearfraction ρ which determines how often overhearing occurs duringthe training epochs. ρ is a hyperparameter in our experiments.

Task Sharing (TS): Finally, we fully augment both reward shar-ing and dialog overhearing with task sharing, where the two setsof instances and tasks are known to both teams (i.e. G(2) is addedto Equation 4 and I(2) is added to Equation 5). Task sharing addsan additional level of grounding: one team can now ground theoverheard dialog in a specific task and compare their performanceto that of the other team. Overall, Algorithm 2 summarizes theprocedure for training teams of agents in a competitive settingusing the aforementioned three sources of competition. We call theresulting algorithm CompetitiveTrain.

Asynchronous Training: The baseline from Kottur et al. [25]only evaluates the performance of a single team at test time. Forfair comparison with this baseline, we evaluate performance witha single team 1 as well. In the case of dialog overhearing and tasksharing, we replace the overheard symbols and tasks from task2 with zeros during evaluation. This removes the confoundingexplanation that improved performance is due to having moreinformation from team 2 during testing.

To prevent data mismatch during training and testing [16, 39],we use a three-stage training process. During stage (1), Q-bot(1)and A-bot(1) are trained to cooperate, independent of team 2. Allinformation shared from team 2 is passed to team 1 as zeros. Duringstage (2), Q-bot(2) and A-bot(2) are trained to cooperate, and instage (3), both teams are trained together with reward sharingand/or dialog overhearing and/or task sharing. These three stagesare repeated until convergence. Intuitively, this procedure meansthat the agents within each team must learn how to cooperate witheach other in addition to competing with the other team. The finalalgorithm which takes into account asynchronous training bothwithin and across teams is shown in Algorithm 2, which we callFullTrain. FullTrain is the final algorithm used for training bothteams of agents.

Page 5: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

Due to how we model competition in our setting (either throughreward sharing or task and dialog overhearing), our hypothesis isthat teams are able to leverage information from the performanceof the other team and learn more efficiently beyond the signalobtained solely from their performance.

4 EXPERIMENTAL SETUP

We present our experimental results and observations on the ef-fect of competitive influence on both final task performance andemergence of grounded, interpretable, and compositional language.Our experimental testbed is the Task, Talk & Compete game. Weimplement a variety of different algorithms spanning cooperativebaselines and competitive methods which we will detail in thefollowing subsections.2

4.1 Cooperative Baselines

These are baseline methods that involve only cooperative team (orteams) of agents. These baselines test for various confounders ofour experiments (e.g. improved performance due to increase in thenumber of parameters).

(1) Coop, base: a single team, fully cooperative setting withreward structure (+R,−10R), which is the baseline from [25].

(2) Coop, rewards: a single team, fully cooperative setting withreward structure (+R,−100R) adjusted for our “extreme” rewardsetting that is more strict in penalizing incorrect answers. Thisbaseline ensures that our improvements are not simply due tobetter reward shaping [18, 19].

(3) Coop, params: a single team, fully cooperative setting with(roughly) double the number of parameters as compared to thebaseline. This ensures that the improvement in performance isnot due to an increase in the number of parameters during dialogoverhearing or task sharing. This baseline is obtained by increasingthe LSTM hidden size for question and answer bots such that thetotal number of parameters match the competitive settings withroughly double the number of parameters.

(4)Coop, double: two teams each trained independently withoutsharing any information. At test time, the team with the highervalidation score is used to evaluate performance. This baselineensures that our improvements are not simply due to double thechance of beginning with a better random seed or luckier training.

4.2 Competitive Methods

The following methods introduce an extra level of competitionbetween teams on top of cooperation within teams. Competitivebehavior is encouraged via combinations of reward sharing, dialogoverhearing, and task sharing.

(1) Comp, TS: two competitive teams with task sharing.(2) Comp, DO: two competitive teams with dialog overhearing.(3) Comp, DO+TS: two competitive teams with dialog overhear-

ing and task sharing.(4) Comp, RS: two competitive teams with reward sharing.(5) Comp, RS+TS: two competitive teams with reward sharing

and task sharing.(6) Comp, RS+DO: two competitive teams with reward sharing

and dialog overhearing.

2Our data and models are publicly released at https://github.com/pliang279/Competitive-Emergent-Communication.

(7) Comp, RS+DO+TS: two competitive teams with reward shar-ing, dialog overhearing, and task sharing.

Hyperparameters: We perform all experiments with the samehyperparameters (except LSTM hidden dimension which is fixed at100 but increased to 150 for the Coop, params setting to experimentwith an increase in the number of parameters). We set the rewardmultiplier R = 100, overhear fraction ρ = 0.5, and vocabulary sizesof Q-bot and A-bot to be |VQ | = 3 and |VA | = 4 respectively.Following Kottur et al. [25], we set A-bot(1) and A-bot(2) to bememoryless to ensure consistent A-bot grounding across roundswhich is important for generalization and compositional language.All other parameters follow those in Kottur et al. [25].

Metrics: In addition to evaluating the train and test accura-cies [25], we would also like to investigate the impact of competitivepressure on the emergence of language among agents. We measureInstantaneous Coordination (IC) [22] defined as the mutual in-formation between one agent’s message and another agent’s action,i.e. MI(at (j), wG

i(j)). Higher IC implies that Q-bot’s action dependsmore strongly on A-bot’s message (and vice versa), which is inturn indicative that messages are used in positive manner to sig-nal actions within a team. Although Lowe et al. [31] mentionedother metrics such as speaker consistency [8, 22] and entropy, webelieve that IC is the most suited for question answering dialogtasks where responding to another agent’s messages is key. Wealso measured several other metrics that were recently proposed tomeasure how informative a language is with respect to the agent’sactions [31]: Speaker Consistency (SC) measures the mutual in-formation between an agent’s message and its own future action:MI(qt (j), wG

i(j)) and Entropy (H) which measures the entropy ofan agent’s sequence of outgoing messages.

Training Details: For cooperative baselines, we set the maxi-mum number of epochs to be 100,000 and stop training early whentraining accuracy reaches 100%. For competitive baselines, we alsoset the maximum number of epochs to be 100,000 and stop train-ing early when the first team reaches a training accuracy of 100%.This team is labeled as the winning team and is the focus of ourexperiments. The other losing team can be viewed as an auxiliaryteam that helps the performance of the winning team. All experi-ments are repeated 10 times with randomly chosen random seeds.Results are reported as average ± standard deviation over the 10runs. Implementation details are provided in the appendix.

5 RESULTS AND ANALYSIS

We study the effect of competition between teams on 1) the general-ization abilities of the agents in new environments during test-time,2) the rate of convergence of train and test accuracies during train-ing, and 3) the emergence of informative communication protocolsbetween agents when solving the Task, Talk & Compete game.

5.1 Qualitative Results

We begin by studying the effect of competition on the performanceof agents in the Task, Talk & Compete game. The teams are trained invarious cooperative and competitive settings. We aggregate scoresfor both winning and losing teams as described in the trainingdetails above. These results are summarized in Table 1. From ourresults, we draw the following general observations regarding thegeneralization capabilities of the trained agents:

Page 6: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

Type Method RS DO TS Train Acc (%) Test Acc (%)Winning Team Losing Team Winning Team Losing Team

Cooperativebaselines

Coop, base [25] ✗ ✗ ✗ 88.5 ± 11.6 - 45.6 ± 18.9 -Coop, rewards [18, 19] ✗ ✗ ✗ 87.0 ± 13.7 - 49.7 ± 22.9 -Coop, params ✗ ✗ ✗ 85.5 ± 14.6 - 53.3 ± 26.2 -Coop, double ✗ ✗ ✗ 91.4 ± 12.0 74.6 ± 16.6 57.8 ± 28.5 38.3 ± 27.0

Competitivemethods

Comp, TS ✗ ✗ ✓ 94.4 ± 3.0 80.4 ± 13.3 53.1 ± 19.5 41.7 ± 25.7Comp, DO ✗ ✓ ✗ 96.8 ± 5.4 71.4 ± 14.2 65.7 ± 26.1 23.1 ± 10.7Comp, DO+TS ✗ ✓ ✓ 98.6 ± 2.0 75.8 ± 12.1 75.8 ± 17.9 41.1 ± 26.0Comp, RS ✓ ✗ ✗ 99.5 ± 1.3 79.6 ± 12.2 68.5 ± 19.4 43.8 ± 20.9Comp, RS+TS ✓ ✗ ✓ 99.5 ± 1.4 66.6 ± 10.3 68.9 ± 16.9 26.7 ± 11.3Comp, RS+DO ✓ ✓ ✗ 100.0 ± 0.0 63.9 ± 12.8 78.3 ± 10.7 28.3 ± 14.4Comp, RS+DO+TS ✓ ✓ ✓ 98.8 ± 2.4 78.9 ± 10.9 77.2 ± 16.5 28.9 ± 21.8

Table 1: Train and test accuracies measured across teams trained in various cooperative and competitive settings. All cooper-

ative baselines are in shades of blue and competitive teams are in red. RS: reward sharing, DO: dialog overhearing, TS: task

sharing. Accuracies are reported separately for winning and losing teams with best accuracies for winning teams in bold.

Winning teams in competitive settings display faster convergence and improved performance.

(1) Sharingmessages via overhearing dialog improves gen-

eralization performance: Dialog overhearing contributes themost towards improvement in test accuracy, from below 60% inthe baselines without to 75.8% with dialog overhearing. We believethis is because dialog overhearing transmits the most amount ofinformation to the other team, as compared to a single scalar inreward sharing or a single image in task sharing.

(2) Composing sources of competition improves perfor-

mance: While dialog overhearing on its own displays strong im-provements in test performance, we found that composing multiplesources of competition improves performance even more. In theComp, RS+DO setting, the winning team’s train accuracy quicklyincreases to 100% very consistently across all 10 runs, while testaccuracy is also the highest at 78.3%. Other settings that workedwell included Comp, DO+TS with a winning test accuracy of 75.8%,and Comp, RS+DO+TS with a winning test accuracy of 77.2%.

(3) Increasing competitive pressure increases the gap be-

tween winning and losing teams: Another finding is that asmore competitive pressure is introduced, the gap between win-ning and losing teams increases. The winning team increasinglyperforms better, especially in train accuracy, and the losing teamincreasingly performs worse than the single team cooperative base-line (i.e. lower than ∼ 50%). This confirms our hypothesis that thelosing team acts as an auxiliary team that boosts the performance ofthe winning team at its own expense. Furthermore, winning teamsthat survive through multiple sources of competition learn bettercommunication protocols that allow them to generalize better tonew test environments.

(4)Reward shaping does not improve performance: In bothcooperative and competitive settings, one cannot rely solely onreward shaping to improve generalization. The test accuracies arelargely similar across various reward settings.

(5) Improvement in performance is not due to other fac-

tors: To ensure that the empirical results we observe are not dueto confounding factors, we compare the performance in teams ofcompetitive agents with purely cooperative baselines with rewardshaping, doubling the number of parameters, and doubling the

number of teams. None of these baselines generalize well whichshows that the improvement in performance is not due to betterrewards, more parameters, or luckier training.

5.2 Rates of Convergence

We also compare the convergence in test accuracies of teams trainedin both fully cooperative and competitive settings. For test accura-cies that ended due to early stopping when train accuracy reached100%, we propagate the test accuracy corresponding to the besttrain accuracy over the remaining epochs. This ensures that the testaccuracies over multiple runs are averaged accurately across epochs.We outline these results in Figure 3.We find that in all settings, train-ing teams of competitive agents leads to faster convergence andimproved performance as compared to the cooperative baselines.Furthermore, composing multiple sources of competition duringtraining steadily improves performance. Figure 3(b) shows a cleartrend that: Comp, DO+TS > Comp, DO ≈ Comp, TS > Coop, base.By further adding reward sharing, Figure 3(c) shows that Comp,RS+DO+TS ≈ Comp, RS+DO > Comp, RS+TS > Coop, base.

We find that the fastest convergence in training happens inearlier stages of training. Winning teams tend to quickly pull aheadof their losing counterparts during earlier stages and losing teamsare unable to recover in later stages of training. This observation isalso shared in Table 1 where the gap between winning and losingteams grows as sources of competition are composed. Interestingly,these observations are also mirrored in studies in psychology whichargue that competition during childhood is beneficial for earlycognitive and social development [7, 37].

Finally, another interesting observation is that high-performingwinning teams trained in competitive settings tend to display lowervariance in test time evaluation as compared to their cooperativecounterparts, thereby learning to more stable training.

5.3 Emergence of Language

In addition to generalization performance, we also compare thequality of the communication protocols that emerge between theteams of trained agents. Specifically, we measure the signalingthat occurs between agents using the Instantaneous Coordination

Page 7: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

(a)<latexit sha1_base64="mCQG28w3JZfkA0oUflAIJ4fWwPQ=">AAAB83icbVBNS8NAEJ34WetX1aOXxSLUS0lU0GPRi8cK9gOaUDbbbbt0Nwm7E7GE/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvTKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqnhUkS8gQIlbyeaUxVK3gpHt1O/9ci1EXH0gOOEB4oOItEXjKKVfB/5E2qVVejZpFsqu1V3BrJMvJyUIUe9W/ryezFLFY+QSWpMx3MTDDKqUTDJJ0U/NTyhbEQHvGNpRBU3QTa7eUJOrdIj/VjbipDM1N8TGVXGjFVoOxXFoVn0puJ/XifF/nWQiShJkUdsvqifSoIxmQZAekJzhnJsCWVa2FsJG1JNGdqYijYEb/HlZdI8r3oXVff+sly7yeMowDGcQAU8uIIa3EEdGsAggWd4hTcndV6cd+dj3rri5DNH8AfO5w/0bpGe</latexit>

(c)<latexit sha1_base64="LxchaOUNev40YNfKkAgVnt+w2t4=">AAAB83icbVBNS8NAEJ34WetX1aOXxSLUS0lU0GPRi8cK9gOaUDbbbbt0Nwm7E7GE/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvTKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqnhUkS8gQIlbyeaUxVK3gpHt1O/9ci1EXH0gOOEB4oOItEXjKKVfB/5E2qVVdjZpFsqu1V3BrJMvJyUIUe9W/ryezFLFY+QSWpMx3MTDDKqUTDJJ0U/NTyhbEQHvGNpRBU3QTa7eUJOrdIj/VjbipDM1N8TGVXGjFVoOxXFoVn0puJ/XifF/nWQiShJkUdsvqifSoIxmQZAekJzhnJsCWVa2FsJG1JNGdqYijYEb/HlZdI8r3oXVff+sly7yeMowDGcQAU8uIIa3EEdGsAggWd4hTcndV6cd+dj3rri5DNH8AfO5w/3epGg</latexit>

(b)<latexit sha1_base64="+5u2hTunFoMqqLdQjKfVVI2W+Nk=">AAAB83icbVBNS8NAEJ34WetX1aOXxSLUS0lU0GPRi8cK9gOaUDbbbbt0Nwm7E7GE/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvTKQw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmjjVjDdYLGPdDqnhUkS8gQIlbyeaUxVK3gpHt1O/9ci1EXH0gOOEB4oOItEXjKKVfB/5E2qVVcKzSbdUdqvuDGSZeDkpQ456t/Tl92KWKh4hk9SYjucmGGRUo2CST4p+anhC2YgOeMfSiCpugmx284ScWqVH+rG2FSGZqb8nMqqMGavQdiqKQ7PoTcX/vE6K/esgE1GSIo/YfFE/lQRjMg2A9ITmDOXYEsq0sLcSNqSaMrQxFW0I3uLLy6R5XvUuqu79Zbl2k8dRgGM4gQp4cAU1uIM6NIBBAs/wCm9O6rw4787HvHXFyWeO4A+czx/19JGf</latexit>

Figure 3: We plot the convergence of test accuracy across training epochs for the winning team trained in various cooperative

and competitive settings. All cooperative baselines are in shades of blue and competitive teams are in red, black, and green.

Lines represent the mean across 10 runs and shaded boundaries represent the standard deviations. In (a), we compare the

cooperative baselines Coop, base, Coop, rewards, Coop, params, and Coop, double with a well-performing competitive

method, Comp, DO+TS. In (b), we compare Coop, base with competitive teams involving Dialog Overhearing (DO) and Task

Sharing (TS) (i.e. Comp, TS, Comp, DO, Comp, DO+TS). In (c), we compare Coop, basewith competitive teams that additionally

incorporate Reward Sharing (RS) (i.e. Comp, RS+TS, Comp, RS+DO, Comp, RS+DO+TS). Winning teams in competitive settings

display faster convergence and improved generalization performance in test environments.

Type Method ICWinning Team Losing Team

Cooperativebaselines

Coop, base [25] 0.675 ± 0.099 -Coop, rewards [18, 19] 0.646 ± 0.050 -Coop, params 0.689 ± 0.101 -Coop, double 0.719 ± 0.145 0.691 ± 0.153

Competitivemethods

Comp, TS 0.650 ± 0.139 0.592 ± 0.128Comp, DO 0.778 ± 0.161 0.757 ± 0.179Comp, DO+TS 0.806 ± 0.202 0.800 ± 0.204Comp, RS 0.793 ± 0.165 0.776 ± 0.161Comp, RS+TS 0.726 ± 0.207 0.718 ± 0.118Comp, RS+DO 0.814 ± 0.154 0.743 ± 0.116Comp, RS+DO+TS 0.834 ± 0.203 0.740 ± 0.142

Table 2: The Instantaneous Coordination (IC) metric mea-

sured across teams trained in various cooperative and com-

petitive settings. All cooperative baselines are in shades of

blue and competitive teams are in red. RS: reward sharing,

DO: dialog overhearing, TS: task sharing. IC scores are re-

ported separately for winning and losing teams with best ac-

curacies for winning teams in bold. Winning teams in com-

petitive settings performmore informative communication

as measured by a higher IC score.

(IC) metric [22]. We report these results in Table 2 and focus onthe IC between Q-bot and A-bot from the winning team. Weobserve that IC is highest for the fully competitive setting Comp,RS+DO+TS. Furthermore, by comparing Table 1 with Table 2, weobserve a strong correlation between winning teams that signalclearly with high IC scores and winning teams that perform beston test environments. Comp, DO+TS, Comp, RS+DO, and Comp,RS+DO+TS are the training settings that lead to suchwinning teams.These observations supports our hypothesis that having external

Type Method |VQ | |VA | Winning Team

Cooperativebaselines Coop, base [25]

3 4 45.6 ± 18.916 16 26.4 ± 5.164 64 22.6 ± 4.6

Competitivemethods Comp, RS+DO+TS

3 4 77.2 ± 16.516 16 50.8 ± 26.164 64 47.5 ± 25.2

Table 3: Effect of vocabulary size on both cooperative and

competitive training. Similar to Kottur et al. [25], we found

that test performance is hurt at large vocab sizes, even under

competitive training. For the same fixed vocabulary size, we

also see consistent improvements using competitive train-

ing as compared to the cooperative baselines, suggesting the

utility of our approach across different hyperparameter set-

tings s such as vocabulary sizes.

pressure from similar agents encourages the team’s Q-bot andA-bot to coordinate better through emergent language, therebyleading to superior task performance which is another benefit ofour proposed competitive training method.

Finally, we find that the learned communication protocol is com-positional in the same measure as Kottur et al. [25]. For example,Q-bot assigns Y to represent tasks (shape, style), (style, shape),and X for (style, color). The small vocabulary size and memorylessA-bot means that the messages must compose across entities togeneralize at test time to unseen instances. We further note thatfrom the convergence graphs as shown in Figure 3, compositional-ity in emergent language is achieved faster in competitive settingsas compared to the fully cooperative counterparts.

We also experimented with large vocab sizes of |VQ | = |VA | = 16and 64. We reported these results in Table 3. Similar to Kottur et al.[25], we found that test performance is hurt at large vocab sizes,

Page 8: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

Type Method SC (↑) H (↓)Winning Team Losing Team Winning Team Losing Team

Cooperativebaselines

Coop, base [25] 0.631 ± 0.114 - 1.186 ± 0.124 -Coop, rewards [18, 19] 0.640 ± 0.117 - 1.060 ± 0.023 -Coop, params 0.676 ± 0.132 - 1.138 ± 0.143 -Coop, double 0.683 ± 0.150 0.675 ± 0.133 1.196 ± 0.128 1.210 ± 0.131

Competitivemethods

Comp, TS 0.610 ± 0.123 0.618 ± 0.143 1.089 ± 0.119 1.107 ± 0.154Comp, DO 0.635 ± 0.149 0.647 ± 0.185 1.212 ± 0.123 1.210 ± 0.135Comp, DO+TS 0.635 ± 0.146 0.646 ± 0.188 1.215 ± 0.133 1.211 ± 0.124Comp, RS 0.677 ± 0.149 0.658 ± 0.118 1.205 ± 0.137 1.207 ± 0.137Comp, RS+TS 0.622 ± 0.157 0.596 ± 0.143 1.197 ± 0.143 1.205 ± 0.130Comp, RS+DO 0.701 ± 0.114 0.670 ± 0.073 1.174 ± 0.151 1.217 ± 0.123Comp, RS+DO+TS 0.679 ± 0.219 0.615 ± 0.157 1.197 ± 0.164 1.207 ± 0.150

Table 4: Speaker Consistency (SC) and Entropy (H) metrics measured across teams in all settings. All cooperative baselines are

in shades of blue and competitive teams are in red. RS: reward sharing, DO: dialog overhearing, TS: task sharing. SC and H

scores are reported separately for winning and losing teams with best accuracies for winning teams in bold.

even under competitive training. Therefore, we set the vocabularysizes |VQ | = 3 and |VA | = 4 respectively following Kottur et al. [25].With these limited vocabulary sizes, we observed good generaliza-tion of the language to new object instances. When using largevocabulary sizes, the agents tend to use every vocabulary symbol tomemorize pairs of concepts, e.g. symbol a represents a green circleand symbol b represents a green square, etc. instead of representingcompositional concepts e.g. symbol a represents the color greenand symbol b represents the shape square etc. The compositionalvocabulary learned in the latter case is required for generalizationto new pairs of concepts at test-time.

From Table 3, it is interesting to note that for the same fixedvocabulary size, we also see consistent improvements using com-petitive training as compared to the cooperative baselines. Thisfurther suggests the utility of our approach across different hyper-parameter settings. Moreover, it suggests that competitive trainingapproaches are more robust to different hyperparameter settingssuch as vocabulary sizes.

5.4 Speaker Consistency and Entropy

Here we report the results on two more metrics proposed to mea-sure how informative a language is with respect to the agent’sactions [31]: Speaker Consistency (SC) measures the mutualinformation between an agent’s message and its future action:MI(qt (j), wG

i(j)) and Entropy (H) which measures the entropy ofan agent’s sequence of outgoing messages. We show these experi-mental results in Table 4.

In general, competitive teams display a higher speaker consis-tency score, again showing strong correlation with the best per-forming teams Comp, RS+DO and Comp, RS+DO+TS. This againimplies that the better performing teams trained via competitiondemonstrate more signaling using their vocabulary. As for entropy,it is hard to interpret this metric [31]. It is traditionally thoughtthat lower entropy in languages represents more compositional-ity and efficiency in the way meaning is encoded in language. Onone hand, it is also possible for an agent to always to send thesame symbol which implies the lowest possible entropy, but thesemessages are unlikely to be informative. The results show that the

entropies across all settings are roughly similar, which we believeimply that the agents are learning communication protocols thatare equally complex and rich in nature. However, the improvedspeaker consistency and instantaneous coordination scores implythat the communication protocols learnt via competition are moreinformational to the other agents.

6 CONCLUSION

In this paper, we revisited emergent language in multi-agent teamsfrom the lens of competition for performance: scenarios where com-petition acts as an additional external pressure for improvement.We start from Task & Talk, a previously proposed referential gamebetween two cooperative agents as our testbed and extend it intoTask, Talk & Compete, a game involving two competitive teams eachconsisting of cooperative agents. Using our newly proposed Task,Talk & Compete benchmark, we showed that competition from anexternal team acts as social influence that encourages multi-agentpopulations to develop more informative communication protocolsfor improved generalization and faster convergence. Our controlledexperiments also show that these results are not due to confoundingfactors such as more parameters, more agents, and reward shaping.This line of work constitutes a step towards studying the emergenceof language from agents that are both cooperative and competitiveat different levels. Future work can explore the effect of competitivemulti-agent training in various real-world settings as well as theemergence of natural language and multimodal dialog.

7 ACKNOWLEDGEMENTS

This material is based upon work partially supported by the Na-tional Science Foundation (Awards #1750439, #1722822), NationalInstitutes of Health, DARPA SAGAMORE HR00111990016, AFRLCogDeCON FA875018C0014, and NSF IIS1763562. Any opinions,findings, and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarily reflectthe views of the National Science Foundation, National Institutes ofHealth, DARPA, and AFRL, and no official endorsement should beinferred.Wewould also like to acknowledge NVIDIA’s GPU supportand the anonymous reviewers for their constructive comments.

Page 9: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

REFERENCES

[1] Ben Alderson-Day and Charles Fernyhough. 2015. Inner Speech: Development,Cognitive Functions, Phenomenology, andNeurobiology. In Psychological Bulletin,American Psychological Association.

[2] Jacob Andreas. 2019. Measuring Compositionality in Representation Learning.In International Conference on Learning Representations. https://openreview.net/forum?id=HJz05o0qK7

[3] Bernard J. Baars. 2017. The Global Workspace Theory of Consciousness. JohnWiley& Sons, Ltd, Chapter 16, 227–242. https://doi.org/10.1002/9781119132363.ch16arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781119132363.ch16

[4] Michael Bacharach. 1991. Zero-sum Games. Palgrave Macmillan UK, London,727–731. https://doi.org/10.1007/978-1-349-21315-3_100

[5] Alan D. Baddeley and Graham Hitch. 1974. Working Memory. Psychology ofLearning and Motivation, Vol. 8. Academic Press, 47 – 89. https://doi.org/10.1016/S0079-7421(08)60452-1

[6] Helmy H. Baligh and Leon E. Richartz. 1967. Variable-Sum GameModels of Marketing Problems. Journal of Marketing Research4, 2 (1967), 173–183. https://doi.org/10.1177/002224376700400209arXiv:https://doi.org/10.1177/002224376700400209

[7] Daphna Bassok. 2012. Competition or Collaboration?: Head Start En-rollment During the Rapid Expansion of State Pre-kindergarten. Educa-tional Policy 26, 1 (2012), 96–116. https://doi.org/10.1177/0895904811428973arXiv:https://doi.org/10.1177/0895904811428973

[8] Ben Bogin, Mor Geva, and Jonathan Berant. 2018. Emergence of Communicationin an Interactive World with Consistent Speakers. CoRR abs/1809.00549 (2018).arXiv:1809.00549 http://arxiv.org/abs/1809.00549

[9] Antoine Bordes and Jason Weston. 2016. Learning End-to-End Goal-OrientedDialog. CoRR abs/1605.07683 (2016). arXiv:1605.07683 http://arxiv.org/abs/1605.07683

[10] Diane Bouchacourt and Marco Baroni. 2018. How agents see things: On visualrepresentations in an emergent language game. CoRR abs/1808.10696 (2018).arXiv:1808.10696 http://arxiv.org/abs/1808.10696

[11] Noam Chomsky. 1957. Syntactic Structures. Mouton and Co., The Hague.[12] F. B. Christiansen and V. Loeschcke. 1990. Evolution and Competition.

Springer Berlin Heidelberg, Berlin, Heidelberg, 367–394. https://doi.org/10.1007/978-3-642-74474-7_13

[13] Michael Cogswell, Jiasen Lu, Stefan Lee, Devi Parikh, and Dhruv Batra. 2019.Emergence of Compositional Language with Deep Generational Transmission.CoRR abs/1904.09067 (2019). arXiv:1904.09067 http://arxiv.org/abs/1904.09067

[14] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, andDhruv Batra. 2018. Embodied Question Answering. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR).

[15] Brynne C. DiMenichi and Elizabeth Tricomi. 2015. The power of competition:Effects of social motivation on attention, sustained physical effort, and learning.Frontiers in Psychology (2015).

[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain Adaptationfor Large-scale Sentiment Classification: A Deep Learning Approach. In ICML.

[17] S. Grossberg. 2013. Behavioral economics and neuroeconomics: Cooperation,competition, preference, and decision making. In The 2013 International JointConference on Neural Networks (IJCNN). 1–5. https://doi.org/10.1109/IJCNN.2013.6706709

[18] Marek Grześ. 2017. Reward Shaping in Episodic Reinforcement Learning. InAAMAS.

[19] Marek Grześ and Daniel Kudenko. 2008. Multigrid Reinforcement Learning withReward Shaping. In ICANN.

[20] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-termmemory. Neuralcomputation 9, 8 (1997), 1735–1780.

[21] L. C. Jain and L. R. Medsker. 1999. Recurrent Neural Networks: Design and Appli-cations (1st ed.). CRC Press, Inc., Boca Raton, FL, USA.

[22] Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Çaglar Gülçehre, Pedro A.Ortega, DJ Strouse, Joel Z. Leibo, and Nando de Freitas. 2019. Social Influenceas Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. In Pro-ceedings of the 36th International Conference on Machine Learning, ICML 2019,9-15 June 2019, Long Beach, California, USA (Proceedings of Machine LearningResearch), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR,3040–3049. http://proceedings.mlr.press/v97/jaques19a.html

[23] Simon Kirby, Hannah Cornish, and Kenny Smith. 2008. Cumulative cul-tural evolution in the laboratory: An experimental approach to the originsof structure in human language. Proceedings of the National Academy ofSciences 105, 31 (2008), 10681–10686. https://doi.org/10.1073/pnas.0707835105arXiv:https://www.pnas.org/content/105/31/10681.full.pdf

[24] Simon Kirby, Monica Tamariz, Hannah Cornish, and Kenny Smith. 2015. Com-pression and communication in the cultural evolution of linguistic structure.Cognition 141 (2015), 87–102.

[25] Satwik Kottur, José Moura, Stefan Lee, and Dhruv Batra. 2017. Natural LanguageDoes Not Emerge ‘Naturally’ in Multi-Agent Dialog. In Proceedings of the 2017Conference on Empirical Methods in Natural Language Processing. Association for

Computational Linguistics, Copenhagen, Denmark, 2962–2967. https://doi.org/10.18653/v1/D17-1321

[26] Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. 2018.Emergence of Linguistic Communication from Referential Games with Symbolicand Pixel Input. In International Conference on Learning Representations. https://openreview.net/forum?id=HJGv1Z-AW

[27] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multi-Agent Cooperation and the Emergence of (Natural) Language. In 5th InternationalConference on Learning Representations, ICLR 2017, Toulon, France, April 24-26,2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=Hk8N3Sclg

[28] E. G. Leigh Jr. 2010. The evolution of mutualism. Journal of Evolutionary Biol-ogy 23, 12 (2010), 2507–2528. https://doi.org/10.1111/j.1420-9101.2010.02114.xarXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1420-9101.2010.02114.x

[29] Paul Pu Liang, Yao Chong Lim, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov, andLouis-Philippe Morency. 2019. Strong and Simple Baselines for Multimodal Ut-terance Embeddings. In Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers). Association for Computational Linguis-tics, Minneapolis, Minnesota, 2599–2609. https://doi.org/10.18653/v1/N19-1267

[30] Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency.2018. Multimodal Language Analysis with Recurrent Multistage Fusion. InProceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing. Association for Computational Linguistics, Brussels, Belgium, 150–161.https://doi.org/10.18653/v1/D18-1014

[31] Ryan Lowe, Jakob Foerster, Y-Lan Boureau, Joelle Pineau, and Yann Dauphin.2019. On the Pitfalls of Measuring Emergent Communication. In Proceedingsof the 18th International Conference on Autonomous Agents and MultiAgent Sys-tems (AAMAS âĂŹ19). International Foundation for Autonomous Agents andMultiagent Systems, Richland, SC, 693âĂŞ701.

[32] Ryan Lowe, YiWu, Aviv Tamar, Jean Harb, Pieter Abbeel, and IgorMordatch. 2017.Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. InProceedings of the 31st International Conference on Neural Information ProcessingSystems (NIPSâĂŹ17). Curran Associates Inc., Red Hook, NY, USA, 6382âĂŞ6393.

[33] Dean Mobbs, Demis Hassabis, Rongjun Yu, Carlton Chu, Matthew Rush-worth, Erie Boorman, and Tim Dalgleish. 2013. Foraging under Competi-tion: The Neural Basis of Input-Matching in Humans. Journal of Neuroscience33, 23 (2013), 9866–9872. https://doi.org/10.1523/JNEUROSCI.2238-12.2013arXiv:http://www.jneurosci.org/content/33/23/9866.full.pdf

[34] Martin A. Nowak and David C. Krakauer. 1999. The evolutionof language. Proceedings of the National Academy of Sciences96, 14 (1999), 8028–8033. https://doi.org/10.1073/pnas.96.14.8028arXiv:https://www.pnas.org/content/96/14/8028.full.pdf

[35] Martin A. Nowak, Joshua B. Plotkin, and Vincent A A Jansen. 2000. The evolutionof syntactic communication. Nature 404 (2000), 495–498.

[36] Minna Pekkonen, Tarmo Ketola, and Jouni T Laakso. 2013. Resource availabilityand competition shape the evolution of survival and growth ability in a bacterialcommunity. PLoS One 8, 9 (2013).

[37] Emmy A. Pepitone. 1985. Children in Cooperation and Competition. Springer US,Boston, MA, 17–65. https://doi.org/10.1007/978-1-4899-3650-9_2

[38] Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barn-abás Póczos. 2019. Found in Translation: Learning Robust Joint Representationsby Cyclic Translations between Modalities. In The Thirty-Third AAAI Confer-ence on Artificial Intelligence, AAAI 2019. 6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892

[39] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D.Lawrence. 2009. Dataset Shift in Machine Learning. The MIT Press.

[40] James H. Read. 2012. Is power zero-sum or variable-sum? Old arguments and newbeginnings. Journal of Political Power (2012). https://doi.org/10.1080/2158379X.2012.659865 arXiv:https://doi.org/10.1080/2158379X.2012.659865

[41] Kenny Smith, Simon Kirby, and Henry Brighton. 2003. Iterated Learning: AFramework for the Emergence of Language. Artif. Life 9, 4 (Sept. 2003), 371–386.https://doi.org/10.1162/106454603322694825

[42] Monica Tamariz and Simon Kirby. 2015. The Cultural Evolution of Language.Current Opinion in Psychology 8 (09 2015). https://doi.org/10.1016/j.copsyc.2015.09.003

[43] Paul Vogt. 2005. The emergence of compositional structures in perceptuallygrounded language games. Artif. Intell. 167 (2005), 206–242.

[44] Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms forConnectionist Reinforcement Learning. Mach. Learn. 8, 3-4 (May 1992), 229–256.https://doi.org/10.1007/BF00992696

Page 10: On Emergent Communication inCompetitive Multi-Agent Teams · On Emergent Communication in Competitive Multi-Agent Teams Paul Pu Liang, Jeffrey Chen, Ruslan Salakhutdinov, Louis-Philippe

APPENDIX

A MODELING THE AGENTS

The Task, Talk & Compete game begins with two target instances I(1)and I(2) presented to Q-bot(1) and Q-bot(2) respectively, and twotasksG(1) andG(2) presented to A-bot(1) and A-bot(2) respectively.Within a team, we largely follow the setting by Kottur et al. [25]. Ateam consists of agents Q-bot and A-bot cooperating in a partiallyobservable world to solve task G given instance I . We use lowercase characters (e.g. sQt ) to denote the token symbol and upper caseSQt to denote the corresponding representation. We use subscriptsto index the rounds and subscripts in parenthesis to index whichteam the pair of agents belong to (i.e. sQt (1)). We drop the teamsubscript if it is clear from the context (i.e. same team).

Base States and Actions: Each agent observes its specific in-put (task G for Q-bot and instance instance I for A-bot) and theoutput of the other agent in the same team. At the beginning ofround t , Q-bot observes state sQt = [G,q1,a1, . . . ,qt−1,at−1] andutters some token qt ∈ VQ . A-bot observes the history and thisnew utterance as state sAt = [F ,q1,a1, . . . ,qt−1,at−1,qt ] and uttersat ∈ VA. Note, the two teams share vocabularies (i.e. VQ is sharedby Q-bot(1) and Q-bot(2)). At the final round, Q-bot predicts apair of attribute values wG = (wG

1 , wG2 ) to solve the task. Q-bot

and A-bot are modeled as stochastic policies πQ (qt |sQt ;θQ ) andπA(at |sAt ;θA) implemented as recurrent networks [20, 21]. Q-botis modeled with three modules – speaking, listening, and predic-tion. Given task G, Q-bot stores an initial state SQt−1 from which itconditionally generates output utterances qt ∈ VQ . SQt−1 is updatedusing answers at from A-bot and is used to make a prediction wGin the final round. A-bot is modeled with two modules – speakingand listening. A-bot encodes instance I into its initial state SAt fromwhich it conditionally generates output utterances at ∈ VA. SAt isupdated using questions at from Q-bot. Q-bot and A-bot receivean identical base reward of R if Q-bot’s prediction wG matches

ground truth wG and a negative reward of −10R otherwise. R isa hyperparamter which affects the rate of convergence. To trainthese agents, we update policy parameters θQ and θA using thepopular REINFORCE policy gradient algorithm [44].

B IMPLEMENTATION DETAILS

Table 5 is the list of hyperparameters used.We train all models usingthe same set of hyperparameters and only modify the rewards andinformation being shared among agents. All reported results wereaveraged over 10 runs with randomly initialized random seeds.

In order to report results on the Instantaneous Coordination (IC)metric, we compute the mutual information across the following:within team 1, the mutual information between A-bot’s messagesand Q-bot’s first guess, and A-bot’s messages and Q-bot’s secondguess. We average these number to obtain an aggregated IC scorefor team 1 before repeating the procedure for team 2.

Parameter Valueattribute embeddings size 20instance embedding size 60

R 100|VQ | 3|VA | 4ρ 0.5

batch size 1000LSTM dimension 50

episodes 1000max epochs 50000learning rate 0.01

gradient clipping [-5.0,+5.0]optimizer Adam

num repeats 10Table 5: Table of hyperparameters.