13
TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations Shiyu Huang * Tsinghua University Beijing, China [email protected] Wenze Chen * Tsinghua University Beijing, China [email protected] Longfei Zhang National University of Defense Technology Changsha, China [email protected] Shizhen Xu RealAI Beijing, China [email protected] Ziyang Li Tencent AI Lab Shenzhen, China [email protected] Fengming Zhu Tencent AI Lab Shenzhen, China [email protected] Deheng Ye Tencent AI Lab Shenzhen, China [email protected] Ting Chen Tsinghua University Beijing, China [email protected] Jun Zhu Tsinghua University Beijing, China [email protected] Abstract Deep reinforcement learning (DRL) has achieved super-human performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an end-to-end learning-based AI system (denoted as TiKick 23 ) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a distributed learning system and new offline algorithms to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, TiKick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method achieves state-of-the- art performances on various academic scenarios. * Equal contribution 2 Codes can be found at https://github.com/TARTRL/TiKick. 3 Videos available at https://sites.google.com/view/tikick. Preprint. Under review. arXiv:2110.04507v5 [cs.AI] 30 Nov 2021

arXiv:2110.04507v4 [cs.AI] 19 Oct 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

TiKick: Towards Playing Multi-agent Football FullGames from Single-agent Demonstrations

Shiyu Huang∗Tsinghua University

Beijing, [email protected]

Wenze Chen∗Tsinghua University

Beijing, [email protected]

Longfei ZhangNational University of Defense Technology

Changsha, [email protected]

Shizhen XuRealAI

Beijing, [email protected]

Ziyang LiTencent AI Lab

Shenzhen, [email protected]

Fengming ZhuTencent AI Lab

Shenzhen, [email protected]

Deheng YeTencent AI Lab

Shenzhen, [email protected]

Ting ChenTsinghua University

Beijing, [email protected]

Jun ZhuTsinghua University

Beijing, [email protected]

Abstract

Deep reinforcement learning (DRL) has achieved super-human performance oncomplex video games (e.g., StarCraft II and Dota II). However, current DRLsystems still suffer from challenges of multi-agent coordination, sparse rewards,stochastic environments, etc. In seeking to address these challenges, we employa football video game, e.g., Google Research Football (GRF), as our testbed anddevelop an end-to-end learning-based AI system (denoted as TiKick23) to completethis challenging task. In this work, we first generated a large replay dataset fromthe self-playing of single-agent experts, which are obtained from league training.We then developed a distributed learning system and new offline algorithms tolearn a powerful multi-agent AI from the fixed single-agent dataset. To the bestof our knowledge, TiKick is the first learning-based AI system that can take overthe multi-agent Google Research Football full game, while previous work couldeither control a single agent or experiment on toy academic scenarios. Extensiveexperiments further show that our pre-trained model can accelerate the trainingprocess of the modern multi-agent algorithm and our method achieves state-of-the-art performances on various academic scenarios.

∗Equal contribution2Codes can be found at https://github.com/TARTRL/TiKick.3Videos available at https://sites.google.com/view/tikick.

Preprint. Under review.

arX

iv:2

110.

0450

7v5

[cs

.AI]

30

Nov

202

1

1 Introduction

Deep reinforcement learning (DRL) has shown great success in many video games, including theAtari games (Mnih et al., 2013), StarCraft II (Vinyals et al., 2019), Dota II (Berner et al., 2019), etc.However, current DRL systems still suffer from challenges of multi-agent coordination (Rashid et al.,2018; Mahajan et al., 2019; Yu et al., 2021), sparse rewards (Taiga et al., 2019; Zhang et al., 2020),stochastic environments (Kurach et al., 2019; Team et al., 2021), etc. In seeking to address thesechallenges, we employ a football video game, e.g., Google Research Football (GRF) (Kurach et al.,2019), as our testbed.

Even though a lot of work has been done recently (Kurach et al., 2019; Li et al., 2021; Liu et al.,2021b), there remain many problems in building agents for the GRF: (1) Multiple Players: Inthe GRF, there are both cooperative and competitive players. For cooperative players, the jointaction space is very huge, thus it is hard to build a single agent to control all the players. Moreover,competitive players mean that the opponents are not fixed, thus the agents should be adapted to variousopponents. (2) Sparse Rewards: The goal of the football game is to maximize the goal score, whichcan only be obtained after a long time of the perfect decision process. And it is almost impossibleto receive a positive reward when starting from random agents. (3) Stochastic Environments: TheGRF introduces stochasticity into the environment, which means the outcome of taking specificactions is not deterministic. This can improve the robustness of the trained agents but also increasesthe training difficulties.

To address the aforementioned issues, we develop an end-to-end learning-based AI system (denotedas TiKick) to complete this challenging task. In this work, we first generated a large replay datasetfrom the self-playing of single-agent experts, which are obtained from league training. We thendeveloped a distributed learning system and new offline algorithms to learn a powerful multi-agent AIfrom the fixed single-agent dataset. To the best of our knowledge, TiKick is the first learning-based AIsystem that can take over the multi-agent Google Research Football full game, while previous workcould either control a single agent or experiment on toy academic scenarios. Extensive experimentsfurther show that our pre-trained model can accelerate the training process of the modern multi-agentalgorithm and our method can achieve state-of-the-art performances on various academic scenarios.

2 Background

In this section, we will first introduce basic notations in multi-agent reinforcement learning (MARL)and then briefly review imitation learning and offline reinforcement learning. We also present theefforts made in the football game AI research field.

2.1 Multi-Agent Reinforcement Learning

The multi-agent reinforcement learning can be formalized as Dec-POMDPs (Oliehoek and Amato,2016). Formally, a Dec-POMDP is represented as a tuple (A, S, U, T, r, O,G, γ), where S, U , O andγ are state space, action space, observation space and discount factor, respectively. A ≡ {1, ..., n}is the set of n agents. At each time step t, each agent i ∈ A chooses an action ui ∈ U to forma joint action u ∈ U ≡ Un. The immediate reward function r(s,u) is the received reward whentaking action u in state s and it is shared by all the agents. T (s,u, s′) : S ×U× S → [0, 1] is thestate-transition function, which defines the probability of the succeeding state s′ after taking action uin state s. In Dec-POMDP, each agent can only have access to the partially observable observationso ∈ O according to the observation function G(s, i) : S × A → O. Each agent uses a policyπi(ui|oi1:t) to produce its action ui from its local historical observations oi1:t. u

−i is used to denotethe action of all the agents other than i and follow a similar convention for the policies π−i. The goalis to learn a joint policy π to maximize the discounted accumulated reward Est,ut [

∑t γ

tr(st,ut)].

2.2 Imitation Learning and Offline Reinforcement Learning

Training agents from pre-collected dataset has long been studied (Ho and Ermon, 2016; Kang, Jie,and Feng, 2018; Oh et al., 2018; Peng et al., 2019; Fu et al., 2020; Qin et al., 2021). Peng et al. (2019)used standard supervised learning methods to train reinforcement learning agents, which aimed tomaximize the expected improvement. Besides, Oh et al. (2018); Vinyals et al. (2019) tried to develop

2

(a) Multi-agent Full Game (b) Run to Score with Keeper (c) Run, Pass and Shoot with Keeper

(d) Corner (e) 3 vs 1 with Keeper (f) Hard Counter-attack

Figure 1: Google Research Football screenshots. (a) shows the multi-agent full game, which requiresall the ten players to be controlled. (b)-(f) show the five GRF academic scenarios, which requirefewer players to be controlled and are easier than the full game.

new algorithms to learn policies from past good experiences. Moreover, Ho and Ermon (2016)proposed a generative adversarial training method (i.e., GAIL) to train a policy to mimic the behaviorpolicy. And then Song et al. (2018) extended GAIL to multi-agent scenarios. More recently, offlinereinforcement learning has attracted attention because it intends to learn a better policy than thebehavior policies which generate the dataset (Fu et al., 2020; Gulcehre et al., 2020; Qin et al., 2021).Although much work has been proposed, such as model-based methods (Kidambi et al., 2020; Yu etal., 2020), policy constraint methods (Fujimoto, Meger, and Precup, 2019; Wu, Tucker, and Nachum,2019), value regularization methods (Kumar et al., 2020) and uncertainty-based methods (Agarwal,Schuurmans, and Norouzi, 2020), there is no work on learning multiple cooperative agents with theincomplete (or partially observable) dataset. And our work tries to build a powerful AI system for themulti-agent football full game from single-agent demonstrations.

2.3 Football Games and Football Game AIs

Many researchers have developed agents for other football games (e.g., RoboCup Soccer Simula-tor (Kitano et al., 1997) and the DeepMind MuJoCo Multi-Agent Soccer Environment (Liu et al.,2019, 2021a)). Different from Google Research Football, these environments focus more on low-levelcontrol of a physics simulation of robots, while GRF focuses on high-level actions. In GoogleResearch Football Competition 2020 (Google, 2020), an RL-based agent, named WeKick (Ziyang Li,2020), took first place. WeKick utilized imitation learning, the multi-head value trick, and distributedleague training to achieve the top performance on the GRF. However, it is a single-agent AI that cannot be extended to multi-agent control. And Liu et al. (2021b) also proposed a new algorithm for theGRF with single-agent control. Instead, our work tries to build a powerful game AI for multi-agentcontrol on the GRF. More recently, Li et al. (2021) proposed a value-based multi-agent algorithm(i.e., CDS) for GRF with mutual information maximization between agents. However, they onlyconducted experiments on toy academic scenarios. Compared with CDS, our AI system can achievemuch better performances and lower sample complexity on these academic scenarios without themutual information maximization process, and our method can even play the GRF full game.

3 Methodology

In this section, we will introduce the TiKick algorithm in detail, including our observation, modeldesign, and our proposed offline reinforcement learning method.

3

3.1 Observation and Model Design

Design the observation is the first step for building a DRL model. The Google Research Footballsimulator originally provide three types of observations. The first type is the pixel-level representation,which consists of a 1280 × 720 RGB image corresponding to the rendered screen. The first typeis the Super Mini Map (SMM), which consists of four 72 × 96 matrices encoding informationabout the current situation. The third type is the vector representation, which provides a compactencoding and consists of a 115-dimensional vector. The pixel-level and SMM representations needmore burdensome deep models to extract hidden features, which results in lower training speedand higher memory consumption. Thus, we use the vector representation as the input of our deepmodel. Furthermore, we extend the standard 115-dimensional vector with auxiliary features, such asrelative poses among teammates and opponents, offside flags to mark potential offside teammates,etc. Finally, we obtain a 268-dimensional vector as the input. More details about the observation canbe found in the Appendix.

To extract hidden features from the raw input, we build a deep model with four fully connected layersand GRU cells (Cho et al., 2014). All the hidden layers are followed by a ReLU, except for the lastoutput layer and all the hidden sizes are set to 256. The learning rate is set to 1e − 4 and is fixedduring the training. Network parameters are initialized with the orthogonal matrix (Saxe, McClelland,and Ganguli, 2013) and Adam optimizer (Kingma and Ba, 2014) is used for the parameter update.More details about the network structure and hyper-parameters can be found in the Appendix.

3.2 Offline RL Training with Single-agent Dataset

In this part, we will introduce how we collect the single-agent dataset and how to utilize this datasetto learn multi-agent control models.

3.2.1 Single-agent Dataset Collection

To collect an expert single-agent dataset, we first obtain a single-agent AI, denoted as WeKick, fromself-play league training. WeKick took first place at the Google Research Football Competition2020 (Google, 2020) and it is the most powerful football AI in the world until now. We let WeKickplay with itself and store all the battle data, including raw observations, actions, and rewards. Duringthe self-playing, only the designated player at one side can be controlled by the WeKick, and thedesignated player is not fixed and is changed automatically according to the build-in strategy. Thegame will last for 3, 000 steps for each round (or episode). At last, we collected 21, 947 episodesfrom the self-play. This dataset will then be used for training our multi-agent offline RL model andother offline RL baselines.

3.2.2 Multi-agent Offline RL Training

The dataset is collected from single-agent control, so it is easy to train a single-agent model withbehavior cloning. However, such trained models can not be applied to multi-agent control and manyproblems will be raised. For example, we find that if we control all the ten players on the court withthe same trained model, all the players will be huddled together because all the players tend to get theball. In single-agent playing, we only need to control the player who is closest to the ball most time.And the designated player is dynamically changed which means the observation inputs for the modelare switched between different players. However, for the multi-agent control, we need to control theplayer who is far from the ball and the observation inputs for each control model are always froma specific player. To handle the differentiation between single-agent and multi-agent playing, wecarefully designed the observations, actions, and our learning algorithm.

As for the observations, we construct a 268-dimensional vector (as described in Section 3.1) from theraw observation for each player. As for the actions, we extend the original 19 discrete actions to 20discrete actions by adding an extra build-in action. All the non-designated players will be assignedthe build-in action. When the player takes the build-in action, the player will behave like a build-inagent. Currently, such build-in agents are obtained from a rule-based tactic, and we will try to convertit to a learning-based controller in future work. To be clear, the build-in action is only used for thefull game and dropped for all the academic scenarios.

4

The most direct way to build a multi-agent AI is to train behavior-cloning models with the pre-processed data. The observations-state pair is represented as {(oi1:t, uit)}ni=1, where oi1:t is theobservation history of agent i, uit is the action of agent i at time-step t and n is the agent number(n = 10 in the GRF full game). Our goal is to train a parametric neural network policy πθ(ui|oi1:t) tomimic the strategy implied in the data, where θ is the parameter of the policy network and all theagents share the same parameter. The policy will be updated via behavior cloning with log loss:

Lbc = Eτ∼D

[1

T

T∑t=1

− 1

n

n∑i=1

log πθ(uit|oi1:t)

], (1)

where τ = {oit, uit, rt}i=1:nt=1:T is the joint trajectory sampled from the dataset D.In the experiment, we

find the agent directly trained with behavior cloning tends to only choose the build-in action. This isbecause there is a serious class imbalance problem in the dataset, i.e., the number of build-in actionsis much larger than other actions. To relieve the class imbalance problem, we introduce a weightingfactor into the log loss. Similar to focal loss (Lin et al., 2017), we use an α-balanced variant of thelog loss:

Lα−balance = Eτ∼D

[1

T

T∑t=1

− 1

n

n∑i=1

α(ui) log πθ(uit|oi1:t)

], (2)

where α(ui) is the weighting factor function, which is conditioned on the action label ui. Besides,there is a designated player among the ten players who should not use the build-in action, so we addthe extra loss to minimize the log probability of the build-in action of the designated player:

Lmin_build_in = Eτ∼D

[1

T

T∑t=1

n∑i=1

−I(i, t) log(1− πθ(uibuild_in|oi1:t)

)], (3)

where uibuild_in is represented for the build-in action and I(i, t) is an indicator function, which isdefined as:

I(i, t) =

{0 if agent i is a non-designated player at time step t1 if agent i is a designated player at time step t. (4)

Using past good experiences as the supervision has shown great success in many control tasks (Ohet al., 2018; Vinyals et al., 2019; Zha et al., 2021). Similar to the ranking buffer trick as proposedby Zha et al. (2021), we weight each trajectory in the dataset via their cumulative rewards. Andtrajectories with higher cumulative rewards receive higher weight and contribute more in the trainingloss:

Lbuffer_ranking = Eτ∼D

[ρ(τ)

1

T

T∑t=1

− 1

n

n∑i=1

α(ui) log πθ(uit|oi1:t)

]. (5)

where ρ(τ) is the trajectory weighting factor. And the practical setting of the trajectory weightingfactor can be found in the Appendix.

To drive a policy improvement, we further adopt the Advantage-Weighted Regression (AWR) (Penget al., 2019). We first train a centralized value network V Dφ (s1:t) using the rewards from the dataset.And then compute the advantage with cumulative rewards (or reward return)RD:

A(ut, s1:t) = RD − V Dφ (s1:t). (6)

Then the advantage weight is defined as:

w(ut, s1:t) = clip(exp

(1

βA(ut, s1:t)

), wmin, wmax

), (7)

where β > 0 is a fixed temperature parameter and clip(·) and we utilize the clipping function tolimit the weight to the range of wmin to wmax. The clipping function is helpful to relieve gradientexplosion and gradient vanishing problems. The advantage weight will be applied to the training loss:

Ladv = Eτ∼D

[ρ(τ)

1

T

T∑t=1

−w(ut, s1:t)n

n∑i=1

α(ui) log πθ(uit|oi1:t)

], (8)

5

Data Server LearnerWorker

WorkerWorker

Dataset

Offline RL

Deep RL

Environment

MLP

GRU

MLP

ht-1 ht

values

Input

actions...

DRL Loss Offline RL Loss

...

... ...

Network Weight Training Data

gRPC Stream Data/Tensor Transfer

��� ��

� �� ...

���... ...

��� ��

Figure 2: The overall structure of the distributed training framework. The worker modules are usedfor producing training data. In offline reinforcement learning, the function of the work module isloading data from the disk and producing RNN hidden states. In deep reinforcement learning, themain function of the worker is interacting with the environment and producing actions and RNNhidden states. All the data collected by the worker modules will be sent to a data server via a gRPCstream. Then the learner modules will fetch data from the data server via a gRPC stream (the dottedarrow). After the training step, the weights will be sent to the worker modules through the data server.

In conclusion, the final training loss of our offline algorithm is the combination of the advantage-weighted loss and build-in action minimize loss:

Ltotal = Ladv + ηLmin_build_in, (9)

where η is a loss balancing coefficient (i.e., a fixed hyper-parameter) and the loss is optimized withthe Adam optimizer. In the next section, we will introduce how to train our offline RL algorithm in apractical distributed training framework.

3.3 Distributed Training Framework

Training a powerful deep model needs a large-scale computation. In this work, we build a unifieddistributed training framework for both offline reinforcement learning and deep reinforcementlearning. There are two main types of modules in our framework, i.e., the worker modules andthe learner modules. The worker modules are used for producing necessary training data for thelearner modules. In offline reinforcement learning, the main function of the work module is loadingdata from the disk and also producing RNN hidden states. In deep reinforcement learning, the mainfunction of the worker module is interacting with environments (or simulators) to collect observationsand rewards and also producing actions and RNN hidden states. In our distributed framework, asingle worker module can load the data in parallel or interact with hundreds of environments. We canalso launch multiple worker modules independently to further improve the throughput. All the datacollected by the worker modules will be sent to a data server via a gRPC (grpc, 2020) stream. Thenthe learner modules will fetch data from the data server via a gRPC stream and start to train withpre-defined losses. In this work, we use the GRU unit to extract historical information. In the learnermodule, the update step of the GRU is set to 25, and the initialized GRU hidden states are producedby the worker modules. After the training step, the weights will be sent to the worker modulesthrough the data server. Figure 2 shows the overall structure of our distributed training frameworkand Pytorch (Paszke et al., 2019) is used for deep model building and parameter optimizing.

4 Experiments

In this section, we will evaluate our algorithm on the GRF full game and also show how the pre-trainedmodel can accelerate multi-agent reinforcement learning in academic scenarios.

4.1 Multi-Agent Google Research Football Full Game Evaluation

In the multi-agent GRF full game, as shown in Figure 1(a), we need to control all the players exceptthe goalkeeper. To the best of our knowledge, TiKick is the first learning-based AI system that cantake over the multi-agent GRF full game, while previous work could either control a single agent

6

CQL MABC +α +MinBuild-in +BR +AWWin Rate 0 0.834 0.890 0.926 0.930 0.944

Draw Rate 0.258 0.112 0.088 0.064 0.058 0.036Failure Rate 0.742 0.054 0.022 0.010 0.012 0.02

Goal Diff -1.496 2.140 2.732 3.056 2.816 3.096TrueSkill 6.33± 3.16 19.33± 3.21 20.88± 3.20 22.18± 3.23 22.39± 3.23 22.84± 3.25

Table 1: Evaluation results of different algorithms on the multi-agent GRF full game. Results showthat our final algorithm (+AW) achieves the best performance with the highest rate of winning andthe highest goal difference. The results of the TrueSkill evaluation are shown in the last row andmean scores with the standard deviation are reported. Our final TiKick model achieves the highestTrueSkill score.

0.85

0.90

0.95

1.00Win RateDraw RateFailure Rate

MABC + +MinBulid-in +BR +AW0.00

0.05

0.10

0.15

2.8

3.0

3.2

3.4Goal Difference

1.8

2.0

2.2

2.4

Rate

Score

(a)

MABC + +MinBulid-in +BR +AW12.5

15.0

17.5

20.0

22.5

25.0

27.5

30.0

True

Skill

(b)

Figure 3: (a) The curves of the changes on the rate of winning, draw, and failure, and the goaldifferences of different algorithms. And the curve of the changes on the goal differences of differentalgorithms. Our final algorithm (+AW) achieves the best performance with the highest rate of winningand the highest goal difference. (b) TrueSkill evaluation for different algorithms. And our finalTiKick model achieves the highest TrueSkill score.

or experiment on toy academic scenarios. To evaluate our proposed offline reinforcement learningalgorithm, we construct a series of incremental baselines from the vanilla multi-agent behaviorcloning (MABC) algorithm.

Baselines:

CQL: CQL (Kumar et al., 2020) is an offline reinforcement learning algorithm that tries to learnconservative Q-values via adding penalties on the Q-functions.

MABC: The multi-agent behavior cloning algorithm has been described in Section 3.2.2 and uses anaive supervised loss as shown in Equation 1.

+ α-Balance (abbreviated as +α): Add an α-balance weight to the MABC baseline to relieve theclass imbalance problem as shown in Equation 2.

+ Min Build-in (abbreviated as +MinBuild-in): Add a build-in action minimization loss to theprevious baseline to force the designated player to choose the non-build-in action as shown inEquation 3.

+ Buffer Ranking (abbreviated as +BR): Add the buffer ranking trick to take advantage of past goodexperiences as shown in Equation 5.

+ Advantage Weight (abbreviated as +AW): Add an advantage-weighted loss to the previous baselineand obtain the final loss as shown in Equation 9. And this method has been served as the final TiKickmodel.

7

0 5 10T (mil)

0

50

100

Test

Win

%

Run to Score with Keeper

0 5 10T (mil)

0

25

50

75

Run, Pass and Shoot with Keeper

0 5 10T (mil)

0

20

40

Corner

0 5 10T (mil)

0

25

50

75

3 vs 1 with Keeper

0 5 10T (mil)

0

25

50

75

Hard Counter-attackTiKick(ours) MAPPO CDS QMIX MADDPG

Figure 4: Comparison of our method against baseline algorithms on GRF academic scenarios. Resultsshow that our method, denoted as TiKick, achieves the best performance and the lowest samplecomplexity on all the scenarios.

Run to Score with Keeper

Run, Pass and Shoot with KeeperCorner

3 vs 1 with Keeper

Hard Counter-attack

0.00

0.25

0.50

0.75

1.00

Avg

Goal

Diff

eren

ce

MAPPO

Run to Score with Keeper

Run, Pass and Shoot with KeeperCorner

3 vs 1 with Keeper

Hard Counter-attack

TiKick(ours)1M 5M 15M

Figure 5: The average goal difference on all the academic scenarios for MAPPO (left) and TiKick(right). Results show that our method achieves the highest scores on all the scenarios and our methodcan reach the highest scores within just 1M steps on four of five scenarios.

All the baselines share the same set of hyper-parameters and the same neural network backbone.Table 1 shows the evaluation results of different algorithms on the multi-agent GRF full game.We report the rate of winning, draw, and failure, and also the goal differences after combatingwith the build-in AI for 500 rounds (with 3, 000 steps per round). Results show that CQL fails todefeat the build-in AI with a zero winning rate and our final algorithm (i.e., + Advantage Weight)achieves the best performance with the highest rate of winning and the highest goal difference.Figure 3(a) shows the curves of the changes in the rate of winning, draw, and failure of differentalgorithms, and the curves of the changes on the goal differences of different algorithms. To furtherevaluate the performance of different algorithms, we rank all the algorithms with the TrueSkill ratingsystem (Herbrich, Minka, and Graepel, 2006), and the results are shown in the last row of Table 1and Figure 3(b). And our final TiKick model achieves the highest TrueSkill score.

4.2 Accelerate Multi-agent Reinforcement Learning Training

Preivous work (Silver et al., 2016; Vinyals et al., 2019; Ye et al., 2020) shows that pre-trainedsupervised models help a lot for the follow-up deep reinforcement learning training. In this section, wewill examine whether our pre-trained offline RL models can accelerate the multi-agent reinforcementlearning training process. Thus, we consider five GRF academic scenarios originally proposed inKurach et al. (2019), including:

- Run_to_Score_with_Keeper- Run_Pass_and_Shoot_with_Keeper- Corner- 3_vs_1_with_Keeper- Hard_Counter-attack

Figure 1(b)-(f) show the screenshots of these five scenarios. In this experiment, we use QMIX (Rashidet al., 2018), MADDPG (Lowe et al., 2017), CDS (Li et al., 2021) and MAPPO (Yu et al., 2021) asthe baselines.

Baselines:

QMIX: A value-based method that learns a monotonic approximation for the joint action-valuefunction. QMIX factors the joint-action into a monotonic nonlinear combination of individual utilities

8

of each agent. A mixer network with nonnegative weights is responsible for combining the agent’sutilities.

MADDPG: A policy-based multi-agent algorithm, which adapts the single-agent DDPG algorithmto the multi-agent setting.

CDS: A value-based multi-agent algorithm with mutual information maximization between agents’identities and their trajectories. To be clear, CDS only combats with three enemies in theHard_Counter-attack scenario, while our method can combat with eleven enemies, which is muchharder than the four-enemy version.

MAPPO: A policy-based method that uses a centralized value function without individual utilities.MAPPO adapts the single-agent PPO algorithm to the multi-agent setting. Many useful tricks areused to stabilize the training, including Generalized Advantage Estimation (GAE) (Schulman et al.,2015), PopArt (Hessel et al., 2019), observation normalization, value clipping, layer normalization,and ReLU activation with orthogonal initialization.

Our method is built on the MAPPO algorithm and we load weights from our pre-trained TiKickoffline RL model. Because there is a minor difference between the GRF full game and academicscenarios (i.e., academic scenarios have no build-in action), we do not load the weights of the lastlayer in the pre-trained model. In the experiment, we measure the winning rates and goal differencesagainst the build-in AI over 5 seeds for all the algorithms and each evaluation data point is obtainedfrom 64 test rounds.

We show the training performance comparison against baselines in Figure 4. Results show that ourmethod, denoted as TiKick, achieves the best performance and the lowest sample complexity on allthe scenarios. Figure 5 shows the average goal difference on all the academic scenarios for MAPPOand TiKick. Results show that our method achieves the highest scores on all the scenarios and ourmethod can reach the highest score within just 1M steps on four of five scenarios. All the experimentsshow that the pre-trained offline RL model can accelerate the multi-agent RL training process.

5 Conclusion

In this paper, we have proposed a new multi-agent offline reinforcement learning method for theGoogle Research Football full game. Our proposed learning framework, denoted as TiKick, isimplemented practically by using deep recurrent neural networks and carefully designed traininglosses. Experimental results show that our final algorithm achieves the best performance on theGRF full game. Furthermore, experiments on the GRF academic scenarios show that our pre-trainedmodel can accelerate the multi-agent RL training process with better performance and lower samplecomplexity. In the future, we will try to apply our pre-trained model to the multi-agent league trainingto further improve the performance and robustness.

ReferencesAgarwal, R.; Schuurmans, D.; and Norouzi, M. 2020. An optimistic perspective on offline reinforce-

ment learning. In International Conference on Machine Learning, 104–114. PMLR.

Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.;Hashme, S.; Hesse, C.; et al. 2019. Dota 2 with large scale deep reinforcement learning. arXivpreprint arXiv:1912.06680.

Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y.2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation.arXiv preprint arXiv:1406.1078.

Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; and Levine, S. 2020. D4rl: Datasets for deep data-drivenreinforcement learning. arXiv preprint arXiv:2004.07219.

Fujimoto, S.; Meger, D.; and Precup, D. 2019. Off-policy deep reinforcement learning withoutexploration. In International Conference on Machine Learning, 2052–2062.

Google. 2020. Google research football competition 2020. https://www.kaggle.com/c/google-football.

9

grpc. 2020. grpc. https://grpc.io/.

Gulcehre, C.; Wang, Z.; Novikov, A.; Le Paine, T.; Gomez Colmenarejo, S.; Zolna, K.; Agarwal,R.; Merel, J.; Mankowitz, D.; Paduraru, C.; et al. 2020. Rl unplugged: Benchmarks for offlinereinforcement learning. arXiv e-prints arXiv–2006.

Herbrich, R.; Minka, T.; and Graepel, T. 2006. Trueskill™: A bayesian skill rating system. InProceedings of the 19th international conference on neural information processing systems, 569–576.

Hessel, M.; Soyer, H.; Espeholt, L.; Czarnecki, W.; Schmitt, S.; and van Hasselt, H. 2019. Multi-taskdeep reinforcement learning with popart. In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, 3796–3803.

Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. Advances in neural informationprocessing systems 29:4565–4573.

Kang, B.; Jie, Z.; and Feng, J. 2018. Policy optimization with demonstrations. In InternationalConference on Machine Learning, 2469–2478. PMLR.

Kidambi, R.; Rajeswaran, A.; Netrapalli, P.; and Joachims, T. 2020. Morel: Model-based offlinereinforcement learning. arXiv preprint arXiv:2005.05951.

Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980.

Kitano, H.; Asada, M.; Kuniyoshi, Y.; Noda, I.; and Osawa, E. 1997. Robocup: The robot world cupinitiative. In Proceedings of the first international conference on Autonomous agents, 340–347.

Kumar, A.; Zhou, A.; Tucker, G.; and Levine, S. 2020. Conservative q-learning for offlinereinforcement learning. arXiv preprint arXiv:2006.04779.

Kurach, K.; Raichuk, A.; Stanczyk, P.; Zajkac, M.; Bachem, O.; Espeholt, L.; Riquelme, C.; Vincent,D.; Michalski, M.; Bousquet, O.; et al. 2019. Google research football: A novel reinforcementlearning environment. arXiv preprint arXiv:1907.11180.

Li, C.; Wu, C.; Wang, T.; Yang, J.; Zhao, Q.; and Zhang, C. 2021. Celebrating diversity in sharedmulti-agent reinforcement learning. arXiv preprint arXiv:2106.02195.

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection.In Proceedings of the IEEE international conference on computer vision, 2980–2988.

Liu, S.; Lever, G.; Merel, J.; Tunyasuvunakool, S.; Heess, N.; and Graepel, T. 2019. Emergentcoordination through competition. arXiv preprint arXiv:1902.07151.

Liu, S.; Lever, G.; Wang, Z.; Merel, J.; Eslami, S.; Hennes, D.; Czarnecki, W. M.; Tassa, Y.;Omidshafiei, S.; Abdolmaleki, A.; et al. 2021a. From motor control to team play in simulatedhumanoid football. arXiv preprint arXiv:2105.12196.

Liu, X.; Jia, H.; Wen, Y.; Yang, Y.; Hu, Y.; Chen, Y.; Fan, C.; and Hu, Z. 2021b. Unifyingbehavioral and response diversity for open-ended learning in zero-sum games. arXiv preprintarXiv:2106.04958.

Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic formixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275.

Mahajan, A.; Rashid, T.; Samvelyan, M.; and Whiteson, S. 2019. Maven: Multi-agent variationalexploration. arXiv preprint arXiv:1910.07483.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M.2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

Oh, J.; Guo, Y.; Singh, S.; and Lee, H. 2018. Self-imitation learning. In International Conference onMachine Learning, 3878–3887. PMLR.

10

Oliehoek, F. A., and Amato, C. 2016. A concise introduction to decentralized POMDPs. Springer.

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein,N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems 32:8026–8037.

Peng, X. B.; Kumar, A.; Zhang, G.; and Levine, S. 2019. Advantage-weighted regression: Simpleand scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.

Qin, R.; Gao, S.; Zhang, X.; Xu, Z.; Huang, S.; Li, Z.; Zhang, W.; and Yu, Y. 2021. Neorl: A nearreal-world benchmark for offline reinforcement learning. arXiv preprint arXiv:2102.00714.

Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; and Whiteson, S. 2018.Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. InInternational Conference on Machine Learning, 4295–4304. PMLR.

Saxe, A. M.; McClelland, J. L.; and Ganguli, S. 2013. Exact solutions to the nonlinear dynamics oflearning in deep linear neural networks. arXiv preprint arXiv:1312.6120.

Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2015. High-dimensional continuouscontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438.

Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.;Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deepneural networks and tree search. Nature 529(7587):484–489.

Song, J.; Ren, H.; Sadigh, D.; and Ermon, S. 2018. Multi-agent generative adversarial imitationlearning. arXiv preprint arXiv:1807.09936.

Taiga, A. A.; Fedus, W.; Machado, M. C.; Courville, A.; and Bellemare, M. G. 2019. On bonusbased exploration methods in the arcade learning environment. In International Conference onLearning Representations.

Team, E. L.; Stooke, A.; Mahajan, A.; Barros, C.; Deck, C.; Bauer, J.; Sygnowski, J.; Trebacz, M.;Jaderberg, M.; Mathieu, M.; et al. 2021. Open-ended learning leads to generally capable agents.arXiv preprint arXiv:2107.12808.

Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.;Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in starcraft ii using multi-agentreinforcement learning. Nature 575(7782):350–354.

Wu, Y.; Tucker, G.; and Nachum, O. 2019. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361.

Ye, D.; Chen, G.; Zhao, P.; Qiu, F.; Yuan, B.; Zhang, W.; Chen, S.; Sun, M.; Li, X.; Li, S.; et al. 2020.Supervised learning achieves human-level performance in moba games: A case study of honor ofkings. IEEE Transactions on Neural Networks and Learning Systems.

Yu, T.; Thomas, G.; Yu, L.; Ermon, S.; Zou, J.; Levine, S.; Finn, C.; and Ma, T. 2020. Mopo:Model-based offline policy optimization. arXiv preprint arXiv:2005.13239.

Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.; and Wu, Y. 2021. The surprising effectiveness ofmappo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955.

Zha, D.; Lai, K.-H.; Zhou, K.; and Hu, X. 2021. Simplifying deep reinforcement learning viaself-supervision. arXiv preprint arXiv:2106.05526.

Zhang, T.; Xu, H.; Wang, X.; Wu, Y.; Keutzer, K.; Gonzalez, J. E.; and Tian, Y. 2020. Bebold:Exploration beyond the boundary of explored regions. arXiv preprint arXiv:2012.08621.

Ziyang Li, Kaiwen Zhu, F. Z. 2020. Wekick. https://www.kaggle.com/c/google-football/discussion/202232.

11

A Observation Design

In this section, we will describe our observation design in detail. Our observation consists of thefollowing parts:

absolute coordinates of players at our sidemoving directions of players at our sideabsolute coordinates of enemiesmoving directions of enemiesabsolute coordinate of current playerabsolute coordinate of the ballmoving direction of the ballone-hot indicator for the ball-owned teamone-hot indicator for the active playerone-hot indicator for the game modesticky actionsdistance between the ball and current playertired factors of players of our sideyellow card indicator of our sidered card indicator of our sideoffside indicator of our sideoffside indicator of enemiesdistance between the enemies and current playerrelative coordinates of players at our siderelative coordinates of players at our siderelative coordinates of players at our sidenumber of steps left till the end of the matchnumber of steps left till the end of the half-timenumber goal differenceszero vector of 27 dimensions (used for future usage)

12

B Hyper-parameters

In this section, we will describe the hyper-parameters in detail.

Hyper-parameters Valuepolicy network learning rate 1e− 4critic network learning rate 1e− 4

number of environment parallel threads 64number of MLP layers 4number of GRU layers 1training episode length 1,500

test episode length 3,000data chunk length 25observation size 268

action size 20 for the GRF full game, 19 for academic scenariosnumber of evaluation round 500 for the GRF full game, 64 for academic scenarios

C Trajectory Weighting Factor Design

In this section, we will describe how we design the trajectory weighting factor. Given a joint trajectoryτ = {oit, uit, rt}i=1:n

t=1:T sampled from the dataset, we can calculate the cumulative reward as:

R(τ) =T∑t=1

rt. (10)

Then we define a simple trajectory weighting factor as:

ρ(τ) =

{ρ0 ifR(τ) < Rthreshold

ρ1 ifR(τ) ≥ Rthreshold,(11)

where ρ0, ρ1 and Rthreshold are hyper-parameters. In this paper, we set ρ0 = 0, ρ1 = 1 andRthreshold = 3.

13