Solving Simple StarCraft 2 Problems with Dueling
Deep Q learning with limited Available Actions
Sylvester Shan
Supervisor: Penny Kyburz
Australia National University, Canberra 2600, Australia
Abstract. This paper compares the performance of the two RL (Reinforcement
Learning) algorithms, Deep Double Q Learning (DDQ) and Dueling Deep Double Q
Learning (DDDQ) by training on three mini-games with limited actions available on
the StarCraft II reinforcement learning environment. This paper showed that it is
possible to solve broken down problems efficiently with limited action with DDDQN
training method and a non-static state representing the environment.
Keywords: Dueling Deep Q Learning, Deep Double Q Learning, Reinforcement
Learning, Hierarchical Reinforcement Learning, Deep Learning, StarCraft II.
1 Introduction
In recent years, Reinforcement Learning (RL) has been applied in different fields such
as chemistry, robotics, engineering, neuroscience etc. But it’s mostly well known for
playing games such as Atari 2600 computer games [1] and defeating the top board game
GO player in the world [2], where both agent are developed by DeepMind.
Deepmind released a reinforcement learning environment called SC2LE (StarCraft II
learning Environment), where StarCraft II is a popular real time strategy game. The
objective of the game is to expand the base and create an army to destroy the opponent.
It’s a fast-paced game that needs the player to make hundreds of decisions and micro
actions every minute. There are various features that makes the game the perfect
environment for reinforcement learning. The environment is not fully observable for
each player. To obtain information the player needs to scout the map with a unit. The
game has a large state space and a large action space for each state, making it impossible
to use existing traditional algorithms to record all the possible states. In other words,
it’s impossible to have a scripted agent to encounter all possible scenarios. These are
what makes the game challenging to solve. [3]
In past published paper, DeepMind stated they had a long-term goal of developing a
deep reinforcement learning agent to defeat top competitive human players. [5] The
long-term goal has been achieved recently with the release of AlphaStar, which is
capable of defeating professional players such as Grzegorz "MaNa" Komincz to a 5-0.
AlphaStar could be considered at the state of art for reinforcement learning. More
information on how AlphaStar is trained can be found on the blog post by DeepMind.
[4] Though it’s unlikely to reproduce or create a better agent than AlphaStar, but there
are still areas that are worth exploring and evaluating using StarCraft II.
Other than using Reinforcement Learning to solve problems, Hierarchical
Reinforcement Learning (HRL) can also be used to solve problems. HRL decomposes
a problem down into smaller problems, until it can’t be broken down more. Then solve
for each of the subproblems to solve the original problem.
An agent is needed to be trained to solve each sub problem broken down from the larger
problem. Two training methods are introduced in this paper, Deep Double Q Learning
(DDQL) and Dueling Deep Double Q Learning (DDDQL) using the environment and
mini-games in PySC2. In this paper, three different mini-games are used to train agents
to play it. Each training method will be used to train an agent. The paper compares the
performance of the two agents from the same mini-game and introduce how could these
trained agents possibly be used to construct a Hierarchical Agent to play the full game
of StarCraft.
2 Background
2.1 Reinforcement Learning
Reinforcement learning are often represented as a finite Markov Decision Process
(MDP). The following tuple can represent the MDP:
M = (S, A, P, R,𝜸)
Where S is the state space, A is the action space , P= (s' | s,a) represents the probability
of next state being s’ given current state s and action a. R = R(s,a) represents the instant
reward after taking action a at state s. 𝜸 is the discount factor that influences predicted
future award.
Though there are many success in reinforcement learning, it still has its
disadvantages. Curse of dimensionality usually refers to a phenomenon that occurs
when the dimensionality of the data increases. [9] For example, a simple problem can
be solved with an existing reinforcement learning algorithm, but when the
dimensionality of the data in the problem is scaled up, it will have a larger action
space or state space, making the reinforcement learning algorithm not be feasible to
be use. Reinforcement learning is not good at generalization. A complex question can
be solved by a trained agent, but the agent won’t be able to solve a similar complex
problem it is able to solve. It cannot transfer its learning experience to a different
environment. In other words, a trained agent is overspecialized for the problem it is
solving. [8] For example, AlphaStar only plays the race Protoss, but it’s not possible
to transfer the learning experience and play the race Zerg. The initial building the
Protoss gets is called Nexus, which allows the player to create a worker unit called
Probe. While the initial building the Zerg gets is called Hatchery, which constantly
creates a unit called Larva every 11 seconds, which can then be morphed into a
worker unit called Drone or two military units called Zergling.
2.2 Hierarchal reinforcement learning
Hierarchical Reinforcement Learning (HRL) can overcome the disadvantage of
Reinforcement Learning. HRL can break down problems to avoid the curse of
dimensionality. For example, how to win a StarCraft II game can be broken down into
how to expand base and how to build a military. How to build a military can be broken
down to how build barracks. How to build barracks can be broken down into selecting
worker and then build barracks at a specific location etc. Breaking down the problem
also means HRL is better at generalization. [8] When the problem can’t be broken down
further, it could be solved by a policy. It’s likely that this problem could be the same
problem between the three races in StarCraft II. For example, the basic military building
for Terrans, Protoss and Zergs are Barrack, Gateway and Spawning Pool. The common
action to build this will be selecting a worker and build the building at a specified
location. Indicating if there exists a policy for the race Drones, that selects a worker
unit and builds a Barrack, then it’s possible to transfer the learning experience and have
an agent playing the Zerg to have a worker unit and build a Spawning Pool.
3 Related Work
In 2019 Feb, Pang and his team was able to develop an HRL agent, that was able to
defeat StarCraft II built the most difficult AI that does not cheat with a win rate of 93%.
The following figure is the architecture of the HRL agent.
Figure 1: Hierarchal reinforcement architecture [6]
The global state is represented by non-spatial features such as the number of resources
the agent has, the number of workers, size of the army etc. Detailed information can be
found in the appendix of their paper. [6]
Before training the HRL agent, they data mined the macro-actions from replays, which
are fixed sequence of actions that human players do. For example, they found out that
the selecting a Probe and performing the action Harvest-Gather-Probe-screen appeared
2711 times during the data mining session. Doing this allows to reduce the action space
for each policy and increase the testing speed and the learning efficiency.
The global state is then feed into a controller, which picks a sub-policy every 8 seconds.
After a sub-policy is chosen, the screen the agent sees locally is passed into the chosen
sub-policy, and then the sub-policy chooses which macro-actions to perform. [6]
The data mined sequence of actions are mostly for expanding the base, creating more
military buildings and the focus area will be where the base is. Extracting combat
sequence will be hard, since the combat area does not only limit to the base, but all over
the map. Therefore, it’s impossible to have all the replays covering all or parts of
possible game state when combat happens. During combat, the action a player chooses
should be depended on the current scenario and hence they have a simple combat
system. One combat system has three actions, go and attack this given location, retreat
to this give location or do nothing. The second one is when a battle sub-policy is chosen,
then attack the center of the most injured units. [6]
It’s impossible to data mine and store all the sequence of macro-actions. If a problem
in StarCraft II can be continued broken down into smaller problems, then this problem
show be abled to be solved by a simple agent with a small action space.
4 Methods
4.1 Mini-Games
Instead of training an agent to play the whole game, mini-games were introduced to
help understand how the agent is learning and behaving. The following three mini-
games are chosen for experiments. For each mini-game below, one agent is trained
using the DDQL and the other uses the DDDQL. Out of the three mini-games,
CollectingMineralShards is the easiest and DefeatZerglingsAndBanelings is the hardest.
The three chosen mini-games’ camera is fixed.
4.1.1 CollectingMineralShards
This mini-game has mineral shards scattered all around the map. The player has two
military units and move them to collect as many mineral shards as they possibly can
within a time limit. After all mineral shards has been collected from the map, it will
randomly spawn more on the map. The agent gets 1 score as reward for each shard it
collects.
4.1.2 Defeat Roaches
The agent will receive 9 Marines and fight 4 Roaches. Both Marines and Roaches can
attack from a distance, but the Roaches has more hit points than the Marines. The
objective of this game is to have the agent controlling the Marines to focus firing down
each Roaches one by one. After all the Roaches are defeated, the agent will get an
addition of 5 marines and 4 Roaches will spawn again on the map. The agent will get
10 points for each Roach it defeats and lose 1 point for each Marine it loses.
4.1.3 DefeatZerglingsAndBanelings
The agent will receive 9 Marines and fight 6 Zerglings and 4 Banelings. Zerglings are
a close combat unit. Banelings is a military unit that explodes, and deals splash damage
around it. The agent will get 5 points for defeating an enemy unit but loses 1 point for
losing a Marine.
4.2 Global State
The global state is different for different types of mini-games. If the agent can’t see any
enemy units on the screen, then the global state will be a feature_screen_size x
feature_screen_size 2D-array (feature_screen_size = 84 by default) representing the
coordinate of the map. For each element in this 2D-array, it will have the value 0, 1, 2,
3 or 4, meaning at this location there is nothing, agent’s unit, allie’s unit, neutral object
(mineral shards) or enemy units. (This layer is equivalent to player_relative in PySC2)
If enemy is seen on the map, then the global state will be a feature_screen_size x
feature_screen_size 2D-array representing the coordinates of the map, where each
element of the 2D-array is the health of an enemy unit at that location.
The global state will be used as input to feed into the convolution neural network for
DDQN and DDDQN to calculate the Q-values.
4.3 Action Space
For all three mini-games are limited to two actions: Move_screen and Attack_screen.
Both actions take a coordinate of a pixel of the feature screen, which could be anywhere
on the screen. Move_screen moves the selected units to the specified location, they
won’t attack any enemy units until they reached the destination. If the specified location
is within line of sight and has an enemy unit on it, it will attack the unit. Attack_screen
move the selected units to the specified location. The units will attack enemy unit within
the line of sight.
Note that all units are selected with action select_army. There is no drag and select or
individual click on agent’s unit.
Figure 2: The light green box on the left of the figure is the available position to choose
from. The camera on this map is fixed.
4.4 DDDQN neural network and output
DDDQN is based off the implementation of DDQN by Ray Heberer. The github
repository for the implementation of DDQN can be found in the appendix. After
flattening the results of the convolutional neural network, take the results and feed them
to two separate neural networks.
The first neural network calculates the value of the state. The first layer takes in the
flatten results as inputs, fully connect the input layer to the hidden layer with 1024
hidden neurons, then fully connect the hidden layer to the one output. The output will
be the value of the state.
The second neural network is to calculate the advantage of taking an action comparing
to all other actions. The first layer takes in the flatten results as inputs, fully connect the
input layer to the hidden layer with 1024 hidden neurons, then fully connect the hidden
layer to the output layer that has the same size as the input layer. This neural network
predicts the advantage for each action A(s,a).
There for the Q-value follows the equation from background. The advantage of taking
an action is subtracted by the average advantage of all actions of the current state, then
add it to the value of the current state.
4.5 Training Termination
Training methods DDQN and DDDQN will train an agent on each mini-game one at a
time. The number of episodes it plays is not determined for each mini-game. There are
three ways of terminating the training of an agent. A training can be terminated by an
error, the terminal prints out “MemoryError” then ends the training. More about his
error could be found in the file errormessage.md in the github link in Appendix. A
training could be terminated by looking at the Score graph generated by Tensorboard.
The last method of terminating training requires to have the Score graph of a training
method of a mini-game, eg finished training a DDQN agent on mini-game
CollectMinearlShards, currently training a DQN agent on mini-game
CollectMineralShards. Terminate the training when the number of episodes is
approximately the same.
4.6 Experiments
4.6.1 Number of hidden neurons in DDDQN
In DDDQN, for the two fully connected neural network, deciding the number of hidden
neurons for both of neural network was hard and not trivial. Two agents were trained
using the same mini-game DefeatRoaches. The first agent had 512 hidden neurons in
the hidden layer for both neural networks, similarly the second agent had 1024 neurons.
4.6.2 Global state decision
When training a reinforcement learning agent, choosing to have what as the state is
important. Using different kinds of states can result in different results. While having
the state as player_relative and when the score has converged, it’s been observed that
when there are enemy on the screen, the agent will keep clicking all the units. Therefore,
an experiment was done by having two DDDQN agent training on the same mini-game
DefeatRoaches, where one keeps the state player_relative and the other enemy_hp.
4.6.3 Scores
To evaluate which training method is better, DDDQN agent and DDQN agent will use
the same state and train on the same three mini-games separately.
5 Results
Note that all graphs are generated from Tensorflow and are smoothen by 0.9 for better
visibility. The x-axis is the number of episodes the agent has trained, the y-axis is the
score the agent achieved in that episode.
5.1 Number of hidden neurons in DDDQN
After training the two agents, the graph of the score for both agents are down below.
Figure 3: Left graph is the performance of the DDDQN agent with 1024 hidden neurons;
Right graph is the performance of the DDDQN agent with 512 hidden neurons. (x-axis:
number of episodes, y-axis: score achieved at the corresponding episode)
Table 1: The max and mean score of the two agents with different number of hidden
neurons.
Number of hidden
neurons
Max score Mean score
512 211.00 25.52
1024 335.00 79.00
5.2 Global state decision
After training two DDDQN agents with different states, the score for the two agents are
shown below.
Figure 4: Left graph is the performance of the DDDQN agent with the state as
enemy_hp; Right graph is the performance of the DDDQN agent with the state
player_relative. (x-axis: number of episodes, y-axis: score achieved at the
corresponding episode)
Table 2: The max and mean score of the two agents with two different states
State name Max score Mean score
Player_relative 81 21.40
Enemy_hp 335.0 79.00
5.3 Scores
After training the two agents on each of the mini-games, below presents the score.
Figure 5: After training on CollectMineralShards. Left graph is the performance of
DDDQN and the right graph is the performance of DDQN. (x-axis: number of episodes,
y-axis: score achieved at the corresponding episode)
Table 3: The max and mean score of the two agents training on CollectMineralShards.
Training Method Max score Mean score
DDDQN 90.00 66.36
DDQN 111.00 65.11
Figure 6: After training on DefeatRoaches. Left graph is the performance of DDDQN
and the right graph is the performance of DDQN. (x-axis: number of episodes, y-axis:
score achieved at the corresponding episode)
Table 4: The max and mean score of the two agents training on DefeatRoaches.
Number of hidden
neurons
Max score Mean score
DDDQN 335.00 79.00
DDQN 302.00 35.32
Figure 7: After training on DefeatZerglingsAndBanelings. Left graph is the
performance of DDDQN and the right graph is the performance of DDQN. (x-axis:
number of episodes, y-axis: score achieved at the corresponding episode)
Table 5: The max and mean score of the two agents training on
DefeatZerglingAndBanelings.
Number of hidden
neurons
Max score Mean score
DDDQN 89.00 17.18
DDQN 88.00 21.90
6 Discussion
6.1 Mini-games behavior
Note that the mini game behavior was observed on the feature layers before terminating
the training, not in actual game play. Assuming it already converged.
Replays will be available in this project github link.
6.1.1 CollectMineralShards
This was one of the simplest games in the mini-game list. There wasn’t a big difference
in how the agents behaved. From the talbe above we could see that though DDQN agent
has a higher max score, but that could it just being lucky, since the mineral shards
spawns randomly around the units.
6.1.2 DefeatRoaches
The behavior of the two agents are different in this game. For the DDDQN agent, it was
able to learn how to focus fire a single unit at a time. After an opponent unit dies, it will
perform the action “Attack_move” onto the next opponent in line and continue.
According to the graph, around 200 episode it was able to learn this technique. The
reason why the score is not consistent is because the angle the units moves into the
Roaches. By default, the Roaches will attack the first unit they are able to see, then that
unit will be destroyed within two shots from each of the Roaches, which leads to a
lower score since the agent’s military won’t be able to win if it lost a unit too early.
Another interesting observation was the agent did not constantly click on the focus fired
Roach or any other random actions. Between the time when the agent orders to attack
a Roach and the Roach hit points reaching zero, it will not click on anything.
For the DDQN agent, the performance was not ideal. It couldn’t learn to focus fire
within the same number of episodes the DDDQN agent took. The agent was constantly
clicking on one or two roaches constantly. Which leads to the agent army’s death
quickly. If given enough time, it could possible achieve results like the DDDQN agent,
but it’s not wroth looking into, since the DDDQN agent uses half the time to learn and
performs better.
6.1.3 DefeatZerglingsAndBanelings
Looking at the graaaaph, both agents was not able to perform well, but they had one in
common behavior: they are running away from the opponent’s army. In the early stage
of the game, the agent’s army will attack the opponent’s units. The problem is the
opponent has 4 units called the Banelings, which explodes at close range and deals large
amount of damage to light armor units such as the Marines. After the Banelings killing
and damaging most of the units, the Zerglings will come and destroy the rest of the
units.
After a few trials and errors, they the DDDQN agent was able to learn that it does not
have enough action space to deal with the opponent. Hence it started fleeing to the top
left corner of the screen and tried to avoid the opponent. The max score DDDQN
obtained should be from the early training time when it tried to fight the opponent. In
the later part of training, it was still getting some scores why the expected score should
have been zero. The reason to this is because when an episode restarts, there are chances
that the agent’s army will spawn on the right side of the map while the opponent will
spawn on the left. Hence when the agent tried to flee from the enemy by running to the
left top corner, it will get into combat and might trigger one or two Banelings to explode,
which does grant it rewards. The evidence that the agent is running into the opponent
will be the agent’s score are mostly zero when on the left graph inf iiiiiiiiiiure.
The DDQN was not able to learn that the number of available actions it can use is
limited, and even spending more around twice as many episode DDDQN agent used, it
was still not able to defeat or run away from the agent.
6.2 Action Space
The objective of this paper is to show that the even with limited action space, an agent
could still be able to train and solve a problem with the correct training method. HRL
is about breaking down the problems into smaller problems, smaller problems also
indicate that it could be solved easily with a small action space. This is the reason why
the available action was limited to only “Move_screen” and “Attack_screen”.
6.3 Number of hidden neurons in DDDQN
The agent with more hidden neurons had a much higher max and average score than
the agent with less hidden neurons. Therefore, for all agents that was trained using the
DDDQN method will have 1024 hidden neurons. The reason why we picked 512 and
1024 is because both numbers can be easily stored. Numbers that can be created by two
to the power can be stored easily.
After this experiment there was no further experiment to find the optimal hidden
number of neurons.
6.4 Global State Decision
In section 5.2, it can be observed that the DDDQN agent performs much better with the
state as enemy_hp other than player_relative. An explanation to this that in enemy_hp,
the agent learnt that it will receive rewards when an the hp of an enemy reaches 0. And
the fastest way to do it is to focus fire a unit at a time. Unlike for player_relative, the
number 4 which stands for ENEMY will remain there until is destroyed. It seemed to
be the case that the agent with state as player_relative knows that clicking on the state
of positions where it’s labelled 4 will return a reward, but it wasn’t able to learn that
the fastest way is to focus fire a unit, because the label 4 doesn’t decrease while the hp
of the enemy does.
7 Conclusion
It is possible to create agents with one or two available actions and small action space.
The efficiency of the agent learning with DDDQN was faster than using the DDQN.
The state of that is used to represent that game state is also an important aspect of the
learning efficiency of an agent. Having non-static values seems to speed up learning.
8 Future Work
Further investigation is needed to find the optimal number of hidden neurons in the
neural network for the DDDQN training algorithm. More time might be needed for the
agent to learn each mini-game fully with the two training methods.
Though DDDQN is a better training method than DDQN, but according to DeepMind’s
Rainbow DQN, it is the most efficient learning algorithm amongst DDQN, DDDQN,
prioritized deep Q learning etc. It would be interesting to compare the performance
between a DDDQN and a Rainbow DQN agent to see how the agent learns using
Rainbow DQN as the training method. [7]
The next objective is to develop an HRL agent with pre-trained agents that can solve
smaller problems. Connect these pretrained agents in a hierarchal structure with neural
network could be a solution, but this needs to be investigated more before making
anymore claims.
9 Acknowledgement
Special thanks to Dr Penny Kyburz and Ray Heberer.
10 References
[1]V. Mnih et al., "https://arxiv.org/pdf/1312.5602v1.pdf", 2013. Available:
https://arxiv.org/pdf/1312.5602v1.pdf. [Accessed 31 May 2019].
[2]D. Silver et al., "Mastering the game of Go without human knowledge", Nature, vol.
550, no. 7676, pp. 354-359, 2017. Available: 10.1038/nature24270.
[3]T. Le, N. Vien and T. Chung, "A Deep Hierarchical Reinforcement Learning
Algorithm in Partially Observable Markov Decision Processes", IEEE Access, vol. 6,
pp. 49089-49102, 2018. Available: 10.1109/access.2018.2854283.
[4]"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II | DeepMind",
DeepMind, 2019. [Online]. Available: https://deepmind.com/blog/alphastar-mastering-
real-time-strategy-game-starcraft-ii/. [Accessed: 31- May- 2019].
[5]O. Vinyals et al, "StarCraft II: A New Challenge for Reinforcement Learning", 2017.
[Accessed 31 May 2019].
[6]Z. Pang et al, "On Reinforcement Learning for Full-length Game of StarCraft", 2019.
[Accessed 31 May 2019].
[7]M. Hessel et al, "Rainbow: Combining Improvements in Deep Reinforcement
Learning", 2017. [Accessed 31 May 2019].
[8]"The Promise of Hierarchical Reinforcement Learning", The Gradient, 2019.
[Online]. Available: https://thegradient.pub/the-promise-of-hierarchical-
reinforcement-learning/. [Accessed: 31- May- 2019].
[9]"What Killed the Curse of Dimensionality? – Camron's Blog", Camron.xyz, 2017.
[Online]. Available: http://camron.xyz/index.php/2017/09/06/what-killed-the-curse-
of-dimensionality. [Accessed: 31- May- 2019].
11 Appendix
Link to this project github repository: https://github.com/ZestyVesty/SC2Agents
Link to implementation of DDQN: https://github.com/rayheberer/SC2Agents
Table 6: Training Environment The training took place on Ubuntu 19.04,
OS/Packages Version
Ubuntu 19.04
PySC2 2.0.2
conda 4.6.14
Python 3.7.3
If there are requests for the pre-trained model and other data, please message email.