Download pdf - Solving Simple StarCraft 2 Problems with Dueling Deep Q learning … · 2019. 6. 3. · as chemistry, robotics, engineering, neuroscience etc. But it’s mostly well known for playing

Solving Simple StarCraft 2 Problems with Dueling

Deep Q learning with limited Available Actions

Sylvester Shan

[email protected]

Supervisor: Penny Kyburz

[email protected]

Australia National University, Canberra 2600, Australia

Abstract. This paper compares the performance of the two RL (Reinforcement

Learning) algorithms, Deep Double Q Learning (DDQ) and Dueling Deep Double Q

Learning (DDDQ) by training on three mini-games with limited actions available on

the StarCraft II reinforcement learning environment. This paper showed that it is

possible to solve broken down problems efficiently with limited action with DDDQN

training method and a non-static state representing the environment.

Keywords: Dueling Deep Q Learning, Deep Double Q Learning, Reinforcement

Learning, Hierarchical Reinforcement Learning, Deep Learning, StarCraft II.

1 Introduction

In recent years, Reinforcement Learning (RL) has been applied in different fields such

as chemistry, robotics, engineering, neuroscience etc. But it’s mostly well known for

playing games such as Atari 2600 computer games [1] and defeating the top board game

GO player in the world [2], where both agent are developed by DeepMind.

Deepmind released a reinforcement learning environment called SC2LE (StarCraft II

learning Environment), where StarCraft II is a popular real time strategy game. The

objective of the game is to expand the base and create an army to destroy the opponent.

It’s a fast-paced game that needs the player to make hundreds of decisions and micro

actions every minute. There are various features that makes the game the perfect

environment for reinforcement learning. The environment is not fully observable for

each player. To obtain information the player needs to scout the map with a unit. The

game has a large state space and a large action space for each state, making it impossible

to use existing traditional algorithms to record all the possible states. In other words,

it’s impossible to have a scripted agent to encounter all possible scenarios. These are

what makes the game challenging to solve. [3]

In past published paper, DeepMind stated they had a long-term goal of developing a

deep reinforcement learning agent to defeat top competitive human players. [5] The

long-term goal has been achieved recently with the release of AlphaStar, which is

capable of defeating professional players such as Grzegorz "MaNa" Komincz to a 5-0.

AlphaStar could be considered at the state of art for reinforcement learning. More

information on how AlphaStar is trained can be found on the blog post by DeepMind.

[4] Though it’s unlikely to reproduce or create a better agent than AlphaStar, but there

are still areas that are worth exploring and evaluating using StarCraft II.

Other than using Reinforcement Learning to solve problems, Hierarchical

Reinforcement Learning (HRL) can also be used to solve problems. HRL decomposes

a problem down into smaller problems, until it can’t be broken down more. Then solve

for each of the subproblems to solve the original problem.

An agent is needed to be trained to solve each sub problem broken down from the larger

problem. Two training methods are introduced in this paper, Deep Double Q Learning

(DDQL) and Dueling Deep Double Q Learning (DDDQL) using the environment and

mini-games in PySC2. In this paper, three different mini-games are used to train agents

to play it. Each training method will be used to train an agent. The paper compares the

performance of the two agents from the same mini-game and introduce how could these

trained agents possibly be used to construct a Hierarchical Agent to play the full game

of StarCraft.

2 Background

2.1 Reinforcement Learning

Reinforcement learning are often represented as a finite Markov Decision Process

(MDP). The following tuple can represent the MDP:

M = (S, A, P, R,𝜸)

Where S is the state space, A is the action space , P= (s' | s,a) represents the probability

of next state being s’ given current state s and action a. R = R(s,a) represents the instant

reward after taking action a at state s. 𝜸 is the discount factor that influences predicted

future award.

Though there are many success in reinforcement learning, it still has its

disadvantages. Curse of dimensionality usually refers to a phenomenon that occurs

when the dimensionality of the data increases. [9] For example, a simple problem can

be solved with an existing reinforcement learning algorithm, but when the

dimensionality of the data in the problem is scaled up, it will have a larger action

space or state space, making the reinforcement learning algorithm not be feasible to

be use. Reinforcement learning is not good at generalization. A complex question can

be solved by a trained agent, but the agent won’t be able to solve a similar complex

problem it is able to solve. It cannot transfer its learning experience to a different

environment. In other words, a trained agent is overspecialized for the problem it is

solving. [8] For example, AlphaStar only plays the race Protoss, but it’s not possible

to transfer the learning experience and play the race Zerg. The initial building the

Protoss gets is called Nexus, which allows the player to create a worker unit called

Probe. While the initial building the Zerg gets is called Hatchery, which constantly

creates a unit called Larva every 11 seconds, which can then be morphed into a

worker unit called Drone or two military units called Zergling.

2.2 Hierarchal reinforcement learning

Hierarchical Reinforcement Learning (HRL) can overcome the disadvantage of

Reinforcement Learning. HRL can break down problems to avoid the curse of

dimensionality. For example, how to win a StarCraft II game can be broken down into

how to expand base and how to build a military. How to build a military can be broken

down to how build barracks. How to build barracks can be broken down into selecting

worker and then build barracks at a specific location etc. Breaking down the problem

also means HRL is better at generalization. [8] When the problem can’t be broken down

further, it could be solved by a policy. It’s likely that this problem could be the same

problem between the three races in StarCraft II. For example, the basic military building

for Terrans, Protoss and Zergs are Barrack, Gateway and Spawning Pool. The common

action to build this will be selecting a worker and build the building at a specified

location. Indicating if there exists a policy for the race Drones, that selects a worker

unit and builds a Barrack, then it’s possible to transfer the learning experience and have

an agent playing the Zerg to have a worker unit and build a Spawning Pool.

3 Related Work

In 2019 Feb, Pang and his team was able to develop an HRL agent, that was able to

defeat StarCraft II built the most difficult AI that does not cheat with a win rate of 93%.

The following figure is the architecture of the HRL agent.

Figure 1: Hierarchal reinforcement architecture [6]

The global state is represented by non-spatial features such as the number of resources

the agent has, the number of workers, size of the army etc. Detailed information can be

found in the appendix of their paper. [6]

Before training the HRL agent, they data mined the macro-actions from replays, which

are fixed sequence of actions that human players do. For example, they found out that

the selecting a Probe and performing the action Harvest-Gather-Probe-screen appeared

2711 times during the data mining session. Doing this allows to reduce the action space

for each policy and increase the testing speed and the learning efficiency.

The global state is then feed into a controller, which picks a sub-policy every 8 seconds.

After a sub-policy is chosen, the screen the agent sees locally is passed into the chosen

sub-policy, and then the sub-policy chooses which macro-actions to perform. [6]

The data mined sequence of actions are mostly for expanding the base, creating more

military buildings and the focus area will be where the base is. Extracting combat

sequence will be hard, since the combat area does not only limit to the base, but all over

the map. Therefore, it’s impossible to have all the replays covering all or parts of

possible game state when combat happens. During combat, the action a player chooses

should be depended on the current scenario and hence they have a simple combat

system. One combat system has three actions, go and attack this given location, retreat

to this give location or do nothing. The second one is when a battle sub-policy is chosen,

then attack the center of the most injured units. [6]

It’s impossible to data mine and store all the sequence of macro-actions. If a problem

in StarCraft II can be continued broken down into smaller problems, then this problem

show be abled to be solved by a simple agent with a small action space.

4 Methods

4.1 Mini-Games

Instead of training an agent to play the whole game, mini-games were introduced to

help understand how the agent is learning and behaving. The following three mini-

games are chosen for experiments. For each mini-game below, one agent is trained

using the DDQL and the other uses the DDDQL. Out of the three mini-games,

CollectingMineralShards is the easiest and DefeatZerglingsAndBanelings is the hardest.

The three chosen mini-games’ camera is fixed.

4.1.1 CollectingMineralShards

This mini-game has mineral shards scattered all around the map. The player has two

military units and move them to collect as many mineral shards as they possibly can

within a time limit. After all mineral shards has been collected from the map, it will

randomly spawn more on the map. The agent gets 1 score as reward for each shard it

collects.

4.1.2 Defeat Roaches

The agent will receive 9 Marines and fight 4 Roaches. Both Marines and Roaches can

attack from a distance, but the Roaches has more hit points than the Marines. The

objective of this game is to have the agent controlling the Marines to focus firing down

each Roaches one by one. After all the Roaches are defeated, the agent will get an

addition of 5 marines and 4 Roaches will spawn again on the map. The agent will get

10 points for each Roach it defeats and lose 1 point for each Marine it loses.

4.1.3 DefeatZerglingsAndBanelings

The agent will receive 9 Marines and fight 6 Zerglings and 4 Banelings. Zerglings are

a close combat unit. Banelings is a military unit that explodes, and deals splash damage

around it. The agent will get 5 points for defeating an enemy unit but loses 1 point for

losing a Marine.

4.2 Global State

The global state is different for different types of mini-games. If the agent can’t see any

enemy units on the screen, then the global state will be a feature_screen_size x

feature_screen_size 2D-array (feature_screen_size = 84 by default) representing the

coordinate of the map. For each element in this 2D-array, it will have the value 0, 1, 2,

3 or 4, meaning at this location there is nothing, agent’s unit, allie’s unit, neutral object

(mineral shards) or enemy units. (This layer is equivalent to player_relative in PySC2)

If enemy is seen on the map, then the global state will be a feature_screen_size x

feature_screen_size 2D-array representing the coordinates of the map, where each

element of the 2D-array is the health of an enemy unit at that location.

The global state will be used as input to feed into the convolution neural network for

DDQN and DDDQN to calculate the Q-values.

4.3 Action Space

For all three mini-games are limited to two actions: Move_screen and Attack_screen.

Both actions take a coordinate of a pixel of the feature screen, which could be anywhere

on the screen. Move_screen moves the selected units to the specified location, they

won’t attack any enemy units until they reached the destination. If the specified location

is within line of sight and has an enemy unit on it, it will attack the unit. Attack_screen

move the selected units to the specified location. The units will attack enemy unit within

the line of sight.

Note that all units are selected with action select_army. There is no drag and select or

individual click on agent’s unit.

Figure 2: The light green box on the left of the figure is the available position to choose

from. The camera on this map is fixed.

4.4 DDDQN neural network and output

DDDQN is based off the implementation of DDQN by Ray Heberer. The github

repository for the implementation of DDQN can be found in the appendix. After

flattening the results of the convolutional neural network, take the results and feed them

to two separate neural networks.

The first neural network calculates the value of the state. The first layer takes in the

flatten results as inputs, fully connect the input layer to the hidden layer with 1024

hidden neurons, then fully connect the hidden layer to the one output. The output will

be the value of the state.

The second neural network is to calculate the advantage of taking an action comparing

to all other actions. The first layer takes in the flatten results as inputs, fully connect the

input layer to the hidden layer with 1024 hidden neurons, then fully connect the hidden

layer to the output layer that has the same size as the input layer. This neural network

predicts the advantage for each action A(s,a).

There for the Q-value follows the equation from background. The advantage of taking

an action is subtracted by the average advantage of all actions of the current state, then

add it to the value of the current state.

4.5 Training Termination

Training methods DDQN and DDDQN will train an agent on each mini-game one at a

time. The number of episodes it plays is not determined for each mini-game. There are

three ways of terminating the training of an agent. A training can be terminated by an

error, the terminal prints out “MemoryError” then ends the training. More about his

error could be found in the file errormessage.md in the github link in Appendix. A

training could be terminated by looking at the Score graph generated by Tensorboard.

The last method of terminating training requires to have the Score graph of a training

method of a mini-game, eg finished training a DDQN agent on mini-game

CollectMinearlShards, currently training a DQN agent on mini-game

CollectMineralShards. Terminate the training when the number of episodes is

approximately the same.

4.6 Experiments

4.6.1 Number of hidden neurons in DDDQN

In DDDQN, for the two fully connected neural network, deciding the number of hidden

neurons for both of neural network was hard and not trivial. Two agents were trained

using the same mini-game DefeatRoaches. The first agent had 512 hidden neurons in

the hidden layer for both neural networks, similarly the second agent had 1024 neurons.

4.6.2 Global state decision

When training a reinforcement learning agent, choosing to have what as the state is

important. Using different kinds of states can result in different results. While having

the state as player_relative and when the score has converged, it’s been observed that

when there are enemy on the screen, the agent will keep clicking all the units. Therefore,

an experiment was done by having two DDDQN agent training on the same mini-game

DefeatRoaches, where one keeps the state player_relative and the other enemy_hp.

4.6.3 Scores

To evaluate which training method is better, DDDQN agent and DDQN agent will use

the same state and train on the same three mini-games separately.

5 Results

Note that all graphs are generated from Tensorflow and are smoothen by 0.9 for better

visibility. The x-axis is the number of episodes the agent has trained, the y-axis is the

score the agent achieved in that episode.

5.1 Number of hidden neurons in DDDQN

After training the two agents, the graph of the score for both agents are down below.

Figure 3: Left graph is the performance of the DDDQN agent with 1024 hidden neurons;

Right graph is the performance of the DDDQN agent with 512 hidden neurons. (x-axis:

number of episodes, y-axis: score achieved at the corresponding episode)

Table 1: The max and mean score of the two agents with different number of hidden

neurons.

Number of hidden

neurons

Max score Mean score

512 211.00 25.52

1024 335.00 79.00

5.2 Global state decision

After training two DDDQN agents with different states, the score for the two agents are

shown below.

Figure 4: Left graph is the performance of the DDDQN agent with the state as

enemy_hp; Right graph is the performance of the DDDQN agent with the state

player_relative. (x-axis: number of episodes, y-axis: score achieved at the

corresponding episode)

Table 2: The max and mean score of the two agents with two different states

State name Max score Mean score

Player_relative 81 21.40

Enemy_hp 335.0 79.00

5.3 Scores

After training the two agents on each of the mini-games, below presents the score.

Figure 5: After training on CollectMineralShards. Left graph is the performance of

DDDQN and the right graph is the performance of DDQN. (x-axis: number of episodes,

y-axis: score achieved at the corresponding episode)

Table 3: The max and mean score of the two agents training on CollectMineralShards.

Training Method Max score Mean score

DDDQN 90.00 66.36

DDQN 111.00 65.11

Figure 6: After training on DefeatRoaches. Left graph is the performance of DDDQN

and the right graph is the performance of DDQN. (x-axis: number of episodes, y-axis:

score achieved at the corresponding episode)

Table 4: The max and mean score of the two agents training on DefeatRoaches.

Number of hidden

neurons


DDDQN 335.00 79.00

DDQN 302.00 35.32

Figure 7: After training on DefeatZerglingsAndBanelings. Left graph is the

performance of DDDQN and the right graph is the performance of DDQN. (x-axis:

number of episodes, y-axis: score achieved at the corresponding episode)

Table 5: The max and mean score of the two agents training on

DefeatZerglingAndBanelings.

Number of hidden

neurons


DDDQN 89.00 17.18

DDQN 88.00 21.90

6 Discussion

6.1 Mini-games behavior

Note that the mini game behavior was observed on the feature layers before terminating

the training, not in actual game play. Assuming it already converged.

Replays will be available in this project github link.

6.1.1 CollectMineralShards

This was one of the simplest games in the mini-game list. There wasn’t a big difference

in how the agents behaved. From the talbe above we could see that though DDQN agent

has a higher max score, but that could it just being lucky, since the mineral shards

spawns randomly around the units.

6.1.2 DefeatRoaches

The behavior of the two agents are different in this game. For the DDDQN agent, it was

able to learn how to focus fire a single unit at a time. After an opponent unit dies, it will

perform the action “Attack_move” onto the next opponent in line and continue.

According to the graph, around 200 episode it was able to learn this technique. The

reason why the score is not consistent is because the angle the units moves into the

Roaches. By default, the Roaches will attack the first unit they are able to see, then that

unit will be destroyed within two shots from each of the Roaches, which leads to a

lower score since the agent’s military won’t be able to win if it lost a unit too early.

Another interesting observation was the agent did not constantly click on the focus fired

Roach or any other random actions. Between the time when the agent orders to attack

a Roach and the Roach hit points reaching zero, it will not click on anything.

For the DDQN agent, the performance was not ideal. It couldn’t learn to focus fire

within the same number of episodes the DDDQN agent took. The agent was constantly

clicking on one or two roaches constantly. Which leads to the agent army’s death

quickly. If given enough time, it could possible achieve results like the DDDQN agent,

but it’s not wroth looking into, since the DDDQN agent uses half the time to learn and

performs better.

6.1.3 DefeatZerglingsAndBanelings

Looking at the graaaaph, both agents was not able to perform well, but they had one in

common behavior: they are running away from the opponent’s army. In the early stage

of the game, the agent’s army will attack the opponent’s units. The problem is the

opponent has 4 units called the Banelings, which explodes at close range and deals large

amount of damage to light armor units such as the Marines. After the Banelings killing

and damaging most of the units, the Zerglings will come and destroy the rest of the

units.

After a few trials and errors, they the DDDQN agent was able to learn that it does not

have enough action space to deal with the opponent. Hence it started fleeing to the top

left corner of the screen and tried to avoid the opponent. The max score DDDQN

obtained should be from the early training time when it tried to fight the opponent. In

the later part of training, it was still getting some scores why the expected score should

have been zero. The reason to this is because when an episode restarts, there are chances

that the agent’s army will spawn on the right side of the map while the opponent will

spawn on the left. Hence when the agent tried to flee from the enemy by running to the

left top corner, it will get into combat and might trigger one or two Banelings to explode,

which does grant it rewards. The evidence that the agent is running into the opponent

will be the agent’s score are mostly zero when on the left graph inf iiiiiiiiiiure.

The DDQN was not able to learn that the number of available actions it can use is

limited, and even spending more around twice as many episode DDDQN agent used, it

was still not able to defeat or run away from the agent.

6.2 Action Space

The objective of this paper is to show that the even with limited action space, an agent

could still be able to train and solve a problem with the correct training method. HRL

is about breaking down the problems into smaller problems, smaller problems also

indicate that it could be solved easily with a small action space. This is the reason why

the available action was limited to only “Move_screen” and “Attack_screen”.

6.3 Number of hidden neurons in DDDQN

The agent with more hidden neurons had a much higher max and average score than

the agent with less hidden neurons. Therefore, for all agents that was trained using the

DDDQN method will have 1024 hidden neurons. The reason why we picked 512 and

1024 is because both numbers can be easily stored. Numbers that can be created by two

to the power can be stored easily.

After this experiment there was no further experiment to find the optimal hidden

number of neurons.

6.4 Global State Decision

In section 5.2, it can be observed that the DDDQN agent performs much better with the

state as enemy_hp other than player_relative. An explanation to this that in enemy_hp,

the agent learnt that it will receive rewards when an the hp of an enemy reaches 0. And

the fastest way to do it is to focus fire a unit at a time. Unlike for player_relative, the

number 4 which stands for ENEMY will remain there until is destroyed. It seemed to

be the case that the agent with state as player_relative knows that clicking on the state

of positions where it’s labelled 4 will return a reward, but it wasn’t able to learn that

the fastest way is to focus fire a unit, because the label 4 doesn’t decrease while the hp

of the enemy does.

7 Conclusion

It is possible to create agents with one or two available actions and small action space.

The efficiency of the agent learning with DDDQN was faster than using the DDQN.

The state of that is used to represent that game state is also an important aspect of the

learning efficiency of an agent. Having non-static values seems to speed up learning.

8 Future Work

Further investigation is needed to find the optimal number of hidden neurons in the

neural network for the DDDQN training algorithm. More time might be needed for the

agent to learn each mini-game fully with the two training methods.

Though DDDQN is a better training method than DDQN, but according to DeepMind’s

Rainbow DQN, it is the most efficient learning algorithm amongst DDQN, DDDQN,

prioritized deep Q learning etc. It would be interesting to compare the performance

between a DDDQN and a Rainbow DQN agent to see how the agent learns using

Rainbow DQN as the training method. [7]

The next objective is to develop an HRL agent with pre-trained agents that can solve

smaller problems. Connect these pretrained agents in a hierarchal structure with neural

network could be a solution, but this needs to be investigated more before making

anymore claims.

9 Acknowledgement

Special thanks to Dr Penny Kyburz and Ray Heberer.

10 References

[1]V. Mnih et al., "https://arxiv.org/pdf/1312.5602v1.pdf", 2013. Available:

https://arxiv.org/pdf/1312.5602v1.pdf. [Accessed 31 May 2019].

[2]D. Silver et al., "Mastering the game of Go without human knowledge", Nature, vol.

550, no. 7676, pp. 354-359, 2017. Available: 10.1038/nature24270.

[3]T. Le, N. Vien and T. Chung, "A Deep Hierarchical Reinforcement Learning

Algorithm in Partially Observable Markov Decision Processes", IEEE Access, vol. 6,

pp. 49089-49102, 2018. Available: 10.1109/access.2018.2854283.

[4]"AlphaStar: Mastering the Real-Time Strategy Game StarCraft II | DeepMind",

DeepMind, 2019. [Online]. Available: https://deepmind.com/blog/alphastar-mastering-

real-time-strategy-game-starcraft-ii/. [Accessed: 31- May- 2019].

[5]O. Vinyals et al, "StarCraft II: A New Challenge for Reinforcement Learning", 2017.

[Accessed 31 May 2019].

[6]Z. Pang et al, "On Reinforcement Learning for Full-length Game of StarCraft", 2019.

[Accessed 31 May 2019].

[7]M. Hessel et al, "Rainbow: Combining Improvements in Deep Reinforcement

Learning", 2017. [Accessed 31 May 2019].

[8]"The Promise of Hierarchical Reinforcement Learning", The Gradient, 2019.

[Online]. Available: https://thegradient.pub/the-promise-of-hierarchical-

reinforcement-learning/. [Accessed: 31- May- 2019].

[9]"What Killed the Curse of Dimensionality? – Camron's Blog", Camron.xyz, 2017.

[Online]. Available: http://camron.xyz/index.php/2017/09/06/what-killed-the-curse-

of-dimensionality. [Accessed: 31- May- 2019].

11 Appendix

Link to this project github repository: https://github.com/ZestyVesty/SC2Agents

Link to implementation of DDQN: https://github.com/rayheberer/SC2Agents

Table 6: Training Environment The training took place on Ubuntu 19.04,

OS/Packages Version

Ubuntu 19.04

PySC2 2.0.2

conda 4.6.14

Python 3.7.3

If there are requests for the pre-trained model and other data, please message email.

https://github.com/ZestyVesty/SC2Agents

https://github.com/rayheberer/SC2Agents