Transcript
Page 1: Jeffrey Barratt Deep Imitation Learning for Playing Chuanbo Pancs229.stanford.edu/proj2017/final-posters/5147253.pdf · Deep Imitation Learning for Playing Real Time Strategy Games

DeepImitationLearningforPlayingRealTimeStrategyGames

ChuanboPan

[email protected]

JeffreyBarratt

[email protected]

MotivationCompetitiveComputerGames,despiterecentprogressinthearea,stillremainalargelyunexploredapplicationofMachineLearning(ML),ArtificialIntelligence(AI),andComputerVision.RealTimeStrategygamessuchasStarCraftIIprovideanidealtestingenvironmentforAIandMLtechniques,as:•  Theyruninrealtimeandinvolveincompleteinformation

(theplayercannotseethewholebattlefieldatonce).•  Usersmustbalancemakingunits,controllingthoseunits,

executingastrategy,andhidinginformationfromtheirenemytosuccessfullywinagameofStarCraftII.

•  Simpleactionsmadeearlyoninthegamecangreatlyimpactlaterstagesofthegame.

ChallengesApartfromthereasonsmentionedinthemotivation,StarCraftIIisachallenginggamebecause:•  Theactionspaceishuge–O(n4)possibleselections,and

O(n2)possibleplacestomoveto.•  Thegameisinreal-time–can’tspendforeverthinking!•  Thegameishugeandmulti-layered.

o  Havetoconsiderattacking,producing,gathering,etc.o  Ourmini-gamealreadyincludestaskssuchasmoving

tobeacon(enemy),attacking,andmovingback.•  Overfittingthedataisaneasytraptofallinto.

ProblemDefinition

1)ThestartingpositionofthePySc2-providedmini-game

2)Themarinescanstartbyattackingthemoredangeroustargetsfirst

3)Splittingthemarinesmitigatestheeffectofsplashdamage

4)KitingcanbeusedtotrademoreefficientlywiththeZerglings

Wefocusoureffortsononemini-gamewhichinvolves“splitting”,aprocessofengagingarea-of-effectunitswithagroupoflow-healthrangedunits,wherethelow-healthunitsaremovedaparttomitigatetheareaofeffectdamagetoasfewunitsatonce,aswellasothermanagementtacticssuchaskitingandtargeting.Wecollecteddatabyplayingthemini-gamerepeatedlyandrecordingallactionstakenandthestatetheyweremadein,yieldingaround10,000state-actionpairs.

FeaturesandModelWeusethenewly-releasedPySc2[1]frameworktointerfacewiththeStarCraftIIenvironment.ThisAPIallowsusthefollowingchannelstocommunicatewithStarCraftII:•  Astatefeaturevectorthatrepresentswhattheagent

currentlyseesonthescreen.o  1784x84featurelayers,eachrepresentingadifferent

aspectofthegame(health,alliancestatus,etc.).o  Becausetheypreservethespatiallayoutofthescreen,

featurelayersactas“images”ofwhat’scurrentlyseen.•  Asetofavailableactionsthattheagentcanlegallymake

inthecurrentstate.WewrotecodetoconvertthefeaturesandactionsfromStarCraftIItothenetworkandviceversa.OurapproachutilizesDeepBehavioralCloning.Wetakeadvantageoftheimage-likenatureofthestatebyusingaConvolutionalNeuralNetwork(CNN).

Thenetworkoutputsscoresforeachpossibleaction(select,attack,move,noop,etc.)aswellasabestguessforeachoftheirparametersforthecurrentstate,whichwillforthemostpartbecoordinatesofwheretoattack,select,move,etc.

Results

DiscussionandFutureWorkWecanobserveseveralthingsfromourresults:•  Theoreticalscorerangesfrom-9to750+.Obviously,we

arenotnearmaximumpotential.•  Thelearningcurveindicatesthatweareimprovingour

scorefromrandombutstillsufferfromoverfitting.•  Theagentsometimesgetsstuckmovingbackandforth

betweentwostates.•  Theagentdoesn’tseemtohaveafullunderstandingof

whatit’sdoing,justmimickingwhatlooksright.Futurework,werewetohave6monthstowork:•  UsingaLSTMRecurrentNeuralNetwork(RNN)tohelp

determineactions,especiallyinsequence.•  Refiningourabstractiontoensurestabilityduringtraining•  Gathermoretrainingdatatocovermorepossiblecases.

Thegraphsonthetopshowtraininglossduringtrainingwithandwithoutregularization.Thegraphsonthebottomshowaveragescoresforthemini-gameduringtraining.

References[1]Vinyals,Oriol,etal.,"StarCraftII:Anewchallengeforreinforcementlearning."arXivpreprintarXiv:1708.04782(2017).[Online].Available:https://arxiv.org/abs/1708.04782.

Recommended