1
Playing DOOM using Deep Reinforcement Learning {tdhoot, danikhan, bkonyi}@stanford.edu Problem Statement Teach agent to perform well in various scenarios in the video game classic, DOOM. Maximize rewards in our training scenarios using only state that would be available to a human player (images, items, health) Training the agent to perform well at seek-and-destroy ” and “ health gathering” scenarios. DQN Architecture First convolutional layer: 32 lters, 7x7 kernel, stride of 1, ReLU Second convolutional layer: 32 lters, 5x5 kernel, stride of 1, ReLU Max-pooling: 2x2 window, stride of 2 (on both convolutional layers) First fully connected layer: 1024 output units and had 38,400 inputs from our processed images, in addition to our scalar states. Second fully connected layer: 512 output units Linear output layer: one output per valid action Dataset, Features & Training e data that was used to train our DQN raw pixel data, game state that would be readily available to a human player (current ammunition count, health). Frame Size: <640 x 480> scaled to down to <160 x 120> (RGB & gray scale) Reward shaping: Encouraged movement by giving rewards based on distance moved Loss = ( Q ( s , a ) ( r + γ max a Q ( s , a ))) 2 Network Variations Double DQN: Split action selection network and evaluation selection to reduce overestimation: Dueling DQN: Split the network into 2 streams: value function for state S, V(S) and advantage function for each action a in state S A(S, a) Multi-Frame States: State consists of four frames instead of a single frame. is captures movement in time. Dueling DQN Seek & Destroy Scenario Health Gathering Scenario t DoubleDQN Y = R t + 1 + γ Q( S t + 1 , arg max a Q( S t + 1 , a , θ t ), θ t + 1 ) Q( s , a ;θ , α, β ) = V ( s ;θ , β ) + A( s , a ;θ , α ) 1 | A | A( s , a ';θ a ' , α ) Results 250 300 350 400 450 500 550 600 650 700 Health Gathering Valida/on Reward vs. Epoch Dueling (Shaped) Double (Shaped) Double Dueling (Shaped) Dueling Double Dueling Double Discussion After each training epoch (max 50), we did a validation run (deterministic policy using trained DQN). e DQN performed very well on seek & destroy, approaching >60 validation reward (quickly killing the monster, +99 pts) Results were mixed overall, for health gathering (reward shaping improved performance a bit) double dueling performed the best, large variation with reward shaping. Training loss converged quickly (after a few epochs) but converged in half the time with dueling network. References [1] Deep Reinforcement Learning with Double Q-learning Hado Hasselt-Arthur Guez-David Silver - https://arxiv.org/abs/1509.06461 [2] V. Mnih, “Playing Atari with Deep Reinforcement Learning.” [Online]. Available: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf. [3] Dueling Network Architectures for Deep Reinforcement Learning Ziyu Wang-Tom Schaul-Matteo Hessel-Hado Hasselt-Marc Lanctot-Nando Freitas - https://arxiv.org/abs/1511.06581 400 300 200 100 0 100 Seek & Destroy Valida/on Reward vs. Epoch Double Dueling No 4 Frame Double Dueling Double

Playing DOOM using Deep Reinforcement Learningcs229.stanford.edu/proj2017/final-posters/5134269.pdfPlaying DOOM using Deep Reinforcement Learning {tdhoot, danikhan, bkonyi}@stanford.edu

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Playing DOOM using Deep Reinforcement Learningcs229.stanford.edu/proj2017/final-posters/5134269.pdfPlaying DOOM using Deep Reinforcement Learning {tdhoot, danikhan, bkonyi}@stanford.edu

Playing DOOM using Deep Reinforcement Learning {tdhoot, danikhan, bkonyi}@stanford.edu

Problem Statement Teach agent to perform well in various scenarios in the video game classic, DOOM. Maximize rewards in our training scenarios using only state that would be available to a human player (images, items, health) Training the agent to perform well at “seek-and-destroy” and “health gathering” scenarios.

DQN Architecture First convolutional layer: 32 filters, 7x7 kernel, stride of 1, ReLU Second convolutional layer: 32 filters, 5x5 kernel, stride of 1, ReLU Max-pooling: 2x2 window, stride of 2 (on both convolutional layers) First fully connected layer: 1024 output units and had 38,400 inputs from our processed images, in addition to our scalar states. Second fully connected layer: 512 output units Linear output layer: one output per valid action

Dataset, Features & Training The data that was used to train our DQN raw pixel data, game state that would be readily available to a human player (current ammunition count, health). Frame Size: <640 x 480> scaled to down to <160 x 120> (RGB & gray scale) Reward shaping: Encouraged movement by giving rewards based on distance moved

Loss = (Q(s,a)− (r +γmaxaQ(s,a)))2

Network Variations Double DQN: Split action selection network and e v a l u a t i o n s e l e c t i o n to reduce overestimation: Dueling DQN: Split the network into 2 streams: value function for state S, V(S) and advantage function for each action a in state S A(S, a) Multi-Frame States: State consists of four frames instead of a single frame. This captures movement in time.

Dueling DQN Seek & Destroy Scenario Health Gathering Scenario

tDoubleDQNY = Rt+1 +γQ(St+1, argmax

aQ(St+1,a,θt ),θt+1

− )

Q(s,a;θ,α,β) =V (s;θ,β)+ A(s,a;θ,α)− 1| A |

A(s,a ';θa '∑ ,α)

Results

250  

300  

350  

400  

450  

500  

550  

600  

650  

700  Health  Gathering  Valida/on  Reward  vs.  Epoch  

Dueling  (Shaped)  

Double  (Shaped)  

Double  Dueling  (Shaped)  

Dueling  

Double  Dueling  

Double  

Discussion After each training epoch (max 50), we did a validation run (deterministic policy using trained DQN). The DQN performed very well on seek & destroy, approaching >60 validation reward (quickly killing the monster, +99 pts) Results were mixed overall, for health gathering (reward shaping improved performance a bit) double dueling performed the best, large variation with reward shaping. Training loss converged quickly (after a few epochs) but converged in half the time with dueling network.

References [1] Deep Reinforcement Learning with Double Q-learning Hado Hasselt-Arthur Guez-David Silver - https://arxiv.org/abs/1509.06461 [2] V. Mnih, “Playing Atari with Deep Reinforcement Learning.” [Online]. Available: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf. [3] Dueling Network Architectures for Deep Reinforcement Learning Ziyu Wang-Tom Schaul-Matteo Hessel-Hado Hasselt-Marc Lanctot-Nando Freitas - https://arxiv.org/abs/1511.06581

-­‐400  

-­‐300  

-­‐200  

-­‐100  

0  

100  

Seek  &  Destroy  Valida/on  Reward  vs.  Epoch  

Double  Dueling  No  4  Frame  Double  Dueling  Double