Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
GoGoGo: Improving Deep Neural Network Based Go Playing AI with Residual Networks
Introduction
Xingyu Liu
CONV3, 64
CONV3, 64
Input (19x19x5)
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
Output
CONV3 64
P
Fig. 2 (a) Policy Network
CONV3, 64
CONV3, 64
Input (19x19x5)
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3, 64
CONV3 64
(b) Value Network
Network Architecture Experiment Result
• Go playing AIs using Traditional Search:
GNU Go, Pachi, Fuego, Zen etc.
Training Methodology and Data
• From Kifu to Input Feature Maps
Channels: 1) Space Positions; 2) Black Positions; 3)
White Positions; 4) Current Player; 5) Ko Positions
[1] David Silver et al., “Mastering the game of go with deep neural networks and
tree search”, Nature, 529:484–503, 2016.
[2] Kaiming He et al, “Deep residual learning for image recognition”, CoRR,
abs/1512.03385, 2015.
SL on Policy
Network
RL on Policy
Network
RL on Value
Network
• Dynamic Board State Expansion
Ko fight performing. Saves disk space. Small Mem
• Use Ing Chang-ki rule
Board State + Ko is Game State, No need to
remember the number of captured stones
• Two Levels of Batches (Kifus, moves)
Random Shuffling. Mem usage small and locality.
• Powered by Deep Learning:
Zen → Deep Zen Go, darkforest, AlphaGo
• Goal: From by Vanilla CNN to ResNets
CONV3
Batch Norm
ReLU
CONV3
Batch Norm
ReLU
Data
Eltwise Add
(c) Residual Module
FC23104 1
Hyperparameters Value
Base learning rate 2E-4
Decay Policy Exp
Decay Rate 0.95
Decay Step (kifu) 200
Loss Function Softmax
(b) Hyperparameters
Future Work Fig 1. Ko fight explicity
expansion
• Training Accuracy ~ 32%
• Testing Accuracy ~ 26%
Fig. 3 GoGoGo plays against itself, policy network only
0
1
2
3
4
5
6
7
1183
365
547
729
911
1093
1275
1457
1639
1821
2003
2185
2367
2549
2731
2913
3095
3277
3459
3641
3823
4005
4187
4369
4551
4733
4915
5097
5279
5461
5643
5825
6007
6189
6371
6553
6735
6917
7099
7281
7463
7645
7827
8009
8191
8373
8555
8737
Supervised Learning Training Loss
/100 batches
• Monte Carlo Tree Search
𝑎𝑡 = argmax𝑎
(𝑄 𝑠𝑡 , 𝑎 + 𝑢(𝑠𝑡, 𝑎))
𝑢(𝑠, 𝑎) ∝𝑃(𝑠, 𝑎)
1 + 𝑁(𝑠, 𝑎)
• Reinforcement Learning of Value Network
• Network Architecture Exploration
• Real Match Testing against Human Players