Upload
michael-jongho-moon
View
277
Download
0
Embed Size (px)
Citation preview
AlphaGo: An AI Go Player based on Deep Neural Networks and Monte Carlo Tree Search
Michael J. MoonM.Sc. Candidates in BiostatisticsDalla Lana School of Public HealthUniversity of Toronto
April 7, 2016
2
AgendaIntroduction
Methodologies
Design
Discussion
References
AlphaGo | M.Moon
3
Introduction | Background
AlphaGo | M.Moon
The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space
1. 1 Googol
Example of a Go BoardShades represent territories
4
Introduction | Background
AlphaGo | M.Moon
The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space
Complexity> Possible number of sequences > Googol1 times more complex than chess> Viewed as an unsolved “grand challenge” for AI
1. 1 Googol
“pinnacle of perfect information games”
Demis Hassabis, Co-founder of DeepMind
Example of a Go BoardShades represent territories
5
Introduction | Background
AlphaGo | M.Moon
1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
> Google DeepMind’s AI Go Player
6
Introduction | Background
AlphaGo | M.Moon
1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
Oct 2015
5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game
7
Introduction | Background
AlphaGo | M.Moon
1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking
Oct 2015
Mar 2016
5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game
4-1 against Sedol Lee> Victory against world’s top player over the past decade> Awarded the highest Go ranking after the match2
8
Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
9
Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
Monte Carlo Tree Search
Move Selection
10
Introduction | Overview of the Design
AlphaGo | M.Moon
30M Human Moves
SL Policy Network
Rollout Policy
RL Policy Network
RL Value Network
Monte Carlo Tree Search
Asynchronous Multi-threaded Search
> 40 Search Threads> 48 CPUs> 8 GPUs
Distributed Version1
> 40 Search Threads> 1,202 CPUs> 176 GPUs
1. Used against Fan Hui; 1,920 CPUs and 280 GPUs against Lee http://www.economist.com/news/science-and-technology/21694540-win-or-lose-best-five-battle-contest-another-milestone
Move Selection
11
Methodologies | Deep Neural Network
AlphaGo | M.Moon
Deep Learning Architecture> Multilayer (5~20) stack of simple
modules subject to learning
𝑤𝑖𝑗 𝑤 𝑗𝑘
𝑤𝑘𝑙
𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )
𝑦 𝑙= 𝑓 (𝑧 𝑙 )
𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1
𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙
𝑧 𝑗=∑𝑖∈𝐼𝑛
𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑
𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗
𝑧 𝑙= ∑𝑘∈𝐻 2
𝑤𝑘𝑙 𝑦 𝑘
ij
kl
12
Methodologies | Deep Neural Network
AlphaGo | M.Moon
Deep Learning Architecture> Multilayer (5~20) stack of simple
modules subject to learning
Backpropagation Training> Trained by simple stochastic
gradient descent to minimize error
𝑤𝑖𝑗 𝑤 𝑗𝑘
𝑤𝑘𝑙
𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )
𝑦 𝑙= 𝑓 (𝑧 𝑙 )
𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1
𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙
𝑧 𝑗=∑𝑖∈𝐼𝑛
𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑
𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗
𝑧 𝑙= ∑𝑘∈𝐻 2
𝑤𝑘𝑙 𝑦 𝑘
ij
kl
13
Methodologies | Deep Neural Network
AlphaGo | M.Moon
Deep Learning Architecture> Multilayer (5~20) stack of simple
modules subject to learning
Backpropagation Training> Trained by simple stochastic
gradient descent to minimize error
𝜕𝐸𝜕𝑧 𝑗
=𝜕𝐸𝜕 𝑗
𝜕 𝑦 𝑗
𝜕 𝑧 𝑗
𝜕𝐸𝜕𝑧𝑘
= 𝜕𝐸𝜕 𝑦𝑘
𝜕 𝑦𝑘
𝜕 𝑧𝑘
𝜕𝐸𝜕𝑧 𝑙
= 𝜕𝐸𝜕 𝑦 𝑙
𝜕 𝑦 𝑙
𝜕 𝑧 𝑙
𝜕𝐸𝜕 𝑦 𝑗
= ∑𝑘∈𝐻 2
𝑤 𝑗𝑘𝜕𝐸𝜕 𝑧𝑘
𝜕𝐸𝜕 𝑦𝑘
= ∑𝑙∈𝑂𝑢𝑡
𝑤𝑘𝑙𝜕𝐸𝜕 𝑧𝑙
𝑤𝑖𝑗 𝑤 𝑗𝑘
𝑤𝑘𝑙
𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1
𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠
ij
kl
Application of the chain rule for derivatives to obtain gradient descents
14
Methodologies | Deep Neural Network
AlphaGo | M.Moon
Deep Learning Architecture> Multilayer (5~20) stack of simple
modules subject to learning
Backpropagation Training> Trained by simple stochastic
gradient descent to minimize error> Rectified linear unit (ReLU) learns
faster than other non-linearities
𝑤𝑖𝑗 𝑤 𝑗𝑘
𝑤𝑘𝑙
𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )
𝑦 𝑙= 𝑓 (𝑧 𝑙 )
𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1
𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙
𝑧 𝑗=∑𝑖∈𝐼𝑛
𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑
𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗
𝑧 𝑙= ∑𝑘∈𝐻 2
𝑤𝑘𝑙 𝑦 𝑘
ij
kl
𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠
15
Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon
InputArrays such as signals, images and videos
Local ConnectionsArrays such as signals, images and videos
Shared WeightsEach filter with common weights and bias to create a feature
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏
PoolingCoarse-graining the position of each feature, typically by taking max from neighbouring features
Non-linearityLocal weighted sums to a non-linearity such as ReLU
Size and StrideFilter size 3 with stride 2
Deep ArchitectureUses stacks of many layers
Properties of natural signals
16
Methodologies | Deep Convolutional Neural Network
AlphaGo | M.Moon
Architecture> Highly correlated local
groups> Local statistics invariant to
location
Properties> Compositional hierarchy> Invariant to small shifts and
distortions due to pooling> Weights trained through
backpropagation
𝑾 𝟏
𝑾 𝟏
𝑾 𝟏
17
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
Tree Policy
Default Policy
OverviewFind optimal decisions by:> Take random samples in the decision space > Build a search tree according to the result
18
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
SelectionTraverse to the most urgent expandable node
Tree Policy
Tries to balance exploration and exploitation
Default Policy
19
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
SelectionTraverse to the most urgent expandable node
ExpansionAdd a child node from the selected node
Tree Policy
Tries to balance exploration and exploitation
Default Policy
20
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
𝑟 (𝑠 ′ )
SelectionTraverse to the most urgent expandable node
ExpansionAdd a child node from the selected node
SimulationSimulate from the newly added node to an outcome
Tree Policy
Default Policy
21
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
𝑟 (𝑠 ′ )
SelectionTraverse to the most urgent expandable node
ExpansionAdd a child node from the selected node
SimulationSimulate from the newly added node to an outcome
BackpropagationBackup simulation result through selected nodes
Tree Policy
Default Policy
22
Methodologies | Monte Carlo Tree Search
AlphaGo | M.Moon
𝑟 (𝑠 ′ )
Tree Policy
Default Policy
Strengths> Anytime algorithm – gives a valid solution at
any time of interruption> Values of intermediate states are not
evaluated – domain knowledge not required
23
Design | Problem Setting
AlphaGo | M.Moon
Unique Optimal Value Function
> State of the game
> Legal actions at
> Deterministic state transitions
> Reward for player at ,
> Terminal reward at
Value Function>
Policy> Probability distribution
over legal actions
24
Design | Rollout Policy
AlphaGo | M.Moon
> A fast, linear softmax policy for simulation> Pattern-based feature inputs> Trained using 8 million positions> Less domain knowledge implemented compared to
existing MTSC Go programs> 24.2% prediction accuracy> Similar for tree expansion
25
Design | Neural Network Architectures
AlphaGo | M.Moon
1
1
0
1
1
0
1
0
0
0
1
0
1
0
Input19 x 19 intersectionsx 48 feature plane x48 +1Input Feature Space
> Stone Colour> Ones & Zeros> Turns Since> Liberties> Capture Size> Self-atari Size> Liberties after Move> Ladder Capture> Ladder Escape> Sensiblenesswith respect to current player
Extra Feature for Value Network
> Player Colour 0
19 x
19
1
1
0
1
0
1
0
0 1
0
26
Design | Neural Network Architectures
AlphaGo | M.Moon
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
00
1
0
FiltersKernel size 5 x 5 with stride 1 convolution
Zero-Padding(19+4) x (19+4)
ReLU
0
00
0
0
00
0
27
Design | Neural Network Architectures
AlphaGo | M.Moon
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
00
1
0
19 x 19 Output0
00
0
0
00
0
28
Design | Neural Network Architectures
AlphaGo | M.Moon
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
00
1
0
00
000
0
0
0
x11
FiltersKernel size 3 x 3 with stride 1 convolution
Zero-Padding(19+2) x (19+2)
ReLU
19 x
19
0
00
0
0
00
0
29
Design | Neural Network Architectures
AlphaGo | M.Moon
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
0
1
0
0
0
0
0
00
1
0
00
000
0
0
0
Policy 1-Stride Convolution1 kernel of size 1 x 1 with different bias for each intersection
Softmax FunctionOutputs for each of 19 x 19 intersections
Value 1-Stride Convolution1 kernel of size 1 x 1
Tanh FunctionFully-connected layerOutputs a single
256 RectifiersFully-connected layer
19 x
19
x11Convolution Layers
0
00
0
0
00
0
30
Design | Supervised Learning Policy Network
AlphaGo | M.Moon
> Trained using mini-batches of 16 randomly selected from 28.4 million positions
> Trained on 50 GPUs over 3 weeks > Tested with 1 million positions> 57.0% prediction accuracy
31
Design | Reinforcement Learning Policy Network
AlphaGo | M.Moon
> Trained using self-play between the current network and a randomly selected previous iteration of
> Trained over 10,000 million mini-batches of 128 games > Evaluated through game play without search
> 80% against > 85% against strongest open-source Go program
32
Design | Value Network
AlphaGo | M.Moon
> Trained using 30 million distinct positions from a separate game generated by a random mix of and to prevent overfitting
> Consistently more accurate than > Approaches Monte Carlo rollouts using with less computation
𝒗𝜽 (𝒔 )≈𝒗𝒑 𝝆 (𝒔 )≈𝒗∗ (𝒔 )
33
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Edge (𝑠,𝑎) Data
34
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Selection
Edge (𝑠,𝑎) Data
35
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Expansion
Edge (𝑠,𝑎) Data
36
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Evaluation
Edge (𝑠,𝑎) Data
37
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Edge (𝑠,𝑎) Data
Backup
38
Design | Search Algorithm
AlphaGo | M.Moon
*Image captured from Silver D. et al. (2016)
Edge (𝑠,𝑎) Data
Select Move
39
Discussion | Performance
AlphaGo | M.Moon
Against AI Players> Played against strongest commercial and
open-source Go programs based on MCTS> Single machine AlphaGo won 494 out of
495 in even games> Distributed version of AlphaGo won 77%
against the single machine version and 100% against others
40
Discussion | Performance
AlphaGo | M.Moon
Against Fan Hui> Won 5-0 in formal games with 1 hour of
main time + three 30s byoyomi1’s> Won 3-2 in informal games with three
30s byoyomi1’s
1. Time slots to be consumed after exhausting main time; reset to full period if not exceeded in a single turn;*Image captured from Silver D. et al. (2016)
41
Discussion | Performance
AlphaGo | M.Moon
Against Sedol Lee> Won 4-1 in formal games with 2 hours of main
time + three 60s byoyomi’s> Game 4 – the only loss – being analyzed> MCTS may have overlooked Lee’s game
changing move – which was the only move that could save the game at the state
Game 4Sedol Lee (White), AlphaGo (Black)Sedol Lee wins by resignation
*Image captured from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/
42
Discussion | Future Work
AlphaGo | M.Moon
Next Potential Matches
> Imperfect information games (e.g., Poker, StarCraft)
> AlphaGo based on pure learning> Testbed for future algorithmic researches
Areas Applications> Gaming> Healthcare> Smartphone Assistant
Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help
people have healthier lifestyles
43
Discussion | Future Work
AlphaGo | M.Moon
Next Potential Matches
> Imperfect information games (e.g., Poker, StarCraft)
> AlphaGo based on pure learning> Testbed for future algorithmic researches
“it’d be cool if one day an AI was involved in finding a new particle”
Demis Hassabis, Co-founder of DeepMind
Areas Applications> Gaming> Healthcare> Smartphone Assistant
Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help
people have healthier lifestyles
44
References
AlphaGo | M.Moon
Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., . . . Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43.
Byford, S. (2016, March 10). DeepMind founder Demis Hassabis on how AI will shape the future. The Verge. Retrieved April 02, 2016, from http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-alphago-google-deepmind-ai
Google Inc. (2016). AlphaGo | Google DeepMind. Retrieved April 02, 2016, from https://deepmind.com/alpha-go.html
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
Ormerod, D. (2016, March 13). Lee Sedol defeats AlphaGo in masterful comeback - Game 4. Retrieved April 06, 2016, from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.