AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

AlphaGo: An AI Go Player based on Deep Neural Networks and Monte Carlo Tree Search

Michael J. MoonM.Sc. Candidates in BiostatisticsDalla Lana School of Public HealthUniversity of Toronto

April 7, 2016

2

AgendaIntroduction

Methodologies

Design

Discussion

References

AlphaGo | M.Moon

3

Introduction | Background

AlphaGo | M.Moon

The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space

1. 1 Googol

Example of a Go BoardShades represent territories

4


AlphaGo | M.Moon

The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space

Complexity> Possible number of sequences > Googol1 times more complex than chess> Viewed as an unsolved “grand challenge” for AI

1. 1 Googol

“pinnacle of perfect information games”

Demis Hassabis, Co-founder of DeepMind

Example of a Go BoardShades represent territories

5


AlphaGo | M.Moon

1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking

> Google DeepMind’s AI Go Player

6


AlphaGo | M.Moon


Oct 2015

5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game

7


AlphaGo | M.Moon


Oct 2015

Mar 2016

5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game

4-1 against Sedol Lee> Victory against world’s top player over the past decade> Awarded the highest Go ranking after the match2

8

Introduction | Overview of the Design

AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

9


AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

Monte Carlo Tree Search

Move Selection

10


AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

Monte Carlo Tree Search

Asynchronous Multi-threaded Search

> 40 Search Threads> 48 CPUs> 8 GPUs

Distributed Version1

> 40 Search Threads> 1,202 CPUs> 176 GPUs

1. Used against Fan Hui; 1,920 CPUs and 280 GPUs against Lee http://www.economist.com/news/science-and-technology/21694540-win-or-lose-best-five-battle-contest-another-milestone

Move Selection

11

Methodologies | Deep Neural Network

AlphaGo | M.Moon

Deep Learning Architecture> Multilayer (5~20) stack of simple

modules subject to learning

𝑤𝑖𝑗 𝑤 𝑗𝑘

𝑤𝑘𝑙

𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )

𝑦 𝑙= 𝑓 (𝑧 𝑙 )

𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙

𝑧 𝑗=∑𝑖∈𝐼𝑛

𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑

𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗

𝑧 𝑙= ∑𝑘∈𝐻 2

𝑤𝑘𝑙 𝑦 𝑘

ij

kl

12


AlphaGo | M.Moon



Backpropagation Training> Trained by simple stochastic

gradient descent to minimize error


𝑤𝑘𝑙


𝑦 𝑙= 𝑓 (𝑧 𝑙 )


𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙




𝑧 𝑙= ∑𝑘∈𝐻 2


ij

kl

13


AlphaGo | M.Moon




gradient descent to minimize error

𝜕𝐸𝜕𝑧 𝑗

=𝜕𝐸𝜕 𝑗

𝜕 𝑦 𝑗

𝜕 𝑧 𝑗

𝜕𝐸𝜕𝑧𝑘

= 𝜕𝐸𝜕 𝑦𝑘

𝜕 𝑦𝑘

𝜕 𝑧𝑘

𝜕𝐸𝜕𝑧 𝑙

= 𝜕𝐸𝜕 𝑦 𝑙

𝜕 𝑦 𝑙

𝜕 𝑧 𝑙

𝜕𝐸𝜕 𝑦 𝑗

= ∑𝑘∈𝐻 2

𝑤 𝑗𝑘𝜕𝐸𝜕 𝑧𝑘

𝜕𝐸𝜕 𝑦𝑘

= ∑𝑙∈𝑂𝑢𝑡

𝑤𝑘𝑙𝜕𝐸𝜕 𝑧𝑙


𝑤𝑘𝑙


𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠

ij

kl

Application of the chain rule for derivatives to obtain gradient descents

14


AlphaGo | M.Moon




gradient descent to minimize error> Rectified linear unit (ReLU) learns

faster than other non-linearities


𝑤𝑘𝑙


𝑦 𝑙= 𝑓 (𝑧 𝑙 )


𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙




𝑧 𝑙= ∑𝑘∈𝐻 2


ij

kl

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠

15

Methodologies | Deep Convolutional Neural Network

AlphaGo | M.Moon

InputArrays such as signals, images and videos

Local ConnectionsArrays such as signals, images and videos

Shared WeightsEach filter with common weights and bias to create a feature

𝑾 𝟏

𝑾 𝟏

𝑾 𝟏

PoolingCoarse-graining the position of each feature, typically by taking max from neighbouring features

Non-linearityLocal weighted sums to a non-linearity such as ReLU

Size and StrideFilter size 3 with stride 2

Deep ArchitectureUses stacks of many layers

Properties of natural signals

16

Methodologies | Deep Convolutional Neural Network

AlphaGo | M.Moon

Architecture> Highly correlated local

groups> Local statistics invariant to

location

Properties> Compositional hierarchy> Invariant to small shifts and

distortions due to pooling> Weights trained through

backpropagation

𝑾 𝟏

𝑾 𝟏

𝑾 𝟏

17

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

Tree Policy

Default Policy

OverviewFind optimal decisions by:> Take random samples in the decision space > Build a search tree according to the result

18


AlphaGo | M.Moon

SelectionTraverse to the most urgent expandable node

Tree Policy

Tries to balance exploration and exploitation

Default Policy

19


AlphaGo | M.Moon


ExpansionAdd a child node from the selected node

Tree Policy

Tries to balance exploration and exploitation

Default Policy

20


AlphaGo | M.Moon

𝑟 (𝑠 ′ )



SimulationSimulate from the newly added node to an outcome

Tree Policy

Default Policy

21


AlphaGo | M.Moon

𝑟 (𝑠 ′ )



SimulationSimulate from the newly added node to an outcome

BackpropagationBackup simulation result through selected nodes

Tree Policy

Default Policy

22


AlphaGo | M.Moon

𝑟 (𝑠 ′ )

Tree Policy

Default Policy

Strengths> Anytime algorithm – gives a valid solution at

any time of interruption> Values of intermediate states are not

evaluated – domain knowledge not required

23

Design | Problem Setting

AlphaGo | M.Moon

Unique Optimal Value Function

> State of the game

> Legal actions at

> Deterministic state transitions

> Reward for player at ,

> Terminal reward at

Value Function>

Policy> Probability distribution

over legal actions

24

Design | Rollout Policy

AlphaGo | M.Moon

> A fast, linear softmax policy for simulation> Pattern-based feature inputs> Trained using 8 million positions> Less domain knowledge implemented compared to

existing MTSC Go programs> 24.2% prediction accuracy> Similar for tree expansion

25

Design | Neural Network Architectures

AlphaGo | M.Moon

1

1

0

1

1

0

1

0

0

0

1

0

1

0

Input19 x 19 intersectionsx 48 feature plane x48 +1Input Feature Space

> Stone Colour> Ones & Zeros> Turns Since> Liberties> Capture Size> Self-atari Size> Liberties after Move> Ladder Capture> Ladder Escape> Sensiblenesswith respect to current player

Extra Feature for Value Network

> Player Colour 0

19 x

19

1

1

0

1

0

1

0

0 1

0

26


AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

FiltersKernel size 5 x 5 with stride 1 convolution

Zero-Padding(19+4) x (19+4)

ReLU

0

00

0

0

00

0

27


AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

19 x 19 Output0

00

0

0

00

0

28


AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

00

000

0

0

0

x11

FiltersKernel size 3 x 3 with stride 1 convolution

Zero-Padding(19+2) x (19+2)

ReLU

19 x

19

0

00

0

0

00

0

29


AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

00

000

0

0

0

Policy 1-Stride Convolution1 kernel of size 1 x 1 with different bias for each intersection

Softmax FunctionOutputs for each of 19 x 19 intersections

Value 1-Stride Convolution1 kernel of size 1 x 1

Tanh FunctionFully-connected layerOutputs a single

256 RectifiersFully-connected layer

19 x

19

x11Convolution Layers

0

00

0

0

00

0

30

Design | Supervised Learning Policy Network

AlphaGo | M.Moon

> Trained using mini-batches of 16 randomly selected from 28.4 million positions

> Trained on 50 GPUs over 3 weeks > Tested with 1 million positions> 57.0% prediction accuracy

31

Design | Reinforcement Learning Policy Network

AlphaGo | M.Moon

> Trained using self-play between the current network and a randomly selected previous iteration of

> Trained over 10,000 million mini-batches of 128 games > Evaluated through game play without search

> 80% against > 85% against strongest open-source Go program

32

Design | Value Network

AlphaGo | M.Moon

> Trained using 30 million distinct positions from a separate game generated by a random mix of and to prevent overfitting

> Consistently more accurate than > Approaches Monte Carlo rollouts using with less computation

𝒗𝜽 (𝒔 )≈𝒗𝒑 𝝆 (𝒔 )≈𝒗∗ (𝒔 )

33

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Edge (𝑠,𝑎) Data

34


AlphaGo | M.Moon


Selection


35


AlphaGo | M.Moon


Expansion


36


AlphaGo | M.Moon


Evaluation


37


AlphaGo | M.Moon



Backup

38


AlphaGo | M.Moon



Select Move

39

Discussion | Performance

AlphaGo | M.Moon

Against AI Players> Played against strongest commercial and

open-source Go programs based on MCTS> Single machine AlphaGo won 494 out of

495 in even games> Distributed version of AlphaGo won 77%

against the single machine version and 100% against others

40


AlphaGo | M.Moon

Against Fan Hui> Won 5-0 in formal games with 1 hour of

main time + three 30s byoyomi1’s> Won 3-2 in informal games with three

30s byoyomi1’s

1. Time slots to be consumed after exhausting main time; reset to full period if not exceeded in a single turn;*Image captured from Silver D. et al. (2016)

41


AlphaGo | M.Moon

Against Sedol Lee> Won 4-1 in formal games with 2 hours of main

time + three 60s byoyomi’s> Game 4 – the only loss – being analyzed> MCTS may have overlooked Lee’s game

changing move – which was the only move that could save the game at the state

Game 4Sedol Lee (White), AlphaGo (Black)Sedol Lee wins by resignation

*Image captured from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/

42

Discussion | Future Work

AlphaGo | M.Moon

Next Potential Matches

> Imperfect information games (e.g., Poker, StarCraft)

> AlphaGo based on pure learning> Testbed for future algorithmic researches

Areas Applications> Gaming> Healthcare> Smartphone Assistant

Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help

people have healthier lifestyles

43

Discussion | Future Work

AlphaGo | M.Moon

Next Potential Matches

> Imperfect information games (e.g., Poker, StarCraft)

> AlphaGo based on pure learning> Testbed for future algorithmic researches

“it’d be cool if one day an AI was involved in finding a new particle”

Demis Hassabis, Co-founder of DeepMind

Areas Applications> Gaming> Healthcare> Smartphone Assistant

Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help

people have healthier lifestyles

44

References

AlphaGo | M.Moon

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., . . . Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43.

Byford, S. (2016, March 10). DeepMind founder Demis Hassabis on how AI will shape the future. The Verge. Retrieved April 02, 2016, from http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-alphago-google-deepmind-ai

Google Inc. (2016). AlphaGo | Google DeepMind. Retrieved April 02, 2016, from https://deepmind.com/alpha-go.html

Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

Ormerod, D. (2016, March 13). Lee Sedol defeats AlphaGo in masterful comeback - Game 4. Retrieved April 06, 2016, from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.

Data & Analytics

AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search