44
AlphaGo: An AI Go Player based on Deep Neural Networks and Monte Carlo Tree Search Michael J. Moon M.Sc. Candidates in Biostatistics Dalla Lana School of Public Health University of Toronto April 7, 2016

AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

Embed Size (px)

Citation preview

Page 1: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

AlphaGo: An AI Go Player based on Deep Neural Networks and Monte Carlo Tree Search

Michael J. MoonM.Sc. Candidates in BiostatisticsDalla Lana School of Public HealthUniversity of Toronto

April 7, 2016

Page 2: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

2

AgendaIntroduction

Methodologies

Design

Discussion

References

AlphaGo | M.Moon

Page 3: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

3

Introduction | Background

AlphaGo | M.Moon

The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space

1. 1 Googol

Example of a Go BoardShades represent territories

Page 4: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

4

Introduction | Background

AlphaGo | M.Moon

The Game of Go> Played on a square grid called a board, usually 19 x 19> Stones – black and white – are placed alternatively> Points awarded for surrounding empty space

Complexity> Possible number of sequences > Googol1 times more complex than chess> Viewed as an unsolved “grand challenge” for AI

1. 1 Googol

“pinnacle of perfect information games”

Demis Hassabis, Co-founder of DeepMind

Example of a Go BoardShades represent territories

Page 5: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

5

Introduction | Background

AlphaGo | M.Moon

1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking

> Google DeepMind’s AI Go Player

Page 6: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

6

Introduction | Background

AlphaGo | M.Moon

1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking

Oct 2015

5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game

Page 7: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

7

Introduction | Background

AlphaGo | M.Moon

1. Image source: https://deepmind.com/alpha-go.html; 2. Source: http://www.straitstimes.com/asia/east-asia/googles-alphago-gets-divine-go-ranking

Oct 2015

Mar 2016

5-0 against Fan Hui> Victory against 3-times European champion> First program to win against a professional player in an even game

4-1 against Sedol Lee> Victory against world’s top player over the past decade> Awarded the highest Go ranking after the match2

Page 8: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

8

Introduction | Overview of the Design

AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

Page 9: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

9

Introduction | Overview of the Design

AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

Monte Carlo Tree Search

Move Selection

Page 10: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

10

Introduction | Overview of the Design

AlphaGo | M.Moon

30M Human Moves

SL Policy Network

Rollout Policy

RL Policy Network

RL Value Network

Monte Carlo Tree Search

Asynchronous Multi-threaded Search

> 40 Search Threads> 48 CPUs> 8 GPUs

Distributed Version1

> 40 Search Threads> 1,202 CPUs> 176 GPUs

1. Used against Fan Hui; 1,920 CPUs and 280 GPUs against Lee http://www.economist.com/news/science-and-technology/21694540-win-or-lose-best-five-battle-contest-another-milestone

Move Selection

Page 11: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

11

Methodologies | Deep Neural Network

AlphaGo | M.Moon

Deep Learning Architecture> Multilayer (5~20) stack of simple

modules subject to learning

𝑤𝑖𝑗 𝑤 𝑗𝑘

𝑤𝑘𝑙

𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )

𝑦 𝑙= 𝑓 (𝑧 𝑙 )

𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙

𝑧 𝑗=∑𝑖∈𝐼𝑛

𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑

𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗

𝑧 𝑙= ∑𝑘∈𝐻 2

𝑤𝑘𝑙 𝑦 𝑘

ij

kl

Page 12: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

12

Methodologies | Deep Neural Network

AlphaGo | M.Moon

Deep Learning Architecture> Multilayer (5~20) stack of simple

modules subject to learning

Backpropagation Training> Trained by simple stochastic

gradient descent to minimize error

𝑤𝑖𝑗 𝑤 𝑗𝑘

𝑤𝑘𝑙

𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )

𝑦 𝑙= 𝑓 (𝑧 𝑙 )

𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙

𝑧 𝑗=∑𝑖∈𝐼𝑛

𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑

𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗

𝑧 𝑙= ∑𝑘∈𝐻 2

𝑤𝑘𝑙 𝑦 𝑘

ij

kl

Page 13: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

13

Methodologies | Deep Neural Network

AlphaGo | M.Moon

Deep Learning Architecture> Multilayer (5~20) stack of simple

modules subject to learning

Backpropagation Training> Trained by simple stochastic

gradient descent to minimize error

𝜕𝐸𝜕𝑧 𝑗

=𝜕𝐸𝜕 𝑗

𝜕 𝑦 𝑗

𝜕 𝑧 𝑗

𝜕𝐸𝜕𝑧𝑘

= 𝜕𝐸𝜕 𝑦𝑘

𝜕 𝑦𝑘

𝜕 𝑧𝑘

𝜕𝐸𝜕𝑧 𝑙

= 𝜕𝐸𝜕 𝑦 𝑙

𝜕 𝑦 𝑙

𝜕 𝑧 𝑙

𝜕𝐸𝜕 𝑦 𝑗

= ∑𝑘∈𝐻 2

𝑤 𝑗𝑘𝜕𝐸𝜕 𝑧𝑘

𝜕𝐸𝜕 𝑦𝑘

= ∑𝑙∈𝑂𝑢𝑡

𝑤𝑘𝑙𝜕𝐸𝜕 𝑧𝑙

𝑤𝑖𝑗 𝑤 𝑗𝑘

𝑤𝑘𝑙

𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠

ij

kl

Application of the chain rule for derivatives to obtain gradient descents

Page 14: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

14

Methodologies | Deep Neural Network

AlphaGo | M.Moon

Deep Learning Architecture> Multilayer (5~20) stack of simple

modules subject to learning

Backpropagation Training> Trained by simple stochastic

gradient descent to minimize error> Rectified linear unit (ReLU) learns

faster than other non-linearities

𝑤𝑖𝑗 𝑤 𝑗𝑘

𝑤𝑘𝑙

𝑦 𝑗= 𝑓 (𝑧 𝑗 )𝑦 𝑘= 𝑓 (𝑧𝑘 )

𝑦 𝑙= 𝑓 (𝑧 𝑙 )

𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 2𝐻𝑖𝑑𝑑𝑒𝑛𝑢𝑛𝑖𝑡𝑠𝐻 1

𝑂𝑢𝑡𝑝𝑢𝑡 𝑢𝑛𝑖𝑡𝑠𝑦 𝑙

𝑧 𝑗=∑𝑖∈𝐼𝑛

𝑤𝑖𝑗 𝑥𝑖𝑧𝑘= ∑

𝑗∈𝐻 1𝑤 𝑗𝑘𝑦 𝑗

𝑧 𝑙= ∑𝑘∈𝐻 2

𝑤𝑘𝑙 𝑦 𝑘

ij

kl

𝐼𝑛𝑝𝑢𝑡𝑢𝑛𝑖𝑡𝑠

Page 15: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

15

Methodologies | Deep Convolutional Neural Network

AlphaGo | M.Moon

InputArrays such as signals, images and videos

Local ConnectionsArrays such as signals, images and videos

Shared WeightsEach filter with common weights and bias to create a feature

𝑾 𝟏

𝑾 𝟏

𝑾 𝟏

PoolingCoarse-graining the position of each feature, typically by taking max from neighbouring features

Non-linearityLocal weighted sums to a non-linearity such as ReLU

Size and StrideFilter size 3 with stride 2

Deep ArchitectureUses stacks of many layers

Properties of natural signals

Page 16: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

16

Methodologies | Deep Convolutional Neural Network

AlphaGo | M.Moon

Architecture> Highly correlated local

groups> Local statistics invariant to

location

Properties> Compositional hierarchy> Invariant to small shifts and

distortions due to pooling> Weights trained through

backpropagation

𝑾 𝟏

𝑾 𝟏

𝑾 𝟏

Page 17: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

17

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

Tree Policy

Default Policy

OverviewFind optimal decisions by:> Take random samples in the decision space > Build a search tree according to the result

Page 18: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

18

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

SelectionTraverse to the most urgent expandable node

Tree Policy

Tries to balance exploration and exploitation

Default Policy

Page 19: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

19

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

SelectionTraverse to the most urgent expandable node

ExpansionAdd a child node from the selected node

Tree Policy

Tries to balance exploration and exploitation

Default Policy

Page 20: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

20

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

𝑟 (𝑠 ′ )

SelectionTraverse to the most urgent expandable node

ExpansionAdd a child node from the selected node

SimulationSimulate from the newly added node to an outcome

Tree Policy

Default Policy

Page 21: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

21

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

𝑟 (𝑠 ′ )

SelectionTraverse to the most urgent expandable node

ExpansionAdd a child node from the selected node

SimulationSimulate from the newly added node to an outcome

BackpropagationBackup simulation result through selected nodes

Tree Policy

Default Policy

Page 22: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

22

Methodologies | Monte Carlo Tree Search

AlphaGo | M.Moon

𝑟 (𝑠 ′ )

Tree Policy

Default Policy

Strengths> Anytime algorithm – gives a valid solution at

any time of interruption> Values of intermediate states are not

evaluated – domain knowledge not required

Page 23: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

23

Design | Problem Setting

AlphaGo | M.Moon

Unique Optimal Value Function

> State of the game

> Legal actions at

> Deterministic state transitions

> Reward for player at ,

> Terminal reward at

Value Function>

Policy> Probability distribution

over legal actions

Page 24: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

24

Design | Rollout Policy

AlphaGo | M.Moon

> A fast, linear softmax policy for simulation> Pattern-based feature inputs> Trained using 8 million positions> Less domain knowledge implemented compared to

existing MTSC Go programs> 24.2% prediction accuracy> Similar for tree expansion

Page 25: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

25

Design | Neural Network Architectures

AlphaGo | M.Moon

1

1

0

1

1

0

1

0

0

0

1

0

1

0

Input19 x 19 intersectionsx 48 feature plane x48 +1Input Feature Space

> Stone Colour> Ones & Zeros> Turns Since> Liberties> Capture Size> Self-atari Size> Liberties after Move> Ladder Capture> Ladder Escape> Sensiblenesswith respect to current player

Extra Feature for Value Network

> Player Colour 0

19 x

19

1

1

0

1

0

1

0

0 1

0

Page 26: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

26

Design | Neural Network Architectures

AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

FiltersKernel size 5 x 5 with stride 1 convolution

Zero-Padding(19+4) x (19+4)

ReLU

0

00

0

0

00

0

Page 27: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

27

Design | Neural Network Architectures

AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

19 x 19 Output0

00

0

0

00

0

Page 28: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

28

Design | Neural Network Architectures

AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

00

000

0

0

0

x11

FiltersKernel size 3 x 3 with stride 1 convolution

Zero-Padding(19+2) x (19+2)

ReLU

19 x

19

0

00

0

0

00

0

Page 29: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

29

Design | Neural Network Architectures

AlphaGo | M.Moon

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1

1

0

0

0

1

0

0

0

0

0

00

1

0

00

000

0

0

0

Policy 1-Stride Convolution1 kernel of size 1 x 1 with different bias for each intersection

Softmax FunctionOutputs for each of 19 x 19 intersections

Value 1-Stride Convolution1 kernel of size 1 x 1

Tanh FunctionFully-connected layerOutputs a single

256 RectifiersFully-connected layer

19 x

19

x11Convolution Layers

0

00

0

0

00

0

Page 30: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

30

Design | Supervised Learning Policy Network

AlphaGo | M.Moon

> Trained using mini-batches of 16 randomly selected from 28.4 million positions

> Trained on 50 GPUs over 3 weeks > Tested with 1 million positions> 57.0% prediction accuracy

Page 31: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

31

Design | Reinforcement Learning Policy Network

AlphaGo | M.Moon

> Trained using self-play between the current network and a randomly selected previous iteration of

> Trained over 10,000 million mini-batches of 128 games > Evaluated through game play without search

> 80% against > 85% against strongest open-source Go program

Page 32: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

32

Design | Value Network

AlphaGo | M.Moon

> Trained using 30 million distinct positions from a separate game generated by a random mix of and to prevent overfitting

> Consistently more accurate than > Approaches Monte Carlo rollouts using with less computation

𝒗𝜽 (𝒔 )≈𝒗𝒑 𝝆 (𝒔 )≈𝒗∗ (𝒔 )

Page 33: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

33

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Edge (𝑠,𝑎) Data

Page 34: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

34

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Selection

Edge (𝑠,𝑎) Data

Page 35: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

35

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Expansion

Edge (𝑠,𝑎) Data

Page 36: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

36

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Evaluation

Edge (𝑠,𝑎) Data

Page 37: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

37

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Edge (𝑠,𝑎) Data

Backup

Page 38: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

38

Design | Search Algorithm

AlphaGo | M.Moon

*Image captured from Silver D. et al. (2016)

Edge (𝑠,𝑎) Data

Select Move

Page 39: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

39

Discussion | Performance

AlphaGo | M.Moon

Against AI Players> Played against strongest commercial and

open-source Go programs based on MCTS> Single machine AlphaGo won 494 out of

495 in even games> Distributed version of AlphaGo won 77%

against the single machine version and 100% against others

Page 40: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

40

Discussion | Performance

AlphaGo | M.Moon

Against Fan Hui> Won 5-0 in formal games with 1 hour of

main time + three 30s byoyomi1’s> Won 3-2 in informal games with three

30s byoyomi1’s

1. Time slots to be consumed after exhausting main time; reset to full period if not exceeded in a single turn;*Image captured from Silver D. et al. (2016)

Page 41: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

41

Discussion | Performance

AlphaGo | M.Moon

Against Sedol Lee> Won 4-1 in formal games with 2 hours of main

time + three 60s byoyomi’s> Game 4 – the only loss – being analyzed> MCTS may have overlooked Lee’s game

changing move – which was the only move that could save the game at the state

Game 4Sedol Lee (White), AlphaGo (Black)Sedol Lee wins by resignation

*Image captured from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/

Page 42: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

42

Discussion | Future Work

AlphaGo | M.Moon

Next Potential Matches

> Imperfect information games (e.g., Poker, StarCraft)

> AlphaGo based on pure learning> Testbed for future algorithmic researches

Areas Applications> Gaming> Healthcare> Smartphone Assistant

Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help

people have healthier lifestyles

Page 43: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

43

Discussion | Future Work

AlphaGo | M.Moon

Next Potential Matches

> Imperfect information games (e.g., Poker, StarCraft)

> AlphaGo based on pure learning> Testbed for future algorithmic researches

“it’d be cool if one day an AI was involved in finding a new particle”

Demis Hassabis, Co-founder of DeepMind

Areas Applications> Gaming> Healthcare> Smartphone Assistant

Healthcare Applications> Medical diagnosis of images> Longitudinal tracking of vital signs to help

people have healthier lifestyles

Page 44: AlphaGo: An AI Go player based on deep neural networks and monte carlo tree search

44

References

AlphaGo | M.Moon

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., . . . Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43.

Byford, S. (2016, March 10). DeepMind founder Demis Hassabis on how AI will shape the future. The Verge. Retrieved April 02, 2016, from http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-alphago-google-deepmind-ai

Google Inc. (2016). AlphaGo | Google DeepMind. Retrieved April 02, 2016, from https://deepmind.com/alpha-go.html

Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

Ormerod, D. (2016, March 13). Lee Sedol defeats AlphaGo in masterful comeback - Game 4. Retrieved April 06, 2016, from https://gogameguru.com/lee-sedol-defeats-alphago-masterful-comeback-game-4/

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V., . . . Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484-489.