Upload
others
View
28
Download
0
Embed Size (px)
Citation preview
AlphaGo – Artificial Intelligence
CIS 601 - Graduate Seminar Presentation
Seungyoon Jang
CSU ID: 2725495
Directory
Introduction
Monte Carlo Tree Search(MCTS)
Policy Network & Value Network
Supervised Learning & Reinforcement Learning
Evaluation of AlphaGo
References
Introduction
What is “Go?”
“Go” is an abstract strategy board game for two players, in
which the aim is to surround more territory than the opponent
with black and white circle stones.
Introduction
Simple rule, Complex number of cases
1) One player uses the white stones and the other black
2) Take turns placing the stone on the vacant
intersections(called “Points”) of a board with a 19x19 grid
of lines
3) Compared to chess, the lower bound on the number of
legal moves in Go: 2 x 10170
Introduction
Problem
1) All games of perfect information have an optimal value
function, v*(s), which determines the outcome of the game
2) May be solved by recursively computing the optimal value
function in a search tree containing approximately 𝑏𝑑
possible sequences of moves
* b : the game’s breadth / d: depth
3) In large games, especially Go(b≈250, d≈150),
exhaustive search is infeasible.
Monte Carlo Tree Search(MCTS)
What’s Monte Carlo Tree Search?1) Heuristic search algorithm for some kinds of decision
processes, most notably those employed in game play
2) The focus of MCTS is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space
3) As more simulations are executed, the search tree grows larger and the relevant values become more accurate.
4) Been limited to shallow policies or value functions based on a linear combination input features.
Monte Carlo Tree Search(MCTS)
What’s Monte Carlo Tree Search?
Monte Carlo Tree Search(MCTS)
Employing Deep Convolutional Neural Networks
1) Have achieved unprecedented performance in visual
domains
Ex: image classification, face recognition
2) Reduce the effective depth and breadth of the search tree
3) Evaluating positions using a “value network”
4) Sampling actions using a “policy network”
Policy network & Value Network
Neural Network Training Pipeline Architecture
Policy network & Value Network
Supervised Learning(SL) Policy Network
1) Alternates between convolutional layers
with weights σ, and rectifier nonlinearities
2) A final softmax layer outputs a probability
distribution over all legal moves a
3) The input s to the policy network is a
simple representation of the board state
4) Trained on randomly sampled state-action
pairs (s, a), using stochastic gradient ascent
to maximize the likelihood of the human
move a selected in state s
Policy network & Value Network
Rollout Policy
1) Trained simultaneously with SL policy to supplement
slow speed to evaluate during search
2) Using linear softmax of small pattern features with
weights π
Policy network & Value Network
Reinforcement Learning of Policy Network1) Identical in structure to the SL policy network
2) weights ρ are initialized to the same values, ρ = σ
3) Play games between the current policy network Pρand a randomly selected previous iteration of the policy network
4) Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome
5) Here 𝑧𝑡 is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and -1 for losing
Policy network & Value Network
Reinforcement Learning of Value Network
1) Estimating a value function 𝑣𝑝(s) that predicts the outcome
from position s of games played by using policy p for both
players weights ρ are initialized to the same values, ρ = σ
2) Similar architecture to the policy network, but outputs a
single prediction instead of a probability distribution.
3) Minimize the mean squared error (MSE) between the
predicted value 𝑣θ (s), and the corresponding outcome z
Policy network & Value Network
Searching with policy and value networks1) At each time step t of each simulation, an action 𝑎𝑡is selected from
state 𝑠𝑡𝑎𝑡 = arg max(Q(𝑠𝑡 , a) + u(𝑠𝑡 , a))
2) The node is evaluated in two very different ways:
- By the value network 𝑣ᶿ(𝑠𝐿)
- By the outcome 𝑧𝐿of a random rollout played out until
terminal step T using the fast rollout policy p
These evaluations are combined, using a mixing parameter λ,into an evaluation V(𝑠𝐿)
V(𝑠𝐿) = (1- λ) 𝑣ᶿ(𝑠𝐿) + λ𝑍𝐿
3) The algorithm chooses the most visited move from the root position
Monte Carlo Tree Search(MCTS)
One more quick review of MCTS structure
Evaluation of AlphaGo
* ELO Rating: A method for calculating the relative skill levels of players in zero-sum games
References
Silver, David, et al. "Mastering the game of Go
with deepneural networks and tree search."
Nature 529.7587 (2016): 484-489.
https://en.wikipedia.org/wiki/Monte_Carlo_tr
ee_search
https://en.wikipedia.org/wiki/Go_(game)