Reinforcement Learning
Matt Gormley Lecture 22
November 30, 2016
1
School of Computer Science
Readings: Mitchell Ch. 13
10-‐701 Introduction to Machine Learning
Reminders
• Poster Session – Fri, Dec 2: 2:30pm – 5:30 pm
• Final Report – due Fri, Dec 9
2
REINFORCEMENT LEARNING
3
Eric Xing
What is Learning?
l Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that
l Percepts received by an agent should be used not only for understanding/interpreting/prediction, as in the machine learning tasks we have addressed so far, but also for acting, and further more for improving the agent’s ability to behave optimally in the future to achieve the goal.
4 © Eric Xing @ CMU, 2006-2011
Eric Xing
Types of Learning l Supervised Learning
l A situation in which sample (input, output) pairs of the function to be learned can be perceived or are given
l You can think it as if there is a kind teacher - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification.
l Unsupervised Learning
- Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.
5 © Eric Xing @ CMU, 2006-2011
Eric Xing
Example of Supervised Learning l Predict the price of a stock in 6 months from now, based on
economic data. (Regression) l Predict whether a patient, hospitalized due to a heart attack,
will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)
l Identify the numbers in a handwritten ZIP code, from a digitized image (pixels). (Classification)
6 © Eric Xing @ CMU, 2006-2011
Eric Xing
Example of Unsupervised Learning
l From the DNA micro-array data, determine which genes are most “similar” in terms of their expression profiles. (Clustering)
7 © Eric Xing @ CMU, 2006-2011
Eric Xing
Types of Learning (Cont’d)
l Reinforcement Learning l in the case of the agent acts on its environment, it receives some
evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal
- Training data: (S, A, R). (State-Action-Reward) - Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs.
8 © Eric Xing @ CMU, 2006-2011
Eric Xing
RL is learning from interaction
9 © Eric Xing @ CMU, 2006-2011
Eric Xing
Examples of Reinforcement Learning
l How should a robot behave so as to optimize its “performance”? (Robotics)
l How to automate the motion of a helicopter? (Control Theory)
l How to make a good chess-playing program? (Artificial Intelligence)
10 © Eric Xing @ CMU, 2006-2011
Autonomous Helicopter
Video:
11
https://www.youtube.com/watch?v=VCdxqn0fcnE
Eric Xing
Robot in a room
l what’s the strategy to achieve max reward? l what if the actions were NOT deterministic?
12 © Eric Xing @ CMU, 2006-2011
Eric Xing
Pole Balancing l Task:
l Move car left/right to keep the pole balanced
l State representation l Position and velocity of the car l Angle and angular velocity of the pole
13 © Eric Xing @ CMU, 2006-2011
Eric Xing
History of Reinforcement Learning
l Roots in the psychology of animal learning (Thorndike,1911).
l Another independent thread was the problem of optimal control, and its solution using dynamic programming (Bellman, 1957).
l Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).
l A major breakthrough was the discovery of Q-learning (Watkins, 1989).
14 © Eric Xing @ CMU, 2006-2011
Eric Xing
What is special about RL? l RL is learning how to map states to actions, so as to
maximize a numerical reward over time.
l Unlike other forms of learning, it is a multistage decision-making process (often Markovian).
l An RL agent must learn by trial-and-error. (Not entirely supervised, but interactive)
l Actions may affect not only the immediate reward but also subsequent rewards (Delayed effect).
15 © Eric Xing @ CMU, 2006-2011
Eric Xing
Elements of RL l A policy - A map from state space to action space. - May be stochastic. l A reward function - It maps each state (or, state-action pair) to a real number, called reward. l A value function - Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair).
16 © Eric Xing @ CMU, 2006-2011
Maze Example
17
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example
Start
Goal
Rewards: -1 per time-step
Actions: N, E, S, W
States: Agent’s location
Slide from David Silver (Intro RL lecture)
Maze Example
18 Slide from David Silver (Intro RL lecture)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Policy
Start
Goal
Arrows represent policy ⇡(s) for each state s
Policy:
Maze Example
19 Slide from David Silver (Intro RL lecture)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Value Function
-14 -13 -12 -11 -10 -9
-16 -15 -12 -8
-16 -17 -6 -7
-18 -19 -5
-24 -20 -4 -3
-23 -22 -21 -22 -2 -1
Start
Goal
Numbers represent value v⇡(s) of each state s
Value Function: (Expected Future Reward)
Maze Example
20 Slide from David Silver (Intro RL lecture)
Lecture 1: Introduction to Reinforcement Learning
Inside An RL Agent
Maze Example: Model
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1
-1 -1
-1 -1
Start
Goal
Agent may have an internalmodel of the environment
Dynamics: how actionschange the state
Rewards: how much rewardfrom each state
The model may be imperfect
Grid layout represents transition model Pa
ss
0
Numbers represent immediate reward Ra
s
from each state s(same for all a)
Model:
Eric Xing
Policy
21 © Eric Xing @ CMU, 2006-2011
Eric Xing
Reward for each step -2
22 © Eric Xing @ CMU, 2006-2011
Eric Xing
Reward for each step: -0.1
23 © Eric Xing @ CMU, 2006-2011
Eric Xing
Reward for each step: -0.04
24 © Eric Xing @ CMU, 2006-2011
Eric Xing
The Precise Goal l To find a policy that maximizes the Value function.
l transitions and rewards usually not available
l There are different approaches to achieve this goal in various situations.
l Value iteration and Policy iteration are two more classic approaches to this problem. But essentially both are dynamic programming.
l Q-learning is a more recent approaches to this problem. Essentially it is a temporal-difference method.
25 © Eric Xing @ CMU, 2006-2011
MARKOV DECISION PROCESSES
26
Eric Xing
Markov Decision Processes A Markov decision process is a tuple where:
27 © Eric Xing @ CMU, 2006-2011
Eric Xing
The dynamics of an MDP l We start in some state s0, and get to choose some action a0 ∈
A l As a result of our choice, the state of the MDP randomly
transitions to some successor state s1, drawn according to s1~ Ps0a0
l Then, we get to pick another action a1 l …
28 © Eric Xing @ CMU, 2006-2011
Eric Xing
The dynamics of an MDP, (Cont’d)
l Upon visiting the sequence of states s0, s1, …, with actions a0, a1, …, our total payoff is given by
l Or, when we are writing rewards as a function of the states only, this becomes
l For most of our development, we will use the simpler state-rewards R(s), though the generalization to state-action rewards R(s; a) offers no special diffculties.
l Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payoff:
29 © Eric Xing @ CMU, 2006-2011
FIXED POINT ITERATION
30
Fixed Point Iteration for Optimization • Fixed point iteration is a general tool for solving systems of
equations • It can also be applied to optimization.
31
1. Given objective function: 2. Compute derivative, set to
zero (call this function f ). 3. Rearrange the equation s.t.
one of parameters appears on the LHS.
4. Initialize the parameters. 5. For i in {1,...,K}, update each
parameter and increment t: 6. Repeat #5 until convergence
J(✓)
dJ(✓)
d✓i= 0 = f(✓)
0 = f(✓) ) ✓i = g(✓)
✓(t+1)i = g(✓(t))
Fixed Point Iteration for Optimization • Fixed point iteration is a general tool for solving systems of
equations • It can also be applied to optimization.
32
1. Given objective function: 2. Compute derivative, set to
zero (call this function f ). 3. Rearrange the equation s.t.
one of parameters appears on the LHS.
4. Initialize the parameters. 5. For i in {1,...,K}, update each
parameter and increment t: 6. Repeat #5 until convergence
J(x) =x
3
3+
3
2x
2 + 2x
dJ(x)
dx
= f(x) = x
2 � 3x+ 2 = 0
) x =x
2 + 2
3= g(x)
x x
2 + 2
3
Fixed Point Iteration for Optimization We can implement our example in a few lines of python.
33
J(x) =x
3
3+
3
2x
2 + 2x
dJ(x)
dx
= f(x) = x
2 � 3x+ 2 = 0
) x =x
2 + 2
3= g(x)
x x
2 + 2
3
Fixed Point Iteration for Optimization
34
$ python fixed-point-iteration.py i= 0 x=0.0000 f(x)=2.0000i= 1 x=0.6667 f(x)=0.4444i= 2 x=0.8148 f(x)=0.2195i= 3 x=0.8880 f(x)=0.1246i= 4 x=0.9295 f(x)=0.0755i= 5 x=0.9547 f(x)=0.0474i= 6 x=0.9705 f(x)=0.0304i= 7 x=0.9806 f(x)=0.0198i= 8 x=0.9872 f(x)=0.0130i= 9 x=0.9915 f(x)=0.0086i=10 x=0.9944 f(x)=0.0057i=11 x=0.9963 f(x)=0.0038i=12 x=0.9975 f(x)=0.0025i=13 x=0.9983 f(x)=0.0017i=14 x=0.9989 f(x)=0.0011i=15 x=0.9993 f(x)=0.0007i=16 x=0.9995 f(x)=0.0005i=17 x=0.9997 f(x)=0.0003i=18 x=0.9998 f(x)=0.0002i=19 x=0.9999 f(x)=0.0001i=20 x=0.9999 f(x)=0.0001
J(x) =x
3
3+
3
2x
2 + 2x
dJ(x)
dx
= f(x) = x
2 � 3x+ 2 = 0
) x =x
2 + 2
3= g(x)
x x
2 + 2
3
VALUE ITERATION
35
Eric Xing
Elements of RL l A policy - A map from state space to action space. - May be stochastic. l A reward function - It maps each state (or, state-action pair) to a real number, called reward. l A value function - Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair).
36 © Eric Xing @ CMU, 2006-2011
Eric Xing
Policy l A policy is any function mapping from the states
to the actions.
l We say that we are executing some policy if, whenever we are in state s, we take action a = π(s).
l We also define the value function for a policy π according to
l Vπ (s) is simply the expected sum of discounted rewards upon starting in state s, and taking actions according to π.
37 © Eric Xing @ CMU, 2006-2011
Eric Xing
Value Function l Given a fixed policy π, its value function Vπ satisfies the
Bellman equations:
l Bellman's equations can be used to efficiently solve for Vπ (see later)
Immediate reward expected sum of future discounted rewards
38 © Eric Xing @ CMU, 2006-2011
Eric Xing
M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left
0.1 right Policy: mapping from states to actions
3
2
1
1 2 3 4
+1
-1
0.705
3
2
1
1 2 3 4
+1
-1
0.812
0.762
0.868 0.912
0.660
0.655 0.611 0.388
An optimal policy for the stochastic environment:
utilities of states:
Environment Observable (accessible): percept identifies the state Partially observable
Markov property: Transition probabilities depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.
The Grid world
39 © Eric Xing @ CMU, 2006-2011
Eric Xing
Optimal value function l We define the optimal value function according to
l In other words, this is the best possible expected sum of discounted rewards that can be attained using any policy
l There is a version of Bellman's equations for the optimal value function:
l Why?
(1)
(2)
40 © Eric Xing @ CMU, 2006-2011
Eric Xing
Optimal policy l We also define a policy : as follows:
l Fact: l
l Policy π* has the interesting property that it is the optimal policy for all states s. l It is not the case that if we were starting in some state s then there'd be some
optimal policy for that state, and if we were starting in some other state s0 then there'd be some other policy that's optimal policy for s0.
l The same policy π* attains the maximum above for all states s. This means that we can use the same policy no matter what the initial state of our MDP is.
(3)
41 © Eric Xing @ CMU, 2006-2011
Eric Xing
The Basic Setting for Learning
l Training data: n finite horizon trajectories, of the form
l Deterministic or stochastic policy: A sequence of decision rules
l Each π maps from the observable history (states and actions) to the action space at that time point.
}.,,,,...,,,{ 1000 +TTTT srasras
}.,...,,{ 10 Tπππ
42 © Eric Xing @ CMU, 2006-2011
Eric Xing
Algorithm 1: Value iteration l Consider only MDPs with finite state and action spaces
l The value iteration algorithm:
l synchronous update l asynchronous updates
l It can be shown that value iteration will cause V to converge to V *. Having found V* , we can then use Equation (3) to find the optimal policy.
43 © Eric Xing @ CMU, 2006-2011
Eric Xing
Algorithm 2: Policy iteration l The policy iteration algorithm:
l The inner-loop repeatedly computes the value function for the current policy, and then updates the policy using the current value function.
l Greedy update l After at most a finite number of iterations of this algorithm, V will converge to V* ,
and π will converge to π*.
44 © Eric Xing @ CMU, 2006-2011
Eric Xing
l The utility values for selected states at each iteration step in the application of VALUE-ITERATION to the 4x3 world in our example
Thrm: As tà∞, value iteration converges to exact U even if updates are done asynchronously & i is picked randomly at every step.
Convergence
start
3
2
1
1 2 3 4
+1
-1
45 © Eric Xing @ CMU, 2006-2011
Eric Xing
When to stop value iteration?
Convergence
46 © Eric Xing @ CMU, 2006-2011
Q-‐LEARNING
47
Eric Xing
l Define Q-value function
l Q-value function updating rule l See subsequent slides
l Key idea of TD-Q learning l Combined with temporal difference approach
l Rule to chose the action to take
Q learning
48 © Eric Xing @ CMU, 2006-2011
Eric Xing
Algorithm 3: Q learning For each pair (s, a), initialize Q(s,a) Observe the current state s Loop forever {
Select an action a (optionally with ε-exploration) and execute it Receive immediate reward r and observe the new state s’ Update Q(s,a) s=s’
} 49 © Eric Xing @ CMU, 2006-2011
Eric Xing
Exploration l Tradeoff between exploitation (control) and exploration
(identification)
l Extremes: greedy vs. random acting (n-armed bandit models)
Q-learning converges to optimal Q-values if l Every state is visited infinitely often (due to exploration), l The action selection becomes greedy as time approaches infinity, and l The learning rate a is decreased fast enough but not too fast (as we discussed in
TD learning)
50 © Eric Xing @ CMU, 2006-2011
RL EXAMPLES
51
Eric Xing
A Success Story l TD Gammon (Tesauro, G., 1992) - A Backgammon playing program. - Application of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against
itself and learning from the outcome. So smart!! - More information: http://www.research.ibm.com/massive/
tdl.html
52 © Eric Xing @ CMU, 2006-2011
Lecture 1: Introduction to Reinforcement Learning
Problems within RL
Atari Example: Reinforcement Learning
observation
reward
action
At
Rt
OtRules of the game areunknown
Learn directly frominteractive game-play
Pick actions onjoystick, see pixelsand scores
Playing Atari with Deep RL • Setup: RL system observes the pixels on the screen
• It receives rewards as the game score
• Actions decide how to move the joystick / buttons
53 Figures from David Silver (Intro RL lecture)
Playing Atari with Deep RL
Videos: – Atari: https://www.youtube.com/watch?v=V1eYniJ0Rnk
– Space Invaders: https://www.youtube.com/watch?v=ePv0Fs9cGgU
54
Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,Seaquest, Beam Rider
an experience replay mechanism [13] which randomly samples previous transitions, and therebysmooths the training distribution over many past behaviors.
We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks thatwere designed to be difficult for humans players. Our goal is to create a single neural network agentthat is able to successfully learn to play as many of the games as possible. The network was not pro-vided with any game-specific information or hand-designed visual features, and was not privy to theinternal state of the emulator; it learned from nothing but the video input, the reward and terminalsignals, and the set of possible actions—just as a human player would. Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games. So far thenetwork has outperformed all previous RL algorithms on six of the seven games we have attemptedand surpassed an expert human player on three of them. Figure 1 provides sample screenshots fromfive of the games used for training.
2 Background
We consider tasks in which an agent interacts with an environment E , in this case the Atari emulator,in a sequence of actions, observations and rewards. At each time-step the agent selects an actionat from the set of legal game actions, A = {1, . . . ,K}. The action is passed to the emulator andmodifies its internal state and the game score. In general E may be stochastic. The emulator’sinternal state is not observed by the agent; instead it observes an image xt 2 Rd from the emulator,which is a vector of raw pixel values representing the current screen. In addition it receives a rewardrt representing the change in game score. Note that in general the game score may depend on thewhole prior sequence of actions and observations; feedback about an action may only be receivedafter many thousands of time-steps have elapsed.
Since the agent only observes images of the current screen, the task is partially observed and manyemulator states are perceptually aliased, i.e. it is impossible to fully understand the current situationfrom only the current screen xt. We therefore consider sequences of actions and observations, st =x1, a1, x2, ..., at�1, xt, and learn game strategies that depend upon these sequences. All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism givesrise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.As a result, we can apply standard reinforcement learning methods for MDPs, simply by using thecomplete sequence st as the state representation at time t.
The goal of the agent is to interact with the emulator by selecting actions in a way that maximisesfuture rewards. We make the standard assumption that future rewards are discounted by a factor of� per time-step, and define the future discounted return at time t as Rt =
PTt0=t �
t0�trt0 , where T
is the time-step at which the game terminates. We define the optimal action-value function Q
⇤(s, a)
as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q⇤
(s, a) = max⇡ E [Rt|st = s, at = a,⇡], where ⇡ is a policymapping sequences to actions (or distributions over actions).
The optimal action-value function obeys an important identity known as the Bellman equation. Thisis based on the following intuition: if the optimal value Q
⇤(s
0, a
0) of the sequence s
0 at the nexttime-step was known for all possible actions a
0, then the optimal strategy is to select the action a
0
2
Figures from Mnih et al. (2013)
Playing Atari with Deep RL
55
B. Rider Breakout Enduro Pong Q*bert Seaquest S. InvadersRandom 354 1.2 0 �20.4 157 110 179
Sarsa [3] 996 5.2 129 �19 614 665 271
Contingency [4] 1743 6 159 �17 960 723 268
DQN 4092 168 470 20 1952 1705 581Human 7456 31 368 �3 18900 28010 3690
HNeat Best [8] 3616 52 106 19 1800 920 1720HNeat Pixel [8] 1332 4 91 �16 1325 800 1145
DQN Best 5184 225 661 21 4500 1740 1075
Table 1: The upper table compares average total reward for various learning methods by runningan ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results ofthe single best performing episode for HNeat and DQN. HNeat produces deterministic policies thatalways get the same score while DQN used an ✏-greedy policy with ✏ = 0.05.
types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 colorchannel representation of the Atari emulator that represents an object label map at each channel.This method relies heavily on finding a deterministic sequence of states that represents a successfulexploit. It is unlikely that strategies learnt in this way will generalize to random perturbations;therefore the algorithm was only evaluated on the highest scoring single episode. In contrast, ouralgorithm is evaluated on ✏-greedy control sequences, and must therefore generalize across a widevariety of possible situations. Nevertheless, we show that on all the games, except Space Invaders,not only our max evaluation results (row 8), but also our average results (row 4) achieve betterperformance.
Finally, we show that our method achieves better performance than an expert human player onBreakout, Enduro and Pong and it achieves close to human performance on Beam Rider. The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.
6 ConclusionThis paper introduced a new deep learning model for reinforcement learning, and demonstrated itsability to master difficult control policies for Atari 2600 computer games, using only raw pixelsas input. We also presented a variant of online Q-learning that combines stochastic minibatch up-dates with experience replay memory to ease the training of deep networks for RL. Our approachgave state-of-the-art results in six of the seven games it was tested on, with no adjustment of thearchitecture or hyperparameters.
References
[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning (ICML 1995), pages30–37. Morgan Kaufmann, 1995.
[2] Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap-proximation. In Advances in Neural Information Processing Systems 25, pages 2222–2230,2012.
[3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learningenvironment: An evaluation platform for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013.
[4] Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awarenessusing atari 2600 games. In AAAI, 2012.
[5] Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac-tored environments. In Proceedings of the Thirtieth International Conference on Machine
Learning (ICML 2013), pages 1211–1219, 2013.
8
Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,Seaquest, Beam Rider
an experience replay mechanism [13] which randomly samples previous transitions, and therebysmooths the training distribution over many past behaviors.
We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks thatwere designed to be difficult for humans players. Our goal is to create a single neural network agentthat is able to successfully learn to play as many of the games as possible. The network was not pro-vided with any game-specific information or hand-designed visual features, and was not privy to theinternal state of the emulator; it learned from nothing but the video input, the reward and terminalsignals, and the set of possible actions—just as a human player would. Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games. So far thenetwork has outperformed all previous RL algorithms on six of the seven games we have attemptedand surpassed an expert human player on three of them. Figure 1 provides sample screenshots fromfive of the games used for training.
2 Background
We consider tasks in which an agent interacts with an environment E , in this case the Atari emulator,in a sequence of actions, observations and rewards. At each time-step the agent selects an actionat from the set of legal game actions, A = {1, . . . ,K}. The action is passed to the emulator andmodifies its internal state and the game score. In general E may be stochastic. The emulator’sinternal state is not observed by the agent; instead it observes an image xt 2 Rd from the emulator,which is a vector of raw pixel values representing the current screen. In addition it receives a rewardrt representing the change in game score. Note that in general the game score may depend on thewhole prior sequence of actions and observations; feedback about an action may only be receivedafter many thousands of time-steps have elapsed.
Since the agent only observes images of the current screen, the task is partially observed and manyemulator states are perceptually aliased, i.e. it is impossible to fully understand the current situationfrom only the current screen xt. We therefore consider sequences of actions and observations, st =x1, a1, x2, ..., at�1, xt, and learn game strategies that depend upon these sequences. All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism givesrise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.As a result, we can apply standard reinforcement learning methods for MDPs, simply by using thecomplete sequence st as the state representation at time t.
The goal of the agent is to interact with the emulator by selecting actions in a way that maximisesfuture rewards. We make the standard assumption that future rewards are discounted by a factor of� per time-step, and define the future discounted return at time t as Rt =
PTt0=t �
t0�trt0 , where T
is the time-step at which the game terminates. We define the optimal action-value function Q
⇤(s, a)
as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q⇤
(s, a) = max⇡ E [Rt|st = s, at = a,⇡], where ⇡ is a policymapping sequences to actions (or distributions over actions).
The optimal action-value function obeys an important identity known as the Bellman equation. Thisis based on the following intuition: if the optimal value Q
⇤(s
0, a
0) of the sequence s
0 at the nexttime-step was known for all possible actions a
0, then the optimal strategy is to select the action a
0
2
Figures from Mnih et al. (2013)
Alpha Go Game of Go (圍棋) • 19x19 board • Players alternately play black/white stones
• Goal is to fully encircle the largest region on the board
• Simple rules, but extremely complex game play
56 Figure from Silver et al. (2016)
4 8 8 | N A T U R E | V O L 5 2 9 | 2 8 J A N U A R Y 2 0 1 6
ARTICLERESEARCH
on high-performance MCTS algorithms. In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.
The results of the tournament (see Fig. 4a) suggest that single- machine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed ver-sion of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.
We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. However, the mixed evaluation (λ = 0.5) performed best, winning ≥95% of games against other variants. This suggests that the two position-evaluation
mechanisms are complementary: the value network approximates the outcome of games played by the strong but impractically slow pρ, while the rollouts can precisely score and evaluate the outcome of games played by the weaker but faster rollout policy pπ. Figure 5 visualizes the evaluation of a real game position by AlphaGo.
Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human profes-sional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away3,7,31.
DiscussionIn this work we have developed a Go program, based on a combina-tion of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intel-ligence’s “grand challenges”31–33. We have developed, for the first time, effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination
Figure 6 | Games from the match between AlphaGo and the European champion, Fan Hui. Moves are shown in a numbered sequence corresponding to the order in which they were played. Repeated moves on the same intersection are shown in pairs below the board. The first
move number in each pair indicates when the repeat move was played, at an intersection identified by the second move number (see Supplementary Information).
1 2
3
4
5
6
7
8
910
11
12
13
14
15 16
17
18
1920
21
22
23
24
25
26
27
28
29 30
31
32
33
3435
36 37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52 53
54
55
56
57
58
59
60
61
62
63
6465
66
67
68
69
70
71 72
73
74
75
7677
78
79
80
8182 83
84
85
86
87
88 89
90
91
9293
94
95
96
97
9899
100
101 102
103 104
105
106
107108
109
110
111
112
113
114
115 116
117
118
119
120
121
122123
124
125
126
127
128
129130 131
132
133
134
135
136
137
138
139
140
141
142
143
144
145 146
147
148
149 150
151
152
153
154
155
156
157
158
159
160 161
162163
164
165
166
167
168
169
170
171 172
173
174
175
176 177
178
179
180
181 182
183
184
185
186
187
188
189
190
191
192
193194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213 214
215
216
217 218
219
220221
222
223
224
225226
227
228
229
230
231
232
233
235236
237
238
239
240
241
242
243244 246 247
248
249
251
252
253
254
255
256
257
258
259
260
261
262 263
264
265
266267
268
269 270
271
272
234 at 179 245 at 122 250 at 59
1 2
3
4
5 6
7 8
9
1011
12
13
14
15
16
17
18
1920
21
22
23
24
25
26
2728
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 48
49 50 5152
53
54 55
5657
58
59
60
61
62 63
64
65
66
67
68
69
70
71
7273
7475
76 77
78
79
80
81
82
83
84
85
86
87
88
8990
91
92
93 94
95
96
9798 99
100
101
102
103
104
105
106
107 108
109
110
111
112
113 114
115
116
117
118
119
120121
122123
124
125
126
127
128
129
130
131
132
133
134 135136
137
138
139 140
141142143
144
145
146
147 148
149
150 151
152
153
154155
156
157
158
159 160
161
162
163
164165
166
167
168169170 171
172
173
174
175
176
177
178179 180
181
183
182 at 169
1 2
3 4
5
6
7
8
9 10
11
12
13
14
15
16
17
18
19 20
21
22
23
24
25
26
27
2829
30
31
32
33
34
35
36
37 38
39
40
41 42
43 44
45
46
47
4849
50
51
5253
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69 70
71
72
73
74 75
76
77
78
79
80
81 828384
85
86
87
88
8990
91
92
93
94
95
96 97
98
99
100
101
102
103104
105
106
107
108
109
110111
112
113
114
115
116
117 118
119
120
121 122
123
124 125
126
127
128
129130
131 132
133 134
135
136
137
138
139
140
141
142
143
144145
146
147 148
149
150
151
152
153
154
155156
157158
159160
161
162163
164
165
166
1 2
3
4
5 6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 23
24 25
26
2728 29
30
3132
33
34
35
36 37
38
39
40
41
42 43
44 45
46
47
48
49
50
51
52
5354
55
56
57
58
59
60 61
62
63
6465
66 67
68
69
70
71
72 73
74
75
76
77
78
79 80
81
8283
84
85
86
87
88
89
90
91
9293
94
9597
98
99100
101
102
103
104
105
106
107108
109
110 111
112
113
114
115
116
117
118 119
120
121122
123
124
125
126
127
128
129
130
131
132
133
134135
136137
138139
140
141
142
143
144
145
146
147
148 149
150
151 152
153
154
155156 157
158
159
160
161 162
163
164165
96 at 10
1 2
3 4
5
6
7
8
9 10
11
12
13
14
15
16
17
18
19 20
21
22
2324
25
26
27
28
29
30
31
32
33
34
35
36
37
3839
40
41
42
43
44 45
46
47
48
49
50
51
52
53
54
55
56
57
58 59
60
61
62
63
64
65
66
67
68
69
7071
72
73
74
75
76
77
78 79
80
81
82
83
84
85
86
87
8889
91
92
93
94
95
96
97
98 99
100
101
102 103
104105
106
107
108
109
110111
112
113
114
115
116117
118 119
120
121
122
123
124
125 126
128
129
130
131
132
133
134
135
136
137138
139140
141
142
143
144
145
146
147148
149
150
152 153
155
156
158159
161 162
164
165
166
167
168
169
170 171172
173
174175
176
177178
179
180
181
182
183
184 185
186 187
188 189
190
191
192
193 194
195
196197
198
199
200
201
202
203
204 205
206
207
208
209
210
211
212
213 214
90 at 15 127 at 37 151 at 141 154 at 148 157 at 141 160 at 148
163 at 141
Game 1Fan Hui (Black), AlphaGo (White)AlphaGo wins by 2.5 points
Game 2AlphaGo (Black), Fan Hui (White)AlphaGo wins by resignation
Game 3Fan Hui (Black), AlphaGo (White)AlphaGo wins by resignation
Game 4AlphaGo (Black), Fan Hui (White)AlphaGo wins by resignation
Game 5Fan Hui (Black), AlphaGo (White)AlphaGo wins by resignation
© 2016 Macmillan Publishers Limited. All rights reserved
Alpha Go • State space is too large to represent explicitly since
# of sequences of moves is O(bd) – Go: b=250 and d=150 – Chess: b=35 and d=80
• Key idea: – Define a neural network to approximate the value function – Train by policy gradient
57
2 8 J A N U A R Y 2 0 1 6 | V O L 5 2 9 | N A T U R E | 4 8 5
ARTICLE RESEARCH
sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s
∆σσ
∝∂ ( | )∂
σp a slog
We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The net-work predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi-tion and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission24 (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 µs to select an action, rather than 3 ms for the policy network.
Reinforcement learning of policy networksThe second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL)25,26. The RL policy network pρ is identical in structure to the SL policy network,
and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network pρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome zt = ± r(sT) is the termi-nal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome25
∆ρρ
∝∂ ( | )
∂ρp a s
zlog t t
t
We evaluated the performance of the RL policy network in game play, sampling each move ∼ (⋅| )ρa p st t from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi. In com-parison, the previous state-of-the-art, based only on supervised
Figure 1 | Neural network training pipeline and architecture. a, A fast rollout policy pπ and supervised learning (SL) policy network pσ are trained to predict human expert moves in a data set of positions. A reinforcement learning (RL) policy network pρ is initialized to the SL policy network, and is then improved by policy gradient learning to maximize the outcome (that is, winning more games) against previous versions of the policy network. A new data set is generated by playing games of self-play with the RL policy network. Finally, a value network vθ is trained by regression to predict the expected outcome (that is, whether
the current player wins) in positions from the self-play data set. b, Schematic representation of the neural network architecture used in AlphaGo. The policy network takes a representation of the board position s as its input, passes it through many convolutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a probability distribution ( | )σp a s or ( | )ρp a s over legal moves a, represented by a probability map over the board. The value network similarly uses many convolutional layers with parameters θ, but outputs a scalar value vθ(s′) that predicts the expected outcome in position s′.
Regr
essi
on
Cla
ssifi
catio
nClassification
Self Play
Policy gradient
a b
Human expert positions Self-play positions
Neural netw
orkD
ata
Rollout policy
pS pV pV�U (a⎪s) QT (s′)pU QT
SL policy network RL policy network Value network Policy network Value network
s s′
Figure 2 | Strength and accuracy of policy and value networks. a, Plot showing the playing strength of policy networks as a function of their training accuracy. Policy networks with 128, 192, 256 and 384 convolutional filters per layer were evaluated periodically during training; the plot shows the winning rate of AlphaGo using that policy network against the match version of AlphaGo. b, Comparison of evaluation accuracy between the value network and rollouts with different policies.
Positions and outcomes were sampled from human expert games. Each position was evaluated by a single forward pass of the value network vθ, or by the mean outcome of 100 rollouts, played out using either uniform random rollouts, the fast rollout policy pπ, the SL policy network pσ or the RL policy network pρ. The mean squared error between the predicted value and the actual game outcome is plotted against the stage of the game (how many moves had been played in the given position).
15 45 75 105 135 165 195 225 255 >285Move number
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Mea
n sq
uare
d er
ror
on e
xper
t gam
es
Uniform random rollout policyFast rollout policyValue networkSL policy networkRL policy network
50 51 52 53 54 55 56 57 58 59Training accuracy on KGS dataset (%)
0
10
20
30
40
50
60
70128 filters192 filters256 filters384 filters
Alp
haG
o w
in ra
te (%
)
a b
© 2016 Macmillan Publishers Limited. All rights reserved
Figure from Silver et al. (2016)
Alpha Go
• Results of a tournament
• From Silver et al. (2016): “a 230 point gap corresponds to a 79% probability of winning”
58 Figure from Silver et al. (2016)
2 8 J A N U A R Y 2 0 1 6 | V O L 5 2 9 | N A T U R E | 4 8 7
ARTICLE RESEARCH
better in AlphaGo than a value function ( )≈ ( )θ σv s v sp derived from the SL policy network.
Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that
exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.
Evaluating the playing strength of AlphaGoTo evaluate AlphaGo, we ran an internal tournament among variants of AlphaGo and several other Go programs, including the strongest commercial programs Crazy Stone13 and Zen, and the strongest open source programs Pachi14 and Fuego15. All of these programs are based
Figure 4 | Tournament evaluation of AlphaGo. a, Results of a tournament between different Go programs (see Extended Data Tables 6–11). Each program used approximately 5 s computation time per move. To provide a greater challenge to AlphaGo, some programs (pale upper bars) were given four handicap stones (that is, free moves at the start of every game) against all opponents. Programs were evaluated on an Elo scale37: a 230 point gap corresponds to a 79% probability of winning, which roughly corresponds to one amateur dan rank advantage on KGS38; an approximate correspondence to human ranks is also shown,
horizontal lines show KGS ranks achieved online by that program. Games against the human European champion Fan Hui were also included; these games used longer time controls. 95% confidence intervals are shown. b, Performance of AlphaGo, on a single machine, for different combinations of components. The version solely using the policy network does not perform any search. c, Scalability study of MCTS in AlphaGo with search threads and GPUs, using asynchronous search (light blue) or distributed search (dark blue), for 2 s per move.
3,500
3,000
2,500
2,000
1,500
1,000
500
0
c
1 2 4 8 16 32
1 2 4 8
12
64
24
112
40
176
64
280
40
Single machine Distributed
a
Rollouts
Value network
Policy network
3,500
3,000
2,500
2,000
1,500
1,000
500
0
b
40
8
Threads
GPUs
3,500
3,000
2,500
2,000
1,500
1,000
500
0El
o R
atin
g
GnuG
o
Fuego
Pachi
=HQ
Crazy S
tone
Fan Hui
AlphaG
o
AlphaG
odistributed
Professional dan (p)
Am
ateurdan (d)
Beginnerkyu (k)
9p7p5p3p1p
9d
7d
5d
3d
1d1k
3k
5k
7k
Figure 5 | How AlphaGo (black, to play) selected its move in an informal game against Fan Hui. For each of the following statistics, the location of the maximum value is indicated by an orange circle. a, Evaluation of all successors s′ of the root position s, using the value network vθ(s′); estimated winning percentages are shown for the top evaluations. b, Action values Q(s, a) for each edge (s, a) in the tree from root position s; averaged over value network evaluations only (λ = 0). c, Action values Q(s, a), averaged over rollout evaluations only (λ = 1).
d, Move probabilities directly from the SL policy network, ( | )σp a s ; reported as a percentage (if above 0.1%). e, Percentage frequency with which actions were selected from the root during simulations. f, The principal variation (path with maximum visit count) from AlphaGo’s search tree. The moves are presented in a numbered sequence. AlphaGo selected the move indicated by the red circle; Fan Hui responded with the move indicated by the white square; in his post-game commentary he preferred the move (labelled 1) predicted by AlphaGo.
Principal variation
Value networka
fPolicy network Percentage of simulations
b c Tree evaluation from rolloutsTree evaluation from value net
d e g
© 2016 Macmillan Publishers Limited. All rights reserved
SUMMARY
59
Eric Xing
Summary l Both value iteration and policy iteration are standard
algorithms for solving MDPs, and there isn't currently universal agreement over which algorithm is better.
l For small MDPs, value iteration is often very fast and converges with very few iterations. However, for MDPs with large state spaces, solving for V explicitly would involve solving a large system of linear equations, and could be difficult.
l In these problems, policy iteration may be preferred. In practice value iteration seems to be used more often than policy iteration.
l Q-learning is model-free, and explore the temporal difference
60 © Eric Xing @ CMU, 2006-2011
Eric Xing
Types of Learning l Supervised Learning
- Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification.
l Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.
l Reinforcement Learning - Training data: (S, A, R). (State-Action-Reward)
- Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs
61 © Eric Xing @ CMU, 2006-2011