Download pdf - 10701+Introduction+to+Machine+Learning+mgormley/courses/10701-f16/slides/lectur… · Eric Xing Example of Supervised Learning ! Predict the price of a stock in 6 months from now,

Reinforcement Learning

Matt Gormley Lecture 22

November 30, 2016

1

School of Computer Science

Readings: Mitchell Ch. 13

10-‐701 Introduction to Machine Learning

Reminders

•  Poster Session – Fri, Dec 2: 2:30pm – 5:30 pm

•  Final Report – due Fri, Dec 9

2

REINFORCEMENT LEARNING

3

Eric Xing

What is Learning?

l  Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that

l  Percepts received by an agent should be used not only for understanding/interpreting/prediction, as in the machine learning tasks we have addressed so far, but also for acting, and further more for improving the agent’s ability to behave optimally in the future to achieve the goal.

4 © Eric Xing @ CMU, 2006-2011

Eric Xing

Types of Learning l  Supervised Learning

l  A situation in which sample (input, output) pairs of the function to be learned can be perceived or are given

l  You can think it as if there is a kind teacher - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification.

l  Unsupervised Learning

- Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.

5 © Eric Xing @ CMU, 2006-2011

Eric Xing

Example of Supervised Learning l  Predict the price of a stock in 6 months from now, based on

economic data. (Regression) l  Predict whether a patient, hospitalized due to a heart attack,

will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)

l  Identify the numbers in a handwritten ZIP code, from a digitized image (pixels). (Classification)

6 © Eric Xing @ CMU, 2006-2011

Eric Xing

Example of Unsupervised Learning

l  From the DNA micro-array data, determine which genes are most “similar” in terms of their expression profiles. (Clustering)

7 © Eric Xing @ CMU, 2006-2011

Eric Xing

Types of Learning (Cont’d)

l  Reinforcement Learning l  in the case of the agent acts on its environment, it receives some

evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal

- Training data: (S, A, R). (State-Action-Reward) - Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs.

8 © Eric Xing @ CMU, 2006-2011

Eric Xing

RL is learning from interaction

9 © Eric Xing @ CMU, 2006-2011

Eric Xing

Examples of Reinforcement Learning

l  How should a robot behave so as to optimize its “performance”? (Robotics)

l  How to automate the motion of a helicopter? (Control Theory)

l  How to make a good chess-playing program? (Artificial Intelligence)

10 © Eric Xing @ CMU, 2006-2011

Autonomous Helicopter

Video:

11

https://www.youtube.com/watch?v=VCdxqn0fcnE

Eric Xing

Robot in a room

l  what’s the strategy to achieve max reward? l  what if the actions were NOT deterministic?

12 © Eric Xing @ CMU, 2006-2011

Eric Xing

Pole Balancing l  Task:

l  Move car left/right to keep the pole balanced

l  State representation l  Position and velocity of the car l  Angle and angular velocity of the pole

13 © Eric Xing @ CMU, 2006-2011

Eric Xing

History of Reinforcement Learning

l  Roots in the psychology of animal learning (Thorndike,1911).

l  Another independent thread was the problem of optimal control, and its solution using dynamic programming (Bellman, 1957).

l  Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).

l  A major breakthrough was the discovery of Q-learning (Watkins, 1989).

14 © Eric Xing @ CMU, 2006-2011

Eric Xing

What is special about RL? l  RL is learning how to map states to actions, so as to

maximize a numerical reward over time.

l  Unlike other forms of learning, it is a multistage decision-making process (often Markovian).

l  An RL agent must learn by trial-and-error. (Not entirely supervised, but interactive)

l  Actions may affect not only the immediate reward but also subsequent rewards (Delayed effect).

15 © Eric Xing @ CMU, 2006-2011

Eric Xing

Elements of RL l  A policy - A map from state space to action space. - May be stochastic. l  A reward function - It maps each state (or, state-action pair) to a real number, called reward. l  A value function - Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair).

16 © Eric Xing @ CMU, 2006-2011

Maze Example

17

Lecture 1: Introduction to Reinforcement Learning

Inside An RL Agent

Maze Example

Start

Goal

Rewards: -1 per time-step

Actions: N, E, S, W

States: Agent’s location

Slide from David Silver (Intro RL lecture)

Maze Example

18 Slide from David Silver (Intro RL lecture)


Inside An RL Agent

Maze Example: Policy

Start

Goal

Arrows represent policy ⇡(s) for each state s

Policy:

Maze Example



Inside An RL Agent

Maze Example: Value Function

-14 -13 -12 -11 -10 -9

-16 -15 -12 -8

-16 -17 -6 -7

-18 -19 -5

-24 -20 -4 -3

-23 -22 -21 -22 -2 -1

Start

Goal

Numbers represent value v⇡(s) of each state s

Value Function: (Expected Future Reward)

Maze Example



Inside An RL Agent

Maze Example: Model

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1

-1 -1 -1

-1

-1 -1

-1 -1

Start

Goal

Agent may have an internalmodel of the environment

Dynamics: how actionschange the state

Rewards: how much rewardfrom each state

The model may be imperfect

Grid layout represents transition model Pa

ss

0

Numbers represent immediate reward Ra

s

from each state s(same for all a)

Model:

Eric Xing

Policy

21 © Eric Xing @ CMU, 2006-2011

Eric Xing

Reward for each step -2

22 © Eric Xing @ CMU, 2006-2011

Eric Xing

Reward for each step: -0.1

23 © Eric Xing @ CMU, 2006-2011

Eric Xing

Reward for each step: -0.04

24 © Eric Xing @ CMU, 2006-2011

Eric Xing

The Precise Goal l  To find a policy that maximizes the Value function.

l  transitions and rewards usually not available

l  There are different approaches to achieve this goal in various situations.

l  Value iteration and Policy iteration are two more classic approaches to this problem. But essentially both are dynamic programming.

l  Q-learning is a more recent approaches to this problem. Essentially it is a temporal-difference method.

25 © Eric Xing @ CMU, 2006-2011

MARKOV DECISION PROCESSES

26

Eric Xing

Markov Decision Processes A Markov decision process is a tuple where:

27 © Eric Xing @ CMU, 2006-2011

Eric Xing

The dynamics of an MDP l  We start in some state s0, and get to choose some action a0 ∈

A l  As a result of our choice, the state of the MDP randomly

transitions to some successor state s1, drawn according to s1~ Ps0a0

l  Then, we get to pick another action a1 l  …

28 © Eric Xing @ CMU, 2006-2011

Eric Xing

The dynamics of an MDP, (Cont’d)

l  Upon visiting the sequence of states s0, s1, …, with actions a0, a1, …, our total payoff is given by

l  Or, when we are writing rewards as a function of the states only, this becomes

l  For most of our development, we will use the simpler state-rewards R(s), though the generalization to state-action rewards R(s; a) offers no special diffculties.

l  Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payoff:

29 © Eric Xing @ CMU, 2006-2011

FIXED POINT ITERATION

30

Fixed Point Iteration for Optimization •  Fixed point iteration is a general tool for solving systems of

equations •  It can also be applied to optimization.

31

1.  Given objective function: 2.  Compute derivative, set to

zero (call this function f ). 3.  Rearrange the equation s.t.

one of parameters appears on the LHS.

4.  Initialize the parameters. 5.  For i in {1,...,K}, update each

parameter and increment t: 6.  Repeat #5 until convergence

J(✓)

dJ(✓)

d✓i= 0 = f(✓)

0 = f(✓) ) ✓i = g(✓)

✓(t+1)i = g(✓(t))

Fixed Point Iteration for Optimization •  Fixed point iteration is a general tool for solving systems of

equations •  It can also be applied to optimization.

32

1.  Given objective function: 2.  Compute derivative, set to

zero (call this function f ). 3.  Rearrange the equation s.t.

one of parameters appears on the LHS.

4.  Initialize the parameters. 5.  For i in {1,...,K}, update each

parameter and increment t: 6.  Repeat #5 until convergence

J(x) =x

3

3+

3

2x

2 + 2x

dJ(x)

dx

= f(x) = x

2 � 3x+ 2 = 0

) x =x

2 + 2

3= g(x)

x x

2 + 2

3

Fixed Point Iteration for Optimization We can implement our example in a few lines of python.

33

J(x) =x

3

3+

3

2x

2 + 2x

dJ(x)

dx

= f(x) = x

2 � 3x+ 2 = 0

) x =x

2 + 2

3= g(x)

x x

2 + 2

3

Fixed Point Iteration for Optimization

34

$ python fixed-point-iteration.py i= 0 x=0.0000 f(x)=2.0000i= 1 x=0.6667 f(x)=0.4444i= 2 x=0.8148 f(x)=0.2195i= 3 x=0.8880 f(x)=0.1246i= 4 x=0.9295 f(x)=0.0755i= 5 x=0.9547 f(x)=0.0474i= 6 x=0.9705 f(x)=0.0304i= 7 x=0.9806 f(x)=0.0198i= 8 x=0.9872 f(x)=0.0130i= 9 x=0.9915 f(x)=0.0086i=10 x=0.9944 f(x)=0.0057i=11 x=0.9963 f(x)=0.0038i=12 x=0.9975 f(x)=0.0025i=13 x=0.9983 f(x)=0.0017i=14 x=0.9989 f(x)=0.0011i=15 x=0.9993 f(x)=0.0007i=16 x=0.9995 f(x)=0.0005i=17 x=0.9997 f(x)=0.0003i=18 x=0.9998 f(x)=0.0002i=19 x=0.9999 f(x)=0.0001i=20 x=0.9999 f(x)=0.0001

J(x) =x

3

3+

3

2x

2 + 2x

dJ(x)

dx

= f(x) = x

2 � 3x+ 2 = 0

) x =x

2 + 2

3= g(x)

x x

2 + 2

3

VALUE ITERATION

35

Eric Xing

Elements of RL l  A policy - A map from state space to action space. - May be stochastic. l  A reward function - It maps each state (or, state-action pair) to a real number, called reward. l  A value function - Value of a state (or, state-action pair) is the total expected reward, starting from that state (or, state-action pair).

36 © Eric Xing @ CMU, 2006-2011

Eric Xing

Policy l  A policy is any function mapping from the states

to the actions.

l  We say that we are executing some policy if, whenever we are in state s, we take action a = π(s).

l  We also define the value function for a policy π according to

l  Vπ (s) is simply the expected sum of discounted rewards upon starting in state s, and taking actions according to π.

37 © Eric Xing @ CMU, 2006-2011

Eric Xing

Value Function l  Given a fixed policy π, its value function Vπ satisfies the

Bellman equations:

l  Bellman's equations can be used to efficiently solve for Vπ (see later)

Immediate reward expected sum of future discounted rewards

38 © Eric Xing @ CMU, 2006-2011

Eric Xing

M = 0.8 in direction you want to go 0.2 in perpendicular 0.1 left

0.1 right Policy: mapping from states to actions

3

2

1

1 2 3 4

+1

-1

0.705

3

2

1

1 2 3 4

+1

-1

0.812

0.762

0.868 0.912

0.660

0.655 0.611 0.388

An optimal policy for the stochastic environment:

utilities of states:

Environment Observable (accessible): percept identifies the state Partially observable

Markov property: Transition probabilities depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.

The Grid world

39 © Eric Xing @ CMU, 2006-2011

Eric Xing

Optimal value function l  We define the optimal value function according to

l  In other words, this is the best possible expected sum of discounted rewards that can be attained using any policy

l  There is a version of Bellman's equations for the optimal value function:

l  Why?

(1)

(2)

40 © Eric Xing @ CMU, 2006-2011

Eric Xing

Optimal policy l  We also define a policy : as follows:

l  Fact: l 

l  Policy π* has the interesting property that it is the optimal policy for all states s. l  It is not the case that if we were starting in some state s then there'd be some

optimal policy for that state, and if we were starting in some other state s0 then there'd be some other policy that's optimal policy for s0.

l  The same policy π* attains the maximum above for all states s. This means that we can use the same policy no matter what the initial state of our MDP is.

(3)

41 © Eric Xing @ CMU, 2006-2011

Eric Xing

The Basic Setting for Learning

l  Training data: n finite horizon trajectories, of the form

l  Deterministic or stochastic policy: A sequence of decision rules

l  Each π maps from the observable history (states and actions) to the action space at that time point.

}.,,,,...,,,{ 1000 +TTTT srasras

}.,...,,{ 10 Tπππ

42 © Eric Xing @ CMU, 2006-2011

Eric Xing

Algorithm 1: Value iteration l  Consider only MDPs with finite state and action spaces

l  The value iteration algorithm:

l  synchronous update l  asynchronous updates

l  It can be shown that value iteration will cause V to converge to V *. Having found V* , we can then use Equation (3) to find the optimal policy.

43 © Eric Xing @ CMU, 2006-2011

Eric Xing

Algorithm 2: Policy iteration l  The policy iteration algorithm:

l  The inner-loop repeatedly computes the value function for the current policy, and then updates the policy using the current value function.

l  Greedy update l  After at most a finite number of iterations of this algorithm, V will converge to V* ,

and π will converge to π*.

44 © Eric Xing @ CMU, 2006-2011

Eric Xing

l  The utility values for selected states at each iteration step in the application of VALUE-ITERATION to the 4x3 world in our example

Thrm: As tà∞, value iteration converges to exact U even if updates are done asynchronously & i is picked randomly at every step.

Convergence

start

3

2

1

1 2 3 4

+1

-1

45 © Eric Xing @ CMU, 2006-2011

Eric Xing

When to stop value iteration?

Convergence

46 © Eric Xing @ CMU, 2006-2011

Q-‐LEARNING

47

Eric Xing

l  Define Q-value function

l  Q-value function updating rule l  See subsequent slides

l  Key idea of TD-Q learning l  Combined with temporal difference approach

l  Rule to chose the action to take

Q learning

48 © Eric Xing @ CMU, 2006-2011

Eric Xing

Algorithm 3: Q learning For each pair (s, a), initialize Q(s,a) Observe the current state s Loop forever {

Select an action a (optionally with ε-exploration) and execute it Receive immediate reward r and observe the new state s’ Update Q(s,a) s=s’

} 49 © Eric Xing @ CMU, 2006-2011

Eric Xing

Exploration l  Tradeoff between exploitation (control) and exploration

(identification)

l  Extremes: greedy vs. random acting (n-armed bandit models)

Q-learning converges to optimal Q-values if l  Every state is visited infinitely often (due to exploration), l  The action selection becomes greedy as time approaches infinity, and l  The learning rate a is decreased fast enough but not too fast (as we discussed in

TD learning)

50 © Eric Xing @ CMU, 2006-2011

RL EXAMPLES

51

Eric Xing

A Success Story l  TD Gammon (Tesauro, G., 1992) - A Backgammon playing program. - Application of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against

itself and learning from the outcome. So smart!! - More information: http://www.research.ibm.com/massive/

tdl.html

52 © Eric Xing @ CMU, 2006-2011


Problems within RL

Atari Example: Reinforcement Learning

observation

reward

action

At

Rt

OtRules of the game areunknown

Learn directly frominteractive game-play

Pick actions onjoystick, see pixelsand scores

Playing Atari with Deep RL •  Setup: RL system observes the pixels on the screen

•  It receives rewards as the game score

•  Actions decide how to move the joystick / buttons

53 Figures from David Silver (Intro RL lecture)

Playing Atari with Deep RL

Videos: – Atari: https://www.youtube.com/watch?v=V1eYniJ0Rnk

– Space Invaders: https://www.youtube.com/watch?v=ePv0Fs9cGgU

54

Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,Seaquest, Beam Rider

an experience replay mechanism [13] which randomly samples previous transitions, and therebysmooths the training distribution over many past behaviors.

We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks thatwere designed to be difficult for humans players. Our goal is to create a single neural network agentthat is able to successfully learn to play as many of the games as possible. The network was not pro-vided with any game-specific information or hand-designed visual features, and was not privy to theinternal state of the emulator; it learned from nothing but the video input, the reward and terminalsignals, and the set of possible actions—just as a human player would. Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games. So far thenetwork has outperformed all previous RL algorithms on six of the seven games we have attemptedand surpassed an expert human player on three of them. Figure 1 provides sample screenshots fromfive of the games used for training.

2 Background

We consider tasks in which an agent interacts with an environment E , in this case the Atari emulator,in a sequence of actions, observations and rewards. At each time-step the agent selects an actionat from the set of legal game actions, A = {1, . . . ,K}. The action is passed to the emulator andmodifies its internal state and the game score. In general E may be stochastic. The emulator’sinternal state is not observed by the agent; instead it observes an image xt 2 Rd from the emulator,which is a vector of raw pixel values representing the current screen. In addition it receives a rewardrt representing the change in game score. Note that in general the game score may depend on thewhole prior sequence of actions and observations; feedback about an action may only be receivedafter many thousands of time-steps have elapsed.

Since the agent only observes images of the current screen, the task is partially observed and manyemulator states are perceptually aliased, i.e. it is impossible to fully understand the current situationfrom only the current screen xt. We therefore consider sequences of actions and observations, st =x1, a1, x2, ..., at�1, xt, and learn game strategies that depend upon these sequences. All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism givesrise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.As a result, we can apply standard reinforcement learning methods for MDPs, simply by using thecomplete sequence st as the state representation at time t.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximisesfuture rewards. We make the standard assumption that future rewards are discounted by a factor of� per time-step, and define the future discounted return at time t as Rt =

PTt0=t �

t0�trt0 , where T

is the time-step at which the game terminates. We define the optimal action-value function Q

⇤(s, a)

as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q⇤

(s, a) = max⇡ E [Rt|st = s, at = a,⇡], where ⇡ is a policymapping sequences to actions (or distributions over actions).

The optimal action-value function obeys an important identity known as the Bellman equation. Thisis based on the following intuition: if the optimal value Q

⇤(s

0, a

0) of the sequence s

0 at the nexttime-step was known for all possible actions a

0, then the optimal strategy is to select the action a

0

2

Figures from Mnih et al. (2013)

Playing Atari with Deep RL

55

B. Rider Breakout Enduro Pong Q*bert Seaquest S. InvadersRandom 354 1.2 0 �20.4 157 110 179

Sarsa [3] 996 5.2 129 �19 614 665 271

Contingency [4] 1743 6 159 �17 960 723 268

DQN 4092 168 470 20 1952 1705 581Human 7456 31 368 �3 18900 28010 3690

HNeat Best [8] 3616 52 106 19 1800 920 1720HNeat Pixel [8] 1332 4 91 �16 1325 800 1145

DQN Best 5184 225 661 21 4500 1740 1075

Table 1: The upper table compares average total reward for various learning methods by runningan ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results ofthe single best performing episode for HNeat and DQN. HNeat produces deterministic policies thatalways get the same score while DQN used an ✏-greedy policy with ✏ = 0.05.

types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 colorchannel representation of the Atari emulator that represents an object label map at each channel.This method relies heavily on finding a deterministic sequence of states that represents a successfulexploit. It is unlikely that strategies learnt in this way will generalize to random perturbations;therefore the algorithm was only evaluated on the highest scoring single episode. In contrast, ouralgorithm is evaluated on ✏-greedy control sequences, and must therefore generalize across a widevariety of possible situations. Nevertheless, we show that on all the games, except Space Invaders,not only our max evaluation results (row 8), but also our average results (row 4) achieve betterperformance.

Finally, we show that our method achieves better performance than an expert human player onBreakout, Enduro and Pong and it achieves close to human performance on Beam Rider. The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.

6 ConclusionThis paper introduced a new deep learning model for reinforcement learning, and demonstrated itsability to master difficult control policies for Atari 2600 computer games, using only raw pixelsas input. We also presented a variant of online Q-learning that combines stochastic minibatch up-dates with experience replay memory to ease the training of deep networks for RL. Our approachgave state-of-the-art results in six of the seven games it was tested on, with no adjustment of thearchitecture or hyperparameters.

References

[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning (ICML 1995), pages30–37. Morgan Kaufmann, 1995.

[2] Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap-proximation. In Advances in Neural Information Processing Systems 25, pages 2222–2230,2012.

[3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learningenvironment: An evaluation platform for general agents. Journal of Artificial Intelligence

Research, 47:253–279, 2013.

[4] Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awarenessusing atari 2600 games. In AAAI, 2012.

[5] Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac-tored environments. In Proceedings of the Thirtieth International Conference on Machine

Learning (ICML 2013), pages 1211–1219, 2013.

8

Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,Seaquest, Beam Rider

an experience replay mechanism [13] which randomly samples previous transitions, and therebysmooths the training distribution over many past behaviors.

We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks thatwere designed to be difficult for humans players. Our goal is to create a single neural network agentthat is able to successfully learn to play as many of the games as possible. The network was not pro-vided with any game-specific information or hand-designed visual features, and was not privy to theinternal state of the emulator; it learned from nothing but the video input, the reward and terminalsignals, and the set of possible actions—just as a human player would. Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games. So far thenetwork has outperformed all previous RL algorithms on six of the seven games we have attemptedand surpassed an expert human player on three of them. Figure 1 provides sample screenshots fromfive of the games used for training.

2 Background

We consider tasks in which an agent interacts with an environment E , in this case the Atari emulator,in a sequence of actions, observations and rewards. At each time-step the agent selects an actionat from the set of legal game actions, A = {1, . . . ,K}. The action is passed to the emulator andmodifies its internal state and the game score. In general E may be stochastic. The emulator’sinternal state is not observed by the agent; instead it observes an image xt 2 Rd from the emulator,which is a vector of raw pixel values representing the current screen. In addition it receives a rewardrt representing the change in game score. Note that in general the game score may depend on thewhole prior sequence of actions and observations; feedback about an action may only be receivedafter many thousands of time-steps have elapsed.

Since the agent only observes images of the current screen, the task is partially observed and manyemulator states are perceptually aliased, i.e. it is impossible to fully understand the current situationfrom only the current screen xt. We therefore consider sequences of actions and observations, st =x1, a1, x2, ..., at�1, xt, and learn game strategies that depend upon these sequences. All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism givesrise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.As a result, we can apply standard reinforcement learning methods for MDPs, simply by using thecomplete sequence st as the state representation at time t.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximisesfuture rewards. We make the standard assumption that future rewards are discounted by a factor of� per time-step, and define the future discounted return at time t as Rt =

PTt0=t �

t0�trt0 , where T

is the time-step at which the game terminates. We define the optimal action-value function Q

⇤(s, a)

as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q⇤

(s, a) = max⇡ E [Rt|st = s, at = a,⇡], where ⇡ is a policymapping sequences to actions (or distributions over actions).

The optimal action-value function obeys an important identity known as the Bellman equation. Thisis based on the following intuition: if the optimal value Q

⇤(s

0, a

0) of the sequence s

0 at the nexttime-step was known for all possible actions a

0, then the optimal strategy is to select the action a

0

2

Figures from Mnih et al. (2013)

Alpha Go Game of Go (圍棋) •  19x19 board •  Players alternately play black/white stones

•  Goal is to fully encircle the largest region on the board

•  Simple rules, but extremely complex game play

56 Figure from Silver et al. (2016)

4 8 8 | N A T U R E | V O L 5 2 9 | 2 8 J A N U A R Y 2 0 1 6

ARTICLERESEARCH

on high-performance MCTS algorithms. In addition, we included the open source program GnuGo, a Go program using state-of-the-art search methods that preceded MCTS. All programs were allowed 5 s of computation time per move.

The results of the tournament (see Fig. 4a) suggest that single- machine AlphaGo is many dan ranks stronger than any previous Go program, winning 494 out of 495 games (99.8%) against other Go programs. To provide a greater challenge to AlphaGo, we also played games with four handicap stones (that is, free moves for the opponent); AlphaGo won 77%, 86%, and 99% of handicap games against Crazy Stone, Zen and Pachi, respectively. The distributed ver-sion of AlphaGo was significantly stronger, winning 77% of games against single-machine AlphaGo and 100% of its games against other programs.

We also assessed variants of AlphaGo that evaluated positions using just the value network (λ = 0) or just rollouts (λ = 1) (see Fig. 4b). Even without rollouts AlphaGo exceeded the performance of all other Go programs, demonstrating that value networks provide a viable alternative to Monte Carlo evaluation in Go. However, the mixed evaluation (λ = 0.5) performed best, winning ≥95% of games against other variants. This suggests that the two position-evaluation

mechanisms are complementary: the value network approximates the outcome of games played by the strong but impractically slow pρ, while the rollouts can precisely score and evaluate the outcome of games played by the weaker but faster rollout policy pπ. Figure 5 visualizes the evaluation of a real game position by AlphaGo.

Finally, we evaluated the distributed version of AlphaGo against Fan Hui, a professional 2 dan, and the winner of the 2013, 2014 and 2015 European Go championships. Over 5–9 October 2015 AlphaGo and Fan Hui competed in a formal five-game match. AlphaGo won the match 5 games to 0 (Fig. 6 and Extended Data Table 1). This is the first time that a computer Go program has defeated a human profes-sional player, without handicap, in the full game of Go—a feat that was previously believed to be at least a decade away3,7,31.

DiscussionIn this work we have developed a Go program, based on a combina-tion of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intel-ligence’s “grand challenges”31–33. We have developed, for the first time, effective move selection and position evaluation functions for Go, based on deep neural networks that are trained by a novel combination

Figure 6 | Games from the match between AlphaGo and the European champion, Fan Hui. Moves are shown in a numbered sequence corresponding to the order in which they were played. Repeated moves on the same intersection are shown in pairs below the board. The first

move number in each pair indicates when the repeat move was played, at an intersection identified by the second move number (see Supplementary Information).

1 2

3

4

5

6

7

8

910

11

12

13

14

15 16

17

18

1920

21

22

23

24

25

26

27

28

29 30

31

32

33

3435

36 37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52 53

54

55

56

57

58

59

60

61

62

63

6465

66

67

68

69

70

71 72

73

74

75

7677

78

79

80

8182 83

84

85

86

87

88 89

90

91

9293

94

95

96

97

9899

100

101 102

103 104

105

106

107108

109

110

111

112

113

114

115 116

117

118

119

120

121

122123

124

125

126

127

128

129130 131

132

133

134

135

136

137

138

139

140

141

142

143

144

145 146

147

148

149 150

151

152

153

154

155

156

157

158

159

160 161

162163

164

165

166

167

168

169

170

171 172

173

174

175

176 177

178

179

180

181 182

183

184

185

186

187

188

189

190

191

192

193194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213 214

215

216

217 218

219

220221

222

223

224

225226

227

228

229

230

231

232

233

235236

237

238

239

240

241

242

243244 246 247

248

249

251

252

253

254

255

256

257

258

259

260

261

262 263

264

265

266267

268

269 270

271

272

234 at 179 245 at 122 250 at 59

1 2

3

4

5 6

7 8

9

1011

12

13

14

15

16

17

18

1920

21

22

23

24

25

26

2728

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47 48

49 50 5152

53

54 55

5657

58

59

60

61

62 63

64

65

66

67

68

69

70

71

7273

7475

76 77

78

79

80

81

82

83

84

85

86

87

88

8990

91

92

93 94

95

96

9798 99

100

101

102

103

104

105

106

107 108

109

110

111

112

113 114

115

116

117

118

119

120121

122123

124

125

126

127

128

129

130

131

132

133

134 135136

137

138

139 140

141142143

144

145

146

147 148

149

150 151

152

153

154155

156

157

158

159 160

161

162

163

164165

166

167

168169170 171

172

173

174

175

176

177

178179 180

181

183

182 at 169

1 2

3 4

5

6

7

8

9 10

11

12

13

14

15

16

17

18

19 20

21

22

23

24

25

26

27

2829

30

31

32

33

34

35

36

37 38

39

40

41 42

43 44

45

46

47

4849

50

51

5253

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69 70

71

72

73

74 75

76

77

78

79

80

81 828384

85

86

87

88

8990

91

92

93

94

95

96 97

98

99

100

101

102

103104

105

106

107

108

109

110111

112

113

114

115

116

117 118

119

120

121 122

123

124 125

126

127

128

129130

131 132

133 134

135

136

137

138

139

140

141

142

143

144145

146

147 148

149

150

151

152

153

154

155156

157158

159160

161

162163

164

165

166

1 2

3

4

5 6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22 23

24 25

26

2728 29

30

3132

33

34

35

36 37

38

39

40

41

42 43

44 45

46

47

48

49

50

51

52

5354

55

56

57

58

59

60 61

62

63

6465

66 67

68

69

70

71

72 73

74

75

76

77

78

79 80

81

8283

84

85

86

87

88

89

90

91

9293

94

9597

98

99100

101

102

103

104

105

106

107108

109

110 111

112

113

114

115

116

117

118 119

120

121122

123

124

125

126

127

128

129

130

131

132

133

134135

136137

138139

140

141

142

143

144

145

146

147

148 149

150

151 152

153

154

155156 157

158

159

160

161 162

163

164165

96 at 10

1 2

3 4

5

6

7

8

9 10

11

12

13

14

15

16

17

18

19 20

21

22

2324

25

26

27

28

29

30

31

32

33

34

35

36

37

3839

40

41

42

43

44 45

46

47

48

49

50

51

52

53

54

55

56

57

58 59

60

61

62

63

64

65

66

67

68

69

7071

72

73

74

75

76

77

78 79

80

81

82

83

84

85

86

87

8889

91

92

93

94

95

96

97

98 99

100

101

102 103

104105

106

107

108

109

110111

112

113

114

115

116117

118 119

120

121

122

123

124

125 126

128

129

130

131

132

133

134

135

136

137138

139140

141

142

143

144

145

146

147148

149

150

152 153

155

156

158159

161 162

164

165

166

167

168

169

170 171172

173

174175

176

177178

179

180

181

182

183

184 185

186 187

188 189

190

191

192

193 194

195

196197

198

199

200

201

202

203

204 205

206

207

208

209

210

211

212

213 214

90 at 15 127 at 37 151 at 141 154 at 148 157 at 141 160 at 148

163 at 141

Game 1Fan Hui (Black), AlphaGo (White)AlphaGo wins by 2.5 points

Game 2AlphaGo (Black), Fan Hui (White)AlphaGo wins by resignation

Game 3Fan Hui (Black), AlphaGo (White)AlphaGo wins by resignation

Game 4AlphaGo (Black), Fan Hui (White)AlphaGo wins by resignation

Game 5Fan Hui (Black), AlphaGo (White)AlphaGo wins by resignation

© 2016 Macmillan Publishers Limited. All rights reserved

Alpha Go •  State space is too large to represent explicitly since

# of sequences of moves is O(bd) –  Go: b=250 and d=150 –  Chess: b=35 and d=80

•  Key idea: –  Define a neural network to approximate the value function –  Train by policy gradient

57

2 8 J A N U A R Y 2 0 1 6 | V O L 5 2 9 | N A T U R E | 4 8 5

ARTICLE RESEARCH

sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s

∆σσ

∝∂ ( | )∂

σp a slog

We trained a 13-layer policy network, which we call the SL policy network, from 30 million positions from the KGS Go Server. The net-work predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board posi-tion and move history as inputs, compared to the state-of-the-art from other research groups of 44.4% at date of submission24 (full results in Extended Data Table 3). Small improvements in accuracy led to large improvements in playing strength (Fig. 2a); larger networks achieve better accuracy but are slower to evaluate during search. We also trained a faster but less accurate rollout policy pπ(a|s), using a linear softmax of small pattern features (see Extended Data Table 4) with weights π; this achieved an accuracy of 24.2%, using just 2 µs to select an action, rather than 3 ms for the policy network.

Reinforcement learning of policy networksThe second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL)25,26. The RL policy network pρ is identical in structure to the SL policy network,

and its weights ρ are initialized to the same values, ρ = σ. We play games between the current policy network pρ and a randomly selected previous iteration of the policy network. Randomizing from a pool of opponents in this way stabilizes training by preventing overfitting to the current policy. We use a reward function r(s) that is zero for all non-terminal time steps t < T. The outcome zt = ± r(sT) is the termi-nal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and −1 for losing. Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome25

∆ρρ

∝∂ ( | )

∂ρp a s

zlog t t

t

We evaluated the performance of the RL policy network in game play, sampling each move ∼ (⋅| )ρa p st t from its output probability distribution over actions. When played head-to-head, the RL policy network won more than 80% of games against the SL policy network. We also tested against the strongest open-source Go program, Pachi14, a sophisticated Monte Carlo search program, ranked at 2 amateur dan on KGS, that executes 100,000 simulations per move. Using no search at all, the RL policy network won 85% of games against Pachi. In com-parison, the previous state-of-the-art, based only on supervised

Figure 1 | Neural network training pipeline and architecture. a, A fast rollout policy pπ and supervised learning (SL) policy network pσ are trained to predict human expert moves in a data set of positions. A reinforcement learning (RL) policy network pρ is initialized to the SL policy network, and is then improved by policy gradient learning to maximize the outcome (that is, winning more games) against previous versions of the policy network. A new data set is generated by playing games of self-play with the RL policy network. Finally, a value network vθ is trained by regression to predict the expected outcome (that is, whether

the current player wins) in positions from the self-play data set. b, Schematic representation of the neural network architecture used in AlphaGo. The policy network takes a representation of the board position s as its input, passes it through many convolutional layers with parameters σ (SL policy network) or ρ (RL policy network), and outputs a probability distribution ( | )σp a s or ( | )ρp a s over legal moves a, represented by a probability map over the board. The value network similarly uses many convolutional layers with parameters θ, but outputs a scalar value vθ(s′) that predicts the expected outcome in position s′.

Regr

essi

on

Cla

ssifi

catio

nClassification

Self Play

Policy gradient

a b

Human expert positions Self-play positions

Neural netw

orkD

ata

Rollout policy

pS pV pV�U (a⎪s) QT (s′)pU QT

SL policy network RL policy network Value network Policy network Value network

s s′

Figure 2 | Strength and accuracy of policy and value networks. a, Plot showing the playing strength of policy networks as a function of their training accuracy. Policy networks with 128, 192, 256 and 384 convolutional filters per layer were evaluated periodically during training; the plot shows the winning rate of AlphaGo using that policy network against the match version of AlphaGo. b, Comparison of evaluation accuracy between the value network and rollouts with different policies.

Positions and outcomes were sampled from human expert games. Each position was evaluated by a single forward pass of the value network vθ, or by the mean outcome of 100 rollouts, played out using either uniform random rollouts, the fast rollout policy pπ, the SL policy network pσ or the RL policy network pρ. The mean squared error between the predicted value and the actual game outcome is plotted against the stage of the game (how many moves had been played in the given position).

15 45 75 105 135 165 195 225 255 >285Move number

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Mea

n sq

uare

d er

ror

on e

xper

t gam

es

Uniform random rollout policyFast rollout policyValue networkSL policy networkRL policy network

50 51 52 53 54 55 56 57 58 59Training accuracy on KGS dataset (%)

0

10

20

30

40

50

60

70128 filters192 filters256 filters384 filters

Alp

haG

o w

in ra

te (%

)

a b


Figure from Silver et al. (2016)

Alpha Go

•  Results of a tournament

•  From Silver et al. (2016): “a 230 point gap corresponds to a 79% probability of winning”

58 Figure from Silver et al. (2016)

2 8 J A N U A R Y 2 0 1 6 | V O L 5 2 9 | N A T U R E | 4 8 7

ARTICLE RESEARCH

better in AlphaGo than a value function ( )≈ ( )θ σv s v sp derived from the SL policy network.

Evaluating policy and value networks requires several orders of magnitude more computation than traditional search heuristics. To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that

exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. The Methods section provides full details of asynchronous and distributed MCTS.

Evaluating the playing strength of AlphaGoTo evaluate AlphaGo, we ran an internal tournament among variants of AlphaGo and several other Go programs, including the strongest commercial programs Crazy Stone13 and Zen, and the strongest open source programs Pachi14 and Fuego15. All of these programs are based

Figure 4 | Tournament evaluation of AlphaGo. a, Results of a tournament between different Go programs (see Extended Data Tables 6–11). Each program used approximately 5 s computation time per move. To provide a greater challenge to AlphaGo, some programs (pale upper bars) were given four handicap stones (that is, free moves at the start of every game) against all opponents. Programs were evaluated on an Elo scale37: a 230 point gap corresponds to a 79% probability of winning, which roughly corresponds to one amateur dan rank advantage on KGS38; an approximate correspondence to human ranks is also shown,

horizontal lines show KGS ranks achieved online by that program. Games against the human European champion Fan Hui were also included; these games used longer time controls. 95% confidence intervals are shown. b, Performance of AlphaGo, on a single machine, for different combinations of components. The version solely using the policy network does not perform any search. c, Scalability study of MCTS in AlphaGo with search threads and GPUs, using asynchronous search (light blue) or distributed search (dark blue), for 2 s per move.

3,500

3,000

2,500

2,000

1,500

1,000

500

0

c

1 2 4 8 16 32

1 2 4 8

12

64

24

112

40

176

64

280

40

Single machine Distributed

a

Rollouts

Value network

Policy network

3,500

3,000

2,500

2,000

1,500

1,000

500

0

b

40

8

Threads

GPUs

3,500

3,000

2,500

2,000

1,500

1,000

500

0El

o R

atin

g

GnuG

o

Fuego

Pachi

=HQ

Crazy S

tone

Fan Hui

AlphaG

o

AlphaG

odistributed

Professional dan (p)

Am

ateurdan (d)

Beginnerkyu (k)

9p7p5p3p1p

9d

7d

5d

3d

1d1k

3k

5k

7k

Figure 5 | How AlphaGo (black, to play) selected its move in an informal game against Fan Hui. For each of the following statistics, the location of the maximum value is indicated by an orange circle. a, Evaluation of all successors s′ of the root position s, using the value network vθ(s′); estimated winning percentages are shown for the top evaluations. b, Action values Q(s, a) for each edge (s, a) in the tree from root position s; averaged over value network evaluations only (λ = 0). c, Action values Q(s, a), averaged over rollout evaluations only (λ = 1).

d, Move probabilities directly from the SL policy network, ( | )σp a s ; reported as a percentage (if above 0.1%). e, Percentage frequency with which actions were selected from the root during simulations. f, The principal variation (path with maximum visit count) from AlphaGo’s search tree. The moves are presented in a numbered sequence. AlphaGo selected the move indicated by the red circle; Fan Hui responded with the move indicated by the white square; in his post-game commentary he preferred the move (labelled 1) predicted by AlphaGo.

Principal variation

Value networka

fPolicy network Percentage of simulations

b c Tree evaluation from rolloutsTree evaluation from value net

d e g


SUMMARY

59

Eric Xing

Summary l  Both value iteration and policy iteration are standard

algorithms for solving MDPs, and there isn't currently universal agreement over which algorithm is better.

l  For small MDPs, value iteration is often very fast and converges with very few iterations. However, for MDPs with large state spaces, solving for V explicitly would involve solving a large system of linear equations, and could be difficult.

l  In these problems, policy iteration may be preferred. In practice value iteration seems to be used more often than policy iteration.

l  Q-learning is model-free, and explore the temporal difference

60 © Eric Xing @ CMU, 2006-2011

Eric Xing

Types of Learning l  Supervised Learning

- Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification.

l  Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.

l  Reinforcement Learning - Training data: (S, A, R). (State-Action-Reward)

- Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robotics, Board game playing programs

61 © Eric Xing @ CMU, 2006-2011