57
Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 3 Winter 2018 1 / 57

Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Lecture 14: MCTS 2

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2018

2With many slides from or derived from David SilverEmma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 3 Winter 2018 1 / 57

Page 2: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Class Structure

Last time: Batch RL

This Time: MCTS

Next time: Human in the Loop RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 4 Winter 2018 2 / 57

Page 3: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

4 Integrated Architectures

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 5 Winter 2018 3 / 57

Page 4: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Based Reinforcement Learning

Previous lectures: learn value function or policy or directly fromexperience

This lecture: learn model directly from experience

and use planning to construct a value function or policy

Integrate learning and planning into a single architecture

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 6 Winter 2018 4 / 57

Page 5: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Based and Model-Free RL

Model-Free RL

No modelLearn value function (and/or policy) from experience

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 7 Winter 2018 5 / 57

Page 6: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Based and Model-Free RL

Model-Free RL

No modelLearn value function (and/or policy) from experience

Model-Based RL

Learn a model from experiencePlan value function (and/or policy) from model

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 8 Winter 2018 6 / 57

Page 7: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Free RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 9 Winter 2018 7 / 57

Page 8: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Based RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 10 Winter 2018 8 / 57

Page 9: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

4 Integrated Architectures

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 11 Winter 2018 9 / 57

Page 10: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model-Based RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 12 Winter 2018 10 / 57

Page 11: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Advantages of Model-Based RL

Advantages:

Can efficiently learn model by supervised learning methodsCan reason about model uncertainty (like in upper confidence boundmethods for exploration/exploitation trade offs)

Disadvantages

First learn a model, then construct a value function⇒ two sources of approximation error

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 13 Winter 2018 11 / 57

Page 12: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

MDP Model Refresher

A model M is a representation of an MDP < S,A,P,R >,parametrized by η

We will assume state space S and action space A are known

So a model M =< Pη,Rη > represents state transitions Pη ≈ P andrewards Rη ≈ R

St+1 ∼ Pη(St+1 | St ,At)

Rt+1 = Rη(Rt+1 | St ,At)

Typically assume conditional independence between state transitionsand rewards

P[St+1,Rt+1 | St ,At ] = P[St+1 | St ,At ]P[Rt+1 | St ,At ]

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 14 Winter 2018 12 / 57

Page 13: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Model Learning

Goal: estimate model Mη from experience {S1,A1,R2, ...,ST}This is a supervised learning problem

S1,A1 → R2, S2

S2A2 → R3, S3...

ST−1,AT−1 → RT ,ST

Learning s, a→ r is a regression problem

Learning s, a→ s ′ is a density estimation problem

Pick loss function, e.g. mean-squared error, KL divergence, . . .

Find parameters η that minimize empirical loss

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 15 Winter 2018 13 / 57

Page 14: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Examples of Models

Table Lookup Model

Linear Expectation Model

Linear Gaussian Model

Gaussian Process Model

Deep Belief Network Model

. . .

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 16 Winter 2018 14 / 57

Page 15: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Table Lookup Model

Model is an explicit MDP, P̂, R̂Count visits N(s, a) to each state action pair

P̂as,s′ =

1

N(s, a)

T∑t=1

1(St ,At , St+1 = s, a, s ′)

R̂as =

1

N(s, a)

T∑t=1

1(St ,At = s, a)

Alternatively

At each time-step t, record experience tuple < St ,At ,Rt+1,St+1 >To sample model, randomly pick tuple matching < s, a, ·, · >

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 17 Winter 2018 15 / 57

Page 16: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

AB Example

Two states A,B; no discounting; 8 episodes of experience

We have constructed a table lookup model from the experience

Recall: For a particular policy, TD with a tabular representation withinfinite experience replay will converge to the same value as computedif construct a MLE model and do planning

Check Your Memory: Will MC methods converge to the samesolution?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 18 Winter 2018 16 / 57

Page 17: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Planning with a Model

Given a model Mη =< Pη,Rη >Solve the MDP < S,A,Pη,Rη >Using favourite planning algorithm

Value iterationPolicy iterationTree search· · ·

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 19 Winter 2018 17 / 57

Page 18: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Sample-Based Planning

A simple but powerful approach to planning

Use the model only to generate samples

Sample experience from model

St+1 ∼ Pη(St+1 | St ,At)

Rt+1 = Rη(Rt+1 | St ,At)

Apply model-free RL to samples, e.g.:

Monte-Carlo controlSarsaQ-learning

Sample-based planning methods are often more efficient

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 20 Winter 2018 18 / 57

Page 19: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Back to the AB Example

Construct a table-lookup model from real experience

Apply model-free RL to sampled experience

Real experienceA, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

Sampled experienceB, 1B, 0B, 1A, 0 B, 1B, 1A, 0 B, 1B, 1B, 0

e.g. Monte-Carlo learning: V (A) = 1, V (B) = 0.75

Check Your Memory: What would have MC on the original experiencehave converged to?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 21 Winter 2018 19 / 57

Page 20: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Planning with an Inaccurate Model

Given an imperfect model < Pη,Rη >6=< P,R >

Performance of model-based RL is limited to optimal policy forapproximate MDP < S,A,Pη,Rη >i.e. Model-based RL is only as good as the estimated model

When the model is inaccurate, planning process will compute asub-optimal policy

Solution 1: when model is wrong, use model-free RL

Solution 2: reason explicitly about model uncertainty (see Lectures onExploration / Exploitation)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 22 Winter 2018 20 / 57

Page 21: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

4 Integrated Architectures

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 23 Winter 2018 21 / 57

Page 22: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Forward Search

Forward search algorithms select the best action by lookahead

They build a search tree with the current state st at the root

Using a model of the MDP to look ahead

No need to solve whole MDP, just sub-MDP starting from now

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 24 Winter 2018 22 / 57

Page 23: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Simulation-Based Search

Forward search paradigm using sample-based planning

Simulate episodes of experience from now with the model

Apply model-free RL to simulated episodes

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 25 Winter 2018 23 / 57

Page 24: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Simulation-Based Search (2)

Simulate episodes of experience from now with the model

{Skt ,A

kt ,R

kt+1, ...,S

kT}Kk=1 ∼Mv

Apply model-free RL to simulated episodes

Monte-Carlo control → Monte-Carlo searchSarsa → TD search

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 26 Winter 2018 24 / 57

Page 25: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Simple Monte-Carlo Search

Given a model Mv and a simulation policy π

For each action a ∈ ASimulate K episodes from current (real) state st

{st , a,Rkt+1, ...,S

kT}Kk=1 ∼Mv , π

Evaluate actions by mean return (Monte-Carlo evaluation)

Q(st , a) =1

K

K∑k=1

GtP−→ qπ(st , a) (1)

Select current (real) action with maximum value

at = argmina∈A

Q(st , a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 27 Winter 2018 25 / 57

Page 26: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Recall Expectimax Tree

If have a MDP model Mv

Can compute optimal q(s, a) values for current state by constructingan expectimax tree

Limitations: Size of tree scales as

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 28 Winter 2018 26 / 57

Page 27: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Monte-Carlo Tree Search (MCTS)

Given a model Mv

Build a search tree rooted at the current state st

Samples actions and next states

Iteratively construct and update tree by performing K simulationepisodes starting from the root state

After search is finished, select current (real) action with maximumvalue in search tree

at = argmina∈A

Q(st , a)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 29 Winter 2018 27 / 57

Page 28: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Monte-Carlo Tree Search

Simulating an episode involves two phases (in-tree, out-of-tree)

Tree policy: pick actions to maximize Q(S ,A)Roll out policy: e.g. pick actions randomly, or another policy

To evaluate the value of a tree node i at state action pair (s, a),average over all rewards received from that node onwards acrosssimulated episodes in which this tree node was reached

Q(i) =1

N(i)

K∑k=1

T∑u=t

1(i ∈ epi .k)Gk(i)P−→ q(s, a) (2)

Under mild conditions, converges to the optimal search tree,Q(S ,A)→ q∗(S ,A)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 30 Winter 2018 28 / 57

Page 29: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 31 Winter 2018 29 / 57

Page 30: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

UCT: borrow idea from bandit literature and treat each node wherecan select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 32 Winter 2018 30 / 57

Page 31: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Upper Confidence Tree (UCT) Search

How to select what action to take during a simulated episode?

UCT: borrow idea from bandit literature and treat each node wherecan select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm

Q(s, a, i) =1

N(s, a, i)

K∑k=1

T∑u=t

1(i ∈ epi .k)Gk(s, a, i)+c

√lnn(s)

n(s, a)(3)

For simplicity can treat each node as a separate MAB

For simulated episode k at node i , select action/arm with highestupper bound to simulate and expand (or evaluate) in the tree

aik = arg maxQ(s, a, i) (4)

This implies that the policy used to simulate episodes with (andexpand/update the tree) can change across each episode

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 33 Winter 2018 31 / 57

Page 32: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Case Study: the Game of Go

Go is 2500 years old

Hardest classic board game

Grand challenge task (JohnMcCarthy)

Traditional game-tree searchhas failed in Go

Check your understanding:does playing Go involvelearning to make decisions ina world where dynamics andreward model are unknown?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 34 Winter 2018 32 / 57

Page 33: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Rules of Go

Usually played on 19x19, also 13x13 or 9x9 board

Simple rules, complex strategy

Black and white place down stones alternately

Surrounded stones are captured and removed

The player with more territory wins the game

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 35 Winter 2018 33 / 57

Page 34: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Position Evaluation in Go

How good is a position s

Reward function (undiscounted):

Rt = 0 for all non-terminal steps t < T

RT =

{1, if Black wins.

0, if White wins.

(5)

Policy π =< πB , πW > selects moves for both players

Value function (how good is position s):

vπ(s) = Eπ[RT | S = s] = P[Black wins | S = s]

v∗(s) = maxπB

minπW

vπ(s)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 36 Winter 2018 34 / 57

Page 35: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Monte-Carlo Evaluation in Go

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 37 Winter 2018 35 / 57

Page 36: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Applying Monte-Carlo Tree Search (1)

Go is a 2 player game so tree is a minimax tree instead of expectimax

White minimizes future reward and Black maximizes future rewardwhen computing action to simulate

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 38 Winter 2018 36 / 57

Page 37: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Applying Monte-Carlo Tree Search (2)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 39 Winter 2018 37 / 57

Page 38: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Applying Monte-Carlo Tree Search (3)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 40 Winter 2018 38 / 57

Page 39: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Applying Monte-Carlo Tree Search (4)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 41 Winter 2018 39 / 57

Page 40: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Applying Monte-Carlo Tree Search (5)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 42 Winter 2018 40 / 57

Page 41: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Advantages of MC Tree Search

Highly selective best-first search

Evaluates states dynamically (unlike e.g. DP)

Uses sampling to break curse of dimensionality

Works for “black-box” models (only requires samples)

Computationally efficient, anytime, parallelisable

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 43 Winter 2018 41 / 57

Page 42: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

In more depth: Upper Confidence Tree (UCT) Search

UCT: borrow idea from bandit literature and treat each tree nodewhere can select actions as a multi-armed bandit (MAB) problem

Maintain an upper confidence bound over reward of each arm andselect the best arm

Check your understanding: Why is this slightly strange? Hint: whywere upper confidence bounds a good idea for exploration/exploitation? Is there an exploration/ exploitation problem duringsimulated episodes?44

44Relates to metalevel reasoning (for an example related to Go see ”SelectingComputations: Theory and Applications”, Hay, Russell, Tolpin and Shimony 2012)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 45 Winter 2018 42 / 57

Page 43: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

MCTS and Early Go Results

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 46 Winter 2018 43 / 57

Page 44: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

MCTS Variants

UCT and vanilla MCTS are just the beginning

Potential extensions / alterations?

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 47 Winter 2018 44 / 57

Page 45: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

MCTS Variants

UCT and vanilla MCTS are just the beginning

Potential extensions / alterations?

Use a better rollout policy (have a policy network? Learned fromexpert data or from data gathered in the real world)Learn a value function (can be used in combination with simulatedtrajectories to get a state-action estimate, can be used to bias initialactions considered, can be used to avoid having to rollout to the fullepisode length, ...)Many other possibilities

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 48 Winter 2018 45 / 57

Page 46: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

MCTS and AlphaGo / AlphaZero ...

MCTS was a critical advance for defeating Go

Several new versions including AlphaGo Zero and AlphaZero whichhave even more impressive performance

AlphaZero has also been applied to other games now including Chess

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 49 Winter 2018 46 / 57

Page 47: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Table of Contents

1 Introduction

2 Model-Based Reinforcement Learning

3 Simulation-Based Search

4 Integrated Architectures

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 50 Winter 2018 47 / 57

Page 48: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Real and Simulated Experience

We consider two sources of experience

Real experience Sampled from environment (true MDP)

S ′ ∼ Pas,s′

R = Ras

Simulated experience Sampled from model (approximate MDP)

S ′ ∼ Pη(S ′ | S ,A)

R = Rη(R | S ,A)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 51 Winter 2018 48 / 57

Page 49: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Integrating Learning and Planning

Model-Free RL

No modelLearn value function (and/or policy) from real experience

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 52 Winter 2018 49 / 57

Page 50: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Integrating Learning and Planning

Model-Free RL

No modelLearn value function (and/or policy) from real experience

Model-Based RL (using Sample-Based Planning)

Learn a model from real experiencePlan value function (and/or policy) from simulated experience

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 53 Winter 2018 50 / 57

Page 51: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Integrating Learning and Planning

Model-Free RL

No modelLearn value function (and/or policy) from real experience

Model-Based RL (using Sample-Based Planning)

Learn a model from real experiencePlan value function (and/or policy) from simulated experience

Dyna

Learn a model from real experienceLearn and plan value function (and/or policy) from real and simulatedexperience

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 54 Winter 2018 51 / 57

Page 52: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Dyna Architecture

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 55 Winter 2018 52 / 57

Page 53: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Dyna-Q Algorithm

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 56 Winter 2018 53 / 57

Page 54: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Dyna-Q on a Simple Maze

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 57 Winter 2018 54 / 57

Page 55: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Dyna-Q with an Inaccurate Model

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 58 Winter 2018 55 / 57

Page 56: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Dyna-Q with an Inaccurate Model (2)

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 59 Winter 2018 56 / 57

Page 57: Lecture 14: MCTS =1With many slides from or derived from ...web.stanford.edu/.../cs234_2018_l14_annotated.pdf · Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter

Class Structure

Last time: Batch RL

This Time: MCTS

Next time: Human in the Loop RL

Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 14: MCTS 60 Winter 2018 57 / 57