Reinforcement Learning and Optimal ControlA Selective...

Reinforcement Learning and Optimal ControlA Selective Overview

Dimitri P. Bertsekas

Laboratory for Information and Decision SystemsMassachusetts Institute of Technology

2018 CDC

December 2018

Bertsekas (M.I.T.) Reinforcement Learning 1 / 33

Reinforcement Learning (RL): A Happy Union of AI andDecision/Control Ideas

Decision/Control/DP

Principle of Optimality

Markov Decision Problems

Policy IterationValue Iteration

AI/RLLearning through

Experience

Simulation,Model-Free Methods

Feature-Based

Representations

A*/Games/Heuristics

Complementary Ideas

Late 80s-Early 90s

Historical highlightsExact DP, optimal control (Bellman, Shannon, 1950s ...)

First major successes: Backgammon programs (Tesauro, 1992, 1996)

Algorithmic progress, analysis, applications, first books (mid 90s ...)

Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)

AlphaGo and Alphazero (DeepMind, 2016, 2017)

AlphaGo (2016) and AlphaZero (2017)

AlphaZero (Google-Deep Mind)

Plays different!

Learned from scratch ... with 4 hours of training!

Plays much better than all chess programs

Same algorithm learned multiple games (Go, Shogi)

AlphaZero was Trained Using Self-Generated Data

φjf =

!1 if j ∈ If

0 if j /∈ If

dfi = 0 if i /∈ If

pff(u) =

pij(u)φjf

g(f, u) =n"

pij(u)g(i, u, j)

Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4

f1 f2 f3 f4 f5 f6 f7

Neural Network Features Approximate Cost Jµ Policy Improvement

F = {f1, f2, f3, f4, f5, f6, f7}

Representative Feature States dfi f f with Aggregation

Current Policy µ Improved Policy µµ

TµΦr Φr = ΠTµΦr

Generate “Improved” Policy µ

State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ

r = (r1, . . . , rs)

State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures

Cost = 2αϵ rk+1 = arg minr∈ℜs

Nt−1"

$φ(iτ,t)′r − cτ,t(rk)

µ1 − µℓ

φjf =

!1 if j ∈ If

0 if j /∈ If

pff(u) =

pij(u)φjf

g(f, u) =n"

pij(u)g(i, u, j)

f1 f2 f3 f4 f5 f6 f7

F = {f1, f2, f3, f4, f5, f6, f7}

r = (r1, . . . , rs)

Nt−1"

µ1 − µℓ

φjf =

!1 if j ∈ If

0 if j /∈ If

pff(u) =

pij(u)φjf

g(f, u) =n"

pij(u)g(i, u, j)

f1 f2 f3 f4 f5 f6 f7

F = {f1, f2, f3, f4, f5, f6, f7}

r = (r1, . . . , rs)

Nt−1"

µ1 − µℓ

Tail problem approximation u1k u2

k u3k u4

k u5k Constraint Relaxation

U U1 U2

AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)

Plays different! Approximate Value Function Player Features Mapping

At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2

minuk ,µk+1,...,µk+ℓ−1

E#gk(xk, uk, wk) +

k+ℓ−1$

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

k u3k u4

k u5k Constraint Relaxation

U U1 U2

E#gk(xk, uk, wk) +

k+ℓ−1$

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

k u5k Self-Learning/Policy Iteration Constraint Relaxation

U U1 U2

E#gk(xk, uk, wk) +

k+ℓ−1$

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

Learned from scratch ... with 4 hours of training! Current “Improved”

E#gk(xk, uk, wk) +

k+ℓ−1$

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

Learned from scratch ... with 4 hours of training! Current “Improved”

E#gk(xk, uk, wk) +

k+ℓ−1$

!xm, µm(xm), wm

"+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

Position “values” Move “probabilities”

Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Has been used to learn other games (Go, Shogi)

Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

�F (i)

F (i) =�F1(i), . . . , Fs(i)

�: Vector of Features of i

�F (i)

�: Feature-based architecture Final Features

If Jµ

�F (i)

�=Ps

`=1 F`(i)r` it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡

Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡

W+ =�J | J � J+, J(t) = 0

VI converges to J+ from within W+

Cost: g(xk, uk) � 0 VI converges to Jp from within Wp

Wp0 : Lyapounov region for p0 Wp0 VI converges to J 0p from within

W+: Lyapounov region for p(x) ⌘ 1, x 6= t W+

φjf =

!1 if j ∈ If

0 if j /∈ If

pff(u) =

pij(u)φjf

g(f, u) =n"

pij(u)g(i, u, j)

f1 f2 f3 f4 f5 f6 f7

F = {f1, f2, f3, f4, f5, f6, f7}

r = (r1, . . . , rs)

Nt−1"

µ1 − µℓ

Neural Net Policy Evaluation Improvement of Current Policy µ byLookahead Min

States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with

Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

!F (i), r

": Feature-based parametric architecture

r: Vector of weights

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

!F (i)

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

!F (i)

": Feature-based architecture Final Features

If Jµ

!F (i), r

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities”

�F (i)

F (i) =�F1(i), . . . , Fs(i)

�F (i)

If Jµ

�F (i), r

�=Ps

W+ =�J | J � J+, J(t) = 0

The “current" player plays games that are used to “train" an “improved" player

At a given position, the “move probabilities" and the “value" of a position areapproximated by a deep neural net (NN)

Successive NNs are trained using self-generated data and a form of regression

A form of randomized policy improvement Monte-Carlo Tree Search (MCTS)generates move probabilities

AlphaZero bears similarity to earlier works, e.g., TD-Gammon (Tesauro,1992), butis more complicated because of the MCTS and the deep NN

The success of AlphaZero is due to a skillful implementation/integration of knownideas, and awesome computational power

Approximate DP/RL Methodology is now Ambitious and Universal

Exact DP applies (in principle) to a very broad range of optimization problemsDeterministic <—-> Stochastic

Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces

One decision maker <—-> Two player games

... BUT is plagued by the curse of dimensionality and need for a math model

Approximate DP/RL overcomes the difficulties of exact DP by:Approximation (use neural nets and other architectures to reduce dimension)

Simulation (use a computer model in place of a math model)

State of the art:Broadly applicable methodology: Can address broad range of challengingproblems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc

There are no methods that are guaranteed to work for all or even most problems

There are enough methods to try with a reasonable chance of success for mosttypes of optimization problems

Role of the theory: Guide the art, delineate the sound ideas

Approximation in Value Space

Central Idea: Lookahead with an approximate cost

Compute an approximation J to the optimal cost function J∗

At current state, apply control that attains the minimum in

Current Stage Cost + J(Next State)

Multistep lookahead extension

At current state solve an `-step DP problem using terminal cost J

Apply the first control in the optimal policy for the `-step problem

Example approaches to compute J:

Problem approximation: Use as J the optimal cost function of a simpler problem

Rollout and model predictive control: Use a single policy iteration, with costevaluated on-line by simulation or limited optimization

Self-learning/approximate policy iteration (API): Use as J an approximation to thecost function of the final policy obtained through a policy iteration process

Role of neural networks: “Learn" the cost functions of policies in the context ofAPI; “learn" policies obtained by value space approximation

Aims and References of this Talk

The purpose of this talkTo selectively review some of the methods, and bring out some of the AI-DPconnections

ReferencesQuite a few Exact DP books (1950s-present starting with Bellman; my latest book“Abstract DP" came out earlier this year)

Quite a few DP/Approximate DP/RL/Neural Nets books (1996-Present)I Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996I Sutton and Barto, 1998, Reinforcement Learning (new edition 2019, Draft on-line)I NEW DRAFT BOOK: Bertsekas, Reinforcement Learning and Optimal Control, 2019,

on-line

Many surveys on all aspects of the subject; Tesauro’s papers on computerbackgammon, and Silver, et al., papers on AlphaZero

Terminology in RL/AI and DP/Control

RL uses Max/Value, DP uses Min/CostReward of a stage = (Opposite of) Cost of a stage.

State value = (Opposite of) State cost.

Value (or state-value) function = (Opposite of) Cost function.

Controlled system terminologyAgent = Decision maker or controller.

Action = Control.

Environment = Dynamic system.

Methods terminologyLearning = Solving a DP-related problem using simulation.

Self-learning (or self-play in the context of games) = Solving a DP problem usingsimulation-based policy iteration.

Planning vs Learning distinction = Solving a DP problem with model-based vsmodel-free simulation.

Outline

1 Approximation in Value Space

2 Problem Approximation

3 Rollout and Model Predictive Control

4 Parametric Approximation - Neural Networks

5 Neural Networks and Approximation in Value Space

6 Model-free DP in Terms of Q-Factors

7 Policy Iteration - Self-Learning

Finite Horizon Problem - Exact DP

Systemxk+1 = fk(xk, uk, wk)

uk = µk(xk) xk

Systemxk+1 = fk (xk , uk ,wk ), k = 0, 1, . . . ,N − 1

where xk : State, uk : Control, wk : Random disturbance

Cost function:

{gN(xN) +

N−1∑k=0

gk (xk , uk ,wk )

Perfect state information: uk is applied with (exact) knowledge of xk

Optimization over feedback policies {µ0, . . . , µN−1}: Rules that specify the controlµk (xk ) to apply at each possible state xk that can occur

The DP Algorithm and Approximation in Value Space

Go backwards, k = N − 1, . . . ,0, using

JN(xN) = gN(xN)

Jk (xk ) = minuk

{gk (xk , uk ,wk ) + Jk+1

(fk (xk , uk ,wk )

)}Jk (xk ): Optimal cost-to-go starting from state xk

Approximate DP is motivated by the ENORMOUS computational demands of exact DP

Approximation in value space: Use an approximate cost-to-go function Jk+1

µk (xk ) ∈ arg minuk

{gk (xk , uk ,wk ) + Jk+1

(fk (xk , uk ,wk )

There is also a multistep lookahead version

At state xk solve an `-step DP problem with terminal cost function approximation Jk+`.Use the first control in the optimal `-step sequence.

Approximation in Value Space Methods

One-step case at state xk :min

uk,µk+1,...,µk+ℓ−1

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

Approximations: Replace E{·} with nominal values (certainty equiv-alent control)

minuk,µk+1,...,µk+ℓ−1

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

Approximations: Computation of Jk+ℓ:

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

Approximations: Computation of Jk+ℓ: (Could be approximate)

Rollout

E!gk(xk, uk, wk) + Jk+1(xk+ℓ)

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

%xk(Ik)

u1k u2

k u3k u4

(certainty equivalent control) Computation of Jk+1:

Rollout

gk(xk, uk, wk) + Jk+1(xk+1)"

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

Rollout

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

Rollout

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

Approximate minimization Replace E{·} with nominal values

Rollout

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

Rollout Model Predictive Control

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

Rollout Model Predictive Control

#gk(xk, uk, wk) +

k+ℓ−1$

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

%xk(Ik)

u1k u2

k u3k u4

s t j1 j2 j` j`�1 j1

Nodes j 2 A(j`) Path Pj , Length Lj · · ·

Aggregation

Is di + aij < UPPER � hj?

�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

Improper policy µ

Proper policy µ

s t j1 j2 j` j`�1 j1

Aggregation Adaptive simulation Monte-Carlo Tree Search

within Wp+

J(1) = min�c, a + J(2)

J(2) = b + J(1)

Improper policy µ

Proper policy µ

s t j1 j2 j` j`�1 j1

Aggregation Adaptive simulation Monte-Carlo Tree Search (certainty equivalence)

within Wp+

J(1) = min�c, a + J(2)

J(2) = b + J(1)

Improper policy µ

Proper policy µ

Corrected J J J* Cost Jµ

�F (i), r

�of i ⇡ Jµ(i) Jµ(i) Feature Map

�F (i), r

�: Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

�F (i)

F (i) =�F1(i), . . . , Fs(i)

�F (i)

If Jµ

�F (i), r

�=Ps

W+ =�J | J � J+, J(t) = 0

Multistep case at state xk :

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

Rollout

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

k u3k u4

k u5k Constraint Relaxation U U1 U2

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

Problem Approximation: Simplify the Tail Problem and Solve it Exactly

Use as cost-to-go approximation Jk+1 the exact cost-to-go of a simpler problem

Many problem-dependent possibilities:Probabilistic approximation

I Certainty equivalence: Replace stochastic quantities by deterministic ones (makes thelookahead minimization deterministic)

I Approximate expected values by limited simulationI Partial versions of certainty equivalence

Enforced decomposition of coupled subsystemsI One-subsystem-at-a-time optimizationI Constraint decompositionI Lagrangian relaxation

Aggregation: Group states together and view the groups as aggregate statesI Hard aggregation: Jk+1 is a piecewise constant approximation to Jk+1I Feature-based aggregation: The aggregate states are defined by “features" of the

original statesI Biased hard aggregation: Jk+1 is a piecewise constant local correction to some other

approximation Jk+1, e.g., one provided by a neural net

Rollout: On-Line Simulation-Based Approximation in Value Space

Selective Depth Lookahead Tree

�xk(Ik)

Mobility, Safety, etc Weighting of Features Score Position Evaluator

States xk+1 States xk+2

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk

ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

Iteration Index k PI index k Jµk J∗ 0 1 2 . . . Error Zone Width (ϵ + 2αδ)/(1 − α)2

Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (ϵ + 2αδ)/(1 − α)

Policy Cost Evaluation Jµ of Current policy µ µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function

Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function

Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation Jµ

u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) − Qk(xk, u)

x0 xk x1k+1 x2

k+1 x3k+1 x4

k+1 States xN Base Heuristic ik States ik+1 States ik+2

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

States xk+1 States xk+2 xk Rollout Policy Approximation J

Adaptive Simulation Terminal cost approximation

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

States xk+1 States xk+2 xk Rollout Policy Approximation J

Adaptive Simulation Terminal cost approximation

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

States xk+1 States xk+2 xk Heuristic/Suboptimal Policy

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic Policy

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

Approximation J

Adaptive Simulation Terminal cost approximation Heuristic Policy

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

of Current Policy µ by Lookahead Min

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

The base policy can be any suboptimal policy (obtained by another method)One-step or multistep lookahead; exact minimization or a “randomized form oflookahead" that involves “adaptive" simulation and Monte Carlo tree searchWith or without terminal cost approximation (obtained by another method)Some forms of model predictive control can be viewed as special cases (basepolicy is a short-term deterministic optimization)Important theoretical fact: With exact lookahead and no terminal costapproximation, the rollout policy improves over the base policy

Example of Rollout: Backgammon (Tesauro, 1996)

Av. Score by Monte-Carlo Simulation

Possible Moves

Base policy was a backgammon player developed by a different RL method [TD(λ)trained with a neural network]; was also used for terminal cost approximation

The best backgammon players are based on rollout ... but are too slow forreal-time play (MC simulation takes too long)

AlphaGo has similar structure to backgammonThe base policy and terminal cost approximation are obtained with a deep neural net.In AlphaZero the rollout-with-base-policy part was dropped (long lookahead suffices)

Parametric Approximation in Value Space

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

Approximations: Computation of Jk+ℓ:

Rollout

!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

#xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

k u3k u4

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

k u3k u4

E!gk(xk, uk, wk) +

k+ℓ−1"

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

T (λ)(x) = T (x) x = P (c)(x)

Jk comes from a class of functions Jk (xk , rk ), where rk is a tunable parameter vector

Feature-based architectures: The linear case

Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r

Route to Queue 2

Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p

1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2

Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ

Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

Cost = 0 Cost = �1

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

Cost = 2⇥� J0

R + g(1) + �n⌥

p1jJ�(j)

Route to Queue 2

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

Cost = 2⇥� J0

R + g(1) + �n⌥

p1jJ�(j)

Route to Queue 2

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

Cost = 2⇥� J0

R + g(1) + �n⌥

p1jJ�(j)

Route to Queue 2

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

Cost = 2⇥� J0

R + g(1) + �n⌥

p1jJ�(j)

Route to Queue 2

c + Ez

⇧J�

⇤pf0(z)

pf0(z) + (1 � p)f1(z)

⌅⌃

⇥i(u)pij(u)⇥

⇥j(u)pjk(u)⇥

⇥k(u)pki(u)⇥

J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)

J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)

J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)

J� =�J�(1), J�(2)

1 � ⇥j(u)⇥ 1 � ⇥i(u)

⇥ 1 � ⇥k(u)⇥

1 � µi

Cost = 2⇥� J0

R + g(1) + �n⌥

p1jJ�(j)

�xk(Ik)

State xk Feature Vector �k(xk) Approximator �k(xk)0rk

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

State xk Feature Vector �k(xk) Approximator �k(xk)0rk

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

State xk Feature Vector �k(xk) Approximator r0k�k(xk)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Training with Fitted Value Iteration

This is just DP with intermediate approximation at each step

Start with JN = gN and sequentially train going backwards, until k = 0

Given Jk+1, we construct a number of samples (xsk , β

sk ), s = 1, . . . , q,

βsk = min

g(xsk , u,wk ) + Jk+1

(fk (xs

k , u,wk ), rk+1)}, s = 1, . . . , q

We “train" Jk on the set of samples (xsk , β

sk ), s = 1, . . . , q

Training by least squares/regressionWe minimize over rk

q∑s=1

(Jk (xs

k , rk )− βs)2+ γ‖rk − r‖2

where r is an initial guess for rk and γ > 0 is a regularization parameter

Neural Networks for Constructing Cost-to-Go Approximations Jk

Major fact about neural networksThey automatically construct features to be used in a linear architecture

Neural nets are approximation architectures of the form

J(x , v , r) =m∑

riφi (x , v) = r ′φ(x , v)

involving two parameter vectors r and v with different roles

View φ(x , v) as a feature vector

View r as a vector of linear weights for φ(x , v)

By training v jointly with r , we obtain automatically generated features!

Neural nets can be used in the fitted value iteration scheme

Train the stage k neural net (i.e., compute Jk ) using a training set generated with thestage k + 1 neural net (which defines Jk+1)

Neural Network with a Single Nonlinear Layer

. . .. . . . . .

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

State encoding (could be the identity, could include special features of the state)

Linear layer Ay(x) + b [parameters to be determined: v = (A, b)]

Nonlinear layer produces m outputs φi (x , v) = σ((

Ay(x) + b)

), i = 1, . . . ,m

σ is a scalar nonlinear differentiable function; several types have been used(hyperbolic tangent, logistic, rectified linear unit)

Training problem is to use the training set (xs, βs), s = 1, . . . , q, for

minv,r

q∑s=1

riφi (xs, v)− βs

+ (Regularization Term)

Solved often with incremental gradient methods (known as backpropagation)

Universal approximation theorem: With sufficiently large number of parameters,“arbitrarily" complex functions can be closely approximated

Deep Neural Networks

. . . . . . . . .. . .. . .. . .

Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

�xk(Ik)

u1k u2

k u3k u4

p(j1) p(j2) p(j3) p(j4)

ACB ACD CAB CAD CDA

More complex NNs are formed by concatenation of multiple layers

The outputs of each nonlinear layer become the inputs of the next linear layer

A hierarchy of features

Considerable success has been achieved in major contexts

Possible reasons for the successWith more complex features, the number of parameters in the linear layers may bedrastically decreased

We may use matrices A with a special structure that encodes special linearoperations such as convolution

Q-Factors - Model-Free RL

The Q-factor of a state-control pair (xk , uk ) at time k is defined by

Qk (xk , uk ) = E{

gk (xk , uk ,wk ) + Jk+1(xk+1)}

where Jk+1 is the optimal cost-to-go function for stage k + 1

Note thatJk (xk ) = min

u∈Uk (xk )Qk (xk , uk )

so the DP algorithm is written in terms of Qk

Qk (xk , uk ) = E{

gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)

Qk+1(xk+1, u)}

We can approximate Q-factors instead of costs

Fitted Value Iteration for Q-Factors: Model-Free Approximate DP

Consider fitted value iteration of Q-factor parametric approximations

Qk (xk , uk , rk ) ≈ E{

gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)

Qk+1(xk+1, u, rk+1)}

(Note a mathematical magic: The order of E{·} and min have been reversed.)

We obtain Qk (xk , uk , rk ) by training with many pairs((xs

k , usk ), βs

), where βs

k is asample of the approximate Q-factor of (xs

k , usk ). No need to compute E{·}

No need for a model to obtain βsk . Sufficient to have a simulator that generates

random samples of state-control-cost-next state((xk , uk ), (gk (xk , uk ,wk ), xk+1)

)Having computed rk , the one-step lookahead control is obtained on-line as

µk (xk ) = arg minu∈Uk (xk )

Qk (xk , u, rk )

without the need of a model or expected value calculations

Also the on-line calculation of the control is simplified

A Few Remarks on Infinite Horizon Problems

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate cost

Jµ(r) = �r Using Simulation

µ(i) = arg minu⇧U(i)

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

�xim � ⌅(im)⇤r

⌅(i)⇤

x = T (x) = g + �Px

⌅↵

�tP tg

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤r⇥

�⌦

⇧t↵

⌅(ik)⌅(ik)⇤

⌃rt =

⌅(ik)

⇤bik +

⌅(jk)⇤rt

rt+1 =

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

⌅(ik)⌅(ik)⇤

⌃�1 t↵

⌅(ik)

⇤bik +

⌅(jk)⇤rt

Approximate Policy

Evaluation

Policy Improvement

Evaluate Approximate cost

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

⌅(i)⇤

x = T (x) = g + �Px

⌅↵

�tP tg

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤r⇥

�⌦

⇧t↵

⌅(ik)⌅(ik)⇤

⌃rt =

⌅(ik)

⇤bik +

⌅(jk)⇤rt

rt+1 =

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

⌅(ik)⌅(ik)⇤

⌃�1 t↵

⌅(ik)

⇤bik +

⌅(jk)⇤rt

Approximate Policy

Evaluation

Policy Improvement

Evaluate Approximate Cost

pij(u)�g(i, u, j) + �J(j, r)

x = T (x) = Ax + b

pij = 0 ⇥ aij = 0

xi1 , . . . , xiM

⌅(i)⇤

x = T (x) = g + �Px

⌅↵

�tP tg

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃r⇥ =

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤r⇥

�⌦

⇧t↵

⌅(ik)⌅(ik)⇤

⌃rt =

⌅(ik)

⇤bik +

⌅(jk)⇤rt

rt+1 =

⇧n↵

⇤i ⌅(i)⌅(i)⇤

⌃�1 n↵

⇤i ⌅(i)

⌥ bi +

aij⌅(j)⇤rt

�⌦

rt+1 =

⇧t↵

⌅(ik)⌅(ik)⇤

⌃�1 t↵

⌅(ik)

⇤bik +

⌅(jk)⇤rt

y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Initial Policy Controlled System Cost per Stage Vector G(r) Transi-tion Matrix P (r)

Steady-State Distribution ⇧(r) Average Cost ⇤(r)

⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space

�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state x Initial state f(x, u,w) TimeVk: k-stages optimal cost vector with terminal cost function J

Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ

Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equation

Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)

Tµ(⇥r) ⇥r = �T(�)µ (⇥r)

�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ �TJ J T J �T J

Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1

Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1

Route to Queue 2

y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy

Qµ(i, u, r) Jµ(i, r) G(r) Transition Matrix P (r)

Evaluate Approximate Cost Steady-State Distribution ⇧(r) AverageCost ⇤(r)

⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space

�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state Decision µ(i) x Initial state f(x, u,w)

TimeVk: k-stages optimal cost vector with terminal cost function J

Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ

Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equation

Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)

Tµ(⇥r) ⇥r = �T(�)µ (⇥r)

�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ �TJ J T J �T J

Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1

Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Jµ(i, r)

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2)

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

�F (i)

R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ

�F (i)

I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ

�F (i)

Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

d`i = 0 if i /2 I`

�j ¯ = 1 if j 2 I¯

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Based on Jµ(i, r)

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2)

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Artificial Start State End State

Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

�F (i)

R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ

�F (i)

I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ

�F (i)

Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

(May Involve a Neural Network) (May Involve Aggregation)

d`i = 0 if i /2 I`

�j ¯ = 1 if j 2 I¯

of Current Policy µ

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

Approximation J

Cost Jµ

!F (i), r

!F (i)

F (i) =!F1(i), . . . , Fs(i)

!F (i)

If Jµ

!F (i), r

W+ =$J | J ≥ J+, J(t) = 0

Most popular setting: Stationary finite-state system, stationary policies,discounting or termination statePolicy iteration (PI) method generates a sequence of policies

I The current policy µ is evaluated using a parametric architecture: Jµ(x , r)I An “improved" policy µ is obtained by one-step lookahead using Jµ(x , r)

The architecture is trained using simulation data with µThus the system “observes itself" under µ and uses the data to “learn" theimproved policy µ - “self-learning"Exact PI converges to an optimal policy; approximate PI “converges" to within an“error zone" of the optimal, then oscillatesTD-Gammon, AlphaGo, and AlphaZero, all use forms of approximate PI for training

A Few Topics we did not Cover in this Talk

Infinite horizon extensions: Approximate value and policy iteration methods, errorbounds, model-based and model-free methods

Temporal difference methods: A class of methods for policy evaluation in infinitehorizon problems with a rich theory, issues of variance-bias tradeoff

Sampling for exploration, in the context of policy iteration

Monte Carlo tree search, and related methods

Aggregation methods, synergism with other approximate DP methods

Approximation in policy space, actor-critic methods, policy gradient andcross-entropy methods

Special aspects of imperfect state information problems, connections withtraditional control schemes

Infinite spaces optimal control, connections with aggregation schemes

Special aspects of deterministic problems: Shortest paths and their use inapproximate DP

A broad view of using simulation for large-scale computations: Methods for largesystems of equations and linear programs, connection to proximal algorithms

Concluding Remarks

Some words of cautionThere are challenging implementation issues in all approaches, and no fool-proofmethods

Problem approximation and feature selection require domain-specific knowledge

Training algorithms are not as reliable as you might think by reading the literature

Approximate PI involves oscillations

Recognizing success or failure can be a challenge!

The RL successes in game contexts are spectacular, but they have benefited fromperfectly known and stable models and small number of controls (per state)

Problems with partial state observation remain a big challenge

On the positive sideMassive computational power together with distributed computation are a sourceof hope

Silver lining: We can begin to address practical problems of unimaginable difficulty!

There is an exciting journey ahead!

Thank you!

Reinforcement Learning and Optimal ControlA Selective...

Documents

Selective and Non-selective Synthesis of Carbon Nanotubes

Cooperative Inverse Reinforcement Learning...Cooperative Inverse Reinforcement Learning Dylan Hadfield-Menell CS237: Reinforcement Learning May 31, 2017

for high ReinfoRcement systems - kotaca.cz · ReinfoRcement systems rEiNforcEmENt systEm PyraPlEx ... and DBV data sheet "Reinforcement system steel and ... suant to EC2 6.2.2 and

From Reinforcement Learning to Deep Reinforcement …fagostin/assets/files/...Keywords: Machine learning · Reinforcement learning Deep learning · Deep reinforcement learning 1 Introduction

Reinforcement Learning Lecture Inverse Reinforcement Learningipvs.informatik.uni-stuttgart.de/mlr/wp-content/uploads/2017/07/09... · Reinforcement Learning Inverse Reinforcement

Reinforcement Learning and Optimal ControlA Selective Overviewdimitrib/Slides_RLOC_Stanford... · 2019-03-06 · Reinforcement Learning and Optimal Control A Selective Overview Dimitri

Reinforcement Learning and Optimal ControlA Selective Overview · Reinforcement Learning and Optimal Control A Selective Overview Dimitri P. Bertsekas Laboratory for Information and

Selective IT neurons are selective along many dimensions

arXiv:1911.10635v1 [cs.LG] 24 Nov 2019 · 2019-12-02 · Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms Kaiqing Zhang \Zhuoran Yangy Tamer Bas¸ar

1 Convex Optimization Models: An Overviewweb.mit.edu/dimitrib/www/Chapter_1_NEW_ALG_Corrected_2017.pdfoptimization problem to an equivalent unconstrained problem. 1.1 LAGRANGE DUALITY

PERFORMANCE ENHANCEMENT USING SELECTIVE REINFORCEMENT …

APPLICATION OF COMPOSITES TO THE SELECTIVE METALLIC ... · REINFORCEMENT OF METALLIC AEROSPACE STRUCTURES by W. A. Brooks, ... on Impact of Composite Materials on Aerospace Vehicles

Continuous hoops for transverse reinforcement of ... · Continuous hoops for transverse reinforcement of ... transverse reinforcement details for the ... Fig. 1 Transverse reinforcement

IRG II: Mechanomutable Materials: Overviewweb.mit.edu/cortiz/www/IRG/IRG Site Visit 2012_All Slides02_16_12.pdf · IRG II: Mechanomutable Materials: Overview Potential Applications:

Schedules of reinforcement. Schedules of Reinforcement Continuous reinforcement refers to reinforcement being administered to each instance of a response

Risk Management For Hedge Funds: Introduction and Overviewweb.mit.edu/katykam/Public/slides2004.pdf · 2004. 3. 23. · Credit Risk In ationary Pressures, Central Banking Policy Macroeconomic

Reinforcement Learning and Optimal ControlA …web.mit.edu/dimitrib/www/Slides_RLOC_Stanford_Berkeley.pdfReinforcement Learning and Optimal Control A Selective Overview Dimitri P

Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Reinforcement Learning or Active Inference?karl/Reinforcement Learning or Active... · Reinforcement Learning or Active Inference? ... From the point of view of reinforcement learning

Inverse Reinforcement Learning CS885 Reinforcement