View
247
Download
0
Category
Preview:
Citation preview
Reinforcement Learning and Optimal ControlA Selective Overview
Dimitri P. Bertsekas
Laboratory for Information and Decision SystemsMassachusetts Institute of Technology
2018 CDC
December 2018
Bertsekas (M.I.T.) Reinforcement Learning 1 / 33
Reinforcement Learning (RL): A Happy Union of AI andDecision/Control Ideas
Decision/Control/DP
Principle of Optimality
Markov Decision Problems
POMDP
Policy IterationValue Iteration
AI/RLLearning through
Experience
Simulation,Model-Free Methods
Feature-Based
Representations
A*/Games/Heuristics
Complementary Ideas
Late 80s-Early 90s
Historical highlightsExact DP, optimal control (Bellman, Shannon, 1950s ...)
First major successes: Backgammon programs (Tesauro, 1992, 1996)
Algorithmic progress, analysis, applications, first books (mid 90s ...)
Machine Learning, BIG Data, Robotics, Deep Neural Networks (mid 2000s ...)
AlphaGo and Alphazero (DeepMind, 2016, 2017)
Bertsekas (M.I.T.) Reinforcement Learning 2 / 33
AlphaGo (2016) and AlphaZero (2017)
AlphaZero (Google-Deep Mind)
Plays different!
Learned from scratch ... with 4 hours of training!
Plays much better than all chess programs
Same algorithm learned multiple games (Go, Shogi)
Bertsekas (M.I.T.) Reinforcement Learning 3 / 33
AlphaZero was Trained Using Self-Generated Data
φjf =
!1 if j ∈ If
0 if j /∈ If
dfi = 0 if i /∈ If
pff(u) =
n"
i=1
dfi
n"
j=1
pij(u)φjf
g(f, u) =n"
i=1
dfi
n"
j=1
pij(u)g(i, u, j)
Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4
f1 f2 f3 f4 f5 f6 f7
Neural Network Features Approximate Cost Jµ Policy Improvement
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
Representative Feature States dfi f f with Aggregation
Current Policy µ Improved Policy µµ
TµΦr Φr = ΠTµΦr
Generate “Improved” Policy µ
State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ
r = (r1, . . . , rs)
State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures
Cost = 2αϵ rk+1 = arg minr∈ℜs
m"
t=1
Nt−1"
τ=0
$φ(iτ,t)′r − cτ,t(rk)
%2
µℓ
µ1 − µℓ
µ
1
φjf =
!1 if j ∈ If
0 if j /∈ If
dfi = 0 if i /∈ If
pff(u) =
n"
i=1
dfi
n"
j=1
pij(u)φjf
g(f, u) =n"
i=1
dfi
n"
j=1
pij(u)g(i, u, j)
Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4
f1 f2 f3 f4 f5 f6 f7
Neural Network Features Approximate Cost Jµ Policy Improvement
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
Representative Feature States dfi f f with Aggregation
Current Policy µ Improved Policy µµ
TµΦr Φr = ΠTµΦr
Generate “Improved” Policy µ
State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ
r = (r1, . . . , rs)
State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures
Cost = 2αϵ rk+1 = arg minr∈ℜs
m"
t=1
Nt−1"
τ=0
$φ(iτ,t)′r − cτ,t(rk)
%2
µℓ
µ1 − µℓ
µ
1
φjf =
!1 if j ∈ If
0 if j /∈ If
dfi = 0 if i /∈ If
pff(u) =
n"
i=1
dfi
n"
j=1
pij(u)φjf
g(f, u) =n"
i=1
dfi
n"
j=1
pij(u)g(i, u, j)
Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4
f1 f2 f3 f4 f5 f6 f7
Neural Network Features Approximate Cost Jµ Policy Improvement
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
Representative Feature States dfi f f with Aggregation
Current Policy µ Improved Policy µµ
TµΦr Φr = ΠTµΦr
Generate “Improved” Policy µ
State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ
r = (r1, . . . , rs)
State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures
Cost = 2αϵ rk+1 = arg minr∈ℜs
m"
t=1
Nt−1"
τ=0
$φ(iτ,t)′r − cτ,t(rk)
%2
µℓ
µ1 − µℓ
µ
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation
U U1 U2
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)
"
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
!xm, µm(xm), wm
"+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation
U U1 U2
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)
"
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
!xm, µm(xm), wm
"+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation
U U1 U2
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)
"
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
!xm, µm(xm), wm
"+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)
"
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
!xm, µm(xm), wm
"+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Self-Learning/Policy Iteration Constraint Relaxation
Learned from scratch ... with 4 hours of training! Current “Improved”
AlphaZero (Google-Deep Mind) Plays much better than all computer programs F (i) Cost J!F (i)
"
Plays different! Approximate Value Function Player Features Mapping
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
!xm, µm(xm), wm
"+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Has been used to learn other games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i)
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
Wp0 : Lyapounov region for p0 Wp0 VI converges to J 0p from within
Wp0
W+: Lyapounov region for p(x) ⌘ 1, x 6= t W+
1
φjf =
!1 if j ∈ If
0 if j /∈ If
dfi = 0 if i /∈ If
pff(u) =
n"
i=1
dfi
n"
j=1
pij(u)φjf
g(f, u) =n"
i=1
dfi
n"
j=1
pij(u)g(i, u, j)
Representative Features Feature Space F F (j) φjf1 φjf2 φjf3 φjf4
f1 f2 f3 f4 f5 f6 f7
Neural Network Features Approximate Cost Jµ Policy Improvement
Neural Network Features Approximate Cost Jµ Policy Improvement
F = {f1, f2, f3, f4, f5, f6, f7}
Representative Feature States dfi f f with Aggregation
Current Policy µ Improved Policy µµ
TµΦr Φr = ΠTµΦr
Generate “Improved” Policy µ
State Space Feature Space Subspace J = {Φr | s ∈ ℜs} #sℓ=1 Fℓ(i, v)rℓ
r = (r1, . . . , rs)
State i y(i) Ay(i) + b Fs(i, v) F1(i, v) F2(i, v) Linear Weighting ofFeatures
Cost = 2αϵ rk+1 = arg minr∈ℜs
m"
t=1
Nt−1"
τ=0
$φ(iτ,t)′r − cτ,t(rk)
%2
µℓ
µ1 − µℓ
µ
1
Neural Net Policy Evaluation Improvement of Current Policy µ byLookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
1
Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
1
Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
1
Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
1
Neural Network Policy Evaluation Improvement of Current Policy µby Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
1
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
The “current" player plays games that are used to “train" an “improved" player
At a given position, the “move probabilities" and the “value" of a position areapproximated by a deep neural net (NN)
Successive NNs are trained using self-generated data and a form of regression
A form of randomized policy improvement Monte-Carlo Tree Search (MCTS)generates move probabilities
AlphaZero bears similarity to earlier works, e.g., TD-Gammon (Tesauro,1992), butis more complicated because of the MCTS and the deep NN
The success of AlphaZero is due to a skillful implementation/integration of knownideas, and awesome computational power
Bertsekas (M.I.T.) Reinforcement Learning 4 / 33
Approximate DP/RL Methodology is now Ambitious and Universal
Exact DP applies (in principle) to a very broad range of optimization problemsDeterministic <—-> Stochastic
Combinatorial optimization <—-> Optimal control w/ infinite state/control spaces
One decision maker <—-> Two player games
... BUT is plagued by the curse of dimensionality and need for a math model
Approximate DP/RL overcomes the difficulties of exact DP by:Approximation (use neural nets and other architectures to reduce dimension)
Simulation (use a computer model in place of a math model)
State of the art:Broadly applicable methodology: Can address broad range of challengingproblems. Deterministic-stochastic-dynamic, discrete-continuous, games, etc
There are no methods that are guaranteed to work for all or even most problems
There are enough methods to try with a reasonable chance of success for mosttypes of optimization problems
Role of the theory: Guide the art, delineate the sound ideas
Bertsekas (M.I.T.) Reinforcement Learning 5 / 33
Approximation in Value Space
Central Idea: Lookahead with an approximate cost
Compute an approximation J to the optimal cost function J∗
At current state, apply control that attains the minimum in
Current Stage Cost + J(Next State)
Multistep lookahead extension
At current state solve an `-step DP problem using terminal cost J
Apply the first control in the optimal policy for the `-step problem
Example approaches to compute J:
Problem approximation: Use as J the optimal cost function of a simpler problem
Rollout and model predictive control: Use a single policy iteration, with costevaluated on-line by simulation or limited optimization
Self-learning/approximate policy iteration (API): Use as J an approximation to thecost function of the final policy obtained through a policy iteration process
Role of neural networks: “Learn" the cost functions of policies in the context ofAPI; “learn" policies obtained by value space approximation
Bertsekas (M.I.T.) Reinforcement Learning 6 / 33
Aims and References of this Talk
The purpose of this talkTo selectively review some of the methods, and bring out some of the AI-DPconnections
ReferencesQuite a few Exact DP books (1950s-present starting with Bellman; my latest book“Abstract DP" came out earlier this year)
Quite a few DP/Approximate DP/RL/Neural Nets books (1996-Present)I Bertsekas and Tsitsiklis, Neuro-Dynamic Programming, 1996I Sutton and Barto, 1998, Reinforcement Learning (new edition 2019, Draft on-line)I NEW DRAFT BOOK: Bertsekas, Reinforcement Learning and Optimal Control, 2019,
on-line
Many surveys on all aspects of the subject; Tesauro’s papers on computerbackgammon, and Silver, et al., papers on AlphaZero
Bertsekas (M.I.T.) Reinforcement Learning 7 / 33
Terminology in RL/AI and DP/Control
RL uses Max/Value, DP uses Min/CostReward of a stage = (Opposite of) Cost of a stage.
State value = (Opposite of) State cost.
Value (or state-value) function = (Opposite of) Cost function.
Controlled system terminologyAgent = Decision maker or controller.
Action = Control.
Environment = Dynamic system.
Methods terminologyLearning = Solving a DP-related problem using simulation.
Self-learning (or self-play in the context of games) = Solving a DP problem usingsimulation-based policy iteration.
Planning vs Learning distinction = Solving a DP problem with model-based vsmodel-free simulation.
Bertsekas (M.I.T.) Reinforcement Learning 8 / 33
Outline
1 Approximation in Value Space
2 Problem Approximation
3 Rollout and Model Predictive Control
4 Parametric Approximation - Neural Networks
5 Neural Networks and Approximation in Value Space
6 Model-free DP in Terms of Q-Factors
7 Policy Iteration - Self-Learning
Bertsekas (M.I.T.) Reinforcement Learning 9 / 33
Finite Horizon Problem - Exact DP
Systemxk+1 = fk(xk, uk, wk)
uk = µk(xk) xk
wk
) µk
Systemxk+1 = fk (xk , uk ,wk ), k = 0, 1, . . . ,N − 1
where xk : State, uk : Control, wk : Random disturbance
Cost function:
E
{gN(xN) +
N−1∑k=0
gk (xk , uk ,wk )
}
Perfect state information: uk is applied with (exact) knowledge of xk
Optimization over feedback policies {µ0, . . . , µN−1}: Rules that specify the controlµk (xk ) to apply at each possible state xk that can occur
Bertsekas (M.I.T.) Reinforcement Learning 11 / 33
The DP Algorithm and Approximation in Value Space
Go backwards, k = N − 1, . . . ,0, using
JN(xN) = gN(xN)
Jk (xk ) = minuk
Ewk
{gk (xk , uk ,wk ) + Jk+1
(fk (xk , uk ,wk )
)}Jk (xk ): Optimal cost-to-go starting from state xk
Approximate DP is motivated by the ENORMOUS computational demands of exact DP
Approximation in value space: Use an approximate cost-to-go function Jk+1
µk (xk ) ∈ arg minuk
Ewk
{gk (xk , uk ,wk ) + Jk+1
(fk (xk , uk ,wk )
)}
There is also a multistep lookahead version
At state xk solve an `-step DP problem with terminal cost function approximation Jk+`.Use the first control in the optimal `-step sequence.
Bertsekas (M.I.T.) Reinforcement Learning 12 / 33
Approximation in Value Space Methods
One-step case at state xk :min
uk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Replace E{·} with nominal values (certainty equiv-alent control)
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
1
Approximations: Computation of Jk+ℓ:
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!gk(xk, uk, wk) + Jk+1(xk+ℓ)
"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
Approximate minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
Approximate minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout Model Predictive Control
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
Approximate minimization Replace E{·} with nominal values
(certainty equivalent control) Computation of Jk+1:
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout Model Predictive Control
minuk
E!
gk(xk, uk, wk) + Jk+1(xk+1)"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
s t j1 j2 j` j`�1 j1
Nodes j 2 A(j`) Path Pj , Length Lj · · ·
Aggregation
Is di + aij < UPPER � hj?
�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
s t j1 j2 j` j`�1 j1
Nodes j 2 A(j`) Path Pj , Length Lj · · ·
Aggregation Adaptive simulation Monte-Carlo Tree Search
Is di + aij < UPPER � hj?
�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
s t j1 j2 j` j`�1 j1
Nodes j 2 A(j`) Path Pj , Length Lj · · ·
Aggregation Adaptive simulation Monte-Carlo Tree Search (certainty equivalence)
Is di + aij < UPPER � hj?
�jf = 1 if j 2 If x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Corrected J J J* Cost Jµ
�F (i), r
�of i ⇡ Jµ(i) Jµ(i) Feature Map
Jµ
�F (i), r
�: Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r⇤` Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost ↵kg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
�F (i)
�of
F (i) =�F1(i), . . . , Fs(i)
�: Vector of Features of i
Jµ
�F (i)
�: Feature-based architecture Final Features
If Jµ
�F (i), r
�=Ps
`=1 F`(i)r` it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J � Jp with J(xk) ! 0 for all p-stable ⇡
Wp0 : Functions J � Jp0 with J(xk) ! 0 for all p0-stable ⇡
W+ =�J | J � J+, J(t) = 0
VI converges to J+ from within W+
Cost: g(xk, uk) � 0 VI converges to Jp from within Wp
1
Multistep case at state xk :
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Bertsekas (M.I.T.) Reinforcement Learning 13 / 33
Problem Approximation: Simplify the Tail Problem and Solve it Exactly
Use as cost-to-go approximation Jk+1 the exact cost-to-go of a simpler problem
Many problem-dependent possibilities:Probabilistic approximation
I Certainty equivalence: Replace stochastic quantities by deterministic ones (makes thelookahead minimization deterministic)
I Approximate expected values by limited simulationI Partial versions of certainty equivalence
Enforced decomposition of coupled subsystemsI One-subsystem-at-a-time optimizationI Constraint decompositionI Lagrangian relaxation
Aggregation: Group states together and view the groups as aggregate statesI Hard aggregation: Jk+1 is a piecewise constant approximation to Jk+1I Feature-based aggregation: The aggregate states are defined by “features" of the
original statesI Biased hard aggregation: Jk+1 is a piecewise constant local correction to some other
approximation Jk+1, e.g., one provided by a neural net
Bertsekas (M.I.T.) Reinforcement Learning 15 / 33
Rollout: On-Line Simulation-Based Approximation in Value Space
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position Evaluator
States xk+1 States xk+2
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Iteration Index k PI index k Jµk J∗ 0 1 2 . . . Error Zone Width (ϵ + 2αδ)/(1 − α)2
Policy Q-Factor Evaluation Evaluate Q-Factor Qµ of Current policy µ Width (ϵ + 2αδ)/(1 − α)
Policy Cost Evaluation Jµ of Current policy µ µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Control v (j, v) Cost = 0 State-Control Pairs Transitions under policy µ Evaluate Cost Function
Variable Length Rollout Selective Depth Rollout Policy µ Adaptive Simulation Terminal Cost Function
Limited Rollout Selective Depth Adaptive Simulation Policy µ Approximation Jµ
u Qk(xk, u) Qk(xk, u) uk uk Qk(xk, u) − Qk(xk, u)
x0 xk x1k+1 x2
k+1 x3k+1 x4
k+1 States xN Base Heuristic ik States ik+1 States ik+2
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
1
States xk+1 States xk+2 xk Rollout Policy Approximation J
Adaptive Simulation Terminal cost approximation
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
1
States xk+1 States xk+2 xk Rollout Policy Approximation J
Adaptive Simulation Terminal cost approximation
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
1
States xk+1 States xk+2 xk Heuristic/Suboptimal Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic Policy
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
1
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic Policy
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
1
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
1
of Current Policy µ by Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
1
The base policy can be any suboptimal policy (obtained by another method)One-step or multistep lookahead; exact minimization or a “randomized form oflookahead" that involves “adaptive" simulation and Monte Carlo tree searchWith or without terminal cost approximation (obtained by another method)Some forms of model predictive control can be viewed as special cases (basepolicy is a short-term deterministic optimization)Important theoretical fact: With exact lookahead and no terminal costapproximation, the rollout policy improves over the base policy
Bertsekas (M.I.T.) Reinforcement Learning 17 / 33
Example of Rollout: Backgammon (Tesauro, 1996)
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Possible Moves
Base policy was a backgammon player developed by a different RL method [TD(λ)trained with a neural network]; was also used for terminal cost approximation
The best backgammon players are based on rollout ... but are too slow forreal-time play (MC simulation takes too long)
AlphaGo has similar structure to backgammonThe base policy and terminal cost approximation are obtained with a deep neural net.In AlphaZero the rollout-with-base-policy part was dropped (long lookahead suffices)
Bertsekas (M.I.T.) Reinforcement Learning 18 / 33
Parametric Approximation in Value Space
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ:
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Jk comes from a class of functions Jk (xk , rk ), where rk is a tunable parameter vector
Feature-based architectures: The linear case
Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r
Route to Queue 2
Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p
1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ
Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)
c + Ez
⇧J�
⇤pf0(z)
pf0(z) + (1 � p)f1(z)
⌅⌃
Cost = 0 Cost = �1
⇥i(u)pij(u)⇥
⇥j(u)pjk(u)⇥
⇥k(u)pki(u)⇥
J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)
J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)
J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)
J� =�J�(1), J�(2)
⇥
1 � ⇥j(u)⇥ 1 � ⇥i(u)
⇥ 1 � ⇥k(u)⇥
1 � µi
µµi
µ
Cost = 2⇥� J0
R + g(1) + �n⌥
j=1
p1jJ�(j)
1
Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r
Route to Queue 2
Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p
1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ
Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)
c + Ez
⇧J�
⇤pf0(z)
pf0(z) + (1 � p)f1(z)
⌅⌃
Cost = 0 Cost = �1
⇥i(u)pij(u)⇥
⇥j(u)pjk(u)⇥
⇥k(u)pki(u)⇥
J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)
J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)
J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)
J� =�J�(1), J�(2)
⇥
1 � ⇥j(u)⇥ 1 � ⇥i(u)
⇥ 1 � ⇥k(u)⇥
1 � µi
µµi
µ
Cost = 2⇥� J0
R + g(1) + �n⌥
j=1
p1jJ�(j)
1
Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r
Route to Queue 2
Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p
1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ
Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)
c + Ez
⇧J�
⇤pf0(z)
pf0(z) + (1 � p)f1(z)
⌅⌃
Cost = 0 Cost = �1
⇥i(u)pij(u)⇥
⇥j(u)pjk(u)⇥
⇥k(u)pki(u)⇥
J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)
J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)
J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)
J� =�J�(1), J�(2)
⇥
1 � ⇥j(u)⇥ 1 � ⇥i(u)
⇥ 1 � ⇥k(u)⇥
1 � µi
µµi
µ
Cost = 2⇥� J0
R + g(1) + �n⌥
j=1
p1jJ�(j)
1
Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r
Route to Queue 2
Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p
1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ
Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)
c + Ez
⇧J�
⇤pf0(z)
pf0(z) + (1 � p)f1(z)
⌅⌃
Cost = 0 Cost = �1
⇥i(u)pij(u)⇥
⇥j(u)pjk(u)⇥
⇥k(u)pki(u)⇥
J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)
J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)
J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)
J� =�J�(1), J�(2)
⇥
1 � ⇥j(u)⇥ 1 � ⇥i(u)
⇥ 1 � ⇥k(u)⇥
1 � µi
µµi
µ
Cost = 2⇥� J0
R + g(1) + �n⌥
j=1
p1jJ�(j)
1
Special State n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
State i Feature Extraction Mapping Feature Vector ⇧(i) Linear CostApproximator ⇧(i)⇥r
Route to Queue 2
Route to Queue 2h�(n) ⇤� ⇤µ ⇤ hµ,�(n) = (⇤µ � ⇤)Nµ(n)n � 1 �(n � 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ�(i + 1) µ µ p
1 0 ⌅j(u), pjk(u) ⌅k(u), pki(u) J�(p) µ1 µ2
Simulation error Solution of Jµ = WTµ(Jµ) Bias �Jµ Slope Jµ =⇥rµ
Transition diagram and costs under policy {µ⇥, µ⇥, . . .} M q(µ)
c + Ez
⇧J�
⇤pf0(z)
pf0(z) + (1 � p)f1(z)
⌅⌃
Cost = 0 Cost = �1
⇥i(u)pij(u)⇥
⇥j(u)pjk(u)⇥
⇥k(u)pki(u)⇥
J(2) = g(2, u2) + �p21(u2)J(1) + �p22(u2)J(2)
J(2) = g(2, u1) + �p21(u1)J(1) + �p22(u1)J(2)
J(1) = g(1, u2) + �p11(u2)J(1) + �p12(u2)J(2)
J� =�J�(1), J�(2)
⇥
1 � ⇥j(u)⇥ 1 � ⇥i(u)
⇥ 1 � ⇥k(u)⇥
1 � µi
µµi
µ
Cost = 2⇥� J0
R + g(1) + �n⌥
j=1
p1jJ�(j)
1
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator �k(xk)0rk
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator �k(xk)0rk
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Bertsekas (M.I.T.) Reinforcement Learning 20 / 33
Training with Fitted Value Iteration
This is just DP with intermediate approximation at each step
Start with JN = gN and sequentially train going backwards, until k = 0
Given Jk+1, we construct a number of samples (xsk , β
sk ), s = 1, . . . , q,
βsk = min
uE{
g(xsk , u,wk ) + Jk+1
(fk (xs
k , u,wk ), rk+1)}, s = 1, . . . , q
We “train" Jk on the set of samples (xsk , β
sk ), s = 1, . . . , q
Training by least squares/regressionWe minimize over rk
q∑s=1
(Jk (xs
k , rk )− βs)2+ γ‖rk − r‖2
where r is an initial guess for rk and γ > 0 is a regularization parameter
Bertsekas (M.I.T.) Reinforcement Learning 21 / 33
Neural Networks for Constructing Cost-to-Go Approximations Jk
Major fact about neural networksThey automatically construct features to be used in a linear architecture
Neural nets are approximation architectures of the form
J(x , v , r) =m∑
i=1
riφi (x , v) = r ′φ(x , v)
involving two parameter vectors r and v with different roles
View φ(x , v) as a feature vector
View r as a vector of linear weights for φ(x , v)
By training v jointly with r , we obtain automatically generated features!
Neural nets can be used in the fitted value iteration scheme
Train the stage k neural net (i.e., compute Jk ) using a training set generated with thestage k + 1 neural net (which defines Jk+1)
Bertsekas (M.I.T.) Reinforcement Learning 23 / 33
Neural Network with a Single Nonlinear Layer
. . .. . . . . .
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
State encoding (could be the identity, could include special features of the state)
Linear layer Ay(x) + b [parameters to be determined: v = (A, b)]
Nonlinear layer produces m outputs φi (x , v) = σ((
Ay(x) + b)
i
), i = 1, . . . ,m
σ is a scalar nonlinear differentiable function; several types have been used(hyperbolic tangent, logistic, rectified linear unit)
Training problem is to use the training set (xs, βs), s = 1, . . . , q, for
minv,r
q∑s=1
(m∑
i=1
riφi (xs, v)− βs
)2
+ (Regularization Term)
Solved often with incremental gradient methods (known as backpropagation)
Universal approximation theorem: With sufficiently large number of parameters,“arbitrarily" complex functions can be closely approximated
Bertsekas (M.I.T.) Reinforcement Learning 24 / 33
Deep Neural Networks
. . . . . . . . .. . .. . .. . .
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation �(x, v)0r
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
1
Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
Nonlinear Ay(x) + b �1(x, v) �2(x, v) �m(x, v) r x Initial
Selective Depth Lookahead Tree �(⇠) ⇠ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r0�(x, v)
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector �k(xk) Approximator r0k�k(xk)
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
1
More complex NNs are formed by concatenation of multiple layers
The outputs of each nonlinear layer become the inputs of the next linear layer
A hierarchy of features
Considerable success has been achieved in major contexts
Possible reasons for the successWith more complex features, the number of parameters in the linear layers may bedrastically decreased
We may use matrices A with a special structure that encodes special linearoperations such as convolution
Bertsekas (M.I.T.) Reinforcement Learning 25 / 33
Q-Factors - Model-Free RL
The Q-factor of a state-control pair (xk , uk ) at time k is defined by
Qk (xk , uk ) = E{
gk (xk , uk ,wk ) + Jk+1(xk+1)}
where Jk+1 is the optimal cost-to-go function for stage k + 1
Note thatJk (xk ) = min
u∈Uk (xk )Qk (xk , uk )
so the DP algorithm is written in terms of Qk
Qk (xk , uk ) = E{
gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)
Qk+1(xk+1, u)}
We can approximate Q-factors instead of costs
Bertsekas (M.I.T.) Reinforcement Learning 27 / 33
Fitted Value Iteration for Q-Factors: Model-Free Approximate DP
Consider fitted value iteration of Q-factor parametric approximations
Qk (xk , uk , rk ) ≈ E{
gk (xk , uk ,wk ) + minu∈Uk+1(xk+1)
Qk+1(xk+1, u, rk+1)}
(Note a mathematical magic: The order of E{·} and min have been reversed.)
We obtain Qk (xk , uk , rk ) by training with many pairs((xs
k , usk ), βs
k
), where βs
k is asample of the approximate Q-factor of (xs
k , usk ). No need to compute E{·}
No need for a model to obtain βsk . Sufficient to have a simulator that generates
random samples of state-control-cost-next state((xk , uk ), (gk (xk , uk ,wk ), xk+1)
)Having computed rk , the one-step lookahead control is obtained on-line as
µk (xk ) = arg minu∈Uk (xk )
Qk (xk , u, rk )
without the need of a model or expected value calculations
Also the on-line calculation of the control is simplified
Bertsekas (M.I.T.) Reinforcement Learning 28 / 33
A Few Remarks on Infinite Horizon Problems
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate cost
Jµ(r) = �r Using Simulation
Generate “Improved” Policy µ
µ(i) = arg minu⇧U(i)
n↵
j=1
pij(u)�g(i, u, j) + �J(j, r)
⇥
x = T (x) = Ax + b
pij = 0 ⇥ aij = 0
xi1 , . . . , xiM
M↵
m=1
⇤im
�xim � ⌅(im)⇤r
⇥2
⌅(i)⇤
x = T (x) = g + �Px
x =
⌅↵
t=0
�tP tg
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃r⇥ =
n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤r⇥
�⌦
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃rt =
t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
rt+1 =
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃�1 n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤rt
�⌦
rt+1 =
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃�1 t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
1
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate cost
Jµ(r) = �r Using Simulation
Generate “Improved” Policy µ
µ(i) = arg minu⇧U(i)
n↵
j=1
pij(u)�g(i, u, j) + �J(j, r)
⇥
x = T (x) = Ax + b
pij = 0 ⇥ aij = 0
xi1 , . . . , xiM
M↵
m=1
⇤im
�xim � ⌅(im)⇤r
⇥2
⌅(i)⇤
x = T (x) = g + �Px
x =
⌅↵
t=0
�tP tg
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃r⇥ =
n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤r⇥
�⌦
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃rt =
t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
rt+1 =
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃�1 n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤rt
�⌦
rt+1 =
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃�1 t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
1
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
Jµ(r) = �r Using Simulation
Generate “Improved” Policy µ
µ(i) = arg minu⇧U(i)
n↵
j=1
pij(u)�g(i, u, j) + �J(j, r)
⇥
x = T (x) = Ax + b
pij = 0 ⇥ aij = 0
xi1 , . . . , xiM
M↵
m=1
⇤im
�xim � ⌅(im)⇤r
⇥2
⌅(i)⇤
x = T (x) = g + �Px
x =
⌅↵
t=0
�tP tg
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃r⇥ =
n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤r⇥
�⌦
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃rt =
t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
rt+1 =
⇧n↵
i=1
⇤i ⌅(i)⌅(i)⇤
⌃�1 n↵
i=1
⇤i ⌅(i)
⌥ bi +
n↵
j=1
aij⌅(j)⇤rt
�⌦
rt+1 =
⇧t↵
k=0
⌅(ik)⌅(ik)⇤
⌃�1 t↵
k=0
⌅(ik)
⇤bik +
aikjk
pikjk
⌅(jk)⇤rt
⌅
1
y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy
Initial Policy Controlled System Cost per Stage Vector G(r) Transi-tion Matrix P (r)
Steady-State Distribution ⇧(r) Average Cost ⇤(r)
⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space
⇥ =
�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤
1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1
⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅
1 2 3 4 5 6 7 8 9 x1 x2 x3 x4
⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state x Initial state f(x, u,w) TimeVk: k-stages optimal cost vector with terminal cost function J
TJ J0
Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ
Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)
Indirect Method: Solving a projected form of Bellman’s equation
Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)
Tµ(⇥r) ⇥r = �T(�)µ (⇥r)
�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
J TJ �TJ J T J �T J
Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1
Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1
Route to Queue 2
1
y1 y2 y3 System Space State i µ(i, r) µ(·, r) Policy
Qµ(i, u, r) Jµ(i, r) G(r) Transition Matrix P (r)
Evaluate Approximate Cost Steady-State Distribution ⇧(r) AverageCost ⇤(r)
⌃j1y1 ⌃j1y2 ⌃j1y3 j1 j2 j3 y1 y2 y3 Original State Space
⇥ =
�⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤
1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1
⇥⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅
1 2 3 4 5 6 7 8 9 x1 x2 x3 x4
⌅ |⇥| (1 � ⌅)|⇥| l(1 � ⌅)⇥| ⌅⇥ O A B C |1 � ⌅⇥|Asynchronous Initial state Decision µ(i) x Initial state f(x, u,w)
TimeVk: k-stages optimal cost vector with terminal cost function J
TJ J0
Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ
Direct Method: Projection of cost vector Jµ �Jµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)
Indirect Method: Solving a projected form of Bellman’s equation
Projection on S. Solution of projected equation ⇥r = �T(�)µ (⇥r)
Tµ(⇥r) ⇥r = �T(�)µ (⇥r)
�Jµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)
J TJ �TJ J T J �T J
Value Iterate T (⇥rk) = g + �P⇥rk Projection on S ⇥rk ⇥rk+1
Solution of Jµ = �Tµ(Jµ) ⌅ = 0 ⌅ = 1 0 < ⌅ < 1
1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Jµ(i, r)
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2)
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
d`i = 0 if i /2 I`
�j ¯ = 1 if j 2 I¯
1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r)
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2)
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
(May Involve a Neural Network) (May Involve Aggregation)
d`i = 0 if i /2 I`
�j ¯ = 1 if j 2 I¯
1
of Current Policy µ
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
1
of Current Policy µ by Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
1
of Current Policy µ by Lookahead Min
States xk+1 States xk+2 xk Heuristic/ Suboptimal Base Policy
Approximation J
Adaptive Simulation Terminal cost approximation Heuristic PolicySimulation with
Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture
r: Vector of weights
Position “values” Move “probabilities”
Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
1
Most popular setting: Stationary finite-state system, stationary policies,discounting or termination statePolicy iteration (PI) method generates a sequence of policies
I The current policy µ is evaluated using a parametric architecture: Jµ(x , r)I An “improved" policy µ is obtained by one-step lookahead using Jµ(x , r)
The architecture is trained using simulation data with µThus the system “observes itself" under µ and uses the data to “learn" theimproved policy µ - “self-learning"Exact PI converges to an optimal policy; approximate PI “converges" to within an“error zone" of the optimal, then oscillatesTD-Gammon, AlphaGo, and AlphaZero, all use forms of approximate PI for training
Bertsekas (M.I.T.) Reinforcement Learning 30 / 33
A Few Topics we did not Cover in this Talk
Infinite horizon extensions: Approximate value and policy iteration methods, errorbounds, model-based and model-free methods
Temporal difference methods: A class of methods for policy evaluation in infinitehorizon problems with a rich theory, issues of variance-bias tradeoff
Sampling for exploration, in the context of policy iteration
Monte Carlo tree search, and related methods
Aggregation methods, synergism with other approximate DP methods
Approximation in policy space, actor-critic methods, policy gradient andcross-entropy methods
Special aspects of imperfect state information problems, connections withtraditional control schemes
Infinite spaces optimal control, connections with aggregation schemes
Special aspects of deterministic problems: Shortest paths and their use inapproximate DP
A broad view of using simulation for large-scale computations: Methods for largesystems of equations and linear programs, connection to proximal algorithms
Bertsekas (M.I.T.) Reinforcement Learning 31 / 33
Concluding Remarks
Some words of cautionThere are challenging implementation issues in all approaches, and no fool-proofmethods
Problem approximation and feature selection require domain-specific knowledge
Training algorithms are not as reliable as you might think by reading the literature
Approximate PI involves oscillations
Recognizing success or failure can be a challenge!
The RL successes in game contexts are spectacular, but they have benefited fromperfectly known and stable models and small number of controls (per state)
Problems with partial state observation remain a big challenge
On the positive sideMassive computational power together with distributed computation are a sourceof hope
Silver lining: We can begin to address practical problems of unimaginable difficulty!
There is an exciting journey ahead!
Bertsekas (M.I.T.) Reinforcement Learning 32 / 33
Thank you!
Bertsekas (M.I.T.) Reinforcement Learning 33 / 33
Recommended