Tutorial: A Unified Modeling and Algorithmic Framework for Optimization under Uncertainty
Informs Optimization Society Meeting
March 23, 2018
Warren B. Powell
Princeton UniversityDepartment of Operations Research
and Financial Engineering
© 2016 Warren B. Powell, Princeton University
Drug discovery
Designing molecules
» X and Y are sites where we can hang substituents to change the behavior of the molecule. We approximate the performance using a linear belief model:
0
ij ijsites i substituents j
Y X
» How to sequence experiments to learn the best molecule as quickly as possible?
Real-time logisticsUber/Lyft» Provides real-time, on-demand
transportation.» Drivers are encouraged to enter or leave
the system using pricing signals and informational guidance.
Decisions:» How to price to get the right balance of
drivers relative to customers.» Real-time management of drivers.» Policies (rules for managing drivers,
customers, …)
Ride sharing
RidersCars
t t+1 t+2
Ride sharing
t t+1 t+2
Ride sharing
t t+1 t+2
Ride sharing
Matching buyers with sellersNow we have a logistic curve for each origin-destination pair (i,j)
Number of offers for each (i,j) pair is relatively small.Need to generalize the learning across hundreds to thousands of markets.
0
0( , | )1
aij ij ij
aij ij ij
p aY
p a
eP p ae
Buyer Seller
Offered price
Meeting variability with portfolios of generationwith mixtures of dispatchability
Storage applicationsHow much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
Canonical problems
Decision trees
Canonical problems
Stochastic search (derivative based)» Basic problem:
» Stochastic gradient
» Asymptotic convergence:
» Finite time performance
max ( , )x F x W
1 1( , )n n n nn xx x F x W
*lim ( , ) ( , )nn F x W F x W
Manufacturing network (x=design)Unit commitment problem (x=day ahead decisions)Inventory system (x=design, replenishment policy)Battery system (x=choice of material)Patient treatment cost (x=drug, treatments)Trucking company (x=fleet size and mix)
,max ( , ) where is an algorithm (or policy)nF x W
Canonical problems
Ranking and selection (derivative free)» Basic problem:
» We need to design a policy that finds a design given by
» We refer to this objective as maximizing the final reward.
1 ,...,max ( , )Mx x x F x W
( )nX S
,max ,NF x W
,Nx
Canonical problems
Multi-armed bandit problems» We learn the reward from playing each
“arm”
» We need to find a policy for playing machine x that maximizes:
where
We refer to this problem as maximizing cumulative reward.
( )nX S
11
0
max ( ( ), )N
n n
n
F X S W
1 "winnings"State of knowledge
( )
n
n
n n
WSx X S Choose next “arm” to play
New information
What we know about each slot machine
Canonical problems
(Discrete) Markov decision processes» Bellman’s optimality equation
» This is also the same as solving
where the optimal policy has the form
1 1
1 1 1'
( ) min ( , ) ( ) |
min ( , ) ( ' | , ) ( )
t
t
t t a t t t t t
a t t t t t t ts
V S C S a V S S
C S a p S s S a V S
A
A
{ }( )1 1( ) arg min ( , ) ( ) | ,tt x t t t t t tX S C S x V S S xp
+ += +
00
min , ( ) |T
t t tt
E C S X S S
Canonical problems
Optimal stopping I» Model:
• Exogenous process:
• Decision:
• Reward:
» Optimization problem:
where is a “stopping time” (or )
1 If we stop and sell at time ( )
0 Otherwise t
tX
1 2, ,..., Sequence of stock pricesTp p p
Price received if we stop at time tp t
max p X " measurable function"tF
Canonical problems
Optimal stopping II» Model:
• Exogenous process:
• State:
• Policy:
» Optimization problem:
1 ( | )
0 Otherwise t t
t t
p pX S
1 2
1
, ,..., Sequence of stock prices(1 )
T
t t t
p p pp p p
0 0max ( | ) max ( | )
T T
t t t tt t
p X S p X S
1 if we are holding asset, 0 otherwise.( , , )
t
t t t t
RS R p p
=
=
Canonical problems
Linear quadratic regulation (LQR)» A popular optimal control problem in engineering
involves solving:
» where:
» Possible to show that the optimal policy looks like:
where is a complicated function of Q and R.
0 ,...,
0min ( ) ( )
T
TT T
u u t t t tt
x Qx u Ru
1
State at time Control at time (must be measurable)
( , ) ( is random at time )
t
t t
t t t t t
x tu t F
x f x u w w t
*( )t t t tU x K x
tK
Canonical problems
Stochastic programming» A (two-stage) stochastic programming problem
where
» This is the canonical form of stochastic programming, which might also be written over multiple periods:
0 0 0 0 0 1min ( , )x X c x Q x
1 10 1 ( ) ( ) 1 1( , ( )) min ( ) ( )x XQ x c x
0 01
min ( ) ( ) ( )T
t tt
c x p c x
Canonical problems
Stochastic programming» A (two-stage) stochastic programming policy
where
» This is the canonical form of stochastic programming, which might also be written over multiple periods:
1min ( , )t tx X t t t tc x Q x
1 11 ( ) ( ) 1 1( , ( )) min ( ) ( )t tt t x X t tQ x c x
' '' 1
min ( ) ( ) ( )t t
t H
t t t tt ttt t
c x p c x
Canonical problems
A robust optimization problem would be written
» This means finding the best design x for the worst outcome w in an “uncertainty set”
» This has been adapted to multiperiod problems
( )min max ( , )x X w F x w W
0 1 0,..., ( ,..., ) ( ) ' ' '' 0
min max ( )T
T
x x w w t t tt
c w x
( )
,..., ( ,..., ) ( ) ' ' ''
min max ( )t t H t t H
t H
x x w w t t tt t
c w x
Canonical problems
Why do we need a unified framework?
» The classical frameworks and algorithms are fragile.
» Small changes to problems invalidate optimality conditions, or make algorithmic approaches intractable.
» Practitioners need robust approaches that will provide high quality solutions for all problems.
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies problem classesModeling uncertaintyDesigning policies
Storage applicationsHow much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?
Modeling
For deterministic problems, we speak the language of mathematical programming» Linear programming:
» For time-staged problems
min x cx
0Ax b
x
1 1
0
t t t t t
t t t
t
A x B x bD x u
x
0 ,...,0
minT
T
x x t tt
c x
Arguably Dantzig’s biggest contribution, more so than the simplex algorithm, was his articulation of optimization problems in a standard format, which has given algorithmic researchers a common language.
Stochastic programming
Markov decision processes
Reinforcement learning
Optimal control
Model predictive
control
Robust optimization
Approximate dynamic
programming
Online computation
Simulation optimization
Stochastic search
Decision
analysis
Stochastic control
Simulation optimization
DynamicProgramming
andcontrol
Optimal learning
Banditproblems
Stochastic programming
Markov decision processes
Reinforcement learning
Optimal control
Model predictive
control
Robust optimization
Approximate dynamic
programming
Online computation
Simulation optimization
Stochastic search
Decision
analysis
Stochastic control
Simulation optimization
DynamicProgramming
andcontrol
Optimal learning
Banditproblems
Modeling
Before we can solve complex problems, we have to know how to think about them.
The biggest challenge when making decisions under uncertainty is modeling.
Min E {cx}Ax = bx > 0
Mathematician
Software
Organize classlibraries, and set up
communications and databases
Modeling
We lack a standard language for modeling sequential, stochastic decision problems.» In the slides that follow, we propose to model problems
along five fundamental dimensions:
• State variables• Decision variables• Exogenous information• Transition function• Objective function
» This framework draws heavily from Markov decision processes and the control theory communities, but it is not the standard form used anywhere.
Modeling dynamic problems
The state variable:
Controls community "Information state"Operations research/MDP/Computer science , , System state, where: Resource state (physical state) Loca
t
t t t t
t
x
S R I BR
tion/status of truck/train/plane Energy in storage Information state Prices Weather Belief state ("state of knowl
t
t
I
B
edge") Belief about traffic delays Belief about the status of equipment
The state variable
Illustrating state variables» A deterministic graph
1
2
3
4
5
6
7
8
9
10
11
12.6
8.4
9.2 3.6
8.1
17.4
15.9
16.5 20.2
13.5
8.912.7
15.9
2.34.5
7.3
9.65.7
?tS ( ) 6t tS N
The state variable
Illustrating state variables» A stochastic graph
1
2
3
4
5
6
7
8
9
10
11
3.6
8.1
13.5
8.9
12.712.6
8.4
9.2
15.9
?tS
The state variable
Illustrating state variables» A stochastic graph
1
2
3
4
5
6
7
8
9
10
11
3.6
8.1
13.5
8.9
12.712.6
8.4
9.2
15.9
?tS , ,, 6, 12.7,8.9,13.5tt t t N j j
S N c
tR
tI
The state variable
Illustrating state variables» A stochastic graph with left turn penalties
1
2
3
4
5
6
7
8
9
10
11
3.6
8.1
13.5
8.9
12.6
8.4
9.2
15.9
?tS , , 1, , 6, 12,7,8.9,13.5 ,3tt t t N j tj
S N c N
tR
tI
12.7( .7)
The state variable
Variant of problem in Puterman (2005):» Find best path from 1 to 11 that minimizes the second
highest arc cost along the path:
» If the traveler is at node 9, what is her state?
1 3
2
4
6
5
7
8
8
1212
14
87
3
11 14
11
8
9
8
10
112
4
510
15
815
( , highest,second highest) (9,15,12)t tS N ?tS
The state variables
What is a state variable?» Bellman’s classic text on dynamic programming (1957)
describes the state variable with:• “… we have a physical system characterized at any stage by a
small set of parameters, the state variables.”
» The most popular book on dynamic programming (Puterman, 2005, p.18) “defines” a state variable with the following sentence:
• “At each decision epoch, the system occupies a state.”
» Wikipedia:• A state variable is one of the set of variables that are used to
describe the mathematical ‘state’ of a dynamical system
The state variable
My definition of a state variable:
» The first depends on a policy. The second depends only on the problem (and includes the constraints).
» Using either definition, all properly modeled problems are Markovian!
Modeling dynamic problems
Decisions:Markov decision processes/Computer science Discrete actionControl theory Low-dimensional continuous vectorOperations research Usually a discrete or continuous but high-dimensional
t
t
t
a
u
x
vector of decisions.
At this point, we do not specify to make a decision.Instead, we define the function ( ) (or ( ) or ( )), where specifies the type of policy. " " carries informationabout the type of functi
howX s A s U s
.on , and any tunable parameters ff
The decision variables
Styles of decisions» Binary
» Finite
» Continuous scalar
» Continuous vector
» Discrete vector
» Categorical
0,1x X
1,2,...,x X M
,x X a b
1( ,..., ), K kx x x x
1( ,..., ), K kx x x x
1( ,..., ), is a category (e.g. red/green/blue)I ix a a a
Modeling dynamic problems
Exogenous information:
New information that first became known at time
ˆ ˆ ˆˆ = , , ,
ˆ Equipment failures, delays, new arrivals New drivers being hired to the network
ˆ New customer demands
t
t t t t
t
t
W t
R D p E
R
D
ˆ Changes in pricesˆ Information about the environment (temperature, ...) t
t
p
E
Note: Any variable indexed by t is known at time t. This convention, which is not standard in control theory, dramatically simplifies the modeling of information.
1 2Below, we let represent a sequence of actual observations , ,....
refers to a sample realization of the random variable .t t
W WW W
Modeling dynamic problems
The transition function
1 1
1 1
1 1
1 1
( , , )ˆ Inventoriesˆ Spot pricesˆ Market demands
Mt t t t
t t t t
t t t
t t t
S S S x W
R R x Rp p p
D D D
Also known as the:“System model”“State transition model”“Plant model”“Plant equation”“State equation”
“Transfer function”“Transformation function”“Law of motion”“Model”“transition function”
For many applications, these equations are unknown. This is known as “model-free” dynamic programming.
1 00
max , ( ), |
T
t t t t tt
E C S X S W S
Modeling stochastic, dynamic problemsThe universal objective function
Given a system model (transition function)
and a stochastic process:
Now we just have to find the best policy.
Decision function (policy)State variable
Contribution function
Finding the best policy
Expectation over allrandom outcomes
1 1, , ( )Mt t t tS S S x W
Initial state variableNew information
0 1 2, , ,..., TS W W W
1 00
max , ( ), |
T
t t t t tt
E C S X S W S
Modeling stochastic, dynamic problemsThe universal objective function
Given a system model (transition function)
and a stochastic process:
Now we just have to find the best policy.
1 1, , ( )Mt t t tS S S x W
0 1 2, , ,..., TS W W W
Modeling Deterministic » Objective function
» Decision variables:
» Constraints: • at time t
• Transition function
Stochastic» Objective function
» Policy
» Constraints at time t
» Transition function
» Exogenous information
1 00
max , ( ), |
T
t t t t tt
E C S X S W S0 ,..., 0min
T
T
t tx x tc x
( )t t t tx X S
1 1t t t tR b B x 1 1, ,Mt t t tS S S x W
0 1 2( , , ,..., )TS W W W
0t t t
t
A x Rx
t
0 ,..., Tx x :X S
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
An energy storage problem
Consider a basic energy storage problem:
» We are going to show that with minor variations in the characteristics of this problem, we can make each class of policy work best.
An energy storage problem
A model of our problem
» State variables
» Decision variables
» Exogenous information
» Transition function
» Objective function
An energy storage problem
State variables
» We will present the full model, accumulating the information we need in the state variable.
» We will highlight information we need as we proceed. This information will make up our state variable.
E
G
B
L
An energy storage problem
Decision variables
» Constraints;
E
G
B
L
, , , , ,EL EB GL GB BLt t t t t tx x x x x x
An energy storage problem
Exogenous information
E
G
B
L
' '
' '
ˆ Change in energy from wind between 1 and
Noise in the price process between 1 and
Forecast of demand provided by vendor at time
Provided exogenously
Differenc
tp
tD
tt t
D Dt tt t tDt
E t t
t t
f D t
f f
e between actual demand and forecast
tW
An energy storage problem
Transition function
E
G
B
L
1 1
1 0 1 1 2 2 1 1
1 , 1 1
1
ˆ
( )t t t
p T pt t t t t t t t t t t
D Dt t t t
battery batteryt t t
E E E
p p p p p
D f
R R x
1
2
t
t t
t
pp p
p
Learning in stochastic optimization
Updating the demand parameter» Let be the new price and let
» We update our estimate using our recursive least squares equations:
1 11
1t t t t t
t
B pq q eg+ +
+
= -
( )1 1
11
1
( | ) , 1 ( )
1 ( )
t t t t t
Tt t t t t t
t
Tt t t t
F p p
B B B p p B
p B p
e q
g
g
+ +
++
+
= -
= -
= +
0 1 1 2 2( | ) ( )price Tt t t t t t t t t t tF p p p p p
An energy storage problem
Objective function
E
G
B
L
( , ) GB GLt t t t tC S x p x x
1 00
min ( , ( ), ) | T
t t t t tt
C S X S W S
1 0 1 1 2 2 1
1 , 1 1
1 11
1 ( )
pt t t t t t t t
D Dt t t t
t t t t tt
p p p p
D f
B x
An energy storage problem
State variables» Cost function
» Decision functionConstraints:
» Transition function
Price of electricitytp
1 2, , , ( , , ) , , ( , )Dt t t t t t t t t tS E L R p p p f B
Outline
Elements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
Solution strategies and problem classesSpecial structure» There are special cases where we can solve
exactly. But not very many.
Sampled problems (SAA, scenario trees)» If the only problem is that we cannot compute the expectation, we
might solve a sampled approximation
Adaptive learning algorithms» This is what we have to turn to for most problems, and is the focus
of this tutorial.
max ( , )x F x W
ˆmax ( , )x F x W
Solution strategies and problem classes
State independent problems» The problem does not depend on the state of the system.
» The only state variable is what we know (or believe) about the unknown function , called the belief state , so .
State dependent problems» Now the problem may depend on what we know at time t:
» Now the state is
max ( , ) min( , )xF x W p x W cx
( , )F x W
0max ( , , ) min( , )
tx R tC S x W p x W cx
Solution strategies and problem classes
Offline (final reward)» We can iteratively search for the best solution, but only
care about the final answer.» Asymptotic formulation:
» Finite horizon formulation:
Online (cumulative reward)» We have to learn as we go
max ( , )xF x W
,max ( , )NF x W
11
0
max ( ( ), )N
n n
n
F X S W
üïïïïïýïïïïïþ
“ranking and selection”or
“stochastic search”
Solution strategies and problem classes
Solution strategies and problem classes
Solution strategies and problem classes
Learning policies:Approximate dynamic programmingQ-learningSDDP…
Solution strategies and problem classes
“Online” (cumulative reward) dynamic programming is recognized as the “dynamic programming problem,” but the entire literature on solving dynamic programs describes class (4) problems. This appears to be an open problem.
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
Modeling uncertainty
Observational uncertaintyPrognostic uncertainty (forecasting)Experimental noise/variabilityTransitional uncertaintyInferential uncertaintyModel uncertaintySystematic exogenous uncertaintyControl/implementation uncertaintyAlgorithmic noiseGoal uncertainty
Modeling uncertainty in the context of stochastic optimization is a relatively untapped area of research.
Outline
Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies
Designing policies
We have to start by describing what we mean by a policy.» Definition:
A policy is a mapping from a state to an action. … any mapping.
How do we search over an arbitrary space of policies?
Designing policies
Two fundamental strategies:
1) Policy search – Search over a class of functions for making decisions to optimize some metric.
2) Lookahead approximations – Approximate the impact of a decision now on the future.
0( , )0
max , ( | ) |f f
T
t t tf Ft
E C S X S S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policiesPolicy search:1a) Policy function approximations (PFAs)
• Lookup tables – “when in this state, take this action”
• Parametric functions– Order-up-to policies: if inventory is less than s, order up to S.– Affine policies -– Neural networks
• Locally/semi/non parametric– Requires optimizing over local regions
1b) Cost function approximations (CFAs)• Optimizing a deterministic model modified to handle uncertainty
(buffer stocks, schedule slack)
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
( | )PFAt tx X S
( | ) ( )PFAt t f f t
f Fx X S S
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):
2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)
*1 1
1 1
( ) arg max ( , ) ( ) | ,
( ) arg max ( , ) ( ) | ,
arg max ( , ) ( )
t
t
t
t t x t t t t t t
VFAt t x t t t t t t
x xx t t t t
X S C S x V S S x
X S C S x V S S x
C S x V S
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
The ultimate lookahead policy is optimal*
' ' ' 1' 1
( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policies
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
Designing policies
The ultimate lookahead policy is optimal
Expectations that we cannot compute
Maximization that we cannot compute
Designing policies
The ultimate lookahead policy is optimal
» 2b) Instead, we have to solve an approximation called the lookahead model:
» A lookahead policy works by approximating the lookahead model.
*' ' ' 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
T
t t x t t t t t t t tt t
X S C S x C S X S S S x
*' ' ' , 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t
t H
t t x t t tt tt tt t t t tt t
X S C S x C S X S S S x
Designing policiesTypes of lookahead approximations » One-step lookahead – Widely used in pure learning
policies:• Bayes greedy/naïve Bayes• Expected improvement• Value of information (knowledge gradient)
» Multi-step lookahead• Deterministic lookahead, also known as model predictive
control, rolling horizon procedure• Stochastic lookahead:
– Two-stage (widely used in stochastic linear programming)– Multistage
» Monte carlo tree search (MCTS) for discrete action spaces
» Multistage scenario trees (stochastic linear programming) – typically not tractable.
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Polic
y se
arch
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
Look
ahea
dap
prox
imat
ions
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Func
tion
appr
ox.
1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions
2) Cost function approximation (CFAs)»
3) Policies based on value function approximations (VFAs)»
4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control
» Chance constrained programming
» Stochastic lookahead /stochastic prog/Monte Carlo tree search
» “Robust optimization”
Four (meta)classes of policies
( )( | ) arg max ( , | )
t t
CFAt t tx
X S C S x
X
,' ',..., ' 1
( ) arg max ( , ) ( , )
tt t t H
TLA Dt t tt tt tt ttx x t t
X S C S x C S x
( ) arg max ( , ) ( , )t
VFA x xt t x t t t t t tX S C S x V S S x
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
,' '( ),..., ' 1
( ) arg max min ( , ) ( ( ), ( ))ttt t t H
TLA ROt t tt tt tt ttw Wx x t t
X S C S x C S w x w
[ ( )] 1t tP A x f W
Imbe
dded
opt
imiz
atio
n
Learning problems
Functions we have to learn:» Approximating the objective
.» Designing a policy .» A value function approximation
.» Designing a cost function approximation:
• The objective function ̅ , | .• The constraints |
» Approximating the transition function
Approximation strategies
Approximation strategies» Lookup tables
• Independent beliefs • Correlated beliefs
» Linear parametric models• Linear models • Sparse-linear• Tree regression
» Nonlinear parametric models• Logistic regression• Neural networks
» Nonparametric models• Gaussian process regression• Kernel regression• Support vector machines• Deep neural networks
Designing policies
Finding the best policy» We have to first articulate our classes of policies
» So minimizing over means:
» We then have to pick an objective such as
or
10 0
max , ( | ) ( | ),T T
t t t tt t
C S X S F X S W
, , ,
Parameters that characterize each family.f
f PFAs CFAs VFAs DLAs
, ff
max , ( , )T T TC S X F X W
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Policy function approximations
Battery arbitrage – When to charge, when to discharge, given volatile LMPs
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71
Grid operators require that batteries bid charge and discharge prices, an hour in advance.
We have to search for the best values for the policy parameters
DischargeCharge
Charge Dischargeand .
Policy function approximations
Policy function approximations
Our policy function might be the parametric model (this is nonlinear in the parameters):
charge
charge discharge
charge
1 if ( | ) 0 if
1 if
t
t t
t
pX S p
p
Energy in storage:
Price of electricity:
Policy function approximations
Finding the best policy» We need to maximize
» We cannot compute the expectation, so we run simulations:
DischargeCharge
0
max ( ) , ( | )T
tt t t
tF C S X S
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Cost function approximations
Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate
correlations in experiments between similar catalysts.
92
Cost function approximations
Correlated beliefs: Testing one material teaches us about other materials
1 2 3 4 4 5
Cost function approximations
Designing a policy» Examples of policies for making decisions:
• Interval estimation:
• Upper confidence bounding
• Thompson sampling:
• Knowledge gradient (expected value of information):
( | ) arg max Std. dev. of IE n IE n IE n n nx x x x xX S
ˆ ˆ( ) arg max ( , )TS n n n n nx x x x xX S N
log( | ) arg max No. of times is tested.UCB n UCB n UCB nx x xn
x
nX S N xN
, 1( ) max ( , ( )) max ( , )KG n n ny yx E F y B x F y B
94
Cost function approximations
Picking means we are evaluating each choice at the mean.
1 2 3 4 4 5
95
Cost function approximations
Picking means we are evaluating each choice at the 95th percentile.
1 2 3 4 4 5
Cost function approximationsOptimizing the policy» We optimize to maximize:
where
Notes:» This can handle any belief model,
including correlated beliefs, nonlinear belief models.
» All we require is that we be able to simulate a policy.
( | ) arg max ( , )n IE n IE n IE n n n nx x x x xx X S S
,max ( ) ,IEIE NF F x W
Cost function approximations
Inventory management
» How much product should I order to anticipate future demands?
» Need to accommodate different sources of uncertainty.
• Market behavior• Transit times• Supplier uncertainty• Product quality
Cost function approximations
Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve
» This assumes our demand forecast is accurate.
tD
tpx
tD
( ) arg maxtt t x p tp
p PX S c x
subject to
0
tp tp P
tp p
tp
x D
x u
x
t
Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve
» This is a “parametric cost function approximation”
( )
Reserve
buffer
( | ) arg max
subject to
tt t p tpx
p P
tp tp P
tp p
tp
X S c x
x D
x u
x
( )t
Cost function approximations
tD
tpx
Reserve
Buffer stock
A general way of creating CFAs:» Define our policy:
subject to
» We tune by optimizing:
Cost function approximations
( ) arg max ( , | )t x t tX C S x
0
min ( ) ( , ( ))T
t tt
F C S X
( )Ax b
Parametricallymodified costs
Parametricallymodified constraints
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Locational marginal prices on the gridLMPs – Locational marginal prices
$977/MW !!!
Slide 103
Grid level storage Optimizing battery storage
Slide 103
Slide 104
Grid level storage Optimizing battery storage
Slide 104
Slide 105
Value function approximations
Slide 105
Slide 106
Value function approximationsMonday
Time :05 :10 :15 :20
Slide 106
Slide 107
Value function approximationsMonday
Time :05 :10 :15 :20
Slide 107
Slide 108
Value function approximationsMonday
Time :05 :10 :15 :20
Slide 108
Exploiting concavity
Derivatives are used to estimate a piecewise linear approximation
( )t tV R
tR
Slide 109
Approximate value iteration
Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using
an approximate value function:
to obtain . Step 3: Update the value function approximation
Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:
Step 5: Return to step 1.
, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n
t t n t t n tV S V S v
1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x
ntS
( )ntW
1 1( , , ( ))n M n n nt t t tS S S x W
Simulation
Deterministicoptimization
Recursivestatistics
“on policy learning”
nx
Approximate value iteration
Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using
an approximate value function:
to obtain . Step 3: Update the value function approximation
Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:
Step 5: Return to step 1.
, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n
t t n t t n tV S V S v
1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x
ntS
( )ntW
1 1( , , ( ))n M n n nt t t tS S S x W
Simulation
Deterministicoptimization
Recursivestatistics
nx
1200000
1300000
1400000
1500000
1600000
1700000
1800000
1900000
0 100 200 300 400 500 600 700 800 900 1000
Obj
ectiv
e fu
nctio
n
Iterations
Approximate dynamic programming
… a typical performance graph.
The value of grid level storage
Without storage
The value of grid level storage
With storage
“I think you give a too rosy a picture of ADP….”Andy Barto, in comments on a paper (2009)
“Is the RL glass half full, or half empty?” Rich Sutton, NIPS workshop, (2014)
Approximate value functions can work very well, but you need structure to guide the learning process. ADP needs benchmarks and careful tuning.
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Lookahead policies
Planning your next chess move:
» You put your finger on the piece while you think about moves into the future. This is a lookahead policy, illustrated for a problem with discrete actions.
Lookahead policies
Decision trees:
Lookahead policies
Modeling lookahead policies» Lookahead policies solve a lookahead model, which is an
approximation of the future.» It is important to understand the difference between the:
• Base model – this is the model we are trying to solve by finding the best policy. This is usually some form of simulator.
• The lookahead model, which is our approximation of the future to help us make better decisions now.
» The base model is typically a simulator, or it might be the real world.
Lookahead policies
Lookahead models use five classes of approximations:» Horizon truncation – Replacing a longer horizon problem
with a shorter horizon» Stage aggregation – Replacing multistage problems with
two-stage approximation.» Outcome aggregation/sampling – Simplifying the
exogenous information process» Discretization – Of time, states and decisions» Dimensionality reduction – We may ignore some variables
(such as forecasts) in the lookahead model that we capture in the base model (these become latent variables in the lookahead model).
Lookahead policies
Lookahead policies are the trickiest to model:» We create “tilde variables” for the lookahead model:
» All variables are indexed by t (when the lookaheadmodel is being generated) and t’ (the time within the lookahead model).
, '
, '
, , 1 ,
Approximated state variable (e.g coarse discretization)Decision we plan on implementing at time ' when we are
planning at time , ' , 1,...,
, ,...,
t t
t t
t t t t t t t H
Sx t
t t t t t H
x x x x
, '
, '
, '
Approximation of information processForecast of costs at time ' made at time
Forecast of right hand sides for time ' made at time
t t
t t
t t
Wc t t
b t t
Lookahead policies
We can use this notation to create a policy based on our lookahead model:
» Simplest lookahead is deterministic.
*' ' ' , 1
' 1( ) arg max ( , ) max ( , ( )) | | ,
t H
t t t t tt tt tt t t t tt t
X S C S x C S X S S S x
Restricted/simplified set of policies
Simplified/discretized set of state variables
Simplified/discretized set of decision variables
Sampled set of realizations (or deterministic);Aggregated staging of decisions and information
Limited horizon
Lookahead policies
Deterministic lookahead
Stochastic lookahead (with two-stage approximation)
' '' 1
( ) arg max ( , ) ( , )T
LA Dt t tt tt tt tt
t t
X S C S x C S x
xtt , xt ,t1,..., xt ,tT
' '' 1
( ) arg max ( , ) ( ) ( ( ), ( ))t
TLA St t tt tt tt tt
t tX S C S x p C S x
xtt , xt ,t1,..., xt ,tT
Scenario trees
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
Lookahead policies
Lookahead policies peek into the future» Optimize over deterministic lookahead model
The real process
. . . .
The
look
ahea
dm
odel
t 1t 2t 3t
Lookahead policies
Stochastic lookahead» Here, we approximate the information model by using a
Monte Carlo sample to create a scenario tree: 1am 2am 3am 4am 5am …..
Change in wind speed
Change in wind speed
Change in wind speed
Lookahead policies
We can then simulate this lookahead policy over time:
. . . .
t 1t 2t 3t
The
look
ahea
dm
odel
The base model
. . . .
t 1t 2t 3t
Lookahead policies
We can then simulate this lookahead policy over time:
The
look
ahea
dm
odel
The base model
. . . .
t 1t 2t 3t
Lookahead policies
We can then simulate this lookahead policy over time:
The
look
ahea
dm
odel
The base model
. . . .
t 1t 2t 3t
Lookahead policies
We can then simulate this lookahead policy over time:
The
look
ahea
dm
odel
The base model
Learning damaged networks
XX
X
134callcall
call
call
134
call
call
Which way should the truck go? X
X
X
call
Decision Outcome Decision Outcome Decision
Lookahead policies
Monte Carlo tree search:
C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis and S. Colton, “A survey of Monte Carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–49, March 2012.
Lookahead policies
Notes:
» Solving stochastic lookahead policies can be hard!
» … but this is still just a lookahead policy which is a class of rolling horizon heuristic.
» Even if solving the lookahead model is hard, an optimal solution of a lookahead model (even a stochastic one) is (with rare exceptions) not an optimal policy.
Outline
The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA
Parametric cost function approximation
An energy storage problem:
Parametric cost function approximationForecasts evolve over time as new information arrives:
Actual
Rolling forecasts, updated each hour.
Forecast made at midnight:
Parametric cost function approximation
Benchmark policy – Deterministic lookahead
' ' ' '
' ' '
' ' '
max' ' '
' ' 'arg
' 'arg
' '
wd rd gd Dtt tt tt tt
gd gr Gtt tt tt
rd rgtt tt tt
wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt
x x x f
x x f
x x R
x x R R
x x f
x x
x x
b
g
g
+ + £
+ £
+ £
+ £ -
+ £
+ £
+ £
Parametric cost function approximation
Parametric cost function approximations» Replace the constraint
with:» Lookup table modified forecasts (one adjustment term for
each time in the future):
» Exponential function for adjustments (just two parameters)
» Constant adjustment (one parameter)
't t
' ' ' 'wr wd Ett tt t t ttx x f
'wrttx '
wdttx
2 ( ' )
' ' 1 '
t twr wd Ett tt ttx x e f
' ' 'wr wd Ett tt ttx x f
Parametric cost function approximation
Optimizing the CFA:» Let be a simulation of our policy given by
» We then compute the gradient with respect to
» The parameter is found using a classical stochastic gradient algorithm:
We tested several stepsize formulas and found that ADAGRAD worked best:
( , ) ( , )F F
( , )F
0
( , ) ( ), ( ( ) | )T
t t tt
F C S X S
1 1( , )n n n nnF
( )21
' 0
( , )t
n t x t ttt
G F x WGh
ae +
=
= = +
å
Parametric cost function approximation
Optimizing the CFA:» We compute the gradient by applying the chain rule
where the interaction from one time period to the next is captured using
» Assuming there are no integer variables, these equations are quite easy to compute.
0fs = 10fs =
20fs = 30fs =Lookup table
Constant parameterExponential function
Parametric cost function approximation
Improvement over deterministic benchmark:
Lookup tableExponential
Constant
Parametric cost function approximation
The parametric CFA represents a fundamental rethinking of the modeling of stochastic programming problems:» From thinking of the lookahead model as the objective
function:
» To acknowledging that the lookahead model is a policy for solving the base model…
…. which is a simulator where we do not have to make any of the standard approximations required in stochastic programming.
0 01
max ( ) ( ) ( )T
t tt
c x p c x
1 00
max , ( | ), |T
t t t t tt
E C S X S W S
An energy storage problem
Consider a basic energy storage problem:
» We are going to show that with minor variations in the characteristics of this problem, we can make each class of policy work best.
An energy storage problem
We can create distinct flavors of this problem:» Problem class 1 – Best for PFAs
• Highly stochastic (heavy tailed) electricity prices• Stationary data
» Problem class 2 – Best for CFAs• Stochastic prices and wind (but not heavy tailed)• Stationary data
» Problem class 3 - Best for VFAs• Stochastic wind and prices (but not too random)• Time varying loads, but inaccurate wind forecasts
» Problem class 4 – Best for deterministic lookaheads• Relatively low noise problem with accurate forecasts
» Problem class 5 – A hybrid policy worked best here• Stochastic prices and wind, nonstationary data, noisy forecasts.
An energy storage problemThe policies» The PFA:
• Charge battery when price is below p1• Discharge when price is above p2
» The CFA• Optimize over a horizon H; maintain upper and lower bounds (u, l)
for every time period except the first (note that this is a hybrid with a lookahead).
» The VFA• Piecewise linear, concave value function in terms of energy, indexed
by time.» The lookahead (deterministic)
• Optimize over a horizon H (only tunable parameter) using forecasts of demand, prices and wind energy
» The lookahead CFA• Use a lookahead policy (deterministic), but with a tunable parameter
that improves robustness.
An energy storage problem
Each policy is best on certain problems» Results are percent of posterior optimal solution
» … any policy might be best depending on the data.Joint research with Prof. Stephan Meisel, University of Muenster, Germany.
Stochastic programming
Markov decision processes
Reinforcement learning
Optimal control
Model predictive
control
Robust optimization
Approximate dynamic
programming
Online computation
Simulation optimization
Stochastic search
Decision
analysis
Stochastic control
Simulation optimization
DynamicProgramming
andcontrol
Optimal learning
Banditproblems
© Warren Powell 2017
Multi-armed bandits/optimal learning
Reinforcementlearning/ADP
Stochastic search
Simulationoptimization
Stochastic programming
Optimal controlMarkov decisionprocesses
Lookahead policies (DLAs)
Stochastic gradients
© Warren Powell 2017
Ranking andselection
Policy search(PFAs)
Cost functionapprox. (CFAs)
Value function approx. (VFAs)
Theory
Applications
Computation
Modeling
Thank you!http://www.castlelab.princeton.edu/
A tutorial on this topic is available at the top of
http://www.castlelab.princeton.edu/jungle/