Approximate Dynamic Programming: Solving the curses of ... · Planning for a risky world Weather •Robust design of emergency response networks. ... Government policies (tax rebates

© 2008 Warren B. Powell Slide 1

Approximate Dynamic Programming:Solving the curses of dimensionality

Informs Computing Society Tutorial

October, 2008

Warren PowellCASTLE LaboratoryPrinceton University

http://www.castlelab.princeton.edu

© 2008 Warren B. Powell, Princeton University




The fractional jet ownership industry

© 2008 Warren B. Powell Slide 5NetJets Inc.


Schneider National


Schneider National


Planning for a risky world

Weather

•Robust design of emergency response networks.

•Design of financial instruments to hedge against weather emergencies to protect individuals, companies and municipalities.

•Design of sensor networks and communication systems to manage responses to major weather events.

Disease

•Models of disease propagation for response planning.

•Management of medical personnel, equipment and vaccines to respond to a disease outbreak.

•Robust design of supply chains to mitigate the disruption of transportation systems.


Energy management

Energy resource allocation• What is the right mix of energy technologies?

• How should the use of different energy resources be coordinated over space and time?

• What should my energy R&D portfolio look like?

• Should I invest in nuclear energy?

• What is the impact of a carbon tax?

Energy markets• How should I hedge energy commodities?

• How do I price energy assets?

• What is the right price for energy futures?


High value spare parts

Electric Power Grid

•PJM oversees an aging investment in high-voltage transformers.

•Replacement strategy needs to anticipate a bulge in retirements and failures

•1-2 year lag times on orders. Typical cost of a replacement ~ $5 million.

•Failures vary widely in terms of economic impact on the network.

Spare parts for business jets

•ADP is used to determine purchasing and allocation strategies for over 400 high value spare parts.

•Inventory strategy has to determine what to buy, when and where to store it. Many parts are very low volume (e.g. 7 spares spread across 15 service centers).

•Inventories have to meet global targets on level of service and inventory costs.


Challenges

Real-time control» Scheduling aircraft, pilots, generators, tankers» Pricing stocks, options» Electricity resource allocation

Near-term tactical planning» Can I accept a customer request?» Should I lease equipment?» How much energy can I commit with my windmills?

Strategic planning» What is the right equipment mix?» What is the value of this contract?» What is the value of more reliable aircraft?

Outline

A resource allocation modelAn introduction to ADP ADP and the post-decision state variableA blood management exampleHierarchical aggregationStepsizesHow well does it work?Some applications» Transportation» Energy

Outline



A resource allocation model

Attribute vectors:

a =Location

ETAA/C typeFuel level

Home shopCrewEqpt1

Eqpt100

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

LocationETAHome

ExperienceDriving hours

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

TypeLocation

Age

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Asset classTime invested⎡ ⎤⎢ ⎥⎣ ⎦



Modeling resources:» The attributes of a single resource:

» The resource state vector:

» The information process:

The attributes of a single resource The attribute space

aa=∈A

ˆ The change in the number of resources with attribute .

taRa

=

( )The number of resources with attribute

The resource state vectorta

t ta a

R a

R R∈

=

=A



Modeling demands:» The attributes of a single demand:

» The demand state vector:

» The information process:ˆ The change in the number of demands with

attribute .tbD

b=

The attributes of a demand to be served. The attribute space

bb=∈B

( )The number of demands with attribute

The demand state vectortb

t tb b

D b

D D∈

=

=B


Energy resource modeling

The system state:

( ), , System state, where:

Resource state (how much capacity, reserves) Market demands "system parameters" State of the technology (costs, pe

t t t t

t

t

t

S R D

RD

ρ

ρ

= =

=

==

rformance) Climate, weather (temperature, rainfall, wind) Government policies (tax rebates on solar panels) Market prices (oil, coal)

⎫⎪⎪⎪⎬⎪⎪⎪⎭



The decision variable:

New capacityRetired capacity

:Type

LocationTechnology

t

for eachx

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟

= ⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

⎫⎪⎪⎪⎬⎪⎪⎪⎭



Exogenous information:

⎫⎪⎪⎪⎬⎪⎪⎪⎭

( )ˆ ˆ ˆNew information = , ,t t t tW R D ρ=

ˆ Exogenous changes in capacity, reservesˆ New demands for energy from each sourceˆ Exogenous changes in parameters.

t

t

t

R

Dρ

=

=

=



The transition function

⎫⎪⎪⎪⎬⎪⎪⎪⎭

1 1( , , )Mt t t tS S S x W+ +=



DemandsResources



t t+1 t+2



t t+1 t+2

Optimizing at a point in time

Optimizing over time

Outline



Introduction to ADP

We just solved Bellman’s equation:

» We found the value of being in each state by stepping backward through the tree.

{ }1 1( ) max ( , ) ( ) |t t t t t t t txV S C S x E V S S+ +∈

= +X


Introduction to ADP

The challenge of dynamic programming:

Problem: Curse of dimensionality

{ }( )1 1( ) max ( , ) ( ) |t t t t t t t txV S C S x E V S S+ +∈

= +X


Introduction to ADP

The challenge of dynamic programming:

Problem: Curse of dimensionality


= +X

Three curses

State spaceOutcome spaceAction space (feasible region)


Introduction to ADP

The computational challenge:

How do we find ? 1 1( )t tV S+ +

How do we compute the expectation?

How do we find the optimal solution?


= +X

Introduction to ADP

Classical ADP» Most applications of ADP focus on the challenge of

handling multidimensional state variables» Start with

» Now replace the value function with some sort of approximation


= +X

( ) ( )t t t tV S V S≈

Introduction to ADP

Approximating the value function:» We have to exploit the structure of the value function

(e.g. concavity).» We might approximate the value function using a

simple polynomial

» .. or a complicated one:

» Sometimes, they get really messy:

20 1 2( | )t t t tV S S Sθ θ θ θ= + +

( )20 1 2 3 4( | ) ln sin( )t t t t t tV S S S S Sθ θ θ θ θ θ= + + + +

2 3(0) (1) (1)

, ' , ' , ' , '' '

(2) 2 (2) 2, ' , ' , ' , '

' '2

(3), , '

'

21 1(4)

, ' , ' '' ' '

( | )

1

12

t t

t t st t st t wt t wts t t w t t

t wt t wt t st t stw t s t

ts t st t s ts s

t t

ts t st t s ts t t s t t

V R R R

R R

R RS

R RS

θ θ θ θ

θ θ

θ

θ

+ +

= =

+ +

= =

= + +

+ +

⎛ ⎞+ −⎜ ⎟

⎝ ⎠

⎛ ⎞⎛ ⎞+ −⎜ ⎟⎜ ⎟

⎝ ⎠⎝ ⎠

+

∑∑ ∑∑

∑∑ ∑∑

∑ ∑

∑ ∑ ∑∑22 2

(5), ' , ' '

' ' '

( ,1), , ,

2( ,2), , , '

'

3 2( ,3), , ' , '

' '

( ,4), , '

'

13

t t

ts t st t s ts t t s t t

wst ws t wt t st

w s

tws

t ws t wt t stw s t t

t tws

t ws t wt t stw s t t t t

wst ws t wt

t

R RS

R R

R R

R R

R

θ

θ

θ

θ

θ

+ +

= =

+

=

+ +

= =

⎛ ⎞⎛ ⎞ −⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

+

⎛ ⎞+ ⎜ ⎟⎝ ⎠

⎛ ⎞⎛ ⎞+ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

+

∑ ∑ ∑∑

∑∑

∑∑ ∑

∑∑ ∑ ∑2 2

, ''

( ,5), , , 2 , , , 2

t t

t stw s t t t

wst sw t s t t wt t w t

w s

R

R R Rθ

+ +

= =

+ +

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

+

∑∑ ∑ ∑

∑∑

1,2,3 4,5,6,7

8-14

15

16

17

18

19

20

21

22

Introduction to ADP

We can write a model of the observed value of being in a state as:

This is often written as a generic regression model:

The ADP community refers to the independent variables as basis functions:

( )20 1 2 3 4ˆ ln sin( )t t t tv S S S Sθ θ θ θ θ ε= + + + + +

0 1 1 2 2 3 3 4 4 Y X X X Xθ θ θ θ θ= + + + +

0 0 1 1 2 2 3 3 4 4( ) ( ) ( ) ( ) ( )

= ( )f ff

Y S S S S S

S

θ ϕ θ ϕ θ ϕ θ ϕ θ ϕ

θ ϕ∈

= + + + +

∑F

( )f Rϕ are also known as features.

Introduction to ADP

Methods for estimating»

»

θ1 2ˆ ˆ ˆGenerate observations , ,..., , and use traditional regression

methods to fit .

Nv v vθ

Stochastic gradient for updating :nθ

( )

( )

1 1 1 1 11

1

21 1 11

ˆ( | ) ( | )

( )( )

ˆ ( | )

( )

n n n n n n n n nn

n n n n nn

F

V S v V S

SS

V S v

S

θ θ α θ θ

ϕϕ

θ α θ

ϕ

− − − − −−

− − −−

= − − ∇

⎛ ⎞⎜ ⎟⎜ ⎟= − −⎜ ⎟⎜ ⎟⎝ ⎠

Error Basis functions

Introduction to ADP

Methods for estimating» Recursive statistics

• Iterative equations avoid matrix inverse:

θ

( )( )( )

1

11

1 1 1

1

Error1 is matrix=

1

1

n n n n n n

n n Tn

Tn n n n n nn

Tn n n n

H x

H B B F F XX

B B B x x B

x B x

θ θ ε ε

γ

γ

γ

−

−−

− − −

−

= − =

⎡ ⎤= × ⎣ ⎦

= −

= +

Introduction to ADP

Other statistical methods

» Regression trees• Combines regression with techniques for discrete variables.

» Data mining• Good for categorical data

» Neural networks• Engineers like this for low-dimensional continuous problems

Introduction to ADP

Notes:» When approximating value functions, we are basically

drawing on the entire field of statistics.» Choosing an approximation is primarily an art.» In special cases, the resulting algorithm can produce

optimal solutions.» Most of the time, we are hoping for “good” solutions.» In some cases, it can work terribly.» As a general rule – you have to use problem structure.

Value function approximations have to capture the right structure. Blind use of polynomials will rarely be successful.

Introduction to ADP

What you will struggle with:» Stepsizes

• Can’t live with ‘em, can’t live without ‘em.• Too small, you think you have converged but you have really

just stalled (“apparent convergence”)• Too large, and the system is unstable.

» Stability• There are two sources of randomness:

– The traditional exogenous randomness– An evolving policy

» Exploration vs. exploitation• You sometimes have to choose to visit a state just to collect

information about the value of being in a state.

Introduction to ADP

But we are not out of the woods…» Assume we have an approximate value function.» We still have to solve a problem that looks like

» This means we still have to deal with a maximization problem (might be a linear, nonlinear or integer program) with an expectation.

1( ) max ( , ) ( )t t t t t f f tx fV S C S x E Sθ φ +∈ ∈

⎛ ⎞= +⎜ ⎟

⎝ ⎠∑

X F

Outline


Do not use

weather report

Use w

eath

er re

port

Forecast sunny .6

Rain .8 -$2000Clouds .2 $1000Sun .0 $5000Rain .8 -$200Clouds .2 -$200Sun .0 -$200

Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game

Forecast cloudy .3

Forecast rain .1

- Decision nodes

- Outcome nodes

Information

ActionInformation

Action

State

State

Do not use

weather report

Use w

eath

er re

port

Forecast sunny .6

Schedule game

Cancel game

Schedule game

Cancel game

Schedule game

Cancel game

Schedule game

Cancel game

Forecast cloudy .3

Forecast rain .1

-$1400

-$200

$2300

-$200

$3500

-$200

$2400

-$200

Do not use

weather report

Use w

eath

er re

port

Forecast sunny .6Schedule game

Cancel game

Forecast cloudy .3

Forecast rain .1 -$200

$2300

$3500

$2400

-$200

Do not use

weather report

Use w

eath

er re

port

$2770

$2400


The post-decision state



Managing blood inventories over time

t=0

0S1 1

ˆ ˆ,R D1S

Week 1

1x

2 2ˆ ˆ,R D

2S

Week 2

2x2xS

3 3ˆ ˆ,R D

3S3x

Week 3

3xS

t=1 t=2 t=3

Week 0

0x



What happens if use Bellman’s equation?» State variable is:

• The supply of each type of blood– 8 blood types– 6 ages

• The demand for each type of blood– 8 blood types

• 56 dimensional state vector» Decision variable is how much of 8 blood

types to supply to 8 demand types.• 162- dimensional decision vector

» Random information• Blood donations by week (8 types)• New demands for blood (8 types)• 16-dimensional information vector

Do not

weath

Use w

eath

er re

port

Forecast sunny .6


Schedule game

Cancel game


Schedule game

Cancel game

Rain .1 -$2000Clouds .2 $1000Sun .7 $5000Rain .1 -$200Clouds .2 -$200S n 7 $200

Schedule game

Cancel game

Rain 2 -$2000

Forecast cloudy .3

Forecast rain .1

Do not

weath

Use w

eath

er re

port

Forecast sunny .6


Schedule game

Cancel game


Schedule game

Cancel game

Rain .1 -$2000Clouds .2 $1000Sun .7 $5000Rain .1 -$200Clouds .2 -$200S n 7 $200

Schedule game

Cancel game

Rain 2 -$2000

Forecast cloudy .3

Forecast rain .1


= +X

tS

1tS +

Do not use

weather report

Use w

eath

er re

port

Forecast sunny .6


Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game

Forecast cloudy .3

Forecast rain .1

- Decision nodes

- Outcome nodes

Do not use

weather report

Use w

eath

er re

port

Forecast sunny .6


Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game


Schedule game

Cancel game

Forecast cloudy .3

Forecast rain .1

- Decision nodes

- Outcome nodes



New concept:» The “pre-decision” state variable:

•

• Same as a “decision node” in a decision tree.

» The “post-decision” state variable:

•

• Same as an “outcome node” in a decision tree.

The information required to make a decision t tS x=

The state of what we know immediately after we make a decision.

xtS =



An inventory problem:» Our basic inventory equation:

» Using pre- and post-decision states:

Rt+1 = max{0, Rt + xt − D̂t+1}where

Rt = Inventory at time t.

xt = Amount we ordered.

D̂t+1 = Demand in next time period.

Rxt = Rt + xt Post-decision state

Rt+1 = max{0, Rxt − D̂t+1} Pre-decision state


⎛⎜⎜⎜⎝

⎞⎟⎟⎟⎠


Pre-decision, state-action, and post-decision

Pre-decision state State Action Post-decision state

93 states 93 9 state-action pairs× 93 states



( , )t t tS R D=

Pre-decision: resources and demands


, ( , )x M xt t tS S S x=




1 1 1ˆ ˆ( , )t t tW R D+ + +=

xtS ,

1 1( , )M W xt t tS S S W+ +=



1tS +



Classical form of Bellman’s equation:

Bellman’s equations around pre- and post-decision states:» Optimization problem (making the decision):

• Note: this problem is deterministic!» Simulation problem (the effect of exogenous

information):

( )( ),( ) max ( , ) ( , ) x M xt t x t t t t t t tV S C S x V S S x= +

{ },1 1( ) ( ( , )) |x x M W x x

t t t t t tV S E V S S W S+ +=


= +X



Challenges» For most practical problems, we are not going to be

able to compute .

» Concept: replace it with an approximation and solve

» So now we face:• What should the approximation look like?• How do we estimate it?

( )( ) max ( , ) ( ) x xt t x t t t t tV S C S x V S= +

( )x xt tV S

( )xt tV S

( )( ) max ( , ) ( ) xt t x t t t t tV S C S x V S= +



Value function approximations:» Linear (in the resource state):

» Piecewise linear, separable:

» Indexed PWL separable:

( ) ( )x xt t ta ta

aV R V R

∈

= ∑A

( )x xt t ta ta

aV R v R

∈

= ⋅∑A

( ) ( ) | ( )x xt t ta ta t

aV R V R features

∈

= ∑A



Value function approximations:» Ridge regression (Klabjan and Adelman)

» Benders cuts

0x

( )t tV R

1x

( ) ( ) f

xt t tf tf tf fa ta

f aV R V R R Rθ

∈ ∈

= =∑ ∑F A



Comparison to other methods:» Classical MDP (value iteration)

» Classical ADP (pre-decision state):

» Our method (update around post-decision state):

( ), 1 ,

, 1 ,1 1 1 1 1 1

ˆ max ( , ) ( ( , ))

ˆ( ) (1 ) ( )

n n x n M x nt x t t t t t t

n x n n x n nt t n t t n t

v C S x V S S x

V S V S vα α

−

−− − − − − −

= +

= − +

( )11( ) max ( , ) ( )n n

x tV S C S x EV Sγ −+= +

( )11

'

11 1

ˆ max ( , ) ( ' | , ) '

ˆ( ) (1 ) ( )

n n n nt x t t t t t t

s

n n n n nt t n t t n t

v C S x p s S x V s

V S V S vα α

−+

−− −

⎛ ⎞= +⎜ ⎟⎝ ⎠

= − +

∑ˆ updates ( )t t tv V S

1 1ˆ updates ( )xt t tv V S− −

, 1x ntV −

Expectation



Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S vα α−− − − − − −= − +

( )1 ,ˆ max ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x−= +

ntx

ntS

( )ntW ω

1 1( , , ( ))n M n n nt t t tS S S x W ω+ +=

Simulation

Deterministicoptimization

Recursivestatistics

Approximate dynamic programming

Optimization Simulation

CplexApproximate dynamic programming combines simulation and optimization in a rigorous yet flexible framework.

OptimizationStrengths

Produces optimal decisions.Mature technology

WeaknessesCannot handle uncertainty.Cannot handle high levels

of complexity

SimulationStrengths

Extremely flexibleHigh level of detailEasily handles uncertainty

WeaknessesModels decisions using

user-specified rules.Low solution quality.

Outline



Blood management

Managing blood inventories


Blood management

Managing blood inventories over time

t=0

0S1 1

ˆ ˆ,R D1S

Week 1

1x1xS

2 2ˆ ˆ,R D

2S

Week 2

2x2xS

3 3ˆ ˆ,R D

3S3x

Week 3

3xS

t=1 t=2 t=3

Week 0

0x0xS

O-,1

O-,2

O-,3

AB+,2

AB+,3

O-,0

,ˆ

t ABD +

AB+,0

AB+,1

AB+,2

O-,0

O-,1

O-,2

,( ,0)t ABR +

,( ,1)t ABR +

,( ,2)t ABR +

,( ,0)t OR −

,( ,1)t OR −

,( ,2)t OR −

,ˆ

t ABD −

,ˆ

t AD +

,ˆ

t ABD +

,ˆ

t ABD +

,ˆ

t ABD +

,ˆ

t ABD +

AB+

AB-

A+

A-

B+

B-

O+

O-

xtR

AB+,0

AB+,1

,ˆ

t ABD +

Satisfy a demand Hold

tS = ( )ˆ , t tR D

AB+,0

AB+,1

AB+,2

tR

O-,0

O-,1

O-,2

xtR

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

1,ˆ

t ABR + +

1tR +

1,ˆ

t OR + −

ˆtD

,( ,0)t ABR +

,( ,1)t ABR +

,( ,2)t ABR +

,( ,0)t OR −

,( ,1)t OR −

,( ,2)t OR −

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆtD

,( ,0)t ABR +

,( ,1)t ABR +

,( ,2)t ABR +

,( ,0)t OR −

,( ,1)t OR −

,( ,2)t OR −

( )tF R

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆtD

,( ,0)t ABR +

,( ,1)t ABR +

,( ,2)t ABR +

,( ,0)t OR −

,( ,1)t OR −

,( ,2)t OR −

Solve this as a linear program.

( )tF R

AB+,0

AB+,1

AB+,2

tR xtR

O-,0

O-,1

O-,2

AB+,0

AB+,1

AB+,2

AB+,3

O-,0

O-,1

O-,2

O-,3

ˆtD

Dual variables give value additional unit of blood..

Duals

,( ,0)t̂ ABν +

,( ,1)t̂ ABν +

,( ,2)t̂ ABν +

,( ,0)t̂ ABν −

,( ,1)t̂ ABν −

,( ,2)t̂ ABν −


Updating the value function approximation

Estimate the gradient at

,( ,2)nt ABR +

,( ,2)ˆnt ABν +

ntR

( )tF R



Update the value function at

,1

x ntR −

11 1( )n x

t tV R−− −

,1

x ntR −

,( ,2)ˆnt ABν +

( )tF R

,( ,2)nt ABR +



Update the value function at ,1

x ntR −

,( ,2)ˆnt ABν +

,1

x ntR −

11 1( )n x

t tV R−− −



Update the value function at ,1

x ntR −

,1

x ntR −

11 1( )n x

t tV R−− −

1 1( )n xt tV R− −


Iterative learning

t


Iterative learning


Iterative learning


Iterative learning

Approximate dynamic programmingWith luck, the objective function will improve steadily

Objec tive func tion

97

97.5

98

98.5

99

99.5

100

1 11 21 31 41 51 61 71 81

Itera tion

Chart Title

1700000

1720000

1740000

1760000

1780000

1800000

1820000

1840000

1860000

1880000

1900000

300 400 500 600 700 800 900 1000

Iterations

Obj

ectiv

e fu

nctio

n


… but performance can be jagged.

Outline


Hierarchical aggregation

Attribute vectors:

a =Location

ETAA/C typeFuel level

Home shopCrewEqpt1

Eqpt100

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

LocationETA

Bus. segmentSingle/team

DomicileDrive hours

Days from home

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Blood typeAge

Frozen?

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Asset classTime invested⎡ ⎤⎢ ⎥⎣ ⎦

Estimating value functions

1 ˆ( ) (1 ) ( ) ( )n n

LocationLocation Location Fleet

v Fleet v Fleet v DomicileDomicile Domicile DOThrs

DaysFromHome

α α−

⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥= − +⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

$2050 = $2000 $2500(1 0.10)− × + (0.10)

Value function Approximation may have fewer attributes than driver.

Drivers may have very detailed attributes

Aggregation

Estimating value functions» Most disaggregate level


1 ˆ( ) (1 ) ( ) ( )n n

LocationLocation Location Fleet

v Fleet v Fleet v DomicileDomicile Domicile DOThrs

DaysFromHome

α α−

⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥= − +⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎢ ⎥⎣ ⎦

Estimating value functions» Middle level of aggregation


1 ˆ( ) (1 ) ( ) ( )n n

LocationFleet

Location Locationv v v Domicile

Fleet FleetDOThrs

DaysFromHome

α α−

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎡ ⎤⎢ ⎥= − +⎢ ⎥ ⎢ ⎥⎢ ⎥⎣ ⎦ ⎣ ⎦⎢ ⎥⎢ ⎥⎣ ⎦

Estimating value functions» Most aggregate level


[ ] [ ]1 ˆ( ) (1 ) ( ) ( )n n

LocationFleet

v Location v Location v DomicileDOThrs

DaysFromHome

α α−

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥= − +⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

State-dependent weighted aggregation:» Now we have a linear regression for each attribute

» We cannot solve hundreds of thousands of linear regressions. Instead use

( ) ( ) g ga a a

g

v w v=∑

Estimate of variance Estimate of bias

( ) ( )( ) 12( ) ( ) ( ) ( ) 1g g g ga a a a

g

w Var v wβ−

∝ + =∑


1400000

1450000

1500000

1550000

1600000

1650000

1700000

1750000

1800000

1850000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Iterations

Obj

ectiv

e fu

nctio

n

Aggregate

Disaggregate


1400000

1450000

1500000

1550000

1600000

1650000

1700000

1750000

1800000

1850000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Iterations

Obj

ectiv

e fu

nctio

n

Weighted Combination

Aggregate

Disaggregate


Outline


Stepsizes

Stepsizes:» Fundamental to ADP is an updating equation that looks

like:

11 1 1 1 1 1 ˆ( ) (1 ) ( )n x n x n

t t n t t n tV S V S vα α−− − − − − −= − +

Old estimate New observationUpdated estimate

The stepsize“Learning rate”

“Smoothing factor”

Stepsizes

High noise data:

Low noise, changing signal

0

0.5

1

1.5

2

2.5

3

1 21 41 61 81

Best stepsize

0

0.2

0.4

0.6

0.8

1

1.2

1 21 41 61 81

1/n nα =

1nα ≈

Stepsizes

Theorem 1 (general knowledge):» If data is stationary, then 1/n is the best possible

stepsize.

Theorem 2 (e.g. Tsitsiklis 1994)» For general problems, 1/n is provably convergent

Theorem 3 (Frazier and Powell):» 1/n works so badly it should (almost) never be used.

Stepsizes

Lower bound on the value function after n iterations:

210 410 610 810 1010 1210

Stepsizes

( ) ( )

( ) ( )

2

21 2

2 21

2

11

where:

1

As increases, stepsize decreases toward 1/As increases, stepsize increases toward 1.

n n n

n nn n

n

n

σαλ σ β

λ α λ α

σ

β

−

−

= −+ +

= − +

Estimate of the variance

Estimate of the bias

Bias

Noise

Bias-adjusted Kalman filter

Stepsizes

The bias-adjusted Kalman filter

0

0.2

0.4

0.6

0.8

1

1.2

1 21 41 61 81

Observed values

BAKF stepsize rule

1/n stepsize rule

Stepsizes

The bias-adjusted Kalman filter

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 21 41 61 81

Observed values

BAKF stepsize rule

Stepsizes

BAKF formula

Our newest formula» where

» First stepsize formula designed specifically for dynamic programming (note: no need to estimate bias )

» Captures the natural growth of a value function through iterative learning.

( )( )( )( )

21 2

22 1 2

n n

nn n

λ σ βα

σ λ σ β

−

−

+=

+ +

( )( )( )( ) ( )( )( )

21 2 1 2

22 1 2 1 2 1 2

1 1

1 1 1

n n

nn n n

c

c

λ σ γ δα

γ λ σ λ σ γ δ

− −

− − −

+ − −=

+ + + − −

nβ

Stepsizes

I recommend:» Start with a constant stepsize

• Vary it, get a sense of what works best

» Next try a deterministic stepsize such as a/(a+n)• Choose a so that it declines to roughly the best deterministic

stepsize at an iteration comparable to where your constant stepsize seems to stabilize

– Might be 50 iterations– Might be 5000

» Try an (optimal) adaptive stepsize rule• Can work very well if there is not too much noise• Adaptive rules work well when there is a need to keep the

stepsize from declining too quickly (but you do not know how quickly)

Outline


Two-stage optimizationPiecewise linear, separable value function approximations:

Piecewise linear, separable:

( ) ( )t t tl tll

V R V R∈

=∑L

Two-stage optimizationBenders decomposition:

0x

( )t tV R

1x

⎫⎪⎪⎬⎪⎪⎭

Multidimensional cuts produce provably convergent, nonseparablevalue function approximation.

The competition

Exact solutions using Benders:

0x

0V“L-Shaped” decomposition

(Van Slyke and Wets)

0x

0VStochastic decomposition

(Higle and Sen)

0x

0VCUPPS

(Chen and Powell)

The competitionPercent from optimal 100 iterations

0

5

10

15

20

25

30

35

40

45

SD L-shaped CUPPS SPAR

10 locations25 locations50 locations100 locations

10

20

30

40

0

Percent over optimal after 100 iterations

Benders

Perc

ent e

rror

Increasing problem size

Separable

The competitionPercent from optimal 100 iterations

0

5

10

15

20

25

30

35

40

45

SD L-shaped CUPPS SPAR

10 locations25 locations50 locations100 locations

10

20

30

40

0

Percent over optimal after 100 iterations

Increasing problem size

Benders Separable

Perc

ent e

rror

Multistage problems

Deterministic, single commodity flow

Multistage problems

Deterministic, single commodity flow» ADP solution as a percent of the optimal solution (from

Cplex)

Multistage problems

Stochastic, single commodity flow

0

20

40

60

80

100

120

(20,10

0)(20

,200)

(20,40

0)(40

,100)

(40,20

0)(40

,400)

(80,10

0)(80

,200)

(80,40

0)

Perc

ent o

f pos

terio

r opt

imal

Rolling horizonADP

100 = optimal posterior bound

Multistage problems

Deterministic, (integer) multicommodity flow

Multistage problems

Deterministic, (integer) multicommodity flow

60

65

70

75

80

85

90

95

100

105

Base

T_30

T_90

I_10

I_40

C_IIC_IIIC_IV R_1

R_5R_10

0R_40

0C_1C_8

Perc

ent o

f opt

imal

100 = optimal continuous relaxation

Multistage problems

Stochastic, (integer) multicommodity flow

0

20

40

60

80

100

120

Base

I_10

I_40

C_II C_III

C_IV R_1 R_5R_1

00R_4

00 C_1 C_8

Perc

ent o

f pos

terio

r opt

imal

Rolling horizonADP

Outline



Schneider National



0

200

400

600

800

1000

1200

1400

US_SOLO US_IC US_TEAM

Capacity category

Rev

enue

per

WU

Historical maximum

Simulation

Historical minimum

0

200

400

600

800

1000

1200

US_SOLO US_IC US_TEAM

Capacity category

Util

izat

ion Historical maximum

Simulation

Historical minimumRevenue per WU

Utilization

Historical min and maxCalibrated model

Schneider National

Princeton team:Warren PowellBelgacem Bouzaiene-Ayari

NS team:Clark ChengRicardo FiorilloJunxia ChangSourav Das





Convergence

Train coverage

Model calibration

Train delay curve – October 2007D

elay

hou

rs/d

ay

Fleet size

Model calibration

Train delay curve – March 2008D

elay

hou

rs/d

ay

Fleet size

Outline






Hourly model» Decisions at time t impact t+1 through the amount of water held in

the reservoir.Hour t Hour t+1



Hourly model» Decisions at time t impact t+1 through the amount of water held in

the reservoir.

Value of holding water in the reservoir for future time periods.

Hour t




Energy resource modeling2008

…

Hour 1 2 3 4 87602009

1 2


Energy resource modeling2008

…

Hour 1 2 3 4 87602009

1 2


Annual energy model

Optimal

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800 900 1000

PrecipitationHydroUsageRemainingWastageDemandCoalWind

Reservoir level

Demand


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800 900 1000

PrecipitationUsageRemainingWastageDemandCoalWind

Annual energy model


Reservoir level

Demand


Annual energy model

Optimal

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800 900 1000

PrecipitationHydroUsageRemainingWastageDemandCoalWind

Reservoir level

Demand


Annual energy model


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800 900 1000

PrecipitationUsageRemainingWastageDemandCoalWind

Reservoir level

Demand


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800

Res

ervo

ir le

vel

Time period

ADP at last iterationOptimal for individual

scenarios

Annual energy model

ADP vs optimal reservoir levels for stochastic rainfall


Documents

Approximate Dynamic Programming: Solving the curses of ... · Planning for a risky world Weather •Robust design of emergency response networks. ... Government policies (tax rebates