Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
From Perturbation Analysis
to a New Paradigm of Optimization
Xi-Ren CaoShanghai Jiao Tong University
The Problem in Optimization
1
Policy Space: Best Policy?
D
Policy space too large for exhaustive search
(100 states, 2 actions 2100
=1030
policies, 10Gh ->1012
yrs to count)
State space too large, we cannot analyze every policy
2
Perturbation Analysis (PA)- gradient-based approach
With special structure, by analyzing one policy, obtain performance of its neighboring policies Performance gradient
Queuing networks, Markov processes
q+Dq
q
gradient hill climbing
3
Policies in Distance?
With special structure, by analyzing one policy, find a better policy in the distance Policy Iteration (PI): Discrete version of PA
4
Continuous Discrete
➢ Performance derivatives
q
d
d???
➢ Find the best direction
➢ Hill climbing
➢ Gradient <=0 Local optimal
➢ Performance difference
' ???
➢ Find a better policy
➢ Policy Iteration
➢ No better policy Global optimal
Perturbation Analysis (PA) Relative Optimization (RO)
PA and RO are probably the only way to overcomedifficulties in exhaustive search.
(PDF)
(policy iteration)
▪ Discrete Policy Space
A Sensitivity-Based View of Optimization
q+Dq
q
(perturbation analysis)
▪ Continuous Space
5
Dynamic ProgrammingWorking locally in time and states:
optimal policy at k+1 optimal policy at k
6
k0 1 2 3
X(k)
1
2
3
4
5
6
a1
a2
a3
+1
),()()(K
ki
d
K
d
i
d
idk XFXfx
)5(*2
)2(*2
)4(*2
Problem: Under selectivity for time non-homogeneousLong-run average:
Does not depend on polices in any finite period
}|)({1
lim)( 0
1
0
xXXfEK
xK
kkk
N
X(t)
t+Dtt
Stochastic Control
)()]([)]([)( tdWtXdttXtdX dd +
Problem: Non-smooth value functionLocal property leads to a differential equation Does not work for non-smooth value function
(viscosity solution)7
8
Dynamic Programming:➢ Works backwards in time t+Dt t
➢ Local information
Any Weakness?➢ Not convenient for long-run average
➢ Not for Non-smooth value functions
(viscosity solution)➢ Degenerate processes not well explored
➢ Not necessary: under selectivity issue:
(long-run average not depend on transient actions)
9
Sensitivity Based – PA and RO
PA: Given a sample path X perturbed sample path X(d)
i
bX
Xd
X Xd
T
k0 1 2 3 4 5
d
d’K
AF
E
DC
B
J
I
H
G
Two Performance measures:
Total Reward on ABCDEF
Total Reward on AGHIJK
)(0 xd
)('0 xd
Two Polices: d and d’ xXX dd )0()0( '
10
Relative Optimization: Comparing two policies
Relative Optimization
11
Two Policies:
,..,..,, 21 kPPP ,..',..,',' 21 kPPP,..,..,, 21 kfff ,..',..,',' 21 kfff,..,..,, 21 kggg ,..',..,',' 21 kgggValue function
'
Time non-homogeneous Markov chains:Trans. Prob. Matrices: Reward: Long-run average:
Sijkk ijPP )]|([
)(ifk
,....2,1,0kT
kkk Sfff ))(),...1((:
Relative Optimization
12
+
xXXfgIPEK
K
kkkkk
K0
1
0
'|)'](')'[('1
lim'
Performance difference formula
+
xXXfgIPEK
K
kkkkk
K0
1
0
'|)']()[('1
lim
,' if
)]()[()](')'[( xfgIPxfgIP kkkkkk ++
for all x in S, and all k=0,1,2,… except for a finite period, or on a subsequence with,...., 21 kk .0lim
n
n k
n
HJB
13
)()]([)]([)( tdWtXdttXtdX dd +
+Tdd xtXTXFdssXfEx
0})(|))(())(({)(
Stochastic Control
Finite horizon optimization problem (stationary)
)},({max)(* xx dd
d Goal: .x
14
Ito formula: for a smooth function (x)
).()(2
1)()(])(|)]([{ 2 xxxxxtXtXE
dt
d +
Dynamic programming HJB equation
Ito-Tanaka formula: for a non-smooth function (x)
dtxxxxxtXtXdE )]()(2
1)()([])(|)]([{ 2 +
xXdtLEzz Tz + + )0('|)()]()([
dtztXdtLE X
z
2])(|0)([
where Z is the non-smooth point,
0)(' TLXz
)(z)(z+ : right-sided and left-sided derivative
local time
dt
dtdt 0lim
Derivatives??
15
PDF for a non-smooth value function (x)
++ xXdttXfhhET
)0('|]))(')(''2
1'[(''
0
2
xXTLEzz Xz + + )0('|)(')]()([ '
In addition to the HJB equation at smooth points,we need at the non-smooth points
)()( zz +
Relative Optimization- Based on Comparison
16
1. No viscosity solution is needed!2. The order in dt is at , dt
dtztXdtLE X
z
2])(|0)([
dt
dtdt 0lim
3. X(t) hits the non-smooth point z rarely, but each timeit hits it, the effect in dt is infinity. This cannot becaptured by derivatives.
Example: )()( tWtX
.)( xx 1. .0]0)0(|)([]0)0(|)([ XtdWEXtdE
.||)( xx 2. ]0)0(|)([ XtdE
dtXtWdE
2]0)0(||)(|[
Other Applications???
Global information in entire [0, T], or [0,inf]. Under-selectivity, and non-smoothness,
Degenerate processes explored in details Long-run average➢ State-classification➢ Bias optimality➢ Multi-class optimization
Insights for further research on control andstochastic processes➢ Local times on curves
No viscosity solution needed
Relative Optimization: (based on comparison of performance of any two policies)
18
Performance Optimization
Dynamic Prog.
Relative Opt.
HJB,etc.
Solutions
THANKS!
Example: Long-run Average
Transition prob. matrices P, P’ n * n
Long-run average , ’Steady-state prob. , ’ n- row vectorReward function f n-column vector
Two policieswith finite states
Poisson Equation: (I-P)g + e =f (1)
g: potential, n-column vector, e= (1,1, …, 1)^T n-column vector
Noting ’=’f, ’e=1, left-multiplying (1) with ’ yieldsPDF:
gPP )'(''
➢ ’> if P’g>Pg , Policy iteration
➢ P* is optimal, if P*g*>=Pg* for all P, HJB eqn!
Markov Decision Processes (MDPs) & Policy Iteration
gPPQg )'('''
P is optimal
P g > P g, for all P
^
^^ ^
1. ’> if P’g>Pg , with > for at least one component
2. Policy iteration: At any state find a policy P’ with P’g>Pg
4. Optimality Equations:
3. Improve performance iteratively,Stop when no improvement can be made
Action a in {a1, a2,...,aN}Deterministic:
Stochastic: Distribution of X(k+1).
Reward: Transition:
Terminating:
),( xkf a
)(xF
k K+1 time1
2
3
4
5
6
7
8
states
a1a2
a3
a4 A policy d: a=d(k,X(k))
)](,[)1( kXkkX a+
),,( xkf dUnder policy d :
)],(,[)1( kXkkX d+
The Optimization Problem
1
k0 1 2 3 4 5
A
F
E
D
C
B
Sample paths: ABCDEF, RSUV, states: 5,...,1,0),( kkX d
+1
)],([))(,()(K
ki
dddd
k KXFiXifx .)( xkX d
Total rewards from to xkX d )( )(KX d
A (Deterministic) Policy d
Optimization: },),(max{)(* dxx d
k
d
k
for all k, and x.
SR
U
V
2
k0 1 2 3 4 5
X(k)
1
2
3
4
5
6
7
8
F(6)
F(5)
F(3)
d1
d2
d3d1
d2
d3
Dynamic Programming
Working backwards in time horizontally:optimal policy at k+1 optimal policy at k
3
For stochastic systems, the mapping is replaced by a transition probability and performance is replaced by its mean
)|( xyPd
a
),()|(),()()|(),( *
1
*
1
** yxyPxkfyxyPxkf d
k
y
ad
k
y
dd
++ ++ a
,a .1,...,1,0 Kk,x
The Optimality Condition:
,a .1,...,1,0 Kk
)).,((),()),((),( *
1
**
1
* xkxkfxkxkf d
k
dd
k
d aa ++ ++
,x
(**)
4
k0 1 2 3 4 5
d
d’
K
AF
E
DC
B
J
I
H
G
d
d
dd
1
2
3
4
5
6
7
8
Adding auxiliary paths starting from sample path d’ at each time k, but following policy d
Every sample path has a total reward )].([ ' kX dd
k
'
7
k0 1 2 3 4 5
d
d’K
AF
E
DC
B
J
I
H
G
OQ
ML
R
P
S U
V
dd ' AGHIJK - ABCDEF
= (AGHIJK – AGHIJM) + (AGHIJM – AGHILQ)+ (AGHILQ – AGHOPQ) + (AGHOPQ – AGRSUV)+ (AGRSUV – ABCDEF)
= (JK – JM) + (IJM – ILQ) + (HILQ – HOPQ)+ (GHOPQ – GRSUV) + (AGRSUV – ABCDEF)
8
k0 1 2 3 4 5
d
d’K
A
F
E
DC
B
J
I
H
G
O Q
ML
R
P
SU
V
dd '= (JK – JM) + (IJM – ILQ) + (HILQ – HOPQ)
+ (GHOPQ – GRSUV) + (AGRSUV – ABCDEF)We have
)](),4([)](),4('[ MFJfKFJfJMJK ++
)]},4([),4({)]},4('[),4('{ JJfJJf KK ++
}{}{ RSUVGRHOPQGHGRSUVGHOPQ ++
))]}1(',1([))1(',1({))]}1(',1('[))1(',1('{ 22 XXfXXf dd ++ …… …… ……
))]}4(',4([))4(',4({))]}4(',4('[))4(',4('{ XXfXXf KK ++
9
Thus, we get the Performance difference formula (PDF)
' AGHIJK - ABCDEF
]
++ ++1
0
11 ))]}(',([))(',({))]}(',('[))(',('{K
k
kk kXkkXkfkXkkXkf
Optimality Condition:
)],,('[),(')],([),( 11 xkxkfxkxkf kk ++ ++
,'d .1,...,1,0 Kk,x
(**)This is the same as DP eq
10
Comparison
DP: ~Riemann Integrationlocal information at time k; derivative in continuous time
DC(PDF): ~ Lebesgue Integration Global information in the
entire horizon [0, K]. more than derivative
11
Application to Stochastic Control