Automation & Robotics Research Institute (ARRI)The University of Texas at Arlington
F.L. LewisMoncrief-O’Donnell Endowed Chair
Head, Controls & Sensors Group
Talk available online at http://ARRI.uta.edu/acs
Notes on Optimal Control
Supported by :NSF - PAUL WERBOSARO – RANDY ZACHERY
Draguna Vrabie
Blocking zerosZero synthesismodule theory of multivariable zeros of dynamical systems Exactness of maps TSP nonlinear feedback synthesis
Michael K. Sain and Cheryl B. Schrader, “Bilinear Operators and Matrices,” in Mathematics for Circuits and Filters. Wai-Kai Chen, ed., CRC Press, pp. 23-41, 2000.
Cheryl B. Schrader and Michael K. Sain, “Zero Principles for Implicit Feedback Systems,” Circuits, Systems, and Signal Processing: Special Issue on Implicit and Robust Systems, Vol. 13, No. 2-3, pp. 273-293, 1994.
Michael K. Sain and Cheryl B. Schrader, “Feedback, Zeros, and Blocking Dynamics,” in Recent Advances in Mathematical Theory of Systems, Control, Networks and Signal Processing I. H. Kimura and S. Kodama, eds., Tokyo: Mita Press, pp. 227-232, 1992.
Michael K. Sain: Minimal Torsion Spaces and the Partial Input/Output Problem Information and Control 29(2): 103-124 (1975)
Ronald W. Diersing, Michael K. Sain, and Chang-Hee Won, “Bi-Cumulant Games: A generalization of H-infand H2/H-inf Control," IEEE Transactions on Automatic Control, Submitted, 2007.
M.K. Sain, B.F. Wyman, J.L. Peczkowski, “Extended zeros and model matching,” SIAM Journal on Control and Optimization, May 1991
James L. Massey, Michael K. Sain: Inverse Problems in Coding, Automata, and Continuous Systems FOCS 1967: 226-232
Michael K. Sain
Module theoretic zero structures for system matrices
Author(s): Wyman, Bostwick F.; Sain, Michael K.
Abstract: The coordinate-free module-theoretic treatment of transmission zeros for MIMO transfer functions developed by Wyman and Sain (1981) is generalized to include noncontrollable and nonobservable linear dynamical systems. ...
NASA Center: NASA (non Center Specific)Publication Year: 1987Added to NTRS: 2004-11-03Accession Number: 87A30190; Document ID: 19870042916
“Did I say that? Well, that was then. This is now.”
EIC editorial, IEEE Circuits & Systems magazine, v.3, no. 1, 2003
- M.K. Sain
Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.
Cellular Metabolism
Permeability control of the cell membrane
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Biological Systems
Optimality in Control Systems DesignR. Kalman 1960
Rocket Orbit Injection
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
FmmmF
rwvv
mF
rrvw
wr
−=
+−
=
+−=
=
φ
φμ
cos
sin2
2
ObjectivesGet to orbit in minimum timeUse minimum fuel
Dynamics
Adaptive Control is Not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE ADAPTIVE OPTIMAL Control
),( uxfx =
∫∫∞∞
+==t
T
t
dtRuuxQdtuxrtxV ))((),())((
),,(),(),(),(),(0 uxVxHuxruxf
xVuxrx
xVuxrV
TT
∂∂
≡+⎟⎠⎞
⎜⎝⎛∂∂
=+⎟⎠⎞
⎜⎝⎛∂∂
=+= 0)0( =V
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+= ),(),(min),(min0*
)(
*
)(uxf
xVuxrx
xVuxr
T
tu
T
tu
xVxgRtxh T
∂∂
−= −*
12
1* )())((
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0 −⎟⎟⎠
⎞⎜⎜⎝
⎛−+⎟⎟
⎠
⎞⎜⎜⎝
⎛= 0)0( =V
System
Cost
Hamiltonian
Optimal cost
Optimal control
HJB equation
Continuous-Time Optimal Control
Bellman
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛∂∂
+=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛∂∂
+= ),(),(min),(min0)()(
uxfxVuxrx
xVuxr
T
tu
T
tu
,
,
For a given control, the cost satisfies this eq.
In LQR, this is a Lyapunov eq
In LQR, this is a Riccati eq
Linear system, quadratic cost
System:Utility:
The cost is quadratic
Optimal control (state feedback):
HJB equation is the algebraic Riccati equation (ARE):
PBPBRQPAPA TT 10 −−++=
1( ) ( ) ( )Tu t R B Px t Lx t−= − = −
BuAxx +=
0,0;),( ≥>+= QRRuuQxxuxr TT
∫∞
==t
T tPxtxduxrtxV )()(),())(( τ
Full system dynamics must be known
),,(),(),(0 uxVxHuxruxf
xV T
∂∂
≡+⎟⎠⎞
⎜⎝⎛∂∂
=
0 ( , ( )) ( , ( ))T
kk k
V f x h x r x h xx
∂⎛ ⎞= +⎜ ⎟∂⎝ ⎠(0) 0kV =
1121( ) ( )T k
kVh x R g xx
−+
∂= −
∂
CT Policy Iteration
• Convergence proved by Saridis 1979 if Lyapunov eq. solved exactly
• Beard & Saridis used complicated GalerkinIntegrals to solve Lyapunov eq.
• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
RuuxQuxr T+= )(),(Utility
Cost for any given u(t)
Lyapunov equation
Iterative solution
Pick stabilizing initial control
Find cost
Update control
Full system dynamics must be known
To avoid solving HJB equation
LQR Policy iteration = Kleinman algorithm
1. For a given control policy solve for the cost:
2. Improve policy:
If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.Every iteration step will return a stabilizing controller.The system has to be known.
xLu k−=
kT
kT
kkkT
k RLLCCAPPA +++=0
11
−−= k
Tk PBRL
kk BLAA −=
0L kP
Kleinman 1968
Lyapunov eq.
Policy Iteration Solution
( ) T TRic P A P PA Q PBB P≡ + + −
1 1( ) ( ) 0T T T Ti i i i i iA BB P P P A BB P PBB P Q+ +− + − + + =
( ) 1
1 ( ), 0,1,ii i P iP P Ric Ric P i
−
+ ′= − = …
Policy iteration
This is in fact a Newton’s Method
Then, Policy Iteration is
Frechet Derivative
)()()(' iTT
iT
P PBBAPPPBBAPRici
−+−≡
Policy Iterations without Lyapunov Equations
Dynamic programmingbuilt on Bellman’s optimality principle – alternative form for CT Systems [Lewis & Syrmos 1995]
* *( )
( ( )) min ( ( ), ( )) ( ( ))t t
utt t t
V x t r x u d V x t tτ
τ
τ τ τ+Δ
≤ < +Δ
⎧ ⎫⎪ ⎪= + + Δ⎨ ⎬⎪ ⎪⎩ ⎭∫
)()()()())(),(( ττττττ RuuQxxuxr TT +=
Draguna Vrabie
f(x) and g(x) do not appear
Solving for the cost – Our approach
))(()())(( TtxVdtRuuQxxtxVTt
t
TT +++= ∫+
Lxu −=
Draguna Vrabie
f(x) and g(x) do not appear
For a given control
The cost satisfies
)()()()()( TtPxTtxdtRuuQxxtPxtx TTt
t
TTT ++++= ∫+
PBRL T1−=
LQR case
Optimal gain is
1 1( ( )) ( ) ( ( ))t T
T kT kk k
t
V x t x Qx u Ru dt V x t T+
+ += + + +∫
1. Policy iteration
11
1 +−
+ = kT
k PBRL
Initial stabilizing control is needed
Draguna Vrabie
For LQR case
Cost update
Control gain update
A and B do not appear
B needed for control update
( ) ( )kku t L x t= −
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ += + + + +∫
Solving for the cost – Our approach
1 1( ) ( ) 0T T T Ti i i i i iA B B P P P A B B P P B B P Q+ +− + − + + =
Theorem
This algorithm converges and is equivalent to Kleinman’s Algorithm( ) 1
1 ( ) , 0 , 1 ,ii i P iP P R i c R i c P i
−
+′= − = …
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ += + + + +∫
1 1 11 1( ) ( ) 0T T T T
k k k k k kA BR B P P P A BR B P P BR B P Q− − −+ +− + − + + =
Lemma 1
is equivalent to
xQRKKxxAPPAxdt
xPxdi
Ti
Tiii
Ti
TiT
)()()(
+−=+=
Proof:
( ) ( ) ( ) ( ) ( ) ( )t T t T
T T T T Ti i i i i
t t
x Q K RK xd d x Px x t Px t x t T Px t Tτ+ +
+ = − = − + +∫ ∫
1 Tk kL R B P−=
Solves Lyapunov equation without knowing A or B
Only B is needed
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ += + + + +∫
Critic update
11
1 +−
+ = kT
k PBRL
1kp +Now use RLS along the trajectory to get new weights
Unpack weights into the matrix Pk+1
Then find updated FB gain
( )( ) ( )Tvec ABC C A vec B= ⊗
[ ]1 1( ) ( ) ( ) ( ) ( ) ( )t T
T T T Tk k i i
tp t p x t x t T x Q K RK x dϕ τ τ τ
+
+ +≡ − + = +∫
1 1( ) ( ) ( ) ( ) ( )t T
T T T Tk i i k
tp x t x Q K RK x d p x t Tτ τ τ
+
+ += + + +∫
is the quadratic basis set
Use Kronecker product
To set this up as ( ) ( ) ( )x t x t x t= ⊗
( , )t t Tρ≡ + Reinforcement on time interval [t, t+T]
Algorithm Implementation
Quadratic regression vector
1. Select initial control policy
2. Find associated cost
3. Improve control 11 1
Tk kL R B P−+ +=
[ ]1 ( ) ( ) ( ) ( ) ( ) ( , )t T
T T Tk k k
tp x t x t T x Q L RL x d t t Tτ τ τ ρ
+
+ − + = + = +∫
Solves Lyapunov eq. without knowing dynamics
t t+T
observe x(t)
observe x(t+T)
apply uk(t)=Lkx(t)
observe cost integral
update P
do RLS until convergence to Pk+1
update control gain to Lk+1
Measure cost increment (reinforcement)by adding V as a state. Then
( )T k T kV x Qx u Ru= +
A is not needed anywhere
Algorithm Implementation
The Critic update
can be setup as
Evaluating for n(n+1)/2 trajectory points, one can setup a least squares problem to solve
1 1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ += + + + +∫
11 ( )T
ip XX XY−+ =
1 2[ ( ) ( ) ... ( )]NX t t tϕ ϕ ϕ=1 2[ ( , ) ( , ) ... ( , )]N T
i i iY d x K d x K d x K=
( ( ), )kd x t L
Or use batch Least-Squares solution along the trajectory
[ ]1 1( ) ( ) ( ) ( ) ( ) ( ) ( ( ), )t T
T T T Ti i k k k
tp t p x t x t T x Q L RL x d d x t Lϕ τ τ τ
+
+ +≡ − + = + ≡∫
is the quadratic basis set( ) ( ) ( )x t x t x t= ⊗
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
“This is a very weirdControl Structure whose likesI have not seen in control system theory”
- suspicious quote by F.L. Lewis, 2007u
VZOH T
0; xBuAxx +=System
RuuQxxV TT +=
Critic
TT
xu
VZOH T
0; xBuAxx +=System
RuuQxxV TT +=
Critic
Actor
TT
x
FB Gain L
Lkmultiplier
DynamicControlSystem
)()( txLtu kk −=
Continuous-time control with discrete gain updates
t
Lk
k0 1 2 3 4 5
Sample periods need not be the sameThey can be selected on-line in real time
Gain update (Policy)
Control
1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ = + + + +∫
1( ( )) ( ) ( ( ))t T
T kT kk k
t
V x t x Qx u Ru dt V x t T+
+ = + + +∫
2. CT ADP Greedy iteration
)()( txLtu kk −=
11
1 +−
+ = kT
k PBRL
No initial stabilizing control needed
Draguna Vrabie
LQR
Control policy
Cost update
Control gain update A and B do not appear
B needed for control update
Direct Optimal Adaptive Control for Partially Unknown CT Systems
( )( ) ( )Tvec ABC C A vec B= ⊗
1( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q L RL x d x t T P x t Tτ τ τ
+
+ = + + + +∫
1 ( ) ( ) ( ) ( ) ( )t T
T T T Tk k k k
tp x t x Q L RL x d p x t Tτ τ τ
+
+ = + + +∫
The critic update
Use Kronecker product
To set this up as
1kp +Now use RLS along the trajectory to get new weights
Unpack weights into the matrix Pk+1
Then find updated FB gain 11
1 +−
+ = kT
k PBRL
is the quadratic basis set( ) ( ) ( )x t x t x t= ⊗
Algorithm Implementation
Previous weightsRegression vector
1. Select control policy
2. Find associated cost
3. Improve control 11 1
Tk kL R B P−+ +=
1 ( ) ( ) ( ) ( ) ( )t T
T T T Tk k k k
tp x t x Q L RL x d p x t Tτ τ τ
+
+ = + + +∫
Solves for cost update without knowing dynamics
t t+T
observe x(t)
observe x(t+T)
apply u
observe cost integral
update P
do RLS until convergence to Pk+1
update control gain to Lk+1
Measure cost increment by adding V as a state. Then ( )T i T iV x Qx u Ru= +
No initial stabilizing control needed
A is not needed anywhere
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed value over the interval
Draguna Vrabie
u
VZOH T
0; xBuAxx +=System
RuuQxxV TT +=
Critic
TT
xu
VZOH T
0; xBuAxx +=System
RuuQxxV TT +=
Critic
Actor
TT
x
FB Gain L
LkDynamicControlSystem
Has a different critic cost updateNo initial stabilizing gain needed
Greedy update is equivalent to
Analysis of the algorithm
For a given control policy
( ){ }1 0( ( )) ( ( )), 0t T TT k k
k ktV x t x Qx u Ru d V x t T Vτ
+
+ = + + + =∫
( ) ( )kku t L x t= − k
Tk PBRL 1−=
kT
k PBBRAA 1−−=
1 ( )T Tk k k k
t TA t A t A T A TT
k k k kt
P e Q L RL e dt e P e+
+ = + +∫
with
a strange pseudo-discretized RE
Draguna Vrabie
( ) APBBPBPBPAQAPAP kT
KT
kkT
kT
k1
1−
+ +−+=c.f. DT RE
( ) kKT
kTkkk
Tkk LBPBPLQAPAP +++=+1
∫ −++=−+
TtA
kT
kkktA
kk dteRLLQAPAPePP kTk
01 )(
Lemma 2. CT HDP is equivalent to
kT
k PBBRAA 1−−=
ADP solves the CT ARE without knowledge of the system dynamics A
Analysis of the algorithmDraguna Vrabie
This extra term means the initial Control action need not be stabilizing
When ADP converges, the resulting P satisfies the Continuous-Time ARE !!
Direct OPTIMAL ADAPTIVE CONTROL
Solve the Riccati Equation WITHOUT knowing the plant dynamics
Model-free ADP
Works for Nonlinear Systemsa neural network is used to approximate the cost
Robustness?Comparison with adaptive control methods?
Policy Evaluation – Critic updateLet K be any state feedback gain for the system (1). One can measure the associated cost over the infinite time horizon
where is an initial infinite horizon cost to go.
( , ( )) ( ) ( ) ( ) ( , ( ))t T
T T
tV t x t x Q K RK x d W t T x t Tτ τ τ
+= + + + +∫
( , ( ))W t T x t T+ +
What to do about the tail – issues in Receding Horizon Control
)()( txLtu kk −=
Continuous-time control with discrete gain updates
t
Lk
k0 1 2 3 4 5
Sample periods need not be the same
Gain update (Policy)
Control
0 0.5 1 1.5 2-0.3
-0.2
-0.1
0Control signal
Time (s)
0 0.5 1 1.5 2-0.4
-0.2
0Controller parameters
Time (s)
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4System states
Time (s)
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
Critic parameters
Time (s)
P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal
Simulations on: F-16 autopilotLoad frequency control for power system
A matrix not needed
Converge to SS Riccati equation soln
),( uxfx =
∫∫∞∞
+==t
T
t
dtRuuxQdtuxrtxV ))((),())((
),,(),(),(),(),(0 uxVxHuxruxf
xVuxrx
xVuxrV
TT
∂∂
≡+⎟⎠⎞
⎜⎝⎛∂∂
=+⎟⎠⎞
⎜⎝⎛∂∂
=+= 0)0( =V
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
∂∂
+= ),(),(min),(min0*
)(
*
)(uxf
xVuxrx
xVuxr
T
tu
T
tu
xVxgRtxh T
∂∂
−= −*
12
1* )())((
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0 −⎟⎟⎠
⎞⎜⎜⎝
⎛−+⎟⎟
⎠
⎞⎜⎜⎝
⎛= 0)0( =V
System
Cost
Hamiltonian
Optimal cost
Optimal control
HJB equation
Continuous-Time Optimal Control
Bellman
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛∂∂
+=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛⎟⎠⎞
⎜⎝⎛∂∂
+= ),(),(min),(min0)()(
uxfxVuxrx
xVuxr
T
tu
T
tu
)())(,()( 1++= khkkkh xVxhxrxV γ
c.f. DT value recursion, where f(), g() do not appear