Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
Applications of Integral Reinforcement Learning:Microgrids, UAV, Human-Robot Interaction
Supported by :ONRUS NSF
Applications of Reinforcement Learning
Microgrid Control Human‐Robot Interactive LearningIndustrial process control‐Mineral grinding in Gansu, ChinaH‐infinity control for UAVResilient Control to Cyber‐Attacks in Networked Multi‐agent SystemsDecision & Control for Heterogeneous MAS (different dynamics)
8
Work of Vahidreza Nasirian with Ali Davoudi
Game-theoretic Control for DC Microgrids
9
AC Microgrid:
1) Complex synchronization procedure for grid-tied operation (frequency, magnitude, and phase match is required)
2) Complex control circuitry (voltage, frequency, and active/reactive power control)
3) Unwanted transmission loss due to reactive power exchange
4) Redundant dc-ac-dc conversions for integration of renewable sources, loads, and storage units
5) Harmonic current management and phase unbalances
DC Microgrid:
1) Only voltage and power control is needed2) No reactive power flow and, thus, an
improved overall efficiency3) Converted renewable energies are
basically dc and, thus, a dc distribution is more effective for integration of these sources
4) No harmonic current or phase unbalance issue
10
Advantages of DC Microgrids
Cooperative Game-theoretic Control of Active Loads in DC Microgrids
3t
2t
1t
3t
2t
1t
e
outp
inp
e
outp
3t
2t
1t
inp
Power buffer operation during a step change in power demand.
Supplies excess power needed during load changes until sources can respond
18r
48r
58r
59r
47r
27r
67r
69r
39r
iv i
p
s1vs1r
iu
ie
Power buffers in Microgrid Network
Ling-ling Fan, V. Nasirian, H. Modares, F.L. Lewis,Y.D. Song, and A. Davoudi, “Game-theoretic Controlof Active Loads in DC Microgrids,” IEEE Trans.Energy Conversion, vol. 31, no. 3, pp. 882-895, 2016.
2
,i
i ii
i i
ve p
rr u
ìïïï = -ïíïï =ïïî
Active Load Power Buffer
Stored energyInput impedanceBus voltage Control input Output power = a disturbance
ieir
iviu
ip
Vahid Nasirian
Nonlinear dynamicsNot obvious how to handle ip
2
0
d , 1, , ,i
i j ij j i ij N
J u t i M M Nr¥
Î
æ ö÷ç ÷ç= + = + +÷ç ÷ç ÷çè øåò x Q xT
Define coupled performance indices
( )
2q q
1( )q
0 00 2 1
1 00 0 0
0 10 0 0
2 0 ,
0
i i i ii
i ii ii i
i i i i
i i
M N
ij jj M i
i
e ei i
r r u w
p p
r
i
g
g+
= + ¹
é ùé ù é ù é ù é ù- -ê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê ú= + + +ê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úê ú ê ú ê ú ê úê úë û ë û ë û ë ûë û
é ùê úê úê úê ú+ ê úê úê úê úë û
å
x x B DA
1, , ,i M M N= + +
Solve for bus voltage to get coupled agent dynamics
Define Communication GraphSparse efficient topologyOptimal design provides Resilience
and disturbance rejection
Vahid NasirianReza Modares
Dr. Ali DavoudiLinearize.Add as a state.Formulate as H‐infinity Problem.
ip
Coupling terms
14
Optimal Cooperative Control as a Dynamic Game
14
Minimize the performance function for active loads
Ji x jTQijx j
jNi
iui2
dt
0
Let’s define the neighborhood state vector as xi xiT, x j
T jNi T
The optimal solution is in a general form of
With such solutions, the performance function Ji is quadratic in x:
ui kixi
Ji (xi ) xiTPixi
which helps to find the optimal solution by solving an algebraic Riccati equation
ui* Bii
TPixi i1
Graphical Game
15
Optimal Cooperative Control: Policy Iteration finds Optimal Solutions
15
• Substituting the optimal solution in Bellman equations leads to the following coupled Algebraic Riccati Equations (ARE)
• Policy iteration (a class of reinforcement learning) is used to solve ARE and find Pi and the optimal control input
• Policy evaluation: the performance of a given control policy, ui, is evaluated using the Bellman equation, and Pi are found.
• Policy improvement: an improved control policy, ui, is found for each agent, using Pi found in the first step.
• Policy evaluation and improvement are repeated until no improvement in control policies, ui, of any agent is observed.
Hi xiTQixi
T +i ui* 2
+xiTPi Aixi Biui
* Diwi (xi ) + Aixi Biui
* Diwi (xi ) TPixi=0
ui* ui
*, uj* jNi T
ui* Bii
TPixi i1
(a) DC microgrid system(b) Active load(c) Communication network
16
Controller Implementation
Microgrid Setup and Cooperative Controller
Controller Performance with Load Change
17
(a) microgrid bus voltages at the load terminals, (b) Output voltage of the power buffers, (c) output voltage across theresistive loads, (d) Source currents, (e) Stored energies in power buffers, (f) Input impedance of the power buffers, (g)Output of the active loads, (h) energy-impedance trajectory of power buffers during the load transient.
Load change in bus 5; Buffers 4 & 5 assisting Load change in bus 4; Multiple assistive buffers
Intelligent Operational Control for Complex Industrial Processes
Professor Chai Tianyou
State Key Laboratory of Synthetical Automation for Process Industries
Northeastern UniversityMay 20, 2013
Jinliang Ding
1. Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans. IndustrialInformatics, vol.12, no. 2, pp. 454-465, April 2016.
2. Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control of OperationalIndices for a Class of Industrial Processes,” IET Control Theory & Applications, vol. 10, no. 12, pp. 1348-1356, 2016.
Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities
Production line for mineral processing plant
Mineral Processing Plant in Gansu China
Existing Manual Control for Plant production indices, unit operational indices, and unit process control for a production line
Overall
ˆ ( )kQ t
( )kQ mT
,~ { } 1,
1, 2,3i jr i n
j
r r
ˆ( )tr
( )mTr
*min max, ,k k kQ Q Q
*,i jr
*min max, ,k k kQ Q Q
( )kQ mT
*min max, ,k k kQ Q Q
, ( )i jr mT
Automated online reinforcement learning for determining operational indices
Implemented by Jingliang Ding and Chai Tianyou’s group in biggest mineral processingfactory of hematite iron ore in China, Gansu Province.
Savings of 30.75 million RMB per year were realized by implementing this automatedoptimization procedure instead of the standard industry practice of human operatorselection of process operational indices.
2 RL loopsAnd Value Function Approximation
Xinglong Lu, B. Kiumarsi, Tianyou Chai, and F.L. Lewis, “Data-driven Optimal Control ofOperational Indices for a Class of Industrial Processes,” IET Control Theory & Applications, vol.10, no. 12, pp. 1348-1356, 2016.
Yi Jiang, Jialu Fan, Tianyou Chai, Jinna Li, and F.L. Lewis, “Data-Driven Flotation IndustrialProcess Operational Optimal Control Based on Reinforcement Learning,” IEEE Trans. IndustrialInformatics, to appear, 2018.
Jinna Li, Tianyou Chai, F.L. Lewis, Jialu Fan, Zhangtao Ding, and Jinliang Ding, “Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes,” IEEETrans. Industrial electronics, vol. 65, no. 5, pp. 4092-4102, May 2018.
Jinna Li, Bahare Kiumarsi, Tianyou Chai, F.L. Lewis, and Jialu Fan, “Off-Policy ReinforcementLearning: Optimal Operational Control for Two-Time-Scale Industrial Processes,” IEEE Trans.Cybernetics, vol. 47, no. 12, pp. 4547-4558, Dec. 2017.
Jinliang Ding, H. Modares, Tianyou Chai, and F.L. Lewis, "Data-based Multi-objective Plant-widePerformance Optimization of Industrial Processes under Dynamic Environments,” IEEE Trans.Industrial Informatics, vol.12, no. 2, pp. 454-465, April 2016.
Control of Non-affine Aerial Systems Using Off-policy Reinforcement Learning
( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u Dw t= + +
1 1 1
2 2 2
3 3 32
22 1 3 4 2
cos cos
cos sin
sin
sin
( cos cos )
sincos
zz
z
z
x V d w
x V d w
x V d w
nV V g T n
Vgn
Vgn
V
g yg yg
a g a a a
g f g
y fg
= += +=- +
=- - + - -
= -
=
max
max
cos
sin
x
x
TT Dn
mgTT K
nmg
a
a
-=
+=
with
UAV dynamics
Non‐affine nonlinear aerial vehicle model
( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u Dw t= + +
4 5 6
4 5 6
4 52
2 4 5
54
cos( )cos( )
cos( )sin( )
sin( )
( ( )) sin( )
( cos( )
0
x x x
x x x
x x
f X t x g x
gx
x
a
é ùê úê úê úê ú-ê úê ú= - -ê úê úê ú
-ê úê úê úê úë û
41 3 2
4
4 5
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
( ( )) 0 0
0 0 0 1 0
0 0 0 0cos( )
g X tx
g
x x
aa a
é ùê úê úê úê úê úê úê ú= - -ê úê úê úê úê úê úê úê úë û
11
222
3 2
4 2 3
5 2 3
( ( ))
cos( )
sin( )
uL
uL
L uL u t
L u u
L u u
é ùé ùê úê úê úê úê úê úê úê ú= = ê úê úê úê úê úê úê úê úê úê úë û ë û
1 2 3{ , , , , , }TX x x x V g y=State
Dynamics
where
Optimal Control for Constrained Input Systems
This is a quasi-norm
2
0
2 ( )u
Tq
u d
Weaker than a norm –homogeneity property is replaced by the weaker symmetry property
qqxx
(Used by Lyshevsky for H2 control)
Control constrained by saturation function tanh(p)
p
1
-1
0 0
( , ) ( ) 2 ( )u
TJ u d Q x d dt
Encode constraint into Value function
Then Is BOUNDED1 ( )T Vu R g xx
Murad Abu-Khalaf
( ) ( )d d dX t h X=
2( )
2
2( )
( )
( )
t
t
t
t
e z d
e w d
a t
a t
t tg
t t
¥- -
¥- -
£ò
ò
2 T( ) ( ) ( ) ( ( ))d d
z t X X Q X X W L u= - - +
( ) ( ( )) ( ( )) ( ) ( )X t f X t g X t L u Dw t= + +
( ) T 2 T( ) ( ) ( ) ( ( ))td dt
J X e X X Q X X W L u w w da t g t¥
- - é ù= - - + -ê úë ûò
1 1
2 2
u u
u u
£
£
where
H‐infinity Control Tracking Problem
UAV dynamics
Desired trajectory generator
Bounded L2 norm
Constrained controls
Formulate as Optimal Control Problem
( ) ( ) ( )d
e t X t X t= -
( )( )
( )d
e tZ t
X t
é ùê ú= ê úê úë û
( ) ( ) ( ( )) ( )
( ) ( ) ( ( )) ( ( )) ( ) ( )( ) ( ( )) 0 0
d d d
d d d
e t f e X h X t g e X DL u w t F Z t G Z t L u Kw t
X t h X t
é ù é ù é ù é ù+ - +ê ú ê ú ê ú ê ú= + + º + +ê ú ê ú ê ú ê úê ú ê ú ê ú ê úë û ë û ë û ë û
( ) T 2 T1
( ( ), ) ( ( ))t
tJ L u w e Z Q Z W L u w w da t g t
¥- - é ù= + -ê úë ûò
1
0
0 0
é ùê ú= ê úê úë û
Write Augmented System and Leader Dynamics
Tracking error
Augmented State
Augmented Tracking Dynamics
Performance Index
with
T T( ) tanh (( ) )Z
L u L V G* *= -
2 T1( )
2 Zw V Kg* - *=
T 2 T1
( ( )) ( ) ( ) 0Z Q Z W L u w w V Z V Zg a+ - - + =
T 2 T T1
( , ( ), , ) ( ( )) ( ) ( ( ) ( ) ( ) ) 0Z Z
H Z L u w V Z Q Z W L u w w V Z V F Z G Z L u Kwg a= + - - + + + =
( )( ) argmin ( , ( ), , )
L uL u H Z L u w V* *=
arg max ( , ( ), , )w
w H Z L u w V* *=
Optimal H‐inf Tracker
Bellman Equation
Stationarity Condition gives Optimal Control and worst‐case disturbance
So that
1( tanh ( ))Tu L L v* - *= -
Assume L(u) is Invertible
Then
Reinforcement Learning Policy Iteration Solution
Need to know input matrices G and K
( ) ( ) ( ( )) ( )
( ) ( ) ( ( )) ( ( )) ( ) ( )( ) ( ( )) 0 0
d d d
d d d
e t f e X h X t g e X DL u w t F Z t G Z t L u Kw t
X t h X t
é ù é ù é ù é ù+ - +ê ú ê ú ê ú ê ú= + + º + +ê ú ê ú ê ú ê úê ú ê ú ê ú ê úë û ë û ë û ë û
( ) ( ( )) ( ( )) ( ) ( ( ))( ( ) ( )) ( )j j j jZ t F Z t G Z t L u Kw G Z t L u L u K w w= + + + - + -
Off‐Policy IRL Solution
Do not need any of the dynamics of UAV or leader
Off‐PolicyBellman Equation
Data‐Driven Real‐Time Solution Using VFA
Approximate critic, control, disturbance
Plug into Off‐Policy Bellman Equation to get algebraic equations for the weights
RL for Human-Robot Interaction (HRI)1. H. Modares, I. Ranatunga, F.L. Lewis, and D.O. Popa, “Optimized Assistive Human-robot
Interaction using Reinforcement Learning,” IEEE Transactions on Cybernetics, vol. 46, no. 3,pp. 655-667, 2016.
2. I. Ranatunga, F.L. Lewis, D.O. Popa, and S.M. Tousif, "Adaptive Admittance Control forHuman-Robot Interaction Using Model Reference Design and Adaptive Inverse Filtering" IEEETransactions on Control Systems Technology, vol. 25, no. 1, pp. 278-285, Jan. 2017.
3. B. AlQaudi, H. Modares, I. Ranatunga, S.M. Tousif, F.L. Lewis, and D.O. Popa, “Modelreference adaptive impedance control for physical human robot interaction,” Control Theory andTechnology, vol. 14, no. 1, pp. 1-15, Feb. 2016.
PR2 meets Isura
Robot dynamics
Prescribed Error system
Control torque depends onImpedance model parameters
Impedance Control
Standard Robot Trajectory Tracking Controller
Where is the human?
Human task learning has 2 components:1. Human learns a robot dynamics model to compensate for robot nonlinearities2. Human learns a task model to properly perform a task
Inner Robot Specific Control LoopINDEPENDENT OF TASK
Outer Task Specific Control LoopINDEPENDENT OF ROBOT DETAILS
Human Performance Factors Studies
Robot control inner loop
Task control outer loop
RL for Human‐Robot Interactions
No task trajectory information is used in this inner‐loop robot controllerThe inner‐loop robot controller makes the model‐following error smallThe admittance model parameters are not neededOnly the admittance model trajectories are needed., ,m m mx x x
New Inner Robot Control Loop
Three Outer Loop DesignsTo appear 2016
2C. Outer‐loop Task Specific Design #3
Reinforcement Learning for minimum human effort
Feedforward assistive control term
‐ 2 1( )Ms Bs K -+ +hK
(.)l
+
-
dx
mxh
fde +
+
1( )p dK s K s -+
PrescribedImpedanceModelHuman
Find robot impedance model parametersTo minimize human force effortAnd task trajectory following error
, ,M B Khf
de
Human force amplifier
Work of Reza Modares
Force exerted by human indicates his discontent‐A measure of Human Intent
Feedback linearization loop
Robot Impedance Model Unknown Human Model
1,0h d p h e d d h h h d
f K K f k K e A f E e-= - + º +
( )d p e dK s K f k e+ =
d h p h e dK f K f k e+ =
Minimize human effort and tracking error
( )T T Td d d h h h e e
t
J e Q e f Q f u Ru dt¥
= + +ò
Performance index
1 2e d hu K e K f= +
Then control is
( )T Te e
t
J X Q X u Ru dt¥
= +ò
Overall Augmented Dynamics
nd d me x x= - Î
2[ ]T T T nd d d de e e x x= = - Î
Augmented Tracker Dynamics with Human and Tracking Error
We want online method to learn the optimal control without knowing the System Matrix A
Optimal Design Always Admits Reinforcement Learning for Real‐time Optimal Adaptive Control
Optimal control is an offline methodBased on solving AREKnowing all the plant dynamics
Take enough data along the system trajectoryTo solve this equation using least‐squares
OFF‐POLICYReinforcement LearningNeeds NO knowledge of the system dynamics