Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
and
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
New Developments in Integral Reinforcement Learning:Graphical Games, Off-policy, Tracking
Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries
Northeastern University, Shenyang, China
Supported by :ONRUS NSF
Supported by :China NNSFChina Project 111
Data‐driven Online Solution of Differential GamesSynchronous Solution of Multi‐player Non Zero‐sum Games
Multi‐player Game SolutionsIEEE Control Systems Magazine,Dec 2017
Multi‐player Differential Games
Sun Tz bin fa孙子兵法
500 BC
Multi‐player Differential Games
Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
Each process has its own dynamics
And cost function
Each process helps other processes achieve optimality and efficiency
Real‐Time Solution of Multi‐Player NZS Games
1
( ) ( )N
j jj
x f x g x u
Multi‐Player Nonlinear Systems
Optimal control *1 2
10
( (0), , , ) min ( ( ) ) ;i
NT
i N i i ij ij
V x Q x R dt i N
* * * * * * * *1 2 1 1 2( , , ,..., ) ( , , ,..., ),i i i iV V V i N Nash equilibrium
1
1
0 ,N
T T Ti c c i i j j jj ij jj j j
j
P A A P Q P B R R R B P i N
Requires Offline solution of coupled Hamilton‐Jacobi –Bellman eqs.
Kyriakos Vamvoudakis, Automatica 2011
Continuous‐time, N players
112
1
0 ( ) ( ) ( ) ( )N
T Ti j jj j j
j
V f x g x R g x V
114
1
( ) ( ) ( ) , (0) 0N
T T Ti j j jj ij jj j j i
j
Q x V g x R R R g x V V
Linear Quadratic Regulator Case‐ coupled AREs
These are hard to solveIn the nonlinear case, HJB generally cannot be solved
14
112( ) ( ) ,T
i ii i ix R g x V i N
Control policies
1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi
teamJ J J J J J J J J J
The objective functions of each player can be written as a team average term plus a conflict of interest term:
1 1
1 1
( ) , 1,N N
coii j i j team iN N
j j
J J J J J J i N
For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.
For N-players
Team Interest vs. Self Interest
Real‐Time Solution of Multi‐Player Games
1 210
( (0), , , ) ( ( ) ) ;N
Ti N i i ij i
j
V x Q x R dt i N
Value functions
Solve Bellman eq.
Policy Iteration Solution:1
1
0 ( , , , ) ( ) ( ) ( ) , (0) 0N
k k k T i kN i j j i
j
r x V f x g x V i N
1 112( ) ( ) ,k T k
i ii i ix R g x V i N Policy Update
Non‐Zero Sum Games – Synchronous Policy Iteration Kyriakos Vamvoudakis
Convergence has not been provenHard to solve Hamiltonian equationBut this gives the structure we need for online Synchronous PI Solution
Differential equivalent gives coupled Bellman eqs.
11 1
0 ( ) ( ) ( ( ) ( ) ) ( , , , , ),N N
T Ti j ij j i j j i i N
j jQ x u R u V f x g x u H x V u u i N
Leibniz gives Differential equivalent
Real‐Time Solution of Multi‐Player Games
N Critic Neural Networks for VFA 1 1 1ˆ ˆ( ) ( ) ,TV x W x 2 2 2
ˆ ˆ( ) ( )TV x W x
11 1 1 1 1 1 11 1 2 12 22
1 1
ˆ ˆ[ ( ) ]( 1)
T T TTW a W Q x u R u u R u
13 3 2 3 1 3 1 1 11 21 11 1 3 2 2 1 3 1 1
1 1ˆ ˆ ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) ( ) }4 4
T T T T T TW F W F W g x R R R g x W m W D x W m W
N Actor Neural Networks
On‐Line Learning – for Player 1:
111 11 1 1 32
ˆ( ) ( ) ,T Tu x R g x W
Online Synchronous PI Solution for Multi‐Player Games
Learns Bellman eq. solution
Kyriakos Vamvoudakis
112 22 2 2 42
ˆ( ) ( )T Tu x R g x W
2-player case
Each player needs 2 NN – a Critic and an Actor
Learns control policy
Player 1 Player 2
Convergence is proven using Lyapunov functions
K.G. Vamvoudakis and F.L. Lewis, “Multi-Player Non-Zero Sum Games: Online Adaptive LearningSolution of Coupled Hamilton-Jacobi Equations,” Automatica, vol. 47, pp. 1556-1569, 2011.
1 11 1 1 2 2 2
1 1( ) ( ) ( ) ( ).2 2
T TL t V x tr W a W tr W a W
Lyapunov energy‐based Proof:
1140 ( )
T TTdV dV dVf Q x gR g
dx dx dx
V(x)= Unknown solution to HJB eq.
2 1 2ˆW W W
1 1 1̂W W W
1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru
W1= Unknown LS solution to Bellman equation for given N
Guarantees stability
Multi‐Agent Learning Guaranteed Convegence Proof
Process i Cost Function Optimization
Process j1 controlupdate
Process i controlupdate
Process j2 controlupdate
Optimal Performance of Each Process Depends on the Control of its Neighbor Processes
Control Policy of Each Process Depends on the Performance of its Neighbor Processes
Process i Cost Function Optimization
Process i controlupdate
Process j2 Cost Function Optimization
Process j1 Cost Function Optimization
Multi-player Games for Multi-Process Optimal ControlData-Driven Optimization (DDO)
Simulation. – Nonlinear System – 2‐player game 2( ) ( ) ( ) ,x f x g x u k x d x
1 2 11 22 12 212 2 , 2 2 , 2 2Q Q I R R I R R I
Optimal Value s*
1 2( ) 2(cos(2 ) 2)u x x x Optimal Policies
Select VFA basis set
Algorithm converges to
2 4ˆ ˆ( ) [0.2514 0.0006 0.5001] ( )T
f fW t W t
2 21 1 12 1 22 4
22
1 2 1
1
4
1
( )cos(2 ) 2 (4 ) 2
0 0 ( ) , ( )
( ) (sin )
(s.
cos(2 ) 2 (in )4 ) 2
xf x
x x x
g x k
x
x x
x x
x
* 2 21 1 2
1( )2
V x x x
* 21 2( ) (sin(4 ) 2)d x x x
2 21 2 1 1 2 2( ) ( ) [ ]x x x x x x
1 3ˆ ˆ( ) [0.5015 0.0007 1.0001] ( )T
f fW t W t
111
11 2 121
2
2 0 0.50150
ˆ( ) 0.0007cos(2 ) 2
0 2 1.0001
TT x
u x R x xx
x
111
22 2 12 21
2
2 0 0.25140ˆ( ) 0.0006sin(4 ) 2 0 2 0.5001
TT x
d x R x xx x
* 2 22 1 2
1 1( )4 2
V x x x
Solves HJB equations online11
21
0 ( ) ( ) ( ) ( )N
T Ti j jj j j
j
V f x g x R g x V
114
1
( ) ( ) ( ) , (0) 0N
T T Ti j j jj ij jj j j i
j
Q x V g x R R R g x V V
Critic 1 NN parameters Critic 2 NN parameters
Evolution of the States 3D approximation error of control for player 1.
3D approximation error value for player 1.
Sun Tz bin fa孙子兵法
Games on Communication Graphs
500 BC
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2013
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
,i i i ix Ax B u
0 0x Ax
0( ) ( ),ix t x t i
0( ) ( ),i
i ij i j i ij N
e x x g x x
( ) ,nix t ( ) im
iu t
Graphical GamesSynchronization‐ Cooperative Tracker Problem
Node dynamics
Target generator dynamics
Synchronization problem
Local neighborhood tracking error (Lihua Xie)
x0(t)
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
12
0
( ( ), ( ), ( ))i i i iL t u t u t dt
Local nbhd. tracking error dynamics
Define Local nbhd. performance index
Local agent dynamics driven by neighbors’ controls
Values driven by neighbors’ controls
K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.
M. Abouheaf, K. Vamvoudakis, F.L. Lewis, S. Haesaert, and R. Babuska, “Multi-Agent Discrete-Time GraphicalGames and Reinforcement Learning Solutions,” Automatica, Vol. 50, no. 12, pp. 3038-3053, 2014.
1u
2u
iu Control action of player i
Value function of player i
New Differential Graphical Game
( )i
i i i i i i ij j jj N
A d g B u e B u
State dynamics of agent i
Local DynamicsLocal Value Function
Only depends on graph neighbors
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
1
N
i ii
z Az B u
12
10
( (0), , ) ( )N
T Ti i i j ij j
j
J z u u z Qz u R u dt
1u
2u
iu Control action of player i
Central Dynamics
Value function of player i
Standard Multi-Agent Differential Game
Central DynamicsLocal Value Functiondepends on ALL
other control actions
( ), { : }i j iu t u j N
{ : , }G i ju u j N j i
* * *1 2, ,...,u u u
* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N
* *( )i i iJ J u
( ) ( , ) ( , ' ),i i i i G i i i G iJ u J u u J u u i
* 12( ( )) min ( )
ii
T T Ti i i ii i i ii i j ij ju
j Nt
V t Q u R u u R u dt
Problems with Nash Equilibrium Definition on Graphical GamesGame objective
Neighbors of node i
All other nodes in graphDef: Nash equilibrium
are in Nash equilibrium if
Counterexample. Disconnected graph
Let each node play his optimal control
Then, each agent’s cost does not depend on any other agent
Then all agents are in Nash equilibriumNote‐ this Nash is also coalition‐proof
Another example
Define
* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N
New Definition of Nash Equilibrium for Graphical Games
* * *1 2, ,...,u u u
* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N
Def: Interactive Nash equilibrium
are in Interactive Nash equilibrium if
2. There exists a policy such that ju
1.
That is, every player can find a policy that changes the value of every other player.
1. They are in Nash equilibrium
2. Interaction Condition
To restore symmetry of Nash Equilibrium
1 1 12 2 2( , , , ) ( ) 0
i i
TT T Ti i
i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N
V VH u u A d g B u e B u Q u R u u R u
10 ( ) Ti ii i i ii i
i i
H Vu d g R B
u
2 1 2 1 11 1 12 2 2( ) ( ) 0,
i
TT Tj jc T T Ti i i
i i ii i i i i ii i j j j jj ij jj ji i i j jj N
V VV V VA Q d g B R B d g B R R R B i N
2 1 1( ) ( ) ,i
jc T Tii i i i i ii i ij j j j jj j
i jj N
VVA A d g B R B e d g B R B i N
* *( , , , ) 0ii i i i
i
VH u u
12( ( )) ( )
i
T T Ti i i ii i i ii i j ij j
j Nt
V t Q u R u u R u dt
Value function
Differential equivalent (Leibniz formula) is Bellman’s Equation
Stationarity Condition
Coupled HJ equations
where
Graphical Game Solution Equations
Now use Synchronous PI to learn optimal Nash policies online in real‐time as players interact
Distributed Multi‐Agent Learning Proofs
Online Solution of Graphical Games
Use Reinforcement Learning Convergence Results
POLICY ITERATION
Kyriakos VamvoudakisMulti‐agent Learning Convergence proofs
Data‐driven Online Solution of Differential GamesZero‐sum 2‐Player Games and H‐infinity Control
2
0
2
0
2
0
2
0
2
)(
)(
)(
)(
dttd
dtuhh
dttd
dttz T
System( ) ( ) ( )( )
TT T
x f x g x u k x dy h x
z y u
( )u l x
d
u
z
x control
Performance output disturbance
Find control u(t) so that For all L2 disturbancesAnd a prescribed gain
L2 Gain Problem
H‐Infinity Control Using Reinforcement Learning
Zero‐Sum differential game ‐ Nature as the opposing player
Disturbance Rejection
The game has a unique value (saddle‐point solution) iff the Nash condition holds
2* 2
0
( (0)) min max ( (0), , ) min max ( ) ( )T T
u ud dV x V x u d h x h x u Ru d dt
Define 2‐player zero‐sum game as
min max ( (0), , ) max min ( (0), , )u ud d
V x u d V x u d
The game has a unique value (saddle‐point solution) iff the Nash condition holds
min max ( , , , ) max min ( , , , )u ud d
H x V u d H x V u d
A necessary condition for this is the Isaacs Condition
0 , 0H Hu d
Stationarity Conditions
Linear Quadratic Zero-Sum Games
12 1 2 1 1 2 1 11 1 2 12 22( ( ), , ) ( ( ), , ) ( )T T T
t
J x t u u J x t u u x Qx u R u u R u d
1 11 11 1 2 12 20 T T TA P PA Q PB R B P PB R B P
1 11 1 11 1 2 2 12 2,T Tu K x R B Px u K x R B Px
Game Algebraic Riccati Equation
1 1 2 2x Ax B u B uy Cx
, TQ C C
( , ) ( ) ( ) ( )( )
x f x u f x g x u k x dy h x
System
Cost
Online Zero‐Sum Differential Games
Differential equivalent is ZS game Bellman equation
Given any stabilizing control and disturbance policies ( ), ( )u x d xthe cost value is found by solving this nonlinear Lyapunov equation
22( ( ), , ) ( , , )T T
t t
V x t u d h h u Ru d dt r x u d dt
0 ( , , ) ( , , ) ( ( ) ( ) ( ) ) ( , , , )
(0) 0
T Vr x u d V r x u d V f x g x u k x d H x u dx
V
H-infinity Control
2 players
Leibniz gives Differential equivalent
K.G. Vamvoudakis and F.L. Lewis, “Online solution of nonlinear two-player zero-sum games using synchronous policy iteration,” Int. J. Robust and Nonlinear Control, vol. 22, pp. 1460-1483, 2012.
D. Vrabie and F.L. Lewis, “Adaptive dynamic programming for online solution of a zero-sum differential game,” J Control Theory App., vol. 9, no. 3, pp. 353–360, 2011
0)0( V
Optimal control/dist. policies found by stationarity conditions
HJI equation
Game saddle point solution found from Hamiltonian ‐ ZS Game BELLMAN EQUATION
(‘Nonlinear Game Riccati’ equation)
0 , 0H Hu d
112 ( )Tu R g x V
21 ( )
2Td k x V
* *
12
0 ( , , , )1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 4
T T T T T T
H x V u d
h h V x f x V x g x R g x V x V x kk V x
22( , , , ) ( ( ) ( ) ( ) ) 0TT TVH x u d h h u Ru d V f x g x u k x dx
1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t
111 12( ) ( )T
j ju x R g x V
Double Policy Iteration Algorithm to Solve HJI
Start with stabilizing initial policy 0 ( )u x
ZS game Bellman eq.
3. Improve policy:
• Convergence proved by Van der Schaft if can solve nonlinear Lyapunov equation exactly
• Abu Khalaf & Lewis used NN to approximate V for nonlinear systems and proved convergence
ONLINE SOLUTION‐ Use IRL to solve the ZS Bellman eq. and update the actors at each step
220 ( )( )T iT i T ij j j jh h V x f gu kd u Ru d
Add inner loop to solve for available storage
12
1 ( )2
i T ijd k x V
2. Set . For i=0,1,… solve for 1( ( )),i ijV x t d 0 0d
On convergence set 1( ) ( )ij jV x V x
Murad AbukhalafDraguna Vrabie
Actor‐Critic structure ‐ three time scales
x
u2 1 0;x Ax B u B w x
System
Controller/Player 1
w( 1)T i kuu B P x 1
1T i
ww B P x
Disturbance/Player 2
CriticLearning procedure
ˆ ˆ, if =1
ˆ ˆ ˆ ˆ, if >1
T T T
T T
V x C Cx u u i
V w w u u i
T T
1 2
1iuP ( 1) 1 ( 1)i k i i k
u u uP P Z
V
iuP
x
( ) 1 ( )i k i i ku u uP P Z
2
1/ / 0 0 00 1/ 1/ 0 0
,1/ 0 1/ 1/ 1/
0 0 0 0
p p p
T T
G G G G
E
T K TT T
A BRT T T T
K
2 1x Ax B u B d
( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t
FrequencyGenerator outputGovernor positionIntegral control
-0.0665 8 0 0 0 -3.663 3.663 0
, [0 0 13.7355 0] . -6.86 0 -13.736 -13.736 0.6 0 0 0
TA B
Simulation- H-inf control for Electric Power Plant- LFC
A is unknownB1, B2 are known
1 8 0 0 0 TB
2 2 1 10 ( )T T T TA P PA C C P B B B B P
Load disturbance
Simulation result – Electric Power Plant LFC
• System – Power plant ‐ internally stable system; – system state
(incremental changes of: frequency deviation, generator output, governor position and integral control)
– Player 1 ‐ controller ; Player 2 – load disturbance
• Nash equilibrium solution
• Online learned solution using ADP – after 5 updates of the parameters of Player 2
[ ( ) ( ) ( ) ( )]g gx f t P t X t E t
0.6036 0.7398 0.0609 0.58770.7398 1.5438 0.1702 0.59780.0609 0.1702 0.0502 0.03570.5877 0.5978 0.0357 2.3307
uP
5
0.6036 0.7399 0.0609 0.58770.7399 1.5440 0.1702 0.59790.0609 0.1702 0.0502 0.03570.5877 0.5979 0.0357 2.3307
uP
2 2 1 10 ( )T T T TA P PA C C P B B B B P Solves GARE online without knowing A
Parameters of the critic – ARE Solution elements
– Cost function learning using least squares
– Sampling integration time T=0.1 s
– The policy of Player 1 is updated every 2.5 s
– The policy of Player 2 is updated only when the policy of Player 1 has converged
– Number of updates of Player 1 before an update of Player 2
moments when Player 2 is updated
0 50 100 150 200 250 300
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5Parameters of the cost function of the game
Time (s)
P(1,1)P(1,2)P(2,2)P(3,4)P(4,4)
57
New Developments in IRL for CT SystemsQ Learning for CT SystemsExperience ReplayOff-Policy IRL
Q learning for CT Systems
CT Q function is quadratic in x, u, and the derivatives of u ‐
( )( ), [ : ) ( )t T
t T T T
t
Q x t u t t T e x S x R d e V x t T
( )1 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( )
t Tt T T T T
k k k k k k kt
e x S x u R u d Z t M S M R Z t E t T
( )
( )( )( )( )
( )
k
k
x tu tu tZ t
u t
M. Palanisamy, H. Modares, F.L. Lewis, and M. Aurangzeb,“Continuous-time Q-learning for Infinite-horizon DiscountedCost Linear Quadratic Regulator Problems,” IEEE Trans.Cybernetics, vol. 45, no. 2, pp. 165-176, Feb. 2015.
IRL with Experience ReplayHumans use memories of past experiences to tune current policies
( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +
1
0( ( )) ( ( )) 2 ( tanh ( ))( )u
T
t
V x t Q x v Rdv dt l l t¥
-= +ò ò1
0( ) 2 tanh ( ) ( ) ( ( ) ( ) ) 0, (0) 0( )
uTTQ x v Rdv V x f x g x u Vl l-+ + + = =ò
1tanh (1 2 ) ( ) ( )( )Tu R g x V xl l* - *= -
1ˆ ˆ( ) ( )TV x W xf=
system
Value
1 1 12
( )ˆ ˆ( ) ( ) ( ) ( )(1 ( ) ( ))
( )T
T
tW t p t t W t
t t
fa f
f fD
= - +D+D D
Modares and Lewis, Automatica 2014Girish Chowdhary‐ concurrent learningSutton and Barto book
Bellman Equation
Action Update
VFA‐ Value Function Approximation
( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -
1
0( ) 2 ( tanh ( ))( )
tu
T
t T
p t Q v Rdv dl l t-
-
= +ò ò
i/o Data Measurements
Standard Critic Weight Tuning
IRL Bellman Equation 1
0( ( )) ( ( )) 2 ( tanh ( )) ( ( ))( )
tu
T
t T
V x t T Q x v Rdv d V x tt l l t-
-
- = + +ò ò
110
( ( )) 2 ( tanh ( )) ( ( )) ( )( )t
uT T
B
t T
Q x v Rdv d W x t tt l l t f e-
-
+ + D ºò òBellman Eq gives Linear Equation for Weights
IRL with Experience ReplayHumans use memories of past experiences to tune current policies
1ˆ ˆ( ) ( )TV x W xf=
1 1 1 1 12 2
1
( )ˆ ˆ ˆ( ) ( ) ( ) ( ) ( )(1 ( ) ( )) (1 )
( ) ( )l
jT Tj jT T
j j j
tW t p t t W t p W t
t t
ffa f a f
f f f f=
DD= - +D +D
+D D +D D- å
NN weight tuning uses past samples
Improvements1. Speeds up convergence2. PE condition is milder
VFA‐ Value Function Approximation
Data from Previous time intervals
( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -
1
0( ) 2 ( tanh ( ))( )
tu
T
t T
p t Q v Rdv dl l t-
-
= +ò ò
i/o Data Measurements
Previous data
H. Modares, F.L. Lewis, and M.B. Naghibi Sistani,“Integral reinforcement learning and experiencereplay for adaptive optimal control of partially-unknown constrained-input continuous-timesystems,” Automatica, vol. 50, pp. 193-202, 2014.
New Principles
Off-Policy Learning
On‐policy RL
66
Target and behaviorpolicy
SystemRef.
Target policy: The policy that we are learning about.Behavior policy: The policy that generates actions and behavior
Target policy and behavior policy are the same
Sutton and Barto Book
Off‐Policy Reinforcement Learning
Humans can learn optimal policies while actually playing suboptimal policies
Off‐policy RL
67
Behavior Policy System
Target policy
Ref.
Target policy and behavior policy are different
Humans can learn optimal policies while actually applying suboptimal policies
H. Modares, F.L. Lewis, and Z.-P. Jiang, “H-infinity Tracking Control of Completely-unknown Continuous-time Systems via Off-policyReinforcement Learning ,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2550-2562, Oct. 2015.
Ruizhuo Song, F.L. Lewis, Qinglai Wei, “Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-TimeMulti-Player Non-Zero-Sum Games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704-713, 2017.
Bahare Kiumarsi, Frank L. Lewis, Zhong-Ping Jiang, “H-infinity Control of Linear Discrete-time Systems: Off-policy ReinforcementLearning,” Automatica, vol. 78, pp. 144-152, 2017.
Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies
( ) ( )x f x g x u
( ) ( ) ( ( )) ( ),t t
TJ x r x u d u dQ x Ru
[ [ [] [ ]] ]( ( )) ( ( )) ( )i i
t T t
t i i
T
t TJ x t J x t T Q x d Ru du
On-policy IRL
[ 1] 1 [ ]12
i T ixu R g J
[ ] [ ]( )i ix f gu g u u Off-policy IRL
[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i
t T t T t T
T i i T iJ x t J x t T Q x d R d du u u R u u
This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA
[ ]iJ [ 1]iu
1. Completely unknown system dynamics2. Can use applied u(t) for –
disturbance rejection – Z.P. Jiang ‐ R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! ‐ J.Y. Lee, J.B. Park, Y.H. Choi 2012
DDO
system
value
Must know g(x)
Learned optimaltarget policy
Applied behavior policy
Off‐policy for Multi‐player NZS Games
1( ) ( )N
jjx f x g x u
1 2 1( ( )) ( ( , , , , )) ( ( ) )N T
i i N i j ij jjt tV x t r x u u u d Q x u R u d
[ ] [ ]1 1
( ) ( ) ( )( )N Nk kj j jj j
x f x g x u g x u u
Off‐policy
[ ] [ ] [ ] [ ] [ 1] [ ]1 1
( ( )) ( ( 2 ( ))) ( )t T t T t TN Nkk k T k k T k
j ij j i ii j jji i it t jtu R u u R u uV x t T V x t Q x d d d
1. Solve online using measured data for 2. Completely unknown dynamics3. Add exploring noise with no bias
[ ] [ 1],k ki iV u
DDO
On‐policy
Ruizhuo Song, F.L. Lewis, Qinglai Wei, “Off-Policy Integral Reinforcement Learning Method to SolveNonlinear Continuous-Time Multi-Player Non-Zero-Sum Games,” IEEE Transactions on NeuralNetworks and Learning Systems, vol. 28, no. 3, pp. 704-713, March 2017.
Off‐policy IRL for ZS Games – H‐infinity Control
2( ) [ ( ) ( ) ]T T T nd
X t e t r t= Î
Optimal tracker
( ) ( ( )) ( ( )) ( ) ( ( )) ( )X t F X t G X t u t K X t d t= + +
( ) 2( , ) ( )t T T TTt
V u d e X Q X u Ru d d da t g t¥
- -= + -ò
( ) ( )i i i i
X F G u K d G u u K d d= + + + - + -
Off‐policy
1. Solve online using data for 2. Completely unknown dynamics3. Disturbance does not need to be specified4. Add exploring noise with no bias
1 1, ,i i iV u d
DDO
Biao Luo, H.N. Wu, T. Huang, IEEE T. Cybern. Modares, Lewis, Z.P. Jiang, TNNLS 2015H. Li, Derong Liu, D. Wang, IEEE TASE 2015
Extended stateIncluding ref. traj. dynamics
Observe played policy of disturbanceCompute his worst case malicious attackshould he choose to play it
H. Modares, F.L. Lewis, and Z.-P. Jiang, “H-infinityTracking Control of Completely-unknownContinuous-time Systems via Off-policyReinforcement Learning ,” IEEE Transactions onNeural Networks and Learning Systems, vol. 26, no.10, pp. 2550-2562, Oct. 2015.
Off-Policy Learning for Estimating Malicious Adversaries’ Hidden True Intent
( ) ( )k k k ki i i i i i i i i ix Ax Bu Dv B u u D v v
1 2 112( ( )) ( ( )) ( , , ) ( ) ( ) ( ) ( )
t T t Tk k k k k T k k T k
i i i i i i i i i ii i i i ii i it t
V x t V x t T r x u v dt u R u u v R v v dt
i i i i i ix Ax Bu Dv
MAS H‐infinity control
MA Systems
Cost
Off‐policy IRL
Two Opposing Teams
2 21 12 2( ( )) (x ) ( , , v )
i i
T T T T Ti i i ii i i ii i j ij j i ii i j ij j i i i i
t tj j
V x t Q x u R u u R u v T v v T v dt r x u dt
1 12
1 1,2 2
k kk T k Ti ii i i i
i i
V Vu B v D
x x
Optimal Target policiesActual Behavior policiesOff‐Policy Bellman Eq.
New Principles3. Off‐Policy LQ Tracker for Continuous‐time Systems
LQ Tracker for Continuous‐time Systems – Follow a Leader
System dynamics
Assume the reference trajectory is generated by
Augmented system state
Augmented system
Value function
x Ax Buy C x
d dy F y
( )( )
( )d
x tX t
y t
1A B
X X u T X B uF
00 0
H. Modares and F.L. Lewis, “Linear Quadratic Tracking Control of Partially-Unknown Continuous-time Systemsusing Reinforcement Learning,” IEEE Trans. Automatic Control, vol. 59, no. 11, pp. 3051-3068, Nov. 2014.
1 2 du K X K x K y= = +
Control
Optimal Tracker Solution by Reinforcement Learning
11 2 1
[ , ] TK K K R B P-= =-
11 1 1 1
0T T TT P TP P C QC PB R B Pg -+ - + - =
Tracker ARE
Optimal Control Solution
Optimal Target policy learnedActual Behavior policy applied
77
Output Synchronization of Heterogeneous MAS
78
Heterogeneous Multi‐Agents
Leader
Output regulation error
leader node x0
i i i i i
i i i
x A x B u
y C x
= +
=
0 0
Sz z=0 0
y R z=
0( ) ( ) ( ) 0i it y t y t
Output regulator equations
i i i i i
i i
A B S
C R
P + G = P
P =
Dynamics are different, state dimensions can be differento/p reg eqs capture the common core of all the agents dynamicsAnd define a synchronization manifold
Output Synchronization of Heterogeneous MAS
Standard Solution
i i i i i
i i i
x A x B u
y C x
= +
=
0 0
Sz z=0 0
y R z=MAS Leader
Output regulator equations
i i i i i
i i
A B S
C R
P + G = P
P =Control
01
( ) ( )N
i i ij j i i ij
S c a gz z z z z z=
é ùê ú= + - + -ê úë ûå
1 1 1 1 2( ) ( )
i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +
Must know your dynamics And leader’s dynamics S,RAnd solve o/p regulator equations
0( ) ( ) 0,
iy t y t i- "o/p regulation
,i iA B
Pioneered by Jie Huang
Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRL
i i i i i
i i i
x A x B u
y C x
= +
=
0 0
Sz z=
0 0y R z=
0
( ) ( ) iT
n pT T
iX t x t z +é ù= Îê úë û
1i i i i iX T X B u= + 1
0,
0 0i i
i i
A BT B
S
é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û
( )
1 1( ( )) ( )
( ) ( )
i t T T Ti i i i i i i it i
Ti i i
V X t e X C Q C K W K X d
X t P X t
g t t¥ - -= +
=ò
1 2 0i i i i i iu K x K K Xz= + =
MAS Leader
Optimal Tracker Problem
Augmented Systems
Performance index
Control
Our Solution
H. Modares, S.P. Nageshrao, G. Lopes, R. Babuska, and F.L. Lewis, “Optimal Model-freeOutput Synchronization of Heterogeneous Systems Using Off-policy ReinforcementLearning,” Automatica, to appear.
Optimal Tracker Solution by Reinforcement Learning
11 2 1
[ , ] Ti i i i i i
K K K W B P-= =-
11 1 1 1
0T T Ti i i i i i i i i i i i i i
T P T P P C QC P B W B Pg -+ - + - =
Tracker ARE
Algorithm 1. On-policy IRL State-feedback algorithm
( )
0 0( ) ( ) ( ) ( ) ( ) ( )i i
t tt tT T Ti i i i i i i i it
e X t t P X t t X t P X t e y y Q y y ddg d g tk kd d t
+- - -+ + - = - - -ò
Policy Evaluation‐ Solve IRL Bellman equation
Policy Update‐1 1 1 1
1 2 1[ , ] T
i i i i i iK K K W B Pk k k k+ + + -= =-
Must know 1iB
Bellman equation is solved using RLS or batch LSIt requires a Persistence of Excitation (PE) condition that may be hard to satisfy
Theorem‐ Algorithm 1 converges to the solution to the ARE
1 2 0i i i i i iu K x K K Xz= + =Control
82
1 1 1
( ) ( ) ( )i i i i i i i i i i i i i i i
X T B K X B u K X T X B u K Xk k k= + + - º + -
Off‐Policy RL
1i i i i i
X T X B u= +
Tracker dynamics
Rewrite as
( )
0 0
( ) 1
( ) ( ) ( ) ( ) ( ) ( )
2 ( )
i i
i
t tt tT T Ti i i i i i i i it
t t t Ti i i i i it
e X t t P X t t X t P X t e y y Q y y d
e u K X W K X d
dg d g tk k
d g t k k
d d t
t
+- - -
+ - - +
+ + - = - - -
+ -
ò
ò
Now the Bellman equation becomes
Extra term containing 1i
Kk+
Algorithm 2. Off-policy IRL Data-based algorithm
Iterate on this equation and solve for simultaneously at each step 1,i i
P Kk k+
Note about probing noise If then ( )i i i i i i
u K X e u K X ek k= + - =
Do not have to know any dynamics i i i i i
i i i
x A x B u
y C x
= +
=0 0
Sz z=
0 0y R z=agent Or leader
83
11 1 1 1
0T T Ti i i i i i i i i i i i i i
T P T P P C QC P B W B Pg -+ - + - =
Theorem‐ Off‐policy Algorithm 2 converges to the solution to the ARE
Then the solution to the output regulator equations i i i i i
i i
A B S
C R
P + G = P
P =
Is given by 1
11 12( )i i
iP P-P = -
Let 11 12
21 22
i i
i ii
P PP
P P
é ùê ú= ê úê úë û
Theorem‐ o/p reg eq solution
1
2 1 11 12( )i i
i i iK K P P-G = -
Do not have to know the Agent dynamics or the leader’s dynamics (S,R)
84
01
ˆ ( ) ( )N
i i i ij j i i ij
S c a gz z z z z z=
é ùê ú= + - + -ê úë ûå
vec 0
1
ˆ ( ) ( ) ( )N
i Si q i ij j i i ij
S I a gz z z z z=
é ùê ú= -G Ä - + -ê úë ûå
1 2ˆ i
i i i i i i i ii
xu K x K K X Kz
z
é ùê ú= + º º ê úê úë û
1 2 0i i i i i iu K x K K Xz= + =
To avoid knowledge of leader’s state in
Use adaptive observer for leader’s state
Then use control
Note that may not converge to actual leader’s matrix SˆiS
Do not have to know the leader’s dynamics (S,R)
New Principles
There Appear to be Multiple Reinforcement Learning Loops in the Brain
Multiple Actor-Critic Learning Structures
Narendra MMAC - Multiple Model Adaptive Control
Limbic System ‐Amygdala and Orbitofrontal Cortex (OFC)Anterior Cingulate Cortex (ACC) anddorsolateral prefrontal cortex (DLPFC)
Recent work in Cognitive Neuropsychology reveals the interactions of brain regions in fast decision‐making
New research by Dan Levine
Hippocampus-Spatial maps &
Task context
ThalamusBasal ganglia-Dopamine neurons
If there is mismatch or dissonanceACC sends reset to OFC to form new category
If there is risk or stress, ACC recruits DLPFCFor deliberative decision & control based on real‐time dataPerhaps optimality?
Hippocampus is invoked for detailed task knowledge
Basal ganglia select actual actions taken
LTM
STM
ACC and DLPFC
DLPFCDeliberative
Decision
Longer‐term Memory
Fast skilled response based on stored memoriesusing gist and SATISFICING
Intuition and Limited data
Amygdala and OFCdeliberative decision & control based on more real‐time data
Dan Levine, Neural Dynamics of affect, gist, probability, and choice, Cognitive Sys Research, 2012
The human brain operates with MULTIPLE REINFORCEMENT LOOPS
ThalamusBasal ganglia-
Dopamine neurons
Cerebellum-direct and inverse
models
Brainstem
Spinal cord
Interoceptivereceptors
Exteroceptivereceptors
Muscle contraction and movement
ReinforcementLearning
inf. olive
Hippocampus-Spatial maps &
Task context
Limbic System Amygdala-
Emotion, intuitive
OFCACC
DLPFCCerebral Cortex
New actor-critic Reinforcement Learning loop
Timing pulses
RL
RLtheta rhythms 4-10 Hz
Motor control 200 Hz
gamma rhythms30-100 Hz
reset
There Appear to be Multiple Reinforcement Learning Loops in the Brain
Known work by Doya
MultipleTime scales
Fast skilled response using gist
Work by Dan Levine and others
0
0( )
Th j j j
j Tj h j j
g w x w bs
a f c x c
Standard Value Function approximation is based only on excitatory NN synapses
ˆ ˆ( ) ( ) ( )Ti i i i iV x W x w x
Shunting Inhibition NN
No System Dynamics Model neededNo Model Identification needed
New Fast Decision Structure Using Shunting Inhibitionand Multiple Actor‐Critic Learning
Ruizhuo Song, F.L. Lewis, Qinglai Wei, and Huaguang Zhang, “Multiple Actor‐Critic Structures forContinuous‐Time Optimal Control Using Input‐Output Data,” IEEE Transactions on NeuralNetworks and Learning Systems, vol. 26, no. 4, pp. 851‐865, April 2015.
We use shunting inhibition‐ it is faster
New Value Function Approximation (VFA) Structures Can Encode Risk, Gist, and Emotional Salience
1. Adaptive Self‐Organizing Map
ASOM
New Actor‐CriticADP structure
1 1( ( )) ( ) ( ( ( )))
J L
j ij ij i
V X k I k w X k
Our New VFA structure is No System Dynamics Model needed
No Model Identification needed
Standard NN approximation is based on excitatory NN synapses
ˆ ˆ( ) ( ) ( )Ti i i i iV x W x w x
Indicator function depends on ASOM Classification
2. New Actor‐Critic ADP structure
New Fast Decision Structure Using Adaptive Self‐organizing Map and Multiple Actor‐Critic Learning
B. Kiumarsi, F.L. Lewis, and D. Levine, “Optimal Control of Nonlinear Discrete Time‐varying Systems Usinga New Neural Network Approximation Structure,” Neurocomputing, vol. 156, pp. 157‐165, May 2015
New Value Function Approximation (VFA) Structures Can Encode Risk, Gist, and Emotional Salience