F.L. Lewis - uta.edu 06 RL short course SEU/RL papers/PPT/2019 06... · Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The University of Texas at Arlington, USA and F.L

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis National Academy of Inventors

Talk available online at http://www.UTA.edu/UTARI/acs

New Developments in Integral Reinforcement Learning:Graphical Games, Off-policy, Tracking

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :ONRUS NSF

Supported by :China NNSFChina Project 111

Data‐driven Online Solution of Differential GamesSynchronous Solution of Multi‐player Non Zero‐sum Games

Multi‐player Game SolutionsIEEE Control Systems Magazine,Dec 2017

Multi‐player Differential Games

Sun Tz bin fa孙子兵法

500 BC

Multi‐player Differential Games

Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities

( )i

i i i i i i ij j jj N

A d g B u e B u

12

0

( (0), , ) ( )i

T T Ti i i i i ii i i ii i j ij j

j N

J u u Q u R u u R u dt

Each process has its own dynamics

And cost function

Each process helps other processes achieve optimality and efficiency

Real‐Time Solution of Multi‐Player NZS Games

1

( ) ( )N

j jj

x f x g x u

Multi‐Player Nonlinear Systems

Optimal control *1 2

10

( (0), , , ) min ( ( ) ) ;i

NT

i N i i ij ij

V x Q x R dt i N

* * * * * * * *1 2 1 1 2( , , ,..., ) ( , , ,..., ),i i i iV V V i N Nash equilibrium

1

1

0 ,N

T T Ti c c i i j j jj ij jj j j

j

P A A P Q P B R R R B P i N

Requires Offline solution of coupled Hamilton‐Jacobi –Bellman eqs.

Kyriakos Vamvoudakis, Automatica 2011

Continuous‐time, N players

112

1

0 ( ) ( ) ( ) ( )N

T Ti j jj j j

j

V f x g x R g x V

114

1

( ) ( ) ( ) , (0) 0N

T T Ti j j jj ij jj j j i

j

Q x V g x R R R g x V V

Linear Quadratic Regulator Case‐ coupled AREs

These are hard to solveIn the nonlinear case, HJB generally cannot be solved

14

112( ) ( ) ,T

i ii i ix R g x V i N

Control policies

1 1 11 1 2 3 1 2 1 3 13 3 3( ) ( ) ( ) coi

teamJ J J J J J J J J J

1 1 12 1 2 3 2 1 2 3 23 3 3( ) ( ) ( ) coi


1 1 13 1 2 3 3 1 3 2 33 3 3( ) ( ) ( ) coi


The objective functions of each player can be written as a team average term plus a conflict of interest term:

1 1

1 1

( ) , 1,N N

coii j i j team iN N

j j

J J J J J J i N

For N-player zero-sum games, the first term is zero, i.e. the players have no goals in common.

For N-players

Team Interest vs. Self Interest

Real‐Time Solution of Multi‐Player Games

1 210

( (0), , , ) ( ( ) ) ;N

Ti N i i ij i

j

V x Q x R dt i N

Value functions

Solve Bellman eq.

Policy Iteration Solution:1

1

0 ( , , , ) ( ) ( ) ( ) , (0) 0N

k k k T i kN i j j i

j

r x V f x g x V i N

1 112( ) ( ) ,k T k

i ii i ix R g x V i N Policy Update

Non‐Zero Sum Games – Synchronous Policy Iteration Kyriakos Vamvoudakis

Convergence has not been provenHard to solve Hamiltonian equationBut this gives the structure we need for online Synchronous PI Solution

Differential equivalent gives coupled Bellman eqs.

11 1

0 ( ) ( ) ( ( ) ( ) ) ( , , , , ),N N

T Ti j ij j i j j i i N

j jQ x u R u V f x g x u H x V u u i N

Leibniz gives Differential equivalent

Real‐Time Solution of Multi‐Player Games

N Critic Neural Networks for VFA 1 1 1ˆ ˆ( ) ( ) ,TV x W x 2 2 2

ˆ ˆ( ) ( )TV x W x

11 1 1 1 1 1 11 1 2 12 22

1 1

ˆ ˆ[ ( ) ]( 1)

T T TTW a W Q x u R u u R u

13 3 2 3 1 3 1 1 11 21 11 1 3 2 2 1 3 1 1

1 1ˆ ˆ ˆ ˆ ˆ ˆ ˆ{( ) ( ) ( ) ( ) }4 4

T T T T T TW F W F W g x R R R g x W m W D x W m W

N Actor Neural Networks

On‐Line Learning – for Player 1:

111 11 1 1 32

ˆ( ) ( ) ,T Tu x R g x W

Online Synchronous PI Solution for Multi‐Player Games

Learns Bellman eq. solution

Kyriakos Vamvoudakis

112 22 2 2 42

ˆ( ) ( )T Tu x R g x W

2-player case

Each player needs 2 NN – a Critic and an Actor

Learns control policy

Player 1 Player 2

Convergence is proven using Lyapunov functions

K.G. Vamvoudakis and F.L. Lewis, “Multi-Player Non-Zero Sum Games: Online Adaptive LearningSolution of Coupled Hamilton-Jacobi Equations,” Automatica, vol. 47, pp. 1556-1569, 2011.

1 11 1 1 2 2 2

1 1( ) ( ) ( ) ( ).2 2

T TL t V x tr W a W tr W a W

Lyapunov energy‐based Proof:

1140 ( )

T TTdV dV dVf Q x gR g

dx dx dx

V(x)= Unknown solution to HJB eq.

2 1 2ˆW W W

1 1 1̂W W W

1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru

W1= Unknown LS solution to Bellman equation for given N

Guarantees stability

Multi‐Agent Learning Guaranteed Convegence Proof

Process i Cost Function Optimization

Process j1 controlupdate

Process i controlupdate

Process j2 controlupdate

Optimal Performance of Each Process Depends on the Control of its Neighbor Processes

Control Policy of Each Process Depends on the Performance of its Neighbor Processes

Process i Cost Function Optimization

Process i controlupdate

Process j2 Cost Function Optimization

Process j1 Cost Function Optimization

Multi-player Games for Multi-Process Optimal ControlData-Driven Optimization (DDO)

Simulation. – Nonlinear System – 2‐player game 2( ) ( ) ( ) ,x f x g x u k x d x

1 2 11 22 12 212 2 , 2 2 , 2 2Q Q I R R I R R I

Optimal Value s*

1 2( ) 2(cos(2 ) 2)u x x x Optimal Policies

Select VFA basis set

Algorithm converges to

2 4ˆ ˆ( ) [0.2514 0.0006 0.5001] ( )T

f fW t W t

2 21 1 12 1 22 4

22

1 2 1

1

4

1

( )cos(2 ) 2 (4 ) 2

0 0 ( ) , ( )

( ) (sin )

(s.

cos(2 ) 2 (in )4 ) 2

xf x

x x x

g x k

x

x x

x x

x

* 2 21 1 2

1( )2

V x x x

* 21 2( ) (sin(4 ) 2)d x x x

2 21 2 1 1 2 2( ) ( ) [ ]x x x x x x

1 3ˆ ˆ( ) [0.5015 0.0007 1.0001] ( )T

f fW t W t

111

11 2 121

2

2 0 0.50150

ˆ( ) 0.0007cos(2 ) 2

0 2 1.0001

TT x

u x R x xx

x

111

22 2 12 21

2

2 0 0.25140ˆ( ) 0.0006sin(4 ) 2 0 2 0.5001

TT x

d x R x xx x

* 2 22 1 2

1 1( )4 2

V x x x

Solves HJB equations online11

21

0 ( ) ( ) ( ) ( )N

T Ti j jj j j

j

V f x g x R g x V

114

1

( ) ( ) ( ) , (0) 0N

T T Ti j j jj ij jj j j i

j

Q x V g x R R R g x V V

Critic 1 NN parameters Critic 2 NN parameters

Evolution of the States 3D approximation error of control for player 1.

3D approximation error value for player 1.

Sun Tz bin fa孙子兵法

Games on Communication Graphs

500 BC

F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2013

Key Point

Lyapunov Functions and Performance IndicesMust depend on graph topology

Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.

H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.

,i i i ix Ax B u

0 0x Ax

0( ) ( ),ix t x t i

0( ) ( ),i

i ij i j i ij N

e x x g x x

( ) ,nix t ( ) im

iu t

Graphical GamesSynchronization‐ Cooperative Tracker Problem

Node dynamics

Target generator dynamics

Synchronization problem

Local neighborhood tracking error (Lihua Xie)

x0(t)

( )i


A d g B u e B u

12

0

( (0), , ) ( )i


j N


12

0

( ( ), ( ), ( ))i i i iL t u t u t dt

Local nbhd. tracking error dynamics

Define Local nbhd. performance index

Local agent dynamics driven by neighbors’ controls

Values driven by neighbors’ controls

K.G. Vamvoudakis, F.L. Lewis, and G.R. Hudas, “Multi-Agent Differential Graphical Games: online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, Aug. 2012.

M. Abouheaf, K. Vamvoudakis, F.L. Lewis, S. Haesaert, and R. Babuska, “Multi-Agent Discrete-Time GraphicalGames and Reinforcement Learning Solutions,” Automatica, Vol. 50, no. 12, pp. 3038-3053, 2014.

1u

2u

iu Control action of player i

Value function of player i

New Differential Graphical Game

( )i


A d g B u e B u

State dynamics of agent i

Local DynamicsLocal Value Function

Only depends on graph neighbors

12

0

( (0), , ) ( )i


j N


1

N

i ii

z Az B u

12

10

( (0), , ) ( )N

T Ti i i j ij j

j

J z u u z Qz u R u dt

1u

2u

iu Control action of player i

Central Dynamics

Value function of player i

Standard Multi-Agent Differential Game

Central DynamicsLocal Value Functiondepends on ALL

other control actions

( ), { : }i j iu t u j N

{ : , }G i ju u j N j i

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

* *( )i i iJ J u

( ) ( , ) ( , ' ),i i i i G i i i G iJ u J u u J u u i

* 12( ( )) min ( )

ii

T T Ti i i ii i i ii i j ij ju

j Nt

V t Q u R u u R u dt

Problems with Nash Equilibrium Definition on Graphical GamesGame objective

Neighbors of node i

All other nodes in graphDef: Nash equilibrium

are in Nash equilibrium if

Counterexample. Disconnected graph

Let each node play his optimal control

Then, each agent’s cost does not depend on any other agent

Then all agents are in Nash equilibriumNote‐ this Nash is also coalition‐proof

Another example

Define

* * *( , ) ( , ), ,i j G j i j G jJ u u J u u i j N

New Definition of Nash Equilibrium for Graphical Games

* * *1 2, ,...,u u u

* * * *( , ) ( , ),i i i G i i i G iJ J u u J u u i N

Def: Interactive Nash equilibrium

are in Interactive Nash equilibrium if

2. There exists a policy such that ju

1.

That is, every player can find a policy that changes the value of every other player.

1. They are in Nash equilibrium

2. Interaction Condition

To restore symmetry of Nash Equilibrium

1 1 12 2 2( , , , ) ( ) 0

i i

TT T Ti i

i i i i i i i i i ij j j i ii i i ii i j ij ji i j N j N

V VH u u A d g B u e B u Q u R u u R u

10 ( ) Ti ii i i ii i

i i

H Vu d g R B

u

2 1 2 1 11 1 12 2 2( ) ( ) 0,

i

TT Tj jc T T Ti i i

i i ii i i i i ii i j j j jj ij jj ji i i j jj N

V VV V VA Q d g B R B d g B R R R B i N

2 1 1( ) ( ) ,i

jc T Tii i i i i ii i ij j j j jj j

i jj N

VVA A d g B R B e d g B R B i N

* *( , , , ) 0ii i i i

i

VH u u

12( ( )) ( )

i

T T Ti i i ii i i ii i j ij j

j Nt

V t Q u R u u R u dt

Value function

Differential equivalent (Leibniz formula) is Bellman’s Equation

Stationarity Condition

Coupled HJ equations

where

Graphical Game Solution Equations

Now use Synchronous PI to learn optimal Nash policies online in real‐time as players interact

Distributed Multi‐Agent Learning Proofs

Online Solution of Graphical Games

Use Reinforcement Learning Convergence Results

POLICY ITERATION

Kyriakos VamvoudakisMulti‐agent Learning Convergence proofs

Data‐driven Online Solution of Differential GamesZero‐sum 2‐Player Games and H‐infinity Control

2

0

2

0

2

0

2

0

2

)(

)(

)(

)(

dttd

dtuhh

dttd

dttz T

System( ) ( ) ( )( )

TT T

x f x g x u k x dy h x

z y u

( )u l x

d

u

z

x control

Performance output disturbance

Find control u(t) so that For all L2 disturbancesAnd a prescribed gain

L2 Gain Problem

H‐Infinity Control Using Reinforcement Learning

Zero‐Sum differential game ‐ Nature as the opposing player

Disturbance Rejection

The game has a unique value (saddle‐point solution) iff the Nash condition holds

2* 2

0

( (0)) min max ( (0), , ) min max ( ) ( )T T

u ud dV x V x u d h x h x u Ru d dt

Define 2‐player zero‐sum game as

min max ( (0), , ) max min ( (0), , )u ud d

V x u d V x u d

The game has a unique value (saddle‐point solution) iff the Nash condition holds

min max ( , , , ) max min ( , , , )u ud d

H x V u d H x V u d

A necessary condition for this is the Isaacs Condition

0 , 0H Hu d

Stationarity Conditions

Linear Quadratic Zero-Sum Games

12 1 2 1 1 2 1 11 1 2 12 22( ( ), , ) ( ( ), , ) ( )T T T

t

J x t u u J x t u u x Qx u R u u R u d

1 11 11 1 2 12 20 T T TA P PA Q PB R B P PB R B P

1 11 1 11 1 2 2 12 2,T Tu K x R B Px u K x R B Px

Game Algebraic Riccati Equation

1 1 2 2x Ax B u B uy Cx

, TQ C C

( , ) ( ) ( ) ( )( )

x f x u f x g x u k x dy h x

System

Cost

Online Zero‐Sum Differential Games

Differential equivalent is ZS game Bellman equation

Given any stabilizing control and disturbance policies ( ), ( )u x d xthe cost value is found by solving this nonlinear Lyapunov equation

22( ( ), , ) ( , , )T T

t t

V x t u d h h u Ru d dt r x u d dt

0 ( , , ) ( , , ) ( ( ) ( ) ( ) ) ( , , , )

(0) 0

T Vr x u d V r x u d V f x g x u k x d H x u dx

V

H-infinity Control

2 players

Leibniz gives Differential equivalent

K.G. Vamvoudakis and F.L. Lewis, “Online solution of nonlinear two-player zero-sum games using synchronous policy iteration,” Int. J. Robust and Nonlinear Control, vol. 22, pp. 1460-1483, 2012.

D. Vrabie and F.L. Lewis, “Adaptive dynamic programming for online solution of a zero-sum differential game,” J Control Theory App., vol. 9, no. 3, pp. 353–360, 2011

0)0( V

Optimal control/dist. policies found by stationarity conditions

HJI equation

Game saddle point solution found from Hamiltonian ‐ ZS Game BELLMAN EQUATION

(‘Nonlinear Game Riccati’ equation)

0 , 0H Hu d

112 ( )Tu R g x V

21 ( )

2Td k x V

* *

12

0 ( , , , )1 1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )4 4

T T T T T T

H x V u d

h h V x f x V x g x R g x V x V x kk V x

22( , , , ) ( ( ) ( ) ( ) ) 0TT TVH x u d h h u Ru d V f x g x u k x dx

1. For a given control policy solve for the value ( )ju x 1( ( ))jV x t

111 12( ) ( )T

j ju x R g x V

Double Policy Iteration Algorithm to Solve HJI

Start with stabilizing initial policy 0 ( )u x

ZS game Bellman eq.

3. Improve policy:

• Convergence proved by Van der Schaft if can solve nonlinear Lyapunov equation exactly

• Abu Khalaf & Lewis used NN to approximate V for nonlinear systems and proved convergence

ONLINE SOLUTION‐ Use IRL to solve the ZS Bellman eq. and update the actors at each step

220 ( )( )T iT i T ij j j jh h V x f gu kd u Ru d

Add inner loop to solve for available storage

12

1 ( )2

i T ijd k x V

2. Set . For i=0,1,… solve for 1( ( )),i ijV x t d 0 0d

On convergence set 1( ) ( )ij jV x V x

Murad AbukhalafDraguna Vrabie

Actor‐Critic structure ‐ three time scales

x

u2 1 0;x Ax B u B w x

System

Controller/Player 1

w( 1)T i kuu B P x 1

1T i

ww B P x

Disturbance/Player 2

CriticLearning procedure

ˆ ˆ, if =1

ˆ ˆ ˆ ˆ, if >1

T T T

T T

V x C Cx u u i

V w w u u i

T T

1 2

1iuP ( 1) 1 ( 1)i k i i k

u u uP P Z

V

iuP

x

( ) 1 ( )i k i i ku u uP P Z

2

1/ / 0 0 00 1/ 1/ 0 0

,1/ 0 1/ 1/ 1/

0 0 0 0

p p p

T T

G G G G

E

T K TT T

A BRT T T T

K

2 1x Ax B u B d

( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t

FrequencyGenerator outputGovernor positionIntegral control

-0.0665 8 0 0 0 -3.663 3.663 0

, [0 0 13.7355 0] . -6.86 0 -13.736 -13.736 0.6 0 0 0

TA B

Simulation- H-inf control for Electric Power Plant- LFC

A is unknownB1, B2 are known

1 8 0 0 0 TB

2 2 1 10 ( )T T T TA P PA C C P B B B B P

Load disturbance

Simulation result – Electric Power Plant LFC

• System – Power plant ‐ internally stable system; – system state

(incremental changes of: frequency deviation, generator output, governor position and integral control)

– Player 1 ‐ controller ; Player 2 – load disturbance

• Nash equilibrium solution

• Online learned solution using ADP – after 5 updates of the parameters of Player 2

[ ( ) ( ) ( ) ( )]g gx f t P t X t E t

0.6036 0.7398 0.0609 0.58770.7398 1.5438 0.1702 0.59780.0609 0.1702 0.0502 0.03570.5877 0.5978 0.0357 2.3307

uP

5

0.6036 0.7399 0.0609 0.58770.7399 1.5440 0.1702 0.59790.0609 0.1702 0.0502 0.03570.5877 0.5979 0.0357 2.3307

uP

2 2 1 10 ( )T T T TA P PA C C P B B B B P Solves GARE online without knowing A

Parameters of the critic – ARE Solution elements

– Cost function learning using least squares

– Sampling integration time T=0.1 s

– The policy of Player 1 is updated every 2.5 s

– The policy of Player 2 is updated only when the policy of Player 1 has converged

– Number of updates of Player 1 before an update of Player 2

moments when Player 2 is updated

0 50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Parameters of the cost function of the game

Time (s)

P(1,1)P(1,2)P(2,2)P(3,4)P(4,4)

57

New Developments in IRL for CT SystemsQ Learning for CT SystemsExperience ReplayOff-Policy IRL

Q learning for CT Systems

CT Q function is quadratic in x, u, and the derivatives of u ‐

( )( ), [ : ) ( )t T

t T T T

t

Q x t u t t T e x S x R d e V x t T

( )1 1 1 1( ) ( ) ( ) ( ) ( ) ( ) ( )

t Tt T T T T

k k k k k k kt

e x S x u R u d Z t M S M R Z t E t T

( )

( )( )( )( )

( )

k

k

x tu tu tZ t

u t

M. Palanisamy, H. Modares, F.L. Lewis, and M. Aurangzeb,“Continuous-time Q-learning for Infinite-horizon DiscountedCost Linear Quadratic Regulator Problems,” IEEE Trans.Cybernetics, vol. 45, no. 2, pp. 165-176, Feb. 2015.

IRL with Experience ReplayHumans use memories of past experiences to tune current policies

( ) ( ( )) ( ( )) ( )x t f x t g x t u t= +

1

0( ( )) ( ( )) 2 ( tanh ( ))( )u

T

t

V x t Q x v Rdv dt l l t¥

-= +ò ò1

0( ) 2 tanh ( ) ( ) ( ( ) ( ) ) 0, (0) 0( )

uTTQ x v Rdv V x f x g x u Vl l-+ + + = =ò

1tanh (1 2 ) ( ) ( )( )Tu R g x V xl l* - *= -

1ˆ ˆ( ) ( )TV x W xf=

system

Value

1 1 12

( )ˆ ˆ( ) ( ) ( ) ( )(1 ( ) ( ))

( )T

T

tW t p t t W t

t t

fa f

f fD

= - +D+D D

Modares and Lewis, Automatica 2014Girish Chowdhary‐ concurrent learningSutton and Barto book

Bellman Equation

Action Update

VFA‐ Value Function Approximation

( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -

1

0( ) 2 ( tanh ( ))( )

tu

T

t T

p t Q v Rdv dl l t-

-

= +ò ò

i/o Data Measurements

Standard Critic Weight Tuning

IRL Bellman Equation 1

0( ( )) ( ( )) 2 ( tanh ( )) ( ( ))( )

tu

T

t T

V x t T Q x v Rdv d V x tt l l t-

-

- = + +ò ò

110

( ( )) 2 ( tanh ( )) ( ( )) ( )( )t

uT T

B

t T

Q x v Rdv d W x t tt l l t f e-

-

+ + D ºò òBellman Eq gives Linear Equation for Weights

IRL with Experience ReplayHumans use memories of past experiences to tune current policies

1ˆ ˆ( ) ( )TV x W xf=

1 1 1 1 12 2

1

( )ˆ ˆ ˆ( ) ( ) ( ) ( ) ( )(1 ( ) ( )) (1 )

( ) ( )l

jT Tj jT T

j j j

tW t p t t W t p W t

t t

ffa f a f

f f f f=

DD= - +D +D

+D D +D D- å

NN weight tuning uses past samples

Improvements1. Speeds up convergence2. PE condition is milder

VFA‐ Value Function Approximation

Data from Previous time intervals

( ( )) ( ( )) ( ( ))x t x t x t Tf f fD = - -

1

0( ) 2 ( tanh ( ))( )

tu

T

t T

p t Q v Rdv dl l t-

-

= +ò ò

i/o Data Measurements

Previous data

H. Modares, F.L. Lewis, and M.B. Naghibi Sistani,“Integral reinforcement learning and experiencereplay for adaptive optimal control of partially-unknown constrained-input continuous-timesystems,” Automatica, vol. 50, pp. 193-202, 2014.

New Principles

Off-Policy Learning

On‐policy RL

66

Target and behaviorpolicy

SystemRef.

Target policy: The policy that we are learning about.Behavior policy: The policy that generates actions and behavior

Target policy and behavior policy are the same

Sutton and Barto Book

Off‐Policy Reinforcement Learning

Humans can learn optimal policies while actually playing suboptimal policies

Off‐policy RL

67

Behavior Policy System

Target policy

Ref.

Target policy and behavior policy are different

Humans can learn optimal policies while actually applying suboptimal policies

H. Modares, F.L. Lewis, and Z.-P. Jiang, “H-infinity Tracking Control of Completely-unknown Continuous-time Systems via Off-policyReinforcement Learning ,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 10, pp. 2550-2562, Oct. 2015.

Ruizhuo Song, F.L. Lewis, Qinglai Wei, “Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-TimeMulti-Player Non-Zero-Sum Games,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 3, pp. 704-713, 2017.

Bahare Kiumarsi, Frank L. Lewis, Zhong-Ping Jiang, “H-infinity Control of Linear Discrete-time Systems: Off-policy ReinforcementLearning,” Automatica, vol. 78, pp. 144-152, 2017.

Off-policy IRLHumans can learn optimal policies while actually applying suboptimal policies

( ) ( )x f x g x u

( ) ( ) ( ( )) ( ),t t

TJ x r x u d u dQ x Ru

[ [ [] [ ]] ]( ( )) ( ( )) ( )i i

t T t

t i i

T

t TJ x t J x t T Q x d Ru du

On-policy IRL

[ 1] 1 [ ]12

i T ixu R g J

[ ] [ ]( )i ix f gu g u u Off-policy IRL

[ ] [ ] [ ] [ ] [ 1] [ ]( ( )) ( ( )) ( ) 2 ( )t t tii i

t T t T t T

T i i T iJ x t J x t T Q x d R d du u u R u u

This is a linear equation for andThey can be found simultaneously online using measured data using Kronecker product and VFA

[ ]iJ [ 1]iu

1. Completely unknown system dynamics2. Can use applied u(t) for –

disturbance rejection – Z.P. Jiang ‐ R. Song and Lewis, 2015robust control – Y. Jiang & Z.P. Jiang, IEEE TCS 2012exploring probing noise – without bias ! ‐ J.Y. Lee, J.B. Park, Y.H. Choi 2012

DDO

system

value

Must know g(x)

Learned optimaltarget policy

Applied behavior policy

Off‐policy for Multi‐player NZS Games

1( ) ( )N

jjx f x g x u

1 2 1( ( )) ( ( , , , , )) ( ( ) )N T

i i N i j ij jjt tV x t r x u u u d Q x u R u d

[ ] [ ]1 1

( ) ( ) ( )( )N Nk kj j jj j

x f x g x u g x u u

Off‐policy

[ ] [ ] [ ] [ ] [ 1] [ ]1 1

( ( )) ( ( 2 ( ))) ( )t T t T t TN Nkk k T k k T k

j ij j i ii j jji i it t jtu R u u R u uV x t T V x t Q x d d d

1. Solve online using measured data for 2. Completely unknown dynamics3. Add exploring noise with no bias

[ ] [ 1],k ki iV u

DDO

On‐policy

Ruizhuo Song, F.L. Lewis, Qinglai Wei, “Off-Policy Integral Reinforcement Learning Method to SolveNonlinear Continuous-Time Multi-Player Non-Zero-Sum Games,” IEEE Transactions on NeuralNetworks and Learning Systems, vol. 28, no. 3, pp. 704-713, March 2017.

Off‐policy IRL for ZS Games – H‐infinity Control

2( ) [ ( ) ( ) ]T T T nd

X t e t r t= Î

Optimal tracker

( ) ( ( )) ( ( )) ( ) ( ( )) ( )X t F X t G X t u t K X t d t= + +

( ) 2( , ) ( )t T T TTt

V u d e X Q X u Ru d d da t g t¥

- -= + -ò

( ) ( )i i i i

X F G u K d G u u K d d= + + + - + -

Off‐policy

1. Solve online using data for 2. Completely unknown dynamics3. Disturbance does not need to be specified4. Add exploring noise with no bias

1 1, ,i i iV u d

DDO

Biao Luo, H.N. Wu, T. Huang, IEEE T. Cybern. Modares, Lewis, Z.P. Jiang, TNNLS 2015H. Li, Derong Liu, D. Wang, IEEE TASE 2015

Extended stateIncluding ref. traj. dynamics

Observe played policy of disturbanceCompute his worst case malicious attackshould he choose to play it

H. Modares, F.L. Lewis, and Z.-P. Jiang, “H-infinityTracking Control of Completely-unknownContinuous-time Systems via Off-policyReinforcement Learning ,” IEEE Transactions onNeural Networks and Learning Systems, vol. 26, no.10, pp. 2550-2562, Oct. 2015.

Off-Policy Learning for Estimating Malicious Adversaries’ Hidden True Intent

( ) ( )k k k ki i i i i i i i i ix Ax Bu Dv B u u D v v

1 2 112( ( )) ( ( )) ( , , ) ( ) ( ) ( ) ( )

t T t Tk k k k k T k k T k

i i i i i i i i i ii i i i ii i it t

V x t V x t T r x u v dt u R u u v R v v dt

i i i i i ix Ax Bu Dv

MAS H‐infinity control

MA Systems

Cost

Off‐policy IRL

Two Opposing Teams

2 21 12 2( ( )) (x ) ( , , v )

i i

T T T T Ti i i ii i i ii i j ij j i ii i j ij j i i i i

t tj j

V x t Q x u R u u R u v T v v T v dt r x u dt

1 12

1 1,2 2

k kk T k Ti ii i i i

i i

V Vu B v D

x x

Optimal Target policiesActual Behavior policiesOff‐Policy Bellman Eq.

New Principles3. Off‐Policy LQ Tracker for Continuous‐time Systems

LQ Tracker for Continuous‐time Systems – Follow a Leader

System dynamics

Assume the reference trajectory is generated by

Augmented system state

Augmented system

Value function

x Ax Buy C x

d dy F y

( )( )

( )d

x tX t

y t

1A B

X X u T X B uF

00 0

H. Modares and F.L. Lewis, “Linear Quadratic Tracking Control of Partially-Unknown Continuous-time Systemsusing Reinforcement Learning,” IEEE Trans. Automatic Control, vol. 59, no. 11, pp. 3051-3068, Nov. 2014.

1 2 du K X K x K y= = +

Control

Optimal Tracker Solution by Reinforcement Learning

11 2 1

[ , ] TK K K R B P-= =-

11 1 1 1

0T T TT P TP P C QC PB R B Pg -+ - + - =

Tracker ARE

Optimal Control Solution

Optimal Target policy learnedActual Behavior policy applied

77

Output Synchronization of Heterogeneous MAS

78

Heterogeneous Multi‐Agents

Leader

Output regulation error

leader node x0

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=0 0

y R z=

0( ) ( ) ( ) 0i it y t y t

Output regulator equations

i i i i i

i i

A B S

C R

P + G = P

P =

Dynamics are different, state dimensions can be differento/p reg eqs capture the common core of all the agents dynamicsAnd define a synchronization manifold

Output Synchronization of Heterogeneous MAS

Standard Solution

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=0 0

y R z=MAS Leader

Output regulator equations

i i i i i

i i

A B S

C R

P + G = P

P =Control

01

( ) ( )N

i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

1 1 1 1 2( ) ( )

i i i i i i i i i i i i i i i i iu K x K x K K x Kz z z z= -P +G = + G- P º +

Must know your dynamics And leader’s dynamics S,RAnd solve o/p regulator equations

0( ) ( ) 0,

iy t y t i- "o/p regulation

,i iA B

Pioneered by Jie Huang

Optimal Output Synchronization of Heterogeneous MAS Using Off-policy IRL

i i i i i

i i i

x A x B u

y C x

= +

=

0 0

Sz z=

0 0y R z=

0

( ) ( ) iT

n pT T

iX t x t z +é ù= Îê úë û

1i i i i iX T X B u= + 1

0,

0 0i i

i i

A BT B

S

é ù é ùê ú ê ú= =ê ú ê úê ú ê úë û ë û

( )

1 1( ( )) ( )

( ) ( )

i t T T Ti i i i i i i it i

Ti i i

V X t e X C Q C K W K X d

X t P X t

g t t¥ - -= +

=ò

1 2 0i i i i i iu K x K K Xz= + =

MAS Leader

Optimal Tracker Problem

Augmented Systems

Performance index

Control

Our Solution

H. Modares, S.P. Nageshrao, G. Lopes, R. Babuska, and F.L. Lewis, “Optimal Model-freeOutput Synchronization of Heterogeneous Systems Using Off-policy ReinforcementLearning,” Automatica, to appear.

Optimal Tracker Solution by Reinforcement Learning

11 2 1

[ , ] Ti i i i i i

K K K W B P-= =-

11 1 1 1

0T T Ti i i i i i i i i i i i i i

T P T P P C QC P B W B Pg -+ - + - =

Tracker ARE

Algorithm 1. On-policy IRL State-feedback algorithm

( )

0 0( ) ( ) ( ) ( ) ( ) ( )i i

t tt tT T Ti i i i i i i i it

e X t t P X t t X t P X t e y y Q y y ddg d g tk kd d t

+- - -+ + - = - - -ò

Policy Evaluation‐ Solve IRL Bellman equation

Policy Update‐1 1 1 1

1 2 1[ , ] T

i i i i i iK K K W B Pk k k k+ + + -= =-

Must know 1iB

Bellman equation is solved using RLS or batch LSIt requires a Persistence of Excitation (PE) condition that may be hard to satisfy

Theorem‐ Algorithm 1 converges to the solution to the ARE

1 2 0i i i i i iu K x K K Xz= + =Control

82

1 1 1

( ) ( ) ( )i i i i i i i i i i i i i i i

X T B K X B u K X T X B u K Xk k k= + + - º + -

Off‐Policy RL

1i i i i i

X T X B u= +

Tracker dynamics

Rewrite as

( )

0 0

( ) 1

( ) ( ) ( ) ( ) ( ) ( )

2 ( )

i i

i

t tt tT T Ti i i i i i i i it

t t t Ti i i i i it

e X t t P X t t X t P X t e y y Q y y d

e u K X W K X d

dg d g tk k

d g t k k

d d t

t

+- - -

+ - - +

+ + - = - - -

+ -

ò

ò

Now the Bellman equation becomes

Extra term containing 1i

Kk+

Algorithm 2. Off-policy IRL Data-based algorithm

Iterate on this equation and solve for simultaneously at each step 1,i i

P Kk k+

Note about probing noise If then ( )i i i i i i

u K X e u K X ek k= + - =

Do not have to know any dynamics i i i i i

i i i

x A x B u

y C x

= +

=0 0

Sz z=

0 0y R z=agent Or leader

83

11 1 1 1

0T T Ti i i i i i i i i i i i i i

T P T P P C QC P B W B Pg -+ - + - =

Theorem‐ Off‐policy Algorithm 2 converges to the solution to the ARE

Then the solution to the output regulator equations i i i i i

i i

A B S

C R

P + G = P

P =

Is given by 1

11 12( )i i

iP P-P = -

Let 11 12

21 22

i i

i ii

P PP

P P

é ùê ú= ê úê úë û

Theorem‐ o/p reg eq solution

1

2 1 11 12( )i i

i i iK K P P-G = -

Do not have to know the Agent dynamics or the leader’s dynamics (S,R)

84

01

ˆ ( ) ( )N

i i i ij j i i ij

S c a gz z z z z z=

é ùê ú= + - + -ê úë ûå

vec 0

1

ˆ ( ) ( ) ( )N

i Si q i ij j i i ij

S I a gz z z z z=

é ùê ú= -G Ä - + -ê úë ûå

1 2ˆ i

i i i i i i i ii

xu K x K K X Kz

z

é ùê ú= + º º ê úê úë û

1 2 0i i i i i iu K x K K Xz= + =

To avoid knowledge of leader’s state in

Use adaptive observer for leader’s state

Then use control

Note that may not converge to actual leader’s matrix SˆiS

Do not have to know the leader’s dynamics (S,R)

New Principles

There Appear to be Multiple Reinforcement Learning Loops in the Brain

Multiple Actor-Critic Learning Structures

Narendra MMAC - Multiple Model Adaptive Control

Limbic System ‐Amygdala and Orbitofrontal Cortex (OFC)Anterior Cingulate Cortex (ACC) anddorsolateral prefrontal cortex (DLPFC)

Recent work in Cognitive Neuropsychology reveals the interactions of brain regions in fast decision‐making

New research by Dan Levine

Hippocampus-Spatial maps &

Task context

ThalamusBasal ganglia-Dopamine neurons

If there is mismatch or dissonanceACC sends reset to OFC to form new category

If there is risk or stress, ACC recruits DLPFCFor deliberative decision & control based on real‐time dataPerhaps optimality?

Hippocampus is invoked for detailed task knowledge

Basal ganglia select actual actions taken

LTM

STM

ACC and DLPFC

DLPFCDeliberative

Decision

Longer‐term Memory

Fast skilled response based on stored memoriesusing gist and SATISFICING

Intuition and Limited data

Amygdala and OFCdeliberative decision & control based on more real‐time data

Dan Levine, Neural Dynamics of affect, gist, probability, and choice, Cognitive Sys Research, 2012

The human brain operates with MULTIPLE REINFORCEMENT LOOPS

ThalamusBasal ganglia-

Dopamine neurons

Cerebellum-direct and inverse

models

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

ReinforcementLearning

inf. olive

Hippocampus-Spatial maps &

Task context

Limbic System Amygdala-

Emotion, intuitive

OFCACC

DLPFCCerebral Cortex

New actor-critic Reinforcement Learning loop

Timing pulses

RL

RLtheta rhythms 4-10 Hz

Motor control 200 Hz

gamma rhythms30-100 Hz

reset

There Appear to be Multiple Reinforcement Learning Loops in the Brain

Known work by Doya

MultipleTime scales

Fast skilled response using gist

Work by Dan Levine and others

0

0( )

Th j j j

j Tj h j j

g w x w bs

a f c x c

Standard Value Function approximation is based only on excitatory NN synapses

ˆ ˆ( ) ( ) ( )Ti i i i iV x W x w x

Shunting Inhibition NN

No System Dynamics Model neededNo Model Identification needed

New Fast Decision Structure Using Shunting Inhibitionand Multiple Actor‐Critic Learning

Ruizhuo Song, F.L. Lewis, Qinglai Wei, and Huaguang Zhang, “Multiple Actor‐Critic Structures forContinuous‐Time Optimal Control Using Input‐Output Data,” IEEE Transactions on NeuralNetworks and Learning Systems, vol. 26, no. 4, pp. 851‐865, April 2015.

We use shunting inhibition‐ it is faster

New Value Function Approximation (VFA) Structures Can Encode Risk, Gist, and Emotional Salience

1. Adaptive Self‐Organizing Map

ASOM

New Actor‐CriticADP structure

1 1( ( )) ( ) ( ( ( )))

J L

j ij ij i

V X k I k w X k

Our New VFA structure is No System Dynamics Model needed

No Model Identification needed

Standard NN approximation is based on excitatory NN synapses

ˆ ˆ( ) ( ) ( )Ti i i i iV x W x w x

Indicator function depends on ASOM Classification

2. New Actor‐Critic ADP structure

New Fast Decision Structure Using Adaptive Self‐organizing Map and Multiple Actor‐Critic Learning

B. Kiumarsi, F.L. Lewis, and D. Levine, “Optimal Control of Nonlinear Discrete Time‐varying Systems Usinga New Neural Network Approximation Structure,” Neurocomputing, vol. 156, pp. 157‐165, May 2015

New Value Function Approximation (VFA) Structures Can Encode Risk, Gist, and Emotional Salience

Documents

F.L. Lewis - uta.edu 06 RL short course SEU/RL papers/PPT/2019 06... · Moncrief-O’Donnell Chair, UTA Research Institute (UTARI) The University of Texas at Arlington, USA and F.L