40
Adaptive Control Through Reinforcement Learning J. Miranda Lemos INESC-ID Instituto Superior T´ ecnico/Universidade de Lisboa, Portugal [email protected] Seminars on Mathematics, Physics and Machine Learning July 16, 2020 J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 1 / 40

Adaptive Control Through Reinforcement Learning

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive Control Through Reinforcement Learning

Adaptive ControlThrough Reinforcement Learning

J. Miranda Lemos

INESC-IDInstituto Superior Tecnico/Universidade de Lisboa, Portugal

[email protected]

Seminars on Mathematics, Physics and Machine Learning

July 16, 2020

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 1 / 40

Page 2: Adaptive Control Through Reinforcement Learning

Objective

Explain to a wide audience:

How to design adaptive optimal controllers by combining optimal controlwith reinforcement learning, approximate dynamic programming, andartificial neural networks?

RL basedAdaptive

Optimal Control

OptimalControl

LearningRL, ADP, ANN

Q-Learning

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 2 / 40

Page 3: Adaptive Control Through Reinforcement Learning

Presentation road map

What is adaptive control?

Approaches to adaptive control

Early Reinforcement Learning based controllers

RL based linear Model Predictive Control (MPC)

How to tackle adaptive nonlinear optimal control?

Approximate Dynamic Programming (ADP)

Q-Learning

Conclusions

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 3 / 40

Page 4: Adaptive Control Through Reinforcement Learning

What is control?

Stimulate a system such that it behaves in a specified way.

ControllerAgent Actuator Sensor

Disturbances

EnvironmentMeasured variablesStateReward

Reference

Plant

ManipulatedvariableAction

Physical system (good old gravity law!)

Control modifies dynamic behaviour (Cyber-Physical Systems)

Cyber-physical system

Help of Paula and Francisco kindly acknowledged.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 4 / 40

Page 5: Adaptive Control Through Reinforcement Learning

What is control? An example: anesthesia

Controlling neuromuscular blockade for a patient subject to generalanesthesia

manipulatedcommands

controller

measurementdata

sensor

actuator

Source: Project GALENO, Photo taken at Hospital de S. Antonio, Porto, Portugal.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 5 / 40

Page 6: Adaptive Control Through Reinforcement Learning

Uncertainty

Uncertainty: Unpredictable variability in plant dynamics.

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

BIS

Time (min)

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 6 / 40

Page 7: Adaptive Control Through Reinforcement Learning

Robustness

Robustness: Design the controller for a nominal model, but it works withnearby systems (with graceful degradation in performance)Example: control of the level of self-unconsciousness in patients subject togeneral anesthesia Clinical results

0

10

20

30

40

50

60

70

80

90

100←

On

← O

ff

BIS

Ref.

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 1700

50

100

150

200

Time (min)

u prop

. (µg

.kg−

1 .min

−1 )

← ←

u0

J. M. Lemos, D. V. Caiado, B. A. Costa, L. A. Paz, T. F. Mendonca, R. Rabico, S. Esteves and M. Seabra (2014). Robust

Control of Maintenance Phase Anesthesia. IEEE Control Systems, 34(6):24-38.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 7 / 40

Page 8: Adaptive Control Through Reinforcement Learning

What is adaptive control?

Modify the control law (= control policy) to make it match the plant.Learn the ”best” control policy. Not merely the plant inverse.

ControllerAgent Actuator Sensor

Disturbances

EnvironmentMeasured variablesStateReward

Reference

PlantManipulatedvariableAction

Adaptation

Changes incontrol policy

Plant data

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 8 / 40

Page 9: Adaptive Control Through Reinforcement Learning

Two time-scales system

Control actionFast

AdaptationLearning

Slow

Plant

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 9 / 40

Page 10: Adaptive Control Through Reinforcement Learning

Why use adaptive control?

Controlling time varying processes.Controlling processes with big variability.

Source: Hizook, 2012

KIVA robots for automaticwarehouses (now Amazonrobotics)

Use low cost componentscauses big variability

Use adaptive control tocompensate uncertainty.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 10 / 40

Page 11: Adaptive Control Through Reinforcement Learning

Approaches to adaptive control

Joint parameter and state nonlinear estimation

Certainty equivalence

SMMAC - Supervised Multiple Model Adaptive Control

Model falsification

Reinforcement Learning (RL)

Control Lyapunov Functions (CLF)

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 11 / 40

Page 12: Adaptive Control Through Reinforcement Learning

Joint parameter and state nonlinear control

dx

dt= f (x , θ)

Stochastic control of the hyperstate untractable incomputational terms.Need for approximate solutions.

Augment the state:

z(t) =

[x(t)θ

]dz =

[f (x , θ)θ

]dt +

[0σ

]dw

For a given parameter, the state has a well defined evolution. If theparameter is a r.v. with a known distribution, how can we compute thestate pdf?

x

timet

x(t,θ)

x

timet

p(x,t)

Each solution is generated for adifferent value of the parameter

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 12 / 40

Page 13: Adaptive Control Through Reinforcement Learning

Suboptimal solution: Joint state-parameter estimation

dzt = f (zt)dt + σdwt

p(z , t) satisfies the Fokker-Planckequation (scalar case for simplicity)

∂p

∂t= −fz(z)p − f (z)

∂p

∂z+σ2

2

∂2p

∂z2

Example with an unknown gain.Cautious adaptive control.

Joint work with Antonio Silva

time0 5 10 15 20 25 30 35 40 45 50

x 1=

a

-1.5

-1

-0.5

0

0.5

1

time0 5 10 15 20 25 30 35 40 45 50

x 2

0

50

100

referencereal stateobservationsfiltered maximum value estimate

time0 5 10 15 20 25 30 35 40 45 50

x 1=

a

-1

0

1

2

time0 5 10 15 20 25 30 35 40 45 50

x 2

0

50

100

referencereal stateobservationsfiltered maximum value estimate

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 13 / 40

Page 14: Adaptive Control Through Reinforcement Learning

Certainty equivalence

Assume the estimated model to be the true model

Parameterestimator

Plant data

Control policyredesign

ControllerPlant data Control action

Parameter estimates

Controllergains

AdaptationKalman, 1958 Self-optimizingcontroller

Astrom and Wittenmark, 1972Self-tuning controller

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 14 / 40

Page 15: Adaptive Control Through Reinforcement Learning

Issues with Certainty equivalence: Complex dynamics (1)

Plant dynamics (linear)

y(t) + a1y(t − 1) + a2y(t − 2) = Ku(t − 1)

Controller

θ(t) = θ(t − 1) + py(t − 1)[y(t)− θ(t − 1)y(t − 1)− Ku(t − 1)],

u(t) = (r − θ(t)y(t))/K

The plant is assumed to be 1st order although it is of 2nd order

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 15 / 40

Page 16: Adaptive Control Through Reinforcement Learning

Issues with Certainty equivalence: Complex dynamics (2)

With moderate un-modelled dynamics, the output converges to thereference.Increase the level of un-modelled dynamics causes a sequence ofbifurcations that leads to chaos

y-5 0 5 10 15

θ

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

a2=0.2

y-5 0 5 10 15

θ

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

a2=0.6

y-5 0 5 10 15

θ

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

a2=0.71

y-5 0 5 10 15

θ

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

a2=0.8

B. E. Ydstie (1986). Bifurcations and complex dynamics in adaptive control systems. Proc. 25th CDC, Athens, 2232 - 2236

B. E. Ydstie and M. P. Golden (1987). Chaos and strange attractors in adaptive control systems. Proc. 10th IFAC WorldCongress, Munich, 10: 127-132.

B. E, Ydstie (1991). Stability of the Direct Self-Tuning Regulator. in P. V. Kokotovic (ed.), Foundations of Adaptive Control,

Springer, 1992, 201-237.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 16 / 40

Page 17: Adaptive Control Through Reinforcement Learning

Issues with Certainty equivalence: Equivocation

Maximum entropy approach to control, Saridis, 1988

Equivalence between optimal costand entropy.

Describe the possible controls by apdf p.

Maximize the entropy subject to∫Ωp = 1, E(J(u)) = J(u∗)

Linear case: Separation theorem

Non-linear case: There is noseparation theorem

Hctrl

HestHequiv

Hopt

Use good adaptation and goodcontrol to reduce equivocation

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 17 / 40

Page 18: Adaptive Control Through Reinforcement Learning

SMMAC - Supervised Multiple Model Adaptive Control

Lainiotis 1974 Partitioning (lots of critics at the time)Morse, Hespanha, Mosca, ... (1997 - present)

P1

−∆

z

tzC1

CN

SL

M1/E1

MN/ENPI

PI

σref

r

r

r

+

+

-

-

u

+

- v

ec

1e

Ne

Clinical results for neuromuscular blockade

0 50 100 150 2000

20

40

60

80

100

r(t)

%

0 50 100 150 2000

5

10

15

20

25

30

u(t)

µg

kg−

1 min

−1

0 50 100 150 2000

20

40

60

80

100

Time (minutes)F

ilter

ed r

(t)

%

0 50 100 150 2000

20

40

60

80

100

Sw

itchi

ng S

igna

l

Time (minutes)

1Filtered rref

end of infusion

mean infusion rate = 6.3 µg kg−1 min−1

T. Mendonca, J. M. Lemos, H. Magalhaes, P. Rocha and S. Esteves (2009). Drug delivery for neuromuscular blockade with

supervised multimodel adaptive control. IEEE Trans. Control Systems Technology, 17(6):1237-1244.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 18 / 40

Page 19: Adaptive Control Through Reinforcement Learning

Model falsification

Based on Karl Popper falsification approach to Philosophy.Carve the model bank by eliminating models incompatible with data.Computationally very heavy.

Architecture based on Set-Value Observers Experimental results – fan with varying flow

P. Rosa, T. Simao, C. Silvestre, J. M. Lemos (2016). Fault tolerant control of an air heating fan using set-value observers: an

experimental evaluation. Int. J. Adaptive Control and Signal Proc., 30(2):336-358

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 19 / 40

Page 20: Adaptive Control Through Reinforcement Learning

Control Lyapunov Functions (CLF)

V

x1

x2

x(0)

V(x(0))

x(t)

V(x(t))

In adaptive control: Postulate a Lyapunovfunction for the hyperstate.Choose the adaptation law such as to force theLF time derivative to be negative semi-definite.

Convergence follows the set-invariant theorem.

Parks, 1966 and many others since then

Alexander Lyapunov(1857-1918)Lyapunov, 1892Lasalle, 1950

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 20 / 40

Page 21: Adaptive Control Through Reinforcement Learning

Example: Lyapunov adaptation of a solar field

u y

y0R

Collector field

FeedbackLinearization

CLF basedadaptation

r K+

-

α

Inlet oiltemperature

Solarradiation

Outlet oiltemperature

Virtualcontrolvariable

Oilflow

Experimental results

10 10.5 11 11.5 12 12.5 13 13.5220

240

260

280

300

Tem

pera

ture

[o C]

10 10.5 11 11.5 12 12.5 13 13.52

4

6

8

Flo

w [l

/s]

Time [hour]

Barao, M., J. M. Lemos e R. N. Silva (2002). Reduced

complexity adaptive nonlinear control of a distributed collector

solar field. J. Process Control,12:131-141

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 21 / 40

Page 22: Adaptive Control Through Reinforcement Learning

Control Lyapunov Functions and Reinforcement Learning

Control Lyapunov functions play akey role in control usingreinforcement learning.

See the recent book (2018) andmany papers on the subject.

The long term reward can be used tobuild Lyapunov functions.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 22 / 40

Page 23: Adaptive Control Through Reinforcement Learning

A priori information versus performance

Increasing a priori information on plant dynamics increases performancebut reduces the range of possible applications

Usability withdifferent plants

Performance

Datadriven

Modeldriven

Qualitative plot

Increase a prioriinformation

Lemos, Neves Silva, Igreja Adaptive Control of Solar Energy Collector Systems, Springer, 2014

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 23 / 40

Page 24: Adaptive Control Through Reinforcement Learning

Reinforcement Learning

Perception causes action

Action influences perception

Learn the optimal action by trialand error to maximize a reward

Apply non-optimal actions witha low probability to learn byexploiting different regions ofthe state space

Exploitation and exploration

What is an adequate reward forcontrol design?

How can exploitation be made incontrol?

Early roots: Pavlov’s (1849-1936)experiments on reflex conditioning

Countless works since then.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 24 / 40

Page 25: Adaptive Control Through Reinforcement Learning

Early RL based adaptive controllers

Whitaker, 1958 MIT rule

A gradient rule to maximize theinstantaneous squared tracking errore of a Model Reference AdaptiveController (MRAC) by adjusting again:

dt= −γe ∂e

∂θ

Due to technology limitations theyused

dt= −γe sign

[∂e

∂θ

] Crash of the X-15 aircraft in 15 Nov.1967, that caused the death of thepilot Michael J. Adams.A lot of enthusiasm, poor technology,and no theory at all.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 25 / 40

Page 26: Adaptive Control Through Reinforcement Learning

The road to Predictive Adaptive Control (Adaptive MPC)

Self-tuning regulator, Astromand Wittenmark, 1972, RLS +Minimum variance. Unable tostabilize non-minimum-phaseplants

Detuned Self-tuning regulator,Clarke and Gawthrop, 1974,Include a penalty on the actionUnable to stabilizenon-minimum-phase plants thatare also unstable

GPC, Clarke, Mohtadi andTufts, 1980, Stabilizes any linearplant for a sufficiently largehorizon

time

Present time

Nearby future time

How to choosethe present time

action?

time

Present time

How to choosethe present time

action?

Look at an extended horizonthat slides with the present time

.......

Key ideas

Enlarge the horizon

Receding horizon control

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 26 / 40

Page 27: Adaptive Control Through Reinforcement Learning

RL based linear adaptive MPC

time

Present time

How to choosethe present time

action?

Look at an extended horizonthat slides with the present time

Future controlactions assumedto be a constant state feedback

Fk = Fk−1 − γR−1s ∇J

May start from a non-stabilizing gain.

f1

f 2

UNSTABLE

UNSTABLE

UNSTABLE

Eu2<1.88 CM

MUSMARconvergence

0 0.5 1 1.5 2−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 27 / 40

Page 28: Adaptive Control Through Reinforcement Learning

Example 1: Steam temperature control in a boilerExperimental results

19.8 20 20.2 20.4 20.6 20.8 21530

531

532

533

534

535

536

537

538

Ste

am T

empe

ratu

re [o C

]

Adaptation starts ρ decreases

Time [hour]

A

B

TB* −T

A*

TA* T

B* T

sT

max

Silva, R. N., P. O. Shirley, J. M. Lemos and A. C. Goncalves (2000). Adaptive regulation of super-heated steam temperature: a

case study in an industrial boile. Control Engineering Practice, 8:1405-1415

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 28 / 40

Page 29: Adaptive Control Through Reinforcement Learning

Example 2: Rate of cooling in arc-weldingExperimental results

0 10 20 30 40 50 60 70 80 90800

1000

1200

1400

Time [s]

Tem

p. [C

]

0 10 20 30 40 50 60 70 80 902426283032

Time [s]

Ten

sion

[V]

0 10 20 30 40 50 60 70 80 90

-0.1

0

0.1

Time [s]

Gai

ns

Plate with varying thickness.Resulting seams, nonadaptive pole placement (above) and

adaptive MPC (below)

Santos, T. O., R. B. Caetano, J. M. Lemos and F. J. Coito (2000). Multipredictive Adaptive Control of Arc Eelding Trailing

Centerline Temperature. IEEE Trans. Control Systems Technology, 8(1):159-169

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 29 / 40

Page 30: Adaptive Control Through Reinforcement Learning

Dual control and persistency of excitation

Duality: Learning implies exploitationand conflicts with optimal control.

Feldbaum, 1961

Optimal dual controller impossible todesign, except in very simple cases.Need to resort to suboptimal dualstrategies.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 30 / 40

Page 31: Adaptive Control Through Reinforcement Learning

Dual adaptive MPC

Temperature control of solar field

Use a multicriterion approach toadjust the action, reaching a balancebetween persistency of excitation andgood control performance.

Optimize the exploitation to improvelearning.

Experimental results

9.5 10 10.5 11 11.5 120

50

100

150

200

Tou

t [o C]

Time [h]

Dual MUSMAR

10 10.5 11 11.5 12 12.5100

150

200

250

300

Time [h]T

out [o C

]

Standard MUSMAR

Silva, R. N., N. Filatov, J. M. Lemos and H. Unbehauen (2005). A dual approach to start-up of an Adaptive Predictive

Controller. IEEE Trans. Control Systems Technology, 13(6):877-883.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 31 / 40

Page 32: Adaptive Control Through Reinforcement Learning

How to tackle adaptive nonlinear optimal control

Approximate Dynamic ProgrammingComputationally feasible approach to compute the long-term reward

Q-learningEliminate model knowledge assumptions

Recursive learning/estimation algorithmsEmbed adaptation

Werbos, 1992Sutton and Barto, 1998Bertsekas, 1996But much work and publications before.

See F. Lewis and D. Vrabble (2009), Reinforcement Learning and Adaptive Dynamic Programming for Feedback Control, IEEE Circuits and Systems Mag., 9(3):32-50, for a tutorial on details.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 32 / 40

Page 33: Adaptive Control Through Reinforcement Learning

Dynamic Programming

Bellman, 1957 (but actually since Jacob Bernouilli, XVII cent.)

Performance measure (infinitehorizon)

V (hh) =∞∑i=k

γ i−k r(xi , ui )

r(xk , uk) = Q(xk) + uTk Ruk

Plant state model

xk+1 = f (xk) + g(xk)uk

Control policy uk = h(xk) Minimizethe performance subject to thedynamics

Bellman’s optimality principle

Hamilton-Jacobi-Bellman equation

V ∗(xk) = minh(·)

(r(xk , h(x(k))+

γV ∗(hk + 1))

Optimal policy

h∗(xk) = arg minh(·)

(r(xk , h(x(k))+

γV ∗(hk + 1))

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 33 / 40

Page 34: Adaptive Control Through Reinforcement Learning

Policy iteration (PI)

Requires a stabilizing initial estimate of the control policyPolicy evaluation step

Vj+1(xk) = r(xk , hj(hk)) + γVj+1(xk+1)

Policy improvement step

hj+1(xk) = arg min(r(xk , h(xk)) + γVj+1(xk+1))

Corresponds to the difference Riccati equation in the LQ case.Value iterationAt each time step do just a limited (e. g. 1) number of policy update.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 34 / 40

Page 35: Adaptive Control Through Reinforcement Learning

Adaptive Dynamic Programming

Temporal Difference error

ek = r(xk , h(xk)) + γVh(xk+1 − Vh(xk)

Approximate the policy by Vh(x) ≈W Tφ(x) φ estimated from data.On-line Policy iteration algorithmPolicy evaluation step (obtain W from RLS):

W Tj+1(φ(x(k)− γφx(k + 1)) = r(xk , hj(xk))

Policy improvement step

hj+1(xk) = arg minh

(r(xk , h(xk)) + γW Tj+1φ(xk+1))

May start from a non-stabilizing policy.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 35 / 40

Page 36: Adaptive Control Through Reinforcement Learning

Q-Learning

Q (quality) function

Qh(xk , uk) = r(xk , uk) + γVh(xk+1)

u is the control action.Assume a parametric approximatior of NN of the form

Qh(x , u) = W Tφ(x , u)

The optimal value for the action may be computed from

∂uQ∗(xk , u) = 0

Does not require any derivatives involving model parameters.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 36 / 40

Page 37: Adaptive Control Through Reinforcement Learning

Other problems and issues

Difference and differential adaptive games (Soccer!)

Distributed adaptive control

Minimum attention and event-driven adaptive control

Forgetting and adaptation

Dynamic weights and robustness

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 37 / 40

Page 38: Adaptive Control Through Reinforcement Learning

Conclusions

AdaptiveControl

Caos,ComplexDynamicsMaximum

Entropy

RL

SupervisedLearningSMMAC

Diffusion modelsFokker-Planck eq.

Path integral control

MachineLearning

Physics

Adaptive control provides ameeting arena for machinelearning and physics (as wellas for mathematics!).

The cross breeding betweenRL, ADP and Q-Learning isboosting algorithms withincreased performance foradaptive nonlinear optimalcontrol.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 38 / 40

Page 39: Adaptive Control Through Reinforcement Learning

A final word

Guy de Maussant (1850 - 1893): Il fit une philosophiecomme on faitun bon roman: tout parut vraisemblable,et rien ne fut vrais.

He did a philosophy as one writes a good novel:everything looks plausible, but nothing is true

We can easily develop plausible algorithms for adaptive control based on”intuition”, but that they actually do not work.

To avoid this pitfall, use the anchors provided by mathematical theories forstability, robustness, limits of performance.

Combining machine learning and model based methods is a far reachingship, but the above anchors must be used to avoid shipwrecks.

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 39 / 40

Page 40: Adaptive Control Through Reinforcement Learning

It is now time to stop and restThank you for your attention

J. M. Lemos (INESC-ID, IST/Univ. Lisboa ) Adaptive Control and RL MPML, July 16, 2020 40 / 40