Adaptive on Line Reinforcement Learning Controller . que utiliza-se apenas a entrada/saída da planta para treinar o controlador, um esquema de controle utilizando aprendizagem por

Adaptive on Line Reinforcement LearningControllerEduardo F. CostaVilma A. Oliveira

Departamento de Engenharia Elétrica - USP/EESCCaixa Postal 359, 13560-970, São Carlos, SP, Brazil

ed [email protected]

Abstract: In this paper we consider the use of on-line leaming for non-linear plants. Because weuse only plant input/output to train the controller, a reinforcement control scheme is devised. Themain contribution is the reduction of occurrence of failures during plant operation. Results for amagnetic suspension system is presented to illustrate the effect iveness of the controller developed.

Resumo: Neste artigo, considera-se o uso de aprendizagem em tempo real para plantas nãolineares. Uma vez que utiliza-se apenas a entrada/saída da planta para treinar o controlador, umesquema de controle utilizando aprendizagem por reforço é adequado. A principal contribuiçãoestá na redução da ocorrência de falhas durante a operação do sistema. Resultados para um sistemade suspensão magnética é apresentado para ilustrar a eficiência do controlador desenvolvido.

3. Reinforcement Learning Control SchemesThere are two main classes of reinforcement leamingalgorithms: model-based and model-free algorithms. Inthe first one , the model-based algorithms leam a modeland use it to derive a controller (Sutton, 1990a), (Kumar

Adaptive Critic strategies developed by Barto, Suttonand Anderson (1983) and Costa, Tinós, . Oliveira,Araújo (1997) which are modified to decrease theoccurrence of failures during system operation,resulting in a control system more robust to parametricchanges.

2. Problem StatementThis paper deals with the following class of problem:stabilize and maintain a dynamic nonlinear system ,described by (I), around an equilibrium pointconsidering constraints on the states and amplitudes ofthe control signal.

where x E R" is the state, u E R is the input, k denotesdiscrete time and fi is a non-linear function. The statesare assumed to be available and the plant modeiunknown.

In the sequei, we consider the followingdefinitions. A state Xo is an equilibrium point of thesystem (1) if fk(k ,xo, lI o) = O for a given 110. A failurestate occurs when x(k) does not obey a desired behavior.A trial corresponds to the running of the controller fromk=O until the occurrence of a failure .

( I )x(k + I) = !k[k ,x(k) ,u(k)]

1. IntroductionConditions that asserts solution of nonlinearsystems control problems are not available, exceptin local terms , as in the neighbourhood of a point(Levin, 1992). AIso, design procedures work wellonly if a number of conditions are satisfied. If timevarying plants with modelling uncertainty andconstraints on both the states and control signal aredealt with , the design procedure becomes evenmore complicated.

The use of reinforcement leaming provides aninteresting altemative to the control of systems(Milan, 1996), (Dorigo and Colombetti, 1994),(Costa, Tinós, Oliveira and Araújo, 1997). Thistechnique does not need plant parameters or plantmodel (Hoskins and Himmelblau, 1992), (Barto,Sutton and Anderson, 1983) and facilitates theincorporation of practical constraints in thecontroller designo As a result, this technique may beapplied to a large class of systems.

Reinforcement learning addresses the problemof an agent that must leam behaviour through trial-and-error interactions in a dynamic environrnent. Incontrol applications this involves determining afunction through experience that maps current statesof a plant into control actions and constructing acritic to judge which control actions are acceptable(success) and which are not (failure).

The objective of this paper is to construct acontrai system capable of driving the plant to anequilibrium point and holding it at about that point.We present a leaming scheme based oi! model-free

218

(2)

and Varaiya, 1986), (Sutton, 199üb). In the secondclass, the model-free algorithms learn a controllerbut do not learn a model (Barto, Sutton andAnderson, 1983) , (Watkins, 1989), (Dayan andSejnowski, 1994), (Sutton, 1988).

Model based algorithms demand greatcomputational effort in order lO obtain a rewardfunction and a state transition function (Kaelbling,Littman and Moore, 1996). Comparative studiessuggest that among the model-free algorithms, theBarto, Sutton and Anderson algorithm (1983)(BSA hereafter) has lower failure indices (Moriartyand Mukkulainen, 1996).

The main features of the BSA algorithm maybe summarized as follo ws. The control signal isgiven by an adaptive search element (ASE) which isin charge of mapping the plant states in such a waythat one out of two control actions happens, one forbetter and one for worse. This mapping is done byassociating weights to partitions of the state space.The weights are changed by a rule which considersboth a control action and a reinforcement signal insuch a manner that decreases the probability of acontrol action that has happened at instants elose toa failure occurrence. One can conelude that a failureinhibits the associated behaviour. Thereinforcement signal may be given by a measure ofthe plant performance or may be provided by anassociative critic element (ACE) which is expectedto leam to predict failures .

The training strategy above is not suitable forthe incorporation of a variety of control actions.Actually, a failure occurrence still indicates that theweights must be changed, however it does notindicate how, since more than two actions would betaken. For instance, a failure may indicate that abehaviour must be stimulated rather than inhibited.

This problem was considered in Costa, Tinós,Oliveira, Araújo (1997) where two solutions werepresented. In the first one, the designer analyses theplant in order to provide a suitable reinforcementcriterion. In the second one , called ReinforcementLeaming Control Scheme with Temporal FailureEvaluation (RLCSF for short), the controllerincorporates a temporal failure evaluation criterionin order to find the correct weight adjustmentdirection. In this paper we consider a leamingstrategy based on the RLCSF solution. In thisstrategy there is no need of plant parameters ormodels.

In the RLCSF algo rithm, the weightadjustment depends on the internai reinforcementsignal P provided by the critic element. However,such a dependence may be inappropriate becausethe temporal failure evaluation criterion associates afixed weight update direction during an entire trial.Thus, the weight adjustment may be inaccuratewhen the system stays in a region that did not affect

the temporal failure evaluation criterion and which isassociated with a failure by the critico In this case, thecontroller parameters can be set such that the critic haslittle importance in the training processo The result is afast learning control scheme but highly dependent onfailures to leam properly.

To overcome this difficulty, here we will considera learning scheme with two weight update rules whichare associated to the system operation phases. At thebeginning of learning, failure tends to occur in fewinstants of time. In contrast, after enough leaming,failure tends not to happen anymore. From the fact thatthere exist a relationship between the instant of failureoccurrence and how much leaming the controller hashad, one can define a time instant ka after which failuresare not likely to happen

The first rule considers the temporal failureevaluation criterion, it does not depend on the critic (buton the externaI reinforcement signal r) and is usedbefore instant ka• The other rule does not use thetemporal failure evaluation criterion but the internaireinforcement signal r and is used after instant ka.

The result is a fast learning control scheme whichis capable to leam after k, in the absence of failures . Inthis way, it decreases the probability of failureoccurrence during system operation when changesoccur, as for instance in the case of parametricvariations. The learning scheme which we callReinforcement Control Scheme with Two LearningPhases (RSTP for short) is described in the next section.

4. Reinforcement Control Scheme with TwoLearning Phases

The strategy of the control scheme is as follows.The control problem is split into sub-problems bydividing the space state \fi in a number of partitions.Each partition is associated with a specific controlsignal by adaptive elements. These elements have theirweights adjusted by taking into consideration areinforcement signal which is a qualitative measure ofplant performance. As in the BSA algorithm, theadaptive elements are called adaptive search element(ASE) and associative critic element (ACE).

A decoder is used to partition \fi. The decoderoutput di is equal to one when the system is in partition iand zero elsewhere, for i=I, ...,iT where iT is the total ofpartitions.

The ASE output u(k) is the control signal which issent to the plant:

lI(k) =I[t lVi(k)di(k) +ns(k)]where k is the discrete time, Wi is the weight associatedwith the partition i, di is the decoder output, n, is a realrandom variable with probability density function h, andfis either a threshold, sigmoid or identity function.

2 19

As said before, we will use two weigth updaterules. The first one is the same as in the RLCSF,and is written .as

as mentioned before , because we are considering avariety of control actions, the fact that the system comesinside a worse or better region does not mean that thecontroller behaviour must be stimulated or inhibited,respecti vely.

where the positive constant UI is the learning rate, ris the externaI reinforcement signal, e; is theeligibility term given by (6) which associates theresponsibility of the weights of partition i withfailure occurrence, and

where r is the externaI reinforcement signal responsiblefor adjusting the weights or not, and y is a real valuedconstant (where O< y I) .

The weight update equation and the criticeligibility signal equation are the same as in the BSAalgorithm.

v.t]: + 1)= I'i ( k ) + pr(k)"e;(k) (9){sg(t - I ,h) ,

sgtt ,h) =- sg(t - I,h) ,

if h ? Oif h <O

(4)

r (k ) =Ir(k) + r p(k) - p(k - 1)1 (8)

with kJt} the instant k in which the failure occurs attrial I , and sg(O.v) = 1.

For obtaining the second ruIe, we will considerthat the weights adjustment directions taken in theperiod before ka are adequate to the period after ka .

We may consider the difference between the initialweights WcJ and the weights in the time instantw, (ka) as a register of the weight adjustments. Thus,the second rule is written as

e,(k + I) = Àe,(k ) + (1- À)di ( k) (la)

where e; is analogous to the eligibility in the ASE andÀ is a real valued constant (where O À < 1), and thepositive constant /3 is the critic learning rate.

The steps of the learning algorithm are as follows,where 1 is the trial number, l/IIax is the total number oftrials and kmax is the total of sampling periods.

e;(k+ l ) = Õ e, (k) + (l- ê) d, (6)

where r is the internaI reinforcement signalprovided by the ACE and Uz is the learning rate forthis phase .

The eligibility is given by

where Õ is a decay rate (O Õ< 1).The ACE objective is to predict a failure

before it happens, allowing anticipative weightadjustment. The ACE equations, based on the BSAalgorithm, are as follows .

L Initialise the ASE and ACE weights, set t=1;b While 1< 'ma, , do:D ·I=I+I;2.2. k=0, r(k)=O, initialise x(k);2.3. While r(k)=O and k < kma" do:2.3 .1. k=k+l ;2.3.2. Generate dIk), i=l ,... Jr , by decodifying the

actual system state vector x/k),j=l,... ,n;2.3.3. if k s ka, calculate the ASE output u(k) (2) ;

if k > ka , calculate the ASE output u(k) (2) and theACE output p (k) (7);

.2.3.4. Present u(k) to the systern and obtainx/k+1),j=I, ...,n (I);

2.3.5. Calculate e;{k+1) (6) and e; (k+1) (la),i=I ,...,ir;

2.3.6. If k ka , determine r(k+1) using a definedcriterion ; if k > ka , determine r(k+1) using the criterionand i(k + 1) (8);

2.3.7. Update ACE weights v;{k+1) (9);2.3.8. Update the ASE weights w;{k+1) by using

(3) if ka or using (r) ifk > ka ;

2.4 If k = kmax , finish the algorithm.Remark : This algorithm may be changed in order

to fit specific problems. For instance , it is possible toconsider only the first phase ( k < ka ) to pre-train thecontroller.

5. ResultsIn this section we present results of a magneticsuspension system to illustrate the application and

(7)ir

p(k) =L vi(k) di(k)i=1

where p(k) is the output of the ACE, v;{k) is theweight associated with partition i.

The internaI reinforcement signal ; is givenby equation (8), which is the absolute value of theinternaI reinforcement signal given in the BSAalgorithm. This may be explained as follows. In theBSA algorithm, the critic provides a signal whichmay represent a punishment (if the system comesinside a worse region) or a reward (otherwise). Inthis way, the critic also affect the weight updatedirection. Considering that after the instant ka theweight update directions are the same as beforeinstant ka, it is not necessary, and sometimesunproductive, to use the critic in that way. Actually,

220

The reinforcement criterion for the controllerswas developed by imposing a minimal acceptableperformance. Equation (12) shows thereinforcement criterion. See also Figure 1.

1000

RSTP with

800 1000

800600

RSTP

400

v

RLCSF

200 400 600Sampling time k

200o

0.0105o5iti 0.0095

0.0098

0.0096

P 0.0094oS 0.0092

i 0.0090ti 0.0088

o 0.0086nem)0.0084

0.0082

0.0080 o

_{[CXj cmax - Xj imax )k I kcl +X j imax' k < k;xlsup(k)- . k>k

Xl cmax' - c

k, is a sampling time chosen by the designer,x i cmax is the maximum Xj at sampIing time kc ,x i cmin is the minimum Xj at sampling time kc 'x i i max is the maximum initial Xj,

x i imin is the minimum initial Xj,j=I,oo. ,ns.

Figure 1 shows the results for the RSTP withCR = O, Ca = O, for comparison with the results withdifferent values of CR , Ca' The results for the RLCSFunder the same conditions are very close to these, andare not presented.

Sampling time k

0.009r/ . ... .. ...... . . . .. ... . .... ... . . . . . . .. .. .... .. .. . . • • •

0.0085 :

P 0.011

0.0115

0.012

Figure 1 - RSTP results for different initial positionswith CR = O, Ca = O. The reinforcement criterion is alsoshown (dotted line).

Figure 2 shows the results for both controllers withCR = -10, Ca = O, TR= Ta=2000, which correspond to thevalues used during the training procedure. Table 1shows the time instants in which the fai!ure occurs forthe different initial states considered in the training.

Figure 2 - Results of the RLCSF andCR= -10, Ca= O, TR= Ta=2000.

(11)

(12)lo if x1inf(k) <4k) <x1Sl4'(k) and

r(k) = X2inf(k) <x2Sl4'(k)- 1 otherwise

effectiveness of the given control scheme . Practicalconstraints and the system discrete dynamicequations used here to simulate the plant are foundin the Appendix. For comparison purposes, resultsof the RLCSF are included.

We will consider that the coi! resistence R andthe parameter a are time varying. R and a arechanged by equations found in the appendix, withthe parameters CR, ca related to the amplitude of thevariations and TR, Ta related to the velocity of thevariations.

The adopted number of partitions of the statespace \fi is 288 for the two schemes: 12 intervals forXI. 6 for X2. and 4 for X3. The leaming algorithm isexecuted for different initial states, as follows: wedefine a set ofinitial states Xini = {x', 00. x5} with x'= [x/ O X30)', where X30 is the equilibrium currentfor the equilibrium position XIO(O) = 0.01. In step lthe system states are initialised as x(O)=Xini(n() andthe leaming algorithm is run for I\ =1'00.,5.

We consider that the coil resistence R and theparameter a are time varying. R and a are changedby equations found in (A-4) and (A-5), withparameters CR, Ca related to the amplitude of thevariations and TR, Ta related to the velocity of thevariations . During the learning procedure, weadopt CR = -10, Ca = O, TR= Ta=2000.

Because we considered such high amplitudesof the variations (obser ving that R = 19.9), theRLCSF showed a high number of trials for eachinitial position. Actually, this number wasfrequently equal to the maximum of trials allowed,'mD'=100. The typical number of trials for eachinitial state was 100 and 37 for the RLCSF andRSTP, respectively. The controller parameters arefound in the Appendix.

For both schemes the function f which givesthe value ofthe controller output is given by (11).

where. _ { [(Xl cm;n - Xj ;mm )k l k c ]+ Xj imin ,'\lillf(k)- k >k

x) emiti' - c

221

X I (O) RLCSf RSTP0.0085 422 -0.0090 368 -0.0100 247 -0.0110 439 -0.0115 266 -

Table 3 - Time instants in which thefailure occurs with &R= O, CO = 0.12,'R = ' 0=2000a

X 1 (O) RLCSF RSTP0.0085 259 -0.0090 319 -0.Ql00 296 -0.0110 480 -0.0115 330 -

Table 1 - Time instants in which thefailure occurs with CR = -10, Co = O,'R= z =2000

figures 3 and 4 show the results for RLCSFand RSTP with CR = 10, Co = O, 'R = 'a=2000 andwith CR = O, CO = 0.12, 'R = 'a=2000 respectively,both for the initial position xl=0.008. Table 2 and 3show the time instants in which the failure occursfor the different initial states .

0 .012

0 0 115

Posi

on

(m) 0.0085

RLCSF

RSTP

6. Discussions and ConclusionsIn this paper, we present a reinforcement learningcontroller with two weight update mies correspondingto two distinct system operation phases. The controllercan be used on-line, is adaptive and capable to learn inthe absence of failures , and needs no knowledge of theplant parameters or models. It also exhibits fastleaming, requiring few trials to be trained .

The results showed that the proposed controller ismore robust to parametric variation when compared toprevious reinforcement leaming controller in which thiswork was based .

Well developed control methods can besuccessfully applied to suspension systems, but they aremore dependent on the knowledge of the plantdynamics and uncertainties.

figure 3 - Results of the RLCSf and RSTP withCR= 10, Co = O, 'R = 'a=2000.

(A-I)

(A-2)

(A-3)

(A-5)

(A-4)

a = ao + Co sín('ok)

X2 (k + I) = (I/To)(X I (k) - x I (k - 1))

AppendixThe system suspension dynamic equations:

xJ(k + 1)= (_1_)(LxJ(k -1) + 1QlI(k ))1QR+L

where XI is the sphere position, x2 the sphere speed,x3 the coil current, To the sampling period, li the coilvoltage, g gravity constant, m the mass of the steel ball,R the coil resistance, L the coil inductance, Lbo the coilinductance when xl = O, a a constant, and

where R, is the initial coil resistence, ao is the initialvalue of the parameter a and r and CR, ca, 'R and 'a areconstants.

1000

RSTP

RLCSF

40 0 6 00 800Sampling l ime k

200

aX 1 (O) RLCSf RSTP0.0085 656 -0.0090 658 -0.Ql00 641 -0.0110 656 -0.0115 48 53

o

P0.011

0.012

0.0115

os 0.0105

on

Table 2 - Time instant in which thefailure occurs with CR= 10, Co= O,'R= z =2000

O 200 400 600 800 1000

Sampling time kFigure 4 - Results of the RLCSF and RSTP withCR= O, &0= 0.12, 'R= To =2000.

(m)0.0085

222

X3cmax =x3imax =00; x3cmin=x3imin=-oo; Umax =

Costa , E. F., R. Tinós, V. A. Oliveira and A. F. R.Araújo (1997). Reinforcement Learning Schemesfor Control Design. In Proc. of the 199 7 AmericanControl Conference, Albuquerque, NM, USA, toappear in the Conference Proceedings.

Reinforcement Learning. Computers Chem . Engng., 16/'

(4), pp 241-251.

Sutton , R. S. (1990b). First Results with Dyna andIntegrated Architecture for Learning, Planning andReacting. In W. T. Muller, R. S. Sutton and P. J.Werbos (Eds.) Neural Networks for Control, MITPress .

Kaelbling, L. P., M. L. Littrnan and A. W. Moore(1996). Reinforcement Learning: A Survey. ComputerScience Department, Brown University, Providence,USA.

Kumar, P.R. and P.P . Varaiya (1986). StochasticSystems: Estimation, Identijication and AdaptiveControl. Prentice Hall, .

Levin, A. U. (1992). Neural Networks in DynamicalSystems, PhD Thesis, Yale University.

Moriarty, D. A. and R. Mükkulainen (1996). EfficientReinforcement Learning through Symbiotic Evolution.Machine Learning, 22 (1), pp 11-32.

Milan , J. A. R (1996). Rapid, Safe and IncrementalLearning of Navigation Strategies. IEEE Transactionson Systems, Man and Cybernetics, 26 (3).

Sutton, R.S. (l990a). Integrated Architectures forLearning, Planning and Reacting Based onApproximating Dynamic Programming. In Proc. of theSeventh International Conference on MachineLearning,Austin, TX, USA.

Sutton, R.S. (1988). Learning to Predict by the Methodof Temporal Differences. Machine Learning, 3 (1), pp.9-44 .

Watkins, C. (1989). Learning from Delayed Reward,PhD. Thesis, Cambridge University, Cambridge,England.

X2cmax =0.05;

x2i min =-0.3

system of magnetickg, Ro==19.9 n,LbO=0.0245 H and

Xlimin =0.008 ;

X2cmi/l =-0.05 ;

Parameters of realsuspension: m=0.02255ao=0.00607, i, =0.47 H.T,)=O.OOI s.

Equilibrium point: (Xlo ,x2o, X3o) = (0.01;0.0; 0.876). Practical constraints onstates: GOO5 s xI s G0l5 ard Os X:J s 251. Practicalconstraints on control action: Os u s 50 V.

Intervals for XI: [0.0050 0.0065] ; [0.00650.0078]; [0.0078 0.0088] ; [0.0088 0.0094] ;[0.0094 0.0099]; [0.0099 0.0100] ; [0.01000.0101] ; [0.0101 0.0106]; [0.0106 0.0112];[0.0112 0.0122]; [0.0122 0.0135]; [0.01350.0150) . Interva1s for X2 : [-0.40 -0.20]; [-0.20-0.051; [-0.05 0.00] ; [0.00 0.05]; [0.05 0.20]; [0.200.40] . Intervals for X3: [0.00 0.94]; [0.941.26]; [1.26 1.57]; [1.57 2.51].

Parameters of RLCSF and RSDO: tma.<=100;

kmax=2000; kc=50; x1cmax =0.011; xl cmin=0.09;

50: lI mm = O; J.1e max=9; f is the identity function andn, is a real random variable with normal probabilitydensity function with zero mean and variance equalto one.

Parameters of RLCSF: u=l; 8=0.9; 13=0.005;À=0.5; y=1. Parameters of RSTP: uI=0.9; u 2=0.01;8=0.9; 13=0.01; À=0.7; y=1.

ReferencesBarto, A. G., R. S. Sutton and C. W. Anderson(1983). Neuronlike Adaptive Elements That CanSolve Difficult Learning Control Problems. IEEETransactions on Systems, Man, and Cybernetics,SMC-13, pp 834-846.

Dayan, P. and T. 1. Sejnowski (1994). TD(À)Converges with Probability 1. Machine Learning,14 (3).

Dorigo , M. and M. Colombetti (1994). RobotShaping: Developing Autonomous Agents throughLearning. Artijicial 1ntelligence, 71 (2), pp. 321-370.

Hoskins, 1. C. and D. M. Himmelblau (1992).Process Control via Artificial Neural Networks and

223

Documents

Adaptive on Line Reinforcement Learning Controller . que utiliza-se apenas a entrada/saída da planta para treinar o controlador, um esquema de controle utilizando aprendizagem por