6
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 Locally Recurrent Neural Networks Optimal Filtering Algorithms: Application to Wind Speed Prediction Using Spatial Correlation T. G. Barbounis and J. B. Theocharis Aristotle University of Thessaloniki Department of Electrical and Computer Eng. 54124 Thessaloniki, Greece E-mail: abarboun(egnatia.ee.auth.gr, [email protected] Abstract-- This paper focuses on a locally recurrent multilayer network with internal feedback paths, the IIR-MLP. The computation of the partial derivatives of the network's output with respect to its trainable weights is achieved using backpropagation through adjoints and a second order global recursive prediction error (GRPE) training algorithm is developed. Also, a local version of the GRPE is presented in order to cope with the increased computational burden of the global version. The efficiency of the proposed learning schemes, as compared to conventional gradient-based methods, is tested on the wind prediction problem from 15 min to 3h ahead on a site, using spatial correlation and facilitating measurements from nearby sites up to 40 km away. I. INTRODUCTION Recurrent neural networks are altemative architectures including feedback, thus allowing temporal information to be intemally represented and manipulated. Unlike traditional feedforward neural networks, the recurrent ones can establish a dynamic mapping relating input to output sequences. The first class of dynamic network suggested is time-delay neural networks (TDNN) [1] where tapped delay lines are employed either at the network's input side or at the input of each neuron. A different but equivalent to TDNN structure is the FIR-MLP, where the synapses are implemented by FIR filters [2]. Apart from the limited past history of the above networks that restricts their representation capabilities, a major drawback is that for a sufficient temporal depth a large number of weights is required thus increasing the computational complexity. The main paradigm of dynamic networks is the fully recurrent neural [3] that is composed of a single layer of neurons which are fully interconnected with each other, while an enriched network structure is suggested by Puskorius and Feldkamp [4] consisting of several recurrent layers. Due to their general architecture they can model a wide class of systems, although simpler structures might be considered based on prior knowledge thus facilitating the computational task. During the last years a great effort has been devoted to the so-called local feedback multilayer networks (LF-MLN). According to this approach, recurrency is locally introduced within each neuron through feedback or using UR filters. The advantages of the LF-MLN include small size hierarchical structures, moderate parametric complexity for a specific problem and simpler learning compared to fully recurrent networks. Training of the recurrent neural networks is mainly accomplished using two popular methods based on gradient descent: the backpropagation through time (BPTT) [3],[5] and the real-time recurrent learning (RTRL) [6] algorithms. BPTT is an extension of the standard BP algorithm to dynamic networks. During the learning phase, the network computations are performed on a backward fashion. Therefore, in order to develop an on-line learning scheme, additional routines are necessary, including the causalization and the truncation of the past-history. On the other hand, RTRL requires the computation and storage of the partial derivatives of the neuron outputs with respect to the trainable weights of the network. Although the RTRL algorithm is under certain conditions computationally more complex it is intrinsically suitable for on-line applications. Both BPTT and the RTRL exhibit long convergence times due to the small learning rates required and are often become trapped to local minima of the error function. In this paper, we discuss an alternative gradient-based algorithm for on-line training of recurrent networks composed of IIR-MLP neurons. Along this line, based on the notion of the ordered derivatives, we first develop and explain the adjoint networks that provide the partial derivatives of the output with respect to the network weights; they are computed backwards in a manner similar to the BPTT. Following, two versions of a recursive prediction error (RPE) algorithm are devised providing on- line estimates of the weights, namely the GRPE and its local version the Decoupled RPE (DRPE). These algorithms have enhanced learning qualities and take a form that strongly resembles that of the recursive least squares method. II. NETWORK ARCHITECTURE The local recurrent networks are composed of layers arranged in a feedforward fashion. Each layer contains dynamic processing units (neurons) with time-delay lines and/or feedback. In this paper we focus on the IIR-MLP architecture suggested by Tsoi and Back [7], where the synaptic weights are replaced with infinite impulse response (IIR) linear filters, also refereed to as autoregressive moving average (ARMA) models (Fig. 1). To cope with the structural complexity of the networks, the following notational convention is employed. As shown in 0-7803-9048-2/05/$20.00 02005 IEEE 2711

[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - MOntreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural

  • Upload
    jb

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Locally Recurrent Neural Networks OptimalFiltering Algorithms: Application to Wind Speed

Prediction Using Spatial CorrelationT. G. Barbounis and J. B. Theocharis

Aristotle University of ThessalonikiDepartment of Electrical and Computer Eng.

54124 Thessaloniki, GreeceE-mail: abarboun(egnatia.ee.auth.gr, [email protected]

Abstract-- This paper focuses on a locally recurrentmultilayer network with internal feedback paths, theIIR-MLP. The computation of the partial derivatives ofthe network's output with respect to its trainable weightsis achieved using backpropagation through adjoints anda second order global recursive prediction error (GRPE)training algorithm is developed. Also, a local version ofthe GRPE is presented in order to cope with theincreased computational burden of the global version.The efficiency of the proposed learning schemes, ascompared to conventional gradient-based methods, istested on the wind prediction problem from 15 min to 3 hahead on a site, using spatial correlation and facilitatingmeasurements from nearby sites up to 40 km away.

I. INTRODUCTIONRecurrent neural networks are altemative architecturesincluding feedback, thus allowing temporal information tobe intemally represented and manipulated. Unlike traditionalfeedforward neural networks, the recurrent ones canestablish a dynamic mapping relating input to outputsequences.The first class of dynamic network suggested is time-delayneural networks (TDNN) [1] where tapped delay lines areemployed either at the network's input side or at the input ofeach neuron. A different but equivalent to TDNN structure isthe FIR-MLP, where the synapses are implemented by FIRfilters [2]. Apart from the limited past history of the abovenetworks that restricts their representation capabilities, amajor drawback is that for a sufficient temporal depth a largenumber of weights is required thus increasing thecomputational complexity. The main paradigm of dynamicnetworks is the fully recurrent neural [3] that is composed ofa single layer of neurons which are fully interconnected witheach other, while an enriched network structure is suggestedby Puskorius and Feldkamp [4] consisting of severalrecurrent layers. Due to their general architecture they canmodel a wide class of systems, although simpler structuresmight be considered based on prior knowledge thusfacilitating the computational task.During the last years a great effort has been devoted to theso-called local feedback multilayer networks (LF-MLN).According to this approach, recurrency is locally introducedwithin each neuron through feedback or using UR filters.The advantages of the LF-MLN include small sizehierarchical structures, moderate parametric complexity for a

specific problem and simpler learning compared to fullyrecurrent networks.Training of the recurrent neural networks is mainlyaccomplished using two popular methods based on gradientdescent: the backpropagation through time (BPTT) [3],[5]and the real-time recurrent learning (RTRL) [6] algorithms.BPTT is an extension of the standard BP algorithm todynamic networks. During the learning phase, the networkcomputations are performed on a backward fashion.Therefore, in order to develop an on-line learning scheme,additional routines are necessary, including the causalizationand the truncation of the past-history. On the other hand,RTRL requires the computation and storage of the partialderivatives of the neuron outputs with respect to thetrainable weights of the network. Although the RTRLalgorithm is under certain conditions computationally morecomplex it is intrinsically suitable for on-line applications.Both BPTT and the RTRL exhibit long convergence timesdue to the small learning rates required and are often becometrapped to local minima of the error function.In this paper, we discuss an alternative gradient-basedalgorithm for on-line training of recurrent networkscomposed ofIIR-MLP neurons. Along this line, based on thenotion of the ordered derivatives, we first develop andexplain the adjoint networks that provide the partialderivatives of the output with respect to the networkweights; they are computed backwards in a manner similarto the BPTT. Following, two versions of a recursiveprediction error (RPE) algorithm are devised providing on-line estimates of the weights, namely the GRPE and its localversion the Decoupled RPE (DRPE). These algorithms haveenhanced learning qualities and take a form that stronglyresembles that of the recursive least squares method.

II. NETWORK ARCHITECTUREThe local recurrent networks are composed of layersarranged in a feedforward fashion. Each layer containsdynamic processing units (neurons) with time-delay linesand/or feedback. In this paper we focus on the IIR-MLParchitecture suggested by Tsoi and Back [7], where thesynaptic weights are replaced with infinite impulse response(IIR) linear filters, also refereed to as autoregressive movingaverage (ARMA) models (Fig. 1).To cope with the structural complexity of the networks, thefollowing notational convention is employed. As shown in

0-7803-9048-2/05/$20.00 02005 IEEE 2711

fig. 1 the IIR-MLP network is assumed to consist ofI = O,...,M layers, with I = 0 and I = M denoting the inputand the output layer, respectively. The I layer contains N,neurons, with No and NM being the number of neurons in

the input and the output layer, respectively. x"' [t] is theoutput of the n - th neuron of the I - th layer at time t. Inparticular, xn()[t], n = ,...,No are the input signals while

M)[t], n = 1,..., NM are the output signals. sl)[t] is theoutput of the summing point, that is, the input to theactivation function of the n - th neuron of the I - th layer attime t. yn() [t] is the synaptic filter output at time tconnecting the n - th neuron in the l- th layer with them - th neuron of the (I-1)- th layer. (L(nm -1) and I.")denote the order of the MA and the AR part, respectively, ofthe synapse connecting the n -th neuron in the l- th layerwith the m - th input coming from the (I -1) - th layer, withL(lA.I,1s L(') =1 and I > O, I') = O.>11 nm nO~~~~~~~~~~~~mt)

p =0...,L9 A-1 and v2(p) p=l,...,I( are the coefficientsof the MA and the AR part, respectively, of thecorresponding synapse. w(') are the bias terms that have a dcinput equal to 1. Finally, sgm(.) and sgm'(.) are the node'sactivation function and its derivative.

Fg1Nmelor bias

--E g -a p~ ~~~utpu

xm * Ynm ( xl*~~~~s xn

Fig. I Neuron's model corresponding to the IIR-MLP architecture.

The forward run at time t, evaluated for 1=1,...,M andn =1,..., N1, is described as follows:

Lt~-1 MB,1y.[t] = EW2(J,p) 1[-p]+V;(p)yn1m[t p] (1)

P=o P=1'IV~~~~~~~=

Sn [t] - y2[t]+wn (2)m=l

x(It] =S"9m(s(, [t]) 3It should be noticed that there is actually a dependence ofthe polynomial weights on time through a time index t, thatis, we have W2'(p)[t] and v2)(P)[t], since we are consideringthat these weights can be adapted on-line at each time step.Nevertheless, for sake of notational simplicity, the explicitindication of time in the weights is omitted, as we alsoaccept that the weights are slowly changed during theprocess.

III. GRADENTr CALCULATIONS

The learning algorithms to be developed require thecomputation of the network's output with respect to alltrainable weights. Due to the existence of internal dynamics,the traditional procedure of the standard BP cannot beapplied to achieve the above task. Therefore, we resort to themethod of ordered derivatives [8], as they provide anefficient tool for calculating partial derivatives in complexrecurrent networks. Note that since the BPTT approach isadopted, the chain rule derivative expansion is developed ina backward fashion and for that reason, we use the BPthrough adjoints proposed in [9].Lets consider the IIR neuron model shown in fig.l. Toconstruct its respective adjoint model the following rules areapplied: 1) All signal flows are reversed, 2) Nodes areredefined as summing junctions and vice-versa, 3) Time kin all arguments is replaced by the backwards time rp = t - kwhere t is the current time, 4) Any non-linear blocks arelinearized by using their derivative. Using the above rulesthe IIR neuron's adjoint model is shown in Fig. 2a, while theadjoint of an HR filter is depicted in fig. 2b.If we consider that the weights in layer I are of order Q,then they also have a memory horizon of Q, time steps. As aresult, we have to run through the adjoint networks

M

backwards in time for at least D, = E Q, time steps, a value,=,

considered sufficient enough to include the influence of theweights of the first layer to the network's output, since it isthe layer with the largest influence horizon.So for the calculation of the derivatives of the network'soutput o[t] with respect to all weights, we back-propagatethrough the adjoint network for ( = 0,..., D, a unit impulsethat equals to 1 for g = 0 and is zero elsewhere, that is

AX(M) = S(0), q =1,... NM . This way we are able to computeq

a+o[t]the ordered derivatives =A([t]_ 2 [qP]

aOo[t]

&xn It-f]P nFor the neuron in fig. 2, for example, we have that

[(l)]P= (l) [qi]sgm '(S" [t])Sn Xn

l(l)nmn

An() [(0] = A(j)[PI] +l v(,)A [qi - p]P=1

2nl1) [qi] = Wp R) [q - P]p=o

and

(4)

(5)

(6)

For an arbitary weight v()[t - ],where it can be either a

(1)(p)[t - qp] or a V(n)(p)[t - qp], and applying the chain rulederivative expansion in a backward fashion [10], we havethat the derivative of the network's output with respect to

2712

v() [t -(p] is given by

a+o[t] a+o[t] y')[t - (P]av2([t-] -y( )[t-q] av()[t-q,]

or equivalentlyA djIIR

A dj.

A(1 1) IIR )

(a)

(b)Fig.2. a) The Adjoint of the IIR neuron of fig. 1 and b) the adjointof the IIR block. Note that in order to avoid notational clutter the

subscripts nm of the weights and lambdas are omitted

ao[t] 2 (I)k4] Iy[t ] (8)

The second term of this product is a direct derivative and itsvalue depends on whether the respective weight is a w or av. Therefore for each case, we have that

o[t _] = 2Y [(p]xm $ [t-(p-p1 (9)Wlnm(p)[It-(p]

and

nm(p)[I(I ] [ (10)

for ( = 0, ..., Di .

Finally, after the completion of all D, backward runs, therequired gradients, which are subsequently used in thefollowing training algorithms, are summed for all (p foreach weight, meaning that

ao DI a+o[t]

A. Global approach (GRPE algorithm)Let 9(t) denote a composite (W xl) vector including all

(7) synaptic weights of the ILR-MLP, where W is the totalnumber of weights included in the network. We consider anonlinear recurrent predictor model of the type

Y(t19) = h(9, u(t), p(t, 9)) (12)where h(.) describes the structure of the network. '(t I9), a(NM x 1) vector, comprises the network's outputs,X(M) ,i = ,...,NM, u(t) is a (Noxl) vector including thenetwork's inputs x(0) ,i = 1,..., NO, and (p(t, 9) represents theinternal network dynamics. The real process to be modeledby the network, denoted as y(t), is obtained by

y(t) = y(t I9) + £(t,9) (13)where e(t, 9) is the prediction error for a particular 9.Attempting to improve the learning performance, we employthe recursive prediction error (RPE) identification algorithm[11] that exhibits enhanced training qualities. According tothis method, the weight estimate vector 9(t) is continuouslydetermined, at each t , using the following recursion:

£(t) = y(t)-y(t 9(t- )) (14)

Q(t) = Q(t -1) + y(t)[C(t)e(t) - Q(t -1)]S(t) = y'(t)T P(t - I) V(t) + 2(t)Q(t)

L(t) = P(t - 1)V+(t)S-L(t)

@(t) = [P(t - 1) + ,u(t)L(t)e(t)]D,,P(t) = [P(t -1) - L(t)S(t)LT (t)]IA(t)

(15)

(16)

(17)

(18)

(19)

2(t) = y(t -1) [1-(t)]/y(t) (20)Q(t) is a (NM xNM) weighting matrix putting differentweights to different observations. Notice that for the single-output case, Q(t) is a scalar, acting as a scaling factor, thus,it might be chosen as unity and equation (15) is dispensed.S(t) is a (NM x NM ) matrix, while L(t) is the gain matrixof size (W x NM ) controlling the weight update. P(t) is theerror covariance matrix of size (WxW) that defines thesearch changes along the Gauss-Newton direction. Assum-ing that no prior information is available, P(O) is usuallytaken P(O) = aI, where a is an arbitrary large number andI is the identity matrix. y/(t) is a (W x NM) matrix contain-ing the partial derivatives of the predictor's model, that is,the network's outputs, with respect to the trainable weights.The RPE algorithm provides a recursive way of minimizinga quadratic criterion of the prediction errors using thestochastic Gauss-Newton search method.

IV. LEARNING ALGORITHMS

2713

t'l) [(

v,(9)= Z13(, k){T (k)Q-'(k)e(k)} (21)k=1

where /(t, k) are weight factors used as a means to discountold measurements, given by

/(t, k) = 1 A(j) (22)j=k+l

with 2(j) being the forgetting factor, taking values in (0, 1).When A(k) takes a constant value A(k) =A then thediscount factor becomes (t, k) = At-k, providing anexponential forgetting profile in the error criterion (21). y(t)is a sequence of positive scalars related to the sequence ofA(t) through (20). The choice of the sequence A(t) (orequivalently y(t)) is decided depending on whether thesystem under investigation is time-invariant or time varying.For time-invariant systems it is desirable to let A() -+ 1 ast -+o , an objective achieved in this paper by adopting aexponentially growing profile described by

i(t) = 42(t-1)+(I-A,) (23)where the rate 4, and the initial value A(0) are the designparameters. Under these conditions y(t) tends to zero, thuseliminating the effect of noisy measurements,asymptotically. On the other hand, for time-varying systemsthe objective is to track the time-varying system'sparameters. In that case, the forgetting sequences are chosenwith the scope to provide a trade-off between tracking abilityand noise sensitivity. Accordingly, A(t) takes a constantvalue A(t)==, close to unity, giving a memory timeconstant of To = 1/(1-2A) and y(t) converges asymptoticallyto a small number y0 = 1-2A. The choice of 2 (orequivalently yo ) and hence the time constant To is aproblem-dependent issue, according to the expectedvariation of the parameters.[.]D in (18) implements a projection mechanism into thestability region. The necessary and sufficient condition forthe gradients to tend to zero is that the AR parts of thesynaptic filters should be stable [12]. This dictates that thezeros of [1-A(M)(q-')] should lie within the unit circle,determining the stability region of the algorithm. In thispaper we follow a simple approach, where the correctionterm is successively reduced until the new estimates fallwithin the stability region.p(t) denotes the learning rate, taking values in the range [0,1]. Initially, due to the aggressiveness of RPE, p(t) takes arelatively small value, i.e. 0.01 and after every epoch, astraining proceeds, it is raised to unity following a user-defined profile and the RPE takes fully over.B. Local approach: the DRPE algorithm (Decoupled RPE)In order to reduce the algorithmic complexity of the GRPEwe developed the following simplified version bypartitioning the global optimization problem into a set of

sub-problems that can be handled more easily. If G is thenumber of the network's neurons, the weights are dividedinto M. -dimension vector groups , i = ,...,G, each one

comprising the MA and AR weights w2( ) and V2l(P) of thesynapses pointing to the i - th neuron. Note thatG

E = W . As a result, the local approach consists of a set

of decoupled algorithms where the weight groups areindependently updated at each time step. Because of the lowdimensionality of the matrices involved in the localrecursions, the computational and storage burden isconsiderably reduced.Let us consider the covariance matrix P(t). Regardless ofthe network's architecture the synaptic weights in vector 9can be arranged as 9= 9G), so that P(t) can bewritten in a block-diagonal forn, where the non-diagonalmatrices are zero, as shown below

P(t)= ... ... ... , where Pi(t) are matrices ofO ... PG (t)

(Mi x M,) dimensions.The existence of the zeros outside the diagonal blocksmeans that the G groups do not affect each other duringtraining, hence the expression "decoupled".Furthermore, the (W x NM) gradient matrix

-T |t 9^ )yfr(t) = j -1))can be broken in a similar way

( .x() (,(t)4 1 axMv(t)= : : *= , where

ag- and

)a.9G / LG(J

Vi, (t) are of dimensions (Mi xl) and (M, xNM)respectively.Having the above in mind, (16) can be written as

G

S(t) = A(t(t)t) + _v, (t) (t - l)vi (t)i=l

(24)

From (17) we have

PI (t -1)Y/ (t)S-1 (t) L tL(t) = . = . (25)

PG(t 1)DYIG(tS- (t) ALG(0-where we have decomposed the (W x NM) matrix L(t) intothe respective (M, x NM) sized Li (t) matrices.After calculating S(t) and Li (t) from (24) and (25)respectively, the DRPE algorithm is completed with thefollowing decoupled update equations for the i - th neurongroup, i =1,...,G

2714

were divided into two independent sets, namely, a trainingand a checking data set, composed of 2240 and 960 patterns,respectively. The checking data are employed to evaluate theforecast performance of the resulting models.

Sj(t) = R@ (t 1) + 8(t)Li (t)£(l)] (26)

Pi1-(t)= [P (t ) _Li(t)S(t)LT(t)]/AZ(t) (27)

Notice that while the DRPE still requires the inversion of theN, x NM matrix S(t) at each time step, it is asimplification of the GRPE into a set of decoupledalgorithms that use the ME x M, diagonal blocks instead ofthe whole (W x W) global covariance matrix P(t).

V. COMPUTATIONAL COSTSThe computational costs regard the order of the number ofadditions, multiplications and memory required to train anetwork for one time-step. Lets consider a network structurewith M layers, NO inputs, N, neurons in each hiddenlayer and NM outputs. For simplicity, we also assume thatthe neurons belonging to the same layer have filters of thesame order (L. = L",j() =I(1),l =l,...,IM) . Also let M(')denote the number of weights composing the group of aneuron belonging to the I - th layer. The total number W ofweights included in the network is then given by

M

W = N,M . In view of the above definitions, the/=1

complexity of the GRPE is O(NMW2 ) and of the DRPE is

O(NM, NI(M(Z))2). Since EN, (M°') ) <<W2, it is1=1 1=1

concluded that the local algorithm has considerably smallercomputational demands compared to the GRPE.

VI. EXPERIMENTAL RESULTSA ProblemformulationIn our experiment, we consider the area around the Gulf ofThessaloniki, at Northern Greece. As seen in fig. 3 it has asmooth terrain at almost sea level. Our model is trying tocapture the strong prevailing winds of the region, which areof N-NW direction. Hence, meteorological stations areinstalled at positions S1, S, and S, on a line along thedirection of the prevailing winds (fig. 3). The distancesbetween them are: (S5S2)= 27 km, (S2S0)= 12 km and so

(S,S0)= 39 km. A wind change arriving at site S, on time

t = 0 is assumed to propagate towards S2 and So. Fig. 4shows a typical view of an oncoming event as measured atthe three sites. The main goal is to effectively forecast thewind speed vs at site S, for up to three hours in the future,using the current measurements at the three stations.B. Simulation setupThe wind data were collected over the time period of oneyear. After that, any invalid data, due to random failures ofany of the measurement stations were subsequentlydispensed. Finally, the remaining data were averaged over a15 min interval, in order to eliminate any random noise, tofinally obtain 3200 available input-output data pairs. These

locations

I-70I

10

8

7

6

5

4

3

2

1

M inutes

Fig. 4. A typical oncoming front as it is measured consecutively atthe three sites Si, S2 and So. It is easy to observe that the delay ofthe wind is proportional to the sites' distances (SI-S2)=27 km and

(S2-So)=12 kmThe developed models are fed the current speedmeasurements of the three sites, thus having 3 inputs. Also,the networks are chosen to have one output node and onehidden layer with 10 neurons per layer.Regarding the filter orders, we set LPm = P(2) = 4 for the MAparts and I_() = _(2) = 4 for the AR parts of the hidden andoutput layers. The network weights wU((P) of the MA partsof the filters, are randomly selected initially in the range[-0.5, 0.5], while the v('(P), involved in the AR parts of thesynaptic filters, were initialized so that the roots of theresulting polynomial [I-A() (q-')] lie inside the unit circle,as required for stable operation of the RPE algorithm.For comparison reasons we use an IIR-MLP that is trainedby the GRPE, the DRPE and the two simple gradientmethods the RTRL and the BPTT( h) with h = 12 and alearning rate of 0.001. For the RPE algorithms, the learningrate u(t) is gradually increased to unity, following theformula u(t) = x ,u(t - I) + (1-4), with 4 =0.85.Moreover, to account for the non-stationarity of the process,

2715

a constant forgetting factor of A =0.989 is chosen, resultingin a memory time constant of about 1 day (90 steps). For theinitialization of the correlation matrices P(t) and Pi (t) weselect a = 500. Furthermore, during learning, stabilitymonitoring of the RPE algorithm is continuously performed,as is implemented in (18), (27) by successively multiplyingthe correction term by a factor of 0.2 until the revised weightestimates fulfill the stability conditions for each neuron.For each algorithm we use a separate network for eachfuture time horizon of Ar =15, 30, 60, 90, 120, 150 and 180minutes, which we train for 200 epochs. In this approach themodel is of the form '. [t + h] = f(v50 [t], vs, [t], v5, [t]), witht the current time and h constant and depending on theselected Ar.The algorithms are evaluated in terms of their capability toeffectively train the IIR-MLP in order to allow it to produceefficient forecasts. Their performance is measured in termsof the mean square error (MSE).

Real WindIIR-MiP

10

E

cf 6

4-

2 ,

010 150 300 450 600 750 90'0

Minutes

Fig. 5 A typical plot of the real wind on So and of the output of theIIR-MLP model trained with the GRPE for 1 step (15 min) aheadon the checking data setC. Experimental ResultsA set of 10 experimental runs was carried out for each setupwith different weight initializations. The obtained resultswere averaged and are cited in Table I, while their respectivevariances are shown in Table HI. It is obvious that the GRPEalgorithm shows the best improvement followed by theDRPE. However, note that the DRPE requires much lesscomputational effort than the GRPE. Furthermore, bothproposed algorithms exhibit much better performance thanthe BPTT(12) and the RTRL. A typical curve of the actualwind on S0 and of the IIR-MLP output for 15 min aheadprediction is shown in fig. 5.

VII. CONCLUSIONSTwo on-line learning algorithms are suggested in this paperfor local recurrent neural networks. Due to the intemalfeedback the network's gradients are derived through theconstruction of the adjoint networks. The convergencebehavior of the algorithms is very good, although thecomputational burden is considerably increased, as isobserved by training of an IIR-MLP network to a real-worldapplication, as is the wind prediction problem.

TABLE IMSE OF THE FOUR ALGORITHMS TRAINING THE IIR-MLP

Minutes GRPE DRPE BPTT(12) RTRL15 0.4160 0.4220 0.4857 0.481430 0.8526 0.8662 0.9730 0.977260 1.4783 1.5178 1.7005 1.696290 2.0096 2.0625 2.3187 2.3292120 2.5752 2.6884 3.1664 3.1178150 3.1426 3.2126 3.7022 3.7259180 3.5675 3.7177 4.0756 4.1892

TABLE IIVARIANCES OF THE MSE OF THE 10 EXPERIMENTAL RUNS

Minutes GRPE DRPE RTRL15 8.87-104 6.13104 9.1810 15231030 3.53 104 17.2510h4 i3.58-104 Y9.81.10460 2.03- 104 6.60- 104 9.74- 104 3.17-10-490 8.33-104 1.06-104 7.15F104 6.93.104120 10.72-10 16.80-10-4 8.64-T104 9.74-10-4150 21.41-10 3.96-104 12.40-104 13.23 104180 14.07-1iW 3.70-T104T 6.3310_49.88-104

REFERENCES[1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J.

Lang, "Phoneme recognition using time-delay neuralnetworks," in Acoust., Speech, Signal Processing, vol. 37, pp.828-839, 1989.

[2] E. A. Wan, "Temporal backpropagation for FIR neuralnetworks," in Proc. Int. Joint Conf: Neural Networks, vol. 1,pp. 575-580, 1990.

[3] R. J. Williams and J. Peng, "An efficient gradient-basedalgorithm for on-line training of recurrent networktrajectories," Neural Comput., vol. 2, pp. 490-501, 1990.

[4] G. V. Puskorius and L.A. Feldkamp, "Neurocontrol ofnonlinear dynamical systems with Kalman filter-trainedrecurrent networks," IEEE Trans. Neural Networks, vol. 5,pp. 279-297, 1994.

[5] A. B. Atiya and A. G. Parlos. "New results on recurrentnetwork training: Unifying the algorithms and acceleratingonvergence," IEEE Trans. Neural Networks, vol. 11, No. 9,pp.697-709, 2000.

[6] R. J. Williams and D. Zipser, "A learning algorithm forcontinually running fully recurrent neural networks," NeuralComput., vol. 1, pp. 270-280, 1989.

[7] A. C. Tsoi and A.D. Back, "Locally recurrent globallyfeedforward networks: A critical review of architectures,"IEEE Trans. Neural Networks, vol. 5, pp. 229-239, 1994.

[8] P. J. Werbos, "Beyond regression: New tools for predictionand analysis in the behavioral sciences." Ph.D. thesis,Harvard Univ., Cambridge, MA, 1974.

[9] B. Srinivasan, U. R. Prasad and N. J. Rao, "Back propagationthrough adjoints for the identification of nonlinear dynamicsystems using recurrent neural models," IEEE Trans. NeuralNetworks, vol.10, No. 2, pp. 213-228, March 1994.

[10] S. W. Piche, "Steepest descent algorithms for neural networkcontrollers and filters," IEEE Trans. Neural Networks, vol. 5,No. 2, pp. 198-212, March 1994.

[11] L. Ljung and T. Soderstr6m, Theory and Practice ofRecursive Identification. MIT Press, Cambridge, England,1983.

[12] P. Campolucci, A. Uncini, F. Piazza and B. D. Rao, "On-lineleaming algorithms for locally recurrent neural networks,"IEEE Trans. Neural Networks, vol. 10, pp.253-271, 1999.

2716