Download pdf - [IEEE 2005 International Conference on Neural Networks and Brain - Beijing, China (13-15 Oct. 2005)] 2005 International Conference on Neural Networks and Brain - New Training Methods

New Training Methods for RBF Neural NetworksMehdi Fatemi

Dep. of Electrical EngineeringShiraz University

Shiraz, IranE-mail: [email protected]

Mehdi RoopaeiDep. of Electrical Engineering

Shiraz UniversityShiraz, Iran

E-mail: [email protected]

Abstract-Radial basis function neural networks have beenproposed as powerful tools in many applications. Existingtraining algorithms suffer from some restrictions such as slowconvergence and/or encountering to bias in parameterconvergence. This paper is an attempt to improve the aboveproblems by proposing new parameter initializing, pre-trainingand post-training methods to reach better capabilities inlearning time and desired precision compared to previous RBFnetworks.

I. INTRODUCTION

Redial basis function (RBF) neural networks havewide-range applications in system modeling, and attain tofast training methods for such these networks is of greatimportance. Nowadays, RBF networks are used in diversefields, such as pattern recognition, state estimation, functionapproximation, and .... In this paper without loss ofgenerality, we apply our new proposal methods to functionapproximation.

In many applications it is important to find a function onsome given data. Some numerical methods such asinterpolation and curve-fitting were introduced to do so.Recently many types of neural network such as multilayerperceptron (MLP), and radial basis function (RBF) areconsidered to be powerful tools for functionapproximation[4][5][6]. It seems that RBF networks is agood candidate for that purpose because of its fasterlearning capability compared with other networks, but thereare some major problems in usage of such networks:

First, their slow convergence of parameters, secondexistence of permanent bias in total error, and third, whenthe network encountered to large errors called outliers [1].

In this paper a new method of network establishing andparameter initializing, the Pre-training Method to speed upthe training process (about 100 times), and the Post-trainingMethod are introduced to eliminate bias (permanent error).

In section 2 a review of RBF and SRBF networks isconsidered. Sec 3 presents the outlier problems. Sections 4through 7 discuss new methods introduced in networkparameter initializing, pre-training Gibbs errors andpost-training respectively. Section 8 encompasses theexperimental results, and finally conclusion is included insection 9.

Faridoon ShabaniniaSenior Member, IEEE

Dep. of Electrical EngineeringShiraz University

Shiraz, IranE-mail: [email protected]

II. RADIAL BASIS FUNCTION NETWORKS

The radial basis function (RBF) network is constructed ofthree layers. The input layer which is made up of sourcenodes that connect the network to its environment. Thesecond layer, the only hidden layer of the network, applies anon-linear transformation from the input space to a hiddenspace. The output layer is linear supplying the response ofthe network to the activation pattern applied to the inputlayer.

XI

Xn

F(x)

inputlayer hidden layer output layer

Fig. 1. Typical RBF neural network architecture.

The RBF network technique consists of choosing afumction that has the following form:

NF (x) = E w'Pi (||x - ,ui ||, parameter set) (1)

Where {fo(j|x-fl,i = 1,2,...,N)} is a set of N

arbitrary function known as a radial basis function, and 11c11denotes a norm that is usually Euclidean [2]. If the radialbasis fimction of the network is a Gaussian function then:

(pi (IIx- gill, Oi) = exp{ -[IIx -_ iII/ oi ]2 } , (2)and if radial basis function of the network is sigmoidal

function then:

'Pi(l1x -AHll , Oi , pi) =1 1

1 + exp[-A(Ixix- + 9i)] 1 + exp[-A(x ll -A 9i)](3)

0-7803-9422-4/05/$20.00 ©2005 IEEE1322

Ill. OUTLIER PROBLEM

To assess the performance of the network a criterion isintroduced called objective function. In each trainingprocess the goal is to reach the parameters that minimizethis function by means of a rule that is often GradientDecent.

In most of networks the second norm of error is used asthe objective function called least square (LS)

1 N 2J= - (4)

N n=lWhere, En denotes the residual of the nth pattern. The

neural networks often encounter to bad data which areperhaps caused by some different reasons, such as partialsystem failing. Some methods are used to avoid facing toeffects of such data which are called "outliers".

In one method the LS criterion can be changed to a newobjective function called robust objective finction. In factthe robust object function acts similar to a low-pass filterand results to oust the outliers. This function could beconsidered as below:

-r2/2cTys(r) = re r 12a, ER(rp)= J rPyf(r)dr

ER(rP)=of(l-ee)I .. .....

OCI.42

4:2,

&.:.

4A*%.0t

(4a$) CS t 2 ,$ ¶ ,$ $ $ a$ X

(a) (b)

Fig. 2. (a) +'(r), and (b) the corresponding robust objective fimction ER(r).

Which rp is the residual of errors, and a adjusts theconfidence interval of the objective function and it can bechanged during the training process [1].The other method is based on a new training algorithm

which uses adaptive growing technique [1]. This methodcan be only used in node increasing algorithm which startswith small number of nodes and grows the networkgradually to reach the maximum number of nodes which hasbeen determined by the supervisor and then start to convergeto desired precision.

IV. NETWORK PARAMETER NITIALIZING

To initialize ji, the random method is always used whenwe start with a large number of nodes and then decreasethem to reach the optimal size of network. In this case byusing random centers some areas of input space may stillremain out of random node ranges, so a large number of

training iterations may be needed to reach the desiredprecision, but when we sort the nodes in the input spacewith the same span, all patterns have the same probability tobe near a node; therefore reaching the desired precisionmost likely will happen much faster.On the other hand, it seems that using adaptive growing

technique may optimize the network size, we may use asmall number of nodes and increase them to reach thedesired precision, but in actual programs we must choose anexact number of nodes in such algorithm, in that adding anew node to the network makes the total error be increasedtemporarily. After some iterations the total error starts todecrease, but before reaching the desired precision (whichend the algorithm) another node will be added to thenetwork regard to learning algorithm, and again the totalerror increases, so the network falls into a cycle will neverend and the total error just fluctuates. The only way is torestrict the number of nodes. It means that, in fact, theadaptive growing technique must fmally reach to a numberof nodes which has been determined before by thesupervisor, and the network cannot converge beforereaching to this determined number.Now, consider using this number of nodes in the

initializing step and use the sorting method to initialize thecenters as following:

fl={ti =mm(X)+(i-1)( max(X)1min(X)J i=l,1...,N} (5)

In which N is the number ofnodesX = p = 1,..., P} is the input pattern vector,

and P is the total number of input patterns. Because, in thiscase the probability of existence of a node near any inputpatterns is more than before (random selection), and it doesnot need to consume a large number of iterations to increasethe size of network to reach the final size, the training speedincreases considerably (more than 10 times as shown inresults).The other initial parameters should be selected with the

same value. In appendix it is proven why the same valueshould be used for initializing non-center parameters andwhat they should be.

V. PRE-TRAINING

As shown above, we must use the same value for networknon-center parameters initialization rather than randomselection, and the best way to initialize these values is to usethe mean of goil values (see appendix), but the mainproblem is that we cannot guess what the goal parametersare.

Pre-training Algorithm:At this point a new method is suggested called

1323

pre-training to find initializing parameters and speed up thetraining process as followings:

1. Determine maximum total number of RBF nodesand the maximum total number of iterations.

2. Use a fraction of total number of nodes (e.g. ¼ oftotal number), initialize the centers by the sorting method,and initialize the rest of parameters by a typicalnon-random value such as 1. (The nodes must not besaturated as section 4)

3. Train this small network incompletely (with asmall number of iterations.)

4. Build the main network, initialize the centers ofall nodes by the sorting method and initialize the rest ofparameters by the mean of parameters that are obtainedfrom step 3 (Pre-training).

5. Start final training for the main network untilreaching to the desired precision.Moreover, this idea can be use not only for RBF networks

but also for any types ofneural network to make the trainingprocess faster.

VI. GIBBS ERROR

In function approximation we often encounter to anon-ignorable error called Gibbs Error. We use this name inthat it is very similar to Gibbs Phenomenon in Fourier series[3]. Consider the approximation of a function which hasjumping in its derivative e.g. the saw tooth wave (Fig. 3-1).

Fig. 3-2 a. Gibbs Error in Neural Network (without post-training)

0.8 F

-0.4 -

02-3 2 -

-0.8 -

Fig. 3-2 b. Gibbs Error in Neural Network (with post-training)

VII. POST-TRAINNG

As mentioned in section 6, when we face to Gibbs error,increasing of the nodes either in number or complexity doesnot totally decrease the total error.

0.6 .

0.2

-0.2!r

-0.8

-3 -2 -1 0 1 2 3

Fig. 3-1. Gibbs Phenomenon in Fourier series

In the given approximation, even by using a large numberof terms (or nodes in neural networks) there are someunfavorable ripples in the end points (derivative-jumpingpoints) which do not disappear at all. Gibbs errorphenomenon remains in neural networks with highcomplexity of radial basis function such as sigmoidalnetworks (SRBF).- So increasing the size of network orincreasing the network parameters does not quite solve theproblem. In this paper, a new method is suggested which iscalled post-training to get rid of Gibbs errors by means ofsome saturated nodes after the main training (Fig 3-2 a,b).

SaturatedRBFNode:Consider a sigmoidal radial basis node with too large/,

and too small 9 (e.g. let ,=500, and 9=0.01), On theother hand, for Gaussian nodes consider a very small a(e.g. let a =0.001).

0.9

0.7 _

0.6

0.4

0.3

0.2

0.1 _

O, ......

£710~~~~1;- 3 / ,av= 10 , .!

,,,,\I

.o~~~~~~a ofv,

-2 -1 0 1 2

Fig. 4-1. GRBF saturated and non-saturated nodes.

1324

3

0.:

p=500, 0=0.1

0270.51

0.13 s'

-6 -4 -2 c

X3=3, 0=3

'=5,0=1,I

Fig. 4-2. SRBF saturated and non-saturated nodes.

As you can see in the Fig. 4 such this node has somespecial characteristics:

First, it is completely localized in the input space which isdetermined by its center, so it has not any considerableeffects on other areas. Second, it has finite amplitude i.e. itis not an Impulse and the output of the node can becontrolled by the output weight.

These saturated nodes seem to be good candidates tosolve Gibbs errors. By using some saturated nodes at andaround the location of Gibbs errors or any unsolvable largeerrors, we can reach the goal.

Choose the following rules:

[tnew node-XGibbs error

fnew node Am node<« 'anew node<< 1 (6)

Lwnew node-a [F(XGibbs erro) XP(XGibbs erro)]/Rnew node

Where w, R, F, and tp denote the weight, and output ofnew node, the output of network and the targetcorresponding to input pattern XGibbs error respectively. a isthe output gain to make the process more flexible, andpractically a . 1 (simply it can be 1). The supervisor canuse above rule for any large error points and add nodes uponreaching the best result, but as addressed in section 3, wealso may encounter to large errors by outlier data, so thepost-training should be used in the offline mode and with asupervisor to avoid tacking the outlier into account.

VIII. EXPERIMENTAL RESULTS

Clearly, in comparison of two independent networks theconcepts of time, error, and network size are related, so thebetter the result of each one, the better the result of otherswill.

In these experimental tests we compare two networks asthey ran by an Intel Pentium IV, 2.66 MHz, 512 MB RAMcomputer:

First network: random j + random [3 + random 0.

Second network: ,u's are the same as the first network +equal 1 + equal 0.

(a)I

-2 _

(b)

Fig. 5. Approximation of a given function f(x) =0.5 x sin(x) by (a) allparameters are random (no. of iterations = 150, nodes = 10, and ER= 0.99),and (b) constant parameters (no. of iterations = 150, centers are the same asfirst, all ,B's = 4, 0's = 2, nodes = 10, and ER= 0.10).

1-2: Sorting Method:

First network: fixed number of nodes + random selectionof centers.

Second network: fixed number of nodes + equal sortedcenters.

-0.5 _

-ff -4 -2 O( 4

(a)

Test 1: Network Parameter Initializing1-1: Constant Parameter:

1325

41----

o j L

-2.5'

) 2 4 B

0.5

0

-0.5

-1

-1.5

::2 8 4-2.5 L

B

0.6

0

-t.

-1.5~

-2

-2. 5,,-4 -2 0 2

(b)Fig. 6. Approximation of a given function f(x) =0.5 x sin(x) by (a) withoutsorting method (centers are selected randomly, all P's = 4, 0's = 2, no. ofiterations = 150, nodes = 10, and ER= 2.09), and (b) with sorting method(no. of iterations = 150, centers are sorted, all 1's and 0's are the same asfirst, nodes = 10, and ER= 0.10).

Test 2: Capability ofour new suggested networkFirst network: node increasing network having memory

queue [1] + robust objective function.Second network: fixed number of nodes network + equal

sorted centers + robust objective function.

-2.(-2| 2 2 4

(a)

(b)

Fig. 7. Approximation of a given function f(x) =0.5x sin (x by (a) nodeincreasing network (no. of iterations = 250, final nodes = 10, and ER=72.01), and (b) new suggested network (no. of iterations = 250, nodes = 10,and ER= 0.09).

Test 3: Pre-TrainingFirst: without pre-training; total number of iterations =

500.Second: with pre-training, the same initial parameters as

the first network are used, 150 numbers of iterations areused for pre-training and the rest (350) are used for the maintraining.

Fig. 8. Approximation of a given function f(x) =0.5x sin2(x) + cos2(x) by(a) without pre-training method (all parameters initialized by one, no. ofiterations = 500, nodes = 10, ER= 0.62, and learning time= 14.00 sec.), and(b) with pre-training method (in pre-training step all parameters initializedby one and for main training step all Ps are initialized by the mean ofpre-training step which is 1.78 and all Os are initialized by the mean ofpre-training step which is 1.26, total no. of iterations = 150 + 350 = 500,pre-training nodes = 3, total nodes = 10, ER= 0.0612, and learning timefor pre-training=2.00 sec. and for main training =9.26 sec, so learning time= 11.26 sec.).

Test 4: Post-TrainingFirst: our new suggested network with 20 nodes, but

without post-trainingSecond: the same as first network but with 10 nodes, and

the rest 10 nodes are used for post-training.

4 0

2 _Z

-4 -3 -2 -1 0 1 2 3 4

(a)

4~~~~~~~~~~~~~~

23 *- --2-1 0 12

(b)Fig. 9. Approximation ofa given function

f(x) ={-2 for-4 < x< -2, 0 for -2<x<0, 2 for 0<x<2, 4 for 2<x<4} by (a)without post-training (no. of iterations = 500, nodes = 60, and ER= 3.44),and (b) with post-training (no. of iterations = 500, nodes beforepost-training = 20 ERi= 3.67, and nodes used for post-training = 40, ER=0.02).

IX. CONCLUSIONS

In this paper, a new radial basis network is introduced

1326

I,..

v25

0.5

0

-0.5

-1

-1.5

1._ff -4 -2 0 2 4

which has much more accuracy and faster learning time.Traditionally random selection of parameters is consideredto be suitable for initializing the network. Here is shown touse the sorting method for centers and constant selection forall other parameters will lead to far better results.

In addition, two new methods are introduced:First, the pre-training in online mode to reach faster

learning speed, and second, post-training in off-line mode toreduce the total network error bias, which is usually madeby Gibbs errors.

APPENDIX

Proofofnewparameter initializing method:Assume that 0 is the non-center parameter set of a radial

basis function network, (e.g. for Gaussian 0= { a }, and forsigmoidal 0= {I,0 })

Suppose that O0 is the initial value of jth node and OfJ Jis the final value (the value after complete training) of thejth node, so if the network has n nodes then j=l . . .n.

Consider the jth node and suppose that the variation rangeof Of for different training pattern sets would be like thefollowing:

Of E [aj, bj]So we can define the total span as the union of all above

spans:n

[a, b]=U[aj ,bj]j=1

We definem asthemeanofall & and the radiuspinthe above domain as following:

m=mean(0f) p =1 max ( of) -m

Therefore, the typical node Of is lying in thespan[m-p m+p].The following criterion is defmed for a typical node

parameter set ' and its final value 9/:

Jr= +P(i _ of )2d6

=-[fk f9f~6k 6kf -k-2[i2f- + 3m- +P

= j(m + P)k(m+p) +-(m + p)3

. 2 i2 1 3i (m - P) + f9k (m _ p)2 _-(m - p)

3

20k p - 40Okmp + 2m2p + 2 p33

This criterion shows how far the initial values are fromthe final probable values, and clearly the less this criterion,the faster the training. For minimizing this equation withrespect toO'k, its derivative with respect to 9k must bezero therefore,

(2 k p -49k pm + 2m2 p + 23p

= 4pOk - 4pm = 0

It means that all parameters should be initialized by thesame value which is the mean of all final values.Pay attention to parameter g. Because it has different

characteristic compare to other parameters it should beinitialized in different way and the above proof could not beconsidered for ,u. If centers of some nodes are laid in asmall interval, the network approximate the input space toomuch accurately in this interval, but acts badly in other areas;therefore, we lay centers in unoverlaped intervals. Since atfirst for each node there is no preference to others, theseintervals should be selected equally, and all the previousproof can be accepted if the integral calculates only in oneinterval, let [aj, bj]. So,

/f = mean (Q[a, bk])and the result will be the mean of each interval for each

node. It means that the centers should be sorted in equaldistance that is called sorting method.

REFERENCES

[1] Chien-Cheng Lee, Pau-Choo Chong, Jea-Rong Tasi and Chein-I Chang,"Robust Radial Basis Function Neural Networks," IEEE Trans. onsystems, man, and cybernetics Vol. 29. No. 6, December 1999.

[2] Haykin, S., "Neural Networks: A Comprehensive Foundation",Prentice-Hall, 1994.

[3] C. Ray Wylie, and Louis C. Barrett, "Advanced EngineeringMathematics", Fifth Edition, International Student Edition,McGraw-HILL, 1982.

[4] T. Poggio, and F. Giorsi "Network for approximation and learning"proc. IEEE, Vol 78, pp. 1481-1496, 1990.

[5] A. Saha, C. L. Wu, and D. S. Tang, "Approximation, Dimensionreduction and nonconvex Optimization using linear superposition ofGaussians, " IEEE Trans. Comput., vol. 42, pp. 1222-1233, 1993.

[6] S. Geva and J. Sitte, "A constructive method for multivariate functionapproximation by multilayer perceptrons, " IEEE Trans. NeurlNetworks, vol. 3, pp. 621-624, 1991.

1327