Download pdf - Nonlinear Function Approximation - K.N.Toosi University of Technology

Assignment # 3

Nonlinear Function Approximation

Author:

Ramin Mehran

Instructor:

Dr. Babak Araabi

K.N.Toosi U of Tech.

November 2004

2

Introduction..................................................................................................................3 1-a- MLP Network: First Example ............................................................................4 1-b- MLP Network: Second Example ......................................................................12 1-c- MLP Network: Presence of Noise .....................................................................16 1-d- MLP Network: Comparison with ANFIS ........................................................18 2-a- RBF Network: First Example ...........................................................................20 2-b- RBF Network: Second Example .......................................................................24 2-c- RBF Network: Presence of Noise ......................................................................26 2-d- RBF Network: ANFIS Comparison .................................................................28 3-a- LLNF: First Example.........................................................................................29 3-b- LLNF: Second Example ....................................................................................31 3-c- LLNF: Presence of noise ....................................................................................32 3-d- LLNF: Comparison with ANFIS ......................................................................34 4- Conclusion and Summary.....................................................................................36 References ...................................................................................................................37

3

Introduction

Neural networks and neuro-fuzzy systems are common tools for nonlinear regression

problems. The main feature of these models is their flexibility to adopt the complexity

of the function into their intrinsic nonlinear structure. Thus, we will investigate this

capability in some of familiar types of these models by using them in two well-know

regression problems [1]. Besides, we will try to obtain optimum structure for each

model by using cross-validations techniques. Therefore, we will discuss Model

selection and Model assessment as our main points in this report. By Model selection,

we mean, "Estimating the performance of different models in order to choose the

(approximate) best one." Meanwhile, Model assessment asserts, "having chosen a

final model, estimating its prediction error (generalization error) on new data" [2].

The neural networks and neuro-fuzzy systems to be investigated in this report are:

• MLP with one and two hidden layer

• RBF

• NRBF

• Locally Linear Neuro-Fuzzy Networks (LLNF)

We will consider MLP networks with tanh(.) activation functions, and in RBF and

LLNF networks we will use Gaussian functions. The nonlinear regression problems

are the same as Example 1 and 2 of [1] in its simulation results section. The

excremental situation is kept the same as in [1] as much as possible. Therefore, we

will have the chance to compare our results with the equivalent results in it.

Organization of this report is as follow: This report consists of four main sections,

which the first three sections has four subsection each. In section 1-a, we will present

an MLP network with one hidden layer for estimation along with introducing the first

example of nonlinear function. The training algorithm in this section will be back-

propagation and we will find the efficient number of neurons for this structure, which

preserves the generalization. This will be followed by a similar section (section 1-b)

for the next nonlinear function. However, for the second function, the MLP network

4

will have two hidden layers instead of one. In section 1-c, we will investigate the

effects of noise on generalization and model selection of the first example and we will

compare the result with the no-noise situation. The section 1-d is a summery of the

results with the MLP network for these two examples which entails a comparison with

results in [1].

In addition, we will discuss the abilities of RBF networks for function approximation

and this derives us to a series of experiments on the same examples with radial basis

networks. The section 2-a of this report, presents results from an (approximately)

optimum RBF network with efficient number of hidden neurons for approximating the

first example. The section 2-b uses NRBF network for the second example in a

similar fashion. In section 3-c, the same experiments for monitoring the effects of

adding noise to the learning is done on RBF networks. This is followed by a

concluding discussion of RBF networks for approximation and comparison with

results in [1] (section 3-d).

Finally, the LOLIMOT algorithm for trading LLNF network is used for function

approximation of the two examples which preserving model selection requirements

(section 3-a and 3-b). Similar to the last sections, the effect of noise on the efficiency

and generality of the results is investigated and the comparison with [1] is entailed

(section 3-c and 3-d). At the end , we will summarize the results in all the sections in

a unified framework and we will discuss about abilities of each model.

1-a- MLP Network: First Example

In this section, we will train an MLP neural network with one hidden layer and tanh

activation functions for approximating a two-input sinc function,

]10,10[]10,10[y)(x, sin.sin),( −×−∈== wherey

yx

xyxsincz . (1)

We have generated 121 pairs of {(x,y),z} sample data points for training, and 71 pairs

for the test/validation phase. In the both cases, the sample the points are generated

randomly with uniform distribution over the defined domain of (1) (see Figure 1). It

should be noted that these sample points are stored and the same points are used for

5

the rest of the report to hold consistency of the comparisons. In addition, for training

the MLP it is recommended to scale the input/output values to the [-1,1] boundary.

Thus, before any training phase the data is scaled down to that boundary.

Figure 1. Distribution of data point over the function domain of first example (Bule) Training

points, (Red) Testing points

For the MLP neural network in Figure 2, we choose back-propagation algorithm,

which updates the parameters (weights) of the NN directly after introducing each data

point. This algorithm is more powerful than full-propagation algorithm in facing local

minima since it could change parameters in different direction rather than opposite

direction of the average steepest decent. Thus, it will add the exploration ability to the

back-propagation training phase. We continue the training epochs of each NN until

the mean square error on the test set starts to increase.

Figure 2. One layer MLP with M neuron in hidden layer used for

approximation of the first example

1

x

y

1M hidden neurons + 1 linear neuron for bias

z

6

In addition, we intend to find the efficient number of neurons in the hidden layer,

which is a burdensome task and involves many trainings and cross-validation

operations. Let M denote the number of hidden neurons in the MLP network, starting

from M=1, we train the NN until the stopping criterion is reached and the resultant

errors on the test and train set are stored. Then, we will increment M and repeat the

procedure until reaching a predefined maximum number of neurons, which is 21 in

this case. Although it is does not do any good for the optimization task, for the cases

were the stopping criterion does not meet, we consider a maximum number training

epochs for practicality. At the end, the smallest M which test set error does not start to

increase after, is taken as the optimum number of neurons. The flowchart in Figure 3

describes the above algorithm of finding optimal number of hidden neurons.

Figure 3. Flowchart of finding optimum number of hidden neurons

Although this algorithm sounds reasonable, it lacks some certain points to be

conclusive. First, considering the NN training, the initial values of weights are chosen

Start

M 1

M < Max of M

Train NN with Mneurons with

stopping criteria

Plot the MSEvalue

M M+1

Store MSE of testand train set

Find the M whichtest set error starts

to increase after

Print M

End

Yes

No

7

randomly and thus the behavior of the training error curve will be different at different

runs. Meanwhile, the fixed maximum number of epochs may hinder the training to

reach the minimum value at every run. Therefore, we need to implement a multi-start

algorithm to overcome this problem. To be comparable to [1], we decided to run the

algorithm for N=10 runs for each M and then decide the optimum number of neuron

on basis of average values of MSE1 of the test set over the N runs.

Second, as we know, the nature of the back-propagation algorithm makes it

vulnerable to local minima and initial weights. In addition, the training behavior is

also influenced by learning rate (or even momentum value if available). Thus, the

above heuristic algorithm would not be perfectly conclusive in the nonlinear nature of

system and in fact, there is no routine way to calculate the efficient number of neurons

in hidden layer of an MLP. There exist a number of similar heuristic algorithms for

finding the optimum number of hidden neurons, but there is no general solution yet

[3].

Our final algorithm for finding the optimum number of neurons in hidden layer with a

multi-start approach is depicted in Figure 4. This algorithm will achieve an acceptable

suboptimal solution of M and then, we use the selected structure for getting the best

possible NN weights training.

1 Mean Square Error

8

Figure 4. Finding the optimum number of neurons in hidden layer with a multi-start approach

We implemented the above algorithm for training an MLP network to approximate

the function (1) and Table 1 shows the training parameters. The learning rate is

chosen to be fixed merely for simplicity of the algorithm and the maximum number of

neurons in hidden layer is selected close to number of hidden neurons in MLP of [1].

Figure 6 and Figure 4 show the changes in test-train MSE vs. epochs for the two cases

where the maximum number of epochs is 300 and 700. As it is mentioned earlier in

9

this text, the MSE curve is influenced by maximum number of epochs, initial weights

and learning rate. Hence, we may conclude that M = 10 or M = 9 are the optimal

values for the number of neurons in hidden layer which compromise between

generality and error minimization.

Besides, these two train MSE curves show few abrupt increases in MSE value at

different M values. This apparently opposes our expectations, since adding new

neurons adds more degree of freedom. However, these sudden increases in the plot are

result of one or more of these factors:

• Bad initialization of weights

• Small value of maximum number of epochs

• Early stopping due to simple stopping criterion

Table 1. MLP parameters of estimation of the first function

Parameter Value

Number of inputs 2

Number of outputs 1

Number of Hidden neurons (M) Calculates adaptively (see Figure 4.)

Number of runs of multi-start (N) 10

Hidden layer neurons tanh

Output layer neurons tanh

Train Algorithm Back-propagation (incremental)

Learning rate 0.03

Maximum number of neuron 21

Maximum epochs 300 and 700

10

Figure 5. Train/Test MSE plot with maximum 300 epochs selected M = 10

Figure 6. Train/Test MSE plot with maximum 700 epochs. Selected M=9

MATLAB neural network toolbox uses a special initialization algorithm (Nguyen-

Widrow method [4]) for weights, which minimizes the risk of falling in local minima

and speed ups the training. Since the toolbox does not support the incremental back-

11

propagation learning2 we did not used it in our implementations, but we decided to

use its initialization function instead of our simple uniformly distributed values of

weights between -1 and 1. Moreover, to emphasize the above three factors we set up a

new experiment with Nguyen-Widrow initialization, maximum number of epochs =

800, and stopping criterion with memory. This memory-enabled stopping triggers

only if the test MSE increases for 15 epochs in a row. Thus, it to some extend

tolerates sudden increase in test error in the training phase. Figure 7 shows the results

of this enhanced experiment and it is evident that the abrupt changed in test/training

error has gone almost completely and curve is regulated. There of three points of

interest in the plot. First, at M=3 is a slight increase in the test MSE and it could be

neglected since the test error on the M>3 has a steady tendency to decrease. At M=9

is another increase small increase in test error but we could choose the point M=11 by

the same rationale as before. Meanwhile, the test/train error hast become larger or

almost constant after M=11, which is a good indication of generalization limit of the

approximation. Thus, we conclude that M=11 will be a good choice for the optimum

structure.

Figure 7. Train/Test MSE with maximum epochs = 800, Nguyen-Widrow initialization , and memory enabled stopping criterion

2 The incremental learning in MATLAB Neural Network Toolbox refers to online training which means updating the parameters by considering all available data till now and not only the data at the present moment.

12

Figure 8. Test output and the desired output when M=11

Figure 8 shows the output result for the trained NN with selected structure on the test set and Table 2 shows the parameters for its training.

Table 2. Parameters of training the selected structure M=11.

Parameter Value Epochs 700 Stopping criterion fired NO Train MSE 0.0054 Test MSE 0.0061006 Weight initialization Nguyen-Widrow Stopping criterion With memory = 15 Learning rate 0.03

1-b- MLP Network: Second Example

In this section, we train an MLP network with two hidden layers and tanh activation

function by using full-propagation algorithm (batch training) for the function

]6,1[]6,1[]6,1[),,( )1( 25.115.0 ××∈+++= −− vyxwherevyxz . (2)

We generated 216 pairs of input/output data points for the training phase and 125 pair

for the test/validation phase. The train and test sets are stored for keeping the

comparisons consistence. The distribution of data points is uniform distribution over

13

function interval, the same as in the last section. The test and train sets contain data

point which exceed the [-1,1] interval. Thus, we decided to scale all input and outputs

to [-1,1] for better training.

Figure 9. MLP 3×5×M×1 used for approximation of the second example

The key difference of this section and the last section is the training method, which is

not incremental. In the batch learning, or in other words full-propagation, the

parameter changes in the direction which minimizes the average value of the error by

most. This means that the change in the parameters may tend to go in the direction

which may be local minima but it is good for the average and it more likely misses

global minima by not taking the risk of sometimes moving in locally bad directions

(i.e. weak exploration ability). However, this difference only affects the model

assessment procedures and the discussion about model selection is still well grounded

for this method.

In this example, we will find the optimum number of hidden neurons for the second

hidden-layer while the first hidden-layer is kept with constant 5 neurons (see Figure

9). The algorithm in Figure 4 holds for the full-propagation with slight modifications

for this example. First, the train phase is changed from incremental to batch learning

and second, M will denote the number of neurons in the second hidden layer only.

1

x

y

1M hidden neurons + 1 linear neuron for bias

z

15 hidden neurons+1 linear neuron for bias

v

14

MATLAB Neural Network Toolbox fully supports the batch learning with various

types of learning methods. Thus, we decided to use the rich toolbox as the tool for our

experiments on the second example3 and for the learning method we choose

TRAINCGB: conjugate gradient back-propagation with Powell-Beale restarts. The

number of neurons in the second hidden layer could be less than in the last example,

since the NN in this example have two hidden neurons, and by considering the

complexity of the function. Therefore, we decided to set maximum number of neurons

to 15. Number of iterations, N, is kept the same, equal to 10 and the maximum

number of epochs are the same, 300 and 700. In addition, the weight initializations are

done by Nguyen-Widrow algorithm as the default option of the toolbox.

Table 3 presents the parameters for the experiment to find the efficient structure of the

network. Figure 10 and Figure 11 show the train/test MSE plot versus number of

neurons in the second hidden layer for maximum number of epochs equal to 300 and

700 respectively. The first plot shows the increase of test error at M=4 which is a

result of a outlier and another increase at M=7 which indicates that the M=6 is a

efficient choice for hidden neurons, since the test error tends to be worse or the same

for larger M values. In Figure 11, the maximum number of epochs is increased to 700

and the curve is more regulated. By neglecting the M=1 to be the efficient solution,

due to outliers, it is clear that M=6 would be a good choice. Hence, we concluded that

M=6 is a good approximation of the efficient number of neurons in the second hidden

layer. Figure 12 shows the plot of the trained NN with M=6 neurons in the second

hidden layer.

Table 3. MLP parameters of estimation of the second function

Parameter Value

Number of inputs 3

Number of outputs 1

Number of neurons in 1st hidden layer 5

Number of neurons in 2nd hidden layer Calculates adaptively (see Figure 4.)

Number of runs of multi-start (N) 10

3 However, MATLAB implementations of the neural networks is not fast enough for numerous tests, since it create a unique M-File at each call of train which takes considerable time.

15

Neuron activation functions tanh (For all layer)

Train Algorithm Full-propagation (Batch)

Maximum value of M 15

Maximum epochs 300 and 700

Figure 10. Train/Test MSE plot with maximum 300 epochs, selected M =5

Figure 11. Train/Test MSE plot with maximum 700 epochs selected M = 6

16

Figure 12. Test output and the desired output when M=6 (3×5×6×1)

1-c- MLP Network: Presence of Noise

Adding neurons to hidden layer of a neural network equals to increasing its memory

and ability to specialize. However, it will make the network more vulnerable to the

input noise and reduces its generality. In this section, we will repeat our experiments

in the in section 1-a on a noisy data to examine this fact. We anticipate observing that

efficient number of neurons in hidden-layer to be to some extend proportional to

inverse of noise variance. In addition, by adding more noise, we expect the error

variance increase and thus, MSE error will increase.

Table 4. 2σ values of the input noise

Label 2σ Case 1 Low 0.05 Case 2 Medium 0.1 Case 3 High 0.5

We add three cases of low, medium, and high input noise to our data and train and test

data and repeat the procedure in 1-a to calculate the optimum M. Table 4 presents the

17

variance of the input noise and Figure 13 to Figure 15 show the test/train MSE versus

M for different cases of noise.

Figure 13. Train Test MSE when noise variance = 0.05 (Low noise)

Figure 14. Train Test MSE when noise variance = 0.1 (Medium noise)

Figure 15. Train Test MSE when noise variance = 0.5 (High noise)

18

To obtain more comparable results, we decided to fit a polynomial of degree 3 to each

of these test MSE sets and depict them on a single plot. Figure 16 shows the result of

polynomial fit. Two apparent effects of adding noise are increase of error variance

and decrease in efficient number or neurons in the hidden layer and Figure 16

demonstrates them very well.

Figure 16. Low-pass polynomial of degree 3 curve fitted on the test error data versus M.

Therefore, we will conclude that the efficient value of M for low, medium, and high

cases of input noise is 10, 9, and 5 respectively.

1-d- MLP Network: Comparison with ANFIS

ANFIS, which was originally introduced by Jang [1], is a powerful adaptive neuro-

fuzzy structure with great ability for function approximation and nonlinear

identification. Our two examples are the same as the first to examples in [1] with the

same number of data point. Thus, a comparison of our results and [1]'s would evaluate

performance of our structure.

Selected M values

19

Table 5. Comparison of SMSE4 of [1] and our MLP structure for the first function (1)

Model ANFIS 2×18×1 MLP

in [1]

Our 2×6×1 MLP on

train set

SMSE 0.001 0.108 0.0735

Table 6. Comparison of our (3×5×6×1) MLP, trained by different methods, with Table 2 of [1].

Model APEtrain APEtest Parameter

No.

Training

Set Size

Test Set

Size

ANFIS 0.043 1.066 50 216 125

GMDH model 4.7 5.7 - 20 20

Fuzzy model 1 1.5 2.1 22 20 20

Fuzzy model 2 0.59 3.4 32 20 20

Our MLP (traincgb, 300 epoch) 5.40 6.43 58 216 125

Our MLP (traincgb, 500 epoch) 2.33 3.11 58 216 125

Our MLP (trainlm, 300 epoch) 0.0036 0.0057 58 216 125

Our MLP (trainlm, 500 epoch) 0.03668 0.0745 58 216 125

In Table 5, the SMSE of our trained structure is compared to the results of [1]. It is

clear that the ANFIS method has done a better job in function approximation, but the

shortcoming of MLP could be result of our simple back-propagation algorithm, which

is in danger of local minima. It is noteworthy to mention that the 2×18×1 MLP

network in [1] is was well trained and the MSE error of it is considerably worse than

our results. Although the test set error would be more conclusive, it is not mentioned

for this example in [1]. However, the SMSE on the test set of our 2×6×1 MLP is

0.0781.

Table 6 compares the results for the second function, which we have used a 3×5×6×1

MLP network with structure in Figure 9. Besides, it contains a list of results of

different method from [1]. Thus, we will consider this table as an updated version of

Table 2 in [1]. This results show that our MLP structure, when it is well trained, is an

powerful model for nonlinear function approximation. The results of training with

conjugate gradient method are comparable with some of the results in Figure 6Table 6 4 Square root mean square error

20

and on the other hand, the results from Levenberg-Marquardt back-propagation

training has achieved even better results than ANFIS. Since the test and train set size

are the same and the parameter numbers are comparable (50 vs. 58), we may conclude

that the MLP with two hidden layers has done a better job for approximating this

function. Hence, as it is obvious, the problem of MLP as a universal function

approximate are the training method, danger of falling in local minima, and

cumbersome task of finding the model selection.

2-a- RBF Network: First Example

Similar to section 1-a, in this section, we intend to build an efficient model for

approximating the first example. Nevertheless, the RBF networks are our tool here

instead of MLP networks. We decided to use an approach, which combines the

structure and parameter estimation, and thus, we chose a subset selection strategy,

OLS, for training our RBF network. This algorithm is implemented in MATALB

Neural Network Toolbox and we used it in our implementations.

The subset selection strategy is as follows [5]. First, a large number of potential basis

function are specified, which in this case we place a Gaussian function on each train

data point with fixed standard deviation. Second, a subset selection technique is

applied in order to determine the most important basis functions of all potential ones

with respect to the available data. Since for the potential basis functions only the

linear output layer weights are unknown, an efficient linear subset selection technique

such as OLS can be applied [6]. At the end of process, the optimal weights and

structure is obtained by OLS algorithm.

Figure 17.RBF Network Structure

21

In comparison with MLP networks, roughly speaking, number of radial basis

functions is equivalent to number of neurons in hidden layer of an MLP and role of

centers is roughly similar to role of input weight and structure of the MLP network.

To find the optimum structure of the RBF network we decided to plot the train/test

error of the network versus the number of radial basis functions, denoted by M. For

fixed variance, adding new neurons reduces the error in both test and train set but the

rate of reduction of error diverges from some certain M in two curves. This certain M

could be considered as the optimum number of neurons, since the test-error rate does

not change significantly, as we add new neurons (memory) to the network.

Besides, the OLS algorithm for RBF training needs another parameter named

SPREAD5, which controls the smoothness and generalization of the approximation.

The larger The SPREAD is the smoother the function approximation will be. Too

large a spread means a lot of neurons will be required to fit a fast changing function.

Too small a spread means many neurons will be required to fit a smooth function, and

the network may not generalize well. To fine the best value of SPREAD for the given

problem, we should train the RBF with different spreads.

Figure 18 shows the behavior of test and train versus M when SPREAD value is kept

constant. It is clear that from M=10, adding new neurons to the network does not

reduce the test error as the error on the train set. Thus, more than M neurons would be

considered as the loss of efficiency in using resources.

5 SPREAD is the distance an input vector must be from a neuron's weight vector to produce an output of 0.5. In other words, if a neuron's weight vector is a distance of spread from the input vector, its weighted input is spread, its net input is sqrt(-log(.5)) (or 0.8326), therefore its output is 0.5.

22

Figure 18. Behavior of test/train versus M, SPREAD = 0.32

Hence, finding the optimum structure of our RBF networks means finding the best

values of pair (M,SPREAD) which give low test errors and keep generalization. We

design an algorithm to achieve this goal. For a series of SPREAD values from 0.1 to 3

we start from M=1 and then we calculate the test and train MSE of RBF. Then we

increment M and again calculate the errors. We continue the procedure until the

criterion in (3) is satisfied. This condition heuristically specifies the M, which the two

curve of train and test error start to diverge and the train error rate become less than

70% of test error.

70.0<Test

Train

MSEMSE

(3)

Then we store the (M,SPREAD,Test Error) and continue to do the same on next

SPREAD value. At the end, we plot the curve of stored test errors and the network

structure with selected M and SPREAD values is chosen as the optimum structure.

Figure 19 shows the flowchart of our algorithm.

23

Figure 19. Invented Algorithm of finding optimum value of

SPREAD and number of neurons (M)

Figure 20 shows the result of algorithm in Figure 19, where the SPREAD value is

represented in red and M is mentioned in each case. Hence, we conclude that M=14

with SPREAD = 0.32 gives satisfactory results. However, as Figure 18 shows, we

could gain better approximation in test-set with M=45 (and even M=100 ) in the cost

of overusing resources (see Figure 21).

24

Figure 20. MSE of Test set for different pairs of (M,SPREAD) with stopping rule available

Figure 21. Test set plot for SPREAD = 0.32. (Top Left) M=14 MSE = 0.0044, (Top Right) M=45

MSE = 0.0011, (Bottom) M= 100 MSE= 0.00040

2-b- RBF Network: Second Example For the second function we will use the same algorithm as in Figure 19 for finding the

best pair of (M,SPREAD), since the behavior of the train/test error on training phase

is the same (see Figure 22). Thus, the selected pair is (31,4.2).

25

` Figure 22. (Left) Train/ Test Error when SPREAD = 4.2, (Right) Curve of test MSE for selecting

best (M,SPREAD) pair

Figure 23. Output of RBF on test set when SPREAD = 4.2 (Top LEFT) M = 31, (Top Right) M = 50 (Bottom) M = 100

It should be noted that in Figure 22.Right the best results came form pairs which one

of elements of the (M,SPREAD) is low and the other is high. This behavior represents

the effort for saving generality in the algorithm. However, as mentioned before, this

26

algorithm gives a greedy pair of (M,SPREAD) and in case we want to gain the best

performance, we could add more neurons. The change to the (M,SPREAD) from the

last example is due to many factors including: data samples, change in complexity of

the function, and increase in dimension of the problem.

2-c- RBF Network: Presence of Noise

RBF are generally believed to be good in facing noise since they treat noise locally

and it does not affect the global performance considerably. However, the large value

of noise have negative effect on RBF, but we will see that the is a good robustness to

the medium and low noise and specially better than MLP networks.

The model is test for the three cases of noise discussed in Table 4 and the results are

presented in Figure 24 to Figure 26.

Figure 24. High noise case, 2σ =0.5 (Left) Test set results MSE = 0.194 (Right) Curve of test set

MSE for selecting (M,SPREAD). (15,0.7) is selected.

Figure 25. Medium noise case, 2σ =0.1 (Left) Test set results MSE = 0.194 (Right) Curve of test

set MSE for selecting (M,SPREAD). (37,0.3) is selected

27

Figure 26. Low noise case, 2σ =0.05 (Left) Test set results MSE = 0.194 (Right) Curve of test set MSE for selecting (M,SPREAD). (38,0.2) is selected

The interesting behavior of the pair of (M,SPREAD) in presence of noise is that the

best value of M increase and the SPREAD decreases as noise increases, which means

in low noise we can use more neurons with higher variance value. Figure 27

summarizes the results of our comparison for RBF networks.

Figure 27. The rough comparison of behavior of the best M and SPREAD values in presence of noise in RBF networks

Noise

Val

ue

Best M

Best SPREAD

28

2-d- RBF Network: ANFIS Comparison

In this section we summarize our results for RBF network for comparing with ANFIS

results in [1]. In comparison with the results of function (1), our results are far away

from astonishing results of [1] but it reasonably better than our previous MLP results.

This was expected since RBF networks are more efficient function approximators

than MLP nets with simple gradient decent algorithm and the they can interpolate

nicely (see Table 7).

Table 7. Comparison of SMSE6 of [1] and our RBF structure for the first function (1)

Model ANFIS RBF with M=31,SPREAD=0.32

RBF with M=45,SPREAD=0.32

RBF with M=100,SPREAD=0.32

SMSE 0.001 0.0664 0.0336 0.0202

However, in the second function or NRBF network provide nearly the same result on

the test-set for little more parameters than ANFIS. Beside, by adding more radial basis

neurons, M=100, we obtain superior results that surpasses ANFIS results. By rough

comparison, we could conclude an interesting result on the small difference of

test/train errors in our best NRBF and large difference in ANFIS: The NRBF

networks do a better job on interpolation (and also extrapolation) than ANFIS.

Table 8. Comparison of our RBF for the function (2) with Table 2 of [1]


No.

Training

Set Size

Test Set

Size

ANFIS 0.043 1.066 50 216 125

GMDH model 4.7 5.7 - 20 20

Fuzzy model 1 1.5 2.1 22 20 20

Fuzzy model 2 0.59 3.4 32 20 20

Our NRBF 1 (SPREAD=4.2) 5.6242 2.4362 M=31 216 125

Our NRBF 2 (SPREAD=4.2) 1.1236 2.3014 M=50 216 125

Our NRBF 3 (SPREAD=4.2) 0.6247 1.8514 M=60 216 125

Our NRBF 4 (SPREAD=4.2) 0.1911 0.0568 M=100 216 125

6 Square root mean square error

29

3-a- LLNF: First Example

The network structure of locally linear neuro-fuzzy (LLNF) networks is depicted in

Figure 28. Each neuron realizes a local linear model (LLM) and an associated validity

function that determines the region of validity of the LLM [5]. The validity functions,

which are similar to basis functions in RBF and could be Gaussians, are normalized

such that 1)(1

=Φ∑M

i z for any input z. The output of this model is calculated as

(4)

where the local linear models depend on Tnxxxxx ]...[ 21= and the validity

functions depend on Tnxzzzz ]...[ 21= . However, in this report, we simply

consider the case where the linear models and validity functions use the same inputs x.

This network simply interpolates linear hyperplanes, which are used to approximate

the functions locally, by nonlinear neurons.

Figure 28. Locally Linear Neuro-Fuzzy Network

We used the LLNF network for approximation of the first and second functions, (1)

and (2), with LOLIMOT incremental training algorithm [7]. The incremental nature of

the algorithm provides the ability to find the optimum structure for the model as well

as minimizing the error on the test-set. At each step, this algorithm selects the worst

hyperplane of the function interval, splits the hyperplane in two axis orthogonal

directions, and fits the parameter for each. Then the best-fitted model from these

resultant two model is selected and then the algorithm iterates until the stopping

criterion breaks it. The use of SSE (Sum of Square Error) error in choosing the worst

30

section to split, made the algorithm sensitive to large density of data point in the input.

Thus, the more condense areas are approximated with more linear models and this

increases the model accuracy. A detailed description of the algorithm could be found

in [5,7].

We have implemented a MISO LOLIMOT algorithm, which will successfully

approximate the 2-input and 3-input functions in (1) and (2). The stopping criterion is

chosen as where the test set error starts to increase or being constant. Figure 29.Top-

Left shows the result of the algorithm which has selected M=42 to be the optimized

value for number of LLMs. Figure 29.Top-Right shows the good approximation

behavior of the model, which is inherited in its ability to interpolate very well. Figure

29.Bottom depicts the placement of Gaussian LLMs in the input space. As it is

expected, the more LLMs are placed at the center, where the functions changed more

rapidly and more linear models are needed for a good approximation.

Figure 29. implementation of LOLIMOT algorithm for first function Best M=42 (Top-Left) Train-Test Error and stopping criterion (Top-Right) Test set output MSE 0.00184 (Bottom)

Gaussian basis function structure for selected M=42

31

3-b- LLNF: Second Example

The design of the LLNF with LOLIMOT learning algorithm is straightforward for the

second function (2). As Figure 30 shows, the train-test plot tend to decrease as M

increases and thus the best performance could be achieved with large number or

LLMs but we decided to limit the maximum number of neurons to 100 for simplicity.

Besides, the strange behavior of the test set error at M=20 is interpreted as the lack of

good data point with similar distribution over a small area of function input which

might cause the reduction of error in the train-set and increase of it in the test-set.

Figure 30.Top-Right shows the network output for M = 23 and MSE and APE which

are not small enough. Figure 30.Bottom-Left shows the acceptable results when

M=100 and MSE is reduced to 0.00000069 and APE is 3.05.

In addition, we decided to test a special case, where the regressors for the linear

models are not completely linear. In fact, LOMIMOT considers the regressors to be

simple input values as

(5)

where N denotes the number of data samples. However, in the second function, as we

know there are some inverse terms and we decided to use one different regressor,

which changes a column of the matrix (5). We replaced the v values defined in (2) of

regression matrix with 1−v and obtained the new regression matrix ]1[ 1−= vyxX i .

The results in Figure 30.Bottom-Right show increase in accuracy and reduction of

number of necessary neurons for achieving the low error rate. Thus, we conclude that

with well-defined regressors in the locally linear models we could obtain better

approximation results.

32

Figure 30. Implementation of LOLIMOT for the second function. (Top-Left) Train-Test Error (Top-Right) network output M = 23 MSE 0.0000249 (Bottom-Left) M = 100 MSE = 0.00000679

(Bottom-Right) Special Regressors ]1[ 1−= vyxXi M=60

3-c- LLNF: Presence of noise

In this section, we will investigate the results of approximation and training of LLNF

in presence of noise. We expect LOLIMOT algorithm to present good performance

for overcoming noise. In addition, as in the past section we anticipate the reduction in

the number of optimal neurons as the noise increases for saving the generality of the

approximation. Figure 31 shows the results for difference type of noise in Table 4 and

Figure 32 depicts the placement of LLMs in the input space for difference noise cases.

The obvious effects of increasing the noise are

• Reduction of optimal M,

• Increase of error,

• Misplacement of LLMs.

33

Figure 31. LOLIMOT Test-Train training curve and Test-set output for defacement cases of noise. (Top-Left) Train-Test MSE Selected M = 49 Low noise, (Top-Right) Test-set output for M

= 49 Low noise, (Middle-Left) Train-Test MSE Selected M = 49 Medium Noise, (Middle-Left) Test-set output for M = 49 Medium noise (Bottom-Left) Train-Test MSE M = 26 High noise,

(Bottom-Right) Test-set output for M = 26 High noise

34

Figure 32. Placement of Gaussians in the input space in presence of noise (Top-Left) Low noise (Top-Right) Medium noise (Bottom) High noise

3-d- LLNF: Comparison with ANFIS

In comparison with ANFIS results, our LOLIMOT shows better performance over the

first function and worse performance over the second one. As Table 9 shows, this first

function was approximated compares favorably with ANFIS results.

Table 9. Comparison of results of LOLIMOT with ANFIS for the first function(1)

Model ANFIS LOLIMOT

With M = 42

SMSE 0.001 0.00184

35

Meanwhile, the results for the second function show that the approximations are not

quiet well. We concluded that, since LOLIMOT are far richer model than NRBF and

MLP, the higher rate errors in LOLIMOT are due to bad partitioning of input space,

which is caused by bad data samples. More data point or with different distributions

would improve the performance.

Table 10. Comparison of our LOLIMOT with Table 2 of [1].


No.

Training

Set Size

Test Set

Size

ANFIS 0.043 1.066 50 216 125

GMDH model 4.7 5.7 - 20 20

Fuzzy model 1 1.5 2.1 22 20 20

Fuzzy model 2 0.59 3.4 32 20 20

Our LOLIMOT 1 0.67 5.98 M=23 216 125

Our LOLIMOT 2 0.055 3.05 M=100 216 125

Our LOLIMOT 3

( with Special Regressors) 0.066 3.75 M=60 216 125

36

4- Conclusion and Summary

In this report, we have investigated the model selection and training of MLP, RBF,

NRBF, and LLNF networks. In addition to methods and algorithms we have

extensively implemented for model selection, we provide comparisons of our results

to two well-known function approximation problems in [1]. This gives us the ability

to evaluate our implantation and algorithms. Besides, we have invented an ad-hoc

method for finding the model selection in RBF networks, which the Gaussian

placement is done bye OLS (Figure 19). In the implementation phase, we have

developed a rich library for LOLIMOT algorithm, which we have used it in the final

project for a Swarm extension to LOLIMOT. The Swarm extension continuously

finds the optimized partitioning instead of heuristic bisection algorithm in the original

incremental LOLIMOT.

As a conclusion, we will provide the following table, which compares the different

neural networks for function approximation on the basis of our implementation

experience.

Table 11. Comparison of different properties of the implemented networks

Properties MLP RBF NRBF LLNF

Interpolation + + + ++

Training Speed - ++ ++ + Model selection

simplicity -- + + ++

Memory

Requirement + - - --

Accuracy ++ + + ++ Simplify of

usage ++ + + -

Noise rejection + ++ N/A + + = model propreties are favorable, - = model properties are not favorable

37

References

[1] Jang, J.-S. R., "ANFIS: Adaptive-Network-based Fuzzy Inference Systems," IEEE Transactions on Systems, Man, and Cybernetics, Vol. 23, No. 3, pp. 665-685, May 1993.

[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, pp. 195-196, Springer, 2001

[3] Kecman, V. (2001), Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, Cambridge, MA: The MIT Press;

[4] Derrick Nguyen, Bernard Widrow, "Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights", IJCNN International Joint Conference on Neural Networks, San Diego, CA,USA 17-21 June 1990, p.21-6 vol.3.

[5] O. Nelles, Nonlinear system identification: from classical approaches to neural network and fuzzy models. Berlin, Germany: Springer-Verlag, 2001.

[6] S. Chen, C.F.N Cowan, and P.M. Grant, "Orthogonal least-squares learning algorithm for radial basis function networks", IEEE Transactions on Neural Networks, Vol. 2 No. 2, March 1991

[7] O. Nelles, A. Fink, R. Isermann, Local Linear Model Trees (LOLIMOT) Toolbox for Nonlinear System Identification, 12th IFAC Symposium on System Identification (SYSID), Santa Barbara, USA, 2000.