Assignment # 3
Nonlinear Function Approximation
Author:
Ramin Mehran
Instructor:
Dr. Babak Araabi
K.N.Toosi U of Tech.
November 2004
2
Introduction..................................................................................................................3 1-a- MLP Network: First Example ............................................................................4 1-b- MLP Network: Second Example ......................................................................12 1-c- MLP Network: Presence of Noise .....................................................................16 1-d- MLP Network: Comparison with ANFIS ........................................................18 2-a- RBF Network: First Example ...........................................................................20 2-b- RBF Network: Second Example .......................................................................24 2-c- RBF Network: Presence of Noise ......................................................................26 2-d- RBF Network: ANFIS Comparison .................................................................28 3-a- LLNF: First Example.........................................................................................29 3-b- LLNF: Second Example ....................................................................................31 3-c- LLNF: Presence of noise ....................................................................................32 3-d- LLNF: Comparison with ANFIS ......................................................................34 4- Conclusion and Summary.....................................................................................36 References ...................................................................................................................37
3
Introduction
Neural networks and neuro-fuzzy systems are common tools for nonlinear regression
problems. The main feature of these models is their flexibility to adopt the complexity
of the function into their intrinsic nonlinear structure. Thus, we will investigate this
capability in some of familiar types of these models by using them in two well-know
regression problems [1]. Besides, we will try to obtain optimum structure for each
model by using cross-validations techniques. Therefore, we will discuss Model
selection and Model assessment as our main points in this report. By Model selection,
we mean, "Estimating the performance of different models in order to choose the
(approximate) best one." Meanwhile, Model assessment asserts, "having chosen a
final model, estimating its prediction error (generalization error) on new data" [2].
The neural networks and neuro-fuzzy systems to be investigated in this report are:
• MLP with one and two hidden layer
• RBF
• NRBF
• Locally Linear Neuro-Fuzzy Networks (LLNF)
We will consider MLP networks with tanh(.) activation functions, and in RBF and
LLNF networks we will use Gaussian functions. The nonlinear regression problems
are the same as Example 1 and 2 of [1] in its simulation results section. The
excremental situation is kept the same as in [1] as much as possible. Therefore, we
will have the chance to compare our results with the equivalent results in it.
Organization of this report is as follow: This report consists of four main sections,
which the first three sections has four subsection each. In section 1-a, we will present
an MLP network with one hidden layer for estimation along with introducing the first
example of nonlinear function. The training algorithm in this section will be back-
propagation and we will find the efficient number of neurons for this structure, which
preserves the generalization. This will be followed by a similar section (section 1-b)
for the next nonlinear function. However, for the second function, the MLP network
4
will have two hidden layers instead of one. In section 1-c, we will investigate the
effects of noise on generalization and model selection of the first example and we will
compare the result with the no-noise situation. The section 1-d is a summery of the
results with the MLP network for these two examples which entails a comparison with
results in [1].
In addition, we will discuss the abilities of RBF networks for function approximation
and this derives us to a series of experiments on the same examples with radial basis
networks. The section 2-a of this report, presents results from an (approximately)
optimum RBF network with efficient number of hidden neurons for approximating the
first example. The section 2-b uses NRBF network for the second example in a
similar fashion. In section 3-c, the same experiments for monitoring the effects of
adding noise to the learning is done on RBF networks. This is followed by a
concluding discussion of RBF networks for approximation and comparison with
results in [1] (section 3-d).
Finally, the LOLIMOT algorithm for trading LLNF network is used for function
approximation of the two examples which preserving model selection requirements
(section 3-a and 3-b). Similar to the last sections, the effect of noise on the efficiency
and generality of the results is investigated and the comparison with [1] is entailed
(section 3-c and 3-d). At the end , we will summarize the results in all the sections in
a unified framework and we will discuss about abilities of each model.
1-a- MLP Network: First Example
In this section, we will train an MLP neural network with one hidden layer and tanh
activation functions for approximating a two-input sinc function,
]10,10[]10,10[y)(x, sin.sin),( −×−∈== wherey
yx
xyxsincz . (1)
We have generated 121 pairs of {(x,y),z} sample data points for training, and 71 pairs
for the test/validation phase. In the both cases, the sample the points are generated
randomly with uniform distribution over the defined domain of (1) (see Figure 1). It
should be noted that these sample points are stored and the same points are used for
5
the rest of the report to hold consistency of the comparisons. In addition, for training
the MLP it is recommended to scale the input/output values to the [-1,1] boundary.
Thus, before any training phase the data is scaled down to that boundary.
Figure 1. Distribution of data point over the function domain of first example (Bule) Training
points, (Red) Testing points
For the MLP neural network in Figure 2, we choose back-propagation algorithm,
which updates the parameters (weights) of the NN directly after introducing each data
point. This algorithm is more powerful than full-propagation algorithm in facing local
minima since it could change parameters in different direction rather than opposite
direction of the average steepest decent. Thus, it will add the exploration ability to the
back-propagation training phase. We continue the training epochs of each NN until
the mean square error on the test set starts to increase.
Figure 2. One layer MLP with M neuron in hidden layer used for
approximation of the first example
1
x
y
1M hidden neurons + 1 linear neuron for bias
z
6
In addition, we intend to find the efficient number of neurons in the hidden layer,
which is a burdensome task and involves many trainings and cross-validation
operations. Let M denote the number of hidden neurons in the MLP network, starting
from M=1, we train the NN until the stopping criterion is reached and the resultant
errors on the test and train set are stored. Then, we will increment M and repeat the
procedure until reaching a predefined maximum number of neurons, which is 21 in
this case. Although it is does not do any good for the optimization task, for the cases
were the stopping criterion does not meet, we consider a maximum number training
epochs for practicality. At the end, the smallest M which test set error does not start to
increase after, is taken as the optimum number of neurons. The flowchart in Figure 3
describes the above algorithm of finding optimal number of hidden neurons.
Figure 3. Flowchart of finding optimum number of hidden neurons
Although this algorithm sounds reasonable, it lacks some certain points to be
conclusive. First, considering the NN training, the initial values of weights are chosen
Start
M 1
M < Max of M
Train NN with Mneurons with
stopping criteria
Plot the MSEvalue
M M+1
Store MSE of testand train set
Find the M whichtest set error starts
to increase after
Print M
End
Yes
No
7
randomly and thus the behavior of the training error curve will be different at different
runs. Meanwhile, the fixed maximum number of epochs may hinder the training to
reach the minimum value at every run. Therefore, we need to implement a multi-start
algorithm to overcome this problem. To be comparable to [1], we decided to run the
algorithm for N=10 runs for each M and then decide the optimum number of neuron
on basis of average values of MSE1 of the test set over the N runs.
Second, as we know, the nature of the back-propagation algorithm makes it
vulnerable to local minima and initial weights. In addition, the training behavior is
also influenced by learning rate (or even momentum value if available). Thus, the
above heuristic algorithm would not be perfectly conclusive in the nonlinear nature of
system and in fact, there is no routine way to calculate the efficient number of neurons
in hidden layer of an MLP. There exist a number of similar heuristic algorithms for
finding the optimum number of hidden neurons, but there is no general solution yet
[3].
Our final algorithm for finding the optimum number of neurons in hidden layer with a
multi-start approach is depicted in Figure 4. This algorithm will achieve an acceptable
suboptimal solution of M and then, we use the selected structure for getting the best
possible NN weights training.
1 Mean Square Error
8
Figure 4. Finding the optimum number of neurons in hidden layer with a multi-start approach
We implemented the above algorithm for training an MLP network to approximate
the function (1) and Table 1 shows the training parameters. The learning rate is
chosen to be fixed merely for simplicity of the algorithm and the maximum number of
neurons in hidden layer is selected close to number of hidden neurons in MLP of [1].
Figure 6 and Figure 4 show the changes in test-train MSE vs. epochs for the two cases
where the maximum number of epochs is 300 and 700. As it is mentioned earlier in
9
this text, the MSE curve is influenced by maximum number of epochs, initial weights
and learning rate. Hence, we may conclude that M = 10 or M = 9 are the optimal
values for the number of neurons in hidden layer which compromise between
generality and error minimization.
Besides, these two train MSE curves show few abrupt increases in MSE value at
different M values. This apparently opposes our expectations, since adding new
neurons adds more degree of freedom. However, these sudden increases in the plot are
result of one or more of these factors:
• Bad initialization of weights
• Small value of maximum number of epochs
• Early stopping due to simple stopping criterion
Table 1. MLP parameters of estimation of the first function
Parameter Value
Number of inputs 2
Number of outputs 1
Number of Hidden neurons (M) Calculates adaptively (see Figure 4.)
Number of runs of multi-start (N) 10
Hidden layer neurons tanh
Output layer neurons tanh
Train Algorithm Back-propagation (incremental)
Learning rate 0.03
Maximum number of neuron 21
Maximum epochs 300 and 700
10
Figure 5. Train/Test MSE plot with maximum 300 epochs selected M = 10
Figure 6. Train/Test MSE plot with maximum 700 epochs. Selected M=9
MATLAB neural network toolbox uses a special initialization algorithm (Nguyen-
Widrow method [4]) for weights, which minimizes the risk of falling in local minima
and speed ups the training. Since the toolbox does not support the incremental back-
11
propagation learning2 we did not used it in our implementations, but we decided to
use its initialization function instead of our simple uniformly distributed values of
weights between -1 and 1. Moreover, to emphasize the above three factors we set up a
new experiment with Nguyen-Widrow initialization, maximum number of epochs =
800, and stopping criterion with memory. This memory-enabled stopping triggers
only if the test MSE increases for 15 epochs in a row. Thus, it to some extend
tolerates sudden increase in test error in the training phase. Figure 7 shows the results
of this enhanced experiment and it is evident that the abrupt changed in test/training
error has gone almost completely and curve is regulated. There of three points of
interest in the plot. First, at M=3 is a slight increase in the test MSE and it could be
neglected since the test error on the M>3 has a steady tendency to decrease. At M=9
is another increase small increase in test error but we could choose the point M=11 by
the same rationale as before. Meanwhile, the test/train error hast become larger or
almost constant after M=11, which is a good indication of generalization limit of the
approximation. Thus, we conclude that M=11 will be a good choice for the optimum
structure.
Figure 7. Train/Test MSE with maximum epochs = 800, Nguyen-Widrow initialization , and memory enabled stopping criterion
2 The incremental learning in MATLAB Neural Network Toolbox refers to online training which means updating the parameters by considering all available data till now and not only the data at the present moment.
12
Figure 8. Test output and the desired output when M=11
Figure 8 shows the output result for the trained NN with selected structure on the test set and Table 2 shows the parameters for its training.
Table 2. Parameters of training the selected structure M=11.
Parameter Value Epochs 700 Stopping criterion fired NO Train MSE 0.0054 Test MSE 0.0061006 Weight initialization Nguyen-Widrow Stopping criterion With memory = 15 Learning rate 0.03
1-b- MLP Network: Second Example
In this section, we train an MLP network with two hidden layers and tanh activation
function by using full-propagation algorithm (batch training) for the function
]6,1[]6,1[]6,1[),,( )1( 25.115.0 ××∈+++= −− vyxwherevyxz . (2)
We generated 216 pairs of input/output data points for the training phase and 125 pair
for the test/validation phase. The train and test sets are stored for keeping the
comparisons consistence. The distribution of data points is uniform distribution over
13
function interval, the same as in the last section. The test and train sets contain data
point which exceed the [-1,1] interval. Thus, we decided to scale all input and outputs
to [-1,1] for better training.
Figure 9. MLP 3×5×M×1 used for approximation of the second example
The key difference of this section and the last section is the training method, which is
not incremental. In the batch learning, or in other words full-propagation, the
parameter changes in the direction which minimizes the average value of the error by
most. This means that the change in the parameters may tend to go in the direction
which may be local minima but it is good for the average and it more likely misses
global minima by not taking the risk of sometimes moving in locally bad directions
(i.e. weak exploration ability). However, this difference only affects the model
assessment procedures and the discussion about model selection is still well grounded
for this method.
In this example, we will find the optimum number of hidden neurons for the second
hidden-layer while the first hidden-layer is kept with constant 5 neurons (see Figure
9). The algorithm in Figure 4 holds for the full-propagation with slight modifications
for this example. First, the train phase is changed from incremental to batch learning
and second, M will denote the number of neurons in the second hidden layer only.
1
x
y
1M hidden neurons + 1 linear neuron for bias
z
15 hidden neurons+1 linear neuron for bias
v
14
MATLAB Neural Network Toolbox fully supports the batch learning with various
types of learning methods. Thus, we decided to use the rich toolbox as the tool for our
experiments on the second example3 and for the learning method we choose
TRAINCGB: conjugate gradient back-propagation with Powell-Beale restarts. The
number of neurons in the second hidden layer could be less than in the last example,
since the NN in this example have two hidden neurons, and by considering the
complexity of the function. Therefore, we decided to set maximum number of neurons
to 15. Number of iterations, N, is kept the same, equal to 10 and the maximum
number of epochs are the same, 300 and 700. In addition, the weight initializations are
done by Nguyen-Widrow algorithm as the default option of the toolbox.
Table 3 presents the parameters for the experiment to find the efficient structure of the
network. Figure 10 and Figure 11 show the train/test MSE plot versus number of
neurons in the second hidden layer for maximum number of epochs equal to 300 and
700 respectively. The first plot shows the increase of test error at M=4 which is a
result of a outlier and another increase at M=7 which indicates that the M=6 is a
efficient choice for hidden neurons, since the test error tends to be worse or the same
for larger M values. In Figure 11, the maximum number of epochs is increased to 700
and the curve is more regulated. By neglecting the M=1 to be the efficient solution,
due to outliers, it is clear that M=6 would be a good choice. Hence, we concluded that
M=6 is a good approximation of the efficient number of neurons in the second hidden
layer. Figure 12 shows the plot of the trained NN with M=6 neurons in the second
hidden layer.
Table 3. MLP parameters of estimation of the second function
Parameter Value
Number of inputs 3
Number of outputs 1
Number of neurons in 1st hidden layer 5
Number of neurons in 2nd hidden layer Calculates adaptively (see Figure 4.)
Number of runs of multi-start (N) 10
3 However, MATLAB implementations of the neural networks is not fast enough for numerous tests, since it create a unique M-File at each call of train which takes considerable time.
15
Neuron activation functions tanh (For all layer)
Train Algorithm Full-propagation (Batch)
Maximum value of M 15
Maximum epochs 300 and 700
Figure 10. Train/Test MSE plot with maximum 300 epochs, selected M =5
Figure 11. Train/Test MSE plot with maximum 700 epochs selected M = 6
16
Figure 12. Test output and the desired output when M=6 (3×5×6×1)
1-c- MLP Network: Presence of Noise
Adding neurons to hidden layer of a neural network equals to increasing its memory
and ability to specialize. However, it will make the network more vulnerable to the
input noise and reduces its generality. In this section, we will repeat our experiments
in the in section 1-a on a noisy data to examine this fact. We anticipate observing that
efficient number of neurons in hidden-layer to be to some extend proportional to
inverse of noise variance. In addition, by adding more noise, we expect the error
variance increase and thus, MSE error will increase.
Table 4. 2σ values of the input noise
Label 2σ Case 1 Low 0.05 Case 2 Medium 0.1 Case 3 High 0.5
We add three cases of low, medium, and high input noise to our data and train and test
data and repeat the procedure in 1-a to calculate the optimum M. Table 4 presents the
17
variance of the input noise and Figure 13 to Figure 15 show the test/train MSE versus
M for different cases of noise.
Figure 13. Train Test MSE when noise variance = 0.05 (Low noise)
Figure 14. Train Test MSE when noise variance = 0.1 (Medium noise)
Figure 15. Train Test MSE when noise variance = 0.5 (High noise)
18
To obtain more comparable results, we decided to fit a polynomial of degree 3 to each
of these test MSE sets and depict them on a single plot. Figure 16 shows the result of
polynomial fit. Two apparent effects of adding noise are increase of error variance
and decrease in efficient number or neurons in the hidden layer and Figure 16
demonstrates them very well.
Figure 16. Low-pass polynomial of degree 3 curve fitted on the test error data versus M.
Therefore, we will conclude that the efficient value of M for low, medium, and high
cases of input noise is 10, 9, and 5 respectively.
1-d- MLP Network: Comparison with ANFIS
ANFIS, which was originally introduced by Jang [1], is a powerful adaptive neuro-
fuzzy structure with great ability for function approximation and nonlinear
identification. Our two examples are the same as the first to examples in [1] with the
same number of data point. Thus, a comparison of our results and [1]'s would evaluate
performance of our structure.
Selected M values
19
Table 5. Comparison of SMSE4 of [1] and our MLP structure for the first function (1)
Model ANFIS 2×18×1 MLP
in [1]
Our 2×6×1 MLP on
train set
SMSE 0.001 0.108 0.0735
Table 6. Comparison of our (3×5×6×1) MLP, trained by different methods, with Table 2 of [1].
Model APEtrain APEtest Parameter
No.
Training
Set Size
Test Set
Size
ANFIS 0.043 1.066 50 216 125
GMDH model 4.7 5.7 - 20 20
Fuzzy model 1 1.5 2.1 22 20 20
Fuzzy model 2 0.59 3.4 32 20 20
Our MLP (traincgb, 300 epoch) 5.40 6.43 58 216 125
Our MLP (traincgb, 500 epoch) 2.33 3.11 58 216 125
Our MLP (trainlm, 300 epoch) 0.0036 0.0057 58 216 125
Our MLP (trainlm, 500 epoch) 0.03668 0.0745 58 216 125
In Table 5, the SMSE of our trained structure is compared to the results of [1]. It is
clear that the ANFIS method has done a better job in function approximation, but the
shortcoming of MLP could be result of our simple back-propagation algorithm, which
is in danger of local minima. It is noteworthy to mention that the 2×18×1 MLP
network in [1] is was well trained and the MSE error of it is considerably worse than
our results. Although the test set error would be more conclusive, it is not mentioned
for this example in [1]. However, the SMSE on the test set of our 2×6×1 MLP is
0.0781.
Table 6 compares the results for the second function, which we have used a 3×5×6×1
MLP network with structure in Figure 9. Besides, it contains a list of results of
different method from [1]. Thus, we will consider this table as an updated version of
Table 2 in [1]. This results show that our MLP structure, when it is well trained, is an
powerful model for nonlinear function approximation. The results of training with
conjugate gradient method are comparable with some of the results in Figure 6Table 6 4 Square root mean square error
20
and on the other hand, the results from Levenberg-Marquardt back-propagation
training has achieved even better results than ANFIS. Since the test and train set size
are the same and the parameter numbers are comparable (50 vs. 58), we may conclude
that the MLP with two hidden layers has done a better job for approximating this
function. Hence, as it is obvious, the problem of MLP as a universal function
approximate are the training method, danger of falling in local minima, and
cumbersome task of finding the model selection.
2-a- RBF Network: First Example
Similar to section 1-a, in this section, we intend to build an efficient model for
approximating the first example. Nevertheless, the RBF networks are our tool here
instead of MLP networks. We decided to use an approach, which combines the
structure and parameter estimation, and thus, we chose a subset selection strategy,
OLS, for training our RBF network. This algorithm is implemented in MATALB
Neural Network Toolbox and we used it in our implementations.
The subset selection strategy is as follows [5]. First, a large number of potential basis
function are specified, which in this case we place a Gaussian function on each train
data point with fixed standard deviation. Second, a subset selection technique is
applied in order to determine the most important basis functions of all potential ones
with respect to the available data. Since for the potential basis functions only the
linear output layer weights are unknown, an efficient linear subset selection technique
such as OLS can be applied [6]. At the end of process, the optimal weights and
structure is obtained by OLS algorithm.
Figure 17.RBF Network Structure
21
In comparison with MLP networks, roughly speaking, number of radial basis
functions is equivalent to number of neurons in hidden layer of an MLP and role of
centers is roughly similar to role of input weight and structure of the MLP network.
To find the optimum structure of the RBF network we decided to plot the train/test
error of the network versus the number of radial basis functions, denoted by M. For
fixed variance, adding new neurons reduces the error in both test and train set but the
rate of reduction of error diverges from some certain M in two curves. This certain M
could be considered as the optimum number of neurons, since the test-error rate does
not change significantly, as we add new neurons (memory) to the network.
Besides, the OLS algorithm for RBF training needs another parameter named
SPREAD5, which controls the smoothness and generalization of the approximation.
The larger The SPREAD is the smoother the function approximation will be. Too
large a spread means a lot of neurons will be required to fit a fast changing function.
Too small a spread means many neurons will be required to fit a smooth function, and
the network may not generalize well. To fine the best value of SPREAD for the given
problem, we should train the RBF with different spreads.
Figure 18 shows the behavior of test and train versus M when SPREAD value is kept
constant. It is clear that from M=10, adding new neurons to the network does not
reduce the test error as the error on the train set. Thus, more than M neurons would be
considered as the loss of efficiency in using resources.
5 SPREAD is the distance an input vector must be from a neuron's weight vector to produce an output of 0.5. In other words, if a neuron's weight vector is a distance of spread from the input vector, its weighted input is spread, its net input is sqrt(-log(.5)) (or 0.8326), therefore its output is 0.5.
22
Figure 18. Behavior of test/train versus M, SPREAD = 0.32
Hence, finding the optimum structure of our RBF networks means finding the best
values of pair (M,SPREAD) which give low test errors and keep generalization. We
design an algorithm to achieve this goal. For a series of SPREAD values from 0.1 to 3
we start from M=1 and then we calculate the test and train MSE of RBF. Then we
increment M and again calculate the errors. We continue the procedure until the
criterion in (3) is satisfied. This condition heuristically specifies the M, which the two
curve of train and test error start to diverge and the train error rate become less than
70% of test error.
70.0<Test
Train
MSEMSE
(3)
Then we store the (M,SPREAD,Test Error) and continue to do the same on next
SPREAD value. At the end, we plot the curve of stored test errors and the network
structure with selected M and SPREAD values is chosen as the optimum structure.
Figure 19 shows the flowchart of our algorithm.
23
Figure 19. Invented Algorithm of finding optimum value of
SPREAD and number of neurons (M)
Figure 20 shows the result of algorithm in Figure 19, where the SPREAD value is
represented in red and M is mentioned in each case. Hence, we conclude that M=14
with SPREAD = 0.32 gives satisfactory results. However, as Figure 18 shows, we
could gain better approximation in test-set with M=45 (and even M=100 ) in the cost
of overusing resources (see Figure 21).
24
Figure 20. MSE of Test set for different pairs of (M,SPREAD) with stopping rule available
Figure 21. Test set plot for SPREAD = 0.32. (Top Left) M=14 MSE = 0.0044, (Top Right) M=45
MSE = 0.0011, (Bottom) M= 100 MSE= 0.00040
2-b- RBF Network: Second Example For the second function we will use the same algorithm as in Figure 19 for finding the
best pair of (M,SPREAD), since the behavior of the train/test error on training phase
is the same (see Figure 22). Thus, the selected pair is (31,4.2).
25
` Figure 22. (Left) Train/ Test Error when SPREAD = 4.2, (Right) Curve of test MSE for selecting
best (M,SPREAD) pair
Figure 23. Output of RBF on test set when SPREAD = 4.2 (Top LEFT) M = 31, (Top Right) M = 50 (Bottom) M = 100
It should be noted that in Figure 22.Right the best results came form pairs which one
of elements of the (M,SPREAD) is low and the other is high. This behavior represents
the effort for saving generality in the algorithm. However, as mentioned before, this
26
algorithm gives a greedy pair of (M,SPREAD) and in case we want to gain the best
performance, we could add more neurons. The change to the (M,SPREAD) from the
last example is due to many factors including: data samples, change in complexity of
the function, and increase in dimension of the problem.
2-c- RBF Network: Presence of Noise
RBF are generally believed to be good in facing noise since they treat noise locally
and it does not affect the global performance considerably. However, the large value
of noise have negative effect on RBF, but we will see that the is a good robustness to
the medium and low noise and specially better than MLP networks.
The model is test for the three cases of noise discussed in Table 4 and the results are
presented in Figure 24 to Figure 26.
Figure 24. High noise case, 2σ =0.5 (Left) Test set results MSE = 0.194 (Right) Curve of test set
MSE for selecting (M,SPREAD). (15,0.7) is selected.
Figure 25. Medium noise case, 2σ =0.1 (Left) Test set results MSE = 0.194 (Right) Curve of test
set MSE for selecting (M,SPREAD). (37,0.3) is selected
27
Figure 26. Low noise case, 2σ =0.05 (Left) Test set results MSE = 0.194 (Right) Curve of test set MSE for selecting (M,SPREAD). (38,0.2) is selected
The interesting behavior of the pair of (M,SPREAD) in presence of noise is that the
best value of M increase and the SPREAD decreases as noise increases, which means
in low noise we can use more neurons with higher variance value. Figure 27
summarizes the results of our comparison for RBF networks.
Figure 27. The rough comparison of behavior of the best M and SPREAD values in presence of noise in RBF networks
Noise
Val
ue
Best M
Best SPREAD
28
2-d- RBF Network: ANFIS Comparison
In this section we summarize our results for RBF network for comparing with ANFIS
results in [1]. In comparison with the results of function (1), our results are far away
from astonishing results of [1] but it reasonably better than our previous MLP results.
This was expected since RBF networks are more efficient function approximators
than MLP nets with simple gradient decent algorithm and the they can interpolate
nicely (see Table 7).
Table 7. Comparison of SMSE6 of [1] and our RBF structure for the first function (1)
Model ANFIS RBF with M=31,SPREAD=0.32
RBF with M=45,SPREAD=0.32
RBF with M=100,SPREAD=0.32
SMSE 0.001 0.0664 0.0336 0.0202
However, in the second function or NRBF network provide nearly the same result on
the test-set for little more parameters than ANFIS. Beside, by adding more radial basis
neurons, M=100, we obtain superior results that surpasses ANFIS results. By rough
comparison, we could conclude an interesting result on the small difference of
test/train errors in our best NRBF and large difference in ANFIS: The NRBF
networks do a better job on interpolation (and also extrapolation) than ANFIS.
Table 8. Comparison of our RBF for the function (2) with Table 2 of [1]
Model APEtrain APEtest Parameter
No.
Training
Set Size
Test Set
Size
ANFIS 0.043 1.066 50 216 125
GMDH model 4.7 5.7 - 20 20
Fuzzy model 1 1.5 2.1 22 20 20
Fuzzy model 2 0.59 3.4 32 20 20
Our NRBF 1 (SPREAD=4.2) 5.6242 2.4362 M=31 216 125
Our NRBF 2 (SPREAD=4.2) 1.1236 2.3014 M=50 216 125
Our NRBF 3 (SPREAD=4.2) 0.6247 1.8514 M=60 216 125
Our NRBF 4 (SPREAD=4.2) 0.1911 0.0568 M=100 216 125
6 Square root mean square error
29
3-a- LLNF: First Example
The network structure of locally linear neuro-fuzzy (LLNF) networks is depicted in
Figure 28. Each neuron realizes a local linear model (LLM) and an associated validity
function that determines the region of validity of the LLM [5]. The validity functions,
which are similar to basis functions in RBF and could be Gaussians, are normalized
such that 1)(1
=Φ∑M
i z for any input z. The output of this model is calculated as
(4)
where the local linear models depend on Tnxxxxx ]...[ 21= and the validity
functions depend on Tnxzzzz ]...[ 21= . However, in this report, we simply
consider the case where the linear models and validity functions use the same inputs x.
This network simply interpolates linear hyperplanes, which are used to approximate
the functions locally, by nonlinear neurons.
Figure 28. Locally Linear Neuro-Fuzzy Network
We used the LLNF network for approximation of the first and second functions, (1)
and (2), with LOLIMOT incremental training algorithm [7]. The incremental nature of
the algorithm provides the ability to find the optimum structure for the model as well
as minimizing the error on the test-set. At each step, this algorithm selects the worst
hyperplane of the function interval, splits the hyperplane in two axis orthogonal
directions, and fits the parameter for each. Then the best-fitted model from these
resultant two model is selected and then the algorithm iterates until the stopping
criterion breaks it. The use of SSE (Sum of Square Error) error in choosing the worst
30
section to split, made the algorithm sensitive to large density of data point in the input.
Thus, the more condense areas are approximated with more linear models and this
increases the model accuracy. A detailed description of the algorithm could be found
in [5,7].
We have implemented a MISO LOLIMOT algorithm, which will successfully
approximate the 2-input and 3-input functions in (1) and (2). The stopping criterion is
chosen as where the test set error starts to increase or being constant. Figure 29.Top-
Left shows the result of the algorithm which has selected M=42 to be the optimized
value for number of LLMs. Figure 29.Top-Right shows the good approximation
behavior of the model, which is inherited in its ability to interpolate very well. Figure
29.Bottom depicts the placement of Gaussian LLMs in the input space. As it is
expected, the more LLMs are placed at the center, where the functions changed more
rapidly and more linear models are needed for a good approximation.
Figure 29. implementation of LOLIMOT algorithm for first function Best M=42 (Top-Left) Train-Test Error and stopping criterion (Top-Right) Test set output MSE 0.00184 (Bottom)
Gaussian basis function structure for selected M=42
31
3-b- LLNF: Second Example
The design of the LLNF with LOLIMOT learning algorithm is straightforward for the
second function (2). As Figure 30 shows, the train-test plot tend to decrease as M
increases and thus the best performance could be achieved with large number or
LLMs but we decided to limit the maximum number of neurons to 100 for simplicity.
Besides, the strange behavior of the test set error at M=20 is interpreted as the lack of
good data point with similar distribution over a small area of function input which
might cause the reduction of error in the train-set and increase of it in the test-set.
Figure 30.Top-Right shows the network output for M = 23 and MSE and APE which
are not small enough. Figure 30.Bottom-Left shows the acceptable results when
M=100 and MSE is reduced to 0.00000069 and APE is 3.05.
In addition, we decided to test a special case, where the regressors for the linear
models are not completely linear. In fact, LOMIMOT considers the regressors to be
simple input values as
(5)
where N denotes the number of data samples. However, in the second function, as we
know there are some inverse terms and we decided to use one different regressor,
which changes a column of the matrix (5). We replaced the v values defined in (2) of
regression matrix with 1−v and obtained the new regression matrix ]1[ 1−= vyxX i .
The results in Figure 30.Bottom-Right show increase in accuracy and reduction of
number of necessary neurons for achieving the low error rate. Thus, we conclude that
with well-defined regressors in the locally linear models we could obtain better
approximation results.
32
Figure 30. Implementation of LOLIMOT for the second function. (Top-Left) Train-Test Error (Top-Right) network output M = 23 MSE 0.0000249 (Bottom-Left) M = 100 MSE = 0.00000679
(Bottom-Right) Special Regressors ]1[ 1−= vyxXi M=60
3-c- LLNF: Presence of noise
In this section, we will investigate the results of approximation and training of LLNF
in presence of noise. We expect LOLIMOT algorithm to present good performance
for overcoming noise. In addition, as in the past section we anticipate the reduction in
the number of optimal neurons as the noise increases for saving the generality of the
approximation. Figure 31 shows the results for difference type of noise in Table 4 and
Figure 32 depicts the placement of LLMs in the input space for difference noise cases.
The obvious effects of increasing the noise are
• Reduction of optimal M,
• Increase of error,
• Misplacement of LLMs.
33
Figure 31. LOLIMOT Test-Train training curve and Test-set output for defacement cases of noise. (Top-Left) Train-Test MSE Selected M = 49 Low noise, (Top-Right) Test-set output for M
= 49 Low noise, (Middle-Left) Train-Test MSE Selected M = 49 Medium Noise, (Middle-Left) Test-set output for M = 49 Medium noise (Bottom-Left) Train-Test MSE M = 26 High noise,
(Bottom-Right) Test-set output for M = 26 High noise
34
Figure 32. Placement of Gaussians in the input space in presence of noise (Top-Left) Low noise (Top-Right) Medium noise (Bottom) High noise
3-d- LLNF: Comparison with ANFIS
In comparison with ANFIS results, our LOLIMOT shows better performance over the
first function and worse performance over the second one. As Table 9 shows, this first
function was approximated compares favorably with ANFIS results.
Table 9. Comparison of results of LOLIMOT with ANFIS for the first function(1)
Model ANFIS LOLIMOT
With M = 42
SMSE 0.001 0.00184
35
Meanwhile, the results for the second function show that the approximations are not
quiet well. We concluded that, since LOLIMOT are far richer model than NRBF and
MLP, the higher rate errors in LOLIMOT are due to bad partitioning of input space,
which is caused by bad data samples. More data point or with different distributions
would improve the performance.
Table 10. Comparison of our LOLIMOT with Table 2 of [1].
Model APEtrain APEtest Parameter
No.
Training
Set Size
Test Set
Size
ANFIS 0.043 1.066 50 216 125
GMDH model 4.7 5.7 - 20 20
Fuzzy model 1 1.5 2.1 22 20 20
Fuzzy model 2 0.59 3.4 32 20 20
Our LOLIMOT 1 0.67 5.98 M=23 216 125
Our LOLIMOT 2 0.055 3.05 M=100 216 125
Our LOLIMOT 3
( with Special Regressors) 0.066 3.75 M=60 216 125
36
4- Conclusion and Summary
In this report, we have investigated the model selection and training of MLP, RBF,
NRBF, and LLNF networks. In addition to methods and algorithms we have
extensively implemented for model selection, we provide comparisons of our results
to two well-known function approximation problems in [1]. This gives us the ability
to evaluate our implantation and algorithms. Besides, we have invented an ad-hoc
method for finding the model selection in RBF networks, which the Gaussian
placement is done bye OLS (Figure 19). In the implementation phase, we have
developed a rich library for LOLIMOT algorithm, which we have used it in the final
project for a Swarm extension to LOLIMOT. The Swarm extension continuously
finds the optimized partitioning instead of heuristic bisection algorithm in the original
incremental LOLIMOT.
As a conclusion, we will provide the following table, which compares the different
neural networks for function approximation on the basis of our implementation
experience.
Table 11. Comparison of different properties of the implemented networks
Properties MLP RBF NRBF LLNF
Interpolation + + + ++
Training Speed - ++ ++ + Model selection
simplicity -- + + ++
Memory
Requirement + - - --
Accuracy ++ + + ++ Simplify of
usage ++ + + -
Noise rejection + ++ N/A + + = model propreties are favorable, - = model properties are not favorable
37
References
[1] Jang, J.-S. R., "ANFIS: Adaptive-Network-based Fuzzy Inference Systems," IEEE Transactions on Systems, Man, and Cybernetics, Vol. 23, No. 3, pp. 665-685, May 1993.
[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, pp. 195-196, Springer, 2001
[3] Kecman, V. (2001), Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models, Cambridge, MA: The MIT Press;
[4] Derrick Nguyen, Bernard Widrow, "Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights", IJCNN International Joint Conference on Neural Networks, San Diego, CA,USA 17-21 June 1990, p.21-6 vol.3.
[5] O. Nelles, Nonlinear system identification: from classical approaches to neural network and fuzzy models. Berlin, Germany: Springer-Verlag, 2001.
[6] S. Chen, C.F.N Cowan, and P.M. Grant, "Orthogonal least-squares learning algorithm for radial basis function networks", IEEE Transactions on Neural Networks, Vol. 2 No. 2, March 1991
[7] O. Nelles, A. Fink, R. Isermann, Local Linear Model Trees (LOLIMOT) Toolbox for Nonlinear System Identification, 12th IFAC Symposium on System Identification (SYSID), Santa Barbara, USA, 2000.