Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Applied Mathematical Sciences, Vol. 8, 2014, no. 62, 3051 - 3062
HIKARI Ltd, www.m-hikari.com
http://dx.doi.org/10.12988/ams.2014.44270
A Hybrid GMDH and Box-Jenkins Models in
Time Series Forecasting
Ani Shabri
Department of Mathematics
Faculty of Science
University Technology of Malaysia
81310 Skudai, Johor, Malaysia
Ruhaidah Samsudin
Department of Software Engineering
Faculty of Computing
University Technology of Malaysia
81310 Skudai, Johor, Malaysia
Copyright © 2014 Ani Shabri and Ruhaidah Samsudin. This is an open access article distributed under the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Abstract
The group method of data handling technique (GMDH) and Box-Jenkins methods are two well-
known time series forecasting of mathematical modeling. In this paper, we introduce a hybrid
modeling which combines the GMDH method with the Box-Jenkins method to model time series
data. The Box-Jenkins method was used to determine the useful input variables of GMDH
method and then the GMDH method which works as time series forecasting. The lynx series
contains the number of lynx trapped per year is used in this study to demonstrate the
effectiveness of the forecasting model. The results found by the proposed GMDH were
compared with the results of Box-Jenkins and artificial neural network (ANN) models. The
comparison of modeling results shows that the GMDH model perform better than two other
models based on terms of mean absolute error (MAE) and root mean square error (RMSE). It
also indicates that GMDH provides a promising technique in time series forecasting.
3052 Ani Shabri and Ruhaidah Samsudin
Keywords: Group method of data handling technique, time series forecasting, neural networks,
Box-Jenkins method
1. Introduction
The most comprehensive of all popular and widely known statistical models which have been
utilized in the last four decades for time series forecasting are Box-Jenkins method. However,
the Box-Jenkins model is only a class of linear model and thus it can only capture linear feature
of data time series [1], but many time series are often full of nonlinearity and chaotic.
More advanced nonlinear methods such as neural networks have been frequently applied in
nonlinear time series modeling and chaotic time series modeling in recent years [2, 3, 4, 5, 6, 7].
ANN provides an attractive alternative tool for both forecasting researchers and has shown their
nonlinear modeling capability in data time series forecasting.
One sub-model of ANN is a group method of data handling (GMDH) algorithm was first
developed by [8]. This model has been successfully used to deal with uncertainty, linear or
nonlinearity of systems in a wide range of disciplines such as engineering, science, economy,
medical diagnostics, signal processing and control systems [9, 10, 11].
Improving forecasting especially time series forecasting accuracy is an important yet often
difficult task facing many decision makers in a wide range of areas. Combining several models
or using hybrid models can be an effective way to improve forecasting performance. There have
been several studies suggesting hybrid models such as combining the ARIMA and ANN model
[1, 12, 13, 14], the GMDH and ANN model [14], GMDH and differential evolution [15], GMDH
and LSSVM [16]. More recently, a new class of neural network combining ANN model with
Box-Jenkins (BJ) approach was explored for modeling time series [17, 18, 19, 20]. The BJ
approach was used to determine the most important variables as input nodes in the input layer
and ANN were explored for modeling time series data. Their results showed that the hybrid
model can be an effective way to improving predictions achieved when the variables of input
layer of ANN is chosen based on BJ approach rather than on traditional methods.
In this paper, a new hybrid GMDH-type algorithm is proposed by combining the GMDH
model with the BJ approach to model time series data. The BJ approach is used to generate the
most useful variables as input nodes in the input layer of data from under study time series.
Then, a GMDH model is used to model the generated data by Box-Jenkins model and to predict
the future of time series. To verify the application of this approach, the Canadian lynx data sets
are used in this study.
2. Forecasting Methodology
This section presents the ARIMA, ANN and GMDH models used for modeling time series.
The choice of these models in this study was because these methods have been widely and
successfully used in time series forecasting.
A hybrid GMDH and Box-Jenkins models 3053
2.1 Box-Jenkins Approach
The Box-Jenkins method that was introduced by [21] has been one of the most popular
approaches to the analysis of the time series and prediction. The general ARIMA models are
represented by the following way:
tqtd
p aByBB )()1)(( (1)
where )(B and )(B are polynomials of order p and q, respectively; d number of regular
differencing. Random errors, ta are assumed to be independently and identically distributed with
a mean of zero and a constant variance of 2 . The Box-Jenkins methodology is basically divided in the four steps: identification, estimation, diagnostic checking and forecasting. In the
identification step, transformation is often needed to make time series stationary. The next step is
choosing a tentative model by matching both the autocorrelation (ACF) and partial
autocorrelation function (PACF) of the stationary series. Once a tentative model is identified, the
parameters of the model are estimated. The last step of model building is the diagnostic checking
of model adequacy, basically to check if the model assumptions about the error, ta are satisfied.
The process is repeated several times until a satisfactory model is finally selected. The
forecasting model was then used to compute the fitted values and forecasts values.
2.2 The Neural Network Forecasting Model
The artificial neural networks (ANN) that serve as flexible computational frameworks have
been extensively studied and gained much popularity in many areas applications as well as in
science, psychology and engineering. The ANN with single hidden layer feed-forward network is
the most widely used and suitable for modeling and forecasting in time series. In a feed-forward
ANN, the neurons are usually arranged in layers. The first layer is the input layer where the data
are introduced to the network, the second layer is the hidden layer where data are processed and
the last layer is the output layer where the results of given input are produced. The structure of a
feed-forward ANN is shown in Fig. 1.
3054 Ani Shabri and Ruhaidah Samsudin
Fig. 1 Architecture of three layers feed-forward back-propagation ANN
The relationship between the input observations )...,,,( 21 pttt yyy and the output value )( ty
assuming a linear output neuron is given by
q
j
p
i itijjjtywwfabgy
1 100)( (2)
where jb (j = 0, 1, 2, …, q) is a bias on the jth unit, and ijw (i = 0, 1, 2, …, p; j = 0, 1, 2, …, q)
are connection weights, f and g are hidden and output layer activation functions, respectively.
[22]. Several optimization algorithms can be used to train the ANN. Among the several training
algorithms available, back-propagation has been the most popular and most widely used [23]. In
a back-propagation network, the weights and bias values are initially chosen as random numbers
and then fixed by the results of a training process. The goal of training algorithm is to minimize
the global error.
2.3 The Group Method of Data Handling
The GMDH method was originally formulated to solve for higher order regression polynomials
especially for solving modeling and classification problem. General connection between inputs
and output variables can be expressed by a complicated polynomial series in the form of the
Volterra series, known as the Kolmogorov-Gabor polynomial [8]:
Input Layer
Hidden Layer
Output Layer1ty
2ty
.
.
.pty
ty
jw0
ijw
jw
Bias unit
(Tangent Sigmoid transfer function)
(Linear transfer function)
.
.
.
A hybrid GMDH and Box-Jenkins models 3055
M
i
M
j
M
k
kjiijk
M
i
M
j
jiij
M
i
ii xxxaxxaxaay1 1 11 11
0 ... (3)
where ,...),,( kji xxx is the input vector variables, M is the number of input and ,...),,,( 0 ijkiji aaaa
is the vector of summand coefficients.. However, for most application the quadratic form are
called as partial descriptions (PD) for only two variables is used in the form
2
5
2
43210),( jijijiji xaxaxxaxaxaaxxGy (4)
to predict the output. The input variables are set to ),...,,,( 321 Mxxxx and output is set to {y}. The
coefficients ia for 5...,,1,0i are determined using the least square method. The framework of
the design procedure of the GMDH consists of the following steps.
Step 1: Select input variables },...,,{ 21 MxxxX where M is the total number of input. The data
are separated into training and testing data sets. The training data set is used to construct
a GMDH model and the testing data set is used to evaluate the estimated GMDH model.
Step 2: Construct 2/)1( MML new variables },...,,{ 21 LzzzZ in the training data set for
all independent variables and choose a PD of the GMDH. Conventional GMDH is
developed using the polynomial
2
5
2
43210),( jijijijil xaxaxxaxaxaaxxGz for Ll ,...,2,1 (5)
as PD. In this study, a PD structure, namely radial basis function (RBF) using the
polynomial function is proposed in construct the GMDH. The radial basis function
(RBF) model is used in the form
2lg
l ez
where 2
5
2
43210 jijijil xaxaxxaxaxaag (6)
Step 3: Estimate the coefficient of the PD. The vectors of coefficients of the PDs are determined
using the least square method.
Step 4: Determine new input variables for the next layer. There are several specific selection
criteria to identify the input variables for the next layer. In our study, we used two
criteria. The first criteria, the single best neuron out of these L neurons, 'z identified according to the value of mean square error (MSE) of testing dataset such that
3056 Ani Shabri and Ruhaidah Samsudin
Tn
i
kii
T
zyn 1
2
, )(1
MSE for k = 1, 2, …, L. (7)
where Tn is the number of testing data set. If the smallest value of MSE of 'z less than
threshold, then terminated, otherwise set the new input variables )',,...,,,( 321 zxxxx M .
In second criteria, eliminate the least effective variables, replace the column of
},...,,{ 21 kxxxX by those column },...,,{ 21 kzzzZ that best estimate the dependent
variable y in the testing dataset. This is captured by the expression
11 zx , 22 zx , … , kk zx (8)
where k is the total number of the retained new input variables.
Step 5 : Check the stopping criterion. The lowest value of selection criteria using GMDH model
at each layer obtained during this iteration is compared with the smallest value obtained
at the previous one. If an improvement is achieved, one goes back and repeats step 1 to
5, otherwise the iterations terminate and a realization of the network has been
completed. Once the final layer has been determined, only the one node characterized by
the best performance is selected as the output node. The remaining nodes in that layer
are discarded. And finally, the GMDH model is obtained.
3. Time series prediction by ANN and GMDH
A classical method, for time series forecasting problem, the number of input nodes of nonlinear
such as the ANN or GMDH model is equal to the number of lagged variables ),...,,( 21 pttt yyy ,
where p is the number of chosen lagged. The outputs, ty , the predicted value of a time series
defined as
),...,,( 21 ptttt yyyfy (9)
However, currently there is no suggested systematic way to determine the optimum number of
lagged p. The other method to determine the number of nodes in the input layer is based on the
Box-Jenkins model. Unlike the previous method where the number of lagged p is chosen either
in an ad hoc basis or from traditional methods, the lagged variables obtained from the Box-
Jenkins analysis are the most important variables to be used as input nodes in the input layer of
the ANN or GMDH model. In our proposed model, a time series model based on Box-Jenkins
methodology is considered as nonlinear function of several past observations and random errors
as follows:
A hybrid GMDH and Box-Jenkins models 3057
)],...,,(),,...,,[( 2121 qtttptttt aaayyyfy (10)
where f is a nonlinear function determined by the ANN and GMDH.
4. Empirical Results
In this section, we illustrate the hybrid GMDH-type algorithm and show its performance for
forecasting the Canadian lynx data. The lynx series contains the number of lynx trapped per year
in the Mackenzie River district of Northern Canada. The data set has 114 observations,
corresponding to the period of 1821 to 1934. It has also been extensively analyzed in the time
series literature with a focus on the nonlinear modeling. This lynx data are one of the most
frequently used time series. The data are plotted in Fig. 2, which shows a periodicity of
approximately 10 years. It indicates that the series is stationary in the mean but not be stationary
in variance. The lynx series was studied by many researchers and the first time series analysis
was carried out by [24] and then recently by [25] Kajitani et al. (2005) who fit an AR(2) model
to the logged data. [26], [1], and [20] found the best-fitted model is AR(12) model.
Fig.2 Canadian Lynx data series (1821-1934)
The authors use R-package to formulate the Box-Jenkins technique. Using the Box-Jenkins
technique on the Lynx time series, two models, AR(2) and AR(12) were considered and
statistical results are compared in the following Table 1 based on mean squared error (MSE),
Akaike Information Criterion (AIC) and Schwardz Information Criterion (SIC). We used log
with base 10, which makes the lynx data more symmetrical.
Time
Ly
nx N
um
be
rs
1109988776655443322111
7000
6000
5000
4000
3000
2000
1000
0
3058 Ani Shabri and Ruhaidah Samsudin
Table 1: Comparison of ARIMA models’ Statistical Results
Model MSE AIC SIC
AR(2)
AR(12) 0.0179
0.0238 -1.7072
-1.3834 -1.6550
-1.0708 Note: The data in boldface means the best statistical results.
Table 1 shows that the lowest MSE, AIC and SIC statistics of 0.0179, -1.7072 and -1.6550,
respectively were observed for AR(2). Hence, according to their performances indices, AR(2) is
selected for appropriate ARIMA model for Lynx series. The AR(2) that we identified takes the
following form
tttt ayyy 21 7385.03693.10652.1 (11)
In designing the ANN and GMDH models, one must determine the following variables: the
number of input nodes and the number of layers. The selection of the number of input
corresponds to the number of variables play important roles for many successful applications of
ANN and GMDH models.
To make the ANN and GMDH models simple and reduce some computational burden, only the
lagged variables obtained from the Box-Jenkins are used as input layers. In this study, based on
Box-Jenkins methodology in linear modeling from Eq. 11, a time series is considered as
nonlinear function of several past observations as follows
),( 21 ttt yyfy (12)
where f is a nonlinear function determined by ANN and GMDH models. The nodes in the input
layer consist of lagged variables 1ty and 2ty obtained from the Box-Jenkins analysis.
The ANN model was implemented with software package neural network toolbox using
MATLAB The hidden nodes use the hyperbolic tangent sigmoid transfer function and the output
layer uses the linear function because the prediction performance is the best when these transfer
functions are used. The network was trained for 5000 epochs using the conjugate gradient
descent back-propagation algorithm with a learning rate of 0.001 and a momentum coefficient of
0.9. The optimal number of neuron in the hidden layer was identified using several practical
guidelines. These includes using I/2 [27], 2I [28] and 2I+1 [29], where I is the number of input.
In this study, a trial and error method is performed to optimize the number of neurons in the
hidden layer. The best neural networks architecture for lynx series found consists of 2 inputs, 5
hidden and 1 output neurons (2x5x1).
The GMDH model works by building successive layers with complex connections that are
created by using second-order polynomial and exponential function. The first layer created is
made by computing regressions of the input variables. The second layer is created by computing
regressions of the output value. Only the best are chosen at each layer and this process continues
until a pre-specified selection criterion is found. The optimal number of neuron in the hidden
A hybrid GMDH and Box-Jenkins models 3059
layer of GMDH model was identified using a trial and error procedure by varying the number of
hidden neurons from 1 to 10 for each model. The best fit model structure is for each model is
determined according to criteria of performance index (MSE).
Table 2 shows the performance results of the ARIMA, ANN and GMDH approach based on
mean squared error (MSE), mean absolute error (MAE) and correlation coefficients (R2).
Table 2: Comparison of ARIMA, ANN and GMDH
Model MSE MAE R2
AR(2) 0.0179 0.1166 0.8959
ANN (2x5x1) 0.0107 0.0684 0.9351
GMDH 0.0074 0.0624 0.9589 Note: The data in boldface means the best statistical results.
From Table 2, considering the MSE, MAE and R2 being regarded here as a performance
indicator, the experimental results clearly demonstrate that the GMDH outperforms the other
models. Fig. 3 shows the actual and forecasted values respectively.
Table 3 shows the performance of our proposed model and other models studied in the
previous literature. To measure forecasting performance, MSE and MAE are employed as
performance indicator. The experiment results show that our proposed model offers encouraging
advantages and has good performance.
Table 3: Comparison of the performance of the proposed model with those of other forecasting
models
Model MSE MAE
Zhang’s ARIMA model 0.02049 0.1123
Zhang’s ANN model 0.02046 0.1121
Zhang’s Hybrid model 0.01723 0.10397
Khashei & Bijari’ s ANN model 0.01361 0.08963
Kajitani’s SETAR model 0.01400 -
Kajitani’s FNN model 0.0090 -
Aladag’s Hybrid model 0.0090 -
Proposed model 0.0074 0.0624 Note: The data in boldface means the best statistical results.
3060 Ani Shabri and Ruhaidah Samsudin
Fig. 3: Comparison between observed and predicted for GMDH, ARIMA and ANN models for
Lynx time series (testing phase)
5. Conclusion
One of the major developments in ANN over the last decade is the model combining or hybrid
models. In this paper we proposed a hybrid GMDH model that combines the time series Box-
Jenkins model and the GMDH model to forecast time series data. The GMDH model in
conjunction with Box-Jenkins approach has been demonstrated to model the lynx data. The Box-
Jenkins approach is applied to propose a new hybrid method for improving the performance of
the GMDH to time series forecasting. The empirical results indicate the proposed method yields
better results than other methods. This approach presents a superior and reliable alternative to
ANN, ARIMA and hybrid models studied by other researchers.
Acknowledgements. This research has been funded by the Ministry of Science, Technology and
Innovation (MOSTI), Malaysia under Vot 4F399. Lastly, thanks are given to the Universiti
Teknologi Malaysia.
References
[1] G.P. Zhang, Time Series Forecasting Using a Hybrid ARIMA and Neural Network Model, Neurocomputing, 50 (2003), 159 - 175.
Time
Ly
nx N
um
be
r
1413121110987654321
3.75
3.50
3.25
3.00
2.75
2.50
Variable
ANN
GMDH
Data
ARIMA
A hybrid GMDH and Box-Jenkins models 3061
[2] D.S.K. Karunasinghe and S.Y. Liong, Chaotic time series prediction with a global model: Artificial neural network, Journal of Hydrology, 323 (2006), 92 - 105.
[3] I. Rojas, O. Valenzuela, F. Rojas, A. Guillen, L. J. Herrera, H. Pomares, L. Marquez, M. Pasadas, Soft-computing techniques and ARMA model for time series prediction,
Neurocomputing, 71(4 – 6) (2008), 519 - 537.
[4] F. Camastra, and A. Colla, Mneural short-term rediction based on dynamics reconstruction, ACM, 9(1) (1999), 45-52.
[5] M. Han and M. Wang, Analysis and modeling of multivariate chaotic time series based on neural network, Expert Systems with Applications, 2(36) (2009), 1280-1290.
[6] A. Abraham and B. Nath, A neuro-fuzzy approach for modeling electricity demand in Victoria, Applied Soft Computing, 1(2) (2001), 127–138.
[7] K.A. Oliveira, I. Vannucci, E.C. Silva, Using artificial neural networks to forecast chaotic time series, Physica A: Statistical Mechanics and its Applications, 284(1-4) (2000), 393 -
404.
[8] A.G. Ivakheneko, Polynomial theory of complex system, IEEE Trans. Syst., Man Cybern. SMCI-1, 1 (1971), 364-378.
[9] H. Tamura, and T. Kondo, Heuristic free group method of data handling algorithm of generating optional partial polynomials with application to air pollution prediction.
International Journal of Systems Science, 11 (1980), 1095–1111.
[10] A.G. Ivakheneko and G.A. Ivakheneko, A review of problems solved by algorithms of the GMDH, Pattern Recognition and Image Analysis, 5(4) (1995), 527-535.
[11] M.S. Voss and X. Feng, A new methodology for emergent system identification using particle swarm optimization (PSO) and the group method data handling (GMDH). GECCO
2002, (2002), 1227-1232.
[12] A. Jain and A. Kumar, An evaluation of artificial neural network technique for the determination of infiltration model parameters, Applied Soft Computing, 6 (2006), 272–282.
[13] C.T. Su, L.I. Tong, C.M. Leou, Combination of time series and neural network for reliability forecasting modeling, Journal of Chinese Industrial Engineering, 14 (1997), 419
– 429.
[14] W. Wang, P.V. Gelder, J.K. Vrijling, Improving daily stream flow forecasts by combining ARMA and ANN models, International Conference on Innovation Advances and
Implementation of Flood Forecasting Technology, 2005.
[15] G.C. Onwubolu, Design of hybrid differential evolution and group method of data handling networks for modeling and prediction, Information Sciences, 178 (2008), 3616-3634.
[16] R. Samsudin, P. Saad, A. Shabri, A hybrid GMDH and least squares support vector machines in time series forecasting, Neural Network World, 21(3) (2011), 251-268.
[17] K.Y. Chen and C.H. Wang, A hybrid ARIMA and support vector machines in forecasting the production values of the machinery industry in Taiwan, Expert Systems with
Applications, 32 (2007), 254 - 264.
[18] S. BuHamra, N. Smaoui, M. Gabr, The Box–Jenkins analysis and neural networks, California, Wadsworth, 2003.
3062 Ani Shabri and Ruhaidah Samsudin
[19] C. Hamzaçebi, Improving artificial neural networks' performance in seasonal time series forecasting, Information Sciences, 178(23) (2008), 4550-4559.
[20] M. Khashei and M. Bijari, An artificial neural network (p, d, q) model for timeseries forecasting, Expert Systems with Applications, 37(1) (2010), 479-489.
[21] G.E.P. Box and G. Jenkins, Time Series Analysis. Forecasting and Control, Holden-Day, San Francisco, CA, 1970.
[22] K.K. Lai, L. Yu, S. Wang, W. Huang, Hybridizing Exponential Smoothing and Neural Network for Financial Time Series Predication. ICCS 2006, Part IV, LNCS 3994 (2006),
493-500.
[23] H.F. Zou, G.P. Xia, F.T. Yang, H.Y. Wang, An investigation and comparison of artificial neural network and time series models for chinese food grain price forecasting,
Neurocomputing, 70 (2007), 2913-2923.
[24] P.A.P. Moran, The statistical analysis of the sunspot and lynx cycles. Journal of Animal Ecology, 18 (1953), 115–116.
[25] Y. Katijani, W.K. Hipel, A.I. Mcleod, Forecasting nonlinear time series with feedforward neural networks: A case study of Canadian lynx data, Journal of Forecasting 24 (2005),
105–117.
[26] T. Subba Rao and M.M. Gabr, An Introduction to Bispectral Analysis and Bilinear Time Series Models, Springer, Berlin, 1984.
[27] S. Kang, An Investigation of the Use of Feed forward Neural Network for Forecasting. Ph.D. Thesis, Kent State University, 1991.
[28] F.S. Wong, Time series forecasting using back propagation neural network, Neurocomputing, 2 (1991), 147 - 159.
[29] R.P. Lippmann, An introduction to computing with neural nets. IEEE ASSP Magazine, April, 4-22, 1987.
Received: April 14, 2014
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0M-3VJ3YPK-D&_user=167669&_coverDate=01%2F01%2F1999&_rdoc=1&_fmt=high&_orig=search&_sort=d&_docanchor=&view=c&_searchStrId=1356002637&_rerunOrigin=google&_acct=C000013278&_version=1&_urlVersion=0&_userid=167669&md5=27b36d5a0a49516ca15c7f659e81376d#bb16