An Algorithm for the Identification Stage in Temporal ... · An Algorithm for the Identification Stage in Temporal Series Analysis MARCOS ANTONIO MASNIK FERREIRA Information Technology

An Algorithm for the Identification Stage in Temporal Series Analysis

MARCOS ANTONIO MASNIK FERREIRA Information Technology Department

Central Bank of Brazil Universidade Federal do Paraná

Programa de Pós-Graduação de Métodos Numéricos em Engenharia – PPGMNE Centro Politécnico - Jardim das Américas C. P. 19011

81531-990 - Curitiba – Paraná BRAZIL

JOEL MAURÍCIO CORRÊA DA ROSA Departamento de Estatística

Universidade Federal do Paraná BRAZIL

CELSO CARNIERI Departamento de Matemática

Universidade Federal do Paraná BRAZIL

Abstract: - Understanding the structure of a Temporal Series is essential for the Finance Engineer or the manager of a company in order to make a correct decision. In addition to the insights of the time series structure, the foundations of Temporal Series Theory may help to predict the future using statistics inference. The outstanding Integrated Auto Regressive Moving Average Model ARIMA(p, d, q) is widespread and very used in Finance and Economics. Nevertheless, the process of determining its p, d and q parameters is done manually and prone to error. This paper proposes an algorithm, developed in the R statistical package, which tests all the possibilities, defined by the analyst, for a Multiplicative Seasonal ARIMA Model (SARIMA (p, d, q) x (P, D, Q)). The algorithm sorts the best alternatives considering different objective functions, like Akike Information (AIC), log likelihood function criteria or the alternatives that presents the lowest predicted quadratic errors.

Key-Words: - Computationally Intensive Algorithms, Temporal Series Analysis, Linear Models, SARIMA Model, Econometrics. 1 Introduction Describing the data in a more compact way, helping to interpret data and to forecast could be mentioned as the main purpose of Time Series Analysis. The decomposition of a time series traditionally separates it in three components [1]:

tttt a S T Z ++= (1)

The Zt time series is decomposed in the trend (Tt), seasonal (St) and residual (at) components. The {at} is the random component of the Zt processes which is supposed to have mean zero and variance equal to

. If {a2aσ t} is a white noise, so E[at as] = 0 for s ≠ t.

A very widespread methodology in time series analysis is known by Integrated Auto Regressive Moving Average (ARIMA) and it was first introduced by Box and Jenkins [2]. In the ARIMA (p,

d, q) model, the Xt time series is decomposed by the following equation:

tqtqtt

ptptt

aaaa

ZZZ

++++

++++=

−−−

−−−

θθθ

φφφ

L

L

2211

2211tZ

(2)

Therefore, the series consists by an Auto Regressive AR ( ptptt ZZZ −−− +++ φφφ L2211 ) and a Moving

Average MA ( qtqtt aaa −−− +++ θθθ L2211 ) components. The ‘p‘ and ‘q’ parameters define the order of the AR(p) and the MA(q) components. The ‘d’ parameter specifies how many differences should be taken to transform the series in a stationary time series – a requirement for ARIMA modeling. The statistical theory helps the analyst to determine the values of p, d and q. However, it will be shown in the next section that the determination of the order of p, q, and d parameters is not every time an easy task,

Proceedings of the 6th WSEAS International Conference on Applied Informatics and Communications, Elounda, Greece, August 18-20, 2006 (pp274-279)

mailto:[email protected]



it is always time consuming and most of time prone to error. 2 The ARIMA model The construction of an ARIMA model is an interactive and repetitive process consisting of 4 stages:

1. Identification or the order definition 2. Estimation of the parameters of φ i and θ j 3. Diagnostic Checking of the model 4. Prediction or Forecast

The Identification stage consists in determining if it is necessary or not to differentiate the series or if another transformation of the series, like the log difference, is necessary in order to make the time series stationary. Besides, during this stage the analyst should determine de order ‘p’ and ‘q’ that best fits the model, according to the objectives that the analyst have in mind. The Estimation stage has the purpose of determining the values of φ i and θ j parameters in a way that Zt best fits the time series. After fitting a model, The Diagnostic Checking stage aims to verify if the fitted model is indeed an appropriated model. The commonly used approach is to examine the residuals {at} in order to verify if they represent a white noise. If the diagnostic checking fails, the whole Identification and Estimation stages are repeated until a satisfactory model is found. The last stage, Prediction or Forecast, is the simplest and consists of predicting the future values of the time series using the best fitted model achieved in the previous stages. It should be pointed out that sometimes the analyst may not be interested in prediction but only in understanding the structure of the time series. For this reason, there is not only one best fitted model; it will depend on the analysis purposes. With the aim of exemplifying the stages of the ARIMA model, brief examples are showed. Considering the time series in the Figure 1, it is clear that the series is not stationary because neither E[Zt] nor Var[Zt] are the same during all the period. Therefore the series needs to suffer a transformation in order to become stationary. The Figure 2 show the plot of the time series differentiating it one time ∇Zt = Zt – Zt -1. It is clear from the plot that the series become stationary considering its mean and variance. Although in this example – a didactic one – the analysis was so simple, we should advise the reader that when dealing with the reality it could be much harder.

0 200 400 600 800 1000

05

1015

2025

30

Index

Z

Figure 1 – A Time Series Sample.

Other time series may need another type of transformation like taking the log difference of the time series. The most important, for the example above, is that we determined the value of the ‘d’ parameter equal to 1.

0 200 400 600 800 1000

-2-1

01

23

4

Index

diff(

Z)

Figure 2 – The Time Series Differentiated One Time.

The next step is to find out the p and d order. The Auto Correlation Function (ACF) and the Partial Auto Correlation Function (PACF) plot can shed the light on theses values. The sample autocorrelation function )(ˆ hρ is calculated by the following equations:

)0(ˆ)(ˆ)(ˆ γγρ hh = (3)

∑−

=+ −−=

hn

ttht zzzz

nh

1),)((1)(γ̂ nhnfor <<−

(4)

For the reader interested in how the PACF function is calculated, we suggest [3] as a complete reference. The ACF and PACF functions plots for an AR(2) model is showed in the Figure 3.


0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series sim.arma

0 10 20 30 40

0.0

0.2

0.4

0.6

LagP

artia

l AC

F

Series sim.arma

Figure 3 – The ACF and PACF function for an

AR(2) model.

It can be demonstrated that for an AR(p) model, the Auto Correlation Function decays exponentially and PACF shows a zero value in the lags above the ‘p´ order. Figure 4 shows the ACF and PACF function plots for a MA(2) model.

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series sim.arma

0 10 20 30 40

-0.2

0.0

0.2

0.4

Lag

Par

tial A

CF

Series sim.arma

Figure 4 – The ACF and PACF function for a

MA(2) model.

The theory demonstrates that the plot for a MA(q) shows that the PACF decays exponentially alternating positive and negative values while the ACF zeroes after the lag ‘q’. Everything is just fine when a time series presents an AR or MA behavior separately. The problem arises when the series presents both the Auto Regressive and Moving Average behavior. Figure 5 shows the ACF and PACF plot for an AR(2) MA(2) model.

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

Series sim.arma

0 10 20 30 40

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

Lag

Par

tial A

CF

Series sim.arma

Figure 5– The ACF and PACF function for an AR(2) MA(2) model.

When dealing with series like that, the ACF and PACF plots could not help the analyst anymore. From the ACF plot, the analyst could guess that the series presents an Auto Regressive behavior, but from the PACF plot, he or she could not guess the model order. Looking to the PACF plot, the analyst could figure out that the model presents a MA behavior, but he or she could not guess the ‘q’ order that best fits the model. Only after the correct identification of the ´p’, ‘d’, and ‘q´ order the analyst should go on to the next stage - the estimation of its values. Taking into account the importance of the order selected for ARIMA models and also considering the rapidly changes in computational speed, the focus of this paper is on the identification stage of ARIMA model. It is nowadays possible, due to the powerful computers environment, to test a broad range of order possibilities, to estimate the φ i and θ j values for each order, to proceed the diagnostic checking of the alternatives and to select the best ones among all the examined range. In the next section, it will be shown an algorithm developed in R statistic package that implements these procedures. 3 ARIMA Identification Stage For this stage an algorithm was developed in R statistical package [4] for two reasons: (i) R is a free software and (ii) a very powerful statistical language environment to explore data. 3.1 Algorithm for the ARIMA Identification Stage The Figure 6 presents the outline of the algorithm [5] developed in R each tests all the possibilities defined by the analyst. We believe that the algorithm is clear enough, consequently only a brief description will be given. The algorithm starts reading the series from a text file called Product.txt, transfers the series to “Z” vector, and creates a temporal series stored in “Z.s” vector. The file Products.txt stores the monthly amount exported from the Paranaguá Port, a public port in Paranaguá city, in Brazil. After that, the Z.serie vector is initialized with the same content of Z.s, keeping aside the last 12 values in order to calculate the best prediction. Therefore, all the possibilities will be compared considering the lowest average quadratic predicted errors in the later 12 months.


Z <- read.table("C:\\IST\\Products.txt", header=TRUE)Z.s = ts(Z[[3]], frequency=12)seriesLength = Length(Z.s) - 12Z.serie = NullZ.serie = array(0, dim=c(tam))Z.serie = ts(Z.serie, Frequency = 12)for (i in 1:seriesLength){ # separate the last 12 values for comparing the prediction Z.serie[i] = Z.s[i]}

p = 2 ; d = 1 ; q = 2 ; P = 1 ; D = 1 ; Q = 1 # the limits for the search

Options (Show.Error.messages = False)result = Nullresult = array(0, dim=c((p+1) * (d+1) * (q+1) * (P+1) * (D+1) * (Q+1),10))i = 1for (ar in 0:p) { for (dif in 0:d) { for (ma in 0:q) { for (ars in 0:P){ for (difs in 0:D){ for (mas in 0:Q){ Z.fit = try(arima(Z.serie, Order = c(ar, dif, ma), seasonal = List(Order = c(ars, difs, mas)))) if ( length(Z.fit) > 1){ Z.pred = predict(Z.fit, n.ahead = 12) # lowest quadratic error Err = 0 for (j in 1:12){ err = err + (Z.pred$pred[j] - Z.s[length(Z.s)- (12-j)])**2 } result[i,1] = Z.fit$aic result[i,2] = Z.fit$loglik result [i,3] = Err / (Length(Z.s) - p - q - p - D) e = Z.fit$residuals result [i,4] = Abs(0 - Skewness(e)) + Abs(3 - kurtosis(e)) } else { result [i,1] = 1e10 result [i,2] = -1e10 result [i,3] = 1e10 result [i,4] = 1e10 } result [i,5] = ar result [i,6] = dif result [i,7] = ma result [i,8] = ars result [i,9] = difs result [i,10] = mas i = i + 1 } } } } }}Options (Show.Error.messages = True)

Figure 6. The outline of the algorithm that tests all the possibilities for a SARIMA (p,d,q)*(P,D,Q) modeling . The variables p, q, d, P, D and Q, define limits for each parameters of SARIMA order search. For instance, if the value of ‘p’ is set to 2, the algorithm will test the order 0, 1 and 2 for the auto regressive AR(p) model component. Considering the example in the Figure 6, the algorithm will test 144 possibilities that are stored temporarily at the vector called result. This bidimensional vector is created using the

command array from R package. The number of rows will depend on the amount of possibilities that are being tested. The Table 1 shows the values that are stored in each column of result vector. The columns from 5 to 10 keep the current values of p, q, d, P, D and Q. The other columns will be explained below.


Column 1 2 3 4 5 6 7 8 9 10Values Stored AIC Log

LikelihoodBest

PredictionBetter Residual

Fit p d q P D Q

Result Vector

Table 1. The values stored in each column of the

result vector. After the initialization phase, the algorithm starts fitting the model for all the possibilities using the arima command. It should be pointed out that, for some values of the parameters p, q, d, P, D and Q and depending on the series values, it is impossible to estimate values for φ i and θ j without producing a non stationary series. In these cases, the arima command will generate an error message and will stop the loop execution. For this reason, the arima command is enveloped by the try command, which treats the occurrence of such execution errors. The values for AIC and log likelihood comes directly from the arima command. The other two criteria called – Best Prediction and Best Residual Fit – are calculated in the following way: The Best Prediction criterion is calculated using the predict command from R package. Later, we take the quadratic average error comparing the predict values from the original ones, which were separated from the series at the algorithm initialization phase. The Best Residual Fit is calculated considering the Skewness and Kurtosis values presented by the residuals of the fitted model. It should be considered that in ARIMA models, the residuals should present a behavior of a white noise. The best fitted models will be the ones which present the lowest differences from Skewness and Kurtosis of a normal probability distribution. Finally, it is necessary to sort the result vector according to the several chosen criteria. In the Figure 7, we show the sorting procedure according to Akaike criterion. The same procedure is repeated to the other criteria, like log likelihood, best prediction and best residual fit. bestAIC = resultchanged = Truelimit = dim(bestAIC)[1] - 1while (changed == TRUE & limit != 0){ changed = False for ( i in 1:limit ){ if (bestAIC [i,1] > bestAIC [i+1,1]){ temp = bestAIC [i,] bestAIC [i,] = bestAIC [i+1,] bestAIC [i+1,] = temp changed = True } } limit = limit - 1}bestAIC

Figure 7. The vector bubble sort according to the best Akaike criterion found.

After the succinct algorithm explanation, in the next section we will demonstrate its use showing its application with a real time series. 3.2 The Algorithm Results For showing the use of the proposed algorithm, it will be used a series that represents the total amount exported in tons by the Paranaguá Port (Figure 8). It is a monthly series from January/1999 until april/2006. From the series plot, it is clear enough that the series shows two important characteristics (i) it presents an expansion in the volumes exported and therefore it is not a stationary series and (ii) the series appears to have a seasonal behavior in the months of December, January and February, which show a low exported volume. The reason for this behavior is that the Paranaguá Port is the one of the most prominent grain Brazilian Port, which market is well-known for being influenced by harvest seasonal conditions.

Time

Z.se

rie

1 2 3 4 5 6 7

1000

000

2000

000

3000

000

Exportedton

Figure 8. The total amount exported (ton) by the Paranaguá Port (monthly). Figure 9 shows the results of the proposed algorithm, according to the four criteria that we defined – Akaike, Log Likelihood, Best Prediction and Best Residual criteria. As the series is not a stationary one, almost all the best alternatives show a value for the parameters ‘d´ and/or ‘D’, indicating the needs to differentiate it in order to transform it in a stationary one.


rank AIC p d q P D Q[1,] 1811.60 0 1 1 0 1 1[2,] 1812.52 1 1 1 0 1 1[3,] 1812.78 0 1 2 0 1 1[4,] 1813.40 0 1 1 1 1 1[5,] 1814.38 1 1 1 1 1 1[6,] 1814.39 2 1 1 0 1 1[7,] 1814.48 1 1 2 0 1 1[8,] 1814.64 0 1 2 1 1 1[9,] 1815.46 2 1 2 0 1 1[10,] 1815.74 0 1 1 1 1 0[11,] 1816.18 2 1 1 1 1 1[12,] 1816.28 1 1 2 1 1 1[13,] 1816.73 1 1 0 0 1 1[14,] 1817.09 2 1 2 1 1 1[15,] 1817.56 1 1 1 1 1 0[16,] 1817.62 0 1 2 1 1 0[17,] 1818.10 2 1 0 0 1 1[18,] 1818.17 1 1 0 1 1 1[19,] 1818.85 2 1 1 1 1 0[20,] 1819.50 1 1 2 1 1 0

Best AIC

rank Log-Lik p d q P D Q[1,] 1817.09 2 1 2 1 1 1[2,] 1815.46 2 1 2 0 1 1[3,] 1816.18 2 1 1 1 1 1[4,] 1816.28 1 1 2 1 1 1[5,] 1814.38 1 1 1 1 1 1[6,] 1814.39 2 1 1 0 1 1[7,] 1814.48 1 1 2 0 1 1[8,] 1812.52 1 1 1 0 1 1[9,] 1814.64 0 1 2 1 1 1[10,] 1812.78 0 1 2 0 1 1[11,] 1813.40 0 1 1 1 1 1[12,] 1811.60 0 1 1 0 1 1[13,] 1819.71 2 1 2 1 1 0[14,] 1818.85 2 1 1 1 1 0[15,] 1819.50 1 1 2 1 1 0[16,] 1817.56 1 1 1 1 1 0[17,] 1817.62 0 1 2 1 1 0[18,] 1819.72 2 1 0 1 1 1[19,] 1815.74 0 1 1 1 1 0[20,] 1818.10 2 1 0 0 1 1

Best Log Likelihood

rankPrediction

Error p d q P D Q[1,] 5757737970 1 1 1 0 1 1[2,] 5875271265 1 1 2 0 1 1[3,] 6023309420 1 1 1 1 1 1[4,] 6106775472 0 1 2 0 1 1[5,] 6180535475 2 1 1 0 1 1[6,] 6376595356 2 1 1 1 1 1[7,] 6406638803 0 1 2 1 1 1[8,] 6893831101 2 1 2 0 1 1[9,] 7145343249 2 1 2 1 1 1[10,] 8077384784 2 0 2 0 1 1[11,] 8144571828 2 0 1 0 1 1[12,] 8199900628 2 1 2 1 0 1[13,] 8344235377 2 1 1 1 0 1[14,] 8445785206 1 0 2 0 1 1[15,] 8514821558 1 1 2 1 1 1[16,] 8527706931 0 1 1 1 1 1[17,] 8861463845 0 1 1 0 1 1[18,] 9523768503 2 0 1 1 0 1[19,] 10000000000 0 1 1 1 0 1[20,] 10000000000 0 1 2 1 0 1

Best Predictionrank Residual p d q P D Q[1,] 0.0184 2 0 0 0 1 1[2,] 0.0246 2 0 0 1 0 0[3,] 0.0778 0 1 1 1 0 0[4,] 0.0920 0 0 0 0 1 0[5,] 0.1033 1 0 2 1 0 0[6,] 0.1157 2 1 2 1 1 0[7,] 0.1160 2 1 0 1 0 1[8,] 0.1205 2 1 1 1 0 1[9,] 0.1249 0 1 2 1 1 0[10,] 0.1267 0 1 2 1 0 0[11,] 0.1270 1 1 1 1 1 0[12,] 0.1294 0 1 1 1 1 0[13,] 0.1320 2 0 1 1 1 0[14,] 0.1321 2 0 0 1 1 1[15,] 0.1326 2 0 2 1 1 0[16,] 0.1338 1 1 0 1 0 0[17,] 0.1388 1 0 0 0 1 1[18,] 0.1448 1 0 2 1 1 0[19,] 0.1477 1 1 2 1 1 0[20,] 0.1518 1 0 1 1 1 0

Best Residual

Figure 9. The 20th best alternatives according to four established criteria. With the results provided by the algorithm, the analyst could go on to the next stage, the Diagnostic Checking, with the great advantage of not having to test 144 options, taking into consideration only the best options. It should be remembered that if we increase the value of the p, q, d, P, D and Q parameters the amount of possibilities would increase further. 4 Conclusion For a time series analyst, understanding its structure is essential for taking a better decision. Nevertheless, figuring it out, using the SARIMA modeling, implies to determine the parameters p, q, d, P, D and Q values. As it was showed, this could be very time consuming depending on the series. We believe the proposed algorithm can save a lot of time in the Identification stage. With the results provided by the algorithm the analyst could immediately goes to next stage. Other criteria could be established by the analyst and implemented in the algorithm, as it was demonstrated defining what we called the Best Residual criterion. In addition, in spite of the algorithm role concentrates in the Identification stage of ARIMA modeling, in fact it is not limited by this stage. The Best Prediction

criterion skips the Diagnostic Checking stage and goes directly to the final stage, the Prediction one. In this case, we are not interested in fitting the best model which produces white noise residuals {at}. We are looking for the best fit which minimizes the quadratic predicted error. If forecasting is the analyst goal, he or she needs the best fit that produces the best series predictions. Considering that linear models like ARIMA, mainly at Economics, sometimes can not capture economic behavior [6], we suggest as a natural extension of the proposed algorithm to integrate into it non-linear models like the Autoregressive Conditional heteroskedasticity (ARCH) [7] family modeling and comparing the results from linear and non-linear best fitting. References: [1] Pedro A. Morettin, Clélia M. C. Toloi , Análise

de Séries Temporais, ABE – Projeto Fisher, 2004. [2] G. BOX, G. JENKINS, Time series analysis:

forecasting and control, Holden Day: San Francisco, 1976.

[3] James D. Hamilton, Time Series Analysis, Princeton, 1994.

[4] R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

[5] E. Horowitz, S. Sahni, S. Rajasekaran, Computer Algorithms, Computer Science Press, New York, USA, 1998.

[6] M. P. CLEMENTS, P.H. FRANSES, N. WANSON, Forecasting economic and financial time-series with non-linear models, International Journal of Forecasting, v.20, p. 169-183, 2004.

[7] R. S. TSAY, Analysis of Financial Time Series, John Wiley & Sons Inc, 2002.


http://www.r-project.org/

Documents

An Algorithm for the Identification Stage in Temporal ... · An Algorithm for the Identification Stage in Temporal Series Analysis MARCOS ANTONIO MASNIK FERREIRA Information Technology