MODELS ON ROAD TRAFFIC FORECASTING … · MODELS ON ROAD TRAFFIC FORECASTING: IDENTIFICATION AND DISCUSSION OF DIFFERENT TIME SERIES MODELS ... Anais do XX Congresso Brasileiro de

MODELS ON ROAD TRAFFIC FORECASTING: IDENTIFICATION AND DISCUSSION OF DIFFERENT TIME SERIES MODELS

FERNANDO FERNANDES NETO

Instituto de Pesquisas Tecnológicas do Estado de São Paulo – IPT / Secretaria do Planejamento e Desenvolvimento Regional do Estado de São Paulo

Palácio dos Bandeirantes - Av. Morumbi, 4500, 1º Andar, Sala 142, Morumbi, São Paulo/SP E-mails: [email protected] /

[email protected]

CLAUDIO GARCIA

Escola Politécnica da Universidade de São Paulo – Departamento de Engenharia de Telecomunicações e Controle

Avenida Prof. Luciano Gualberto, trav. 3, 158, Butantã, São Paulo/SP, Brasil - 05508900 E-mails: [email protected]

Abstract⎯ In this paper are discussed and calibrated univariate models (scalar approach, SARIMA) and multivariate models (vector approach, VAR and VEC) aiming traffic forecasts of equivalent axles in the Anchieta-Imigrantes system. The best per-formance models in the backtesting procedure were those of the second type (vector), having a mean absolute error of approxi-mately 3%, in a monthly frequency.

Keywords⎯ VAR, VEC, SARIMA, identification, time series, toll roads

Resumo⎯ Neste artigo são discutidos e calibrados modelos univariados (abordagem escalar, SARIMA) e multivariados (abor-dagem vetorial, VAR e VEC) para a previsão de tráfego em eixos equivalentes no sistema Anchieta-Imigrantes. Os modelos que tiveram melhor desempenho no backtesting foram os do segundo tipo (vetorial), tendo erro médio absoluto de aproximadamente 3% em uma frequência mensal.

Palavras-chave⎯ VAR, VEC, SARIMA, identificação, séries temporais, rodovias

1 Introduction

One of the main problems in the toll road sector is the cash flow planning and its forecasting, due to its idiosyncratic complexity, e.g. levels of service, sea-sonal effects and the inertial evolution of the traffic; and to the impact of other variables like the Gross Domestic Product (GDP).

There is a wide range of methods applied to traffic forecasting, from Time Series models, Kalman Filter based models, Neural Networks; to Markov Chain models, simulation models (multi-agent based) and linear regression models, as shown by Bolshinsky and Freidman (2012), or a combination of them ac-cording to Fillatre et al. (2005), varying from high-frequency to low-frequency data.

Also, it is important to notice that despite the rich existing literature on traffic forecasting, little atten-tion has been paid to the prediction ability of most of these methods, as can be seen in (Bain, 2009). In fact, there is a considerable error range in the U.S. traffic forecasts, as pointed by the same author:

“… actual traffic turned out to lie between

86% below forecast to 51% above forecast. This con-siderable error range illustrates the possible magni-tude of uncertainty when traffic risk is passed to the private sector.”

Hence, planning and forecasting play a fundamen-

tal role in this field, in the sense that most of the nec-essary investments and, consequently, their respec-tive decision-makings and cash outflows, must take into account a very long timeline – conception, con-struction, maturation of the project until plain capaci-ty, etc.

Thus, the main goal of this paper is the discussion of an alternative traffic forecasting method in toll roads – in this case Vectorial Autoregressive models (namely VAR and VEC) and Univariate time series based on Seasonal ARIMA (SARIMA) models, dis-cussed in the next session – illustrating one of the most important highway systems in Brazil, the “An-chieta-Imigrantes” System.

This paper is divided into the following sections: introduction, methodology, presentation of the prob-

Anais do XX Congresso Brasileiro de Automática Belo Horizonte, MG, 20 a 24 de Setembro de 2014

625

lem, results, analysis of the results and conclusion.

2 Methodology

2.1 Univariate Models

The Univariate approach in the present paper is based on SARIMA models, which are a natural ex-tension to the classical ARIMA models, which is a product of two ARIMA polynomials, one with the regular structure of the time series, and the other one with the seasonal structure of the time series, as can be seen in (Box and Jenkins, 1976; Hamilton ,1994 and Morettin and Tolói, 2004).

2.2 Multivariate Models

The Multivariate Models are mainly based on Vec-tor Autoregression models. These are nothing more than a multivariable extension of the classical scalar auto regression models (AR), in the sense that the process is described in terms of matrices and vectors, instead of scalars. Thus, there is a mutual causality relationship between all variables in this dynamic system. For example, a VAR(p) process can be writ-ten as:

Yt = φ1Yt−1 +φ2Yt−2 +...+φpYt−p + at (1)

where the φi terms are square matrices of order “n”;

Yt−n are “1 x n” vectors of endogenous variables; at is a “1 x n” vector of uncorrelated residuals; “n” is the endogenous variable number and “p” is the num-ber of lags.

In addition to that, as the classical scalar auto re-gression models (AR), if all variables are stationary, this model can be estimated using the Ordinary Least Squares (OLS) method. On the other hand, when one or more variables in VAR models are non-stationary, the OLS results may be not valid anymore. Conse-quently, the Theory of Cointegration was developed in order to analyze these possible relationships be-tween non-stationary time series.

Furthermore, Granger and Newbold (1974) dis-cussed and exposed the problems of spurious regres-sions over non-stationary time series. They also veri-fied that given two series completely uncorrelated and non-stationary, the regression between them may produce a significant apparent relationship.

Therefore, if two variables are non-stationary and have a long-run equilibrium relationship, they may be cointegrated – that is, both are uncorrelated, non-stationary, but with a relationship between them – as exposed by Ashley and Granger (1979), Engle and Granger (1987) and Johansen (1988).

Thus Vector Error Correction Models (VEC) were developed, which can be seen as extensions to VAR according to Hendry and Juselius (2000, 2001) and

Lütkepohl (1991), where it is introduced an error correction term.

In order to verify the cointegration assumption, in the current paper the approach that was made is the verification that all variables are non-stationary, us-ing the Augmented Dickey-Fuller (1979) test, using a 95% confidence interval; then – if and only if the variables are non-stationary – following Engle and Granger (1987), the cointegration residuals are ob-tained by running a regression over the variables and these residuals are tested for stationarity. If these residuals are stationary (tested using the Augmented Dickey-Fuller test again) the time series are cointe-grated, otherwise they are not cointegrated.

In order to explain how the VEC model structure is obtained, one can start from a two variable dynam-ic system, where both are cointegrated (by hypothe-sis), following (Hendry and Juselius, 2000, 2001; Lütkepohl, 1991, 2004 and Morettin, 2011).

Be Y1,t and Y2,t two non-stationary cointegrated variables, and assume that there is an equilibrium relation between them given by:

Y1,t −βY2,t = µt ~ N(0,σ ) (2)

If considered that the variations in Y1,t and Y2,t de-

pend on the deviations of this equilibrium in “t-1”, it follows that:

ΔY1,t =α1(Y1,t−1 −βY2,t−1)+ a1,t : a1,t ~ N(0,σ1) (3.1)

ΔY2,t =α2 (Y1,t−1 −βY2,t−1)+ a2,t : a2,t ~ N(0,σ 2 ) (3.2)

One can generalize this error correction model into

a more general form, where these corrections in the equilibrium may depend on previous changes in the equilibrium due to possible autocorrelations, like:

ΔY1,t =α1(Y1,t−1 −βY2,t−1)+φ1,1ΔY1,t−1 +φ1,2ΔY2,t−1 + a1,t: a1,t ~ N(0,σ1)

(4.1)

ΔY2,t =α2 (Y1,t−1 −βY2,t−1)+φ2,1ΔY1,t−1 +φ2,2ΔY2,t−1 + a2,t: a2,t ~ N(0,σ 2 ) (4.2)

where this model actually is a VAR(1) model. In order to verify that, one can simply put these pair of equations into matrix form, resulting in (5) and (6).

ΔYt =αβ'Yt−1 + AΔYt−1 + at (5)

where:


626

α =α1α2

!

"##

$

%&&, β ' = 1 β!

"#$, A =

φ1,1 φ1,2φ2,1 φ2,2

!

"

##

$

%

&&

(6) or rewriting as:

Yt = αβ ' + A+ I( )Yt−1 − AYt−2 + at (7)

Actually, according to Gujarati et al. (2011) such

relationship can be generalized and guaranteed by the Granger Representation Theorem, which shows that any VAR(p) can be written as a VEC(q) and vice-versa.

Depending on the autocorrelation structure, one might find interesting having a VEC(q) model and its respective VAR(p). More details can be found in (Greene, 2005).

3 Presentation of the Problem

In this paper, it is considered a VAR and a VEC model with the following variables: traffic and Gross Domestic Product (GDP) – all of them endogenous, and two kinds of univariate SARIMA models, one with a seasonal difference plus an stochastic seasonal shock, and another one with an autoregressive sea-sonal term. The GDP is available at IPEA (“Instituto de Pesquisas Econômicas Aplicadas” – Brazilian Institute of Applied Economic Research) site, while the other series are publicly available upon request to ARTESP – Transportation Regulatory Agency of São Paulo State, Brazil (“Agência Reguladora de Transportes do Estado de São Paulo”). The time se-ries encompasses monthly observations from March 31st, 1998 until July 31st, 2013. The last six observa-tions are left to test the prevision accuracy of the model.

In addition to that, it is possible to point out as a main concern the fact that considering the Gross Domestic Product as an endogenous variable may be counter-intuitive. However, it is known that traffic can act as a leading indicator for the GDP behavior, and actually, such assumption is tested in this paper, through the verification of cointegration between both variables.

The traffic was normalized under an equivalent vehicle basis, in order to transform different types of vehicles in cars, e.g. a heavy truck is equivalent to “n” cars, while a light truck is equivalent to “n-2” cars.

The Seasonality in the vector models was consid-ered by including a vector of dummy variables, since the data is on a monthly basis.

Then, having all the time series normalized, con-sidered the seasonal effects, the rank of cointegration and the number of lags must be established.

In this case, the rank of cointegration is the num-ber of cointegrating vectors – which is tested accord-ing to (Johansen, 1988) and the least Information Criterion number determines the number of lags, in both univariate and multivariate models, as suggested in (Lütkepohl and Krätzig, 2004). For multivariate models, Bayesian Information Criterion was chosen, due to the fact that it imposes stronger penalties for the inclusion of new parameters, as this kind of mod-el naturally happens to have a larger number of pa-rameters. On the other hand, for univariate models, Akaike Information Criterion was used, due to the fact that these models generally have less parameters than the multivariate ones.

The estimation of the parameters and all tests men-tioned are computed using GRETL – Gnu Regres-sion, Econometrics and Time Library (for multivari-ate models) and R (univariate models).

4 Results

In Table 1, are presented the results of the Bayesian Information Criteria lag-search for multivariate mod-els.

Table 1. Bayesian Information Criterion of the Lag Search.

lags BIC 1 46.740746* 2 46.811174 3 46.868567 4 46.958411 5 46.970219 6 47.066916

So, as can be seen in this table, the multivariate models must have only one lag. For the univariate models, it was tested down for the most common lag compositions over shocks and autoregressive terms, according to the “auto.arima” function, provided in “forecast” package, within the R statistical software, to check the optimal ARIMA regular structure. It resulted in an ARIMA polynomi-al of the form ARIMA (p=1, d=1, q=4). In words, a first-order auto-regressive part; a first-order differ-ence over the original series; and four lags over the innovations (shocks). Then, the two most usual seasonal polynomials were calibrated, SARIMA (p=1, d=0, q=0) and SARIMA (p=0, d=1, q=1), following the same nota-tion above. The Rank of cointegration was determined accord-ing to the Johansen test (1988), and for a null rank matrix (null hypothesis), there is a p-value of 0.03. So, the statistical evidence points out that there is no cointegrating relationship between the variables. De-


627

spite that, in this paper the VEC model was still es-timated for comparison purposes. Thus, 4 different models were obtained as follows. Seasonal Model with Seasonal Difference:

34710.72a0.6753- a0.5514- a0.0978- a 0.2215- a0.0447

4864.0

12-t

4-t3-t

2-t1-t

112

+⋅

⋅⋅

⋅⋅+

Δ⋅−=Δ−Δ −− ttt YYY

(8)

Seasonal Model with Autoregressive Seasonal com-ponents:

039.267910.8141 a0.5641- a0.1227- a 0.2902- a0.0231

5280,0

12

4-t3-t

2-t1-t

1

+Δ⋅+

⋅⋅

⋅⋅−

Δ⋅−=Δ

−

−

t

tt

Y

YY

(9)

VAR Model with Seasonal Dummies:

YGDP

!

"#

$

%&t

= 0.2523 9.75200.0019 0.9735

!

"#

$

%&⋅

YGDP

!

"#

$

%&t−1

+K1K2

!

"##

$

%&&

(10) where 1K and 2K are the seasonal dummies, as fol-lows in Table 2.

Table 2. Seasonal Parameters Estimates of the VAR Model.

K1 K2 S1 181443 -‐7270.76 S2 -‐623254 -‐4400.54 S3 -‐214817 12191.9 S4 -‐460430 4863.24 S5 -‐560794 12545.8 S6 -‐653837 5743.71 S7 -‐296413 2613.97 S8 -‐481878 5324.9 S9 -‐451521 -‐1374.43 S10 -‐195290 13806.3 S11 -‐395400 6612.35 Constant 1259780 -‐5468.02

Thus, if the month to be predicted is January, one must sum up the coefficient S1 plus the constant, and so on according to the respective predicted month. Finally, the VEC model with seasonal dummies is presented as follows.

Δ YGDP

"

#$

%

&'t

= −0.74791 9.769−0.0019 −0.0247

"

#$

%

&'⋅

YGDP

"

#$

%

&'t−1

+ 9.769−0.0247

"

#$

%

&'⋅ GDP − 0.0765 ⋅Y[ ]t−1 +

K1K2

"

#$$

%

&''

(11)

where 1K and 2K are the seasonal dummies, as fol-lows in Table 3:

Table 3. Seasonal Parameter Estimates of the VEC Model.

K1 K2 S1 181590 -‐7256.59 S2 -‐622809 -‐4357.79 S3 -‐214438 12228.3 S4 -‐460173 4887.3 S5 -‐560589 12565.6 S6 -‐653788 5748.41 S7 -‐296368 2618.3 S8 -‐481723 5339.82 S9 -‐451418 -‐1364.57 S10 -‐195095 13825 S11 -‐395336 6618 Constant 911177 -‐1494.86

5 Analysis of the Results

Aiming the selection of the best model, the out-of-sample forecasting accuracy is measured in terms of the absolute error mean, as follows.

Table 4. Out-of-sample Errors of the Models.

Model Mean Absolute Error

ARIMA(1,1,4) -‐ Seasonal IMA(1) 11.28% ARIMA(1,1,4) -‐ Seasonal AR(1) 10.70% VAR(1) 3.23% VEC(1) 3.14%

Thus, the very surprising result is that the VEC(1) model, that shouldn’t be even estimated according to the existing literature, is the best model in terms of out-of-sample performance, despite the fact that only six samples out of the validation set were used due to sampling issues, which may influence these results. Nonetheless, it was already expected that a multivar-


628

iate model should perform better than an univariate model, due to the fact that more information is being included. Another interesting fact is that the log-likelihood of the univariate models are far worse than the multivariate ones, as can be seen in Table 5 – the model which has the largest log-likelihood is the best one.

Table 5. Log-Likelihood of the Models.

Model Log-‐Likelihood

ARIMA(1,1,4) -‐ Seasonal IMA(1) -‐2272.78 ARIMA(1,1,4) -‐ Seasonal AR(1) -‐2451.21 VAR(1) -‐1268.54 VEC(1) -‐1268.54

Hence, based on these results, it seems that the backtesting procedure is a very important part of the modeling process, since the log-likelihood estimate does not provide all necessary information to analyze which model is the best. When analyzing the models fitted values against the observed values (“Obs” in Figures 1 and 2), it is pos-sible to see that SARIMA (Figure 2) models con-verge slower towards to the observed values than the vector based models. It can be explained due the fact that these univariate seasonal models rely on past observed values to forecast the seasonal factors. On

the other hand, vector based models (Figure 1) rely on seasonal deterministic dummy variables. Thus, despite past values are unknown to the autoregressive part, there are already values being inserted in the model, providing estimates of the seasonal fluctua-tions. Another interesting point is the fact that, despite hav-ing a larger number of variables (multivariate), they had a poorer performance within the sample, so basi-cally, the models which were actually overfitted were the univariate ones. Finally, here it is shown the most important feature of vector models in terms of policy analysis, which is the impulse response structure that can be retrieved of the system, following (Sims, 1980). This method is based on the decomposition of the covariance matrix using a Cholesky algorithm, to obtain what is called a Structural VAR/VEC. Considering it as a VAR with contemporaneous rela-tionships, as in the following expression. φ0 ⋅Yt = φ1 ⋅Yt−1 +φ2 ⋅Yt−2 +...+φn ⋅Yt−n +K + at (12) Multiplying the whole equation by the inverse of φ0 one gets a VAR as in Equation (1), that can be estimated using the traditional OLS algorithm.


629

Therefore, after decomposing the covariance ma-trix, it is possible to impose causal restrictions, in order to retrieve the contemporary relationship ma-trix. So, for example, if thought that the economy (GDP) is expected to cause the traffic in the road, one may infer how the dynamics between the time series may behave with an impulse-response of the traffic against the GDP. This is a powerful tool that enables the researcher to verify dynamic effects instead of just applying a first-order (linear), as in the traditional simple linear regression over the logarithms of the variables (this procedure is actually called “elasticity calculation”).

Figure 3. Impulse-Response of Trafego to a Shock in GDP.

As can be seen in Figure 3, a standard shock (a uni-tary shock in terms of the covariance matrix retrieved in the VAR/VEC models) in the evolution of the GDP causes an increase of 50 thousand vehicles, after 4 months and reaches stability after 5 months.

6 Conclusion

In this paper it was shown that it is possible to build an autoregressive multivariable model to de-scribe the traffic data in one of the most important Toll Road in Brazil, with significant seasonal effects and a large amount of vehicles.

Then, four kinds of models were estimated: a VAR, a VEC and two kinds of Seasonal ARIMA models. Furthermore, it were discussed methodolo-gies for testing the cointegration between the varia-bles, unitary root and optimal lag structure obtention.

Thus, it is possible to observe that both multivari-ate methodologies produced very similar forecasts between them, as occurred between both univariate models too. Despite that, both kinds of models were significantly different in the long-run and in the short-run, being the first kind (multivariate) the best of them, producing reasonable forecasts – 3% mean absolute error.

Nonetheless, it is important to notice that this pa-per shows the usefulness of impulse-response analy-sis, which seems to be far more reasonable than the traditional elasticity measures applied over simple linear regression based models in policy analysis.

As perspective for future analysis and work, it is suggested expanding this analysis to other large road systems in Brazil and other countries, continuing to update the existing database and verifying possible structural and parameter changes in these models, and include in this comparison the performance of NARX models (nonlinear autoregressive models) and standard neural-network based models, using only autoregressive components of the dependent variable, or evaluate the inclusion of other possible


630

candidate independent variables (e.g. GDP).

7 References

ASHLEY, R.A., GRANGER, C.W.J. (1979). Time series analysis of residuals from St. Louis model. In Journal of Macroeconomics, 1, 373-394.

BAIN, R. (2009). Error and optimism bias in toll road traffic forecasts, Working Paper, RePEC.

BOLSHINSKI, E., FREIDMAN, R. (2012). Traffic flow forecast survey. Tech. rep., Technion – Israel Institute of Technology.

BOX, G.E.P., JENKINS, G.M. (1976). Times Series Analysis: Forecasting and Control. 1st Edition, San Francisco – Holden Day.

DICKEY, D.A., FULLER, W.A. (1979) Distribution of the estimators for autoregressive time seires with a unit root. In European Journal of Finance, vol. 15, p. 619-637.

ENGLE, R.F., GRANGER, C.W.J. (1987). Cointegration and error correction: Representation, estimation and testing. In Econometrica, vol. 55, 251-276.

FILLATRE, L., MARAKOV, D., VATON, S. December (2005). Forecasting Seasonal Traffic Flows. Workshop EuroNGI, Paris.

GRANGER, C.W.J., NEWBOLD, P. (1974). Spurious Regressions in Econometrics, Journal of Econometrics, vol. 2, 111-120.

GREENE, W.H. (2002). Econometric Analysis, 5th Edition, Upper Saddle River, New Jersey, Prentice Hall.

GUJARATI, D.N., PORTER, D.C. (2011) Econometria Básica, Editora Bookman, São Paulo.

HAMILTON, J.D. (1994). Time Series Analysis, 1st Edition, Princeton, New Jersey, Princeton University Press.

HENDRY, D.F., JUSELIUS, K. (2000). Explaining Cointegration Analysis: Part 1. In The Energy Journal, International Association for Energy Economics, vol. 0 (Number 1), 1-42

HENDRY, D.F., JUSELIUS, K. (2001). Explaining Cointegration Analysis: Part 2. Em The Energy Journal, International Association for Energy Economics, vol. 0 (Number 1), 75-120.

IPEADATA, no sítio http://www.ipeadata.gov.br, visitado em 01/11/2013.

JOHANSEN, S. (1988). Statistical Analysis of cointegration vectors. In Journal of Economic Dynamics and Control, vol. 12, 231-254.

LÜTKEPOHL, H. (2004). Applied Time Series Econometrics, 1st Edition, New York, Cambridge University Press.

LÜTKEPOHL, H. (1991). Introduction to Multiple Time Series Analysis, Heidelberg, Springer Verlag.

MORETTIN, P.A. (2011). Econometria Financeira: Um Curso em Séries Temporais Financeiras, 1ª Edição, São Paulo, Editora Edgar Blücher.

MORETTIN, P.A., TOLÓI, C. (2004). Análise de Séries Temporais, 1ª Edição, São Paulo, Editora Edgar Blücher.

SCHWARZ, G. (1978). Estimating the dimension of a model. In The Annals of Statistics, vol. 6, 461-464.

SIMS, C. (1980). Macroeconomics and Reality. In Econometrica, vol. 48, no. 1, 1-48.


631

Documents

MODELS ON ROAD TRAFFIC FORECASTING … · MODELS ON ROAD TRAFFIC FORECASTING: IDENTIFICATION AND DISCUSSION OF DIFFERENT TIME SERIES MODELS ... Anais do XX Congresso Brasileiro de