REGRESSION mALYSIS OF TIrf: SERIES DATA · REGRESSION mALYSIS OF TIrf: SERIES DATA by William S. Cleveland Department ofStatistics ... I£GI6SION J1!'4ALYSIS OF TIPE SERIES DATA William

•

REGRESSION mALYSIS OF TIrf: SERIES DATA

by

William S. ClevelandDepartment of Statistics

Univerasi.ty of Norath Carolina at ChapeZ Hill

Institute of Statistics Mimeo Series No. 753

Aprvf,Z~ le'il

• •

ABSTRACT

In fitting regression-autoregression models to time series data. it is

very convenient to be able to treat the model like a single large regression.

The asymptotic sampling theory justification for doing this is reviewed. It

is shown that from the likelihood inference and Bayesian inference points of

view using standard N(J1'ession inferenaes is justifiable when the amount of

information from the initial assumptions and initial observations 18 small

compared with the additional information from the rest of the sample.

The use of various regression techniques such as transformations, subset

regression calculations, and nonlinear least squares in fitting regression

autoregression models is discussed.

I£GI6SION J1!'4ALYSIS OF TIPE SERIES DATA

William S. Cleveland.

1. INTRODUCTION

Let tl' ••• '~ be ~ time series; a, b, and d k'n, for n=af-l, ••• ,N

and k-l, ••• ,b, a set of "known" numbers; a=(al, ••• ,aa)' B=(Bl, ... ,Bb),

2and 0, unknown parameters; and z, for n-a+l, ••• ,N, a sequence ofn

independent normal random variables with mean 0 and variance 02• Suppose

for n-a+l, ••• ,N, t n satisfies the equation

(1.1)

and that the j oint distribution of t l , ... , t a is normal with mean

~ - (~l' ••• '~a) and covariance matrix t. This model will be referred to as

a linear regression-auto~egressionmodel and will be abbreviated to linear

RAR model. t~=l ajtn_j is the autoregression part of the model and

t:=l Bkdn,k' the regression part of the model.

We could allow nonlinearities in a and B and will do so in a few

places in the paper. a, b, and d k are "known" in the sense that when itn,

comes time to make inferences about the unknown parameters ~, t, a, B, and

02, the former will be assumed fixed and known. But in any

particular application a, b, and d k must somehOw' be chosen and theirn,

adequacy checked.

Usually, interest centers on making inferences about a, B, and 02

with ~ and I regarded as nuisance parameters. Often, a priori restr1c~

tions are made on the values that the vector of parameters (~, I, a, B, 02)

*WiZU,am S. Cleveland is assistant prOfessor.. Department of Statistics"University of North CaroUna.. Chapel HilZ.

2

can take. For example, it is sometimes assumed from the start that t isn

stationary. That is, the joint distribution, for any value of r, of

tl+n' ~: ~ ~ tt+n does not depend on n. This implies bel; dn,l-l, for

an-a+l, ••• ,N; the roots of the polynomial in B. l-~B- ••• -aaB, lie

outside the unit circle and the (r,s)th element of t-l is t~~(s,t)

(a ia i - aa+i aa+i ), where aO·l (cf. [15, p.968]).r- s- -r-s

Let Y be the (N-a)xl vector (ta+l, ••• ,~)t; Z be the (N-a)xl

vector (.za+l, .. .,~)t; e be the (a+b)xl vector (Oll,··.,aa' 131' ••• ,13b)';

and X be the (N-a) x (a+b) matrix whose first a columns are the a

vectors (ta+1_j' ••• '~_j)" for j=l, ••• ,a, and whose last b columns are

the b vectors (da+l,k' ••• '~.k)t, for kal, ••• ,b. Then the N-a equa

tions in (1.1) may be written in matrix form as

Y .. Xi + z. (1.2)

Writing (1.1) in this way makes the model look like a single big regression

and tempts us to make inferences about e and a2 by using standard

regression inferences. The meaning of this italicized phrase will be taken

to be the following. From the sampling theory point of view it means esti

mating e and a2 by least squares and applying the usual analysis of

variance theory. From the likelihood inference (the approach to inference

described in [2] and [12]) or the Bayesian inference point of view it means

-2taking the likelihood or the posterior of haa and e to be proportional

to

N-aTh . exp -%h(Y...xe) '.(Y-X6)

where this function of h and e is considered to be defined for all

(1.3)

h > 0 and all real (a+b)-tuples e. This normal-gamma density is discussed

in [17, p. 342-5]. The standard regression infe:rencBs for each of the three

approaches are in essence the same (except for minor differences in degrees

of freedom). Only the language is different.

The situations where it is justifiable to make standard regression

inferences about e and (12 are discussed in Section 2 from the sampling

theory point of view and in Section 3 from the likelihood and Bayesian points

of view. In Section 4 these different justifications are compared. In

Section 5 the regression with stationary autoregressive errors model is

discussed. Comparisons are made between this model and the linear RAR model.

In Sections 6-10 it is assumed that for the particular set of data

being analyzed it is justifiable to use standQ1'd rtegression infertences.

Under this assumption the use of various regression techniques in setting up

and analyzing a RAR model is discussed. The general philosophy of using

regression techniques in fitting a RAR model is discussed in Section 6.

Sections 7 and 8 treat the question of selecting some transformation of the

series t n in order to get a better fit. In Section 7 likelihood techniques

are used to estimate the parameter of a power transformation of t. Inn

Section 8 a procedure is suggested for selecting a transformation of t n in

order to remove inhomogeneities in the variation of t. In Section 9 ann

algorithm is given for calculating all subset regressions (really auto-

regressions) in the stationary autoregressive model with mean O. In Section

10 a method of unconstrained optimization is given for finding the maximum of

the likelihood function for the particularly useful nonlinear seasonal RAR

model discussed in [4, p. 300].

The techniques suggested in this paper are illustrated by their appli-

cation to birth rate, air pollution, airline passenger, telephone installa-

tion, and solar flux data.

4

2. SAMPLING THEORY STANDARD REGRESSION INFERENCES

Sampling theory justifications for making standard regression inferences

about the parameters in a RAR model have been given by Mann and Wald [14] and

Durbin [9]. Mann and Wald deal with the stationary RAR model. It is shown

that making standard regression inferences is justifiable if the sample size

is sufficiently large. Durbin gives a similar asymtotic result for the more

general linear RAR model but requires the assumption that (EX'X)-lX'X con-;51'

verges stochastically to the id'&ntity matrix. No conditions. other than

stationarity, are given which assure this convergence.

3. BAYESIAN AND LIKELIHOOD STANDARD REGRESSION INFERENCES

The following conventions will be useful in this section. First if r

is a set "r" will also be used to denote the indicator function of the set

r. For example the indicator fnnction [y: Iyl < 1] equals one if Iyl < 1

and is zero otherwise. As in the previous section we will let h. a-2. Also

k +R will denote the set of k-tpples of real numbers and R will denote the

positive rea1s.

The important idea used in this section will be the principle of stable

estimation [10. p. 201). Suppose Fl and F2 are two functions of a vector

of parameters ~ such that inferences about ~ are to be solely based on

F3(~) =F1(~)F2(~)· In the likelihood approach F3 would be the likelihood

function and in the Bayesian approach F3 would be the posterior density of

~. Now i£ F1 is relatively flat in the region where most of the mass of

F2

is concentrated and is nowhere too big outside this region, then F3

(for the purpose of making inferences about ~) may be approximated by F2 •

We shall now see how this notion of stable estimation can be used to decide

when the RAR model can be treated like a standard regression. It will be

assumed that a, 13, and h are the parameters of interest and p and t

are nuisance parameters.

The likelihood function L of a sample tl' ... ,~ from the linear RAR

model (1.1) may be factored into two functions

L(tl ,···,~ Ip, t,a,B,h) .. Ll (tl , .... , t a Ip,E)L2(tafol,·.·,~ It l , ••• , t a ,a,B,h). (3.1)

As previously stated in Section 1 the vector of parameters (p,t,a,B,h) may

be a priori restricted to lie in some region r1• The stationary case was

given as a particular example. (It would however be quite rare that one

would restrict the range of B to be less than all of Rb or the range of

~h to be less than all of R.) Let r 2 be the region in which the vector

Ra Rb :f-of parameters (a,B,h) must lie. r 2 is then a subset of x xR. The

likelihood function in (3.1) and the two factors on the right side are, once

the sample is taken, regarded as functions of the parameters; we will there-

fore write them as L(p,L,a,B,h), Ll(lJ,t), and L2(a,B,h). Now it is easy

to see that L2(a,e,h) is identical to the function in (1.3) (recall that

e .. (a,B)'),except that L2(a,B,h)" 0 if (a,B,h) is not in r2• This is

a bit inconvenient for what is about to be said so we will change the

definition of L2 slightly. L2 will be taken as identical to (1.3) and

(3.1) will then be rewritten as

We will now show when it is justifiable from the likelihood and the

Bayesian points of view to make standaPd regression inferences about a,B

and h. The Bayesian case will be given first. Let P(lJ,E ,a,B,h) be a

prior distribution on the parameters. The posterior density of the parameters

after observing the first a sample values tl' ••• ,ta is proportional to

P(lJ,t,a,e,h)Ll(lJ,E). Let FI(a,B,h) be the marginal posterior density of

6

a,B, and h after observing t l , ... ,ta , which is proportional to the

marginal function in a,e. and h of the product PLl • Let FZ(a,e,h).

rZLZ(a,B,h) and let F3(a,B,h). Fl (a.e,h)F2(a.e,h). Then F3 is pro

portional to the marginal posterior of a,B. and h after observing the

entire sample tl' •••• ~.

Sta:n.dard regression inferences about a,B, and h may be made if:

(3.2) The principle of stable estimation as described earlier

applies to the F3, FZ' and Fl just defined.

(3.3) Most of the mass of LZ lies in r Z•

From (3.2) F3 is well approximated by FZ 1;1 r ZLZ and from (3.3) r 2LZ is

well approximated by L2; thus F3 .is well approximated by L2• But as

previously stated L2 is just the standard regression inference function

given in (1.3).

Condition (3.2) says that if the amount of information about a,l3,

and h after observing t l , ... ,ta is small compared with the additional

amount of information from ta+l' ....~' then r2L2 = F2 is a good

approximation of F3• Note that if the prior distribution on the parameters

is such that 11,t is independent of a,6,h and the prior density of a,l3,h

is proportional to r2' then there is no information about a.l3,h from

tl, ••• ,ta and F3 =F2 • In addition if r2 =RaxRbxR+ then L2 • F2 and

F3 is exactly the fitandard ztegr'essi.on inference function in (1.3).

The discussion of the likelihood case proceeds exactly as the Bayesian

case with the word "posterior" replaced by "likelihood" and the prior density

P replaced by r1 •

One expects the approximation typically to be valid when N is large.

With a large N we can justifiably expect the additional information about

a,e, and h to be much greater than the information after observing

'l

tl, ••• ,ta which remains fixed as N increases. Of course this does not

happen with probability 1 so it is necessary to check the approximation.

As an example suppose t is a stationary first order autoregressionn

with mean O. That is, tn III altn_l + zn' for n-2, ••• ,N; lall < 1; and

t l is normal with mean 0 and variance h-l(l-ai> -1. Suppose the prior

density on al · is uniform and the prior on h is h-1, independen.tly of

Jt 2Jt 2 2ar Then Ll(aph> -h (l-al ) exp - Jth(l-~)tl and L2(a1,h)"N-lT N 2h exp - Jth t 2(t -alt 1) and the posterior of al and h '.will be pro-n- n n-

portional to

If (3.2) and (3.3) hold then L2(al ,h), normalized to integrate to one, is

a good approximation of the posterior.

This last paragraph will now be applied to the 199 daily changes from

July 16, 1968 tht',ough January 31, 1969 in the level of the solar power flux

density (defined in [20, Section 16.3]) at 2800 megahertz,which were recorded

at National Research Council, Ottawa, Canada. The observations are given by

the following string of numbers; an italicized number has a negative value:

6 8 0 1 2 6 7 6 5 3 8 3 1 6 3 1 0 7 0 0 12 8 2 5 0 4 10 10 13 1 6 10 4 0 0

1 0 10 'l 6 6 4 5 2 4 2 6 5 8 1 4 3 5 8 0 7 3 3 0 4 11 3 1 8 2 0 7 6 1 16 15

2 2 1 4 10 8 0 9 1 2 3 2 0 4 3 1 3 9 5 5 0 6 7 4 2 8 2 2 6 1 1 0 2 21 28 4

6 2 0 6 8 1 2 4 2 0 3 3 2 0 3 6 13 8 'l 2 6 3 2 2 1 0 1 0 1 0 12 10 0 2 1 4

2 1 4 3 0 0 'l 3 1 ? 4 3 9 3 2 0 1 8 2 2 10 13 8 1 0 1 2 2 11 8 14 7 0 1 10

1 0 6 4 0 1 4 0 13 4 4 2 9 6 3 9 11 2 1 3 O. From the knowledge about the

physical: mechanism generating the series it seems quite reasonable to assume

the mean of t is O. This is substantiated by the sample since the samplen

mean is -.08 and the sample standard deviation is 6.20. From a histogram of

the data it seemed quite reasonable to assume normality, although one might

be a bit hesitant in view of the accuracy of the data. The first 5 sample

autocorrelations are .20, .12, .04, -.05, and -.06. The first 5 sample

partial autocorrelations [4, p. 64] are .20, .08, .00, -.07, and -.04.

The first 5 sample inverse autocorre1ations [7] are -.15, -.08, -.03, .05,

and .04. From these estimates it seemed quite reasonable to try fitting a

8

first order stationary autoregression.

The amount of information in L2

/

in this example appears to be large.

The maximum of L2 occurs at a ••19945 with a standard error of .069.

From this rough sort of analysis it would be quite reasonable to suppose L2

serves as a good approximation. Let us, however, look closely at the ap-

proximation for making inferences about al • The actual marginal posterior

of a l is

(t -a t )2J - 99.5n 1 n-l•

This was got by integrating out h in L2(al'h) and using numerical inte

gration to find the normalizing constant. Now the posterior approximation

~. (a -.19945)2 208.266J- 199+1

.174 1 + 1 2199

which is a t-density and very nearly normal with mean .19945 and variance

.0048. Very nearly all the mass of both m1 and m2 lies in the interval

o to .4. The maximum of m1 occurs at .19943 and the maximum of m2,

at .19945. Table 1 gives, for 11 equally spaced values of al .from 0 to

.4, the values of ~,~ and their ratio. Clearly the approximation of

~ by m2 is quite good.

9

4. COMPARISON OF SAMPLING THEORY, BAYESIAN, AND LIKELIHOOD APPROACHES

At present the sampling theory justification for making standardre

gression inferences in a RAR model is unfortunately quite incomplete. The

results in Section 2 are asymptotic and restrict the class of' RAR .models eon-

sidered. The main problem is that it is extremely difficult to derive

finite sample distributions for the least squares estimatesr even if these

were calculated there would remain the problem of deciding when these distri-

butions are well approximated by the standQ%'d regression inferences.

The Bayesian and likelihood justifications in Section 3 are (statistically)

complete. That is, a routine has been described for deciding with a particular

sample at hand when standard regression inferences may be made. Carrying out

the routine is a numerical analysis problem, albeit in some cases a difficult

one. The reason for attempting to make such an approximation is that if the

approximation is valid, the resulting inferences are more easily understood

and more easily communicated.

5. COMPARISON OF THE LINEAR RAR MODEL AND THE REGRESSIONWITH STATIONARY AUTOREGRESSIVE ERRORS ~ODEL

A model often used in time series studies is the regression model with

stationary autoregressive errors. That is,

nal, ••• ,N, the equation

t = 1:~, 1 f3kd k + en A= n, n

t is assumed to satisfy, forn

where en is a stationary autoregression en =1:;.1 ajen_j + zn. For example,

economists frequently use the model tn =Tn + Sn + en where Tn accounts

for trend-cycle and S, accounts for seasonality.n

10

We could proceed as in Section 3 by writing the likelihood function of a

sample from a regression model with stationaxy autoregressive errors as the

product

and see" if L2 provides a good approximation of L or is nearly proportional

to the posterior density of a,B, and h. But the problem is that this L2

for the regression model with stationaxy autoregressive errors is not in the

form of the standard Pegress1,on inference function (1.3) and is not an easy

a function to work with because of the nonlinearities in a and B.

This added complication in the regression with stationaxy autoregressive

errors model should, 1 think, encourage us to try and fit a linear RAR

model when it is possible to do so. Of course, a situation might arise where

the knowledge of the mechanism generating the series is such that one can

write down the first model and apriori specify the d k. But usually,n,choosing the d k involves both prior knowledge and a good hard look atn,the data; there tends to be a lot of guessing, fitting, and checking. In

this sort of a situation the RAR model seems more attractive simply because

it is easier to work with if it is justifiable to make standard refJ'l'8ss1,on

inferences. Of course, if this approximation is not valid then each of the

models has its own difficult nonlinearities.

6. THE USE OF REGRESSION TECHNIQUES

From this point on in the paper it will be assumed that a RAR model is

being fit to a particular set of data and that a preliminaxy analysis from

the likelihood or Bayesian point of view has indicated that making standard

regression inferences is justifiable. The RAR model will be treated as if it

11

is the single large regression in (l.2) with dependent variable Y and

_independent variables ~, for k-l, ••• ,a+b, which are the columns of X.

In fitting the RAR model we would like to make use of what will be

called regression techniques - procedures such as transformations of the data,

variable selection, analysis of residuals, graphical displays, and nonlinear

least squares. We could straightforwardly apply these techniques to the RAR

model as they are applied in the general regression situation. However, while

with the RAR model we are now faced with a regression, it is one that has a

special structure (indeed sometimes, as we shall see later, an extremely

special structure) that is not present in the general regression problem for

which these techniques were devised. There is a special relationship among

the independent variable y. (ta+l'''.'~)' and dependent variables ~ III

(ta+l-k' ••• '~-k)" for k=l, ••• ,a, which are the first a columns of X.

Thus it is fruitful to rethink the use of regression techniques in the

specially structured RAR model. Some techniques may possibly be modified

so they work much more effectively; some may not change at all; others may

be totally inappropriate; and completely new techniques may be developed.

The following sections are devoted to discussions of the use of a few

of the many regression techniques in fitting RAR models.

7. TRANSFORMATIONS: ESTIr1ATING THE BEST POWER TRANSFORMATION

In the standard regression theory with independent observations there

have been many discussions about transforming the dependent variable or

transforming the independent variables in order to better satisfy the as

sumptions of the model. For example, Anscombe and Tukey [1] have discussed

the use of power transformations, t(T) III (tT-l)/T for T ~ 0 and teO) III

log t, for reducing inhomogeneities in variance and nonadditivity for

positive data. If the data is not positive (or even if it is) then the two-

parameter family of power transformations

12

In

this paper only the one-parameter family will be discussed. A discussion of

the two-parameter family would involve the same general principles; the

details would change in an obvious way.

Suppose the independent variable in (1. 2) is transformed by the power

transformation t(L). This is equivalent to rewriting the equations (1.1) as

Such a model, however, involves the original series in both a transformed and

untransformed way, which is not particularly congenial since it makes under-

standing the results of the analysis more difficult. Of course, if such a

model is a very good approximation of the truth then one must be prepared to

face up to it. But what one would like to get away with is to fit a RAR model

to t(~), that is, to fit the model

(7.1)

In the language of regression the independent variable and the first a

dependent variables are being transformed by the same power transformation.

In some instances one might also want to

ent variables, that is, to replace d kn,

for certain values of k. The choice of

the next sections.

transform some of the other independ

in the above equation by d(Lk)n,

T will be discussed in this and

There are a number of ways of choosing the value of T. If the trans-

formation has been introduced to remedy a particular ill, then T will be

chosen to minimize some measure of that ill. A particular case of this will

be discussed in the next section. If the power transformation is being

incorporated into the model to improve its overall health then it would seem

most reasonable to treat T as a parameter in th~ model and utilize whatever

13

one believes to be a sound approach to estimation, to estimate T. This

approach has been used for transforming the independent variable in the

gel'lera1- regression situation by Box and Cox [3]. The mechanics of esti

mating T in the RAR model (7.1) will now be discussed from the likelihood

and Bayesian points of view.

It was stated in Section 6 that we would asswne the situation is such

that the standard regNssion infettence approximation is valid. But for the

model in (7.1), since we have the added parameter T, this statement must

be amended to say that the likelihood or the posterior is nearly proportional

to L2(a,B,h,T) = L2(ta+l, ••• ,~la,B,h,T,tl,••• ,ta). That i~ for each fixed

T, standard Ngression infel'enoos may be made about a,B, and h. Define

the (N-a)xa matrix T(T) = [t~~11, . n.a+1, ... ,N, j=l,. •• ,a and the

(N-a)xb matrix D = [dn,k1, n-a+l, ••• ,N, k-l,. •• ,b. As previously stated

one might want to transform, in addition to t n_j , for j=l, ••• ,a, some of

the other independent variables d k for certain values of k. To allown,

for this possibility define D(T) to be the (N-a) xb matrix obtained by

transforming a subset of the columns of D. Define the (N-a)x(a+b) matrix

X(T) • [T(T); D(T)]. We will assume that the columns of X(T) are linearly

independent. (The discussion would be similar if this restriction were removed.)

Let e = (a,B)'; S(T) = (X'(T)X(T»-l X'(T)Y; and aCT) = (Y-X(T)S(T»'

(Y-X(T)6(T». Then

a(T) is the maximum likelihood estimate of e for a fixed T and in the

Bayesian approach is the condition mean of e given T.

14

For fixed T and h, L2 as a function of e is proportional to a

multivariate normal density. Thus the marginal likelihood or the marginal

posterior of h and T is proportional to

N-2a-b

<TI::a+l t~) h 2 lx' (T)X(T) I-~exp"-Yh s(T) •

If h is integrated out and we let v II: N-2a-b+2, the marginal of T is

proportional to

(7.2)

To help keep the calculations within reasonable bounds it is better to com-

pute the function

m(T) - <IT... N . t ) T~l JX' (l)X(l) l! [i(l) J; ,_ n-a+1 nix' (T)X(T) I-Y Ls(r)

which, as a function of "C is proportional to (7.2). m{T) can be computed

by first computing log m("C) and then exponentiating.

The estimation of T will be governed by meT) and the physical

mechanism generating the series. Two possible estimates of Tare T, the

value of 't which maximizes meT), and T· (J T m(T) d"C)/J m(T) dT, the

mean likelihood estimate or the posterior mean of T. Often a value TO

will be chosen which is close to an estimate that is believed reliable and(TO)

which is such that t n has an easy to understand interpretation.

This method of choosing the transformation will be applied to 234

measurements of the level of air pollution in nonurban areas in the United

States from 1958 through 1966. These measurements are reported in [18] and

resulted from averaging daily observations in two-week intervals. The

observations are 24 15 15 19 16 15 11 20 23 18 48 30 33 31 21 39 30 21 28 26

30 22 22 10 25 20 20 11 20 21 17 21 21 28 39 37 31 31 41 21 34 34 37 37 24 20

20 15 20 19 20 13 13 15 19 18 15 21 22 33 31 26 29 33 38 29 42 32 40 35 28 34

15

29 23 20 24 22 20 16 17 22 19 25 24 18 22 26 27 35 33 39 32 28 32 33 34 23 28

30 30 21 16 21 17 14 21 19 22 20 17 18 18 33 29 30 31 32 34 35 32 36 32 28 26-

26 28 25 18 19 20 21 29 17 19 17 29 44 36 23 35 35 39 39 35 34 29 28 26 33 30

37 31 20 17 18 18 18 25 21 15 29 27 19 34 29 42 SO 41 39 44 39 30 28 33 28 23

29 36 30 18 19 26 17 18 20 16 29 31 26 22 23 37 28 37 48 39 24 37 45 34 25 29

23 34 38 16 23 24 20 17 18 20 21 28 23 33 28 3442 31 57 44 42 33 38 39 30 29

32 39 2722 23 20. There is a strong~ yearly seasonal effect (of c~der 26)

and an increasing linear trend. The model fit to the data was

The results of the fit are summarized in Table 2. Column 1 gives the values

of T at which meT) was calculated. Column 2 gives meT) f lOS. The

maximum of meT) occurs at T" .16 • m is very nearly symmetric about T

-so that T ~ .16. Column 3 gives the values m(T)/m(.16),that is~ the

relative likelihood function. Since m(O)/m(~) = .50 it is very tempting to

use the logarithms of the observations since it makes sense to think in terms

of percentage fluctuation. Columns 4-8 give the values of (a1(T), ••• ,64(T»"

(e(T»'. The dependence of ~l(T) is slight in the interval [-.3, .6]

where most of the mass of meT) is contained. Column 9 gives the values of

gl(T), the coefficient of skewness of the fitted residuals

for n. 2, ••• ,234. Thus gl(T)" t 2324 ~3(T)/(t2342 z2(T»1.S. The untrans-n= n n= n

formed data (corresponding to T=l) yields residuals Z~(l) that are skewed

to the right. The indication from meT) that some transformation is needed

seems in large part due to the deviation of these residuals from normality.

Indeed it can be seen that in the region of high likelihood the coefficient

of skewness has been greatly reduced. At T· T• .16 some skewness is still

16

present but is only roughly a third of the value at or· 1. A value of

gl(T) • 0 occurs between -.2 and -.1. At or" 0 the coefficient of

skewness is positive but slight. This perhaps makes taking the logarithms

even somewhat more attractive.

8. VARIAlJCE HOI~OGENIZING TRANSFORMATIONS

Suppose that an initial inspection of a particular time series reveals

that the variation of the series is changing more or less continuously through

time. By variation is meant the magnitude of the up and down oscillations

that have a frequency greater that the frequency of what would be called

trend. Variation is used rather than variance since we want to avoid any

probabilistic notions; the following analysis will be purely structural.

The above initial inspection might involve plotting the series and making

a visual inspection. It also might inv0lve, particularly if the series is a

long one, breaking the series into blocks, calculating a measure of the

variation in each block (perhaps by a sample variance), and inspecting these

measures. Or it might involve calculating a measure of the change in

variation of the series.

It is, of course, highly desirable to have a series whose variation does

not exhibit any patterned change; that is, to have a series with homogeneous

variation.

transforming

If t is a series with inhomogeneous variation then we can tryn

t by a power transformation t(or) in order to homogenize then n

variation. Choosing a value of or to do this will now be discussed.

One possibility would be to make plots of for several values of

or and select a value of or by Visual inspection which appears to homogenize

the variation. An alternative to this graphical method is to develop some

measure of the change in variation and select the T which minimizes this

measure. One way of doing this will now be described.

17

First a value of T is selected, then tn is transformed to t~T),

and then the trend in t (T) is removed. This removal procedure will ben

discussed in more detail a bit later. Let seT) be the square of the de-n

trended series and let s(T) .. N-l t:'ls~T). One would like the variation

of the detrended series to be as uniform as possible. Another way of saying

this is that one would like seT) to be as near to constant as possible.n

A measure of how close S~T)18 d>'a constant is the percentage reduction in

sum of squares of the regression of seT) on a constant. Percentage rathern

than absolute reduction is used since the latter is not scale invariant and

would therefore not allow meaningful comparisons for different·· values of T.

Thus the measure of the change in variation is

The value of T which minimizes VeT) will be used in the transformation.

The trend in the series can be removed by ca:Lculating either a moving

average of t(T) and subtracting from t(T) or by fitting some function ton n

the t(T) using least squares and subtracting the fitted function from t(T).n n

In choosing the coefficients in the moving average or the particular function

to be fit, sufficient latitude must be given to be able to account for the

trend of for all T in the range of interest. Sometimes there is

slow variation in the series which has a higher frequency than the variation

that would be called trend but which is homogeneous and has a lower frequency

than of the variation which shows a pattem. In this case it will sometimes

be useful to remove this slow variation by a moving average before calcu-

lating v(T).

Another possibility for removing inhomogeneities in the variation of

tn' which won't be described in detail here, would be to transform tn to

18

cntn where cn is some positive" smooth function of n. The above

measure v can be used to assess the relative success of various c sen

quences in removing inhomogeneities.

The technique of using a power transformation to homogenize the variation

was applied to two series. The first is the well-known monthly international

airline passenger series ([4, p.531] or [5, .. p. 429]). The seeond is the

monthly inward station movement series reported in [19] which consists of the

number of telephones installed each month in Wisconsin from January 1951

through October 1966. Both series have an increasing trend and seasonal

oscillations (of order 12) whose magnitudes are increasing with time. For

both analyses a quadratic polynomial was used to eliminate trend.

For the airline series, VeT) was calculated at the 11 equally spaced

points -.5, -.4, ••• , .5. The values of VeT) corresponding to these

points were .618, .596, .580, .572, .572, .568, .593, .611, .630, .651, and

.672. The minimum of VeT) occurs at T I: O. From a visual inspection of a

graph of the logarithms of the series (cf. [4, p. 308]) it is seen that the

increasing trend in the magnitudes of the seasonal oscillations of the series

has been removed. Indeed, it makes sense to take logarithms since, as Box

and Jenkins have written, it is percentage fluctuation which might be expected

to be eomparable at different sales volumes.

For the telephone data VeT) was calculated at the points -1.2, -1.1, -1.0,

-.9, -.8, 0, .5 and 1. The values of VeT) corresponding to these values of T

were .568,.564, .562, .562, .565, .664, .147 arid .815. The minimum OfV(T)

occurs near -1.0. Again from a visual inspection of a graph of the inverses

of the telephone data it is seen that the increase in the magnitudes of the

seasonal oscillations has been removed. The inverses·· have the interpretation

of being the average time between installations during each month.

(9.1)

19

9. CALCULATION OF ALL SUBSET REGRESSIONS UP TO APARTICULARSIZE FOR ASTATIONARY AUTOREGRESSION WITH MEAN O.

Suppose t n is am stationary autoregression With mean 0 and the

standard Ng1!esaion infe7'enC8 approximation of Section 3 is justified. That

is, the likelihood function or the posterior density is approximately pro

portional to

N-aT _~ tN -~a 2

h exp,h L.n=a+l(tn - l.j=l ajtn_j ) •

If the model is written in the form (1.2) then X III [tn_j 1, n-a+~, .... ,N,

jel,. .. ,a and y. (ta+p." ,~)' • Let ~ III (X'X) -Ix'Y then (9.1) is

!:.!2 ,." IW ",., N

h exp-~h[(Y-Xa)'(Y-Xa) + (a-a)'X'X(a-a)]. (9.2)

The (i,j)th element of X'X is

t N t tL.n=a+1 n-i n-j

and the j-th element of X'Y is

INt t •n°&l-l n-j n

(9.3)

(9.4)

Let ~ = t:.a+l tntn+k. If N is sufficiently large (9.3) may be approx

imated by Cji_jl and (9.4) by cj • Let rill [Cli_jll, 1=1, ••• ,a, jal, ••• ,a;

-1 I' aY III (cl, ••• ,ca)'; a = (&l' ••• '&a)' • r y; and n· (N-a)/(cO-E j =l ajcj ).

Then (9.2) is approximately

a and h, the values of a and h which maximize this function, are the

well-known Yule-Walker estimates (cf. [13, p. 476]) of a and h.

20

It is desirable, just as in the standard regression model, to investi-

gate which of the columns of X could be deleted and still maintain a good

fit. That is, which of the Qj' for jel, ••• ,a, can be assumed to be zero.

One way to carry out this investigation is to calculate all subset regressions

up to a particular size. In the particular regression problem of this

section X'X. r and X'Y lid Y have a very special structure. r is a

Toeplitz matrix and its elements determine in a very special way those of y.

One would expect therefore to be able to devise an algorithm for calculating

all subset regressions up to a particular size which makes efficient use of

this special structure. One such algorithm will now be described.

Let Xo denote the independent variable

denote the regression of Xi on Xi'· •• ,Xt ' .o 1 r

Y. Let X(iolil, ••• ,ir)

Then X(Olj)· (cj/cO)Xj ,

is known since all

plus the regression of

for jal, ••• ,a, which gives all subset regressions of size 1. Now suppose

all subset regressions of size r have been computed" From these calcu-

lations the regressions of size r+1 can be computed in the following

manner. Let i1'." ,ir+1 be a subset of l,. •• ,a

Then X(Oli1•••• ,ir+1) is equal to X(Oli1

, ••• ,ir )

xo. on Xir+1 - X(ir+l1i1.···,ir)· X(Olil.···,ir )

regressions of size r have been computed. ~e saving in the algorithm

comes from noting that X(ir+1Iil•••• ,ir) essentially has already been

calculated. Let jk· ir+l - i k for ~l, ••• ,r. Now X(Oljl" ,···,jr) •

~lXjl+. ·e-+E;rXj-r has already been calculated. Then it is easy to see, in

view of the special structure of r and y that X(ir+l1il' ••• ,ir ) =;rXil+•• +~~r and the sum of squares of Xir+l-X(ir+llil.... ,ir) equals

the sum of squares of Xo - X(Olj1' ... ,jr) which is Co - ;lCjl-...-~rCjr.

Thus

21

and the residual sum of squares of this regression of Xo on

is Co - Tlci - •••-Tr+l ci •1 r+l

The assumptions being made in this section are that t n has mean 0 and

is s~ationary. Such assumptions will generally arise when the tn are the

residuals after subtracting off an estimate of trend or the tn are the

result of differencing an original time series. (cf. [4, p. 85]). Such a

stagewise procedure is not the most efficient with regard to estimation but

does tend to make working with the series less complicated.

A note of skepticism will now be entered about calculating subset auto-

regressions. We have been able to exploit the special structure of r and

y in order to reduce the computations in calculating subset autoregreesions.

The question, however, is whether there is generally much marginal benefit

from calculating subset autoregressions after the estimates of the auto-

correlations and inverse autocorrelations [7] have been studied. In the

general regression situation X'X and X'Y are usually too complicated to

simply look at and decide which independent variables should be in the

regression model. Consequently a crutch such as calculating subset re

gressions is needed. But for the stationary RAR model of this section we

can do a fairly good job of looking at the values of r and y (i. e.

looking at the autocorrelation function) and making some statement about

23

that is, the least squares estimates,and b = (N-a-p)/SS(~,;,e).

One possibility for iteratively finding the least squares solution of

this noD.1inear least squares problem is to use an algorithm such as the

Gauss method (cf. [16, p. 148] or [8, p. 267» which is capable of handling

the general nonlinear least squares problem. But the special structure of

oSS suggests an alternative algorithm. If a is fixed to be a then the

ominimum over 'IT and B of SS(a' ,11,,13) can be found by the usual linear

squares formula since aQ

(B)7T(B)tn + E:=l Skdn,k is linear in 'IT and a.

A similar statement holds with 'IT fixed. A quick description of the

algorithm is: minimize SS (a,1r, a) over ''If and a, for fixed a; then minimize

over a and a for fixed 1(; and repeat this process until the points

converge. A more detailed description will now be. given. In reading it,

it should be kept in mind that since a(B) and '!T(B) commute, a(B)'IT(B)t. •n

'IT(B)a(B) tn. Also OJ will be used too

1. The starting values a' = e ,a

denote a vector of j zeros.

o 0'IT = 0p' and a' =0b are generally

quite satisfactory.

2. Suppose at the completion of the r-th iteration the values of the

parameters are ar , trr , and are The (r+l)-st iteration is given by steps

3 and 4.

3. Minimize SS(al' "tr,a) over 'If and S, yielding'ITr+l as the value

and

7T and a using the

for n=a+l, ••• ,N

(r+l)-st" iteration and the algorithm

.... rThis is done by calculating t = a (B)t

n nb 2

Ek=l Bkdn,k) over

of

N '"minimizing En=a+p+l('IT(B)tn -

linear least squares formula.

4. Minimize SS(a,:'!Tr+l,a) over a and a, yielding ar+l and ar+l.

'" r+lThis is done by calculating t = 11' (B)t for n=p+l, ••• ,N and minimizingn nN .... b 2

En=a+p+l(a(B)tn - Lk=l l3kdn ,k) over ex and a using the linear least

squares formula. This completes the

is at the point aI+l, 'lTr+l, er+l.

24

S. A reasonable stopping point can be given in terms of the likelihood

function approximation F. Define hr to be the value of h 'which minimizes

F(h,cl''lTr ,/3r). Then hr. (N"a-p) /5S(o.r , 'ITt', 13r ). The algorithm can be stopped

after the r-th iteration if the likelihood ratio L(hr,o.r,~r,ar)/L(hr-l,ar-l,N-a-p

'Il'r-l,ar - l ) .. '(hr/hr-l) 2 is nearly one, or equivalently, if N-~-P log

hr- 1 is nearly 0.hr -

In the remainder of the paper this algorithm will be ·referred to as the

relaxation algorithm. (Actually it is only one particular relaxation

algorithm.) It will now be shown that the relaxation algorithm converges

(mathematically, that is, ignoring round-off error) to a local minimum of SSe

The method of proof easily extends to the case where there is more than two

polynomials of B in (10.1).

Let ak, for k-l,2,. .. , be a sequence of a points converging to a

point o.O(using the usual Euclidean distance). Let 'Irk and 13k, for

k=O,l, ••• , denote the unique values of rt

'IT,B) over 'IT and a for fixed o.k. Let

and a which minimize SS(ak,

k k k k~ • (a ,'IT ,a), for k-0,l,2, ••••

We will show that ~r °converges to ~. First note that S is a non-

negative polynomial in a, 'IT, and a. Since SS(~k) s;

these latter numbers are uniformly bounded in k, the

k55(0. ,Op,Ob)

~k lie in a

and

closed

bounded region R. SS is uniformly continuous on R; therefore, fork' . k

& > ° we can construct a family of spheres Sd'.~ep ) of radius oAPout the ~o1nts .~

such that ISS(~) - Ss(~k)1 < & for all ~ in So(~k). If the Euclidean

k °distance of (a, 0p' Db) from (a, 0p' Db) is less than 6, then

kOO 0· k °(a ,:IT ,a) lies in Si~). Thus SS{~) < SS(~ ) + £. By a similar

argument 5S(~0) < SS(~k) + e. Thus SS(~k) converges to SS(~O). The ~k

lie in a bounded region and therefore ,have a point of accumulation,

(aO,'IT*,a*). Now SS(aO.'IT*,e*> .. SS(~O). But since 'ITO and 13° are unique

values at which the minimum of SS(aO,1li,a> over 'IT and a occurs, 'IT*" 'iTO

25

and 13*" 13°. Thus <l>k converges to <1>0. By a completely· analogous pro.of

this same convergence occurs if the roles of a and ware interchanged.

Now let ar .. (ar,1f

r ,Br ) be a sequence of points of the relaxation

algorithm. Since SS(ar):s SS(ar - l ) and 58. is a polynomial, the ar lie

in a bomded region and have a point of accumulation e* = (a*,.n:*,B*). Let

yr be a subsequence of ar which converges to 13*. At the point yr the

relaxation algorithm minimizes SS with respect to a and 13 or wi th

respect to w and B. Let <l>r be the subsequence of yr which corresponds

to minimization with respect to a and B. (If there are only a finite

number of terms in this subsequence then consider the other minimization.)

.... rLet be the point to which the algorithm moves as a result of this

minimization. Let ~ be the minimum of SS(a,.w*,B) over a and 8 with

Arn* fixed. From the convergence fact proved in the previous paragraph, 

converges qo $. Now suppose ~.:/: 13*. . Then since $ is a mique point of

minimization 5S(~) < S5(a*). Btit since SS($r) converges to 55(~) we

Arwould have that 5S(<I» < 55(13*) for some r, which contradicts the assump-

tion that 13* is a point of accumulation. ThusA Ar • e* and converges

to S*. Therefore minimization with respect to a and 8 cannot take the

algorithm infinitely often out of a small neighborhood of 13*. A similar

statement holds with ex. replaced by w. Furthermore, from the· fact thatA ..

 .. a*, we see that e* is a local minimum of 5S. Thus the algorithm

converges to a local minimum of SSe

The relaxation algorithm has the property that it tends very quickly to

get in the general area of a minimum but has difficulty descending a long

narrow valley to a minimum. This difficulty can often be easily overcome by

using the elegant pattern mova part of the Hook and Jeeves [11] algorithm.

This would be used after the r-th iteration in the following manner. Let

<l>r • (ex.r,:JTr,ar ) be the point to which the algorithm moves at the r-th

26

1'-1iteration. The result of the r-th iteration is a move from the point ~

to the point ~r. The direction of the move is ~r - ~r-l. The pattern move

- exploits the CotnmOfi sense notion that a little more of a good thing might

l' 1'-1even be better. The pattern move goes in the direction ~ - ~ but goes

twice as far. That is. the move is to arC $1'-1 + 2(~r_~r-l). Now the relaxa-

tion cil.gcritbm"is used to improve ere That is. steps 3 and 4 are carried out

at the point ere If the resulting point is an improvement over $1' then

call it $%+1 and now repeat the pattern move by going to $1' + 2(cp%+l_$r) •

If the point is not an improvement over cpr then return to 4>1' and carry

out the relaxation algorithm as usual.

Just what strategy one uses in inserting pattern moves will depend on

the type of computing system used. If batch processing is used then one

strategy would be to allow the relaxation algorithm to run for a number of

iterations without any pattern moves and then use pattern moves at every

iteration. If an interactive system is used the strategy can be defined as

the algorithm proceeds. The· programmer can insert the pattern move when it

seems advisable.

Another procedure would be to combine the relaxation algorithm with the

Gauss method. The relaxation algorithm tends to get in the neighborhood of

a minimum more quickly than the Gauss method and a single iteration of the

former requires less computation than a single iteration of the latter. But

the Gauss method seems to be a better valley descender. Thus the two might

be combined by alternating them or perhaps by.. letting the relaxation method

run for the first few iterations and the Gauss algorithm thereafter.

If time permits it is advantageous to use more than one algorithm. This

tends to guard against the possibility of thinking a minimum has been found

when. in fact. the algorithm has merely slo",,1ed down.

27

As an example the model (l-alB) (l-'fl Bl2)tn • zn was fit to the monthly

birth rate data reported in [6, p. 240]. The results of using the relaxation

algorithm are shown in Ta15le 3. Convergence was quite rapid and it was

unnecessary to use a pattern move. The results of the Gauss method are

shown in Table 4. The convergence was also rapid, but not quite as fast as

the relaxation algorithm. It is reassuring to see convergence to the same

values.

Table 1. POSTERIOR AND APPROXIMATION

al M1(al) M2(al) M1 (al)/M2(al)

.00 .101 .097 1.036

.04 .425 .416 1.021

.08 1.320 1.308 1.010

.12 2.984 2.980 1.001

.16 4.874 4.893 .996

.20 5.724 5.757 .994

.24 4.825 4.848 .995

.28 2.923 2.926 .999

.32 1.279 1.273 1.005

.36 .407 .402 1.014

.40 .095 .093 1.024

Table 2. TRANSFORMATION OF AIR POLLUTION DATA

M(,)/105 M(,) /M( .16)A A A A A

gl(')ell (,) 131(,) 13

2(,) 13

3(,) 13

4(,)

-1.0 .60 .00 .17 .78 .02 - .01 -.00 -.98

- .3 .13 .00 .12 1.65 .07 - .11 -.01 -.14

- .2 1.27 .03 .12 1.90 .07 - .15 -.01 -.03

- .1 6.98 .16 .11 2.21 .07 - .21 -.01 .05

0 21.69 .50 .11 2.59 .07 - .29 -.01 .12

.1 38.94 .90 .11 3.07 .07 - .41 -.02 .19

.16 43.03 1 .10 3.41 .06 - .50 -.02 .23

.2 41.60 .97 .10 3.66 .06 - .57 -.03 .25

.3 27.27 .63 .10 4.41 .06 - .75 -.04 .31

.4 11.26 .26 .10 5.35 .05 -1.10 -.05 .36

.5 3.00 .07 .10 6.56 .04 -1.52 -.07 .41

.6 .53 .01 .10 8.10 .04 -2.11 -.09 .46

1.0 10-5 .00 .11 20.26 .03 -7.85 -.25 .68

Table 3. RELAXATION ALGORITHM

r r (hr) -1 hr

r 0. 1 7f 1 61.5 £.og --1r-h

0 0 0 601. 91

1 .55993 .99990 .35104 443.09

2 .56168 .99844 .35079 .042801

3 .56171 .99842 .35079 4.4239x10-6

4 .56171 .99842 .35079 1.2463x10-9

Table 4. GAUSS METHOD

r r (hr) -1 hrr Cl. l TIl 61.5 log --1r-h

0 0 0 601.91

1 .32183 .67829 28.963 180.53

2 .64387 .84687 2.0996 156.14

3 .70624 .97076 .39487 99.423

4 .59214 1.0077 .36050 5.4183

5 .55488 .99751 .35093 1. 6007

6 .56235 .99842 .35079 .02296

7 .56171 .99842 .35079 3.5179x10- 5

8 .56171 .99842 .35079 7.9949x10- 8

•

32

REFERENCES

[1] Anscombe, F.J. and Tukey, John W., "The examination and analysis ofresiduals," Teahnometncs, 6'6 1963, 141-159.

[2] Barnard, G.A.; Jenkins, G.M.; and Winsten, C.B., "Likelihood inferenceand time series," Jourrnal of the Royal Statistioa't Soai.ety A-125~ 1962,321-372.

[3] Box, G.E.P. and Cox, D.R., "An analysis of transformations," Journal ofthe Roya'L Statistiaal Soaiety B-26.. 1964, 211-252.

[4] Box, George E.P. and Jenkins, Gwilym, M., Time Senes Analysis'6 HoldenDay, San Francisco, 1970.

[ 5] Brown, Robert G., Smoothing~ Foreaasting and Prediation of DiscreteTime Series.. Prentice Hall, New Jersey, 1962.

[6] Chakravarti, I.M.; Lalla, R.G.; and Roy, J., Handbook of Methods ofAppUed Statistics.. Vol. 1.. Wiley, New York, 1967.

[7] Cleveland, William S., "The inverse autocorrelations of a time series,"Institute of Statistics Mimeo Series No. 689.. Department of Statistics,University of North Carolina, Chapel Hill, 1970 (to appear inTechnometrias) •

[8] Draper, N.R. and Smith, H., AppZied Regression AnaZysis.. lliley, NewYork, 1966.

[9] Durbin, J., "Estimation of parameters in time series regression models,"JoumaZ of the Royal Statistiaal Soaiety B-28, 1960, 139-153.

[10] Edwards, Ward; Lindman, Harold; and Savage, Leonard J., "Bayesianstatistical inference for psychological research," PsychologicalRevietJJ 'l0.. 1963, 193-242.

[11] Hook, R. and Jeeves, T.A., "Direct search solution of numerical andstatistical problems," JoumaZ of the Association for ComputingMaahinery, 8... :. 1961, 212-221.

[12] KalbfleisCh, John D., and Sprott, D.A., "Application of likelihoodmethods involving large numbers of parameters," Journal of the Royal,Statistical Society B-32.. 1970, 175-208.

[13] Kendall, Maurice G. and Stuart, Alan, The Advanced Theory of StatisticsVoZ. 3, Griffin, London, 1966.

[14] Mann, H.B. and Wald, A., "On the statistical treatment of linearstochastic difference equations, If Economemca 11.. 1943, 173-220.

[15] Parzen, !imanuel, "An approach to time series analysis," AnnaZ,s ofMathematical. Statistics, 32, 1961, 951-989 •

33

[16] Powell, M.J.D., "Minimization of functions of several variables,"NUTftSP'icaZ AnaZysis: An Introduction~ J. Walsh, ed., Academic Press,

- London, 1966, 143-157.

[17] Raiffa, Howard and Sch1atfer, Robert, AppZied StatisticaZ DecisionTheo~~ Division of Research, Harvard Business School, Boston, 1961.

[18] Spirtas, Robert and Levin, Howard J., "Patterns and trends in levelsof suspended particulate matter at 78 NASN sites from 1957 through1966," presented at the Annual Meeting of the Air Pollution ControlAssociation, New York, June 22-26, 1969.

[19] Tiao, G.C. and Thompson, H.E., "Analysis of telephone data," TechnicaZReport No. 222~ Department of Statistics, University of Wisconsin, 1970.

[20] Valley, She:a L., Handbook of Geophysics and Space Environments~ McGrawHill, New York, 1965.

Documents

REGRESSION mALYSIS OF TIrf: SERIES DATA · REGRESSION mALYSIS OF TIrf: SERIES DATA by William S. Cleveland Department ofStatistics ... I£GI6SION J1!'4ALYSIS OF TIPE SERIES DATA William