6
Journal of Econometrics 23 (1983) 337-342. North-Holland A REMARK ON SERIAL CORRELATION IN MAXIMUM LIKELIHOOD David LEVINE* University of California, Los Angeles, CA 90024, USA Received September 1981, final version received April 1983 Maximum likelihood estimation can be consistent and asymptotically normal despite serial correlation in the residuals. The usual estimator of the asymptotic covariance of the parameter estimator is inconsistent, but an alternative consistent estimator is derived. It is well known that OLS can be consistent and asymptotically normal despite serial correlation in the residuals. Although the usual estimator of the asymptotic covariance of the parameter estimator is inconsistent there is an alternative covariance estimator which is consistent.’ The purpose of this note is to sketch how and why these results extend to MLE. Let y’ be endogenous and z’ predetermined at time t with x’ =( y’, 2’). For notational simplicity suppose {x’} is stationary. Let f( y’lz’, 0) be a family of conditional density functions for y’ and suppose f( y’lz’, 0,) is the actual density of y’ conditional on the predetermined variables z’. Notice that this does not imply that exp TLT(B) =nf= I f(y’Iz’, 0,) is the joint density of the y’ (conditional, or otherwise), nor that LT(8) is the log-likelihood function for any model. If the z’ are exogenous this is true only if the yf are independent. Define the partial MLE eT to be the estimator that maximizes LT(0). Note that MLE under the assumption of independence is partial MLE if there is serial correlation. We shall extend the usual consistency argument to show that the consistency of partial MLE depends only on f( y lz, 0,) being the actual density of y conditional on z and not on n~=rf(yfIzf, 0,) being a joint density for the y’. We make use of the following notation. The log-likelihood contribution is 2t(0)=logf(x’,8). Associated with 1’ are L(8)=E2(0) and its empirical counterpart L’(B) = (l/T)c,T= I Z(e). Subscript 8’s denote differentiation. Thus *I would like to thank Jerry Hausman, Paul Ruud, Michael Veall, Halbert White, Sweder van Wijnbergen, and two anonymous referees. ‘See Hansen (1979) or White and Domowitz (1981). 0304-4076/83/$3.00 0 1983, Elsevier Science Publishers B.V. (North-Holland)

A remark on serial correlation in maximum likelihood

Embed Size (px)

Citation preview

Journal of Econometrics 23 (1983) 337-342. North-Holland

A REMARK ON SERIAL CORRELATION IN MAXIMUM LIKELIHOOD

David LEVINE*

University of California, Los Angeles, CA 90024, USA

Received September 1981, final version received April 1983

Maximum likelihood estimation can be consistent and asymptotically normal despite serial correlation in the residuals. The usual estimator of the asymptotic covariance of the parameter estimator is inconsistent, but an alternative consistent estimator is derived.

It is well known that OLS can be consistent and asymptotically normal despite serial correlation in the residuals. Although the usual estimator of the asymptotic covariance of the parameter estimator is inconsistent there is an alternative covariance estimator which is consistent.’ The purpose of this note is to sketch how and why these results extend to MLE.

Let y’ be endogenous and z’ predetermined at time t with x’ =( y’, 2’). For notational simplicity suppose {x’} is stationary. Let f( y’lz’, 0) be a family of conditional density functions for y’ and suppose f( y’lz’, 0,) is the actual

density of y’ conditional on the predetermined variables z’. Notice that this does not imply that exp TLT(B) =nf= I f(y’Iz’, 0,) is the joint density of the y’ (conditional, or otherwise), nor that LT(8) is the log-likelihood function for any model. If the z’ are exogenous this is true only if the yf are independent. Define the partial MLE eT to be the estimator that maximizes LT(0). Note that MLE under the assumption of independence is partial MLE if there is serial correlation. We shall extend the usual consistency argument to show that the consistency of partial MLE depends only on f( y lz, 0,) being the actual density of y conditional on z and not on n~=rf(yfIzf, 0,) being a joint density for the y’.

We make use of the following notation. The log-likelihood contribution is 2t(0)=logf(x’,8). Associated with 1’ are L(8)=E2(0) and its empirical counterpart L’(B) = (l/T)c,T= I Z(e). Subscript 8’s denote differentiation. Thus

*I would like to thank Jerry Hausman, Paul Ruud, Michael Veall, Halbert White, Sweder van Wijnbergen, and two anonymous referees.

‘See Hansen (1979) or White and Domowitz (1981).

0304-4076/83/$3.00 0 1983, Elsevier Science Publishers B.V. (North-Holland)

338 D. Levine, Serial correlation in maximum likelihood

the score contributions are n;(e). Associated with these are the autocorrelation functions

For pedagogical purposes we make the following assumptions:

(1)

(2)

(3)

(4)

(5)

(6)

(7)

x’ is stationary and strong a-mixing with exponentially declining CL

0 is a compact convex set.

weights

f(y’ Izf, 0,) is the true conditional density of y’ given z’ and is not stochastically equivalent’ to f( y’ 1 z’, 0) unless 8 = 8, (global identification).

B0 E interior (0).

1’ is a twice continuously differentiable function of 0.

For some 6 > 0,

L,,(B,,) is non-singular3 (local identification).

The mixing condition (1) requires a word of explanation. A stochastic process x’ is called strong cc-mixing where cz is an infinite sequence of scalars c~=(~l~,a,,...) with lim cr,=O iff any event F, defined by xl,..., x’ and event F’ ‘k defined by x’ +k, x’ ‘k + ‘, . . . satisfy

Ipr(F’+k,F,)-pr(F’+k)pr(F,)I sa,.

Since in the independent case a, ~0, this asserts that the distant future is largely independent of the past. Strong mixing is a weak condition in that most common processes such as the normal ARMA processes satisfy (1).

‘Stochastic equivalence means that the two densities are almost everywhere equal. In practice, since we do not know 0, we must check that, for all 0 PO’, f(x’, 0) # f(x’, 0’) a.e.

31n practice, we must check that Lee(B) is non-singular for all 0.

D. Levine, Serial correlation in maximum likelihood 339

Indeed, most processes which are observed can reasonably be argued to satisfy this condition. An extensive discussion of conditions (1) and (6), which can be weakened substantially, can be found in White and Domowitz (1981).

An important (and obvious) fact about mixing processes is that functions of mixing processes depending on a fixed finite number of lagged observations are also mixing. Thus, the fact that x’ satisfies assumption (1) implies that At(e), r:(e), etc. all satisfy (1) as well. This is very convenient in a non-linear context.

Theorem 1. Zf(+o) and (6) hold, OT is strong consistent.4

Proof. By the uniform strong law of large numbers of White and Domowitz (1981)’ together with assumptions (1) and (6), LT(B) almost surely (a.s.) converges uniformly to L(8). By a trivial modification of a classic argument (3) implies L(0) has a unique maximum at B0.6 These facts imply, via an argument due to Frydman (1980), assumption (2) and the definition of BT 2 that eTaG’ 8,. Q E D . . .

Turning to asymptotic normality by assumptions (2), (4), (5) and the usual Taylor series expansion there exists a L& such that

(8) ,/‘%~‘-4,)=C~~1-‘(1/~) i &(eo). t=1

Here the rows of L& are the rows of LrO evaluated at a point (which may depend upon the row) between 0, and f3T.7 By assumption (6) the uniform strong law of large numbers implies each row of LFJB) converges uniformly to the corresponding row of E&&(e). Also by (6) J;(e) and J&(e) are absolutely integrable* so that

(9) Lee(e) = E n;,(e) = a2[w(e)yae2.

Thus plim L&= L,,(&,) and by assumption (7) plim [L&l - ’ = L,‘(B,).

As in (9), since assumption (6) permits the exchange of differentiation and integration, L(8) is twice continuously differentiable by assumption (5) and

(10) Lo(e) = m;(e) = a[mye)yae.

4An alternative proof for a special class of time series models is in Kohn (1978). 5McLeish (1975) shows how to prove the strong law for mixing processes; Hoadley (1971)

shows how to extend the strong law in the independent case to a uniform strong law; White and Domowitz (1981) show how to combine the two proofs to get a uniform strong law for mixing processes.

‘%ee Wald (1949). ‘The measurability of these random points is demonstrated in Lemma 3 of Jennrich (1969). 81nterchange of differentiation and expectation is discussed in Cramer (1946).

340 D. Levine. Serial correlalion in maximum likelihood

Since by assumption (4) e0 (the unique maximizing parameter in 0) is in the interior of 0, LB(OO) =O. Thus by (10) E&(0,) =O. This and assumptions (1) and (6) show that (l/,,/!?)~f=, Ai satisfies the hypotheses of the Rosenblatt Central Limit Theorem’ for strong mixing processes so that

(11) (l/,/‘% ,il &Go)~ NO, V>

(12) V=R,(e,)+2 T Rk(eO). k=l

From this follows:

Theorem 2. If (Z)-(7) hold, JT(P- 0,) Q+N(O, L&l(&) vL;l(e,)).

It remains to provide a consistent estimator of L&‘(&,)VL;‘(&,). The matrix &;‘(8,) can be consistently replaced with [L&(0’)]-r for the same reason discussed above that [z&]-r is consistent. Thus we must find a consistent estimator for 1/: Define

(13) v:(e) = R;(e) + 2 i R;(e), j=l

We)=Ro(e)+2j$lRj(e).

From the mixing assumption (1) it follows that

(14) v(e) G lim v,(e). k-m

exists and that the convergence is uniform in 8.

Furthermore for fixed k it follows from assumptions (1) and (6) and the uniform strong law of large numbers that V,‘(e)“2 V,(e) uniformly.” The difficult question is how to choose k, both as a function of T and possibly of the sample as well. Note that if we can find k(T) such that V,‘,,,(0)“~V(0) uniformly then V&,(0’) will almost surely converge to I/: White and Domowitz (1981) show for a special class of distributions that, if k(T)aT’ for O< y ~3, then I’&,, (0) “3 V(e) uniformly and conjecture this result holds more generally.ll If we are intested only in the weak consistency of V,‘,,,(e’),

‘Blum and Rosenblatt (1956) show that in the case where A:(@,) is a scalar C,cp”,i &(0,) converges to normality for some weights C,. The Cramer-Wald device and computation of the variance of I:= 1 1:(8,) yields the stated result.

“See White (1980) for the application of this estimator to provide robustness against heteroskedasticity in OLS. MLE in discrete choice models is ordinarily inconsistent if heteroskedasticity is present.

“In a conversation with the author.

D. Levine, Serial correlation in maximum likelihood 341

it is straightforward to show 0 < y <t works in general. By Amemiya’s Lemma

and Chebychev’s Inequality is suffices to show that

El V,‘,,,(e)- V,,,,(@l’+O uniformly in 8.

However, V,‘(e) is the sum of the R?(Q); by McLeish’s (1975) covariance bound and assumption (6),

EIRT(8)-Ri(8)1*~CTY/(T- TY) for a fixed constant C.

Thus regardless of how highly correlated the RT are,

As a practical matter knowledge of the rate at which k should be increased with sample size isn’t very helpful in dealing with a sample of fixed size. The issues in choosing k are these: if k is too large then the estimates RT for j near k will not be very reliable as they are based on only T-j observations. On the other hand, if k is too small V, is not going to approximate I/ very well since 21% k + r Rj is omitted. Indeed the error in estimating I/ is at least

2xyZr.Rj, which among other things is unobservable. A crude rule of thumb is this: under the mixing assumption (1) I Rj I must decline exponentially.‘* Choose k so that there are ‘reasonably’ many observations with which to estimate R,. Then use the estimated IR,$~R;I,...,IR~~ to find an exponentially declining upper bound: this can be extrapolated to give some idea of how large the error 2cjm_k+1 I Rj I might be. I should note that the problem of choosing k correctly is a fundamental problem with asymptotic theory: how large is large? Ultimately only finite sample theory can answer this question.

“See McLeish (1975).

References Blum, J. and M. Rosenblatt, 1956, A class of stationary processes and a central limit theorem,

Proceedings of the National Academy of Science 42,412-413. Cramer, H., 1946, Mathematical methods of statistics (Princeton University Press, Princeton,

NJ). Frydman, R., 1980, A proof of the consistency of maximum likelihood estimators of non-linear

regression models with autocorrelated errors, Econometrica 48, 8533860. Hansen, L.P., 1979, Asymptotic distribution of least squares with endogenous regressors and

dependent residuals, March (Carnegie-Mellon University, Pittsburgh, PA). Hoadley, B., 1971, Asymptotic properties of maximum likelihood estimators for the independent

not identically distributed case, Annals of Mathematical Statistics 42, 1977-1991. Jennrich, R., 1969, Asymptotic properties of non-linear least squares estimators, Annals of

Mathematical Statistics 40, 633-643.

342 D. Levine, Serial correlation in maximum likelihood

Kohn, R., 1978, Local and global identification and strong consistency in time series models, Journal of Econometrics 8, 269-294.

McLeish, D., 1975, A maximal inequality and strong dependent laws, Annals of Probability 3, 826836.

Wald, A., 1949, Note on the consistency of MLE, Annals of Mathematical Statistics 20, 5955601. White, H., 1980, Heteroskedasticity consistent covariance matrix estimator and a direct test for

heteroskedasticity, Econometrica 48, 817-838. White, H. and I. Domowitz, 1981, Nonlinear regression with dependent observations (University

of California, San Diego, CA).