Hu - Time Series Analysis

8/19/2019 Hu - Time Series Analysis

1/149

Lecture 1: Stationary Time Series∗

1 Introduction

If a random variable X is indexed to time, usually denoted by t, the observations {X t, t ∈ T} iscalled a time series, where T is a time index set (for example, T = Z, the integer set).

Time series data are very common in empirical economic studies. Figure 1 plots some frequentlyused variables. The upper left figure plots the quarterly GDP from 1947 to 2001; the upper rightfigure plots the the residuals after linear-detrending the logarithm of GDP; the lower left figure

plots the monthly S&P 500 index data from 1990 to 2001; and the lower right figure plots the logdiff erence of the monthly S&P. As you could see, these four series display quite diff erent patternsover time. Investigating and modeling these diff erent patterns is an important part of this course.

In this course, you will find that many of the techniques (estimation methods, inference proce-dures, etc) you have learned in your general econometrics course are still applicable in time seriesanalysis. However, there are something special of time series data compared to cross sectional data.For example, when working with cross-sectional data, it usually makes sense to assume that theobservations are independent from each other, however, time series data are very likely to displaysome degree of dependence over time. More importantly, for time series data, we could observeonly one history of the realizations of this variable. For example, suppose you obtain a series of US weekly stock index data for the last 50 years. This sample can be said to be large in terms of

sample size, however, it is still one data point, as it is only one of the many possible realizations.

2 Autocovariance Functions

In modeling finite number of random variables, a covariance matrix is usually computed to sum-marize the dependence between these variables. For a time series {X t}

∞

t=−∞, we need to modelthe dependence over infinite number of random variables. The autocovariance and autocorrelationfunctions provide us a tool for this purpose.

Definition 1 (Autocovariance function). The autocovariance function of a time series {X t} with V ar(X t) < ∞ is defined by

γ X (s, t) = C ov(X s, X t) = E [(X s − EX s)(X t − EX t)].

Example 1 (Moving average process) Let t ∼ i.i.d.(0, 1), and

X t = t + 0.5t−1

∗Copyright 2002-2006 by Ling Hu.

1


2/149

1 95 0 1 96 0 1 97 0 1 98 0 1 99 0 2 00 00

2000

4000

6000

8000

10000

12000

Time

G D P

1 95 0 1 96 0 1 97 0 1 98 0 1 99 0 2 00 0!0.2

!0.1

0

0.1

0.2

Time

D e t r e n d e d L o g ( G D P )

1 99 0 1 99 2 1 99 4 1 99 6 1 99 8 2 00 0 2 00 20

500

1000

1500

Time

M o n t h l y S & P 5 0 0 I n d e x

1 99 0 1 99 2 1 99 4 1 99 6 1 99 8 2 00 0 2 00 2!0.2

!0.1

0

0.1

0.2

Time

M o n t h l y

S & P 5 0 0 I n d e x R e t u r n s

Figure 1: Plots of some economic variables

2


3/149

then E (X t) = 0 and γ X (s, t) = E (X sX t). Let s ≤ t. When s = t,

γ X (t, t) = E (X 2t ) = 1.25,

when t = s + 1,γ X (t, t + 1) = E [(t + 0.5t−1)(t+1 + 0.5t)] = 0.5,

when t − s > 1, γ X (s, t) = 0.

3 Stationarity and Strict Stationarity

With autocovariance functions, we can define the covariance stationarity, or weak stationarity. Inthe literature, usually stationarity means weak stationarity, unless otherwise specified.

Definition 2 (Stationarity or weak stationarity) The time series {X t, t ∈ Z} (where Z is the integer set) is said to be stationary if

(I) E (X 2t ) < ∞ ∀ t ∈ Z.

(II) EX t = µ ∀ t ∈ Z.

(III) γ X (s, t) = γ X (s + h, t + h) ∀ s, t, h ∈ Z.

In other words, a stationary time series {X t} must have three features: finite variation, constantfirst moment, and that the second moment γ X (s, t) only depends on (t − s) and not depends on sor t. In light of the last point, we can rewrite the autocovariance function of a stationary processas

γ X (h) = C ov(X t, X t+h) for t, h ∈ Z.

Also, when X t is stationary, we must have

γ X (h) = γ X (−h).

When h = 0, γ X (0) = C ov(X t, X t) is the variance of X t, so the autocorrelation function for a

stationary time series {X t} is defined to be

ρX (h) = γ X (h)

γ X (0).

Example 1 (continued): In example 1, we see that E (X t) = 0, E (X 2t ) = 1.25, and the autoco-

variance functions does not depend on s or t. Actually we have γ X (0) = 1.25, γ X (1) = 0.5, andγ x(h) = 0 for h > 1. Therefore, {X t} is a stationary process.

Example 2 (Random walk) Let S t be a random walk S t = Pt

s=0 X s with S 0 = 0 and X t isindependent and identically distributed with mean zero and variance σ2. Then for h > 0,

γ S (t, t + h) = Cov(S t, S t+h)

= Cov tX

i=1

X i,t+hX j=1

X j

= V ar

tXi=1

X i

! since Cov(X i, X j) = 0 for i 6= j

= tσ2

3


4/149

In this case, the autocovariance function depends on time t, therefore the random walk process S tis not stationary.

Example 3 (Process with linear trend): Let t ∼ iid(0, σ2) and

X t = δ t + t.

Then E (X t) = δ t, which depends on t, therefore a process with linear trend is not stationary.

Among stationary processes, there is simple type of process that is widely used in constructingmore complicated processes.

Example 4 (White noise): The time series t is said to be a white noise with mean zero andvariance σ2

, written as

∼ W N (0, σ2

)

if and only if t has zero mean and covariance function as

γ (h) =

σ

2

if h = 00 if h 6= 0

It is clear that a white noise process is stationary. Note that white noise assumption is weakerthan identically independent distributed assumption.

To tell if a process is covariance stationary, we compute the unconditional first two moments,therefore, processes with conditional heteroskedasticity may still be stationary.

Example 5 (ARCH model) Let X t = t with E (t) = 0, E (2t ) = σ

2 > 0, and E (ts) = 0 fort 6= s. Assume the following process for 2t ,

2t = c + ρ2t−1 + ut

where 0 < ρ < 1 and ut ∼ W N (0, 1).In this example, the conditional variance of X t is time varying, as

E t−1(X 2t ) = E t−1(

2t ) = E t−1(c + ρ

2t−1 + ut) = c + ρ

2t−1.

However, the unconditional variance of X t is constant, which is σ2 = c/(1 − ρ). Therefore, this

process is still stationary.

Definition 3 (Strict stationarity) The time series {X t, t ∈ Z} is said to be strict stationary if the joint distribution of (X t1 , X t2, . . . , X tk) is the same as that of (X t1+h, X t2+h, . . . , X tk+h).

In other words, strict stationarity means that the joint distribution only depends on the ‘dif-ference’ h, not the time (t1, . . . , tk).Remarks: First note that finite variance is not assumed in the definition of strong stationarity,

therefore, strict stationarity does not necessarily imply weak stationarity. For example, processeslike i.i.d. Cauchy is strictly stationary but not weak stationary. Second, a nonlinear function of a strict stationary variable is still strictly stationary, but this is not true for weak stationary. Forexample, the square of a covariance stationary process may not have finite variance. Finally, weak

4


5/149

0 100 200 300900

1000

1100

1200

1300

1400

1500

S & P 5

0 0

i n d e x i n

y e a r 1 9 9 9

0 100 200 300!0.05

0

0.05

S & P 5

0 0

r e t u r n s i n y e a r 1 9 9 9

0 100 200 300900

1000

1100

1200

1300

1400

1500

S & P 5

0 0

i n d e x i n

y e a r 2 0 0 1

0 100 200 300!0.05

0

0.05

S & P 5

0 0

r e t u r n s i n

y e a r 2 0 0 1

Figure 2: Plots of S&P index and returns in year 1999 and 2001

stationarity usually does not imply strict stationarity as higher moments of the process may dependon time t. However, if process {X t} is a Gaussian time series, which means that the distributionfunctions of {X t} are all multivariate Gaussian, i.e. the joint density of

f X t,X t+j1 ,...,X t+jk (xt, xt+ j1, . . . , xt+ jk)

is Gaussian for any j1, j2, . . . , jk, weak stationary also implies strict stationary. This is because amultivariate Gaussian distribution is fully characterized by its first two moments.

For example, a white noise is stationary but may not be strict stationary, but a Gaussianwhite noise is strict stationary. Also, general white noise only implies uncorrelation while Gaussianwhite noise also implies independence. Because if a process is Gaussian, uncorrelation impliesindependence. Therefore, a Gaussian white noise is just i.i.d.N (0,σ2).

Stationary and nonstationary processes are very diff erent in their properties, and they requirediff erent inference procedures. We will discuss this in much details through this course. At thispoint, note that a simple and useful method to tell if a process is stationary in empirical studies isto plot the data. Loosely speaking, if a series does not seem to have a constant mean or variance,then very likely, it is not stationary. For example, Figure 2 plots the daily S&P 500 index in year1999 and 2001. The upper left figure plots the index in 1999, upper right figure plots the returnsin 1999, lower left figure plots the index in 2001, and lower right figure plots the returns in 2001.

Note that the index level are very diff erent in 1999 and 2001. In year 1999, it is wandering ata higher level and the market rises. In year 2001, the level is much lower and the market drops.

5


6/149

In comparison, we did not see much diff erence in the returns in year 1999 and 2001 (althoughthe returns in 2001 seem to have thicker tails). Actually, only judging from the return data, itis very hard to tell which figure plots the market in booms, and which figure plots the market incrashes. Therefore, people usually treat stock price data as nonstationary and stock return data asstationary.

4 Ergodicity

Recall that Kolmogorov’s law of large number (LLN) tells that if X i ∼ i.i.d.(µ,σ2) for i = 1, . . . , n,

then we have the following limit for the ensemble average

X̄ n = n−1

nXi=1

X i → µ.

In time series, we have time series average, not ensemble average. To explain the diff erencesbetween ensemble average and time series average, consider the following experiment. Suppose wewant to track the movements of some particles and draw inference about their expected position(suppose that these particles move on the real line). If we have a group of particles (group size n),then we could track down the position of each particle and plot a distribution of their positions.The mean of this sample is called ensemble average. If all these particles are i.i.d., LLN tells thatthis average converges to its expectation as n → ∞. However, as we remarked earlier, with timeseries observations, we only have one history. That means, in this experiment, we only have oneparticle. Then instead of collecting n particles, we can only track this single particle and record itsposition, say xt, for t = 1, 2, . . . , T . The mean we computed by averaging over time, T

−1PT

t=1 xtis called time series average.

Does the time series average converges to the same limit as the ensemble average? The answeris yes if X t is stationary and ergodic. If X t is stationary and ergodic with E (X t) = µ, then thetime series average has the same limit as ensemble average,

X̄ T = T −1

T Xt=1

X t → µ.

This result is given as ergodic theorem, and we will discuss it later in our lecture 4 on asymp-totic theory. Note that this result require both stationarity and ergodicity. We have explainedstationarity and we see that stationarity allows time series dependence. Ergodicity requires ’aver-age asymptotic independence’. Note that stationarity itself does not guarantee ergodicity (page 47in Hamilton and lecture 4).

Readings:Hamilton, Ch. 3.1

Brockwell and Davis, Page 1-29Hayashi, Page 97-102

6


7/149

Lecture 2: ARMA Models∗

1 ARMA Process

As we have remarked, dependence is very common in time series observations. To model this timeseries dependence, we start with univariate ARMA models. To motivate the model, basically wecan track two lines of thinking. First, for a series xt, we can model that the level of its currentobservations depends on the level of its lagged observations. For example, if we observe a highGDP realization this quarter, we would expect that the GDP in the next few quarters are good

as well. This way of thinking can be represented by an AR model. The AR(1) (autoregressive of order one) can be written as:

xt = φxt−1 + t

where t ∼ WN (0,σ2

) and we keep this assumption through this lecture. Similarly, AR( p) (au-toregressive of order p) can be written as:

xt = φ1xt−1 + φ2xt−2 + . . . + φ pxt− p + t.

In a second way of thinking, we can model that the observations of a random variable at timet are not only aff ected by the shock at time t, but also the shocks that have taken place beforetime t. For example, if we observe a negative shock to the economy, say, a catastrophic earthquake,

then we would expect that this negative eff ect aff ects the economy not only for the time it takesplace, but also for the near future. This kind of thinking can be represented by an MA model. TheMA(1) (moving average of order one) and MA(q ) (moving average of order q ) can be written as

xt = t + θt−1

andxt = t + θ1t−1 + . . . + θqt−q.

If we combine these two models, we get a general ARMA( p, q ) model,

xt = φ1xt−1 + φ2xt−2 + . . . + φ pxt− p + t + θ1t−1 + . . . + θqt−q.

ARMA model provides one of the basic tools in time series modeling. In the next few sections,we will discuss how to draw inferences using a univariate ARMA model.


1


8/149

2 Lag Operators

Lag operators enable us to present an ARMA in a much concise way. Applying lag operator(denoted L) once, we move the index back one time unit; and applying it k times, we move theindex back k units.

Lxt = xt−1

L2xt = xt−2...

Lkxt = xt−k

The lag operator is distributive over the addition operator, i.e.

L(xt + yt) = xt−1 + yt−1

Using lag operators, we can rewrite the ARMA models as:

AR(1) : (1− φL)xt = t

AR( p) : (1− φ1L− φ2L2− . . .− φ pL

p)xt = t

MA(1) : xt = (1 + θL)t

MA(q ) : xt = (1 + θ1L + θ2L2 + . . . + θqL

q)t

Let φ0 = 1, θ0 = 1 and define log polynomials

φ(L) = 1− φ1L− φ2L2− . . .− φ pL

p

θ(L) = 1 + θ1L + θ2L2 + . . . + θ pL

q

With lag polynomials, we can rewrite an ARMA process in a more compact way:

AR : φ(L)xt = t

MA : xt = θ(L)t

ARMA : φ(L)xt = θ(L)t

3 Invertibility

Given a time series probability model, usually we can find multiple ways to represent it. Whichrepresentation to choose depends on our problem. For example, to study the impulse-responsefunctions (section 4), MA representations maybe more convenient; while to estimate an ARMAmodel, AR representations maybe more convenient as usually xt is observable while t is not.However, not all ARMA processes can be inverted. In this section, we will consider under what

conditions can we invert an AR model to an MA model and invert an MA model to an AR model. Itturns out that invertibility , which means that the process can be inverted, is an important propertyof the model.

If we let 1 denotes the identity operator, i.e., 1yt = yt, then the inversion operator (1 − φL)−1

is defined to be the operator so that

(1− φL)−1(1− φL) = 1

2


9/149

For the AR(1) process, if we premulitply (1 − φL)−1 to both sides of the equation, we get

xt = (1− φL)−1t

Is there any explicit way to rewrite (1 − φL)−1? Yes, and the answer just turns out to be θ(L)with θk = φ

k for |φ| < 1. To show this,

(1− φL)θ(L)

= (1− φL)(1 + θ1L + θ2L2 + . . .)

= (1− φL)(1 + φL + φ2L2 + . . .)

= 1− φL + φL− φ2L2 + φ2L2 − φ3L3 + . . .

= 1− limk→∞

φkLk

= 1 for |φ| < 1

We can also verify this result by recursive substitution,

xt = φxt−1 + t

= φ2xt−2 + t + φt−1...

= φkxt−k + t + φt−1 + . . . + φk−1t−k+1

= φkxt−k +k−1X

j=0

φ jt− j

With |φ| < 1, we have that limk→∞ φkxt−k = 0, so again, we get the moving average representation

with MA coefficient equal to φk. So the condition that |φ| < 1 enables us to invert an AR(1)process to an MA(∞) process,

AR(1) : (1− φL)xt = t

MA(∞) : xt = θ(L)t with θk = φk

We have got some nice results in inverting an AR(1) process to a MA(∞) process. Then, howto invert a general AR( p) process? We need to factorize a lag polynomial and then make use of theresult that (1− φL)−1 = θ(L). For example, let p = 2, we have

(1− φ1L− φ2L2)xt = t (1)

To factorize this polynomial, we need to find roots λ1 and λ2 such that

(1− φ1L− φ2L2) = (1− λ1L)(1− λ2L)

Given that both |λ1|


10/149

and so to invert (1), we have

xt = (1− λ1L)−1(1− λ2L)

−1t

= θ1(L)θ2(L)t

Solving θ1(L)θ2(L) is straightforward,

θ1(L)θ2(L) = (1 + λ1L + λ2

1L2 + . . .)(1 + λ2L + λ

2

2L2 + . . .)

= 1 + (λ1 + λ2)L + (λ2

1 + λ1λ2 + λ2

2)L2 + . . .

=∞X

k=0

(kX

j=0

λ j1λ

k− j2

)Lk

= ψ(L), say,

with ψk =Pk

j=0 λ j1λ

k− j2

. Similarly, we can also invert the general AR( p) process given that allroots λi has less than one absolute value. An alternative way to represent this MA process (toexpress ψ) is to make use of partial fractions. Let c1, c2 be two constants, and their values are

determined by

1

(1− λ1L)(1− λ2L) =

c1

1− λ1L +

c2

1− λ2L =

c1(1− λ2L) + c2(1− λ1L)

(1− λ1L)(1− λ2L)

We must have

1 = c1(1− λ2L) + c2(1− λ1L)

= (c1 + c2)− (c1λ2 + c2λ1)L

which givesc1 + c2 = 1 and c1λ2 + c2λ1 = 0.

Solving these two equations we get

c1 = λ1

λ1 − λ2, c2 =

λ2

λ2 − λ1.

Then we can express xt as

xt = [(1− λ1L)(1− λ2L)]−1t

= c1(1− λ1L)−1t + c2(1− λ2L)

−1t

= c1

∞

Xk=0

λ

k

1t−

k + c2

∞

Xk=0

λ

k

2t−

k

=∞X

k=0

ψkt−k

where ψk = c1λk1

+ c2λk2

.

4


11/149

Similarly, an MA process,xt = θ(L)t,

is invertible if θ(L)−1 exists. An MA(1) process is invertible if |θ| < 1, and an MA(q ) process isinvertible if all roots of

1 + θ1z + θ2z2 + . . . θqz

q = 0

lie outside of the unit circle. Note that for any invertible MA process, we can find a noninvertibleMA process which is the same as the invertible process up to the second moment. The converse isalso true. We will give an example in section 5.

Finally, given an invertible ARMA( p, q ) process,

φ(L)xt = θ(L)t

xt = φ−1(L)θ(L)t

xt = ψ(L)t

then what is the series ψk? Note that since

φ−1(L)θ(L)t = ψ(L)t,

we have θ(L) = φ(L)ψ(L). So the elements of ψ can be computed recursively by equating thecoefficients of Lk.

Example 1 For a ARMA(1, 1) process, we have

1 + θL = (1 − φL)(ψ0 + ψ1L + ψ2L2 + . . .)

= ψ0 + (ψ1 − φψ0)L + (ψ2 − φψ1)L2 + . . .

Matching coefficients on Lk, we get

1 = ψ0

θ = ψ1 − φψ0

0 = ψ j − φψ j−1 for j ≥ 2

Solving those equation, we can easily get

ψ0 = 1

ψ1 = φ + θ

ψ j = φ j−1(φ + θ) for j ≥ 2

4 Impulse-Response Functions

Given an ARMA model, φ(L)xt = θ(L)t, it is natural to ask: what is the eff ect on xt given a unitshock at time s (for s < t)?

5


12/149

4.1 MA process

For an MA(1) process,xt = t + θt−1

the eff ects of on x are: : 0 1 0 0 0

x : 0 1 θ 0 0

For a MA(q ) process,xt = t + θ1t−1 + θ2t−2 + . . . + θqt−q,

the eff ects on on x are: : 0 1 0 0 . . . 0 0x : 0 1 θ1 θ2 . . . θq 0

The left figure in Figure 1 plots the impulse-response function of an MA(3) process. Similarly,we can write down the eff ects for an MA(∞) process. As you can see, we can get impulse-responsefunction immediately from an MA process.

4.2 AR process

For a AR(1) process xt = φxt−1 + t with |φ| < 1, we can invert it to a MA process and the eff ectsof on x are:

: 0 1 0 0 . . .x : 0 1 φ φ2 . . .

As can be seen from above, the impulse-response dynamics is quite clear from a MA representation.For example, let t > s > 0, given one unit increase in s, the eff ect on xt would be φ

t−s, if thereare no other shocks. If there are shocks that take place at time other than s and has nonzero eff ecton xt, then we can add these eff ects, since this is a linear model.

The dynamics is a bit complicated for higher order AR process. But applying our old trickof inverting them to a MA process, then the following analysis will be straightforward. Take anAR(2) process as example.

Example 2

xt = 0.6xt−1 + 0.2xt−2 + t

or(1− 0.6L− 0.2L2)xt = t

We first solve the polynomial:y2 + 3y − 5 = 0

and get two roots1 y1 = 1.2926 and y2 = −4.1925. Recall that λ1 = 1/y1 = 0.84 and λ2 = 1/y2 =−

0.24. So we can factorize the lag polynomial to be:(1− 0.6L− 0.2L2)xt = (1− 0.84L)(1 + 0.24L)xt

xt = (1− 0.84L)−1(1 + 0.24L)−1t

= ψ(L)t

1Recall that the roots for polynomial ay2 + by + c = 0 is −b±

√ b2−4ac

2a .

6


13/149

where ψk =Pk

j=0 λ j1λk− j2 . In this example, the series of ψ is {1, 0.6, 0.5616, 0.4579, 0.3880, . . .}. So

the eff ects of on x can be described as:

: 0 1 0 0 0 . . .x : 0 1 0.6 0.5616 0.4579 . . .

The right figure in Figure 1 plots this impulse-response function. So after we invert an AR( p)process to an MA process, given t > s > 0, the eff ect of one unit increase in s on xt is just ψt−s.

We can see that given a linear process, AR or ARMA, if we could represent them as a MAprocess, we will find impulse-response dynamics immediately. In fact, MA representation is thesame thing as the impulse-response function.

0 10 20 30!1

!0.5

0

0.5

1

1.5

Time

R e s p

o n s e

0 10 20 30!1

!0.5

0

0.5

1

1.5

Time

R e s p

o n s e

Figure 1: The impulse-response functions of an MA(3) process (θ1 = 0.6, θ2 = −0.5, θ3 = 0.4) andan AR(2) process (φ1 = 0.6,φ2 = 0.2), with unit shock at time zero

5 Autocovariance Functions and Stationarity of ARMA models

5.1 MA(1)

xt = t + θt−1,

where t ∼ W N (0,σ2

). It is easy to calculate the first two moments of xt:

E (xt) = E (t + θt−1) = 0

E (x2

t ) = (1 + θ2

)σ2

and

γ x(t, t + h) = E [(t + θt−1)(t+h + θt+h−1)]

=

θσ2

for h = 1

0 for h > 1

7


14/149

So, for a MA(1) process, we have a fixed mean and a covariance function which does not dependon time t: γ (0) = (1 + θ2)σ2

, γ (1) = θσ2

, and γ (h) = 0 for h > 1. So we know MA(1) is stationary

given any finite value of θ.The autocorrelation can be computed as ρx(h) = γ x(h)/γ x(0), so

ρx(0) = 1, ρx(1) =

θ

1 + θ2 , ρx(h) = 0 for h > 1

We have proposed in the section on invertability that for an invertible (noninvertible) MAprocess, there always exists a noninvertible (invertible) process which is the same as the originalprocess up to the second moment. We use the following MA(1) process as an example.

Example 3 The process

xt = t + θt−1, t ∼ W N (0, σ2) |θ| > 1

is noninvertible. Consider an invertible MA process defined as

x̃t = ̃t + 1/θ̃t−1, ̃t ∼ W N (0, θ2

σ2

)

.Then we can compute that E (xt) = E (x̃t) = 0, E (x2t ) = E (x̃

2t ) = (1 + θ

2)σ2, γ x(1) = γ ̃x(1) =θσ2, and γ x(h) = γ ̃x(h) = 0 for h > 1. Therefore, these two processes are equivalent up to thesecond moments. To be more concrete, we plug in some numbers.

Let θ = 2, and we know that the process

xt = t + 2t−1, t ∼ W N (0, 1)

is noninvertible. Consider the invertible process

x̃t = ̃t + (1/2)̃t−1, ̃t ∼ W N (0, 4)

.Note that E (xt) = E (x̃t) = 0, E (x2t ) = E (x̃t)

2 = 5, γ x(1) = γ ̃x(1) = 2, and γ x(h) = γ ̃x(h) = 0for h > 1.

Although these two representations, noninvertible MA and invertible MA, could generate thesame process up to the second moment, we prefer the invertible presentations in practice because if we can invert an MA process to an AR process, we can find the value of t (non-observable) basedon all past values of x (observable). If a process is noninvertible, then, in order to find the value of t, we have to know all future values of x.

5.2 MA(q )

xt = θ(L)t =

qX

k=0

(θkLk)t

8


15/149

The first two moments are:

E (xt) = 0

E (x2t ) =

q

Xk=0θ2kσ

2

and

γ x(h) =

Pq−hk=0 θkθk+hσ

2

for h = 1, 2, . . . , q 0 for h > q

Again, a MA(q ) is stationary for any finite values of θ1, . . . , θq.

5.3 MA(∞)

xt = θ(L)t =

∞

Xk=0(θkLk)t

Before we compute moments and discuss the stationarity of xt, we should first make sure that{xt} converges.

Proposition 1 If {t} is a sequence of white noise with σ2

0, we wantto show that

E

" nXk=1

θkt−k −mXk=1

θkt−k

#2

=X

m≤k≤n

θ2kσ2

=

" nXk=0

θ2k −mXk=0

θ2k

#σ2

→ 0 as m, n →∞

The result holds since {θk} is square summable. It is often more convenient to work with aslightly stronger condition – absolutely summability:

∞Xk=0

|θk| < ∞.

9


16/149

It is easy to show that absolutely summable implies square summable. A MA(∞) process withabsolutely summable coefficients is stationary with moments:

E (xt) = 0

E (x2t ) =∞X

k=0

θ2kσ2

γ x(h) =∞X

k=0

θkθk+hσ2

5.4 AR(1)

(1 − φL)xt = t (2)

Recall that an AR(1) process with |φ| < 1 can be inverted to an MA(∞) process

xt = θ(L)t with θk = φk.

With |φ| < 1, it is easy to check that the absolute summability holds:∞X

k=0

|θk| =

∞X

k=0

|φk| < ∞.

Using the results for MA(∞), the moments for xt in (2) can be computed:

E (xt) = 0

E (x2t ) =∞X

k=0

φ2kσ2

= σ2/(1 − φ2)

γ x(h) =∞X

k=0

φ2k+hσ2

= φhσ2/(1 − φ2)

So, an AR(1) process with |φ| < 1 is stationary.

5.5 AR(p)

Recall that an AR( p) process

(1 − φ1L− φ2L2− . . .− φ pL

p)xt = t

can be inverted to an MA process xt = θ(L)t if all λi in

(1 − φ1L− φ2L2− . . .− φ pL

p) = (1 − λ1L)(1 − λ2L) . . . (1 − λ pL) (3)

have less than one absolute value. It also turns out that with |λi| < 1, the absolute summabilityP∞

k=0 |ψk| < ∞ is also satisfied. (The proof can be found on page 770 of Hamilton and the proof uses the result that ψk = c1λ

k1 + c2λ

k2.)

10


17/149

When we solve the polynomial in:

(L− y1)(L− y2) . . . (L− y p) = 0 (4)

the requirement that |λi| < 1 is equivalent to that all roots in (4) lie outside of the unit circle, i.e.,|yi| > 1 for all i.

First calculate the expectation for xt, E (xt) = 0. To compute the second moments, one methodis to invert it into a MA process and using the formula of autocovariance function for MA(∞).This method requires finding the moving average coefficients ψ, and an alternative method whichis known as Yule-Walker method maybe more convenient in finding the autocovariance functions.To illustrate this method, take an AR(2) process as an example:

xt = φ1xt−1 + φ2xt−2 + t

Multiply xt, xt−1, xt−2, . . . to both sides of the equation, take expectation and and then divideby γ (0), we get the following equations:

1 = φ1ρ(1) + φ2ρ(2) + σ2

/γ (0)

ρ(1) = φ1 + φ2ρ(1)

ρ(2) = φ1ρ(1) + φ2

ρ(k) = φ1ρ(k − 1) + φ2ρ(k − 2) for k ≥ 3

ρ(1) can be first solved from the second equation: ρ(1) = φ1/(1− φ2), ρ(2) can then be solvedfrom the third equation. ρ(k) can be solved recursively using ρ(1) and ρ(2) and finally, γ (0) canbe solved from the first equation. Using γ (0) and ρ(k), γ (k) can computed using γ (k) = ρ(k)γ (0).Figure 2 plots this autocorrelation for k = 0, . . . , 50 and the parameters are set to be φ1 = 0.5 andφ2 = 0.3. As is clear from the graph, the autocorrelation is very close to zero when k > 40.

0 5 10 15 20 25 30 35 40 450

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k

r h o ( k )

Figure 2: Plot of the autocorrelation of AR(2) process, with φ1 = 0.5 and φ2 = 0.3

11


18/149

5.6 ARMA( p, q )

Given an invertible ARMA( p, q ) process, we have shown that

φ(L)xt = θ(L)t,

invert φ(L) we obtain xt = φ(L)−1θ(L)t = ψ(L)t.

Therefore, an ARMA( p, q ) process is stationary as long as φ(L) is invertible. In other words,the stationarity of the ARMA process only depends on the autoregressive parameters, and not onthe moving average parameters (assuming that all parameters are finite).

The expectation of this process E (xt) = 0. To find the autocovariance function, first we caninvert it to MA process and find the MA coefficients ψ(L) = φ(L)−1θ(L). We have shown anexample of finding ψ in ARMA(1, 1) process, where we have

(1− φL)xt = (1 + θL)t

xt = ψ(L)t =

∞X j=0

ψ jt− j

where ψ0 = 1 and ψ j = φ j−1(φ+ θ) for j ≥ 1. Now, using the autocovariance functions for MA(∞)

process we have

γ x(0) =

∞Xk=0

ψ2kσ2

=

1 +

∞Xk=1

φ2(k−1)(φ + θ)2

!σ2

=

1 +

(φ + θ)2

1− φ2σ2

If we plug in some numbers, say, φ = 0.5 and θ = 0.5, so the original process is xt = 0.5xt−1 + t +0.5t−1, then γ x(0) = (7/3)σ

2

. For h ≥ 1,

γ x(h) =

∞Xk=0

ψkψk+hσ2

=

φh−1(φ + θ) + φh−2(φ + θ)2

∞

Xk=1φ2k

!σ2

= φh−1(φ + θ)

1 +

(φ + θ)φ

1− φ2

σ2

Plug in φ = θ = 0.5 we have for h ≥ 1,

γ x(h) = 5 · 21−h

3 σ2

.

12


19/149

An alternative to compute the autocovariance function is to multiply each side of φ(L)xt =θ(L)t with xt, xt−1, . . . and take expectations. In our ARMA(1, 1) example, this gives

γ x(0)− φγ x(1) = [1 + θ(θ + φ)]σ2

γ x(1)− φγ x(0) = θσ2

γ x(2)− φγ x(1) = 0...

γ x(h)− φγ x(h− 1) = 0 for h > 2

where we use that xt = ψ(L)t in taking expectation on the right side, for instance, E (xtt) =E ((t + ψ1t−1 + . . .)t) = σ

2

. Plug in θ = φ = 0.5 and solving those equations, we have γ x(0) =(7/3)σ2

, γ x(1) = (5/3)σ

2

, and γ x(h) = γ x(h − 1)/2 for h ≥ 2. This is the same results as we gotusing the first method.Summary: A MA process is stationary if and only if the coefficients {θk} are square summable(absolute summable), i.e.,

P∞

k=0 θ2k < ∞ or

P∞

k=0 |θk| < ∞. Therefore, MA with finite number of MA coefficients are always stationary. Note that stationarity does not require MA to be invertible.

An AR process is stationary if it is invertible, i.e. |λi| < 1 or |yi| > 1, as defined in (3) and (4)respectively. An ARMA( p, q ) process is stationary if its autoregressive lag polynomial is invertible.

5.7 Autocovariance generating function of stationary ARMA process

For covariance stationary process, we see that autocovariance function is very useful in describingthe process. One way to summarize absolutely summable autocovariance functions (

P∞

h=−∞ |γ (h)| <∞) is to use the autocovariance-generating function:

gx(z) =∞X

h=−∞

γ (h)zh.

where z could be a complex number.For white noise, the autocovriance-generating function (AGF) is just a constant, i.e, for ∼

W N (0,σ2

), g(z) = σ2

.For MA(1) process,

xt = (1 + θL)t, ∼ W N (0,σ2

),

we can compute that

gx(z) = σ2

[θz−1 + (1 + θ2) + θz] = σ2

(1 + θz)(1 + θz−1).

For a MA(q ) process,xt = (1 + θ1L + . . . + θqL

q)t,

we know that γ x(h) =Pq−h

k=0 θkθk+hσ2

for h = 1, . . . , q and γ x(h) = 0 for h > q . we have

gx(z) =∞X

h=−∞

γ (h)zh

13


20/149

= σ2

qXk=0

θ2k +

qXh=1

q−hXk=0

(θkθk−hz−h + θkθk+hz

h)

!

= σ2

qXk=0

θkzk

! qXk=0

θkz−k

!

For a MA(∞) process xt = θ(L)t where P

∞

k=0 |θk| < ∞, we can naturally let q be replaced by∞ in the AGF for MA(q ) to get AGF for MA(∞),

gx(z) = σ2

∞Xk=0

θkzk

! ∞Xk=0

θkz−k

!= σ2

θ(z)θ(z−1).

Next, for a stationary AR or ARMA process, we can invert them to a MA process. For instance,an AR(1) process, (1 − φL)xt = t, invert it to

xt = 1

1− φLt,

and its AGF is

gx(z) = σ2

(1− φz)(1− φz−1),

which equal to

σ2

∞Xk=0

θkzk

! ∞Xk=0

θkz−k

!= σ2θ(z)θ(z−1),

where θk = φk. In general, the AGF for an ARMA( p, q ) process is

gx(z) = σ2

(1 + θ1z + . . . + θqz

q)(1 + θ1z−1 + . . . + θqz

−q)

(1− φ1z − . . .− φ pz p)(1− φ1z−1 − . . .− φ pz− p)

= σ2

θ(z)θ(z−1)

φ(z)φ(z−1)

6 Simulated ARMA process

In this section, we plot a few simulated ARMA processes. In the simulations, the errors are Gaussianwhite noise i.i.d.N (0, 1). As a comparison, we first plot a Gaussian white noise (or AR(1) withφ = 0) in Figure 3. Then, we plot AR(1) with φ = 0.4 and φ = 0.9 in Figure 4 and Figure 5. Asyou can see, the white noise process is very choppy and patternless. When φ = 0.4, it becomes abit smoother, and when φ = 0.9, the departures from the mean (zero) is very prolonged. Figure 6plots an AR(2) process and the coefficients are set to numbers as in our example in this lecture.Finally, Figure 7 plots a MA(3) process. Compare this MA(3) process with the white noise, wecould see an increase of volatilities (the volatility of the white noise is 1 and the volatility of theMA(3) process is 1.77).

14


21/149

0 20 40 60 80 100 120 140 160 180 200!4

!3

!2

!1

0

1

2

3

4

Figure 3: A Gaussian white noise time series

0 20 40 60 80 100 120 140 160 180 200!4

!3

!2

!1

0

1

2

3

4

Figure 4: A simulated AR(1) process, with φ = 0.4

15


22/149

0 20 40 60 80 100 120 140 160 180 200!10

!8

!6

!4

!2

0

2

4

6

8

10

Figure 5: A simulated AR(1) process, with φ = 0.

9

0 20 40 60 80 100 120 140 160 180 200!5

!4

!3

!2

!1

0

1

2

3

4

5

Figure 6: A simulated AR(2) process, with φ1 = 0.6, φ2 = 0.2

16


23/149

0 20 40 60 80 100 120 140 160 180 200!5

!4

!3

!2

!1

0

1

2

3

4

5

Figure 7: A simulated MA(3) process, with θ1 = 0.6, θ2 = −0.5, and θ3 = 0.4

7 Forecastings of ARMA Models

7.1 Principles of forecasting

If we are interested in forecasting a random variable yt+h based on the observations of x up to timet (denoted by X ) we can have diff erent candidates, denoted by g(X ). If our criterion in picking thebest forecast is to minimize the mean squared error (MSE), then the best forecast is the conditionalexpectation, g(X ) = E X (yt+h). The proof can be found on page 73 in Hamilton. In our followingdiscussion, we assume that the data generating process is known (so parameters are known), so wecan compute the conditional moments.

7.2 AR models

Let’s start from an AR(1) process:xt = φxt−1 + t

where we continue to assume that t is a white noise with mean zero and variance σ2

, then we cancompute

E t(xt+1) = E t(φxt + t+1) = φxt

E t(xt+2) = E t(φ2xt + φt+1 + t+2) = φ

2xt

. . . = . . .

E t(xt+k) = E t(φkxt + φ

k−1t+1 + . . . + t+k) = φkxt

and the variance

Vart(xt+1) = Vart(φxt + t+1) = σ2

Vart(xt+2) = Vart(φ2xt + φt+1 + t+2) = (1 + φ

2)σ2

. . . = . . .

Vart(xt+k) = Vart(φkxt + φ

k−1t+1 + . . . + t+k) =k−1X

j=0

φ2 jσ2

17


24/149

Note that as k →∞,E t(xt+k) → 0

which is the unconditional expectation of xt, and

Vart(xt+k) → σ2

/(1− φ2)

which is the unconditional variance of xt.Similarly, for an AR( p) process, we can forecast recursively.

7.3 MA Models

For a MA(1) process,xt = t + θt−1,

if we know t, then

E t(xt+1) = E t(t+1 + θt) = θt

E t(xt+2) = E t(t+2 + θt+1) = 0

. . . = . . .

E t(xt+k) = E t(t+k + θt+k−1) = 0

and

Vart(xt+1) = Vart(t+1 + θt) = σ2

Vart(xt+2) = Vart(t+2 + θt+1) = (1 + θ2)σ2

. . . = . . .

Vart(xt+k) = Vart(t+k + θt+k−1) = (1 + θ2)σ2

It is easy to see that for an MA(1) process, the conditional expectation for two step ahead and

higher is the same as unconditional expectation, so is the variance. Next, for a MA( q ) model,

xt = t + θ1t−1 + θ2t−2 + . . . + θqt−q =

qX

j=0

θ jt− j ,

if we know t, t−1, . . . , t−q, then

E t(xt+1) = E t(

qX

j=0

θ jt+1− j) =

qX

j=1

θ jt+1− j

E t(xt+2) = E t(

qX

j=0

θ jt+2− j) =

qX

j=2

θ jt+2− j

. . . = . . .E t(xt+k) = E t(

qX

j=0

θ jt+k− j) =

qX

j=k

θ jt+k− j for k ≤ q

E t(xt+k) = E t(

qX

j=0

θ jt+k− j) = 0 for k > q

18


25/149

and

Vart(xt+1) = Vart(

qX

j=0

θ jt+1− j) = σ2

Vart(xt+2) = Vart(

qX

j=0

θ jt+2− j) = 1 + θ21σ

2

. . . = . . .

Vart(xt+k) = Vart(

qX

j=0

θ jt+k− j) =kX

j=0

θ2 jσ2 ∀ k > 0

We could see that for an MA(q ) process, the conditional expectation and variance of forecast forq + 1 and higher is the same as unconditional expectations and variance.

8 Wold Decomposition

So far we have focused on ARMA models, which are linear time series models. Is there any relation-

ship between a general covariance stationary process (maybe nonlinear) to linear representations?The answer is given by the Wold decomposition theorem:

Proposition 2 (Wold Decomposition) Any zero-mean covariance stationary process xt can be rep-resented in the form

xt =

∞X

j=0

ψ jt− j + V t

where

(i) ψ0 = 1 and P

∞

j=0 ψ2 j < ∞

(ii) t ∼ W N (0, σ2

)

(iii) E (tV s) = 0 ∀ s, t > 0

(iv) t is the error in forecasting xt on the basis of a linear function of lagged x:

t = xt − E (xt|xt−1, xt−2, . . .)

(v) V t is a deterministic process and it can be predicted from a linear function of lagged x.

Remarks: Wold decomposition says that any covariance stationary process has a linear repre-sentation: a linear deterministic component (V t) and a linear indeterministic components (t). If V t = 0, then the process is said to be purely-non-deterministic, and the process can be representedas a MA(∞) process. Basically, t is the error from the projection of xt on lagged x, therefore it isuniquely determined and it is orthogonal to lagged x and lagged . Since this error is the residualfrom the projections, it may not be the true errors in the DGP of xt. Also note that the error term() is a white noise process, and does not need to be iid.

Readings:Hamilton Ch. 1-4Brockwell and Davis Ch. 3Hayashi Ch 6.1, 6.2

19


26/149

Lecture 3: Spectral Analysis∗

Any covariance stationary process has both a time domain representation and a spectrum do-main representation. So far, our analysis is in the time domain as we represent a time series {xt}in terms of past values of innovations and investigate the dependence of x at distinct time. In somecases, a spectrum-domain representation is more convenient in describing a process. To transforma time-domain representation to a spectrum-domain representation, we use the Fourier transform.

1 Fourier Transforms

Let ω denote the frequency (−π < ω < π), and let T denote the period : the minimum time thatit takes the wave to go through a whole cycle, and we have T = 2π/ω. Given any integer numberz, we have x(t) = x(t + zT ). Finally, we will let φ denote the phase : the amount that a wave isshifted.

Given a time series {xt}, its Fourier transformation is:

x(ω) = 1

2π

∞Xt=−∞

e−itωx(t) (1)

and the inverse Fourier transform is:

x(t) =

Z π

−π

eitωx(ω)dω (2)

2 Spectrum

Recall that the autocovariance function for a zero-mean stationary process {xt} is defined as:

γ x(h) = E (xtxt−h)

and it serves to characterize the time series {xt}. The spectrum of {x} is defined to be the Fouriertransform of γ x(h),

S x(ω) = 1

2π

∞Xh=−∞

e−ihωγ x(h) (3)

Recall that the autocovariance generating function is gx(z) =P∞

h=−∞ γ x(h)zh

, if we let z =e−iω, then the spectrum is just the autocovariance generating function divided by 2π. In (3), if wetake ω = 0, we see that

∞Xh=−∞

γ x(h) = 2πS x(0),


1


27/149

which tells that the sum of autocorrelations equals the spectrum at zero multiplied by 2π. Usingthe identity

eiφ = cosφ + i sinφ,

we can also write (3) as

S x(ω) =

1

2π"γ 0

+ 2

∞

Xh=1

γ x

(h

) cos(hω)# .

(4)

Note that since cos(ω) = cos(−ω), and γ x(h) = γ x(−h), the spectrum is symmetric about zero.Also the cosine function is periodic with period 2π, therefore, for spectral analysis, we only needto find the spectrum for ω ∈ [0,π]. Now if we know γ x(h), we can compute its spectrum using (4),and if we know the spectrum S x(ω), we can compute γ x(h) using the inverse Fourier transform:

γ x(h) =

Z π−π

eiωhS x(ω)dω (5)

Let h = 0, then (5) gives the variance of {xt}

γ x(0) = Z π

−

π

S x(ω)dω.

So the variance of {xt} is just the sum of the spectrum over all frequencies −π < ω < π.Therefore we can see that the spectrum function S x(ω) decomposes the variance into componentscontributed from each frequency. In other words, we can use spectrum to find the importance of cycles of diff erent frequencies.

If we normalize the spectrum S x(ω) by dividing γ x(0), we get the Fourier transform of theautocorrelation function ρx(h),

f (ω) = 1

2π

∞Xh=−∞

e−ihωρx(h) (6)

The autocorrelation functions can be generated from f (ω) using the inverse transform

ρx(h) =

Z π

−π

eiωhf x(ω)dω (7)

Again, let h = 0, (7) gives

1 =

Z π−π

f x(ω)dω

Note that f (ω) is positive and integrate to one, just like a probability distribution density, sowe call it spectral density .

Example 1 (spectral density of white noise) Let ∼ WN(0,σ2 ). We have γ (0) = σ2 and γ (h) = 0

for h 6= 0. Using (3) and (6), we can compute

S (ω) = 1

2πγ (0) = 1

2πσ2

.

Divide it by γ (0), we have

f x() = 1

2π.

So the spectral density is uniform over [−π,π], i.e., every frequency has equal contribution tothe variance.

2


28/149

3 Spectrum of Filtered Process

Considering that the spectrum of a white noise process is so simple, we may want to know if wecould make use it for a more complicated process, say,

xt =

∞

Xk=−∞

θkt−k = θ(L)t.

We call this process a two-sided moving average process. Then what is the relationship betweenS x(ω) and S (ω)? The general solution is given in the following statement.

Proposition 1 If {xt} is a zero mean stationary process with spectrum function S x(ω), and {yt}is the process

yt =∞X

k=−∞

θkxt−k = θ(L)xt

where θ is absolutely summable, then

S y(ω) = ∞

Xk=−∞

θke−ikω

2

S x(ω) = θ(e−iω)2 S x(ω).Proof: We start from the autocovarinace function of y,

γ y(h) = E (ytyt−h)

= E

∞X

j=−∞

θ jxt− j

∞Xk=−∞

θkxt−h−k

=∞X

j,k=−∞

θ jθkE (xt− jxt−h−k)

=∞

X j,k=−∞

θ jθkγ x

(h + k − j)

Next, consider the spectrum of y,

S y(ω) = 1

2π

∞Xh=−∞

e−ihωγ y(h)

= 1

2π

∞Xh=−∞

e−ihω∞X

j,k=−∞

θ jθkγ x(h + k − j)

(Let l = h + k − j and note that S x(ω) = 1

2π

P∞

l=−∞ e−ilωγ x(l), so we want to construct such a

term and see what are the remainings.)

S y(ω) =∞X

j=−∞

e−ijωθ j

∞Xk=−∞

eikωθk

12π

∞Xl=−∞

e−ilωγ x(l)!

= θ(e−iω)θ(eiω)S x(ω)

= θ(e−iω)θ(e−iω)S x(ω)

=θ(e−iω)2 S x(ω)

3


29/149

Example 2 To apply this results, first consider the problem of computing an MA(1) process,

xt = t + θt−1 = (1 + θL)t.

In this problem,θ(e−iω) = 1 + θe−iω,

thus

θ(e−iω)2 = (1 + θe−iω)(1 + θeiω)= 1 + θ2 + θ(e−iω + eiω)

Therefore,

S x(ω) =θ(e−iω)2 S (ω)

= 1

2π[1 + θ2 + θ(e−iω + eiω)]σ2

We can verify this result by using the spectrum to compute the autocovarinace function, say,γ x(1). Using (5).

γ x(1) =

Z π

−π

eiωS x(ω)dω

= 1

2πσ2

Z π

−π

eiω[1 + θ2 + θ(e−iω + eiω)]dω

= 1

2πσ2 · 2πθ

= θσ2

which is the same as what we got from working in the time domain. In the computation we usethe fact the

R π

−πeiωdω = 0, as the integral of sine or cosine functions all the way around a circle is

zero.

Figure 1 plots the spectrum of MA(1) processes with positive and negative coefficients. Whenθ > 0, we see that the spectrum is high for low frequencies and low for high frequencies. Whenθ < 0, we observe the opposite. This is because when θ is positive, we have positive one lagcorrelation which makes the series smooth with only small contribution from high frequency (say,day to day) components. When θ is negative, we have negative one lag correlation, therefore theseries fluctuates rapidly about its mean value.

Above we have considered the moving average process, the next proposition gives results for an

ARMA models with white noise errors:

Proposition 2 Let {xt} be an ARMA( p, q ) process satisfying

φ(L)xt = θ(L)t

4


30/149

0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Frequency

S p e c t r u m

0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Frequency

S p e c t r u m

Figure 1: Plots of the spectrum of MA(1) processes (θ = 0.5 for the left figure and θ = −0.5 forthe right figure)

where ∼WN (0,σ2

), all roots of φ(L) lies out of the unit circle, then the spectrum of xt is:

S x(ω) = |θ(e−iω)|2

|φ(e−iω)|2S (ω)

= 1

2π

|θ(e−iω)|2

|φ(e−iω)|2σ2

Example 3 Consider an AR(1) process,

xt = φxt−1 + t.

Using the above proposition,

S x(ω) = σ2

2π|1− φe−iω|−2

= σ2

2π(1 + φ2 − 2φ cos ω)−1 (8)

Figure 2 plots the AR(1) processes with positive and negative coefficients. We have similarobservations here as the MA processes. However, note that when φ → 1, S x(ω) →∞, which meansthat a random walk process has an infinite spectrum at frequency zero. This is similar as we areworking with summation and diff erencing. When we add up a white noise (say, φ = 1 as in arandom walk), the high frequencies are smoothed out (those spikes in the white noise disappear)and what is left is the long term stochastic trend. On the contrary, when we do diff erencing (say,do first diff erencing to a random walk, then we are back to the white noise series), we get rid of the long term trend, and what is left is the high frequencies (lots of spikes with mean zero, say).

Finally we introduce a spectral representation theorem without proof. For zero-mean stationaryprocess with absolutely summable autocovariances, define random variables α(ω) and δ (ω), wecould represent the series in the form

xt =

Z π

0

[α(ω)cos(ωt) + δ (ω)sin(ωt)]dω.

5


31/149

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Frequency

S p e c t r u m

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Frequency

S p e c t r u m

Figure 2: Plots of the spectrum of AR(1) processes (φ = 0.5 for the left figure and φ = −0.5 forthe right figure)

where α(ω) and δ (ω) have zero mean and are mutually and serially uncorrelated. The representationtheorem tells that for a stationary process with absolutely summable autocovariances, we can writeit as a weighted sum of periodic functions.

4 Cross Spectrum and Spectrum of a Sum

Spectrum is an autocovariance generating function and we can use it to compute the autocovariancefor a stationary process. Besides computing autocovariance of a single time series, the spectrumfunction can also capture the covariance cross two time series. We call such spectrum functionscross spectrum .

For a single time series {xt}, a spectrum function is the Fourier transform of the autocovariancefunction γ x(h) = E (xtxt−h). Similarly, for two time series {xt} and {yt}, the cross spectrum is theFourier transform of the covariance function of xt and yt−h, i.e.

S xy(ω) =∞X

h=−∞

e−ihωE (xtyt−h)

In general,

S xy(ω) 6= S yx(ω) =∞X

h=−∞

e−ihωE (ytxt−h)

But they have the following relationship:

S xy(ω) = S yx(ω) = S yx(−ω)

which is easy to verify:

S xy(ω) =

∞X

h=−∞

e−ihωE (xtyt−h)

6


32/149

=

∞X

h=−∞

e−ihωE (ytxt+h)

=∞X

k=−∞

eikωE (ytxt−k) (let k = −h)

=∞X

k=−∞

e−(−ikω)E (ytxt−k)

= S yx(−ω)

Note that if xt and ys are uncorrelated for all t, s, then E (xtyt−h) = 0 for all h, therefore,S xy(ω) = S yx(ω) = 0. Knowing the cross spectrum, next we can compute the spectrum of a sum.For a process zt = xt + yt, the spectrum of zt can be computed as follows:

S z(ω) =

∞X

h=−∞

e−ihωE (ztzt−h)

=

∞

Xh=−∞

e−

ihω

E [(xt + yt)(xt−h + yt−h)]

=∞X

h=−∞

e−ihω [E (xtxt−h) + E (xtyt−h) + E (ytxt−h) + E (ytyt−h)]

= S x(ω) + S xy(ω) + S yx(ω) + S y(ω)

We have proposed before that for a time series zt, its spectrum decompose its variation to dif-ferent components contributed from each frequency ω. Here, we see another form of decomposition:we can decompose the variation in z to diff erent sources. In particular, if xt and ys are uncorrelatedfor all t, s, i.e., S xy(ω) = S yx(ω) = 0, then we have

S z(ω) = S x(ω) + S y(ω).

5 Estimation

In equation (3), we define the spectrum as

S x(ω) = 1

2π

∞X

h=−∞

e−ihωγ x(h).

Given a stationary process, the sample autocovariance can be estimated

γ̂ x(h) = T −1

T

Xt=h+1

[(xt−

x̄)(xt−h−

x̄)].

To estimate the spectrum, we may compute the sample analog of (3), which is known as thesample periodogram

I x(ω) = 1

2π

T −1X

h=−T +1

e−ihωγ̂ x(h).

7


33/149

Or we can equivalently write it as

I x(ω) = 1

2π

γ̂ (0) + 2

T −1Xh=1

γ̂ (h) cos(ωh)

!. (9)

We have the following asymptotic distribution of the sample periodogram.2I x(ω)

S x(ω) ∼ χ2(2)

Since E (χ2(2)) = 2, the sample periodogram provides an unbiased estimate of the spectrum,limT →∞EI x(ω) = S x(ω). However, the variance of I x(ω) does not go to zero. In fact,

V ar(I x(ω)) →

2S 2x(0) for ω = 0

S 2(ω) for ω 6= 0

Therefore, even when the sample size is very large, the sample periodogram still could notprovide an accurate estimate for the true spectrum. To estimate the spectrum, there are two betterapproaches. First is parametric approach. We can estimate the ARMA model using least squareor MLE to obtain consistent estimator of the parameters, and then plug in the estimator to obtaina consistent estimator for the spectrum. For instance, for an MA(1) process,

xt = t + θt−1, t ∼ W N (0, 1)

if we could obtain a consistent estimator for θ , denoted by θ̂, then for any ω,

Ŝ x(ω) = 1

2π[1 + θ̂2 + θ̂(e−iω + eiω)].

A potential problem with parametric estimation is that we have to specify a parametric model

for the process, say, ARMA( p, q ). So we may have some errors due to misspecification. However,even if the model is incorrectly specified, if the autocovariances of the true process are close to thosefor our specifications, then this procedure still could provide a useful estimate of the populationspectrum.

An alternative approach is to estimate the spectrum nonparametrically. Doing this could saveus from specifying a model for the process. We still make use of the sample periodogram, however,to estimate the spectrum S x(ω), we use a weighted average of the sample periodogram over severalneighboring ωs. How much weight to put on each ω in the neighborhood is determined by a functionwhich is known as the kernel, or kernel function. This means that the spectrum is estimated by

Ŝ x(ω j) =m

Xl=−m

k(l, m) · I x(ω j+l). (10)

The kernel function k(l, m) must satisfy that

mXl=−m

k(l, m) = 1.

8


34/149

Here m is the bandwidth or window indicating how many diff erent frequencies to viewed as usefulin estimating S x(ω j). Averaging I x(ω) over diff erent frequencies can equivalently be represented asmultiplying the hth autocovariances γ (h) in (9) by a weight function w(h, q ). A derivation can befound on page 166 on Hamilton.

These weight function w(h, q ) satisfy that w(0, q ) = 1, |w(h, q )| ≤ 1, and w(h, q ) = 0 for h > q .

The q in weight function works in a similar way as the m in k(l, m), as it specifies a length of thewindow. Some commonly used weight functions are

Truncated kernel, let x = h/q ,

w(x) =

1 for |x| ≤ 10 otherwise

Bartlett kernel, let x = h/q ,

w(x) =

1 − |x| for |x| ≤ 10 otherwise

Modified Bartlett kernel

w(h, q ) = 1 −

hq+1

for h = 1, 2, . . . , q

0 otherwise

Parzen kernel, let x = h/q ,

w(x) =

1 − 6|x|2 + 6|x|3 for |x| < 1/22(1 − |x|)3 for 1/2 ≤ |x| ≤ 10 otherwise

A typical problem in nonparametric estimation is the trade off between variance and bias.Usually a large bandwidth reduces variance but induces bias. To reduce the variance without addingmuch bias, we need to choose a proper bandwidth. In practice, we may plot an estimate of thespectrum using several diff erent bandwidths and use subjective judgment to choose the bandwidth.

Basically, if the plot is too flat, then it is hard to draw information like which frequencies are moreimportant than others; on the other hand, if the plot is too choppy (too many peaks and valleysmixed together), then it is hard to make convincing comments.

Example 4 (Spectrum estimation of an AR(1) process). The data are generated from

xt = φxt−1 + t, φ = 0.5, t ∼ i.i.d.N (0, 1).

We simulated a sequence of length n = 200 using this DGP and the OLS estimates of φ is 0.59(OLS estimate is consistent in this problem). The upper-left figure in Figure 3 plots the populationspectrum, i.e., using (8) with φ = 0.5. The upper-right figure plots the estimated spectrum using(8) with the OLS estimates of φ, 0.59. The lower-left figure plots the sample periodogram I x(ω),

which is very volatile. Finally, the lower right figure plots the smoothed estimate for the spectrumusing the Bartlett kernel, i.e.

Ŝ x(ω) = (2π)−1

γ̂ x(0) + 2

qX j=1

1 −

j

q + 1

γ̂ x( j)cos(ω j)

,

where q is set to be 5.

9


35/149

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Frequency

P o p u l a t i o n S p e c t r u m

0 1 2 3 40

0.2

0.4

0.6

0.8

1

Frequency

P a r a m

e t r i c a l l y E s t i m a t e d S p e c t r u m

0 1 2 3 40

0.5

1

1.5

2

Frequency

S a m p l e P e

r i o d o g r a m

0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Frequency

N o n p a r a m e t r i c a l l y E

s t i m a t e d S p e c t r u m

Figure 3: Estimates for Spectrum

10


36/149

In empirical studies, Section 6.4 on spectrum of industrial production series in Hamilton providesa very good example. Without any detrending, the spectrum is focused on the low frequency region,which means that the variance of the series is largely from the long term trend (here is the economicsgrowth). After detrending, we obtain the growth rate which is stationary, and the variance nowmostly come from the business cycle and seasonal eff ects. After filtering the seasonal eff ects, most

of the variance is now due to the business cycle.Readings: Hamilton, Ch. 6; Brockwell and Davis, Ch. 4, Ch. 10

11


37/149

Lecture 4: Asymptotic Distribution Theory∗

In time series analysis, we usually use asymptotic theories to derive joint distributions of theestimators for parameters in a model. Asymptotic distribution is a distribution we obtain by lettingthe time horizon (sample size) go to infinity. We can simplify the analysis by doing so (as we knowthat some terms converge to zero in the limit), but we may also have a finite sample error. Hopefully,when the sample size is large enough, the error becomes small and we can have a satisfactoryapproximation to the true or exact distribution. The reason that we use asymptotic distributioninstead of exact distribution is that the exact finite sample distribution in many cases are toocomplicated to derive, even for Gaussian processes. Therefore, we use asymptotic distributions as

alternatives.

1 Review

I think that this lecture may contain more propositions and definitions than any other lecture forthis course. In summary, we are interested in two type of asymptotic results. The first result isabout convergence to a constant. For example, we are interested in whether the sample momentsconverge to the population moments, and law of large numbers (LLN) is a famous result on this.The second type of results is about convergence to a random variable, say, Z , and in many cases, Z follows a standard normal distribution. Central limit theorem (CLT) provides a tool in establishingasymptotic normality.

The confusing part in this lecture might be that we have several versions of LLN and CLT.The results may look similar, but the assumptions are diff erent. We will start from the strongestassumption, i.i.d., then we will show how to obtain similar results when i.i.d. is violated. Beforewe come to the major part on LLN and CLT, we review some basic concepts first.

1.1 Convergence in Probability and Convergence Almost Surely

Definition 1 (Convergence in probability) X n is said to be convergent in probability to X if for every > 0,

P (|X n − X | > ) → 0 as n → ∞.

If X = 0, we say that X n converges in probability to zero, written X n = o p(1), or X n → p 0.

Definition 2 (Boundedness in probability) X n is said to be bounded in probability, written X n =O p(1) if for every > 0, there exists δ () ∈ (0,∞) such that

P (|X n| > δ ()) < ∀ n


1


38/149

We can similarly define order in probability: X n = o p(n−r) if and only if nrX n = o p(1); and

X n = O p(n−r) if and only if nrX n = O p(1).

Proposition 1 if X n and Y n are random variables defined in the same probability space and an > 0,bn > 0, then

(i) If X n = o p(an) and Y n = o p(bn), we have

X nY n = o p(anbn)

X n + Y n = o p(max(an, bn))

|X n|r = o p(a

rn) for r > 0.

(ii) If X n = o p(an) and Y n = O p(bn), we have X nY n = o p(anbn).

Proof of (i): If |X nY n|/(anbn) > then either |Y n|/bn ≤ 1 and |X n|/an > or |Y n|/bn > 1 and|X nY n|/(anbn) > , hence

P (|X nY n|/(anbn) > ) ≤ P (|X n|/an > ) + P (|Y n|/bn > 1)→ 0

If |X n + Y n|/ max(an, bn) > , then either |X n|/an > /2 or |Y n|/bn > /2.

P (|X n + Y n|/ max(an, bn) > ) ≤ P (|X n|/an > /2) + P (|Y n|/bn > /2)

→ 0.

Finally,P (|X n|

r/arn > ) = P (|X n|/an > 1/r) → 0.

Proof of (ii): If |X nY n|/(anbn) > , then either |Y n|/bn > δ () and |X nY n|/(anbn) > or

|Y n|/bn ≤ δ () and |X n|/an > /δ (), then

P (|X nY n|/(anbn) > ) ≤ P (|X n|/an > /δ ()) + P (|Y n|/bn > δ ())

→ 0

This proposition is very useful. For example, if X n = o p(n−1) and Y n = o p(n

−2), then X n+Y n =o p(n

−1), which tells that the slowest convergent rate ‘dominates’. Later on, we will see sum of severalterms, and to study the asymptotics of the sum, we can start from judging the convergent ratesof each term and pick the terms that converge slowerest. In many cases, the terms that convergesfaster can be omitted, such as Y n in this example.

The results also hold if we replace o p in (i) with O p. The notations above can be naturally

extended from sequence of scalar to sequence of vector or matrix. In particular, X

n = o p(n

−r

) if and only if all elements in X converge to zero at order nr. Using Euclidean distance |Xn − X| =Pk j=1(X nj −X j)

2

1/2, where k is the dimension of X n, we also have

Proposition 2 Xn − X = o p(1) if and only if |Xn − X| = o p(1).

2


39/149

Proposition 3 (Preservations of convergence of continuous transformations) If {Xn} is a sequence of k-dimensional random vectors such that Xn → X and if g : R

k → Rm is a continuous mapping,

then g(Xn) → g(X).

Proof: let M be a positive real number. Then ∀ > 0, we have

P (|g(Xn) − g(X)| > ) ≤ P (|g(Xn) − g(X)| > , |Xn| ≤ M, |X| ≤ M )

+P ({|Xn| > M } ∪ {|X| > M }).

(the above inequality usesP (A ∪ B) ≤ P (A) + P (B)

whereA = {|g(Xn) − g(X)| > , |Xn| ≤ M, |X| ≤ M )},

B = {|Xn| > M, |X| > M }.

) Recall that if a function g is uniformly continuous on {x : |x| ≤ M }, then ∀ > 0, ∃η(),|Xn − X| < η(), so that |g(Xn) − g(X)| < . Then

{|g(Xn) − g(X)| > , |Xn| ≤ M, |X| ≤ M )} ⊆ {|Xn − X| > η().}

Therefore,

P (|g(Xn) − g(X)| > ) ≤ P (|Xn − X| > η()) + P (|Xn| > M ) + P (|X| > M )

≤ P (|Xn − X| > η()) + P (|X| > M )

+P (|X| > M/2) + P (|Xn − X| > M/2)

Given any δ > 0, we can choose M to make the second and third terms each less than δ /4.Since Xn → X, the first and fourth terms will each be less than δ /4. Therefore, we have

P (|g(Xn) − g(X)| > ) ≤ δ .

Then g(Xn) → g(X).

Definition 3 (Convergence almost surely) A sequence {X n} is said to converge to X almost surely or with probability one if

P ( limn→∞

|X n −X | > ) = 0.

If X n converges to X almost surely, we write X n →a.s. X . Almost sure convergence is strongerthan convergence in probability. In fact, we have

Proposition 4 If X n →a.s. X , X n → p X .

However, the converse is not true. Below is an example.

3


40/149

Example 1 (Convergence in probability but not almost surely) Let the sample space S = [0, 1], aclosed interval. Define the sequence {X n} as

X 1(s) = s + 1[0,1](s) X 2(s) = s + 1[0,1/2](s) X 3(s) = s + 1[1/2,1](s)

X 4(s) = s + 1[0,1/3](s) X 5(s) = s + 1[1/3,2/3](s) X 6(s) = s + 1[2/3,1](s)

etc, where 1 is the indicator function, i.e., it equals to 1 if the statement is true and equals to 0otherwise. Let X (s) = s. Then X n → p X , as P (|X n − X | ≥ ) is equal to the probability of thelength of the interval of s values whose length is going to zero as n → ∞. However, X n does notconverge to X almost surely, Actually there is no s ∈ S for which X n(s) → s = X (s). For every s,the value of X n(s) alternates between the values of s and s + 1 infinately often.

1.2 Convergence in L p Norm

When E (|X n| p) < ∞ with p > 0, X n is said to be L p-bounded . Define that the L p norm of X is

kX k p = (E |X | p)1/p. Before we define L p convergence, we first review some useful inequalities.

Proposition 5 (Markov’s Inequality) If E |X | p

< ∞, p ≥ 0 and

> 0, then

P (|X | ≥ ) ≤ − pE |X | p

Proof:

P (|X | ≥ ) = P (|X | p− p ≥ 1)

= E 1[1,∞)(|X | p

− p)

≤ E [|X | p− p1[1,∞)(|X | p

− p)]

≤ − pE |X | p

In the Markov’s inequality, we can also replace |X | with |X − c|, where c can be any real

number. When p = 2, the inequality is also known as Chebyshev’s inequality . If X is L p bounded,then Markov’s inequality tells that the tail probabilities converge to zero at the rate p as → ∞.Therefore, the order of L p boundedness measures the tendency of a distribution to generate outliers.

Proposition 6 (Holder’s inequality) For any p ≥ 1,

E |XY | ≤ kX k pkY kq,

where q = p/( p − 1) if p > 1 and q = ∞ if p = 1.

Proposition 7 (Liapunov’s inequality) If p > q > 0, then kX k p ≥ kX kq.

Proof: Let Z = |X |q

, Y = 1, s = p/q , Then by Holder’s inequality, E |ZY | ≤ kZ kskY ks/(s−1), or

E (|X |q) ≤ E (|X |qs)1/s = E (|X | p)q/p.

Definition 4 (L p convergence) If kX nk p < ∞ for all n with p > 0, and limn→∞ kX n − X k p = 0,then X n is said to converge in L p norm to X , written X n →Lp X . When p = 2, we say it converges in mean square, written as X n →m.s. X .

4


41/149

For any p > q > 0, L p convergences implies Lq convergence by Liaponov’s inequality. We cantake convergence in probability as an L0 convergence, therefore, L p convergence implies convergencein probability:

Proposition 8 (L p convergence implies convergence in probability) If X n →Lp X then X n → p X .

Proof:

P (|X n − X | > )

≤ − pE |X n − X | p by Markov0s inequality

→ 0

1.3 Convergence in Distribution

Definition 5 (Convergence in distribution) The sequence {X n}∞

n=0 of random variables with dis-

tribution functions {F X n(x)} is said to converge in distribution to X , written as X n →d X if there exists a distribution function F X (x) such that

limn→∞

F X n(x) = F X (x).

Again, we can naturally extend the definition and related results about scalar random variableX to vector valued random variable X. To verify convergence in distribution of a k by 1 vector, if the scalar (λ1X 1n + λ2X 2n + . . . + λkX kn) converges in distribution to (λ1X 1 + λ2X 2 + . . . + λkX k)for any real values of (λ1, λ2, . . . , λk), then the vector (X 1n, X 2n, . . . , X kn) converges in distributionto the vector (X 1, X 2, . . . , X k).

We also have the continuous mapping theorem for convergence in distribution.

Proposition 9 If {Xn} is a sequence of random k-vectors with Xn →d X and if g : Rk → Rm is

a continuous function. Then g(Xn) →d g (X).

In the special case that that the limit is a constant scalar or vector, convergence in distributionimplies convergence in probability.

Proposition 10 If X n →d c where c is a constant, then X n → p c.

Proof:. If X n →d c, then F X n(x) → 1[c,∞)(x) for all x 6= c. For any > 0,

P (|X n − c| ≤ ) = P (c − ≤ X n ≤ c + )

→ 1[c,∞)(c + )− 1[c,∞)(c − )

= 1

On the other side, for a sequence {X n}, if the limit of convergence in probability or convergencealmost sure is a random variable X , then the sequence also converges in distribution to x.

5


42/149

1.4 Law of Large Numbers

Theorem 1 (Chebychev’s Weak LLN) Let X be a random variable with E (X ) = µ and limn→∞V ar( X̄ n) = 0, then

X̄ n = 1

n

n

Xt=1

X t → p µ.

The proof follow readily from Chebychev’s inequality.

P (| X̄ n − µ| > ) ≤V ar( X̄ n)

2 → 0.

WLLN tells that the sample mean is a consistent estimate for the population mean and thevariance goes away as n →∞. Since E ( X̄ n−µ)

2 = V ar( X̄ n) → 0, we also know that X̄ n convergesto the population mean in mean square.

Theorem 2 (Kolmogorov’s Strong LLN) Let X t be i.i.d and E (|X |) < ∞, then

X̄ n →a.s. µ.

Note that Kolmogorov’s LLN does not require finite variance. Next we consider the LLN for anheterogeneous process without serial correlations, say, E (X t) = µt and V ar(X t) = σ

2t , and assume

that µ̄n = n−1Pn

t=1 µt → µ. Then we know that E ( X̄ n) = µ̄n → µ, and

V ar( X̄ n) = E

n−1

nXt=1

(X t − µt)

!2= n−2

nXt=1

σ2

t .

To prove the condition for V ar( X̄ n) → 0, we need another fundamental tool in asymptotictheory, Kronecker’s lemma.

Theorem 3 (Kronecker’s lemma) Let X n be a sequence of real numbers and Let {bn} be a monotone increasing sequence with bn → ∞, and

P∞

t=1X t convergent. then

1

bn

nXt=1

btX t → 0.

Theorem 4 Let {X t} be a serially uncorrelated sequence, and P∞

t=1 t−2

σ2t


43/149

1.5 Classical Central Limit Theory

Finally, central limit theorem (CLT) provides a tool to establish asymptotic normality of an esti-mator.

Definition 6 (Asymptotic Normality) A sequence of random variables {X n} is said to be asymp-

totic normal with mean µn and standard deviation σn if σn > 0 for n su ffi ciently large and

(X n − µn)/σn →d Z, where Z ∼ N (0, 1).

Theorem 5 (Lindeberg-Levy Central Limit Theorem) If {X n} ∼ iid(µ,σ2), and X̄ n = (X 1 + . . .+X n)/n, then √

n( X̄ n − µ)/σ →d N (0, 1).

Note that in CLT, we obtain normality results about X̄ n without assuming normality for thedistribution of X n. Here we only require that X n follows some i.i.d. We will see a moment laterthat central limit theorem also holds for more general cases. Another useful tool which can be usedtogether with LLN and CLT is known as Slutsky’s theorem.

Theorem 6 (Slutsky’s theorem) If X n → X in distribution and Y n → c, a constant, then

(a) Y nX n → cX in distribution.

(b) X n + Y n → X + c in distribution.

If we know the distribution of a random variable, we can derive the distribution of a functionof this random variable using the so called ’δ -method’.

Proposition 11 (δ -method) Let {X n} be a sequence of random variables such that √ n(X n−µ) →d

N (0,σ2), and if g is a function which is di ff erentiable at µ, then

√ n[g(X n)− g(µ)] →d N (0, g0(µ)2σ2).

Proof: The Taylor expansion of g(X n) around X n = µ is

g(X n) = g(µ) + g0(µ)(X n − µ) + o p(n−1).

as X n → p µ. Applying the Slutsky’s theorem to√ n[g(X n)− g(µ)] = g 0(µ)

√ n(X n − µ),

where we know that √

n(X n − µ) → N (0,σ2), then

√ n[g(X n)− g(µ)] = g0

(µ)√ n(X n − µ) → N (0, g0

(µ)2

σ

2

).

For example, let g(X n) = 1/X n, and√ n(X n−µ) →d N (0,σ2), then we have

√ n(1/X n−1/µ) →d

N (0,σ2/µ4).Lindeberg-Levy CLT assumes i.i.d., which is too strong in practice. Now we retain the assump-

tion of independence but allow heterogeneous distributions (i.ni.d), and in the next section, we willshow versions of CLT for serial dependent sequence.

7


44/149

In the following analysis, it is more convenient to work with normalized variables. We also needto use triangular arrays in the analysis. An array X nt is a double-indexed collection of numbersand each sample size n can be associated with a diff erent sequence. We use {{X nt}

nt=0}

∞

n=1, or just{X nt} to denote an array. Let {Y t} be the sequence of the raw sequence with E (Y t) = µt. Defines2n =

Pn

t=1E (Y t − µt)

2, σ2nt = E (Y t − µt)2/s2n, and

X nt = Y t − µt

sn.

Then E (X nt) = 0 and V ar(X nt) = σ2nt. Define

S n =nXt=1

X nt,

then E (S n) = 0 and

E (S 2n) =nXt=1

σ2nt = 1. (1)

Definition 7 (Lindeberg CLT) Let the array {X nt} be independent with zero mean and variance sequence {σ2nt} satisfying (1). If the following condition holds,

limn→∞

nXt=1

Z {|X nt|>}

X 2ntdP = 0 for all > 0, (2)

then S n →d N (0, 1).

Equation (2) is known as the Lindeberg condition . What Lindeberg condition rules out are thecases where some sequences exhibit extreme behavior as to influence the distribution of the sumin the limit. Only finite variances are not sufficient to rule out these kind of situations with non-

identically distributed observations. The following is a popular version of the CLT for independentprocesses.

Definition 8 (Liapunov CLT) A su ffi cient condition for Lindeberg condition (2) is

limn→∞

nXt=1

E |X nt|2+δ = 0, for some δ > 0 (3)

(3) is known as Liapunov condition . It is stronger than Lindeberg condition, but it is moreeasily checkable. Therefore it is more frequently used in practice.

2 Limit Theorems for Serially Dependent Observations

We have seen that if the data {X n} are generated by an ARMA process, then the observations arenot i.i.d, but serially correlated. In this section, we will discuss how to derive asymptotic theoriesfor stationary and serially dependent process.

8


45/149

2.1 LLN for a Covariance Stationary Process

Consider a covariance stationary process {X n}. Without loss of generality, let E (X n) = 0, soE (X tX t−h) = γ (h), where

P∞

h=0 |γ (h)| < ∞. Now we will consider the the properties of the samplemean: X̄ n = (X 1 + . . . + X n)/n. First we see that it is an unbiased estimate for the populationmean, E ( X̄ n) = E (X t) = 0. Next, the variance of this estimate is:

E ( X̄ 2n) = E [(X 1 + . . . + X n)/n]2

= (1/n2)E (X 1 + . . . + X n)2

= (1/n2)nX

i,j=1

E (X iX j)

= (1/n2)nX

i,j=1

γ x(i− j)

= (1/n)

γ 0 + 2

n−1

Xh=1

1−h

n

γ (h)

!

or= (1/n)

X|h|


46/149

time series average converges to the population mean. Similarly, if the sample average provides anconsistent estimate for the second moment, then the process is said to be ergodic for the second moment . In this section, we see that a sufficient condition for a covariance stationary proce

Documents

Hu - Time Series Analysis