Hu - Time Series Analysis

Embed Size (px)

Citation preview

  • 8/19/2019 Hu - Time Series Analysis

    1/149

    Lecture 1: Stationary Time Series∗

    1 Introduction

    If a random variable  X   is indexed to time, usually denoted by   t, the observations  {X t, t   ∈  T}   iscalled a time series, where  T  is a time index set (for example,  T =  Z, the integer set).

    Time series data are very common in empirical economic studies. Figure 1 plots some frequentlyused variables. The upper left figure plots the quarterly GDP from 1947 to 2001; the upper rightfigure plots the the residuals after linear-detrending the logarithm of GDP; the lower left figure

    plots the monthly S&P 500 index data from 1990 to 2001; and the lower right figure plots the logdiff erence of the monthly S&P. As you could see, these four series display quite diff erent patternsover time. Investigating and modeling these diff erent patterns is an important part of this course.

    In this course, you will find that many of the techniques (estimation methods, inference proce-dures, etc) you have learned in your general econometrics course are still applicable in time seriesanalysis. However, there are something special of time series data compared to cross sectional data.For example, when working with cross-sectional data, it usually makes sense to assume that theobservations are independent from each other, however, time series data are very likely to displaysome degree of dependence over time. More importantly, for time series data, we could observeonly one history of the realizations of this variable. For example, suppose you obtain a series of US weekly stock index data for the last 50 years. This sample can be said to be large in terms of 

    sample size, however, it is still one data point, as it is only one of the many possible realizations.

    2 Autocovariance Functions

    In modeling finite number of random variables, a covariance matrix is usually computed to sum-marize the dependence between these variables. For a time series  {X t}

    t=−∞, we need to modelthe dependence over infinite number of random variables. The autocovariance and autocorrelationfunctions provide us a tool for this purpose.

    Definition 1   (Autocovariance function). The autocovariance function of a time series  {X t}  with V ar(X t) < ∞  is defined by 

    γ X (s, t) = C ov(X s, X t) = E [(X s − EX s)(X t − EX t)].

    Example 1   (Moving average process) Let  t  ∼ i.i.d.(0, 1), and

    X t =  t + 0.5t−1

    ∗Copyright 2002-2006 by Ling Hu.

    1

  • 8/19/2019 Hu - Time Series Analysis

    2/149

    1 95 0 1 96 0 1 97 0 1 98 0 1 99 0 2 00 00

    2000

    4000

    6000

    8000

    10000

    12000

    Time

       G   D   P

    1 95 0 1 96 0 1 97 0 1 98 0 1 99 0 2 00 0!0.2

    !0.1

    0

    0.1

    0.2

    Time

       D  e   t  r  e  n   d  e   d   L  o  g   (   G   D   P   )

    1 99 0 1 99 2 1 99 4 1 99 6 1 99 8 2 00 0 2 00 20

    500

    1000

    1500

    Time

       M  o  n   t   h   l  y   S   &   P   5   0   0   I  n   d  e  x

    1 99 0 1 99 2 1 99 4 1 99 6 1 99 8 2 00 0 2 00 2!0.2

    !0.1

    0

    0.1

    0.2

    Time

       M  o  n   t   h   l  y

       S   &   P   5   0   0   I  n   d  e  x   R  e   t  u  r  n  s

    Figure 1: Plots of some economic variables

    2

  • 8/19/2019 Hu - Time Series Analysis

    3/149

    then  E (X t) = 0 and  γ X (s, t) = E (X sX t). Let  s ≤ t. When  s  =  t,

    γ X (t, t) = E (X 2t ) = 1.25,

    when  t  =  s + 1,γ X (t, t + 1) = E [(t + 0.5t−1)(t+1 + 0.5t)] = 0.5,

    when  t − s > 1,  γ X (s, t) = 0.

    3 Stationarity and Strict Stationarity

    With autocovariance functions, we can define the covariance stationarity, or weak stationarity. Inthe literature, usually stationarity means weak stationarity, unless otherwise specified.

    Definition 2   (Stationarity or weak stationarity) The time series   {X t, t   ∈   Z}   (where   Z   is the integer set) is said to be stationary if 

    (I)  E (X 2t ) < ∞ ∀   t ∈ Z.

    (II)  EX t  =  µ  ∀  t ∈ Z.

    (III)  γ X (s, t) = γ X (s + h, t + h)  ∀  s, t, h ∈ Z.

    In other words, a stationary time series  {X t} must have three features: finite variation, constantfirst moment, and that the second moment  γ X (s, t) only depends on (t − s) and not depends on  sor  t. In light of the last point, we can rewrite the autocovariance function of a stationary processas

    γ X (h) = C ov(X t, X t+h) for   t, h ∈ Z.

    Also, when  X t   is stationary, we must have

    γ X (h) = γ X (−h).

    When  h = 0,  γ X (0) = C ov(X t, X t) is the variance of  X t, so the autocorrelation function for a

    stationary time series  {X t}  is defined to be

    ρX (h) = γ X (h)

    γ X (0).

    Example 1 (continued): In example 1, we see that  E (X t) = 0,  E (X 2t ) = 1.25, and the autoco-

    variance functions does not depend on  s  or   t. Actually we have  γ X (0) = 1.25,  γ X (1) = 0.5, andγ x(h) = 0 for  h >  1. Therefore,  {X t}  is a stationary process.

    Example 2   (Random walk) Let   S t   be a random walk   S t   = Pt

    s=0 X s   with   S 0   = 0 and   X t   isindependent and identically distributed with mean zero and variance  σ2. Then for  h >  0,

    γ S (t, t + h) =   Cov(S t, S t+h)

    =   Cov tX

    i=1

    X i,t+hX j=1

    X  j

    =   V ar

      tXi=1

    X i

    !  since   Cov(X i, X  j) = 0 for   i 6= j

    =   tσ2

    3

  • 8/19/2019 Hu - Time Series Analysis

    4/149

    In this case, the autocovariance function depends on time  t, therefore the random walk process  S tis not stationary.

    Example 3  (Process with linear trend): Let  t  ∼ iid(0, σ2) and

    X t  =  δ t + t.

    Then E (X t) = δ t, which depends on  t, therefore a process with linear trend is not stationary.

    Among stationary processes, there is simple type of process that is widely used in constructingmore complicated processes.

    Example 4   (White noise): The time series   t   is said to be a white noise with mean zero andvariance  σ2

    , written as

    ∼ W N (0, σ2

    )

    if and only if  t  has zero mean and covariance function as

    γ (h) =

      σ

    2

      if    h = 00 if    h 6= 0

    It is clear that a white noise process is stationary. Note that white noise assumption is weakerthan identically independent distributed assumption.

    To tell if a process is covariance stationary, we compute the unconditional first two moments,therefore, processes with conditional heteroskedasticity may still be stationary.

    Example 5   (ARCH model) Let   X t   =   t   with   E (t) = 0,   E (2t ) =  σ

    2 >  0, and   E (ts) = 0 fort 6= s. Assume the following process for  2t ,

    2t   = c + ρ2t−1 + ut

    where 0 <  ρ  <  1 and  ut  ∼ W N (0, 1).In this example, the conditional variance of  X t  is time varying, as

    E t−1(X 2t ) = E t−1(

    2t ) = E t−1(c + ρ

    2t−1 + ut) = c + ρ

    2t−1.

    However, the unconditional variance of  X t  is constant, which is  σ2 = c/(1 − ρ). Therefore, this

    process is still stationary.

    Definition 3  (Strict stationarity) The time series  {X t, t ∈ Z}  is said to be strict stationary if the  joint distribution of  (X t1 , X t2, . . . , X  tk)  is the same as that of  (X t1+h, X t2+h, . . . , X  tk+h).

    In other words, strict stationarity means that the joint distribution only depends on the ‘dif-ference’ h, not the time (t1, . . . , tk).Remarks: First note that finite variance is not assumed in the definition of strong stationarity,

    therefore, strict stationarity does not necessarily imply weak stationarity. For example, processeslike i.i.d. Cauchy is strictly stationary but not weak stationary. Second, a nonlinear function of a strict stationary variable is still strictly stationary, but this is not true for weak stationary. Forexample, the square of a covariance stationary process may not have finite variance. Finally, weak

    4

  • 8/19/2019 Hu - Time Series Analysis

    5/149

    0 100 200 300900

    1000

    1100

    1200

    1300

    1400

    1500

       S   &   P    5

       0   0

       i  n   d  e  x   i  n

      y  e  a  r   1   9   9   9

    0 100 200 300!0.05

    0

    0.05

       S   &   P    5

       0   0

      r  e   t  u  r  n  s   i  n  y  e  a  r   1   9   9   9

    0 100 200 300900

    1000

    1100

    1200

    1300

    1400

    1500

       S   &   P    5

       0   0

       i  n   d  e  x   i  n

      y  e  a  r   2   0   0   1

    0 100 200 300!0.05

    0

    0.05

       S   &   P    5

       0   0

      r  e   t  u  r  n  s   i  n

      y  e  a  r   2   0   0   1

    Figure 2: Plots of S&P index and returns in year 1999 and 2001

    stationarity usually does not imply strict stationarity as higher moments of the process may dependon time   t. However, if process  {X t}   is a Gaussian time series, which means that the distributionfunctions of  {X t}  are all multivariate Gaussian, i.e. the joint density of 

    f X t,X t+j1 ,...,X t+jk (xt, xt+ j1, . . . , xt+ jk)

    is Gaussian for any  j1, j2, . . . , jk, weak stationary also implies strict stationary. This is because amultivariate Gaussian distribution is fully characterized by its first two moments.

    For example, a white noise is stationary but may not be strict stationary, but a Gaussianwhite noise is strict stationary. Also, general white noise only implies uncorrelation while Gaussianwhite noise also implies independence. Because if a process is Gaussian, uncorrelation impliesindependence. Therefore, a Gaussian white noise is just  i.i.d.N (0,σ2).

    Stationary and nonstationary processes are very diff erent in their properties, and they requirediff erent inference procedures. We will discuss this in much details through this course. At thispoint, note that a simple and useful method to tell if a process is stationary in empirical studies isto plot the data. Loosely speaking, if a series does not seem to have a constant mean or variance,then very likely, it is not stationary. For example, Figure 2 plots the daily S&P 500 index in year1999 and 2001. The upper left figure plots the index in 1999, upper right figure plots the returnsin 1999, lower left figure plots the index in 2001, and lower right figure plots the returns in 2001.

    Note that the index level are very diff erent in 1999 and 2001. In year 1999, it is wandering ata higher level and the market rises. In year 2001, the level is much lower and the market drops.

    5

  • 8/19/2019 Hu - Time Series Analysis

    6/149

    In comparison, we did not see much diff erence in the returns in year 1999 and 2001 (althoughthe returns in 2001 seem to have thicker tails). Actually, only judging from the return data, itis very hard to tell which figure plots the market in booms, and which figure plots the market incrashes. Therefore, people usually treat stock price data as nonstationary and stock return data asstationary.

    4 Ergodicity

    Recall that Kolmogorov’s law of large number (LLN) tells that if  X i  ∼ i.i.d.(µ,σ2) for i  = 1, . . . , n,

    then we have the following limit for the ensemble average

    X̄ n =  n−1

    nXi=1

    X i  → µ.

    In time series, we have time series average, not ensemble average. To explain the diff erencesbetween ensemble average and time series average, consider the following experiment. Suppose wewant to track the movements of some particles and draw inference about their expected position(suppose that these particles move on the real line). If we have a group of particles (group size  n),then we could track down the position of each particle and plot a distribution of their positions.The mean of this sample is called ensemble average. If all these particles are i.i.d., LLN tells thatthis average converges to its expectation as  n  → ∞. However, as we remarked earlier, with timeseries observations, we only have one history. That means, in this experiment, we only have oneparticle. Then instead of collecting n  particles, we can only track this single particle and record itsposition, say  xt, for   t = 1, 2, . . . , T  . The mean we computed by averaging over time,  T 

    −1PT 

    t=1 xtis called time series average.

    Does the time series average converges to the same limit as the ensemble average? The answeris yes if  X t   is stationary and ergodic. If  X t   is stationary and ergodic with  E (X t) =  µ, then thetime series average has the same limit as ensemble average,

    X̄ T   = T −1

    T Xt=1

    X t  → µ.

    This result is given as ergodic theorem, and we will discuss it later in our lecture 4 on asymp-totic theory. Note that this result require both stationarity and ergodicity. We have explainedstationarity and we see that stationarity allows time series dependence. Ergodicity requires ’aver-age asymptotic independence’. Note that stationarity itself does not guarantee ergodicity (page 47in Hamilton and lecture 4).

    Readings:Hamilton, Ch. 3.1

    Brockwell and Davis, Page 1-29Hayashi, Page 97-102

    6

  • 8/19/2019 Hu - Time Series Analysis

    7/149

    Lecture 2: ARMA Models∗

    1 ARMA Process

    As we have remarked, dependence is very common in time series observations. To model this timeseries dependence, we start with univariate ARMA models. To motivate the model, basically wecan track two lines of thinking. First, for a series  xt, we can model that the level of its currentobservations depends on the level of its lagged observations. For example, if we observe a highGDP realization this quarter, we would expect that the GDP in the next few quarters are good

    as well. This way of thinking can be represented by an AR model. The AR(1) (autoregressive of order one) can be written as:

    xt  =  φxt−1 + t

    where   t   ∼  WN (0,σ2

    ) and we keep this assumption through this lecture. Similarly, AR( p) (au-toregressive of order  p) can be written as:

    xt =  φ1xt−1 + φ2xt−2 + . . . + φ pxt− p + t.

    In a second way of thinking, we can model that the observations of a random variable at timet   are not only aff ected by the shock at time   t, but also the shocks that have taken place beforetime  t. For example, if we observe a negative shock to the economy, say, a catastrophic earthquake,

    then we would expect that this negative eff ect aff ects the economy not only for the time it takesplace, but also for the near future. This kind of thinking can be represented by an MA model. TheMA(1) (moving average of order one) and MA(q ) (moving average of order  q ) can be written as

    xt =  t + θt−1

    andxt =  t + θ1t−1 + . . . + θqt−q.

    If we combine these two models, we get a general ARMA( p, q ) model,

    xt  =  φ1xt−1 + φ2xt−2 +  . . . + φ pxt− p + t + θ1t−1 + . . . + θqt−q.

    ARMA model provides one of the basic tools in time series modeling. In the next few sections,we will discuss how to draw inferences using a univariate ARMA model.

    ∗Copyright 2002-2006 by Ling Hu.

    1

  • 8/19/2019 Hu - Time Series Analysis

    8/149

    2 Lag Operators

    Lag operators enable us to present an ARMA in a much concise way. Applying lag operator(denoted  L) once, we move the index back one time unit; and applying it  k   times, we move theindex back  k  units.

    Lxt   =   xt−1

    L2xt   =   xt−2...

    Lkxt   =   xt−k

    The lag operator is distributive over the addition operator, i.e.

    L(xt + yt) = xt−1 + yt−1

    Using lag operators, we can rewrite the ARMA models as:

    AR(1) : (1− φL)xt = t

    AR( p) : (1− φ1L− φ2L2− . . .− φ pL

     p)xt =  t

    MA(1) :   xt  = (1 + θL)t

    MA(q ) :   xt  = (1 + θ1L + θ2L2 + . . . + θqL

    q)t

    Let  φ0 = 1, θ0 = 1 and define log polynomials

    φ(L) = 1− φ1L− φ2L2− . . .− φ pL

     p

    θ(L) = 1 + θ1L + θ2L2 + . . . + θ pL

    q

    With lag polynomials, we can rewrite an ARMA process in a more compact way:

    AR :   φ(L)xt =  t

    MA :   xt =  θ(L)t

    ARMA :   φ(L)xt =  θ(L)t

    3 Invertibility

    Given a time series probability model, usually we can find multiple ways to represent it. Whichrepresentation to choose depends on our problem. For example, to study the impulse-responsefunctions (section 4), MA representations maybe more convenient; while to estimate an ARMAmodel, AR representations maybe more convenient as usually   xt   is observable while   t   is not.However, not all ARMA processes can be inverted. In this section, we will consider under what

    conditions can we invert an AR model to an MA model and invert an MA model to an AR model. Itturns out that  invertibility , which means that the process can be inverted, is an important propertyof the model.

    If we let 1 denotes the identity operator, i.e., 1yt  = yt, then the inversion operator (1 − φL)−1

    is defined to be the operator so that

    (1− φL)−1(1− φL) = 1

    2

  • 8/19/2019 Hu - Time Series Analysis

    9/149

    For the AR(1) process, if we premulitply (1 − φL)−1 to both sides of the equation, we get

    xt = (1− φL)−1t

    Is there any explicit way to rewrite (1 − φL)−1? Yes, and the answer just turns out to be  θ(L)with  θk  = φ

    k for  |φ| <  1. To show this,

    (1− φL)θ(L)

    = (1− φL)(1 + θ1L + θ2L2 + . . .)

    = (1− φL)(1 + φL + φ2L2 + . . .)

    = 1− φL + φL− φ2L2 + φ2L2 − φ3L3 + . . .

    = 1−   limk→∞

    φkLk

    = 1 for   |φ| <  1

    We can also verify this result by recursive substitution,

    xt   =   φxt−1 + t

    =   φ2xt−2 + t + φt−1...

    =   φkxt−k +  t + φt−1 + . . . + φk−1t−k+1

    =   φkxt−k +k−1X

     j=0

    φ jt− j

    With |φ|  <  1, we have that limk→∞ φkxt−k  = 0, so again, we get the moving average representation

    with MA coefficient equal to   φk. So the condition that   |φ|  <   1 enables us to invert an AR(1)process to an MA(∞) process,

    AR(1) : (1− φL)xt =  t

    MA(∞) :   xt  =  θ(L)t   with   θk  = φk

    We have got some nice results in inverting an AR(1) process to a MA(∞) process. Then, howto invert a general AR( p) process? We need to factorize a lag polynomial and then make use of theresult that (1− φL)−1 = θ(L). For example, let  p = 2, we have

    (1− φ1L− φ2L2)xt  =  t   (1)

    To factorize this polynomial, we need to find roots  λ1  and  λ2  such that

    (1− φ1L− φ2L2) = (1− λ1L)(1− λ2L)

    Given that both   |λ1|  

  • 8/19/2019 Hu - Time Series Analysis

    10/149

    and so to invert (1), we have

    xt   = (1− λ1L)−1(1− λ2L)

    −1t

    =   θ1(L)θ2(L)t

    Solving  θ1(L)θ2(L) is straightforward,

    θ1(L)θ2(L) = (1 + λ1L + λ2

    1L2 + . . .)(1 + λ2L + λ

    2

    2L2 + . . .)

    = 1 + (λ1 + λ2)L + (λ2

    1 + λ1λ2 + λ2

    2)L2 + . . .

    =∞X

    k=0

    (kX

     j=0

    λ j1λ

    k− j2

      )Lk

    =   ψ(L), say,

    with  ψk   =Pk

     j=0 λ j1λ

    k− j2

      . Similarly, we can also invert the general AR( p) process given that allroots  λi  has less than one absolute value. An alternative way to represent this MA process (toexpress  ψ) is to make use of partial fractions. Let   c1, c2   be two constants, and their values are

    determined by

    1

    (1− λ1L)(1− λ2L) =

      c1

    1− λ1L +

      c2

    1− λ2L =

      c1(1− λ2L) + c2(1− λ1L)

    (1− λ1L)(1− λ2L)

    We must have

    1 =   c1(1− λ2L) + c2(1− λ1L)

    = (c1 + c2)− (c1λ2 + c2λ1)L

    which givesc1 + c2 = 1 and   c1λ2 + c2λ1 = 0.

    Solving these two equations we get

    c1 =  λ1

    λ1 − λ2, c2 =

      λ2

    λ2 − λ1.

    Then we can express  xt  as

    xt   = [(1− λ1L)(1− λ2L)]−1t

    =   c1(1− λ1L)−1t + c2(1− λ2L)

    −1t

    =   c1

    Xk=0

    λ

    k

    1t−

    k +  c2

    Xk=0

    λ

    k

    2t−

    k

    =∞X

    k=0

    ψkt−k

    where  ψk  = c1λk1

     +  c2λk2

    .

    4

  • 8/19/2019 Hu - Time Series Analysis

    11/149

    Similarly, an MA process,xt =  θ(L)t,

    is invertible if   θ(L)−1 exists. An MA(1) process is invertible if   |θ|  <  1, and an MA(q ) process isinvertible if all roots of 

    1 + θ1z + θ2z2 + . . . θqz

    q = 0

    lie outside of the unit circle. Note that for any invertible MA process, we can find a noninvertibleMA process which is the same as the invertible process up to the second moment. The converse isalso true. We will give an example in section 5.

    Finally, given an invertible ARMA( p, q ) process,

    φ(L)xt  =  θ(L)t

    xt =  φ−1(L)θ(L)t

    xt =  ψ(L)t

    then what is the series  ψk? Note that since

    φ−1(L)θ(L)t  =  ψ(L)t,

    we have   θ(L) =   φ(L)ψ(L). So the elements of   ψ   can be computed recursively by equating thecoefficients of  Lk.

    Example 1  For a ARMA(1, 1) process, we have

    1 + θL   = (1 − φL)(ψ0 + ψ1L + ψ2L2 + . . .)

    =   ψ0 + (ψ1 − φψ0)L + (ψ2 − φψ1)L2 + . . .

    Matching coefficients on  Lk, we get

    1 = ψ0

    θ =  ψ1 − φψ0

    0 = ψ j − φψ j−1   for   j  ≥ 2

    Solving those equation, we can easily get

    ψ0 = 1

    ψ1 =  φ + θ

    ψ j  = φ j−1(φ + θ) for   j  ≥ 2

    4 Impulse-Response Functions

    Given an ARMA model,  φ(L)xt  =  θ(L)t, it is natural to ask: what is the eff ect on  xt  given a unitshock at time  s  (for  s < t)?

    5

  • 8/19/2019 Hu - Time Series Analysis

    12/149

    4.1 MA process

    For an MA(1) process,xt =  t + θt−1

    the eff ects of    on  x  are: : 0 1 0 0 0

    x : 0 1   θ   0 0

    For a MA(q ) process,xt  =  t + θ1t−1 + θ2t−2 + . . . + θqt−q,

    the eff ects on    on  x  are: : 0 1 0 0   . . .   0 0x : 0 1   θ1   θ2   . . .   θq   0

    The left figure in Figure 1 plots the impulse-response function of an MA(3) process. Similarly,we can write down the eff ects for an MA(∞) process. As you can see, we can get impulse-responsefunction immediately from an MA process.

    4.2 AR process

    For a AR(1) process  xt  =  φxt−1 + t  with  |φ| <  1, we can invert it to a MA process and the eff ectsof    on  x are:

     : 0 1 0 0   . . .x : 0 1   φ φ2 . . .

    As can be seen from above, the impulse-response dynamics is quite clear from a MA representation.For example, let  t > s >  0, given one unit increase in   s, the eff ect on  xt  would be  φ

    t−s, if thereare no other shocks. If there are shocks that take place at time other than  s  and has nonzero eff ecton  xt, then we can add these eff ects, since this is a linear model.

    The dynamics is a bit complicated for higher order AR process. But applying our old trickof inverting them to a MA process, then the following analysis will be straightforward. Take anAR(2) process as example.

    Example 2

    xt  = 0.6xt−1 + 0.2xt−2 + t

    or(1− 0.6L− 0.2L2)xt =  t

    We first solve the polynomial:y2 + 3y − 5 = 0

    and get two roots1 y1  = 1.2926 and  y2  = −4.1925. Recall that  λ1  = 1/y1  = 0.84 and  λ2  = 1/y2  =−

    0.24. So we can factorize the lag polynomial to be:(1− 0.6L− 0.2L2)xt   = (1− 0.84L)(1 + 0.24L)xt

    xt   = (1− 0.84L)−1(1 + 0.24L)−1t

    =   ψ(L)t

    1Recall that the roots for polynomial  ay2 + by  + c  = 0 is  −b±

    √ b2−4ac

    2a  .

    6

  • 8/19/2019 Hu - Time Series Analysis

    13/149

    where ψk  =Pk

     j=0 λ j1λk− j2   . In this example, the series of  ψ   is {1, 0.6, 0.5616, 0.4579, 0.3880, . . .}. So

    the eff ects of    on  x can be described as:

     : 0 1 0 0 0   . . .x : 0 1 0.6 0.5616 0.4579   . . .

    The right figure in Figure 1 plots this impulse-response function. So after we invert an AR( p)process to an MA process, given  t > s > 0, the eff ect of one unit increase in  s  on  xt   is just  ψt−s.

    We can see that given a linear process, AR or ARMA, if we could represent them as a MAprocess, we will find impulse-response dynamics immediately. In fact, MA representation is thesame thing as the impulse-response function.

    0 10 20 30!1

    !0.5

    0

    0.5

    1

    1.5

    Time

               R         e         s         p

             o         n         s         e

    0 10 20 30!1

    !0.5

    0

    0.5

    1

    1.5

    Time

               R         e         s         p

             o         n         s         e

    Figure 1: The impulse-response functions of an MA(3) process (θ1  = 0.6, θ2  = −0.5, θ3  = 0.4) andan AR(2) process (φ1 = 0.6,φ2 = 0.2), with unit shock at time zero

    5 Autocovariance Functions and Stationarity of ARMA models

    5.1 MA(1)

    xt =  t + θt−1,

    where  t   ∼ W N (0,σ2

    ). It is easy to calculate the first two moments of  xt:

    E (xt) = E (t + θt−1) = 0

    E (x2

    t ) = (1 + θ2

    )σ2

    and

    γ x(t, t + h) =   E [(t + θt−1)(t+h + θt+h−1)]

    =

      θσ2

      for   h = 1

    0 for   h > 1

    7

  • 8/19/2019 Hu - Time Series Analysis

    14/149

    So, for a MA(1) process, we have a fixed mean and a covariance function which does not dependon time t:   γ (0) = (1 + θ2)σ2

    ,  γ (1) = θσ2

    , and  γ (h) = 0 for  h > 1. So we know MA(1) is stationary

    given any finite value of  θ.The autocorrelation can be computed as  ρx(h) = γ x(h)/γ x(0), so

    ρx(0) = 1,   ρx(1) =

      θ

    1 + θ2 ,   ρx(h) = 0 for   h > 1

    We have proposed in the section on invertability that for an invertible (noninvertible) MAprocess, there always exists a noninvertible (invertible) process which is the same as the originalprocess up to the second moment. We use the following MA(1) process as an example.

    Example 3  The process

    xt  =  t + θt−1,   t   ∼ W N (0, σ2)   |θ| >  1

    is noninvertible. Consider an invertible MA process defined as

    x̃t = ̃t + 1/θ̃t−1,   ̃t   ∼ W N (0, θ2

    σ2

    )

    .Then we can compute that  E (xt) =  E (x̃t) = 0,  E (x2t ) = E (x̃

    2t ) = (1 + θ

    2)σ2,  γ x(1) =  γ ̃x(1) =θσ2, and  γ x(h) =   γ ̃x(h) = 0 for   h >   1. Therefore, these two processes are equivalent up to thesecond moments. To be more concrete, we plug in some numbers.

    Let  θ = 2, and we know that the process

    xt  =  t + 2t−1,   t   ∼ W N (0, 1)

    is noninvertible. Consider the invertible process

    x̃t = ̃t + (1/2)̃t−1,   ̃t   ∼ W N (0, 4)

    .Note that  E (xt) = E (x̃t) = 0,  E (x2t ) = E (x̃t)

    2 = 5,  γ x(1) = γ ̃x(1) = 2, and  γ x(h) = γ ̃x(h) = 0for  h > 1.

    Although these two representations, noninvertible MA and invertible MA, could generate thesame process up to the second moment, we prefer the invertible presentations in practice because if we can invert an MA process to an AR process, we can find the value of  t  (non-observable) basedon all past values of  x  (observable). If a process is noninvertible, then, in order to find the value of t, we have to know all future values of  x.

    5.2 MA(q )

    xt  =  θ(L)t =

    qX

    k=0

    (θkLk)t

    8

  • 8/19/2019 Hu - Time Series Analysis

    15/149

    The first two moments are:

    E (xt) = 0

    E (x2t ) =

    q

    Xk=0θ2kσ

    2

    and

    γ x(h) =

     Pq−hk=0 θkθk+hσ

    2

      for   h = 1, 2, . . . , q  0 for   h > q 

    Again, a MA(q ) is stationary for any finite values of  θ1, . . . , θq.

    5.3 MA(∞)

    xt  =  θ(L)t =

    Xk=0(θkLk)t

    Before we compute moments and discuss the stationarity of  xt, we should first make sure that{xt}   converges.

    Proposition 1   If  {t}   is a sequence of white noise with   σ2

       0, we wantto show that

    "  nXk=1

    θkt−k −mXk=1

    θkt−k

    #2

    =X

    m≤k≤n

    θ2kσ2

    =

    "  nXk=0

    θ2k −mXk=0

    θ2k

    #σ2

    →   0 as   m, n →∞

    The result holds since  {θk}   is square summable. It is often more convenient to work with aslightly stronger condition – absolutely summability:

    ∞Xk=0

    |θk| < ∞.

    9

  • 8/19/2019 Hu - Time Series Analysis

    16/149

    It is easy to show that absolutely summable implies square summable. A MA(∞) process withabsolutely summable coefficients is stationary with moments:

    E (xt) = 0

    E (x2t ) =∞X

    k=0

    θ2kσ2

    γ x(h) =∞X

    k=0

    θkθk+hσ2

    5.4 AR(1)

    (1 − φL)xt  =  t   (2)

    Recall that an AR(1) process with  |φ| <  1 can be inverted to an MA(∞) process

    xt =  θ(L)t   with   θk  = φk.

    With   |φ| <  1, it is easy to check that the absolute summability holds:∞X

    k=0

    |θk| =

    ∞X

    k=0

    |φk| < ∞.

    Using the results for MA(∞), the moments for  xt   in (2) can be computed:

    E (xt) = 0

    E (x2t ) =∞X

    k=0

    φ2kσ2

    =   σ2/(1 − φ2)

    γ x(h) =∞X

    k=0

    φ2k+hσ2

    =   φhσ2/(1 − φ2)

    So, an AR(1) process with  |φ| <  1 is stationary.

    5.5 AR(p)

    Recall that an AR( p) process

    (1 − φ1L− φ2L2− . . .− φ pL

     p)xt  =  t

    can be inverted to an MA process  xt  =  θ(L)t  if all  λi   in

    (1 − φ1L− φ2L2− . . .− φ pL

     p) = (1 − λ1L)(1 − λ2L) . . . (1 − λ pL) (3)

    have less than one absolute value. It also turns out that with  |λi|  <  1, the absolute summabilityP∞

    k=0 |ψk| <  ∞  is also satisfied. (The proof can be found on page 770 of Hamilton and the proof uses the result that  ψk  = c1λ

    k1 + c2λ

    k2.)

    10

  • 8/19/2019 Hu - Time Series Analysis

    17/149

    When we solve the polynomial in:

    (L− y1)(L− y2) . . . (L− y p) = 0 (4)

    the requirement that  |λi| <  1 is equivalent to that all roots in (4) lie outside of the unit circle, i.e.,|yi| >  1 for all  i.

    First calculate the expectation for xt, E (xt) = 0. To compute the second moments, one methodis to invert it into a MA process and using the formula of autocovariance function for MA(∞).This method requires finding the moving average coefficients  ψ, and an alternative method whichis known as  Yule-Walker  method maybe more convenient in finding the autocovariance functions.To illustrate this method, take an AR(2) process as an example:

    xt  =  φ1xt−1 + φ2xt−2 + t

    Multiply  xt, xt−1, xt−2, . . . to both sides of the equation, take expectation and and then divideby γ (0), we get the following equations:

    1 =   φ1ρ(1) + φ2ρ(2) + σ2

    /γ (0)

    ρ(1) =   φ1 + φ2ρ(1)

    ρ(2) =   φ1ρ(1) + φ2

    ρ(k) =   φ1ρ(k − 1) + φ2ρ(k − 2) for   k ≥ 3

    ρ(1) can be first solved from the second equation:   ρ(1) = φ1/(1− φ2),  ρ(2) can then be solvedfrom the third equation.   ρ(k) can be solved recursively using  ρ(1) and  ρ(2) and finally,  γ (0) canbe solved from the first equation. Using  γ (0) and  ρ(k),  γ (k) can computed using  γ (k) = ρ(k)γ (0).Figure 2 plots this autocorrelation for  k  = 0, . . . , 50 and the parameters are set to be  φ1 = 0.5 andφ2 = 0.3. As is clear from the graph, the autocorrelation is very close to zero when  k > 40.

    0 5 10 15 20 25 30 35 40 450

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    k

          r        h      o         (        k         )

    Figure 2: Plot of the autocorrelation of AR(2) process, with  φ1  = 0.5 and  φ2 = 0.3

    11

  • 8/19/2019 Hu - Time Series Analysis

    18/149

    5.6 ARMA( p, q )

    Given an invertible ARMA( p, q ) process, we have shown that

    φ(L)xt  =  θ(L)t,

    invert  φ(L) we obtain xt  =  φ(L)−1θ(L)t =  ψ(L)t.

    Therefore, an ARMA( p, q ) process is stationary as long as  φ(L) is invertible. In other words,the stationarity of the ARMA process only depends on the autoregressive parameters, and not onthe moving average parameters (assuming that all parameters are finite).

    The expectation of this process  E (xt) = 0. To find the autocovariance function, first we caninvert it to MA process and find the MA coefficients   ψ(L) =   φ(L)−1θ(L). We have shown anexample of finding  ψ  in ARMA(1, 1) process, where we have

    (1− φL)xt  = (1 + θL)t

    xt  =  ψ(L)t =

    ∞X j=0

    ψ jt− j

    where ψ0 = 1 and  ψ j  = φ j−1(φ+ θ) for j  ≥ 1. Now, using the autocovariance functions for MA(∞)

    process we have

    γ x(0) =

    ∞Xk=0

    ψ2kσ2

    =

    1 +

    ∞Xk=1

    φ2(k−1)(φ + θ)2

    !σ2

    =

    1 +

     (φ + θ)2

    1− φ2σ2

    If we plug in some numbers, say,  φ  = 0.5 and  θ  = 0.5, so the original process is  xt = 0.5xt−1 + t +0.5t−1, then  γ x(0) = (7/3)σ

    2

    . For  h ≥ 1,

    γ x(h) =

    ∞Xk=0

    ψkψk+hσ2

    =

    φh−1(φ + θ) + φh−2(φ + θ)2

    Xk=1φ2k

    !σ2

    =   φh−1(φ + θ)

    1 +

     (φ + θ)φ

    1− φ2

    σ2

    Plug in  φ =  θ  = 0.5 we have for  h ≥ 1,

    γ x(h) = 5 · 21−h

    3  σ2

    .

    12

  • 8/19/2019 Hu - Time Series Analysis

    19/149

    An alternative to compute the autocovariance function is to multiply each side of  φ(L)xt   =θ(L)t  with  xt, xt−1, . . . and take expectations. In our ARMA(1, 1) example, this gives

    γ x(0)− φγ x(1) = [1 + θ(θ + φ)]σ2

    γ x(1)− φγ x(0) =   θσ2

    γ x(2)− φγ x(1) = 0...

    γ x(h)− φγ x(h− 1) = 0 for   h > 2

    where we use that   xt   =  ψ(L)t   in taking expectation on the right side, for instance,   E (xtt) =E ((t +  ψ1t−1 +  . . .)t) =  σ

    2

    . Plug in  θ  =  φ  = 0.5 and solving those equations, we have  γ x(0) =(7/3)σ2

    ,  γ x(1) = (5/3)σ

    2

    , and  γ x(h) =  γ x(h − 1)/2 for  h  ≥ 2. This is the same results as we gotusing the first method.Summary: A MA process is stationary if and only if the coefficients  {θk}  are square summable(absolute summable), i.e.,

    P∞

    k=0 θ2k  <  ∞ or

    P∞

    k=0 |θk| <  ∞. Therefore, MA with finite number of MA coefficients are always stationary. Note that stationarity does not require MA to be invertible.

    An AR process is stationary if it is invertible, i.e.   |λi| <  1 or  |yi| >  1, as defined in (3) and (4)respectively. An ARMA( p, q ) process is stationary if its autoregressive lag polynomial is invertible.

    5.7 Autocovariance generating function of stationary ARMA process

    For covariance stationary process, we see that autocovariance function is very useful in describingthe process. One way to summarize absolutely summable autocovariance functions (

    P∞

    h=−∞ |γ (h)| <∞) is to use the autocovariance-generating function:

    gx(z) =∞X

    h=−∞

    γ (h)zh.

    where z  could be a complex number.For white noise, the autocovriance-generating function (AGF) is just a constant, i.e, for    ∼

    W N (0,σ2

    ),  g(z) = σ2

    .For MA(1) process,

    xt  = (1 + θL)t,   ∼ W N (0,σ2

    ),

    we can compute that

    gx(z) = σ2

    [θz−1 + (1 + θ2) + θz] = σ2

    (1 + θz)(1 + θz−1).

    For a MA(q ) process,xt  = (1 + θ1L + . . . + θqL

    q)t,

    we know that  γ x(h) =Pq−h

    k=0 θkθk+hσ2

      for  h  = 1, . . . , q   and  γ x(h) = 0 for  h > q . we have

    gx(z) =∞X

    h=−∞

    γ (h)zh

    13

  • 8/19/2019 Hu - Time Series Analysis

    20/149

    =   σ2

      qXk=0

    θ2k +

    qXh=1

    q−hXk=0

    (θkθk−hz−h + θkθk+hz

    h)

    !

    =   σ2

      qXk=0

    θkzk

    !  qXk=0

    θkz−k

    !

    For a MA(∞) process  xt =  θ(L)t  where P

    k=0 |θk| < ∞, we can naturally let  q  be replaced by∞ in the AGF for MA(q ) to get AGF for MA(∞),

    gx(z) = σ2

      ∞Xk=0

    θkzk

    !  ∞Xk=0

    θkz−k

    != σ2

    θ(z)θ(z−1).

    Next, for a stationary AR or ARMA process, we can invert them to a MA process. For instance,an AR(1) process, (1 − φL)xt =  t, invert it to

    xt =  1

    1− φLt,

    and its AGF is

    gx(z) =  σ2

    (1− φz)(1− φz−1),

    which equal to

    σ2

      ∞Xk=0

    θkzk

    !  ∞Xk=0

    θkz−k

    != σ2θ(z)θ(z−1),

    where  θk  =  φk. In general, the AGF for an ARMA( p, q ) process is

    gx(z) =  σ2

    (1 + θ1z + . . . + θqz

    q)(1 + θ1z−1 + . . . + θqz

    −q)

    (1− φ1z − . . .− φ pz p)(1− φ1z−1 − . . .− φ pz− p)

    =   σ2

    θ(z)θ(z−1)

    φ(z)φ(z−1)

    6 Simulated ARMA process

    In this section, we plot a few simulated ARMA processes. In the simulations, the errors are Gaussianwhite noise   i.i.d.N (0, 1). As a comparison, we first plot a Gaussian white noise (or AR(1) withφ = 0) in Figure 3. Then, we plot AR(1) with  φ = 0.4 and  φ = 0.9 in Figure 4 and Figure 5. Asyou can see, the white noise process is very choppy and patternless. When φ = 0.4, it becomes abit smoother, and when  φ = 0.9, the departures from the mean (zero) is very prolonged. Figure 6plots an AR(2) process and the coefficients are set to numbers as in our example in this lecture.Finally, Figure 7 plots a MA(3) process. Compare this MA(3) process with the white noise, wecould see an increase of volatilities (the volatility of the white noise is 1 and the volatility of theMA(3) process is 1.77).

    14

  • 8/19/2019 Hu - Time Series Analysis

    21/149

    0 20 40 60 80 100 120 140 160 180 200!4

    !3

    !2

    !1

    0

    1

    2

    3

    4

    Figure 3: A Gaussian white noise time series

    0 20 40 60 80 100 120 140 160 180 200!4

    !3

    !2

    !1

    0

    1

    2

    3

    4

    Figure 4: A simulated AR(1) process, with  φ = 0.4

    15

  • 8/19/2019 Hu - Time Series Analysis

    22/149

    0 20 40 60 80 100 120 140 160 180 200!10

    !8

    !6

    !4

    !2

    0

    2

    4

    6

    8

    10

    Figure 5: A simulated AR(1) process, with  φ = 0.

    9

    0 20 40 60 80 100 120 140 160 180 200!5

    !4

    !3

    !2

    !1

    0

    1

    2

    3

    4

    5

    Figure 6: A simulated AR(2) process, with  φ1 = 0.6,  φ2 = 0.2

    16

  • 8/19/2019 Hu - Time Series Analysis

    23/149

    0 20 40 60 80 100 120 140 160 180 200!5

    !4

    !3

    !2

    !1

    0

    1

    2

    3

    4

    5

    Figure 7: A simulated MA(3) process, with  θ1 = 0.6,  θ2 = −0.5, and  θ3  = 0.4

    7 Forecastings of ARMA Models

    7.1 Principles of forecasting

    If we are interested in forecasting a random variable  yt+h  based on the observations of  x  up to timet (denoted by X ) we can have diff erent candidates, denoted by  g(X ). If our criterion in picking thebest forecast is to minimize the mean squared error (MSE), then the best forecast is the conditionalexpectation,  g(X ) = E X (yt+h). The proof can be found on page 73 in Hamilton. In our followingdiscussion, we assume that the data generating process is known (so parameters are known), so wecan compute the conditional moments.

    7.2 AR models

    Let’s start from an AR(1) process:xt  =  φxt−1 + t

    where we continue to assume that  t  is a white noise with mean zero and variance  σ2

    , then we cancompute

    E t(xt+1) =   E t(φxt + t+1) = φxt

    E t(xt+2) =   E t(φ2xt + φt+1 + t+2) = φ

    2xt

    . . .   =   . . .

    E t(xt+k) =   E t(φkxt + φ

    k−1t+1 + . . . + t+k) = φkxt

    and the variance

    Vart(xt+1) = Vart(φxt + t+1) = σ2

    Vart(xt+2) = Vart(φ2xt + φt+1 + t+2) = (1 + φ

    2)σ2

    . . .   =   . . .

    Vart(xt+k) = Vart(φkxt + φ

    k−1t+1 + . . . + t+k) =k−1X

     j=0

    φ2 jσ2

    17

  • 8/19/2019 Hu - Time Series Analysis

    24/149

    Note that as  k →∞,E t(xt+k) → 0

    which is the unconditional expectation of  xt, and

    Vart(xt+k) → σ2

    /(1− φ2)

    which is the unconditional variance of  xt.Similarly, for an AR( p) process, we can forecast recursively.

    7.3 MA Models

    For a MA(1) process,xt =  t + θt−1,

    if we know  t, then

    E t(xt+1) =   E t(t+1 + θt) = θt

    E t(xt+2) =   E t(t+2 + θt+1) = 0

    . . .   =   . . .

    E t(xt+k) =   E t(t+k +  θt+k−1) = 0

    and

    Vart(xt+1) = Vart(t+1 + θt) = σ2

    Vart(xt+2) = Vart(t+2 + θt+1) = (1 + θ2)σ2

    . . .   =   . . .

    Vart(xt+k) = Vart(t+k +  θt+k−1) = (1 + θ2)σ2

    It is easy to see that for an MA(1) process, the conditional expectation for two step ahead and

    higher is the same as unconditional expectation, so is the variance. Next, for a MA( q ) model,

    xt =  t + θ1t−1 + θ2t−2 + . . . + θqt−q  =

    qX

     j=0

    θ jt− j ,

    if we know  t, t−1, . . . , t−q, then

    E t(xt+1) =   E t(

    qX

     j=0

    θ jt+1− j) =

    qX

     j=1

    θ jt+1− j

    E t(xt+2) =   E t(

    qX

     j=0

    θ jt+2− j) =

    qX

     j=2

    θ jt+2− j

    . . .   =   . . .E t(xt+k) =   E t(

    qX

     j=0

    θ jt+k− j) =

    qX

     j=k

    θ jt+k− j   for   k ≤ q 

    E t(xt+k) =   E t(

    qX

     j=0

    θ jt+k− j) = 0 for   k > q 

    18

  • 8/19/2019 Hu - Time Series Analysis

    25/149

    and

    Vart(xt+1) = Vart(

    qX

     j=0

    θ jt+1− j) = σ2

    Vart(xt+2) = Vart(

    qX

     j=0

    θ jt+2− j) = 1 + θ21σ

    2

    . . .   =   . . .

    Vart(xt+k) = Vart(

    qX

     j=0

    θ jt+k− j) =kX

     j=0

    θ2 jσ2  ∀   k > 0

    We could see that for an MA(q ) process, the conditional expectation and variance of forecast forq  + 1 and higher is the same as unconditional expectations and variance.

    8 Wold Decomposition

    So far we have focused on ARMA models, which are linear time series models. Is there any relation-

    ship between a general covariance stationary process (maybe nonlinear) to linear representations?The answer is given by the Wold decomposition theorem:

    Proposition 2  (Wold Decomposition) Any zero-mean covariance stationary process  xt  can be rep-resented in the form 

    xt  =

    ∞X

     j=0

    ψ jt− j +  V  t

    where 

    (i)  ψ0 = 1 and  P

     j=0 ψ2 j   < ∞

    (ii)  t  ∼ W N (0, σ2

    )

    (iii)   E (tV  s) = 0   ∀   s, t > 0

    (iv)  t   is the error in forecasting  xt  on the basis of a linear function of lagged   x:

    t  =  xt − E (xt|xt−1, xt−2, . . .)

    (v)  V  t  is a deterministic process and it can be predicted from a linear function of lagged   x.

    Remarks: Wold decomposition says that any covariance stationary process has a linear repre-sentation: a linear deterministic component (V  t) and a linear indeterministic components (t). If V  t = 0, then the process is said to be purely-non-deterministic, and the process can be representedas a MA(∞) process. Basically,  t  is the error from the projection of  xt  on lagged  x, therefore it isuniquely determined and it is orthogonal to lagged  x  and lagged . Since this error    is the residualfrom the projections, it may not be the true errors in the DGP of  xt. Also note that the error term() is a white noise process, and does not need to be   iid.

    Readings:Hamilton Ch. 1-4Brockwell and Davis Ch. 3Hayashi Ch 6.1, 6.2

    19

  • 8/19/2019 Hu - Time Series Analysis

    26/149

    Lecture 3: Spectral Analysis∗

    Any covariance stationary process has both a   time domain   representation and a  spectrum do-main  representation. So far, our analysis is in the time domain as we represent a time series {xt}in terms of past values of innovations and investigate the dependence of  x  at distinct time. In somecases, a spectrum-domain representation is more convenient in describing a process. To transforma time-domain representation to a spectrum-domain representation, we use the Fourier transform.

    1 Fourier Transforms

    Let  ω  denote the  frequency   (−π  <  ω  <  π), and let  T   denote the  period : the minimum time thatit takes the wave to go through a whole cycle, and we have  T   = 2π/ω. Given any integer numberz, we have  x(t) =  x(t + zT ). Finally, we will let  φ  denote the   phase : the amount that a wave isshifted.

    Given a time series  {xt}, its Fourier transformation is:

    x(ω) =  1

    ∞Xt=−∞

    e−itωx(t) (1)

    and the inverse Fourier transform is:

    x(t) =

    Z   π

    −π

    eitωx(ω)dω   (2)

    2 Spectrum

    Recall that the autocovariance function for a zero-mean stationary process  {xt} is defined as:

    γ x(h) = E (xtxt−h)

    and it serves to characterize the time series  {xt}. The   spectrum  of  {x} is defined to be the Fouriertransform of  γ x(h),

    S x(ω) =  1

    ∞Xh=−∞

    e−ihωγ x(h) (3)

    Recall that the autocovariance generating function is  gx(z) =P∞

    h=−∞ γ x(h)zh

    , if we let  z   =e−iω, then the spectrum is just the autocovariance generating function divided by 2π. In (3), if wetake  ω  = 0, we see that

    ∞Xh=−∞

    γ x(h) = 2πS x(0),

    ∗Copyright 2002-2006 by Ling Hu.

    1

  • 8/19/2019 Hu - Time Series Analysis

    27/149

    which tells that the sum of autocorrelations equals the spectrum at zero multiplied by 2π. Usingthe identity

    eiφ = cosφ + i sinφ,

    we can also write (3) as

    S x(ω) =

      1

    2π"γ 0

     + 2

    Xh=1

    γ x

    (h

    ) cos(hω)# .

      (4)

    Note that since cos(ω) = cos(−ω), and  γ x(h) = γ x(−h), the spectrum is symmetric about zero.Also the cosine function is periodic with period 2π, therefore, for spectral analysis, we only needto find the spectrum for  ω  ∈  [0,π]. Now if we know  γ x(h), we can compute its spectrum using (4),and if we know the spectrum  S x(ω), we can compute  γ x(h) using the inverse Fourier transform:

    γ x(h) =

    Z   π−π

    eiωhS x(ω)dω   (5)

    Let  h = 0, then (5) gives the variance of  {xt}

    γ x(0) = Z   π

    π

    S x(ω)dω.

    So the variance of   {xt}   is just the sum of the spectrum over all frequencies  −π   <   ω   <   π.Therefore we can see that the spectrum function  S x(ω) decomposes the variance into componentscontributed from each frequency. In other words, we can use spectrum to find the importance of cycles of diff erent frequencies.

    If we normalize the spectrum   S x(ω) by dividing   γ x(0), we get the Fourier transform of theautocorrelation function  ρx(h),

    f (ω) =  1

    ∞Xh=−∞

    e−ihωρx(h) (6)

    The autocorrelation functions can be generated from  f (ω) using the inverse transform

    ρx(h) =

    Z   π

    −π

    eiωhf x(ω)dω   (7)

    Again, let  h = 0, (7) gives

    1 =

    Z   π−π

    f x(ω)dω

    Note that  f (ω) is positive and integrate to one, just like a probability distribution density, sowe call it   spectral density .

    Example 1  (spectral density of white noise) Let   ∼  WN(0,σ2 ). We have γ (0) = σ2  and γ (h) = 0

    for  h 6= 0. Using (3) and (6), we can compute

    S (ω) =  1

    2πγ (0) =  1

    2πσ2

    .

    Divide it by  γ (0), we have

    f x() =  1

    2π.

    So the spectral density is uniform over [−π,π], i.e., every frequency has equal contribution tothe variance.

    2

  • 8/19/2019 Hu - Time Series Analysis

    28/149

    3 Spectrum of Filtered Process

    Considering that the spectrum of a white noise process is so simple, we may want to know if wecould make use it for a more complicated process, say,

    xt  =

    Xk=−∞

    θkt−k  = θ(L)t.

    We call this process a two-sided moving average process. Then what is the relationship betweenS x(ω) and  S (ω)? The general solution is given in the following statement.

    Proposition 1   If  {xt}   is a zero mean stationary process with spectrum function  S x(ω), and  {yt}is the process 

    yt =∞X

    k=−∞

    θkxt−k  = θ(L)xt

    where  θ  is absolutely summable, then 

    S y(ω) = ∞

    Xk=−∞

    θke−ikω

    2

    S x(ω) = θ(e−iω)2 S x(ω).Proof: We start from the autocovarinace function of  y,

    γ y(h) =   E (ytyt−h)

    =   E 

    ∞X

     j=−∞

    θ jxt− j

    ∞Xk=−∞

    θkxt−h−k

    =∞X

     j,k=−∞

    θ jθkE (xt− jxt−h−k)

    =∞

    X j,k=−∞

    θ jθkγ x

    (h + k − j)

    Next, consider the spectrum of  y,

    S y(ω) =  1

    ∞Xh=−∞

    e−ihωγ y(h)

    =  1

    ∞Xh=−∞

    e−ihω∞X

     j,k=−∞

    θ jθkγ x(h + k − j)

    (Let   l  =  h +  k −  j  and note that  S x(ω) =  1

    P∞

    l=−∞ e−ilωγ x(l), so we want to construct such a

    term and see what are the remainings.)

    S y(ω) =∞X

     j=−∞

    e−ijωθ j

    ∞Xk=−∞

    eikωθk

      12π

    ∞Xl=−∞

    e−ilωγ x(l)!

    =   θ(e−iω)θ(eiω)S x(ω)

    =   θ(e−iω)θ(e−iω)S x(ω)

    =θ(e−iω)2 S x(ω)

    3

  • 8/19/2019 Hu - Time Series Analysis

    29/149

    Example 2  To apply this results, first consider the problem of computing an MA(1) process,

    xt =  t + θt−1 = (1 + θL)t.

    In this problem,θ(e−iω) = 1 + θe−iω,

    thus

    θ(e−iω)2 = (1 + θe−iω)(1 + θeiω)= 1 + θ2 + θ(e−iω + eiω)

    Therefore,

    S x(ω) =θ(e−iω)2 S (ω)

    =  1

    2π[1 + θ2 + θ(e−iω + eiω)]σ2

    We can verify this result by using the spectrum to compute the autocovarinace function, say,γ x(1). Using (5).

    γ x(1) =

    Z   π

    −π

    eiωS x(ω)dω

    =  1

    2πσ2

    Z   π

    −π

    eiω[1 + θ2 + θ(e−iω + eiω)]dω

    =  1

    2πσ2  · 2πθ

    =   θσ2

    which is the same as what we got from working in the time domain. In the computation we usethe fact the

    R  π

    −πeiωdω  = 0, as the integral of sine or cosine functions all the way around a circle is

    zero.

    Figure 1 plots the spectrum of MA(1) processes with positive and negative coefficients. Whenθ  >  0, we see that the spectrum is high for low frequencies and low for high frequencies. Whenθ   <   0, we observe the opposite. This is because when   θ   is positive, we have positive one lagcorrelation which makes the series smooth with only small contribution from high frequency (say,day to day) components. When θ   is negative, we have negative one lag correlation, therefore theseries fluctuates rapidly about its mean value.

    Above we have considered the moving average process, the next proposition gives results for an

    ARMA models with white noise errors:

    Proposition 2   Let  {xt}  be an ARMA(  p, q ) process satisfying 

    φ(L)xt  =  θ(L)t

    4

  • 8/19/2019 Hu - Time Series Analysis

    30/149

    0 1 2 3 40

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    Frequency

              S        p        e        c         t        r        u        m

    0 1 2 3 40

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    Frequency

              S        p        e        c         t        r        u        m

    Figure 1: Plots of the spectrum of MA(1) processes (θ  = 0.5 for the left figure and   θ  =  −0.5 forthe right figure)

    where   ∼WN (0,σ2

    ), all roots of  φ(L)   lies out of the unit circle, then the spectrum of  xt   is:

    S x(ω) =  |θ(e−iω)|2

    |φ(e−iω)|2S (ω)

    =  1

    |θ(e−iω)|2

    |φ(e−iω)|2σ2

    Example 3  Consider an AR(1) process,

    xt  =  φxt−1 + t.

    Using the above proposition,

    S x(ω) =  σ2

    2π|1− φe−iω|−2

    =  σ2

    2π(1 + φ2 − 2φ cos   ω)−1 (8)

    Figure 2 plots the AR(1) processes with positive and negative coefficients. We have similarobservations here as the MA processes. However, note that when  φ → 1,  S x(ω) →∞, which meansthat a random walk process has an infinite spectrum at frequency zero. This is similar as we areworking with summation and diff erencing. When we add up a white noise (say,   φ   = 1 as in arandom walk), the high frequencies are smoothed out (those spikes in the white noise disappear)and what is left is the long term stochastic trend. On the contrary, when we do diff erencing (say,do first diff erencing to a random walk, then we are back to the white noise series), we get rid of the long term trend, and what is left is the high frequencies (lots of spikes with mean zero, say).

    Finally we introduce a  spectral representation theorem  without proof. For zero-mean stationaryprocess with absolutely summable autocovariances, define random variables   α(ω) and   δ (ω), wecould represent the series in the form

    xt  =

    Z   π

    0

    [α(ω)cos(ωt) + δ (ω)sin(ωt)]dω.

    5

  • 8/19/2019 Hu - Time Series Analysis

    31/149

    0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Frequency

              S        p        e        c         t        r        u        m

    0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Frequency

              S        p        e        c         t        r        u        m

    Figure 2: Plots of the spectrum of AR(1) processes (φ  = 0.5 for the left figure and  φ  =  −0.5 forthe right figure)

    where α(ω) and δ (ω) have zero mean and are mutually and serially uncorrelated. The representationtheorem tells that for a stationary process with absolutely summable autocovariances, we can writeit as a weighted sum of periodic functions.

    4 Cross Spectrum and Spectrum of a Sum

    Spectrum is an autocovariance generating function and we can use it to compute the autocovariancefor a stationary process. Besides computing autocovariance of a single time series, the spectrumfunction can also capture the covariance cross two time series. We call such spectrum functionscross spectrum .

    For a single time series  {xt}, a spectrum function is the Fourier transform of the autocovariancefunction γ x(h) = E (xtxt−h). Similarly, for two time series  {xt} and  {yt}, the cross spectrum is theFourier transform of the covariance function of  xt  and  yt−h, i.e.

    S xy(ω) =∞X

    h=−∞

    e−ihωE (xtyt−h)

    In general,

    S xy(ω) 6= S yx(ω) =∞X

    h=−∞

    e−ihωE (ytxt−h)

    But they have the following relationship:

    S xy(ω) = S yx(ω) = S yx(−ω)

    which is easy to verify:

    S xy(ω) =

    ∞X

    h=−∞

    e−ihωE (xtyt−h)

    6

  • 8/19/2019 Hu - Time Series Analysis

    32/149

    =

    ∞X

    h=−∞

    e−ihωE (ytxt+h)

    =∞X

    k=−∞

    eikωE (ytxt−k) (let   k = −h)

    =∞X

    k=−∞

    e−(−ikω)E (ytxt−k)

    =   S yx(−ω)

    Note that if   xt   and   ys   are uncorrelated for all   t, s, then  E (xtyt−h) = 0 for all   h, therefore,S xy(ω) =  S yx(ω) = 0. Knowing the cross spectrum, next we can compute the spectrum of a sum.For a process  zt = xt + yt, the spectrum of  zt  can be computed as follows:

    S z(ω) =

    ∞X

    h=−∞

    e−ihωE (ztzt−h)

    =

    Xh=−∞

    e−

    ihω

    E [(xt + yt)(xt−h + yt−h)]

    =∞X

    h=−∞

    e−ihω [E (xtxt−h) + E (xtyt−h) + E (ytxt−h) + E (ytyt−h)]

    =   S x(ω) + S xy(ω) + S yx(ω) + S y(ω)

    We have proposed before that for a time series  zt, its spectrum decompose its variation to dif-ferent components contributed from each frequency  ω. Here, we see another form of decomposition:we can decompose the variation in  z  to diff erent sources. In particular, if  xt and  ys  are uncorrelatedfor all  t, s, i.e.,  S xy(ω) = S yx(ω) = 0, then we have

    S z(ω) = S x(ω) + S y(ω).

    5 Estimation

    In equation (3), we define the spectrum as

    S x(ω) =  1

    ∞X

    h=−∞

    e−ihωγ x(h).

    Given a stationary process, the sample autocovariance can be estimated

    γ̂ x(h) = T −1

    Xt=h+1

    [(xt−

    x̄)(xt−h−

    x̄)].

    To estimate the spectrum, we may compute the sample analog of (3), which is known as thesample periodogram 

    I x(ω) =  1

    T −1X

    h=−T +1

    e−ihωγ̂ x(h).

    7

  • 8/19/2019 Hu - Time Series Analysis

    33/149

    Or we can equivalently write it as

    I x(ω) =  1

    γ̂ (0) + 2

    T −1Xh=1

    γ̂ (h) cos(ωh)

    !.   (9)

    We have the following asymptotic distribution of the sample periodogram.2I x(ω)

    S x(ω)  ∼ χ2(2)

    Since   E (χ2(2)) = 2, the sample periodogram provides an unbiased estimate of the spectrum,limT →∞EI x(ω) = S x(ω). However, the variance of  I x(ω) does not go to zero. In fact,

    V ar(I x(ω)) →

      2S 2x(0) for   ω  = 0

    S 2(ω) for   ω  6= 0

    Therefore, even when the sample size is very large, the sample periodogram still could notprovide an accurate estimate for the true spectrum. To estimate the spectrum, there are two betterapproaches. First is parametric approach. We can estimate the ARMA model using least squareor MLE to obtain consistent estimator of the parameters, and then plug in the estimator to obtaina consistent estimator for the spectrum. For instance, for an MA(1) process,

    xt  =  t + θt−1,   t  ∼ W N (0, 1)

    if we could obtain a consistent estimator for  θ , denoted by  θ̂, then for any  ω,

    Ŝ x(ω) =  1

    2π[1 + θ̂2 + θ̂(e−iω + eiω)].

    A potential problem with parametric estimation is that we have to specify a parametric model

    for the process, say, ARMA( p, q ). So we may have some errors due to misspecification. However,even if the model is incorrectly specified, if the autocovariances of the true process are close to thosefor our specifications, then this procedure still could provide a useful estimate of the populationspectrum.

    An alternative approach is to estimate the spectrum nonparametrically. Doing this could saveus from specifying a model for the process. We still make use of the sample periodogram, however,to estimate the spectrum  S x(ω), we use a weighted average of the sample periodogram over severalneighboring ωs. How much weight to put on each ω  in the neighborhood is determined by a functionwhich is known as the kernel, or kernel function. This means that the spectrum is estimated by

    Ŝ x(ω j) =m

    Xl=−m

    k(l, m) · I x(ω j+l).   (10)

    The kernel function  k(l, m) must satisfy that

    mXl=−m

    k(l, m) = 1.

    8

  • 8/19/2019 Hu - Time Series Analysis

    34/149

    Here m is the bandwidth or window indicating how many diff erent frequencies to viewed as usefulin estimating S x(ω j). Averaging I x(ω) over diff erent frequencies can equivalently be represented asmultiplying the  hth autocovariances  γ (h) in (9) by a weight function  w(h, q ). A derivation can befound on page 166 on Hamilton.

    These weight function  w(h, q ) satisfy that  w(0, q ) = 1,  |w(h, q )| ≤ 1, and  w(h, q ) = 0 for  h > q .

    The  q   in weight function works in a similar way as the  m   in  k(l, m), as it specifies a length of thewindow. Some commonly used weight functions are

    Truncated kernel, let  x  =  h/q ,

    w(x) =

      1 for   |x| ≤ 10 otherwise

    Bartlett kernel, let  x =  h/q ,

    w(x) =

      1 − |x|   for   |x| ≤ 10 otherwise

    Modified Bartlett kernel

    w(h, q ) =   1 −

      hq+1

      for   h = 1, 2, . . . , q  

    0 otherwise

    Parzen kernel, let  x =  h/q ,

    w(x) =

    1 − 6|x|2 + 6|x|3 for   |x| <  1/22(1 − |x|)3 for 1/2 ≤ |x| ≤ 10 otherwise

    A typical problem in nonparametric estimation is the trade off   between variance and bias.Usually a large bandwidth reduces variance but induces bias. To reduce the variance without addingmuch bias, we need to choose a proper bandwidth. In practice, we may plot an estimate of thespectrum using several diff erent bandwidths and use subjective judgment to choose the bandwidth.

    Basically, if the plot is too flat, then it is hard to draw information like which frequencies are moreimportant than others; on the other hand, if the plot is too choppy (too many peaks and valleysmixed together), then it is hard to make convincing comments.

    Example 4  (Spectrum estimation of an AR(1) process). The data are generated from

    xt  =  φxt−1 + t,   φ = 0.5, t  ∼ i.i.d.N (0, 1).

    We simulated a sequence of length  n  = 200 using this DGP and the OLS estimates of  φ  is 0.59(OLS estimate is consistent in this problem). The upper-left figure in Figure 3 plots the populationspectrum, i.e., using (8) with  φ = 0.5. The upper-right figure plots the estimated spectrum using(8) with the OLS estimates of  φ, 0.59. The lower-left figure plots the sample periodogram  I x(ω),

    which is very volatile. Finally, the lower right figure plots the smoothed estimate for the spectrumusing the Bartlett kernel, i.e.

    Ŝ x(ω) = (2π)−1

    γ̂ x(0) + 2

    qX j=1

    1 −

     j

    q  + 1

    γ̂ x( j)cos(ω j)

    ,

    where q   is set to be 5.

    9

  • 8/19/2019 Hu - Time Series Analysis

    35/149

    0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Frequency

       P  o  p  u   l  a   t   i  o  n   S  p  e  c   t  r  u  m

    0 1 2 3 40

    0.2

    0.4

    0.6

    0.8

    1

    Frequency

       P  a  r  a  m

      e   t  r   i  c  a   l   l  y   E  s   t   i  m  a   t  e   d   S  p  e  c   t  r  u  m

    0 1 2 3 40

    0.5

    1

    1.5

    2

    Frequency

       S  a  m  p   l  e   P  e

      r   i  o   d  o  g  r  a  m

    0 1 2 3 40

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Frequency

       N  o  n  p  a  r  a  m  e   t  r   i  c  a   l   l  y   E

      s   t   i  m  a   t  e   d   S  p  e  c   t  r  u  m

    Figure 3: Estimates for Spectrum

    10

  • 8/19/2019 Hu - Time Series Analysis

    36/149

    In empirical studies, Section 6.4 on spectrum of industrial production series in Hamilton providesa very good example. Without any detrending, the spectrum is focused on the low frequency region,which means that the variance of the series is largely from the long term trend (here is the economicsgrowth). After detrending, we obtain the growth rate which is stationary, and the variance nowmostly come from the business cycle and seasonal eff ects. After filtering the seasonal eff ects, most

    of the variance is now due to the business cycle.Readings: Hamilton, Ch. 6; Brockwell and Davis, Ch. 4, Ch. 10

    11

  • 8/19/2019 Hu - Time Series Analysis

    37/149

    Lecture 4: Asymptotic Distribution Theory∗

    In time series analysis, we usually use  asymptotic   theories to derive joint distributions of theestimators for parameters in a model. Asymptotic distribution is a distribution we obtain by lettingthe time horizon (sample size) go to infinity. We can simplify the analysis by doing so (as we knowthat some terms converge to zero in the limit), but we may also have a finite sample error. Hopefully,when the sample size is large enough, the error becomes small and we can have a satisfactoryapproximation to the   true   or  exact  distribution. The reason that we use asymptotic distributioninstead of exact distribution is that the exact finite sample distribution in many cases are toocomplicated to derive, even for Gaussian processes. Therefore, we use asymptotic distributions as

    alternatives.

    1 Review

    I think that this lecture may contain more propositions and definitions than any other lecture forthis course. In summary, we are interested in two type of asymptotic results. The first result isabout convergence to a constant. For example, we are interested in whether the sample momentsconverge to the population moments, and law of large numbers (LLN) is a famous result on this.The second type of results is about convergence to a random variable, say,  Z , and in many cases,  Z follows a standard normal distribution. Central limit theorem (CLT) provides a tool in establishingasymptotic normality.

    The confusing part in this lecture might be that we have several versions of LLN and CLT.The results may look similar, but the assumptions are diff erent. We will start from the strongestassumption,   i.i.d., then we will show how to obtain similar results when  i.i.d.   is violated. Beforewe come to the major part on LLN and CLT, we review some basic concepts first.

    1.1 Convergence in Probability and Convergence Almost Surely

    Definition 1   (Convergence in probability)  X n   is said to be convergent in probability to  X   if for every    >  0,

    P (|X n  − X | >  )  →  0 as   n  → ∞.

    If  X  = 0, we say that  X n   converges in probability to zero, written  X n =  o p(1), or  X n   → p 0.

    Definition 2   (Boundedness in probability)  X n  is said to be bounded in probability, written  X n =O p(1)   if for every    >  0, there exists   δ ()  ∈  (0,∞)  such that 

    P (|X n| >  δ ()) <     ∀  n

    ∗Copyright 2002-2006 by Ling Hu.

    1

  • 8/19/2019 Hu - Time Series Analysis

    38/149

    We can similarly define order in probability:   X n  =  o p(n−r) if and only if  nrX n  =  o p(1); and

    X n  =  O p(n−r) if and only if  nrX n  =  O p(1).

    Proposition 1   if  X n  and  Y n  are random variables defined in the same probability space and  an  >  0,bn >  0, then 

    (i) If  X n  =  o p(an)  and  Y n  =  o p(bn), we have 

    X nY n  =  o p(anbn)

    X n + Y n  =  o p(max(an, bn))

    |X n|r = o p(a

    rn) for   r > 0.

    (ii) If  X n  =  o p(an)  and  Y n  =  O p(bn), we have  X nY n  =  o p(anbn).

    Proof of (i): If   |X nY n|/(anbn)  >    then either   |Y n|/bn   ≤  1 and   |X n|/an   >     or   |Y n|/bn   >  1 and|X nY n|/(anbn) >  , hence

    P (|X nY n|/(anbn) >  )   ≤   P (|X n|/an  >  ) + P (|Y n|/bn  >  1)→   0

    If  |X n + Y n|/ max(an, bn) >  , then either  |X n|/an >  /2 or   |Y n|/bn  >  /2.

    P (|X n + Y n|/ max(an, bn) >  )   ≤   P (|X n|/an  >  /2) + P (|Y n|/bn  >  /2)

    →   0.

    Finally,P (|X n|

    r/arn  >  ) = P (|X n|/an >  1/r) → 0.

    Proof of (ii): If   |X nY n|/(anbn)   >   , then either   |Y n|/bn   >   δ () and   |X nY n|/(anbn)   >     or

    |Y n|/bn  ≤ δ () and  |X n|/an  >  /δ (), then

    P (|X nY n|/(anbn) >  )   ≤   P (|X n|/an  >  /δ ()) + P (|Y n|/bn  >  δ ())

    →   0

    This proposition is very useful. For example, if  X n  =  o p(n−1) and Y n =  o p(n

    −2), then X n+Y n =o p(n

    −1), which tells that the slowest convergent rate ‘dominates’. Later on, we will see sum of severalterms, and to study the asymptotics of the sum, we can start from judging the convergent ratesof each term and pick the terms that converge slowerest. In many cases, the terms that convergesfaster can be omitted, such as  Y n  in this example.

    The results also hold if we replace   o p   in (i) with   O p. The notations above can be naturally

    extended from sequence of scalar to sequence of vector or matrix. In particular,  X

    n  =  o p(n

    −r

    ) if and only if all elements in   X  converge to zero at order  nr. Using Euclidean distance  |Xn − X| =Pk j=1(X nj −X  j)

    2

    1/2, where  k  is the dimension of  X n, we also have

    Proposition 2 Xn − X =  o p(1)  if and only if   |Xn − X| =  o p(1).

    2

  • 8/19/2019 Hu - Time Series Analysis

    39/149

    Proposition 3   (Preservations of convergence of continuous transformations)  If  {Xn} is a sequence of  k-dimensional random vectors such that   Xn  → X  and if  g : R

    k → Rm is a continuous mapping,

    then  g(Xn) → g(X).

    Proof: let M  be a positive real number. Then  ∀   >  0, we have

    P (|g(Xn) − g(X)| >  )   ≤   P (|g(Xn) − g(X)| >  , |Xn| ≤ M, |X| ≤ M )

    +P ({|Xn| > M } ∪ {|X| > M }).

    (the above inequality usesP (A ∪ B) ≤ P (A) + P (B)

    whereA =  {|g(Xn) − g(X)| >  , |Xn| ≤ M, |X| ≤ M )},

    B =  {|Xn| > M, |X| > M }.

    ) Recall that if a function   g   is uniformly continuous on   {x   :   |x|   ≤   M }, then   ∀   >   0,   ∃η(),|Xn − X| <  η(), so that  |g(Xn) − g(X)| <  . Then

    {|g(Xn) − g(X)| >  , |Xn| ≤ M, |X| ≤ M )} ⊆ {|Xn − X| >  η().}

    Therefore,

    P (|g(Xn) − g(X)| >  )   ≤   P (|Xn − X| >  η()) + P (|Xn| > M ) + P (|X| > M )

    ≤   P (|Xn − X| >  η()) + P (|X| > M )

    +P (|X| > M/2) + P (|Xn − X| > M/2)

    Given any   δ   >  0, we can choose  M  to make the second and third terms each less than   δ /4.Since  Xn  → X, the first and fourth terms will each be less than  δ /4. Therefore, we have

    P (|g(Xn) − g(X)| >  ) ≤ δ .

    Then  g(Xn) → g(X).

    Definition 3  (Convergence almost surely)  A sequence  {X n}  is said to converge to X  almost surely or with probability one if 

    P ( limn→∞

    |X n −X | >  ) = 0.

    If  X n  converges to X  almost surely, we write  X n  →a.s.  X . Almost sure convergence is strongerthan convergence in probability. In fact, we have

    Proposition 4   If  X n  →a.s.  X ,  X n  → p  X .

    However, the converse is not true. Below is an example.

    3

  • 8/19/2019 Hu - Time Series Analysis

    40/149

    Example 1  (Convergence in probability but not almost surely) Let the sample space  S  = [0, 1], aclosed interval. Define the sequence {X n}  as

    X 1(s) = s + 1[0,1](s)   X 2(s) = s + 1[0,1/2](s)   X 3(s) = s + 1[1/2,1](s)

    X 4(s) = s + 1[0,1/3](s)   X 5(s) = s + 1[1/3,2/3](s)   X 6(s) = s + 1[2/3,1](s)

    etc, where   1   is the indicator function, i.e., it equals to 1 if the statement is true and equals to 0otherwise. Let X (s) =  s. Then X n  → p  X , as  P (|X n − X | ≥   ) is equal to the probability of thelength of the interval of  s  values whose length is going to zero as  n  → ∞. However, X n  does notconverge to  X  almost surely, Actually there is no  s ∈ S   for which  X n(s) → s =  X (s). For every  s,the value of  X n(s) alternates between the values of  s  and s + 1 infinately often.

    1.2 Convergence in  L p   Norm

    When  E (|X n| p)  <  ∞  with  p >  0,  X n   is said to be  L p-bounded . Define that the L p   norm of  X   is

    kX k p  = (E |X | p)1/p. Before we define  L p   convergence, we first review some useful inequalities.

    Proposition 5  (Markov’s Inequality)  If  E |X | p

    < ∞,  p ≥ 0  and  

     >  0, then 

    P (|X | ≥ ) ≤ − pE |X | p

    Proof:

    P (|X | ≥ ) =   P (|X | p− p ≥ 1)

    =   E 1[1,∞)(|X | p

    − p)

    ≤   E [|X | p− p1[1,∞)(|X | p

    − p)]

    ≤   − pE |X | p

    In the Markov’s inequality, we can also replace   |X |   with   |X  − c|, where   c   can be any real

    number. When  p = 2, the inequality is also known as  Chebyshev’s inequality . If  X   is  L p  bounded,then Markov’s inequality tells that the tail probabilities converge to zero at the rate    p as   → ∞.Therefore, the order of  L p boundedness measures the tendency of a distribution to generate outliers.

    Proposition 6   (Holder’s inequality)  For any  p ≥ 1,

    E |XY  | ≤ kX k pkY kq,

    where  q  =  p/( p − 1)   if  p > 1  and  q  = ∞   if  p = 1.

    Proposition 7   (Liapunov’s inequality)   If  p > q >  0, then  kX k p  ≥ kX kq.

    Proof: Let  Z  = |X |q

    , Y   = 1, s =  p/q , Then by Holder’s inequality,  E |ZY | ≤ kZ kskY ks/(s−1), or

    E (|X |q) ≤ E (|X |qs)1/s = E (|X | p)q/p.

    Definition 4   (L p   convergence)   If  kX nk p  < ∞   for all  n  with  p > 0, and   limn→∞ kX n − X k p  = 0,then  X n  is said to converge in  L p  norm to X , written  X n  →Lp X . When  p  = 2, we say it converges in mean square, written as  X n  →m.s. X .

    4

  • 8/19/2019 Hu - Time Series Analysis

    41/149

    For any  p > q >  0,  L p  convergences implies   Lq   convergence by Liaponov’s inequality. We cantake convergence in probability as an L0 convergence, therefore, L p convergence implies convergencein probability:

    Proposition 8   (L p  convergence implies convergence in probability)  If  X n  →Lp X   then  X n  → p  X .

    Proof:

    P (|X n − X | >  )

    ≤   − pE |X n − X | p by Markov0s inequality

    →   0

    1.3 Convergence in Distribution

    Definition 5   (Convergence in distribution)  The sequence  {X n}∞

    n=0  of random variables with dis-

    tribution functions  {F X n(x)}  is said to converge in distribution to  X , written as  X n  →d  X   if there exists a distribution function   F X (x)  such that 

    limn→∞

    F X n(x) = F X (x).

    Again, we can naturally extend the definition and related results about scalar random variableX  to vector valued random variable  X. To verify convergence in distribution of a  k  by 1 vector, if the scalar (λ1X 1n + λ2X 2n + . . . + λkX kn) converges in distribution to (λ1X 1 + λ2X 2 + . . . + λkX k)for any  real values of (λ1, λ2, . . . , λk), then the vector (X 1n, X 2n, . . . , X  kn) converges in distributionto the vector (X 1, X 2, . . . , X  k).

    We also have the continuous mapping theorem for convergence in distribution.

    Proposition 9   If  {Xn}   is a sequence of random k-vectors with   Xn  →d   X  and if   g   :  Rk →  Rm is 

    a continuous function. Then  g(Xn) →d  g (X).

    In the special case that that the limit is a constant scalar or vector, convergence in distributionimplies convergence in probability.

    Proposition 10   If   X n  →d  c  where   c   is a constant, then   X n  → p  c.

    Proof:. If  X n  →d  c, then  F X n(x) → 1[c,∞)(x) for all  x 6= c. For any    >  0,

    P (|X n − c| ≤ ) =   P (c − ≤ X n  ≤ c +  )

    →   1[c,∞)(c +  )− 1[c,∞)(c − )

    = 1

    On the other side, for a sequence  {X n}, if the limit of convergence in probability or convergencealmost sure is a random variable  X , then the sequence also converges in distribution to  x.

    5

  • 8/19/2019 Hu - Time Series Analysis

    42/149

    1.4 Law of Large Numbers

    Theorem 1  (Chebychev’s Weak LLN)   Let  X   be a random variable with  E (X ) =  µ   and   limn→∞V ar( X̄ n) = 0, then 

    X̄ n =  1

    n

    n

    Xt=1

    X t  → p  µ.

    The proof follow readily from Chebychev’s inequality.

    P (| X̄ n − µ| >   ) ≤V ar( X̄ n)

    2  → 0.

    WLLN tells that the sample mean is a consistent estimate for the population mean and thevariance goes away as n →∞. Since E ( X̄ n−µ)

    2 = V ar( X̄ n) → 0, we also know that  X̄ n convergesto the population mean in mean square.

    Theorem 2   (Kolmogorov’s Strong LLN)  Let  X t  be i.i.d and  E (|X |) < ∞, then 

    X̄ n  →a.s. µ.

    Note that Kolmogorov’s LLN does not require finite variance. Next we consider the LLN for anheterogeneous process without serial correlations, say,  E (X t) = µt  and  V ar(X t) =  σ

    2t , and assume

    that µ̄n =  n−1Pn

    t=1 µt  → µ. Then we know that  E ( X̄ n) = µ̄n  → µ, and

    V ar( X̄ n) = E 

    n−1

    nXt=1

    (X t − µt)

    !2= n−2

    nXt=1

    σ2

    t .

    To prove the condition for   V ar( X̄ n)  →  0, we need another fundamental tool in asymptotictheory, Kronecker’s lemma.

    Theorem 3  (Kronecker’s lemma)  Let  X n  be a sequence of real numbers and Let  {bn} be a monotone increasing sequence with  bn  → ∞, and 

    P∞

    t=1X t   convergent. then 

    1

    bn

    nXt=1

    btX t  → 0.

    Theorem 4   Let  {X t}   be a serially uncorrelated sequence, and P∞

    t=1 t−2

    σ2t  

  • 8/19/2019 Hu - Time Series Analysis

    43/149

    1.5 Classical Central Limit Theory

    Finally, central limit theorem (CLT) provides a tool to establish asymptotic normality of an esti-mator.

    Definition 6   (Asymptotic Normality)  A sequence of random variables  {X n}   is said to be asymp-

    totic normal with mean  µn   and standard deviation  σn   if   σn >  0  for  n  su  ffi ciently large and 

    (X n − µn)/σn →d Z,   where   Z  ∼ N (0, 1).

    Theorem 5  (Lindeberg-Levy Central Limit Theorem)  If  {X n} ∼ iid(µ,σ2), and  X̄ n = (X 1 + . . .+X n)/n, then  √ 

    n( X̄ n − µ)/σ →d  N (0, 1).

    Note that in CLT, we obtain normality results about  X̄ n  without assuming normality for thedistribution of  X n. Here we only require that  X n  follows some   i.i.d. We will see a moment laterthat central limit theorem also holds for more general cases. Another useful tool which can be usedtogether with LLN and CLT is known as Slutsky’s theorem.

    Theorem 6   (Slutsky’s theorem)  If  X n → X  in distribution and  Y n → c, a constant, then 

    (a)  Y nX n → cX   in distribution.

    (b)  X n + Y n → X  +  c   in distribution.

    If we know the distribution of a random variable, we can derive the distribution of a functionof this random variable using the so called ’δ -method’.

    Proposition 11   (δ -method) Let  {X n} be a sequence of random variables such that √ n(X n−µ) →d

    N (0,σ2), and if  g  is a function which is di  ff erentiable at  µ, then 

    √ n[g(X n)− g(µ)] →d N (0, g0(µ)2σ2).

    Proof: The Taylor expansion of  g(X n) around  X n =  µ  is

    g(X n) = g(µ) + g0(µ)(X n − µ) + o p(n−1).

    as  X n → p µ. Applying the Slutsky’s theorem to√ n[g(X n)− g(µ)] = g 0(µ)

    √ n(X n − µ),

    where we know that √ 

    n(X n − µ) → N (0,σ2), then

    √ n[g(X n)− g(µ)] = g0

    (µ)√ n(X n − µ) → N (0, g0

    (µ)2

    σ

    2

    ).

    For example, let g(X n) = 1/X n, and√ n(X n−µ) →d N (0,σ2), then we have

    √ n(1/X n−1/µ) →d

    N (0,σ2/µ4).Lindeberg-Levy CLT assumes  i.i.d., which is too strong in practice. Now we retain the assump-

    tion of independence but allow heterogeneous distributions (i.ni.d), and in the next section, we willshow versions of CLT for serial dependent sequence.

    7

  • 8/19/2019 Hu - Time Series Analysis

    44/149

    In the following analysis, it is more convenient to work with normalized variables. We also needto use triangular arrays in the analysis. An array  X nt   is a double-indexed collection of numbersand each sample size n  can be associated with a diff erent sequence. We use  {{X nt}

    nt=0}

    n=1, or just{X nt}   to denote an array. Let {Y t}  be the sequence of the raw sequence with  E (Y t) = µt. Defines2n =

    Pn

    t=1E (Y t − µt)

    2,  σ2nt =  E (Y t − µt)2/s2n, and

    X nt  = Y t − µt

    sn.

    Then  E (X nt) = 0 and  V ar(X nt) =  σ2nt. Define

    S n =nXt=1

    X nt,

    then  E (S n) = 0 and

    E (S 2n) =nXt=1

    σ2nt = 1.   (1)

    Definition 7   (Lindeberg CLT)  Let the array  {X nt}  be independent with zero mean and variance sequence  {σ2nt}  satisfying (1). If the following condition holds,

    limn→∞

    nXt=1

    Z {|X nt|>}

    X 2ntdP   = 0 for all    >  0,   (2)

    then  S n  →d N (0, 1).

    Equation (2) is known as the  Lindeberg condition . What Lindeberg condition rules out are thecases where some sequences exhibit extreme behavior as to influence the distribution of the sumin the limit. Only finite variances are not sufficient to rule out these kind of situations with non-

    identically distributed observations. The following is a popular version of the CLT for independentprocesses.

    Definition 8  (Liapunov CLT)  A su  ffi cient condition for Lindeberg condition (2) is 

    limn→∞

    nXt=1

    E |X nt|2+δ = 0,   for some   δ  >  0 (3)

    (3) is known as   Liapunov condition . It is stronger than Lindeberg condition, but it is moreeasily checkable. Therefore it is more frequently used in practice.

    2 Limit Theorems for Serially Dependent Observations

    We have seen that if the data {X n} are generated by an ARMA process, then the observations arenot  i.i.d, but serially correlated. In this section, we will discuss how to derive asymptotic theoriesfor stationary and serially dependent process.

    8

  • 8/19/2019 Hu - Time Series Analysis

    45/149

    2.1 LLN for a Covariance Stationary Process

    Consider a covariance stationary process   {X n}. Without loss of generality, let   E (X n) = 0, soE (X tX t−h) =  γ (h), where

    P∞

    h=0 |γ (h)| < ∞. Now we will consider the the properties of the samplemean:  X̄ n   = (X 1 +  . . . +  X n)/n. First we see that it is an unbiased estimate for the populationmean, E ( X̄ n) = E (X t) = 0. Next, the variance of this estimate is:

    E ( X̄ 2n) =   E [(X 1 + . . . + X n)/n]2

    = (1/n2)E (X 1 + . . . + X n)2

    = (1/n2)nX

    i,j=1

    E (X iX  j)

    = (1/n2)nX

    i,j=1

    γ x(i− j)

    = (1/n)

    γ 0 + 2

    n−1

    Xh=1

    1−h

    n

    γ (h)

    !

    or= (1/n)

    X|h|

  • 8/19/2019 Hu - Time Series Analysis

    46/149

    time series average converges to the population mean. Similarly, if the sample average provides anconsistent estimate for the second moment, then the process is said to be  ergodic for the second moment . In this section, we see that a sufficient condition for a covariance stationary proce