Time Series - Lecture Notes - University of Ottawa...2.Let fZ tgbe a sequence of independent random variables with mean zero and variance 1. Then Z(t) = E(Z t) = 0 for all t, Z(t;s)

Time Series - Lecture Notes

April 6, 2020

Contents

1 Stationary Models, Autocovariance and Autocorrelation 21.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Linear processes and ARMA models 42.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Difference and Backward Operator . . . . . . . . . . . . . . . . . . . 5

2.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Linear representation for ARMA models . . . . . . . . . . . . 92.2.3 Autocovariance function for ARMA models . . . . . . . . . . 11

2.3 Partial autocovariance . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Fitting ARMA model to data 153.1 Elimination of Trend and Seasonal Component . . . . . . . . . . . . 153.2 Sample Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Forecasting (prediction) of stationary time series 174.1 Yule-Walker procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Mean square prediction error (MSPE) . . . . . . . . . . . . . 194.2 Durbin-Levinson algorithm . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Partial Autocovariance function (PACF) . . . . . . . . . . . . 224.3 Forecast limits (prediction intervals) . . . . . . . . . . . . . . . . . . 224.4 Summary of forecasting methods . . . . . . . . . . . . . . . . . . . . 23

5 Estimation in ARMA models 245.1 Sample Autocovariance and Sample Autocorrelation . . . . . . . . . 245.2 Yule-Walker estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . 28

5.3.1 MLE for i.i.d. random variables . . . . . . . . . . . . . . . . . 285.3.2 MLE for time series models . . . . . . . . . . . . . . . . . . . 29

1

5.3.3 Easy approach for AR(p) models . . . . . . . . . . . . . . . . 325.3.4 Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Appendix 326.1 A primer on Maximum Likelihood Estimation (MLE) . . . . . . . . 32

6.1.1 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . 336.2 Innovation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2.1 Innovations are uncorrelated . . . . . . . . . . . . . . . . . . 39

1 Stationary Models, Autocovariance and Autocorre-lation

Throughout this course, {Xt} is a sequence of random variables, called time series.

Definition 1.1 Let {Xt} be a time series such that E[X2t ] <∞ for each t.

• The mean function is µX(t) = E[Xt];

• The (auto)covariance function is

γX(t, s) = Cov(Xt, Xs).

Note that γX(t, t) = Var(Xt).

Definition 1.2 A sequence {Xt} is stationary if µX(t) does not depend on t andγX(t, s) depends only on t − s. That is, γX can be treated as a function of onevariable h = t− s only.

Definition 1.3 Let {Xt} be a stationary time series such that E[X2t ] <∞ for each

t. Autocorrelation function is defined as

ρX(h) =γX(h)

γX(0).

Note that ρX(0) = 1.

1.1 Examples

1. If {Zt} be a sequence of independent random variables, then µZ(t) = E(Zt) =µZ . Furthermore, γZ(t, s) = 0 when t 6= s and γZ(t, t) = Var(Zt) = σ2Z .Hence, the sequence is stationary.

2

2. Let {Zt} be a sequence of independent random variables with mean zero andvariance 1. Then µZ(t) = E(Zt) = 0 for all t, γZ(t, s) = 0 for all t 6= s.Sometimes such a sequence is called white noise.

3. MA(1): Assume that {Zt} is a sequence of independent random variableswith mean zero and variance σ2Z = Var(Z). (Note that in class we assumedthat σ2Z = 1.) Define

Xt = Zt + θZt−1, (1)

with θ ∈ R. We can calculate E[Xt] = 0,

Var(Xt) = E[X2t ] = σ2Z(1 + θ2)

and

the covariance of MA(1) is

γX(t, t+ h) =

σ2Z(1 + θ2) h = 0;σ2θ h = ±1;0 |h| > 1

.

Note that γX(t, t+ h) = γX(h) depends on one variable only. Also,

ρX(t, t+ h) =

1 h = 0;θ/(1 + θ2) h = ±1;0 |h| > 1

.

Again, ρX(t, t+ h) = ρX(h). For example,

Cov(Xt, Xt+1) = E[XtXt+1] = E[(Zt + θZt−1)(Zt+1 + θZt)]

= E[ZtZt+1] + θE[Z2t ] + θE[Zt−1Zt+1] + θ2E[Zt−1Zt]

= 0 + θσ2Z + 0 + 0 = θσ2Z .

4. Let {Xt} be a sequence of independent random variables with mean zero andvariance σ2. Define St =

∑ti=1Xi. Then E[St] = 0 but

γS(t, t+ h) = Cov(St, St+h) = Cov(St, St +Xt+1 + · · ·+Xt+h)

= Cov(St, St) + Cov(St, Xt+1 + · · ·+Xt+h) = Cov(St, St) = Var(St).

Now, Var(St) = tσ2. Hence, the covariance function depends on t, the se-quence is not stationary.

3

5. Let {Xt} and {Yt} be two stationary sequences with mean µX , µY and thecovariance functions γX and γY . Assume also that both sequences are inde-pendent from each other. Define Wt = Xt + Yt. Then

µW (t) = E[Wt] = E[Xt + Yt] = E[Xt] + E[Yt] = µX + µY =: µW

and

γW (t, t+ h) = E[WtWt+h]− E[Wt]E[Wt+h]

= E[(Xt + Yt)(Xt+h + Yt+h)]− E[(Xt + Yt)]E[(Xt+h + Yt+h)]

= E[(Xt + Yt)(Xt+h + Yt+h)]− {E[Xt] + E[Yt]} {E[Xt+h] + E[Yt+h]}= E[(Xt + Yt)(Xt+h + Yt+h)]− (µX + µY )2 = E[(Xt + Yt)(Xt+h + Yt+h)]− µ2W= E[XtXt+h] + E[YtYt+h] + E[XtYt+h]︸︷︷︸

=E[Xt]E[Yt]

+ E[YtXt+h]︸︷︷︸=E[Xt]E[Yt]

−µ2W

= γX(h) + µ2X + γY (h) + µ2Y − (µX + µY )2

= γX(h) + γY (h)+µXµY − µXµY − 2µXµY .

2 Linear processes and ARMA models

QUESTION: How to construct stationary time series?

Definition 2.1 Let {Zt} be a sequence of independent random variables with meanzero and variance Var(Zt) = E[Z2

t ] = σ2Z . Let ψj, j ≥ 0, be a sequence of constantssuch that

∑∞j=0 |ψj | <∞. Then

Xt =∞∑j=0

ψjZt−j

is called a linear process, or moving average, or causal moving average, orone-sided moving average.

Notes:

• In general, we can define a non-causal linear process Xt =∑∞j=−∞ ψjZt−j ;

• Condition∑∞j=0 |ψj | <∞ ensures that the infinite series converges:

E[|Xt|] ≤∞∑j=0

|ψj |E[|Zt−j |] = E[|Z0|]∞∑j=0

|ψj | <∞.

However, the condition is not necessary. We will come back to this later (longmemory).

4

• ∑∞j=0 |ψj | <∞ implies also∑∞j=0 ψ

2j <∞.

Lemma 2.2 The linear process is stationary with E[Xt] = 0 and

γX(h) = σ2Z

∞∑j=0

ψjψj+h.

Proof: Since the mean is zero, we have

γX(h) = E[XtXt+h] = E

∞∑j=0

ψjZt−j

∞∑i=0

ψiZt+h−i

=∞∑j=0

∞∑i=0

ψjψiE[Zt−jZt+h−j ].

Since the noise variables Zt are independent with mean zero, the only terms thatcontributes are those when j = i−h. Hence, the double sum becomes a single sum.

2.1 Examples

1. Let |φ| < 1 and consider

Xt =∞∑j=0

φjZt−j .

This is the linear process with ψj = φj . Applying Lemma 2.2 we have

γX(h) = σ2Z

∞∑j=0

ψjψj+h = σ2Z

∞∑j=0

φjφj+h = σ2Zφh

1− φ2.

2. LetXt = Zt + θ1Zt−1 + · · · θqZt−q,

This is the linear process with ψ0 = 1, ψ1 = θ1, . . . , ψq = θq, ψj = 0, j > q.Then using Lemma 2.2,

γX(h) =

®σ2Z∑∞j=0 θjθj+h (θ0 = 1) h = 0, . . . , q

0 h > q.

2.2 Difference and Backward Operator

Define a difference and backward shift operators as

∇Xt = Xt −Xt−1, BXt = Xt−1.

By induction, B2Xt = Xt−2, BkXt = Xt−k.

5

2.2.1 Examples

1. AR(1): Let {Zt} be a sequence of independent random variables with meanzero and variance Var(Zt) = E[Z2

t ] = σ2Z . Define

Xt = φXt−1 + Zt .

Equivalently,

Xt − φBXt = Zt,

(1− φB)Xt = Zt,

φ(B)Xt = Zt,

where φ(z) = 1− φz.

2. MA(1): Re-write (1) as

Xt = Zt + θZt−1,

Xt = Zt + θBZt,

Xt = (1 + θB)Zt,

Xt = θ(B)Zt,

where θ(z) = 1 + θz.

3. ARMA(1,1): The model is defined as

Xt − φXt−1 = Zt + θZt−1.

Equivalently,φ(B)Xt = θ(B)Zt,

where φ(z) = 1− φz, θ(z) = 1 + θz.

Definition 2.3 Let {Zt} be a sequence of independent random variables with meanzero and variance Var(Zt) = E[Z2

t ] = σ2Z . A time series {Xt} is called ARMA(p, q)(AutoRegressive Moving Average of order (p, q)) if it solves the equation

Xt − φ1Xt−1 − · · · − φpXt−p = Zt + θ1Zt−1 + · · · θqZt−q.

Equivalently,φ(B)Xt = θ(B)Zt,

whereφ(z) = 1− φ1z − · · · − φpzp,

andθ(z) = 1 + θ1z + · · ·+ θqz

q

are autoregressive and moving average polynomials, respectively.

6

Notes:

1. A stationary solution for ARMA(p, q) exists whenever the autoregressive poly-nomial φ(z) = 1 − φ1z − · · · − φpz

p 6= 0 for all |z| = 1. For example, ifyou have a model Xt − 1.1Xt−1 = Zt, then the autoregressive polynomial isφ(z) = 1− 1.1z. The only solution of φ(z) = 0 is z = 1/1.1 which is of coursedifferent than 1. That is, there exists a stationary solution.

2. A stationary and causal solution for ARMA(p, q) exists whenever the autore-gressive polynomial φ(z) = 1 − φ1z − · · · − φpz

p 6= 0 for all |z| ≤ 1. Forexample, if you have a model Xt − 1.1Xt−1 = Zt, then the autoregressivepolynomial is φ(z) = 1 − 1.1z. The only solution of φ(z) = 0 is z = 1/1.1.Since |z| < 1 there is no causal solution.

Example 2.4 Consider AR(2) process Xt − 0.1Xt−1 − 0.4Xt−2 = Zt. The autore-gressive polynomial is

φ(z)− 1− 0.2z − 0.4z2 .

Its roots are 1.46, -1.71. Both roots have modulus bigger than one, there is a causaland stationary solution.

Example 2.5 Consider AR(2) process Xt − 0.1Xt−1 + 0.4Xt−2 = Zt. The autore-gressive polynomial is

φ(z)− 1− 0.2z + 0.4z2 .

There are only imaginary roots:

0.1± i√

1.59

0.8.

Both roots have the same modulus which is equal to 1.58. It is bigger than one,there is a causal and stationary solution.

Example 2.6 Consider AR(2) process Xt−φXt−1−φXt−2 = Zt. The autoregres-sive polynomial is

φ(z)− 1− φz − φz2 .

The roots areφ±

√φ2 + 4φ

2φ.

Then ∆ = φ2 + 4φ > 0 if φ < −4 and φ > 0. Thus we need to consider two cases:

7

• φ ∈ (−4, 0) (there are imaginary roots only). Then both roots have the samemodulus equal √

φ2 + φ2 + 4φ

4φ2.

It is bigger than one whenever

2φ2 + 4φ > 4φ2 ⇔ 0 > 2φ2 − 4φ = 2φ(φ− 2)⇔ {φ < 0 or φ > 2} .

We need to compare it with the main restriction φ ∈ (−4, 0). Thus, there isa causal and stationary solution if φ ∈ (−4, 0).

• φ < −4 or φ > 0 (there are real roots only). Then we need to solve∣∣∣∣∣φ±√φ2 + 4φ

2φ

∣∣∣∣∣ > 1

which is equivalent to

φ±√φ2 + 4φ

2φ> 1 or

φ±√φ2 + 4φ

2φ< −1 .

For φ > 0 it is equivalent to

φ±»φ2 + 4φ > 2φ or φ±

»φ2 + 4φ < −2φ

±»φ2 + 4φ > φ or ±

»φ2 + 4φ < −3φ .

For the first root we look at»φ2 + 4φ > φ or

»φ2 + 4φ < −3φ .

Keeping in mind that we work under the assumption φ > 0, the first inequalityalways holds, while the second one never holds. We conclude that the firstroot has the modulus bigger than one whenever φ > 0. Now we look at thesecond root: we need

−»φ2 + 4φ > φ or −

»φ2 + 4φ < −3φ»

φ2 + 4φ < −φ or»φ2 + 4φ > 3φ»

φ2 + 4φ < −φ or φ2 + 4φ > 9φ2»φ2 + 4φ < −φ or 0 > 2φ(φ− 2) .

Keeping in mind that we work under the assumption φ > 0, the first inequalitynever holds, while the second one holds whenever φ ∈ (0, 2). Thus, both rootshave the modulus bigger than one whenever φ ∈ (0, 2).

8

2.2.2 Linear representation for ARMA models

Problem: Given ARMA(p.q) model, how to write a linear representation fromDefinition 2.1? In general, it is not very easy. We will focus on some basic models.

Example 2.7 (MA(q)) If p = 0, then ARMA(0, q) becomes MA(q). Its linearrepresentation is trivial.

Example 2.8 (AR(1)) AR(1) is obtained by setting p = 1 and q = 0. That is,

φ(B)Xt = Zt, (2)

where φ(z) = 1− φz, see the first example in Section 2.2.1. Define

χ(z) =1

φ(z).

Now, the function 1/(1− φz) has the following power series expansion:

χ(z) =1

φ(z)=∞∑j=0

φjzj .

This expansion makes sense whenever |φ| < 1. Take equation (2) and multiply bothsides by χ(B):

χ(B)φ(B)Xt = χ(B)Zt,

Xt = χ(B)Zt,

since χ(z)φ(z) = 1 for all z. That is,

the linear representation of AR(1) is

Xt = χ(B)Zt =∞∑j=0

φjBjZt =∞∑j=0

φjZt−j .

The above formula gives a linear representation for AR(1) with ψj = φj . Infact we have seen this formula before, see the first example in Section 2.1.

We note that the above computation makes sense whenever |φ| < 1. That is,under this condition there exists the causal linear representation.

If |φ| > 1, then one can still write a linear representation, but it is not causal.

9

Example 2.9 (ARMA(1,1)) We will repeat computation from the previous ex-ample for ARMA(1,1), which is obtained by setting p = 1 and q = 1. That is,

φ(B)Xt = θ(B)Zt, (3)

where φ(z) = 1− φz, θ(z) = 1 + θz; see the third example in Section 2.2.1. Defineagain

χ(z) =1

φ(z)=∞∑j=0

φjzj .

Take equation (3) and multiply both sides by χ(B):

χ(B)φ(B)Xt = χ(B)θ(B)Zt,

Xt = χ(B)θ(B)Zt,

since χ(z)φ(z) = 1 for all z. That is,

Xt = χ(B)θ(B)Zt =∞∑j=0

φjBj(1 + θB)Zt =∞∑j=0

φjZt−j + θ∞∑j=0

φjZt−j−1.

Until now everything was almost the same as for AR(1). Now, we want Xt to havea form

∑∞j=0 ψjZt−j . That is

∞∑j=0

ψjZt−j =∞∑j=0

φjZt−j + θ∞∑j=0

φjZt−j−1.

Re-write it as

ψ0Zt +∞∑j=1

ψjZt−j = φ0Zt +∞∑j=1

(φj + θφj−1)Zt−j .

the linear representation of ARMA(1,1) is

ψ0 = 1, ψj = φj−1(θ + φ), j ≥ 1.

The above formula gives a linear representation for ARMA(1,1). The formulais obtained under the condition that |φ| < 1. Furthermore, it is also assumed thatθ + φ 6= 0, otherwise Xt = Zt.

Example 2.10 (ARMA(1,q)) ARMA(1,2) or in general ARMA(1,q) works inthe same way as ARMA(1,1).

Example 2.11 (AR(p)) A procedure for AR(p), p ≥ 2, is much more involvedand we will not discuss it.

10

2.2.3 Autocovariance function for ARMA models

Problem: Given ARMA(p.q) model, how to write its covariance function?

• Use linear representation; or

• Apply a recursive method.

Example 2.12 (MA(q)) For MA(q) a linear representation is trivial and can beused to evaluate a covariance function. See Section 2.1.

Example 2.13 (AR(1)) For AR(1) use the linear representation from Example2.8 with Lemma 2.2. See the first example in Section 2.1.

The covariance of AR(1) is

γX(h) = σ2Zφh 1

1− φ2. (4)

Example 2.14 (ARMA(1,1)) For ARMA((1, 1)) (an in general for ARMA(1,q))use the linear representation from Example 2.9 with Lemma 2.2. Specifically, sinceψ0 = 1, ψj = φj−1(φ+ θ), j ≥ 1, we have

γX(0) = σ2Z

∞∑j=0

ψ2j = σ2Zψ

20+σ2Z

∞∑j=1

ψ2j = σ2Z

1 + (φ+ θ)2∞∑j=1

φ2(j−1)

= σ2Z

ñ1 +

(θ + φ)2

1− φ2

ô.

Similarly,

γX(1) = σ2Z

ñ(θ + φ) + φ

(θ + φ)2

1− φ2

ô.

You can also obtain similar formulas for γX(h). You can also notice that γX(h) =φh−1γX(1). That is

the covariance function of ARMA(1,1) is

γX(h) = σ2Zφh−1ñ(θ + φ) + φ

(θ + φ)2

1− φ2

ô, h ≥ 1 .

You can also notice that if you set θ = 0 (that is, you consider AR(1) model),then you get the formula (4).

11

Example 2.15 (AR(1)) For AR(p) models or general ARMA(p, q), p ≥ 2, oneapplies a recursive method. We will illustrate this procedure for AR(1). In theassignment you will have to repeat this for AR(2).

Take AR(1) equation Xt = φXt−1+Zt. Multiply both sides by Xt−h and applythe expected value to get

E[XtXt−h] = φE[Xt−1Xt−h] + E[ZtXt−h].

Since E[Xt] = 0, then E[XtXt−h] = γX(h) and E[Xt−1Xt−h] = γX(h− 1). Also, forall h ≥ 1 we can see that Zt is independent of Xt−h =

∑∞j=0 φ

jZt−h−j . Hence,

E[ZtXt−h] = E[Zt]E[Xt−h] = 0.

(This is the whole trick, if you multiply by Xt+h it will not work).Hence, we obtain

the recursive formula for AR(1):

γX(h) = φγX(h− 1),

or by inductionγX(h) = φh−1γX(0), h ≥ 1.

We need to start the recursion by computing γX(0) = Var(Xt) = σ2X . We have

Var(Xt) = φ2Var(Xt−1) + Var(Zt)

(again, Xt−1 and Zt are independent.) Since Xt is stationary we get

σ2X = φ2σ2X + σ2Z .

Solving for σ2X :

σ2X = σ2Z1

1− φ2.

Finally

γX(h) = φhγX(0) = γX(h) = φhσ2Z1

1− φ2

which agrees with (4).

Example 2.16 Take AR(2) equation Xt = φ1Xt−1 + φ2Xt−2 + Zt. Multiply bothsides by Xt−h and apply the expected value to get

E[XtXt−h] = φ1E[Xt−1Xt−h] + φ1E[Xt−2Xt−h] + E[ZtXt−h].

12

Since E[Xt] = 0, we have E[XtXt−h] = γX(h), E[Xt−1Xt−h] = γX(h − 1) andE[Xt−2Xt−h] = γX(h− 2).

Also, for all h ≥ 1, Zt is independent of Xt−h. Hence, for h ≥ 1

E[ZtXt−h] = E[Zt]E[Xt−h] = 0.

Hence, we obtain

the recursive formula for covariance of AR(2):

γX(h) = φ1γX(h− 1) + φ2γX(h− 2). (5)

We need to start the recursion by computing γX(0) = Var(Xt) = σ2X andγX(1). To get γX(1) we use again AR(2) equation, multiply by Xt−1 and applyexpectation to get

E[XtXt−1] = φ1E[X2t−1] + φ2E[Xt−2Xt−1] + E[ZtXt−1]︸︷︷︸

=0

so thatγX(1) = φ1γX(0) + φ2γX(1), (6)

and

γX(1)1− φ2φ1

= γX(0). (7)

Now, we need to get γX(0). Take the AR(2) equation, multiply by Xt and applyexpectation to get

E[X2t ] = φ1E[Xt−1Xt] + φ2E[Xt−2Xt] + E[ZtXt]

Now,E[ZtXt] = E[Zt(φ1Xt−1 + φ2Xt−2 + Zt)] = σ2Z

Hence,γX(0) = φ1γX(1) + φ2γX(2) + σ2Z . (8)

We already know (equation (5) with h = 2)

γX(2) = φ1γX(1) + φ2γX(0) .

We plug-in this expression into (8) to get

γX(0) = φ1γX(1) + φ2 {φ1γX(1) + φ2γX(0)}+ σ2Z (9)

13

Solving (8)-(9) we obtain

γX(h) = φ1γX(h− 1) + φ2γX(h− 2) , h ≥ 2 ,

γX(1) = σ2Zφ1

(1 + φ2){(1− φ2)2 − φ21

}γX(0) = σ2Z

1− φ2(1 + φ2)

{(1− φ2)2 − φ21

}Note: to check that the last two equations make sense, take φ2 = 0, φ1 = φ. ThenAR(2) reduces to AR(1) and the last two formulas should reduce to γX(0) andγX(1) for AR(1).

Example 2.17 The recursive formula for covariance of AR(3):

γX(h) = φ1γX(h− 1) + φ2γX(h− 2) + φ3γX(h− 3). (10)

2.3 Partial autocovariance

Suppose X = (X1, . . . , Xd) is a zero mean random vector. The partial correla-tion is the covariance between Xi and Xj , conditioned on the other elements in thevector.

The partial covariance/correlation between Xt and Xt+k+1 is defined as thepartial covariance/correlation between Xt and Xt+k+1 after conditioning out thein-between time series Xt+1, . . . , Xt+k.

We will come back to this later on. At this moment, we will just provide someformulas.

Example 2.18 Consider AR(1) model Xt = φXt−1 + Zt. Then PACF(0) = 1,PACF(1) = φ and PACF(k) = 0, k ≥ 2. Hence, the PACF vanishes after lag 1. Ingeneral, for AR(p), the partial autocovariances vanish after lag p.

Example 2.19 For MA(1) model PACF has the form

PACF(h) =−(−θ)h

1 + θ2 + · · ·+ θ2h.

In general, PACF for MA models have an oscillating behaviour.

14

3 Fitting ARMA model to data

3.1 Elimination of Trend and Seasonal Component

A typical representation for time series:

Yt = mt + st +Xt ,

where mt is a deterministic trend component, st is a deterministic seasonal compo-nent and Xt is a random stationary part.Steps in model fitting:

• estimate and remove mt and st.

• fit a model to the stationary part Xt.

Assume first that st ≡ 0.

1. Moving Average Smoothing (note - it has nothing to do with MA models):Let Q be a positive integer.“mt = (2Q+ 1)−1

Q∑j=−Q

Yt+j , Q+ 1 ≤ t ≤ n−Q .

2. Exponential Smoothing: Let a ∈ (0, 1).“m1 = Y1 , “mt = aYt + (1− a)“mt−1 , t = 2, . . . , n .

3. Polynomial Fitting. assume that mt is a polynomial, for example a linearfunction mt = a0 + a1t. We estimate a0 and a1 using the least squaresmethod.

Then “Xt = Yt− “mt will be analysed using methods appropriate for stationary timeseries.

Also, differencing: compute “Xt = Yt − Yt−1, t = 2, . . . , n.

When st 6= 0. We assume that there exists an integer d such that st+d = st and∑dj=1 sj = 0.

• if d is even (d = 2q) then“mt = (0.5Yt−q + Yt−q+1 + · · ·+ Yt+q−1 + 0.5Yt+q)/d , q < t ≤ n− q ,

if d is even then

15

• if d is odd (d = 2q + 1) then“mt = (2q + 1)−1q∑

j=−qYt+j , q + 1 ≤ t ≤ n− q .

• In either case, for each k = 1, . . . , d, compute wk as the average of

(xk+jd − “mk+jd) , q < k + jd ≤ n− q .

• Then for k = 1, . . . , d,

sk = wk −1

d

d∑i=1

wi

and sk = sk−d for k > d.

• “Xt = Yt − “mt − st, t = 1, . . . , n.

Note: after removing trend and seasonality we make an approximation: instead ofhaving the true time series Xt, we have its approximation “Xt.

3.2 Sample Autocorrelation

Let Xt be a stationary time series. Recall that covariance at lag h is

γX(h) = E[X0Xh]− E[X0]E[Xh] = E[X0Xh]− µ2X= E[(X0 − µX)(Xh − µX)] ,

and the correlation is

ρX(h) =γX(h)

γX(0).

The sample covariance and the sample correlations are, respectively

γX(h) =1

n

n−h∑t=1

(Xt − X)(Xt+h − X) ,

ρX(h) =γX(h)

γX(0),

X =1

n

n∑t=1

Xt .

16

4 Forecasting (prediction) of stationary time series

GOAL: Having observed {X1, . . . , Xn} from a time series with a knownmean µ and known autocovariance function γX(k), predict Xn+k for somek ≥ 1.

Note: I will use forecasting or prediction interchangeably.

Consider a stationary sequence with mean µ = E[Xt] and covariance γX(h). DenotePnXn+k a predicted value of Xn+k given that we have n observations X1, . . . , Xn.We will consider linear predictors, that is predictors of a form:

PnXn+k = a0 + a1Xn + · · ·+ anX1 = a0 +n∑i=1

aiXn+1−i, (11)

where a0, a1, . . . , an are constants. We have to find values a0, . . . , an. We willemploy the mean squared error criterium, that is we will minimize

Eî(Xn+k − PnXn+k)

2ó,

the expected squared distance between the future observation Xn+k and our pre-dictor PnXn+k. Of course, we cannot minimize (Xn+k−PnXn+k)

2, since we do notknow Xn+k (otherwise, if we know Xn+k, there is no point to predict it!).

4.1 Yule-Walker procedure

Define

S(a0, a1, . . . , an) = Eî(Xn+k − PnXn+k)

2ó

= E

[(Xn+k − a0 −

n∑i=1

aiXn+1−i)2

].

We have to take derivatives with respect to a0, . . . , an, equal the resulting functionto 0 and solve for a0, . . . , an.

Take the derivative with respect to a0 to obtain

E

[Xn+k − a0 −

n∑i=1

aiXn+1−i

]= 0,

hence

a0 = µ

(1−

n∑i=1

ai

).

Hence, if we assume that µ = 0, then a0 = 0. From now on we assume that µ = 0.Take derivatives with respect to a1, . . . , an to obtain

E

[(Xn+k −

n∑i=1

aiXn+1−i

)Xn+j−1

]= 0, j = 1, . . . , n.

17

Equivalently,

E[Xn+kXn+1−j ]−n∑i=1

aiE[Xn+1−iXn+1−j ] = 0, j = 1, . . . , n.

Since random variables are centered, the above expectation are covariances at lagsn+ k − (n+ 1− j) = k − 1 + j and n+ 1− i− (n+ 1− j) = i− j. Hence, we canwrite a system of equations

γX(k − 1 + j) =n∑i=1

aiγX(i− j), j = 1, . . . , n. (12)

Define a matrixΓn = [γX(i− j)]ni,j=1 (13)

and vectors

γ(n; k) = (γX(k), . . . , γX(k + n− 1))t, an = (a1, . . . , an)t.

Note that the vectors are column vectors. Furthermore, Γn is called variance-covariance matrix of (X1, . . . , Xn). For example, the entries on the diagonal areγX(0) = Var(Xt).

For example, for p = 1, Γ1 = γX(0), and for p = 2 the matrix Γ2 becomes

Γ2 =

ñγX(0) γX(1)γX(−1) γX(0)

ô=

ñγX(0) γX(1)γX(1) γX(0)

ô.

We can write the system of equations (12) in a matrix-vector notation

Γnan = γ(n; k). (14)

Hence,

Yule-Walker formula for forecasting:

an = Γ−1n γ(n; k). (15)

(of course, we have to assume that Γn is invertible). This equation definesprediction using Yule-Walker procedure.

Note: The formula for an does not use anything about the time series model,except of stationarity. It depends on covariances only. Of course, the question ishow to compute covariances when we have data only. If we can verify that our datacome from particular model, then covariances have some specific form that can beused. Otherwise, we will use sample covariances.

18

4.1.1 Mean square prediction error (MSPE)

The above procedure guarantees that the mean squared prediction error

MSPEn(k) = E

[(Xn+k −

n∑i=1

aiXn+1−i)2

]

is minimal, when a1, . . . , an are chosen according to (15). Can we calculate thiserror. The answer is affirmative. Recall that the random variables are centered.Hence

E

[(Xn+k −

n∑i=1

aiXn+1−i)2

]=

= EîX2n+k

ó− 2

n∑i=1

aiE[Xn+kXn+1−i] + E

( n∑i=1

aiXn+1−i

)2

= γX(0)− 2n∑i=1

aiγX(k + i− 1) + E

n∑i,j=1

aiXn+1−iXn+1−jaj

= γX(0)− 2

n∑i=1

aiγX(k + i− 1) +n∑

i,j=1

aiγX(i− j)aj

= γX(0)− 2atnγ(n; k) + atnΓnan

= γX(0)− atnγ(n; k),

where in the last equation we used (14).

Note: The formula for the MSPE depends on k. That is, having observationX1, . . . , Xn, prediction further in a future (larger k) may have a larger predictionerror.

Example 4.1 Consider AR(1) model Xt = φXt−1+Zt, where Zt are i.i.d. centeredwith mean zero and variance σ2 and |φ| < 1. Hence, µ = E[Xt] = 0. Recall that

γX(h) = φhσ2

1− φ2, h ≥ 0.

Then

γ(n; k) = γ(n; 1) = (γX(1), . . . , γX(n))t =σ2

1− φ2(φ, . . . , φn)t.

19

The equation (14) becomes

σ2

1− φ2

1 φ φ2 φ3 · · · φn−1

φ 1 φ φ2 · · · φn−2

......

......

. . ....

φn−1 φn−2 φn−3 φn−4 · · · 1

a1

...an

=σ2

1− φ2

φ...φn

.

Now, either you invert the matrix on the left hand side or you ”guess” the solutionan = (φ, . . . , 0)t. Hence, in

AR(1) case the prediction is

PnXn+1 = φXn.

The MSPE for AR(1) is

MSPEn(1) = γX(0)− atnγ(n; 1) = γX(0)− φγX(1) =σ2

1− φ2− φ2 σ2

1− φ2= σ2.

Note that the formula for the prediction and the MSPE is not practical yet,since it involves the unknown parameters φ and σ2.

4.2 Durbin-Levinson algorithm

The problem is again to find the best linear predictor as in (11). However, we willavoid computation of the inverse of matrix Γn defined in (13). The price we willpay is that we will be able to write an one-step prediction PnXn+1 only.

Assume that µ = E[Xt] = 0 so that a0 = 0 in (11). Re-write (11) as

PnXn+1 = φn1Xn + · · ·+ φnnX1. (16)

That is, a1 = φn1, . . . , an = φnn.

• Take n = 1. We can still use (15) to get

φ11 = γ−1X (0)γX(1) = ρX(1).

20

• Take n = 2. We want to find coefficients φ21 and φ22 in

P2X3 = φ21X2 + φ22X2.

As in the Yule-Walker procedure we minimize

E[(X3 − φ21X2 − φ22X1)2].

Taking derivatives with respect to φ21 and φ22 leads to:®E[X2(X3 − φ21X2 − φ22X1)] = 0E[X1(X3 − φ21X2 − φ22X1)] = 0

Recalling that e.g. E[X2X3] = γX(1) we can write equivalently®γX(1)− φ21γX(0)− φ22γX(1) = 0γX(2)− φ21γX(1)− φ22γX(0) = 0

Divide everything by γX(0) and re-organize the terms to get®φ21 = ρX(1)− φ22ρX(1) = ρX(1)− φ22φ11 by step n = 1;ρX(2)− φ21ρX(1)− φ22 = 0

Solving for φ21 and φ22 we arrive at{φ22 = ρX(2)−φ11ρX(1)

1−φ11ρX(1)

φ21 = ρX(1)− φ22φ11

Note that sometimes I used φ11 or ρX(1) when convenient. You observe inthe last system of equations that coefficients φ22 and φ21 are computed usingsample correlations and φ11 from step n = 1.

• Such recursive procedure leads to

Theorem 4.2 (Durbin-Levinson algorithm) The coefficients φn1, . . . , φnncan be computed recusrively as

φnn =

γX(n)−n−1∑j=1

φn−1,jγX(n− j)

v−1n−1; φn1

...φn,n−1

=

φn−1,1...

φn−1,n−1

− φnn φn−1,n−1

...φn−1,1

and

vn = vn−1[1− φ2nn], v0 = γX(0), φ11 = ρX(1).

21

Note that Durbin-Levinson and Yule-Walker procedure should lead to thesame results. Indeed, in both cases we compute coefficents in the linearprediciton PnXn+1 using the mean squared error criterium. The only dif-ference is that coefficients are computed in the different ways.

Example 4.3 Consider AR(1) model Xt = φXt−1 + Zt, where Zt are i.i.d. witjh

mean zero and variance σ2 . Its covariance function is given by γX(h) = φh σ2

1−φ2 .

The correlation is ρX(h) = γX(h)/γX(0) = φh. Using the Durbin-Levinson algo-rithm in Theorem 4.2 we find coefficients and predictors:

φ11 = φ, P1X2 = φX1

φ22 = 0, φ21 = φ, P2X3 = φX2

In general,φn1 = φ, φn2 = · · · = φnn = 0, PnXn+1 = φXn.

Note that we obtain the same formulas as in Yule-Walker procedure (see Example4.1). As mnetioned above, it must be like that.

4.2.1 Partial Autocovariance function (PACF)

As a by-product in the Durbin-Levinson algorithm we obtain PACF defined as

α(0) = 1; α(k) = φkk, k ≥ 1.

Example 4.4 Consider AR(1) model Xt = φXt−1 +Zt as in Example 4.3. We canconclude that α(0) = 1, α(1) = φ and α(k) = 0, k ≥ 2. Hence, the PACF vanishesafter lag 1. In general, for AR(p), the partial autocovariances vanish after lag p.

Example 4.5 For MA(1) model PACF has the form

α(h) =−(−θ)h

1 + θ2 + · · ·+ θ2h.

4.3 Forecast limits (prediction intervals)

Above, we obtained formulas for predicted values. We note that the formulas forpredicted values do not depend on the model, they depend on covariances only.However, we did not discuss too much accuracy of this prediction. For this, we willneed to impose some model assumptions.

22

Assume that Xt =∑∞j=0 ψjZt−j is a linear process with E[Zt] = 0, Var(Zt) =

σ2. From a general theory we can get that mean squared error prediction is

MSPEn(k) := E[(Xn+k − PnXn+k)2] = σ2

k−1∑j=0

ψ2j . (17)

Hence, the theoretical forecast limits are

PnXn+k ± zα/2»

MSPEn(k) = PnXn+k ± zα/2σ

Ãk−1∑j=0

ψ2j ,

where zα/2 is a standard normal quantile. You need to know the model to computethese forecast limits !!!

Example 4.6 Recall Example 4.1. For AR(1) model Xt =∑∞j=0 φ

jZt−j we evalu-ated that MSPEn(k) for k = 1 is σ2. This agrees with (17) since

k−1∑j=0

ψ2j = ψ2

0 = 1.

4.4 Summary of forecasting methods

Above, we derived formulas for forecasting:

• Yule-Walker:PnXn+k = a1Xn + · · ·+ anX1,

where an = Γ−1n γ(n; k);

• Durbin-Levinson algorithm

PnXn+1 = φn1Xn + · · ·+ φnnX1,

where the coefficients are evaluated recursively;

The Yule-Walker and the Durbin-Levinson should lead to the same results, thatis the coefficients should be the same. The only difference is the way how thecoefficients are calculated, but they should be the same.

Furthermore, Yule-Walker does k-step prediction, whereas the other does one-step prediction only.

The formulas for prediction depend on the covariances only, and in practice wewill replace covariances with sample covariances. In particular, you do not need themodel. However, if you want to write formulas for prediction intervals, you musthave a model.

23

5 Estimation in ARMA models

GOAL: Assume that we have observations {X1, . . . , Xn} from a time series.Assume also that we have already identified a model, say ARMA(p, q).Hence, p and q has been already chosen. We want to estimate parametersφ1, . . . , φp and/or θ1, . . . , θq in the ARMA(p, q) model

5.1 Sample Autocovariance and Sample Autocorrelation

Assume that we have a stationary sequence {Xt} with mean µ and its autocovari-ance γX(h). Recall:

• Population ordinary moments: µk = E[Xk], k ≥ 1. For example, µ1 = µ =E[X];

• Population central moments: E[(X − E[X])k], k ≥ 1. In particular, for k = 2we get Var(X).

• For a random sample X1, . . . , Xn, a ordinary sample moment is

µk =1

n

n∑i=1

Xki .

For example, when k = 1, then we get the sample mean X = 1n

∑ni=1Xi.

• For a random sample X1, . . . , Xn, a sample central moment is

1

n

n∑i=1

(Xi − X)2.

For example, when k = 2, then we get the variance. (Sometimes, the samplevariance is defined as 1

n−1∑ni=1(Xi − X)2.)

Method of moments: Solve µk = µk. For example, the population mean isestimated using the sample mean, the population variance is estimated using thepopulation variance.

Definition 5.1 Recall that γX(h) = E[(Xt−µ)(Xt+h−µ)]. Its estimator using themethod of moments is the sample autocovariance:

γX(h) =1

n

n∑t=1

(Xt − X)(Xt+h − X).

Note that γX(0) is just the sample variance. The autocorrelation is estimated bythe sample autocorrelation:

ρX(h) =γX(h)

γX(0).

24

5.2 Yule-Walker estimator

• Especially for AR(p) models

Let us start with AR(1): Xt = φXt−1 +Zt, where |φ| < 1 and E[Zt] = 0, Var(Zt) =σ2Z . Hence, the model is stationary and causal.

Multiply both sides of the equation by Xt−1 and Xt to get

XtXt−1 = φXt−1Xt−1 + ZtXt−1,

X2t = φXtXt−1 +XtZt.

Apply expected value to get (recall that E[Xt] = 0 and Xt−1 is independent of Zt,because time series is causal):

γX(1) = φγX(0) + 0,γX(0) = φγX(1) + E[XtZt].

We evaluate

E[XtZt] = E[(φXt−1 + Zt)Zt] = φE[Xt−1Zt] + E[Z2t ] = σ2Z .

Hence, we get the system of equations

γX(0)φ = γX(1)σ2Z = γX(0)− φγX(1).

(18)

Now, consider p = 2: Xt = φ1Xt−1 + φ2Xt−2 +Zt. Multiply by Xt−2, Xt−1 and Xt

to getXtXt−2 − φ1Xt−1Xt−2 − φ2X2

t−2 = ZtXt−2,XtXt−1 − φ1X2

t−1 − φ2Xt−2Xt−1 = ZtXt−1,X2t − φ1Xt−1Xt − φ2Xt−1Xt = XtZt.

Apply expected value to get

γX(0)φ1 − γX(1)φ2 = γX(1)γX(1)φ1 − γX(0)φ2 = γX(2)σ2Z = γX(0)− φ1γX(1)− φ2γX(2).

(19)

As in section 4.1 we consider the variance-covariance matrix

Γn = [γX(i− j)]ni,j=1

and vectors

φp

=

Öφ1...φp

è, γX,p =

ÖγX(1)

...γX(p)

è.

25

Recall that for p = 1, Γ1 = γX(0), and for p = 2 the matrix Γ2 becomes

Γ2 =

ñγX(0) γX(1)γX(−1) γX(0)

ô=

ñγX(0) γX(1)γX(1) γX(0)

ô.

We can write (18) or (19) as

Γpφp = γX,p, σ2Z = γX(0)− φtpγX,p (20)

Equivalently, we obtain the Yule-Walker equations

φp

= Γ−1p γX,p, σ2Z = γX(0)− φtpγX,p, (21)

very similar to prediction equations in Section 4.1. The above formulas involveautocovariances that are unknown. We combine these equations with the methodof moments. The Yule-Walker estimators are

φp

= Γ−1p γX,p, σ2Z = γX(0)− φtpγX,p, (22)

where Γp and γX,p are obtained by replacing γX with γX .

Theorem 5.2 For a big sample size n, the Yule Walker estimators are approxi-mately normal:

φp∼ N

Åφp,

1

nσ2ZΓ−1p

ã.

In particular, for p = 1,

φ ∼ NÅφ,

1

nσ2Zγ

−1X (0)

ã.

That is, Var(φ) ∼ 1nσ

2Zγ−1X (0).

Example 5.3 (Confidence interval for AR(1)) The above theorem allows usto write a formula for confidence interval. When p = 1, the theoretical confidenceinterval for the parameter φ of AR(1) is

φ± zα/21√nσZ»γ−1X (0),

where zα/2 is the standard normal quantile. Since σ2Z and γX(0) are unknown, wereplace them with estimators to obtain the practical confidence interval, by using(22) with p = 1:

26

the practical confidence interval for the parameter φ of AR(1) is

φ± zα/21√nσZ»γ−1X (0),

with

φ =γX(1)

γX(0), σ2 = γX(0)− φγX(1)

Example 5.4 For AR(2)Xt = φ1Xt−1+φ2Xt−2+Zt the limiting variance-covariancematrix for the Yule-Walker estimators φ1, φ2 is

σ2ZΓ−12 =

ñ1− φ22 −φ1(1 + φ2)

−φ1(1 + φ2) 1− φ22

ô.

Indeed, we have

Γ2 =

ñγX(0) γX(1)γX(1) γX(0)

ô.

Therefore,

Γ−12 =1

γ2X(0)− γ2X(1)

ñγX(0) −γX(1)−γX(1) γX(0)

ô.

In Example 2.16 we found out that

γX(1) = σ2Zφ1

(1 + φ2){(1− φ2)2 − φ21

}γX(0) = σ2Z

1− φ2(1 + φ2)

{(1− φ2)2 − φ21

}Form this you can obtain the form of σ2ZΓ−12 . In particular, Var(φ1) ∼ 1

n(1− φ22)and Var(φ2) ∼ 1

n(1− φ22). Consequently,

the practical confidence intervals for the parameters φ1, φ2 of AR(2) are

φ1 ± zα/21√n

√1− φ22, φ2 ± zα/2

1√n

√1− φ22 (23)

with φ1 and φ2 obtained from (22).

27

5.3 Maximum Likelihood Estimation (MLE)

Note: You may read some notes on MLE for i.i.d. random variables - see Appendix.

5.3.1 MLE for i.i.d. random variables

Assume that random variables X1, . . . , Xn are i.i.d. from a density f(x; θ). Weassume that a form of f is known, and our goal is to estimate θ. Since randomvariables are independent, their joint density is fn(x1, . . . , xn) =

∏ni=1 f(xi).

Definition 5.5 The likelihood function is defined as

L(θ) = L(θ;X1, . . . , Xn) =n∏i=1

f(Xi; θ).

The log-likelihood function is defined as `(θ) = logL(θ).

Note: in statistics log x refers usually to logarithm with base e, not 10.

Definition 5.6 Let

θMLE = arg maxθL(θ) = arg max

θ`(θ).

Then θMLE is called the maximum likelihood estimator.

Example 5.7 Assume that X1, . . . , Xn is a random sample from an exponentialdistribution. Recall that X ∼ Exp(β) (β > 0) if

fX(x) =

®β−1 exp(−x/β), x > 0;0, x ≤ 0

Here, θ = β. The likelihood function is

L(β) = β−nn∏i=1

exp(−Xi/β) = β−n exp

(−β−1

n∑i=1

Xi

)

and hence the log-likelihood is

`(β) = −n log(β)− 1

β

n∑i=1

Xi.

Take derivative with respect to β to get

`(β) = −nβ−1 + β−2n∑i=1

Xi.

28

Then solve∂`(β)

∂β= 0.

The solution, X is the required MLE. Hence,

βMLE = X

Example 5.8 Assume that Z1, . . . , Zn is a random sample from a normal distri-bution with mean µ and variance σ2Z . The likelihood function is

L(µ, σZ) =1

(√

2π)nσnZexp

(− 1

2σ2Z

n∑i=1

(Zi − µ)2)

(24)

and hence the log-likelihood is

`(µ, σZ) =n

2log(2π)− n log σZ −

1

2σ2Z

n∑i=1

(Zi − µ)2.

Take derivative with respect to µ and solve

∂`(µ, σZ)

∂µ=

1

σ2Z

n∑i=1

(Zi − µ) = 0.

We obtain

µMLE = Z =1

n

n∑i=1

Zi.

5.3.2 MLE for time series models

Assume thatX1, . . . , Xn are observation from a stationary time series. Let fn(x1, . . . , xn)be their joint density (which is no longer of the product form).

We shall assume that the time series is Gaussian and centered. This isvery important assumption that has to be checked.

Introduce the following notation:

Xn = (X1, . . . , Xn)t, bXn = (“X1, . . . , “Xn)t, Un = (U1, . . . , Un)t,

where Ui = Xi − “Xi, i = 1, . . . , n, are the innovations.Recall that Γn = E[Xt

nXn] = [γX(i − j)]ni,j=1 is the covariance matrix of(X1, . . . , Xn). The likelihood (the joint density of X1, . . . , Xn) is

L =1

(2π)n/21

|Γn|1/2exp

Å−1

2XtnΓ−1n Xn

ã,

29

where |Γn| is the determinant. Note that the covariance function (and hence thecovariance matrix Γn) depends on some model parameters. For example, if themodel is AR(1), then γX(h) = σ2Zφ

h/(1 − φ2). Thus, the covariance matrix andhence the log-likelihood depends on the model parameters σZ , φ, so that we canwrite L(σZ , φ). In this particular case the maximum likelihood estimators are ob-tained by maximizing L(σZ , φ) with respect to σZ , φ. Usually, there are no explicitformulas and everything is done numerically.

It turns out that

XtnΓ−1n Xn = Ut

nD−1Un, (25)

where D is a diagonal matrix diag(v0, . . . , vn−1). Thus, we have

XtnΓ−1n Xn =

n∑i=1

(Xi − “Xi)2/vi−1.

NOTE: The details for the few lines above can be found in Section 6.2.Hence, the complicated expression on the left hand side of (25) is replaced

with a simple expression on the right hand side that can be calculated using theinnovation algorithm.

Furthermore,det(Γn) = v0 · · · vn−1.

Hence, the likelihood function takes a form

1

(2π)n/21

√v0 · · · vn−1

exp

(−1

2

n∑i=1

(Xi − “Xi)2/vi−1

). (26)

How do we use it. Look at (34), the formula that describes “Xi in terms of thecoefficients θij and the innovations. The coefficients θij are computed using theinnovation algorithm. In particular, the coefficients depend on the parametersθ1, . . . , θq. Once the coefficients and the predictions are computed, we plug themin into the likelihood function. Hence, the likelihood function depends on

β = (φ1, . . . , φp, θ1, . . . , θq, σ2Z).

The function can be maximized to obtain MLE of β.We shall note that although the final formula (26) was developed using the

innovation algorithm, we can use it as long as we have formulas for “Xi.

Theorem 5.9 The maximum likelihood estimator is asymptotically normal withmean β and variance n−1V (β), where V (β) is a matrix.

30

Example 5.10 Consider AR(1) model Xt = φXt−1+Zt, where Zt are i.i.d. normalrandom variables with mean zero and variance σ2Z . Then “Xi+1 = φXi and vi =

E[(Xi+1 − “Xi+1)2] = σ2Z for all i = 1, . . . , n− 1. Hence, the likelihood function is

1

(2π)n/21

σnZexp

(−1

2

n∑i=2

(Xi − φXi−1)2/σ2Z

).

Summation should start from 2, because we do not have X0 in our sample Thelog-likelihood is (ignore the term 1

(2π)n/2 )

−n log σZ −1

2σ2Z

n∑i=2

(Xi − φXi−1)2.

Hence,

φMLE =

∑ni=2Xi−1Xi∑ni=2X

2i−1

and

σ2MLE =1

n

n∑i=2

(Xi − φMLEXi−1)2.

Assume for a moment that σZ is known. Then the vector β = φ and then V (β)becomes

V (β) = V (φ) = (1− φ2).

We note that the MLE estimator of φ as well as its asymptotic variance are thesame as for Yule-Walker.

In general, for AR(p) models, the Yule-Walker estimator and MLE of (φ1, . . . , φp)agree and in both cases the asymptotic variance is

V (φp) = σ2ZΓ−1p .

For AR(2) we have (cf. Example 5.4)

V (φ1, φ2) =

ñ1− φ22 −φ1(1 + φ2)

−φ1(1 + φ2) 1− φ22

ô.

However, the MLE and Yule-Walker estimators of variance σ2Z do not need to agree!!

Example 5.11 For MA(1) and MA(2) models we have, respectively

V (θ) = (1− θ2), V (θ1, θ2) =

ñ1− θ22 θ1(1 + θ2)

θ1(1 + θ2) 1− θ22

ô.

31

5.3.3 Easy approach for AR(p) models

Consider for simplicity AR(1) model, that is Xt = φXt−1 + Zt, where Zt are i.i.d.normal with mean zero and variance σ2Z . Re-write (24):

L =1

(√

2π)nσnZexp

(− 1

2σ2Z

n∑i=1

Z2i

), (27)

(note that µ = 0). For a moment I do not write parameters of the likelihoodfunction. Now, in AR(1) model we have Zt = Xt − φXt−1. Plug-in this into (27)to obtain

L(σZ , φ) =1

(√

2π)nσnZexp

(− 1

2σ2Z

n∑i=1

(Xi − φXi−1)2

). (28)

Now, the likelihood function depends explicitly on σZ and φ. Note that this is thesame likelihood function as in Example 5.10.

This easy approach works for arbitrary AR(p) model, but does not work forMA(q) or general ARMA(p, q) models.

5.3.4 Order Selection

A classical order selection is based on AIC (Akaike Information Criteria method).We consider several ARMA(p, q) models that depend on parameters φ = (φ1, . . . , φp)and θ = (θ1, . . . , θq). We consider several choices of p and q. For example, ar func-tion in R deals with AR models and considers q = 0 and p = 1, . . . , 12.

For each model we calculate the following expression

AIC = 2 logL(φ, θ, σZ)− 2(p+ q + 1)n

n− p− q − 2.

Note that sometimes one considers −AIC. The method chooses a model with thehighest likelihood but penalizes for too many model parameters (that is, too big pand q).

6 Appendix

6.1 A primer on Maximum Likelihood Estimation (MLE)

Assume that random variables X1, . . . , Xn are i.i.d. from a density f(x; θ). Weassume that a form of f is known, an our goal is to estimate θ.

Definition 6.1 The likelihood function is defined as

L(θ) = L(θ;X1, . . . , Xn) =n∏i=1

f(Xi; θ).

32

The log-likelihood function is defined as `(θ) = logL(θ).

Note: in statistics log x refers usually to logarithm with base e, not 10.

Definition 6.2 Let

θMLE = arg maxθL(θ) = arg max

θ`(θ).

Then θMLE is called the maximum likelihood estimator.

Example 6.3 Assume that X1, . . . , Xn is a random sample from an exponentialdistribution. Recall that X ∼ Exp(β) (β > 0) if

fX(x) =

®β−1 exp(−x/β), x > 0;0, x ≤ 0

Here, θ = β. The likelihood function is

L(β) = β−nn∏i=1

exp(−Xi/β) = β−n exp

(−β−1

n∑i=1

Xi

)and hence the log-likelihood is

`(β) = −n log(b)− 1

β

n∑i=1

Xi.

Take derivative with respect to β and solve

∂`(β)

∂β= 0.

The solution, X is the required MLE. Hence,

βMLE = X

6.1.1 Asymptotic normality

Score function is defined as

s(x; θ) =∂ log f(x; θ)

∂θ.

The Fisher Information is defined as

In(θ) := Var

(n∑i=1

s(Xi, θ)

).

If random variables are i.i.d., then the Fisher information becomes

In(θ) = nVar(s(X1, θ)) = nI1(θ) =: nI(θ).

33

Lemma 6.4 The score function satisfies E(s(X, θ)) = 0.

Proof.

E(s(X, θ)) =

∫s(x, θ)f(x, θ)dx =

∫∂ log f(x, θ)

∂θf(x, θ)dx

=

∫1

f(x, θ)

∂f(x, θ)

∂θf(x, θ)dx =

∫∂f(x, θ)

∂θdx

=∂

∂θ

∫f(x, θ)dx = 0.

Using the above lemma,

I(θ) = Var(s(X, θ)) = E(s2(X, θ)).

Next,

Lemma 6.5 We have

I(θ) = E(s2(X, θ)) = −E

Ç∂s(X, θ)

∂θ

å= −E

Ç∂2 log f(X, θ)

∂θ2

å.

Proof.

E

Ç∂2 log f(X, θ)

∂θ2

å=

∫∂2 log f(x, θ)

∂θ2f(x, θ)dx =

∫∂

∂θ

Ç∂ log f(x, θ)

∂θ

åf(x, θ)dx

=

∫∂

∂θ

Ç1

f(x, θ)

∂f(x, θ)

∂θ

åf(x, θ)dx......

Example 6.6 If X ∼ Exp(β), then log f(x, β) = − log(β) − β−1x. The scorefunction is

s(x, β) = − 1

β+

1

β2x

and its derivative (w.r.t. β) is

−∂s(x, β)

∂β= − 1

β2+ 2

1

β3x.

Hence,

I(β) = E

Ç−∂s(X,β)

∂β

å= − 1

β2+ 2

1

β3E(X) = − 1

β2+ 2

1

β2=

1

β2

Thus,

In(p) =n

β2.

34

Note that for Xn = (X1 + . . .+Xn)/n,

Var(Xn) =Var(X)

n=β2

n.

Hence, Var(Xn) = I−1n (β).

Theorem 6.7 Under appropriate regularity conditions,

θMLE − θ

Var

Å»θMLE

ã d→ N (0, 1),

where

Var

Å»θMLE

ã= I−1n (θ).

Proof: The MLE estimator, θMLE, solves

∂

∂θ`(θMLE) = 0.

Apply Taylor expansion to get

`(θ) +ÄθMLE − θ

ä ∂2∂θ2

`(θ) ≈ 0.

Reorganize this to get

√nÄθMLE − θ

ä=

1√n∂∂θ `(θ)

− 1n∂2

∂θ2`(θ)

.

Now, we show that the nominator converges to a normal distribution, whereas thedenominator converges in probability to a constant.

Recall that`(θ) = log f(X1; θ) + · · ·+ log f(Xn)(θ)

and hence∂

∂θ`(θ) =

n∑i=1

∂

∂θlog f(Xi; θ) =

n∑i=1

s(Xi; θ).

We proved that E(s(Xi; θ)) = 0. Hence, the nominator can be written as

1√n

n∑i=1

Yi,

35

where Yi = s(Xi; θ) are i.i.d. random variables with mean zero and variance

Var(s(Xi; θ)) = E(s2(Xi, θ)) = I(θ).

Hence,

1√n

∂

∂θ`(θ) =

1√n

n∑i=1

Yid→ N

Ä0,E(s2(X1, θ))

ä= N (0, I(θ)) .

Similarly, the nominator can be written as

1

n

n∑i=1

Ui,

where

Ui =∂2

∂θ2log f(Xi; θ) , i = 1, . . . , n

are Ui are i.i.d. random variables. From Lemma 6.5 we note that

E(Ui) = E

Ç∂2

∂θ2log f(Xi; θ)

å= −I(θ).

Hence, Law of Large Numbers yields

− 1

n

n∑i=1

Ui → I(θ).

We conclude the result. �

Example 6.8 (Example 6.6 continued) Theorem 6.7 yields

√n(Xn − β)

d→ NÄ0, β2

ä.

6.2 Innovation algorithm

Again, as in the case of the Durbin-Levinson algorithm our goal is to find PnXn+1.In the innovation algorithm as a by-product we will ”predict” X1, . . . , Xn. Predic-tion of these values is not necessary since we observe them, however, we will useinnovations (differences between the observed values Xi and ”predicted” values) formodel choice and estimation purposes in the next section.

Define “Xi+1 = PiXi+1 = ai1Xi + · · ·+ aiiX1, i = 0, . . . , n. (29)

36

That is “Xn+1 is the predicted value for Xn+1, whereas “X1, . . . , “Xn are the ”pre-dicted” values for X1, . . . , Xn.

Define column vectors

Xn = (X1, . . . , Xn)t, bXn = (“X1, . . . , “Xn)t, Un = (U1, . . . , Un)t,

where Ui = Xi− “Xi, i = 1, . . . , n, are the innovations, that is the difference betweenthe observed and the ”predicted” values. A ”good” prediction is such that theseerrors are small. We will also set “X1 = 0.

We re-write (29) as“X1 = 0 (i = 0),“X2 = a11X1 (i = 1),“X3 = a21X2 + a22X1 (i = 2),

· · · · · ·“Xn = an−1,1Xn−1 + · · ·+ an−1,n−1X1 (i = n− 1),

(we omit “Xn+1),

or in matrix notation asbXn = A∗Xn,

where

A∗ =

0 0 0 · · · 0a11 0 0 · · · 0a22 a21 0 · · · 0...

......

. . ....

an−1,n−1 an−1,n−2 · · · an−1,1 0

.

Note that the matrix is lower diagonal. We write

Un = Xn − bXn = Xn −A∗Xn = AXn,

where A = I−A∗. If we denote C = A−1 and B = C− I, then we can write

Xn = CUn (30)

bXn = (C− I) Un = BUn. (31)

Ok, what do we get from this mess. We represented the ”predicted” values in termsof the matrix B and innovations Un. In fact, the matrix B is lower diagonal. We

37

can write (31) as

“X1...“Xn

=

0 0 0 · · · 0θ11 0 0 · · · 0θ22 θ21 0 · · · 0...

......

. . ....

θn−1,n−1 θn−1,n−2 θn−1,n−3 · · · 0

U1

...Un

. (32)

We can also write (30) as

Xn = CUn =

X1...Xn

=

1 0 0 · · · 0θ11 1 0 · · · 0θ22 θ21 1 · · · 0...

......

. . ....

θn−1,n−1 θn−1,n−2 θn−1,n−3 · · · 1

U1

...Un

.(33)

Note that the coefficients θkj do not have anything to do with the Durbin-Levinsonalgorithm.

For example, from the above matrix equation we have“X1 = 0“X2 = θ11(X1 − “X1) = θ11X1“X3 = θ21(X2 − “X2) + θ22(X1 − “X1).

That is, the prediction of, say, X3 is based on the first and the second innovation(X1 − “X1), (X2 − “X2). We can write“Xi+1 =

®0 i = 0∑ij=1 θij(Xi+1−j − “Xi+1−j) i ≥ 1

. (34)

This is valid for MA(q) model. For ARMA(p, q) we have“Xi+1 =

®0 i = 0

φ1Xi + · · ·+ φpXi+1−p +∑ij=1 θij(Xi+1−j − “Xi+1−j) i ≥ 1

. (35)

The question is: how to evaluate the coefficients θij :

Theorem 6.9 (Innovation algorithm) Assume that Xt is a stationary time se-ries with mean zero. Let vi = E[(Xi+1 − “Xi+1)

2], i ≥ 0, and v0 = E[X21 ] = γX(0).

Then

θn,n−i = v−1i

ÑγX(n− i)−

i−1∑j=0

θi,i−jθn,n−jvj

é, 0 ≤ i < n,

vn = γX(n− 1)−n−1∑j=0

θ2n,n−jvj .

38

Example 6.10 Consider MA(1) model Xt = Zt + θZt−1, where E[Zt] = 0 andVar(Zt) = σ2Z . Recall that γX(0) = σ2Z(1 +θ2), γX(1) = θσ2Z and γX(h) = 0, h > 1.We have

• n = 1, i = 0: v0 = γX(0) = σ2Z(1 + θ2), θ11 = v−10 γX(1) = γX(1)γX(0) = ρX(1) and

v1 = γX(0)− θ211v0.

• n = 2

– i = 0: θ22 = v−10 γX(2) = 0

– i = 1: θ21 = v−11 γX(1).

• In general, θn,1 = v−1n−1γX(1), θn,j = 0, 2 ≤ j ≤ n, vn = [1+θ2−v−1n−1θ2σ2Z ]σ2Z .

6.2.1 Innovations are uncorrelated

A very important property is that the innovations U1, . . . , Un are uncorrelated, thatis Cov(Ui, Uj) = 0 for i 6= j. This is not trivial !!! Hence, using (30) we have (recallthat the sequence is centered)

Γn = EîXnX

tn

ó= E[CUnU

tnC

t] = CE[UnUtn]Ct = CDCt (36)

where D is the diagonal matrix with entries v0, . . . , vn−1, where vi = E[U2i ] =

E[(Xi − “X2i )] are the same quantities as in the Innovation Algorithm theorem.

This will be used in the next section on estimation.

39

Documents

Time Series - Lecture Notes - University of Ottawa...2.Let fZ tgbe a sequence of independent random variables with mean zero and variance 1. Then Z(t) = E(Z t) = 0 for all t, Z(t;s)