Math Tools: Stochastic Processes - NYUpages.stern.nyu.edu/~dbackus/233/notes_math_stochprocesses.pdf · Math Tools: Stochastic Processes Revised: November 30, 2015 All of modern macroeconomics,

ECON-UB 233 Dave Backus @ NYU

Math Tools: Stochastic ProcessesRevised: November 30, 2015

All of modern macroeconomics, and most of modern finance, is concerned with how ran-domness unfolds through time. Such combinations of randomness and dynamics go by thename of stochastic processes. Thus we might say, as Lars Hansen and Tom Sargent oftendo, that a macroeconomic model — or an asset pricing model — is a stochastic process.

Our goal here is to review the theory of stochastic processes in a relatively low-tech way.We start where we left off, with an event tree, which we extend to more than two periods.This isn’t difficult, but we need new notation to track how uncertainty unfolds with time.

We then proceed to impose more structure. Typically we’d like uncertainty to have someregularity to it, so that behavior at one date is similar to the behavior at another. Withoutsomething like this, it’s hard to know how we’d compare models with data. Think, forexample, about estimating the mean from time series data. If every period has its owndistribution, what does this even mean? We focus on linear normal processes, where theanalysis is relatively transparent, but the same ideas extend to other environments.

There are some optional sections that use linear algebra. Keep in mind: they’re optional.But also keep in mind: linear algebra is one of the great things in life, something you shouldthink about learning if you want to have more fun in life. The applications to Markov chainsare a good example.

1 Event trees

We can expand the concept of an event tree to more than the two dates we’ve used so far.An example is Figure 1, but we can imagine extending it further.

An event tree starts with a single event that we’ll label z0: the event z that occurred at datet = 0. At date t = 1 another event z1 occurs. In the figure, there are two possible events,z1 = 1 and z1 = 2, each represented by a branch stemming from the initial event z0. Theprobability of each branch, conditional on starting at z0, is Prob(z1|z0). These probabilitiesare all greater than or equal to zero, and the sum across events z1 is one.

We can extend the idea to as many periods as we like, with the node at any date t givingrise to further branches at t+ 1. It’s a little cumbersome, but nonetheless helpful, to havenotation for the various nodes we generate this way. We refer to the history of eventsthrough date t as the sequence (z0, z1, . . . , zt), which we denote by zt. Each such historydefines a specific node in the tree. For example, the node labeled (A) in the figure comesfrom the history z1 = (z0, z1 = 1). The node in the upper right labeled (C) comes from thehistory z2 = (z0, z1 = 1, z2 = 1). We’re defining nodes by listing every step we followed toget there, which is sufficient but seems a little excessive.

You may notice that we have two kinds of probabilities here for a node like (C): the probabil-ity conditional on being at (B) and the probability from the starting point z0. Typically weconstruct the latter from the former; that is, from probabilities from one period to the next.Suppose for example, that the probability of taking branch z1 = 1 from z0 is Prob(z1 = 1|z0),and the probability of continuing with z2 = 1 is Prob[z2 = 1|z1 = (z0, z1 = 1)]. Then theprobability at z0 of getting to (C) is the product:

Prob[z2 = 1|z1 = (z0, z1 = 1)] = Prob(z1 = 1|z0) Prob[z2 = 1|z1 = (z0, z1 = 1)].

In words: the probability of getting from z0 to (C) is the probability of getting from z0 to(A) times the probability of getting from (A) to (C).

That gives us two ways of constructing probabilities over the tree. We can start with one-period probabilities and multiply. Or we can start with multi-period probabilities — theprobabilities of every node in the tree as of date zero — and divide.

Given a tree, we can then specify things like cash flows at each node and compute theirprices at earlier nodes. Prices and cash flows are random here because the nodes are random:we don’t know which ones will occur. It’s analogous to what we did earlier with randomvariables. In our two-period settings, cash flows (and other things) were random becausethey were functions of the state z. Here the same idea applies but with state z replaced byhistory zt. In this respect, the appropriate definition of a “state” at date t is the history zt.

All of this works reasonably well, but it’s way too complicated. And without some reg-ularity in how uncertainty works at different dates, it’s hard to see how we’d be able toconnect properties of the tree to analogous properties of data. What kind of tree, for ex-ample, corresponds to the behavior of the short-term interest rate? Or the growth rate ofconsumption?

2 Well-behaved stochastic processes

We will use stochastic processes with a lot more structure than this. Why? Lots of reasons,but mainly because they’re simpler — remember, simple is good — and they give us a clearlink between models and evidence.

We impose three kinds of structure:

• Markov. One kind of structure is named after the Russian mathematician Andrey Markov:the probabilities over states next period depend on this period’s state, but not past states.Formally, we would write Prob(zt+1|zt) = Prob(zt+1|zt). That means we don’t need thecomplete history zt; the event zt tells us everything we need to know about the situationat date t. We will refer to it as the state for that reason. We’ll see that clever definitionsof the state allow us to make this more flexible than it looks.

• Stationary. The next example of structure is that the model is the same at all dates.Specifically, the form of the probabilities Prob(zt+1|zt) = p(zt+1|zt) doesn’t vary with timet. The word stationary here means the function p doesn’t change. It doesn’t mean thatthe distribution over future events is always the same. It does mean that any variationin their distribution works through the current state zt. Terminology varies. Some callthis time-homogeneous: the same structure at all times.

2

• Stable. Our final example of structure has to do with the behavior of probabilities overlong time horizons. The Markov property refers to one-period conditional distributionsProb(zt+1|zt), but we can extend these one-period distributions to multiple periods. Thelogic is similar to what we did with event trees: By stringing together one-period condi-tional distributions we can construct multi-period conditional distributions Prob(zt+k|zt)for any k ≥ 1. We’ll save the details until we have specific candidates to work with, butyou can probably imagine doing something of this sort.

What happens as we extend k to longer time horizons? The structure we’ve imposed so farallows a lot of different kinds of behavior, but in all the examples we’ll use, the conditionaldistribution will “settle down.” By that we mean that the distribution Prob(zt+k = z|zt)converges as we increase k to a unique distribution Prob(z) that doesn’t depend on thecurrent state zt. If it does, we say the process is stable and refer to the resulting Prob(z)as the equilibrium distribution. There are two conditions here: that it converges, andthat it converges to the same thing for every zt. The concept of convergence here involvesfunctions, something we’ll do our best to finesse.

Terminology varies here, too. Some use the exotic term ergodic. Others, in a successfulattempt to spread confusion, refer to such processes as stationary .

Example. I got this one from mathematician Martin Hairer. We have a light switch thatcan be on or off, which we represent with a state variable z that equals one if the lightis one, zero if it’s off. Between any two dates, the state stays the same with probabilityπ = (1 + ϕ)/2 and changes with probability 1 − π = (1 − ϕ)/2, with −1 < ϕ < 1. [Drawdiagram.] This strange-looking notation turns out to be fortuitous. If ϕ = 0 this gives us50-50 draws every period. If ϕ = 1, the light stays in the same state forever. If 0 < ϕ < 1it’s more likely the light will stay in the same state with greater probability than not.

Does this stochastic process have the properties we mentioned? Let’s check:

• Markov. Yes, the probabilities of (on, off) next period depend on the current state, butnot on the past.

• Stationary. Yes, the probabilities are the same at all dates. (Hairer uses the term “time-homogeneous.”)

• Stable. This one we have to think about. What is the distribution over future lightsbeing on or off? If the light is on at date t, the probability of it being on at t + 1 is(1 + ϕ)/2.

Over longer periods, we can compute probabilities recursively, meaning one period at atime. Suppose we have already computed the probability

pk = Prob(zt+k = 1|zt = 1).

That is: the probability that the light is still on k periods from now given that it’s onnow. Then the probability that it’s on the following period is

pk+1 = pk(1 + ϕ)/2 + (1− pk)(1− ϕ)/2 = 1/2 + ϕ(pk − 1/2).

3

http://www.hairer.org/notes/Markov.pdf

Stringing these together, we have

pk − 1/2 = ϕk(p0 − 1/2).

As we increase k, the probability pk approaches one-half. That’s true whatever initialprobability p0 we start with.

So yes, the model is stable.

One more thing. Our approach to this example illustrates something we’ll see more of:we computed the conditional distribution one period at a time. Recursively, so to speak.

3 Digression: matrix multiplication (skip)

Here’s a quick summary of some linear algebra we’ll find helpful. A matrix is a table ofnumbers. The matrix A for example has (say) m rows and n columns, so we would say itis “m by n.” The element in the ith row and j column is usually written aij . A (column)vector is a matrix with one column (n = 1).

If the dimensions line up, we can multiply one matrix times another. Suppose we multiplyA times B and call the result C (a new matrix). The rule for doing this is

cij =∑k

aikbkj ,

where cij is the ijth element of C. We apply the rule for all possible i and j. This worksonly if the second dimension of A equals the first dimension of B: if, for example, A is mby n and B is n by p. Then the subscript k in the sum runs from 1 to n.

Matlab is set up to work directly with vectors and matrices. When we define a matrixin Matlab, we separate rows with semi-colons. The command A = [1 2 5; 6 4 2], forexample, defines the 2 (rows) by 3 (columns) matrix

A =

[1 2 56 4 2

].

Similarly, B = [1 4; 3 2] defines the matrix

B =

[1 43 2

].

You might verify by hand that

BA =

[25 18 1315 14 19

].

You can do the same calculation in Matlab with the command C = B*A.

As a test of your knowledge, what is AB? What happens if you type D = A*B in Matlab?

4

4 Markov chains (skip)

Markov chains are wonderful examples of Markov processes for illustrating the conceptsof stationarity and stability. Both are relatively simple if you’re comfortable with linearalgebra and matrix multiplication.

Here’s the idea. The state zt takes on a finite set of values, the same at all dates, whichwe label with the integers i and j. The probability of state j next period depends only ontoday’s state i (Markov) and is the the same at all dates (stationary):

pij = Prob(zt+1 = j|zt = i).

We can collect them in a matrix P , called the transition matrix , where pij is the elementin the ith row and jth column. Thus row i gives us the probabilities of moving to any statej next period from state i now.

Examples. The example in Section 2, with the light on an off, has this form with

P =

[(1 + ϕ)/2 (1− ϕ)/2(1− ϕ)/2 (1 + ϕ)/2

].

The rows here represent the current state, the columns next period’s state.

Here’s another one:

P =

[0.8 0.20.3 0.7

].

Row i is the probability distribution over states next period if the current state is i. Ifwe’re in state 1 now, there’s an 80% chance we’ll stay there and a 20% chance we’ll move tostate 2. These probabilities are nonnegative and sum to one. Ditto for row 2. Put anotherway: any matrix P is a legitimate transition matrix if (i) pij ≥ 0 and (ii)

∑j pij = 1. By

choosing different numbers, we can generate a wide range of behavior.

The matrix P describes probabilities of one-period changes in the state. But what are theprobabilities of moving from state 1 to state 2 in 2 periods? In 10 periods? As it happens,the probability of moving from state i at date t to state j at date t+ k is the ijth elementof P k. So we can compute the whole set of probabilities by computing powers of P . In ourexample, we get

P 2 =

[0.70 0.300.45 0.55

], P 3 =

[0.650 0.3500.525 0.475

].

Try it yourself in Matlab.

We’re typically interested in examples in which the probabilities are stable. If we increase thetime horizon k, what happens to the probabilities? With the right conditions, P k convergesto a matrix whose rows are all the same. This says that the probability distribution acrossstates doesn’t depend on the current state if we get far enough in the future. Each rowrepresents the equilibrium distribution.

5

Sometimes we converge to a unique equilibrium distribution, and sometimes we don’t. Hereare some examples to help us think about how this works. (i) In the example above, showthat the equilibrium distribution is (0.6, 0.4). [How? Keep taking higher powers of P . Youshould see that every row converges to (0.6, 0.4).] (ii) For each of these choices of P , describewhat happens as we compute P k for larger values of k.

P =

[1 00 1

], P =

0.8 0.1 0.10 1 00 0 1

, P =

[0 11 0

].

In each case, verify that P is a legitimate transition matrix, then compute powers of P andsee how they behave.

We’re interested, for the most part, in P ’s that have unique equilibrium distributions. Oneway to guarantee that is the condition: pij > 0. It’s sufficient, but stronger than we need.The main point is that we need some structure to make it work. We’ll see the same thingin other contexts.

5 Linear models 1: autoregressions

Next up are linear stochastic processes. Linear processes are incredibly useful because theyillustrate the concepts in relatively simple form and are capable of approximating a lotof what we observe about the economy and asset returns. It’s possible nonlinearities areimportant in the real world, but there’s enough noise in most real-world situations thatit’s typically hard to say whether that’s true or not. In any case, linear processes are thepredominant tool of modern macroeconomics and finance and a good starting point even ifwe’re ultimately interested in something else.

One of the simplest linear processes is a first-order autoregression or AR(1). We’ll use itover and over again. A random variable xt evolves through time according to according to

xt = ϕxt−1 + σwt. (1)

Here {wt} is a sequence of random variables, all of them standard normal (normal with meanzero and variance one) and mutually independent. The w’s are commonly referred to as“errors,” “disturbances,” or “innovations.” Equation (1) is referred to as an autoregressionbecause xt is regressed on a lag of itself (“auto” = “same”) and first-order because one lagof xt appears on the right.

We’ll deal extensively with the w’s, so let’s be clear about their properties. First, the meanis zero: E(wt) = 0 for all t. Second, the variance is one. That means

Var(wt) = E{

[wt − E(wt)]2}

= E(w2t ) = 1.

Third, their independent. For dates t and s 6= t we have

E(wtws) = E(wt)E(ws) = 0.

6

The first equality follows because they’re independent, so the expectation of the product isthe product of the expectations.

One more thing: the timing of information. We’ll say that at any date t, we know thecurrent and past values of x and w, but not the future values.

Let’s go back to the autoregression, equation (1) and ask ourselves: (i) Is it Markov? If so,for what definition of the state? (ii) Is it stationary? (iii) Does it have a unique equilibriumdistribution? We’ll consider each in turn.

On question (i): Let’s guess it’s Markov with state zt = xt and see how that works. Ifwe take equation (1) updated one period, we see that the conditional distribution of xt+1

depends only in the state xt. What is the distribution? Well, since it comes from wt+1,which is normal, it’s also normal. And since it’s normal, once we know the mean andvariance we’re done. The conditional mean is

E(xt+1|zt) = E(ϕxt + σwt+1|zt) = ϕxt.

Why? Because we know xt (that’s what we meant by conditional on xt) but we don’t knowwt+1 (it hasn’t happened yet). So we take the mean of wt+1, which is zero. What aboutthe variance of the conditional distribution — what we might call the conditional variance?The variance, as we know, is the expected squared difference between the random variableand its mean. The conditional variance is therefore

Var(xt+1|zt) = E[(xt+1 − ϕxt)2|zt

]= E

[(σwt+1)

2|zt]

= σ2.

Thus we have: xt+1|zt ∼ N (ϕxt, σ2). Since the distribution depends only on xt, that’s

evidently sufficient to define the state.

On question (ii): Yes, it’s stationary. Nothing in (1) depends on the date, including thedistribution of the disturbance wt.

Now about question (iii): Is the process stable? What is the equilibrium distribution? Isit unique? The answer is yes if |ϕ| < 1, but let’s show that for ourselves. One approach isto compute the distribution directly from (1). The logic here is to presume an equilibriumdistribution exists and compute its properties. It’s the same distribution for xt and xt−1,so it has the same mean at both dates:

E(x) = ϕE(x) ⇒ E(x) = 0/(1− ϕ) = 0.

What about the variance? Again, the distributions of xt and xt+1 are the same, so we have

Var(x) = E(x2t ) = E[(ϕxt−1 + σwt)

2]

= ϕ2Var(x) + σ2.

That gives us Var(x) = σ2/(1 − ϕ2). This works if ϕ2 < 1, but if not the variance isnegative, which we know can’t be. So if ϕ2 < 1, the equilibrium distribution is normal withmean zero and variance σ2/(1− ϕ2). If ϕ2 ≥ 1, we don’t converge: there is no equilibriumdistribution.

7

The stability property is more obvious if we go through the effort of deriving the equilibriumdistribution as a limit. Suppose we think of xt+k from the perspective of date t. If thecurrent state is xt, what is the distribution of xt+k? This is the conditional distribution ofxt+k: conditional on the current state xt. Repeated application of (1) gives us

xt+k = ϕkxt + σ[wt+k + ϕwt+k−1 + · · ·+ ϕk−1wt+1

].

The conditional mean and variance are

E(xt+k|xt) = ϕkxt

Var(xt+k|xt) = σ2[1 + ϕ2 + · · ·+ ϕ2(k−1)

]= σ2(1− ϕ2k)/(1− ϕ2).

The second one follows because E(w2t ) = 1 and E(wtwt+j) = 0 for j 6= 0. Does this settle

down as we increase k? Yes, as long as ϕ2 < 1. The conditional mean converges to zeroand the conditional variance converges to σ2/(1− ϕ2), the same answer we had before.

We can do the same for high-order autoregressions: versions of (1) with additional lags ofxt+1 on the right. Examples include

AR(1) : xt = ϕxt−1 + σwt

AR(2) : xt = ϕ1xt−1 + ϕ2xt−2 + σwt

AR(p) : xt = ϕ1xt−1 + ϕ2xt−2 + · · ·+ ϕpxt−p + σwt.

These models are stationary by construction: the equations don’t change with t. They’reMarkov if we define the state at t− 1 by zt−1 = (xt−1, . . . , xt−p). And they’re stable undersome conditions on the ϕ’s that we’ll touch on briefly in Section 9.

6 Digression: the value of information

Generally our ability to forecast gets worse as the time horizon grows. Forecasting two yearsinto the future is harder than forecasting 2 months into the future. Weather forecasting,for example, only works out to about ten days. Beyond ten days, we’re simply taking adraw from the equilibrium distribution. If on average it rains 20% of the time, then ourbest forecast for ten days from now is that the probability of rain is 20%.

The issue here is the amount of variation in, respectively, the conditional mean and theforecast error. The conditional mean is our forecast. Its variance represents our ability toforecast; if the variance is zero, we’re forecasting the same thing in all states, which is prettymuch useless. The conditional variance is the variance of our forecast error. It representsthe error of our forecast.

Let’s look more closely at these two components of variance. Consider the behavior of xt+k

from the perspective of date t. We can write

xt+k = E(xt+k|xt) +[xt+k − E(xt+k|xt)

],

the sum of two components. Since we’re lazy, our first step is to use more compact notation:

xt+k = Et(xt+k) +[xt+k − Et(xt+k)

],

8

where Et(xt+k) = E(xt+k|xt) is shorthand for “the mean of xt+k conditional on the stateat date t.” In both versions, the first component is the conditional mean, which we mightthink of as a forecast. The second component is the forecast error. The components areindependent, because the first one depends on w’s that happened at t or before and thesecond one depends on w’s that happened after. [Use the expressions in the previous sectionto remind yourself of this.]

The variance of xt+k can be broken into similar components. Since the two terms areindependent, we have

Var(xt+k) = E{Et(xt+k) +

[xt+k − Et(xt+k)

]}2= E

[Et(xt+k)

]2+ E

[xt+k − Et(xt+k)

]2.

Thus we have that the variance of the random variable x consists of two components: thevariance of the forecast and the variance of the forecast error.

This seems a little mysterious, so let’s substitute the expressions for the AR(1). That givesus

Var(xt+k) = E(ϕkxt

)2+ E

[(σ2(1− ϕ2k)/(1− ϕ2)

]= ϕ2kσ2/(1− ϕ2) + σ2(1− ϕ2k)/(1− ϕ2).

The first term is the part of the variance of xt+k that’s predictable — it comes from theforecast after all. The second is the part that’s unpredictable. You might verify that theysum to the variance.

Now think about how the components change as we increase the forecast horizon. As weincrease k, our ability to forecast declines, and the forecast E(xt+k|xt) = ϕxt approacheszero, a constant. The variance of this term evidently also approaches zero. The other termgrows to make up the difference: as we increase the time horizon k, the variance of theforecast error approaches the equilibrium variance of x. This takes a simple form in theAR(1) case, but the idea is more general.

7 Linear models 2: moving averages

Our next example is one we’ll put to work repeatedly: what we call a moving average.

The simplest example is a first-order moving average or MA(1):

xt = σ (θ0wt + θ1wt−1) . (2)

Here, as in the previous section, wt is a sequence of independent standard normal randomvariables. The state zt−1 at date t− 1 is the lagged disturbance wt−1.

Does equation (2) satisfy our conditions? Well, it’s stationary (it holds for all t) and Markov(the state variable wt summarizes the situation at t). This is essential, so let’s go throughit slowly. How do we know what the state is? The state is whatever we need to say what

9

the distribution of xt+1 is as of date t1. In this example, we need to know wt, but wt+1 issimply part of the distribution of xt+1. So the state at date t is wt.

Is equation (2) stable? Let’s see. The conditional mean and variance of xt+1 are

Et(xt+1) = σθ1wt

Vart(xt+1) = σ2θ20.

As above, the subscript t on E and Var means “conditional on the state z at date t”:Et(xt+1) = Et(xt+1|zt) and Vart(xt+1) = Var(xt+1|zt).

What about xt+2? From (2) we have

xt+2 = σ (θ0wt+2 + θ1wt+1) .

Viewed from the perspective of the state wt at t, none of this is known. The conditionalmean and variance are therefore

Et(xt+2) = 0

Vart(xt+2) = σ2(θ20 + θ21

).

Therefore, conditional on the state at t, xt+2 is normal with mean zero and variance σ2(θ20 +θ21). What about xt+3? It should be clear that it has the same conditional distribution asxt+2. An MA(1) has a one-period memory, so once we get beyond one period the conditionaldistribution is the same. Evidently we’ve converged to the equilibrium distribution: normalwith mean zero and variance σ2(θ20 + θ21).

How about an MA(2)? It has the form

xt = σ (θ0wt + θ1wt−1 + θ2wt−2) .

It’s Markov with state zt = (wt, wt−1). If we computed conditional distributions, we’d findthat the effect of the state disappears, in this case, after two periods. We’ll skip that fornow, but consider a more general case shortly.

Moving averages in general have the form

MA(1) : xt = σ (θ0wt + θ1wt−1)

MA(2) : xt = σ (θ0wt + θ1wt−1 + θ2wt−2)

MA(q) : xt = σ (θ0wt + θ1wt−1 + θ2wt−2 + · · ·+ θqwt−q)

MA(∞) : xt = σ (θ0wt + θ1wt−1 + θ2wt−2 + θ3wt−3 + · · · ) .

Since the infinite MA includes the others as special cases, let’s consider its properties. Is itstationary? Yes, the equation holds at all dates t. Is it Markov? Yes, if we allow ourselvesan infinite dimensional state zt = (wt, wt−1, . . .). This goes against our goal of simplicity,but we’ll see that it’s convenient for other reasons.

Is the infinite MA stable? Let’s see how the conditional distribution of xt+k changes as weincrease k. The random variable xt+k is, for any positive k,

xt+k = σ (θ0wt+k + θ1wt+k−1 + · · ·+ θk−1wt+1) + σ (θkwt + θk+1wt−1 + · · · ) .

10

Conditional on zt = (wt, wt−1, . . .), the second collection of wt’s is known, but the first isnot. The conditional mean and variance are therefore

Et (xt+k) = σ (θkwt + θk+1wt−1 + · · · )Vart (xt+k) = σ2

(θ20 + θ21 + · · ·+ θ2k−1

).

Do they converge as we increase k? Evidently we need

limk→∞

θk = 0

for the mean to converge. [This isn’t a tight argument, but it’ll do for us.] And we need

limk→∞

(θ20 + θ21 + · · ·+ θ2k−1) =∞∑k=0

θ2k < ∞

for the variance to converge. (The notation “< ∞” means here that the sum converges.)The distribution of xt+k for k sufficiently far in the future is then normal with mean zero andvariance σ2

∑∞k=0 θ

2k. The second condition is sometimes phrased as “the moving average

coefficients are square summable.” The second implies the first, so we’ll stick with that.Roughly speaking, we need the squared moving average coefficients to go to zero at a fastenough rate.

One of the things that makes moving averages so useful is that even autoregressions can bewritten this way. Consider the AR(1), equation (1). If we substitute backwards, we find

xt = ϕxt−1 + σwt

= σwt + ϕ (ϕxt−2 + σwt−1)

= σ(wt + ϕwt−1 + ϕ2wt−2 + · · ·

).

We refer to this as the “moving average representation” of the AR(1). The AR(1) sim-ply applies some structure to the moving average coefficients, namely that they declinegeometrically.

This representation is sometimes useful in finding the equilibrium distribution. For example,the mean here is zero: take expectations of both sides. The variance is

Var(xt) = σ2E(wt + ϕwt−1 + ϕ2wt−2 + · · ·

)2,

which converges, as we’ve seen, if ϕ2 < 1.

8 Linear models 3: ARMA(1,1)’s

We now have two linear models at our disposal: autoregressions and moving averages.Combining them gives us a wide range of behavior with a small number of parameters. Oneof my favorites is the ARMA(1,1):

xt = ϕ1xt−1 + σ (θ0wt + θ1wt−1) .

11

This is Markov with date t state zt = (xt, wt).

Its properties are evident from its moving average representation. Repeated substitutiongives us

xt = σθ0wt + σ(ϕ1θ0 + θ1)wt−1 + σ(ϕ1θ0 + θ1)ϕ1wt−2 + σ(ϕ1θ0 + θ1)ϕ21wt−3 + · · · .

That is: the first two moving average coefficients are arbitrary, then they decline at rateϕ1. Variants of this model are the basis of the most popular models of bond pricing. Oneuseful feature is that its conditional mean is autoregressive:

Et(xt+1) = σ(ϕ1θ0 + θ1)wt + σ(ϕ1θ0 + θ1)ϕ1wt−1 + σ(ϕ1θ0 + θ1)ϕ21wt−2 + · · · .

How do we know it’s autoregressive? Because the moving average coefficients decline at aconstant rate.

Comment: This is simpler if we set θ0 = 1, a convenient normalization as long as θ0 6= 0.

9 Matrix representations (skip)

We can extend all of this to vector processes and, moreover, express univariate ARMAmodels in compact matrix form. The standard form is an AR(1) in vector form:

xt = Axt−1 +Bwt, (3)

where xt and wt are both vectors. This has the same structure as equation (1) with thematrix A playing the role of ϕ and B the role of σ. It’s a stationary Markov process inthe state xt. It’s stable if the eigenvalues of A are all less than one in absolute value: if Ak

approaches zero as we increase k. (This last statement will make sense if you have taken alinear algebra course, but not if you haven’t. See, for example, Steven Pinker’s comment.)

Finite ARMA models — ARMA(p,q) for finite p and q — are conveniently expressed inthis form with the appropriate definition of the state. An ARMA(1,1), for example, can beexpressed [

xtwt

]=

[ϕ1 θ10 0

] [xt−1wt−1

]+

[θ01

][wt].

In practice, then, if we can handle the vector process (3) we can handle finite ARMA models.

10 Autocorrelation functions

We don’t observe moving average coefficients, but we do observe one consequence of them:the autocovariance and autocorrelation functions.

We can compute both from the moving average representation. Suppose we have a timeseries with moving average representation

xt = µ+∞∑j=0

ajwt−j .

12

http://www.nytimes.com/2009/11/15/books/review/Pinker-t.html?pagewanted=all&_r=0

(The reason for the change in notation should be apparent soon.) The mean is evidentlyµ. What is the covariance between xt and xt−k for k ≥ 0? We’ve seen that we can easilycompute sample analogs. The covariance is

γx(k) = Cov(xt, xt−k) = E [(xt − µ)(xt−k − µ)]

= E( ∞∑

j=0

ajwt−j

)( ∞∑j=0

ajwt−k−j

)=

∞∑j=k

ajaj−k =

∞∑j=0

ajaj+k.

We eliminated the cross terms because their expectation is zero [E(wtwt−j) = 0 for j 6= 0].The notation γx(k) is standard. The variance is the value for k = 0,

γx(0) = Cov(xt, xt) =∞∑j=0

a2j , (4)

which we’ve seen before.

We refer to γx(k), plotted as a function of k, as the autocovariance function. The autocor-relation function or acf is the same thing scaled by the variance:

ρx(k) = γx(k)/γx(0).

The scaling insures that |ρ(k)| ≤ 1. By construction ρx(0) = 1. Both are symmetric: wecompute them for k ≥ 0, but you get the same for positive and negative values of k. (Tryit, you’ll see.)

Example 1. Consider the MA(1),

xt = θ0wt + θ1wt−1.

We’ve set σ = 1 here to keep things simple. It has autocovariances

γx(k) =

θ20 + θ21 k = 0θ0θ1 k = 10 k > 1.

The autocorrelations are

ρx(k) = γx(k)/γx(0) =

1 k = 0θ0θ1/(θ

20 + θ21) k = 1

0 k > 1.

That’s the defining property of an MA(1): it has only a one-period memory. You mightverify for yourself that an MA(q) has a q-period memory.

Example 2. We’ve seen that an AR(1) has an infinite moving average representation withMA coefficients aj = ϕj . Therefore the autocovariances (4) are

γx(k) =∞∑j=0

ϕjϕj+k = ϕk/(1− ϕ2).

13

The autocorrelations are therefore

ρx(k) = γx(k)/γx(0) = ϕk.

In words: the acf declines at the geometric rate ϕ. In contrast to an MA, it approacheszero gradually.

11 Sample autocorrelations

We can do the same thing with data. You may recall that the sample mean is

x̄ = T−1T∑t=1

xt

and the sample variance is

γx(0) = T−1T∑t=1

(xt − x̄)2.

The notation follows that of the previous section.

We’re typically interested in some aspects of the dynamics of x, summarized in the sampleautocovariance and autocorrelation functions. The sample covariance is

γx(k) = T−1T∑

t=k+1

(xt − x̄)(xt−k − x̄).

Since we only have the observations xt for t = 1, ..., T , we need to start the sum at t = k+1.By longstanding convention, we nevertheless divide the sum by T rather than T − k. Wecould also consider negative values of k and adjust the range in the sum appropriately.

The shape of γx(k) is useful in telling us about the dynamics of x, but it’s more commonto scale it by γx(0) and convert it to a correlation. The autocorrelation function ρx(k) isdefined by

ρx(k) = γx(k)/γx(0).

Obviously ρx(0) = 1: xt is perfectly correlated with xt. But for other values of k it can takea variety of forms.

When we compute autocorrelations with financial data we see, for example, that autocor-relations for equity returns are very small: returns are virtually uncorrelated over time.Interest rates, however, are very persistent: the autocorrelations decline slowly with k.

14

Bottom line

There’s a lot here, most of it essential to modeling economies and asset returns over time.We’ll use stochastic processes with these three properties:

• Markov. The distribution over next period outcomes depends only on this period’s state.

• Stationary. The conditional distribution is that same at all dates.

• Stable. The distribution over future outcomes settles down as we increase the timehorizon.

We’ll put all of them to work.

Practice problems

1. Probabilities in an event tree. Consider the event tree in Figure 1. Suppose that at eachnode the probability of taking the up branch is ω and the probability of the down branchis 1− ω. What are the probabilities of the four nodes at t = 2?

Answer. We simply multiply the probabilities of the branches taken to get to each node.Starting at the top and working our way down, they are ω2, ω(1 − ω), (1 − ω)ω, and(1− ω)2.

2. Some linear models. Consider the models

(a) xt = 0.9xt−1 + wt

(b) yt = xt + 2, with xt defined in (a)

(c) xt = 0 · wt + 1 · wt−1

(d) xt = ϕxt−1 + wt + θwt−1

(e) xt = 1.2xt−1 + 0.1xt−2 + wt

The disturbances wt are independent standard normals.

For each model, answer the questions:

(i) Is the model Markov? For what definition of the state?

(ii) What is the conditional distribution of xt+1 given the state at date t?

(iii) Is the model stable?

(iv) If it’s stable, what is the equilibrium distribution?

Answer. All of these are Markov. The state (zt, say) is whatever you need to know atdate t− 1 to know the conditional distribution of xt.

(a) This is an AR(1). (i) It’s Markov with state xt−1. (ii) Conditional distribution:normal with mean 0.9xt−1 and variance one. (iii) Yes, stable, because 0.9 is lessthan one in absolute value. (iv) The equilibrium distribution is normal with meanzero and variance 1/(1− 0.92) = 5.2632. The autocorrelation function is

ρ(k) = 0.9k.

This includes ρ(1) = 0.9, ρ(2) = 0.92 = 0.81, and so on.

15

(b) Still an AR(1). (i) Doing the substitution xt = yt − 2 gives us

yt = (1− 0.9) · 2 + 0.9yt−1 + wt.

So it’s Markov with state yt−1. (ii) Conditional distribution: normal with mean0.2 + 0.9yt−1 and variance one. (iii) Yes, stable, because 0.9 is less than one inabsolute value. (iv) The equilibrium distribution is normal with mean two andvariance 1/(1 − 0.92) = 5.2632. All we’ve done here is shift the mean up by two.The autocorrelation function doesn’t depend on the mean, so it’s the same as before.

(c) This is an MA(1). (i) It’s Markov with state wt−1. (ii) Conditional distribution:normal with mean wt−1 and variance zero. (This is an unusual setup: since thecoefficient of wt is zero, we learn xt one period ahead of time.) (iii) Yes, stable. Fora moving average, all we need is that the coefficients are square summable. That’salways true if there’s a finite number of terms. (iv) The equilibrium distribution isnormal with mean zero and variance one.

(d) This is an ARMA(1,1). (i) It’s Markov with state (xt−1, wt−1). (ii) Conditionaldistribution: normal with mean ϕxt−1 + θwt−1 and variance one. (iii) It’s stable if|ϕ| < 1. You can see this from the moving average representation, outlined in thenotes:

xt = wt + (ϕ+ θ)wt−1 + (ϕ+ θ)ϕwt−2 + (ϕ+ θ)ϕ2wt−3 + · · · .

The first two moving average coefficients are arbitrary, then they decline at rate ϕ.(iv) The equilibrium distribution is normal with mean zero and variance equal tothe sum of squared moving average coefficients:

γ(0) = 1 + (ϕ+ θ)2/(1− ϕ2).

The autocovariances are

γ(k) = ϕk−1(ϕ+ θ)[1 + (ϕ+ θ)ϕ/(1− ϕ2)

].

The autocorrelations are ρ(k) = γ(k)/γ(0). They decline at rate ϕ after the firstone.

(e) This is an AR(2). (i) It’s Markov with state (xt−1, xt−2). (ii) The conditionaldistribution is normal with mean ϕxt−1 + ϕxt−2 and variance one. (iii,iv) It’s notstable. You can see this by substituting for a few periods and seeing how the impactof lagged x’s works. So there’s no equilibrium distribution, autocorrelation function,and so on.

3. Odd and even days. Consider a process that differs on odd and even days. Specifically,let xt = wt + θwt−1 with

wt ∼{N (a, b) if t is evenN (c, d) if t is odd.

Is xt Markov? Stationary? Stable?

Answer. It’s Markov, but not stationary (odd and even days are different) or stable (theconditional distribution of xt+1 depends on whether k is even or odd. This kind of thingcomes up all the time. We either adjust the data (seasonal adjustment) or somehowinclude it in our model.

16

4. Two-state Markov chain (skip). We can get a sense of how Markov chains work witha two-state example. A two-state chain is characterized by a 2 by 2 transition matrixP . Because the rows sum to one, P has (essentially) two parameters. A convenientparameterization is

P = (1− ϕ)

[ω 1− ωω 1− ω

]+ ϕ

[1 00 1

], (5)

where the two parameters are ω and ϕ.

(a) Under what conditions on (ω, ϕ) is P a legitimate transition matrix?

(b) What are the two-period transitions P 2? You can either do this by hand or getMatlab to do it. Either way, the key is to arrange the terms into a form similar to(5).

(c) What about the k-period transitions?

(d) What happens as we continue to increase k? What is the equilibrium distribution?

(e) (extra credit) What are the eigenvalues of P?

Answer. This question is useful if you’re comfortable with linear algebra, but skip it ifyou’re not. This is the simplest possible Markov chain, but it illustrates a number ofpossibilities.

(a) The probabilities have to be between zero and one, which gives us these inequalities:

0 ≤ (1− ϕ)ω + ϕ ≤ 1

0 ≤ (1− ϕ)(1− ω) ≤ 1

0 ≤ (1− ϕ)ω ≤ 1

0 ≤ (1− ϕ)(1− ω) + ϕ ≤ 1.

That’s sufficient for your answer. If you’d like to go further, here’s how it works.The second and third inequalities imply (add them together) −1 ≤ ϕ ≤ 1. The firstand fourth imply 0 ≤ ω ≤ 1. That’s not quite sufficient though. The second andthird imply (directly, divide by 1− ϕ)

−ϕ/(1− ϕ) ≤ ω ≤ 1/(1− ϕ),

a joint restriction on ω and ϕ. If ϕ ≥ 0 this is irrelevant, it’s less restrictive thanour earlier condition on ω. But if ϕ < 0, it limits the range of ω. For example, ifϕ = −1/2, then 1/3 ≤ ω ≤ 2/3.

(b,c) The k-period transitions have the form

P = (1− ϕk)

[ω 1− ωω 1− ω

]+ ϕk

[1 00 1

].

(d) If |ϕ| < 1, then as we increase k, ϕk → 0 and P converges to the first matrix. Theequilibrium distribution is evidently (ω, 1− ω).

(e) If you’re comfortable with linear algebra, you might notice that P has eigenvaluesof 1 and ϕ. The first is a feature of all Markov chains: since the rows sum toone, there’s an eigenvalue of one. The second tells us how fast we converge to theequilibrium distribution.

17

5. Properties of an MA(2). An MA(2) can be written

xt = δ + wt + θ1wt−1 + θ2wt−2

with {wt} ∼ NID(0, 1) (the w’s are independent normals with mean zero and varianceone).

(a) What is the equilibrium distribution of x?

(b) What are the conditional means, Et(xt+1), Et(xt+2), and Et(xt+3)?

(c) What are the conditional variances, Vart(xt+1), Vart(xt+2), and Vart(xt+3)?

(d) What is the autocovariance function,

γ(k) = Cov(xt, xt−k),

for k = 0, 1, 2, 3?

(e) What is the autocorrelation function? Under what conditions are ρ(1) and ρ(2)positive?

Answer.

(a) Since x is a linear combination of normals, it’s normal as well. It’s therefore sufficientto say what its mean and variance are. Its mean is

E(xt) = E(δ + wt + θ1wt−1 + θ2wt−2) = δ.

Its variance is

Var(xt) = E(xt − δ)2 = 1 + θ21 + θ22.

(b) The conditional means are

Et(xt+1) = Et(δ + wt+1 + θ1wt + θ2wt−1) = δ + θ1wt + θ2wt−1

Et(xt+2) = Et(δ + wt+2 + θ1wt+1 + θ2wt) = δ + θ2wt

Et(xt+3) = Et(δ + wt+3 + θ1wt+2 + θ2wt+1) = δ.

You can see that as we increase the forecast horizon, the conditional mean ap-proaches the mean.

(c) The conditional variances are

Vart(xt+1) = Et[(wt+1)2] = 1

Vart(xt+2) = Et[(wt+2 + θ1wt+1)2] = 1 + θ21

Vart(xt+3) = Et[(wt+3 + θ1wt+2 + θ2wt+1)2] = 1 + θ21 + θ22.

You see here that as we increase the time horizon, the conditional variance ap-proaches the variance.

(d) The autocovariance function is

Cov(xt, xt−k) =

1 + θ21 + θ22 k = 0θ1 + θ1θ2 k = 1θ2 k = 20 k ≥ 3.

18

(e) Autocorrelations are scaled autocovariances: ρ(k) = γ(k)/γ(0). ρ(2) is positive ifθ2 is. ρ(1) is positive if θ1(1 + θ2) is. Both are therefore positive if θ1 and θ2 arepositive.

6. Combination models. Consider the linear time series model

xt = ϕxt−1 + wt,

with {wt} independent normal random variables with mean zero and variance one. Nowconsider a second random variable yt built from xt and the same disturbance wt by

yt = xt + θwt.

The question is how this combination behaves.

(a) Is there a state variable for which xt is Markov? What is the distribution of xt+1

conditional on the state at date t?

(b) Express xt as a moving average. What are its coefficients?

(c) Is there a state variable for which yt is Markov? What is the distribution of yt+1

conditional on the state at date t?

(d) Express yt as a moving average. What are its coefficients?

(e) Under what conditions is yt stable? That is: under what conditions does the distri-bution of yt+k, conditional on the state at t, converge as k gets large?

(f) What is the equilibrium distribution of yt?

(g) What is the first autocorrelation of yt?

Answer.

(a) It’s Markov with state xt. The conditional distribution of xt+1,

xt+1 = ϕxt + wt+1,

is normal with mean ϕxt and variance one.

(b) The moving average representation is

xt = wt + ϕwt−1 + ϕ2wt−2 + · · · .

The coefficients are (1, ϕ, ϕ2, . . .).

(c) Two answers, both work: the state can be xt or (more commonly) the vector (yt, wt).The distribution of yt+1:

yt+1 = ϕxt + (1 + θ)wt+1 = ϕ(yt − θwt) + (1 + θ)wt+1

is (conditionally) normal with mean ϕxt = ϕ(yt − θwt) and variance (1 + θ)2).

(d) If we add θwt to the expression for xt above, we get

yt = (1 + θ)wt + ϕwt−1 + ϕ2wt−2 + · · · .

(e) It’s stable if |ϕ| < 1: we need the moving average coefficients to approach zero.

19

(f) Equilibrium distribution: xt is normal with mean zero and variance

Var(xt) = (1 + θ)2 + ϕ2 + ϕ4 + · · · = (1 + θ)2 + ϕ2/(1− ϕ2).

7. Vector autoregressions (skip). We can write many linear models in the form

xt+1 = Axt +Bwt+1. (6)

Here x is a vector, w ∼ N (0, I) is also a vector (of possibly different dimension), and(A,B) are matrices.

(a) Consider the ARMA(1,1):

yt = ϕ1yt−1 + θ0wt + θ1wt−1.

Show that this can be expressed in the same form as (6).

(b) Ditto for the ARMA(2,1):

yt = ϕ1yt−1 + ϕ2yt−2 + θ0wt + θ1wt−1.

(c) For the general model (6), what is the distribution of xt+2 given xt?

(d) Ditto for xt+k. Under what conditions does this converge as k gets large?

Answer. This course is designed to avoid linear algebra. Nevertheless, in this case youmight want to know that seemingly complex models can often be written simply inmatrix form. In that case, we’re dealing with essentially higher-dimensional analogs ofan AR(1), which makes programming and insight much easier. If this line of thoughtdoesn’t work for you, just ignore it.

(a) The ARMA(1,1) can be written[yt+1

wt+1

]=

[ϕ1 θ10 0

] [ytwt

]+

[θ01

][wt+1],

which you should recognize from the notes on stochastic processes.

(b) The ARMA(2,1) becomes yt+1

ytwt+1

=

ϕ1 ϕ2 θ11 0 00 0 0

ytyt−1wt

+

θ001

[wt+1].

(c) Now we can put the AR(1) structure to work. After substituting,

xt+2 = A2xt +Bwt+2 +ABwt+1.

The first term (A2xt) is the conditional mean. The others generate variance. It goesbeyond this course, but for a vector like wt the variance is a matrix, with varianceson the diagonal and covariances otherwise. If that’s confusing, just skip it.

If not, we write the variance matrix as E(wtw>t ), where > is a brute force way to

write the transpose. In our case, we assumed the variance matrix for wt is I, withones on the diagonal and zeros off.

20

What’s the variance matrix for xt+2? It’s defined by

Vart(xt+2) = Et

[(xt+2 −Axt)(xt+2 −Axt)>

]= Et

[(Bwt+2 +ABwt+1)(Bwt+2 +ABwt+1)

>]

= BB> +ABB>A>.

(d) Same idea repeated:

Vart(xt+k) = BB> +ABB>A> +A2BB>(A2)> + · · ·+Ak−1BB>(Ak−1)>.

As k gets large, we get more terms. It converges, though, if the terms shrink fastenough, which requires powers of A to shrink. If you’re familiar with linear algebra,you need the eigenvalues of A to be less than one in absolute value.

c© 2015 David Backus | NYU Stern School of Business

21

Figure 1Representative Event Tree

z0

z1 = (z0, z1 = 1)

z1 = (z0, z1 = 2)

z2 = (z0, 1, 1)

z2 = (z0, 1, 2)

z2 = (z0, 2, 1)

z2 = (z0, 2, 2)

��

AAAAAAAAAAAAAU

��

@@@@@@@R

��

@@@@@@@R

z1 = 1

z1 = 2

z2 = 1

z2 = 2

z2 = 1

z2 = 2

(A)

(B)

(C)

The figure illustrates how uncertainty unfolds over time. Time moves from left to right,starting at date t = 0. At each date t, an event zt occurs. In this example, zt is drawn fromthe set Z = {1, 2}. Each node is associated with a box and can be identified from the path ofevents that leads to it, which we refer to as a history and denote by zt ≡ (z0, ..., zt), startingwith an arbitrary initial node z0. Thus the upper right node follows two up branches, z1 = 1and z2 = 1, and is denoted z2 = (z0, 1, 1). The set Z2 of all possible 2-period histories istherefore {(z0, 1, 1), (z0, 1, 2), (z0, 2, 1), (z0, 2, 2)}, illustrated by the “terminal nodes” on theright. Not shown are conditional probabilities of particular branches, from which we canconstruct probabilities for each node/history.

22

Documents

Math Tools: Stochastic Processes - NYUpages.stern.nyu.edu/~dbackus/233/notes_math_stochprocesses.pdf · Math Tools: Stochastic Processes Revised: November 30, 2015 All of modern macroeconomics,