S oderlind¨ - Dokuz Eylül UniversityT =25 T =100 Sample average-5 0 5 0 0.2 0.4 b. Distribution of Ö T ´ sample avg. Ö T ´ sample average Sample average of z t-1 where z t has

Lecture Notes in Financial Econometrics (MBF,MSc course at UNISG)

Paul Soderlind1

24 June 2005

1University of St. Gallen and CEPR. Address: s/bf-HSG, Rosenbergstrasse 52, CH-9000 St.Gallen, Switzerland. E-mail: [email protected]. Document name: FinEcmtAll.TeX

Contents

1 Review of Statistics 31.1 Random Variables and Distributions . . . . . . . . . . . . . . . . . . 31.2 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Normal Distribution of the Sample Mean as an Approximation . . . . 13

2 Least Squares and Maximum Likelihood Estimation 152.1 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 26

A Some Matrix Algebra 28

3 Testing CAPM 303.1 Market Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Several Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Fama-MacBeth∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Event Studies 414.1 Basic Structure of Event Studies . . . . . . . . . . . . . . . . . . . . 414.2 Models of Normal Returns . . . . . . . . . . . . . . . . . . . . . . . 434.3 Testing the Abnormal Return . . . . . . . . . . . . . . . . . . . . . . 454.4 Quantitative Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

A Derivation of (4.8) 47

5 Time Series Analysis 485.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1

5.3 Autoregression (AR) . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4 Moving Average (MA) . . . . . . . . . . . . . . . . . . . . . . . . . 575.5 ARMA(p,q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.6 VAR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.7 Non-stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Predicting Asset Returns 656.1 Asset Prices, Random Walks, and the Efficient Market Hypothesis . . 656.2 Autocorrelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.3 Other Predictors and Methods . . . . . . . . . . . . . . . . . . . . . 776.4 Security Analysts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796.5 Technical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.6 Empirical U.S. Evidence on Stock Return Predictability . . . . . . . . 84

7 ARCH and GARCH 897.1 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.2 ARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.3 GARCH Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.4 Non-Linear Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 967.5 (G)ARCH-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.6 Multivariate (G)ARCH . . . . . . . . . . . . . . . . . . . . . . . . . 98

8 Option Pricing and Estimation of Continuous Time Processes 1028.1 The Black-Scholes Model . . . . . . . . . . . . . . . . . . . . . . . . 1028.2 Estimation of the Volatility of a Random Walk Process . . . . . . . . 106

9 Kernel Density Estimation and Regression 1139.1 Non-Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . 113

2

1 Review of Statistics

Reference: Bodie, Kane, and Marcus (2002) (statistical review in appendix) or any text-book in statistics.

More advanced material is denoted by a star (∗). It is not required reading.

1.1 Random Variables and Distributions

1.1.1 Distributions

A univariate distribution of a random variable x describes the probability of differentvalues. If f (x) is the probability density function, then the probability that x is betweenA and B is calculated as the area under the density function from A to B

Pr (A ≤ x < B) =

∫ B

Af (x)dx . (1.1)

See Figure 1.1 for an example. The distribution can often be described in terms of themean and the variance. For instance, a normal (Gaussian) distribution is fully describedby these two numbers. See Figure 1.4 for an illustration.

Remark 1 If x ∼ N (µ, σ 2), then the probability density function is

f (x) =1

√2πσ 2

e−

12

(x−µσ

)2

.

This is a bell-shaped curve centered on µ and where σ determines the “width” of the

curve.

A bivariate distribution of the random variables x and y contains the same informationas the two respective univariate distributions, but also information on how x and y arerelated. Let h (x, y) be the joint density function, then the probability that x is betweenA and B and y is between C and D is calculated as the volume under the surface of the

3

density function

Pr (A ≤ x < B and C ≤ x < D) =

∫ B

A

∫ D

Ch(x, y)dxdy. (1.2)

A joint normal distributions is completely described by the means and the covariancematrix [

x

y

]∼ N

([µx

µy

],

[σ 2

x σxy

σxy σ 2y

]), (1.3)

where σ 2x and σ 2

y denote the variances of x and y respectively, and σxy denotes theircovariance.

Clearly, if the covariance σxy is zero, then the variables are unrelated to each other.Otherwise, information about x can help us to make a better guess of y. See Figure 1.1for an example. The correlation of x and y is defined as

Corr (x, y) =σxy

σxσy. (1.4)

If two random variables happen to be independent of each other, then the joint densityfunction is just the product of the two univariate densities (here denoted f (x) and k(y))

h(x, y) = f (x) k (y) if x and y are independent. (1.5)

This is useful in many cases, for instance, when we construct likelihood functions formaximum likelihood estimation.

1.1.2 Conditional Distributions∗

If h (x, y) is the joint density function and f (x) the (marginal) density function of x , thenthe conditional density function is

g(y|x) = h(x, y)/ f (x). (1.6)

For the bivariate normal distribution (1.3) we have the distribution of y conditional on agiven value of x as

y|x ∼ N[µy +

σxy

σ 2x

(x − µx) , σ 2y −

σxyσxy

σ 2x

]. (1.7)

4

−2 −1 0 1 20

0.2

0.4

Pdf of N(0,1)

x

−2

0

2

−20

20

0.1

0.2

x

Pdf of bivariate normal, corr=0.1

y

−2

0

2

−20

20

0.2

0.4

x


y

Figure 1.1: Density functions of univariate and bivariate normal distributions

Notice that the conditional mean can be interpreted as the best guess of y given that weknow x . Similarly, the conditional variance can be interpreted as the variance of theforecast error (using the conditional mean as the forecast). The conditional and marginaldistribution coincide if y is uncorrelated with x . (This follows directly from combining(1.5) and (1.6)). Otherwise, the mean of the conditional distribution depends on x , andthe variance is smaller than in the marginal distribution (we have more information). SeeFigure 1.2 for an illustration.

1.1.3 Mean and Standard Deviation

The mean and variance of a series are estimated as

x =∑T

t=1xt/T and σ 2=∑T

t=1 (xt − x)2 /T . (1.8)

5

−2

0

2

−20

20

0.1

0.2

x


y −2 −1 0 1 20

0.2

0.4

0.6

Conditional pdf of y, corr=0.1

x=−0.8

x=0

y

−2

0

2

−20

20

0.2

0.4

x


y −2 −1 0 1 20

0.2

0.4

0.6

Conditional pdf of y, corr=0.8

x=−0.8

x=0

y

Figure 1.2: Density functions of normal distributions

The standard deviation (here denoted Std(xt)), the square root of the variance, is the mostcommon measure of volatility. (Sometimes we use T −1 in the denominator of the samplevariance instead T .)

A sample mean is normally distributed if xt is normal distributed, xt ∼ N (µ, σ 2). Thebasic reason is that a linear combination of normally distributed variables is also normallydistributed. However, a sample average is typically approximately normally distributedeven if the variable is not (discussed below).

Remark 2 If x ∼ N(µx , σ

2x)

and y ∼ N(µy, σ

2y

), then

ay + by ∼ N[aµx + bµx , a2σ 2

x + b2σ 2y + 2ab Cov(x, y)

].

If xt is iid (independently and identically distributed), then it is straightforward to find

6

the variance of the sample average. Then we have that

Var(x) = Var(∑T

t=1xt/T)

=∑T

t=1 Var (xt/T )

= T Var (xt) /T 2

= σ 2/T . (1.9)

The first equality is just a definition and the second equality follows from the assumptionthat xt and xs are independently distributed. This means, for instance, that Var(x2 +

x3) = Var(x2) + Var(x3) since the covariance is zero. The third equality follows from theassumption that xt and xs are identically distributed (so their variances are the same). Thefourth equality is a trivial simplification.

A sample average is (typically) unbiased, that is, the expected value of the sampleaverage equals the population mean. To illustrate that, consider the expected value of thesample average of the iid xt

E x = E∑T

t=1xt/T

=∑T

t=1 E xt/T

= E xt . (1.10)

The first equality is just a definition and the second equality is always true (the expectationof a sum is the sum of expectations), and the third equality follows from the assumptionof identical distributions which implies identical expectations.

1.1.4 Covariance and Correlation

The covariance of two variables (here x and y) is typically estimated as

Cov (xt , yt) =∑T

t=1 (xt − x) (yt − y) /T . (1.11)

(Sometimes we use T − 1 in the denominator of the sample variance instead T .)The correlation of two variables is then estimated as

Corr (xt , yt) =Cov (xt , yt)

Std (xt) Std (yt), (1.12)

7

−2 0 2−2

0

2

4

6

y = x + 0.2*N(0,1)

x

y

Corr 0.98

−2 0 2−2

0

2

4

6

z=y2

x

z

Corr −0.02

Figure 1.3: Example of correlations on an artificial sample. Both subfigures use the samesample of y.

where Std(xt) is an estimated standard deviation. A correlation must be between −1 and 1(try to show it). Note that covariance and correlation measure the degree of linear relationonly. This is illustrated in Figure 1.3.

1.2 Hypothesis Testing

The basic approach in testing a “null hypothesis” is to compare the test statistic (thesample average, say) with how the distribution of it would look like if the null hypothesisis true. If the test statistic would be very unusual, then the null hypothesis is rejected—weare not willing to believe in a null hypothesis that looks very different from what we seein data.

For instance, suppose the null hypothesis (denoted H0) is that the true value of someparameter is β Suppose also that we know that distribution of the parameter estimator, β,is normal (discussed in some detail later on) with a known variance of s2. For instance, β

could be a sample mean. Construct the test statistic by “standardizing” the sample mean

t =β − β

s∼ N (0, 1) if H0 is true. (1.13)

Notice that t has a standard normal distribution if the null hypothesis is true. The test issee if the test statistic would be very unusual—but then we need to define unusual (seebelow).

8

Remark 3 The logic of using the standardized t statistic in (1.13) is easily seen by an

example. Suppose a random variable (here denoted x, but think of any test statistic) is

distributed as N (0.5, 2), then the following probabilities are all 5%

Pr (x ≤ −1.83) = Pr (x − 0.5 ≤ −1.83 − 0.5) = Pr(

x − 0.5√

2≤

−1.83 − 0.5√

2

).

Notice that (x − 0.5) /√

2 ∼ N (0, 1). See 1.4 for an illustration.

1.2.1 Two-Sided Test

As an example of a two sided test we could have the null hypothesis and the alternativehypothesis

H0 : β = 4

H1 : β 6= 4. (1.14)

To test this, we follow these steps:

1. Construct distribution under H0: from (1.13) it is such that t = (β − 4)/s ∼

N (0, 1).

2. Would test statistic (t) be very unusual under the H0 distribution (N (0, 1))? Sincethe alternative hypothesis is β 6= 4, a value of t far from zero (β far from 4) mustbe considered unusual.

3. Put a value on what you mean by unusual. For instance, suppose you regard some-thing that would happen with 10% probability to be unusual. (This is called the“size” of the test.)

4. In a N (0, 1) distribution, t < −1.65 has a 5% probability, and so does t > 1.65.These are your 10% critical values.

5. Reject H0 if |t | > 1.65.

The idea is that, if the hypothesis is true, then this decision rule gives the wrongdecision in 10% of the cases. That is, 10% of all possible random samples will make usreject a true hypothesis. If we prefer a 5% significance level (which makes the risk of afalse rejection smaller), then we should use the critical values of −1.96 and 1.96.

9

Example 4 Let s = 1.5, β = 6 and β = 4 (under H0). Then, t = (6 − 4)/1.5 ≈ 1.33 so

we cannot reject H0 at the 10% significance level.

Example 5 If instead, β = 7, then t = 2 so we can reject H0 at the 10% (and also the

5%) level.

See Figure 1.4 for some examples or normal distributions.

1.2.2 One-Sided Test∗

A one-sided test is a bit different—since it has a different alternative hypothesis (andtherefore a different definition of “unusual”). As an example, suppose the alternativehypothesis is that the mean is larger than 4

H0 : β ≤ 4

H1 : β > 4. (1.15)

To test this, we follow these steps:

1. Construct distribution at the boundary of H0: set β = 4 in (1.13) to get the sametest statistics as in the two-sided test: t = (β − 4)/s ∼ N (0, 1).

2. A value of t a lot higher than zero (β much higher than 4) must be consideredunusual. Notice that t < 0 (β < 4) isn’t unusual at all under H0.

3. In a N (0, 1) distribution, t > 1.29 has a 10% probability. This is your 10% criticalvalue.

4. Reject H0 if t > 1.29.

Example 6 Let s = 1.5, β = 6 and β ≤ 4 (under H0). Then, t = (6 − 4)/1.5 = 1.33 so

we can reject H0 at the 10% significance level (but not the 5% level).

Example 7 Let s = 1.5, β = 3 and β ≤ 4 (under H0). Then, t = (3 − 4)/1.5 = −0.67so we cannot reject H0 at the 10% significance level.

10

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4Density function of N(0.5,2)

x

Pr(x ≤ −1.83) = 0.05

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4Density function of N(0,2)

y = x−0.5

Pr(y ≤ −2.33) = 0.05

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4Density function of N(0,1)

z = (x−0.5)/√2

Pr(z ≤ −1.65) = 0.05

Figure 1.4: Density function of normal distribution with shaded 5% tails

1.2.3 A Joint Test of Several Parameters∗

Suppose we have estimated both βx and βy (the estimates are denoted βx and βy) and thatwe know that they have a joint normal distribution with covariance matrix 6. We nowwant to test the null hypothesis

H0 : βx = 4 and βy = 2 (1.16)

H1 : βx 6= 4 and/or βy 6= 2 (H0 is not true...)

To test two parameters at the same time, we somehow need to combine them into one teststatistic. The most straightforward way is too form a chi-square distributed test statistic.

Remark 8 If v1 ∼ N (0, σ 21 ) and v2 ∼ N (0, σ 2

2 ) and they are uncorrelated, then (v1/σ1)2+

(v2/σ2)2

∼ χ22 .

11

−2 0 20

0.2

0.4

a. Pdf of N(0,1) and t(50)

x

N(0,1) (|1.64|)

t(50) (|1.68|)

0 5 100

0.2

0.4

b. Pdf of Chi−square(n)

x

n=2 (4.61)

n=5 (9.24)

0 2 40

0.5

1

c. Pdf of F(n,50)

x

n=2 (2.41)

n=5 (1.97)

10% critical values in parentheses

Figure 1.5: Probability density functions of a N(0,1) and χ2n

Remark 9 If the J × 1 vector v ∼ N (0, 6), then v′6−1v ∼ χ2J . See Figure 1.5 for the

pdf.

We calculate the test statistics for (1.16) as

c =

[βx − 4βy − 2

]′ Var(βx) Cov

(βx , βy

)Cov

(βx , βy

)Var(βy)

−1 [βx − 4βy − 2

], (1.17)

and then compare with a 10% (say) critical from a χ22 distribution.

Example 10 Suppose[βx

βy

]=

[35

]and

Var(βx) Cov(βx , βy

)Cov

(βx , βy

)Var(βy)

=

[5 33 4

],

12

then (1.17) is

c =

[−13

]′ [4/11 −3/11−3/1 5/11

][−13

]≈ 6.1

which is higher than the 10% critical value of the χ22 distribution (which is 4.61).

1.3 Normal Distribution of the Sample Mean as an Approximation

In many cases, it is unreasonable to just assume that the variable is normally distributed.The nice thing with a sample mean (or sample average) is that it will still be normallydistributed—at least approximately (in a reasonably large sample). This section givesa short summary of what happens to sample means as the sample size increases (oftencalled “asymptotic theory”)

−2 0 20

1

2

3

a. Distribution of sample avg.

T=5 T=25 T=100

Sample average

−5 0 50

0.2

0.4

b. Distribution of √T × sample avg.

√T × sample average

Sample average of zt−1 where z

t has a χ2

(1) distribution

Figure 1.6: Sampling distributions.

The law of large numbers (LLN) says that the sample mean converges to the truepopulation mean as the sample size goes to infinity. This holds for a very large classof random variables, but there are exceptions. A sufficient (but not necessary) conditionfor this convergence is that the sample average is unbiased (as in (1.10)) and that thevariance goes to zero as the sample size goes to infinity (as in (1.9)). (This is also calledconvergence in mean square.) To see the LLN in action, see Figure 1.6.

The central limit theorem (CLT) says that√

T x converges in distribution to a normaldistribution as the sample size increases. See Figure 1.6 for an illustration. This also

13

holds for a large class of random variables—and it is a very useful result since it allowsus to test hypothesis. Most estimators (including least squares and other methods) areeffectively some kind of sample average, so the CLT can be applied.

Bibliography

Bodie, Z., A. Kane, and A. J. Marcus, 2002, Investments, McGraw-Hill/Irwin, Boston,5th edn.

14

2 Least Squares and Maximum Likelihood Estimation


2.1 Least Squares

2.1.1 Simple Regression: Constant and One Regressor

The simplest regression model is

yt = β0 + β1xt + ut , where E ut = 0 and Cov(xt , ut) = 0, (2.1)

where we can observe (have data on) the dependent variable yt and the regressor xt butnot the residual ut . In principle, the residual should account for all the movements in yt

that we cannot explain (by xt ).Note the two very important assumptions: (i) the mean of the residual is zero; and (ii)

the residual is not correlated with the regressor, xt . If the regressor summarizes all theuseful information we have in order to describe yt , then the assumptions imply that wehave no way of making a more intelligent guess of ut (even after having observed xt ) thanthat it will be zero.

Suppose you do not know β0 or β1, and that you have a sample of data: yt and xt fort = 1, ..., T . The LS estimator of β0 and β1 minimizes the loss function∑T

t=1(yt − b0 − b1xt)2

= (y1 − b0 − b1x1)2+ (y2 − b0 − b1x2)

2+ .... (2.2)

by choosing b0 and b1 to make the loss function value as small as possible. The objectiveis thus to pick values of b0 and b1 in order to make the model fit the data as closelyas possible—where close is taken to be a small variance of the unexplained part (theresidual).

Remark 11 (First order condition for minimizing a differentiable function). We want

to find the value of b in the interval blow ≤ b ≤ bhigh , which makes the value of the

15

−2 0 20

10

20

a. 2b2

b

24

6−2

02

0

10

20

30

c

b. 2b2+(c−4)

2

b

Figure 2.1: Quadratic loss function. Subfigure a: 1 coefficient; Subfigure b: 2 coefficients

differentiable function f (b) as small as possible. The answer is blow, bhigh , or the value

of b where d f (b)/db = 0. See Figure 2.1.

The first order conditions for a minimum are that the derivatives of this loss functionwith respect to b0 and b1 should be zero. Let (β0, β1) be the values of (b0, b1) where thatis true

∂

∂β0

∑Tt=1(yt − β0 − β1xt)

2= −

∑Tt=1(yt − β0 − β1xt)1 = 0 (2.3)

∂

∂β1

∑Tt=1(yt − β0 − β1xt)

2= −

∑Tt=1(yt − β0 − β1xt)xt = 0, (2.4)

which are two equations in two unknowns (β0 and β1), which must be solved simulta-neously. These equations show that both the constant and xt should be orthogonal to thefitted residuals, ut = yt −β0−β1xt . This is indeed a defining feature of LS and can be seenas the sample analogues of the assumptions in (2.1) that E ut = 0 and Cov(xt , ut) = 0.To see this, note that (2.3) says that the sample average of ut should be zero. Similarly,(2.4) says that the sample cross moment of ut and xt should also be zero, which impliesthat the sample covariance is zero as well since ut has a zero sample mean.

Remark 12 Note that βi is the true (unobservable) value which we estimate to be βi .

Whereas βi is an unknown (deterministic) number, βi is a random variable since it is

calculated as a function of the random sample of yt and xt .

16

−1 0 1

−2

0

2

OLS, y = bx + u

Data 2*xOLS

x

y

y: −1.5 −0.6 2.1x: −1.0 0.0 1.0b: 1.8 (OLS)

R2: 0.92

0 2 40

5

10

Sum of squared errors

b

−1 0 1

−2

0

2

OLS

x

y

y: −1.3 −1.0 2.3x: −1.0 0.0 1.0b: 1.8 (OLS)

R2: 0.81

0 2 40

5

10


b

Figure 2.2: Example of OLS estimation

Remark 13 Least squares is only one of many possible ways to estimate regression coef-

ficients. We will discuss other methods later on.

Remark 14 (Cross moments and covariance). A covariance is defined as

Cov(x, y) = E[(x − E x)(y − E y)]

= E(xy − x E y − y E x + E x E y)

= E xy − E x E y − E y E x + E x E y

= E xy − E x E y.

When x = y, then we get Var(x) = E x2− (E x)2. These results hold for sample moments

too.

When the means of y and x are zero, then we can disregard the constant. In this case,

17

(2.4) with β0 = 0 immediately gives∑Tt=1yt xt = β1

∑Tt=1xt xt or

β1 =

∑Tt=1 yt xt/T∑Tt=1 xt xt/T

. (2.5)

In this case, the coefficient estimator is the sample covariance (recall: means are zero) ofyt and xt , divided by the sample variance of the regressor xt (this statement is actuallytrue even if the means are not zero and a constant is included on the right hand side—justmore tedious to show it).

See Figure 2.2 for an illustration.

2.1.2 Least Squares: Goodness of Fit

The quality of a regression model is often measured in terms of its ability to explain themovements of the dependent variable.

Let yt be the fitted (predicted) value of yt . For instance, with (2.1) it would be yt =

β0 + β1xt . If a constant is included in the regression (or the means of y and x are zero),then a check of the goodness of fit of the model is given by

R2= Corr(yt , yt)

2. (2.6)

This is the squared correlation of the actual and predicted value of yt .To understand this result, suppose that xt has no explanatory power, so R2 should be

zero. How does that happen? Well, if xt is uncorrelated with yt , then the numerator in(2.5) is zero so β1 = 0. As a consequence yt = β0, which is a constant—and a constantis always uncorrelated with anything else (as correlations measure comovements aroundthe means).

To get a bit more intuition for what R2 represents, suppose the estimated coefficientsequal the true coefficients, so yt = β0 + β1xt . In this case,

R2= Corr (β0 + β1xt + ut , β0 + β1xt)

2 ,

that is, the squared correlation of yt with the systematic part of yt . Clearly, if the model isperfect so ut = 0, then R2

= 1. In contrast, when there is no movements in the systematicpart (β1 = 0), then R2

= 0.

18

See Figure 2.3 for an example.

0 20 40 60−0.5

0

0.5Return = c + b*lagged Return, slope

Return horizon (months)

Slope with 90% conf band,Newey−West std, MA(horizon−1)

US stock returns 1926−2003

0 20 40 600

0.05

0.1

Return = c + b*lagged Return, R2


0 20 40 600

0.2

0.4

Return = c + b*D/P, slope


Slope with 90% conf band

0 20 40 600

0.1

0.2

Return = c + b*D/P, R2


Figure 2.3: Predicting US stock returns (various investment horizons) with the dividend-price ratio

2.1.3 Least Squares: Outliers

Since the loss function in (2.2) is quadratic, a few outliers can easily have a very largeinfluence on the estimated coefficients. For instance, suppose the true model is yt =

0.75xt + ut , and that the residual is very large for some time period s. If the regressioncoefficient happened to be 0.75 (the true value, actually), the loss function value would belarge due to the u2

t term. The loss function value will probably be lower if the coefficientis changed to pick up the ys observation—even if this means that the errors for the otherobservations become larger (the sum of the square of many small errors can very well beless than the square of a single large error).

19

−3 −2 −1 0 1 2 3−2

−1

0

1

2

OLS vs LAD of y = 0.75*x + u

x

y

y: −1.125 −0.750 1.750 1.125

x: −1.500 −1.000 1.000 1.500

Data

OLS (0.25 0.90)

LAD (0.00 0.75)

Figure 2.4: Data and regression line from OLS and LAD

There is of course nothing sacred about the quadratic loss function. Instead of (2.2)one could, for instance, use a loss function in terms of the absolute value of the error6T

t=1 |yt − β0 − β1xt |. This would produce the Least Absolute Deviation (LAD) estima-tor. It is typically less sensitive to outliers. This is illustrated in Figure 2.4. However, LSis by far the most popular choice. There are two main reasons: LS is very easy to computeand it is fairly straightforward to construct standard errors and confidence intervals for theestimator. (From an econometric point of view you may want to add that LS coincideswith maximum likelihood when the errors are normally distributed.)

2.1.4 The Distribution of β

Note that the estimated coefficients are random variables since they depend on which par-ticular sample that has been “drawn.” This means that we cannot be sure that the estimatedcoefficients are equal to the true coefficients (β0 and β1 in (2.1)). We can calculate an es-timate of this uncertainty in the form of variances and covariances of β0 and β1. Thesecan be used for testing hypotheses about the coefficients, for instance, that β1 = 0.

To see where the uncertainty comes from consider the simple case in (2.5). Use (2.1)

20

to substitute for yt (recall β0 = 0)

β1 =

∑Tt=1xt (β1xt + ut) /T∑T

t=1xt xt/T

= β1 +

∑Tt=1xtut/T∑Tt=1xt xt/T

, (2.7)

so the OLS estimate, β1, equals the true value, β1, plus the sample covariance of xt andut divided by the sample variance of xt . One of the basic assumptions in (2.1) is thatthe covariance of the regressor and the residual is zero. This should hold in a very largesample (or else OLS cannot be used to estimate β1), but in a small sample it may bedifferent from zero. Since ut is a random variable, β1 is too. Only as the sample gets verylarge can we be (almost) sure that the second term in (2.7) vanishes.

Equation (2.7) will give different values of β when we use different samples, that isdifferent draws of the random variables ut , xt , and yt . Since the true value, β, is a fixedconstant, this distribution describes the uncertainty we should have about the true valueafter having obtained a specific estimated value.

The first conclusion from (2.7) is that, with ut = 0 the estimate would always beperfect — and with large movements in ut we will see large movements in β. The secondconclusion is that a small sample (small T ) will also lead to large random movements inβ1—in contrast to a large sample where the randomness in

∑Tt=1xtut/T is averaged out

more effectively (should be zero in a large sample).There are three main routes to learn more about the distribution of β: (i) set up a

small “experiment” in the computer and simulate the distribution (Monte Carlo or boot-strap simulation); (ii) pretend that the regressors can be treated as fixed numbers and thenassume something about the distribution of the residuals; or (iii) use the asymptotic (largesample) distribution as an approximation. The asymptotic distribution can often be de-rived, in contrast to the exact distribution in a sample of a given size. If the actual sampleis large, then the asymptotic distribution may be a good approximation.

The simulation approach has the advantage of giving a precise answer—but the dis-advantage of requiring a very precise question (must write computer code that is tailormade for the particular model we are looking at, including the specific parameter values).See Figure 2.5 for an example.

21

2.1.5 Fixed Regressors

The assumption of fixed regressors makes a lot of sense in controlled experiments, wherewe actually can generate different samples with the same values of the regressors (the heator whatever). In makes much less sense in econometrics. However, it is easy to deriveresults for this case—and those results happen to be very similar to what asymptotictheory gives.

Remark 15 (Linear combination of normally distributed variables.) If the random vari-

ables zt and vt are normally distributed, then a + bzt + cvt is too. To be precise,

a + bzt + cvt ∼ N(a + bµz + cµv, b2σ 2

z + c2σ 2v + 2bcσzv

).

Suppose ut ∼ N(0, σ 2), then (2.7) shows that β1 is normally distributed. The reason

is that β1 is just a constant (β1) plus a linear combination of normally distributed residuals(with fixed regressors xt/

∑Tt=1xt xt can be treated as constant). It is straightforward to see

that the mean of this normal distribution is β1 (the true value), since the rest is a linearcombination of the residuals—and they all have a zero mean. Finding the variance of β1

just just slightly more complicated. First, write (2.7) as

β1 = β1 +1∑T

t=1xt xt(x1u1 + x2u2 + . . . xT ut) . (2.8)

Second, remember that we treat xt as fixed numbers (“constants”). Third, assume that theresiduals are iid: they are uncorrelated with each other (independently distributed) andhave the same variances (identically distributed). The variance of (2.8) is therefore

Var(β1

)=

(1∑T

t=1xt xt

)2 [x2

1 Var (u1) + x22 Var (u2) + . . . x2

T Var (uT )]

=

(1∑T

t=1xt xt

)2 [x2

1σ 2+ x2

2σ 2+ . . . x2

T σ 2]

=

(1∑T

t=1xt xt

)2 (∑Tt=1xt xt

)σ 2

=1∑T

t=1xt xtσ 2. (2.9)

22

Notice that the denominator increases with the sample size while the numerator staysconstant: a larger sample gives a smaller uncertainty about the estimate. Similarly, a lowervolatility of the residuals (lower σ 2) also gives a lower uncertainty about the estimate.

Example 16 When the regressor is just a constant (equal to one) xt = 1, then we have∑Tt=1xt x ′

t =∑T

t=11 × 1′= T so Var(β) = σ 2/T .

(This is the classical expression for the variance of a sample mean.)

Example 17 When the regressor is a zero mean variable, then we have∑Tt=1xt x ′

t = Var(xt)T so Var(β) = σ 2/[Var(xi )T

].

The variance is increasing in σ 2, but decreasing in both T and Var(xt). Why?

2.1.6 A Bit of Asymptotic Theory

A law of large numbers would (in most cases) say that both∑T

t=1 x2t /T and

∑Tt=1 xtut/T

in (2.7) converges to their expected values as T → ∞. The reason is that both are sampleaverages of random variables (clearly, both x2

t and xtut are random variables). Theseexpected values are Var (xt) and Cov (xt , ut), respectively (recall both xt and ut have zeromeans). The key to show that β is consistent is that Cov (xt , ut) = 0. This highlights theimportance of using good theory to derive not only the systematic part of (2.1), but alsoin understanding the properties of the errors. For instance, when economic theory tellsus that yt and xt affect each other (as prices and quantities typically do), then the errorsare likely to be correlated with the regressors—and LS is inconsistent. One common wayto get around that is to use an instrumental variables technique. Consistency is a featurewe want from most estimators, since it says that we would at least get it right if we hadenough data.

Suppose that β is consistent. Can we say anything more about the asymptotic distri-bution. Well, the distribution of β converges to a spike with all the mass at β, but thedistribution of

√T (β − β), will typically converge to a non-trivial normal distribution.

To see why, note from (2.7) that we can write

√T (β − β) =

(∑Tt=1x2

t /T)−1 √

T∑T

t=1xtut/T . (2.10)

23

−4 −2 0 2 40

0.2

0.4

Distribution of t−stat, T=5

−4 −2 0 2 40

0.2

0.4

Distribution of t−stat, T=100

Model: Rt=0.9f

t+ε

t, ε

t = v

t − 2 where v

t has a χ2

(2) distribution

Results for T=5 and T=100:

Kurtosis of t−stat: 27.7 3.1

Frequency of abs(t−stat)>1.645: 0.25 0.11

Frequency of abs(t−stat)>1.96: 0.19 0.05

−4 −2 0 2 40

0.2

0.4

Probability density functions

N(0,1)

χ2(2)−2

Figure 2.5: Distribution of LS estimator when residuals have a t3 distribution

The first term on the right hand side will typically converge to the inverse of Var (xt), asdiscussed earlier. The second term is

√T times a sample average (of the random variable

xtut ) with a zero expected value, since we assumed that β is consistent. Under weakconditions, a central limit theorem applies so

√T times a sample average converges to a

normal distribution. This shows that√

T β has an asymptotic normal distribution. It turnsout that this is a property of many estimators, basically because most estimators are somekind of sample average. The properties of this distribution are quite similar to those thatwe derived by assuming that the regressors were fixed numbers.

2.1.7 Diagnostic Tests

Exactly what the variance of β is, and how it should be estimated, depends mostly onthe properties of the errors. This is one of the main reasons for diagnostic tests. Themost common tests are for homoskedastic errors (equal variances of ut and ut−s) and no

24

autocorrelation (no correlation of ut and ut−s).When ML is used, it is common to investigate if the fitted errors satisfy the basic

assumptions, for instance, of normality.

2.1.8 Multiple Regression

All that all the previous results still hold in a multiple regression—with suitable reinter-pretations of the notation.

2.1.9 Multiple Regression: Details∗

Consider the linear model

yt = x1tβ1 + x2tβ2 + · · · + xktβk + ut

= x ′

tβ + ut , (2.11)

where yt and ut are scalars, xt a k×1 vector, and β is a k×1 vector of the true coefficients(see Appendix A for a summary of matrix algebra). Least squares minimizes the sum ofthe squared fitted residuals ∑T

t=1u2t =

∑Tt=1(yt − x ′

t β)2, (2.12)

by choosing the vector β. The first order conditions are

0kx1 =∑T

t=1xt(yt − x ′

t β) or∑T

t=1xt yt =∑T

t=1xt x ′

t β, (2.13)

which can be solved asβ =

(∑Tt=1xt x ′

t

)−1∑Tt=1xt yt . (2.14)

Example 18 With 2 regressors (k = 2), (2.13) is[00

]=∑T

t=1

[x1t(yt − x1t β1 − x2t β2)

x2t(yt − x1t β1 − x2t β2)

]

and (2.14) is [β1

β2

]=

(∑Tt=1

[x1t x1t x1t x2t

x2t x1t x2t x2t

])−1∑Tt=1

[x1t yt

x2t yt

].

25

2.2 Maximum Likelihood

A different route to arrive at an estimator is to maximize the likelihood function.To understand the principle of maximum likelihood estimation, consider the following

example.Suppose we know xt ∼ N (µ, 1), but we don’t know the value of µ. Since xt is a

random variable, there is a probability of every observation and the density function of xt

isL = pdf (xt) =

1√

2πexp

[−

12(xt − µ)2

], (2.15)

where L stands for “likelihood.” The basic idea of maximum likelihood estimation (MLE)is to pick model parameters to make the observed data have the highest possible proba-bility. Here this gives µ = xt . This is the maximum likelihood estimator in this example.

What if there are two observations, x1 and x2? In the simplest case where x1 and x2

are independent, then pdf(x1, x2) = pdf(x1)pdf(x2) so

L = pdf(x1, x2) =1

2πexp

[−

12(x1 − µ)2

−12(x2 − µ)2

]. (2.16)

Take logs (log likelihood)

ln L = − ln(2π) −12

[(x1 − µ)2

+ (x2 − µ)2], (2.17)

which is maximized by setting µ = (x1 + x2)/2.To apply this idea to a regression model

yt = βxt + ut , (2.18)

we could assume that ut is iid N(0, σ 2). The probability density function of ut is

pdf (ut) =1

√2πσ 2

exp(

−12

u2t /σ

2)

. (2.19)

Since the errors are independent, we get the joint pdf of the u1, u2, . . . , uT by multiplying

26

the marginal pdfs of each of the errors

L = pdf (u1) × pdf (u2) × ... × pdf (uT )

= (2πσ 2)−T/2−N/2 exp

[−

12

(u2

1σ 2 +

u22

σ 2 + ... +u2

T

σ 2

)]. (2.20)

Substitute yt − xtβ for ut and take logs to get the log likelihood function of the sample

ln L = −T2

ln (2π) −T2

ln(σ 2) −12∑T

t=1 (yt − xtβ)2 /σ 2. (2.21)

Suppose (for simplicity) that we happen to know the value of σ 2. It is then clearl thatthis likelihood function is maximized by minimizing the last term, which is proportionalto the sum of squared errors—just like in (2.2): LS is ML when the errors are iid normallydistributed (but only then). (This holds also when we do not know the value of σ 2—justslightly messier to show it.) See Figure 2.6.

Maximum likelihood estimators have very nice properties, provided the basic distri-butional assumptions are correct, that is, if we maximize the right likelihood function.In that case, MLE is typically the most efficient/precise estimators (at least in very largesamples). ML also provides a coherent framework for testing hypotheses (including theWald, LM, and LR tests).

Example 19 Consider the simple regression where we happen to know that the intercept

is zero, yt = β1xt + ut . Suppose we have the following data[y1 y2 y3

]=

[−1.5 −0.6 2.1

]and

[x1 x2 x3

]=

[−1 0 1

].

Suppose β2 = 2, then we get the following values for ut = yt − 2xt and its square−1.5 − 2 × (−1)

−0.6 − 2 × 02.1 − 2 × 1

=

0.5−0.60.1

with the square

0.250.360.01

with sum 0.62.

Now, suppose instead that β2 = 1.8, then we get−1.5 − 1.8 × (−1)

−0.6 − 1.8 × 02.1 − 1.8 × 1

=

0.3−0.60.3

with the square

0.090.360.09

with sum 0.54.

27

−1 0 1

−2

0

2

OLS, y = bx + u

Data fitted

x

y

y: −1.5 −0.6 2.1x: −1.0 0.0 1.0b: 1.8 (OLS)

0 2 40

5

10


b

0 2 4−15

−10

−5

0

Log likelihood

b

Figure 2.6: Example of OLS and MLE estimation

The latter choice of β2 will certainly give a larger value of the likelihood function (it is

actually the optimum). See Figure 2.6.

A Some Matrix Algebra

Let

x =

[x1

x2

], c =

[c1

c2

], A =

[A11 A12

A21 A22

], and B =

[B11 B12

B21 B22

].

Matrix addition (or subtraction) is element by element

A + B =

[A11 A12

A21 A22

]+

[B11 B12

B21 B22

]=

[A11 + B11 A12 + B12

A21 + B21 A22 + B22

].

28

To turn a column into a row vector, use the transpose operator like in x ′

x ′=

[x1

x2

]′

=

[x1 x2

].

To do matrix multiplication, the two matrices need to be conformable: the first matrixhas as many columns as the second matrix has rows. For instance, xc does not work, butx ′c does

x ′c =

[x1 x2

] [c1

c2

]= x1c1 + x2c2.

Some further examples:

xx ′=

[x1

x2

] [x1 x2

]=

[x2

1 x1x2

x2x1 x22

]

Ac =

[A11 A12

A21 A22

][c1

c2

]=

[A11c1 + A12c2

A21c1 + A22c2

].

A matrix inverse is the closest we get to “dividing” by a matrix. The inverse of amatrix D, denoted D−1, is such that

DD−1= I and D−1 D = I,

where I is the identity matrix (ones along the diagonal, and zeroes elsewhere). For in-stance, the A−1 matrix has the same dimensions as A and the elements (here denoted Qi j )are such that the following holds

A−1 A =

[Q11 Q12

Q21 Q22

][A11 A12

A21 A22

]=

[1 00 1

].

Example 20 We have[−1.5 1.250.5 −0.25

][1 52 6

]=

[1 00 1

], so[

1 52 6

]−1

=

[−1.5 1.250.5 −0.25

].

Bibliography

29

3 Testing CAPM

Reference: Elton, Gruber, Brown, and Goetzmann (2003) 15More advanced material is denoted by a star (∗). It is not required reading.

3.1 Market Model

The basic implication of CAPM is that the expected excess return of an asset (µei ) is

linearly related to the expected excess return on the market portfolio (µem) according to

µei = βiµ

em , where βi =

Cov (Ri , Rm)

Var (Rm). (3.1)

Let Rei t = Ri t − R f t be the excess return on asset i in excess over the riskfree asset,

and let Remt be the excess return on the market portfolio. CAPM with a riskfree return

says that αi = 0 in

Rei t = αi + bi Re

mt + εi t , where E εi t = 0 and Cov(Remt , εi t) = 0. (3.2)

The two last conditions are automatically imposed by LS. Take expectations to get

E(Re

i t)

= αi + bi E(Re

mt). (3.3)

Notice that the LS estimate of bi is the sample analogue to βi in (3.1). It is then clear thatCAPM implies that αi = 0, which is also what empirical tests of CAPM focus on.

This test of CAPM can be given two interpretations. If we assume that Rmt is thecorrect benchmark (the tangency portfolio for which (3.1) is true by definition), then itis a test of whether asset Ri t is correctly priced. This is typically the perspective inperformance analysis of mutual funds. Alternatively, if we assume that Ri t is correctlypriced, then it is a test of the mean-variance efficiency of Rmt . This is the perspective ofCAPM tests.

The t-test of the null hypothesis that α = 0 uses the fact that, under fairly mild condi-

30

tions, the t-statistic has an asymptotically normal distribution, that is

α

Std(α)

d→ N (0, 1) under H0: α = 0. (3.4)

Note that this is the distribution under the null hypothesis that the true value of the inter-cept is zero, that is, that CAPM is correct (in this respect, at least).

The test assets are typically portfolios of firms with similar characteristics, for in-stance, small size or having their main operations in the retail industry. There are twomain reasons for testing the model on such portfolios: individual stocks are extremelyvolatile and firms can change substantially over time (so the beta changes). Moreover,it is of interest to see how the deviations from CAPM are related to firm characteristics(size, industry, etc), since that can possibly suggest how the model needs to be changed.

The results from such tests vary with the test assets used. For US portfolios, CAPMseems to work reasonably well for some types of portfolios (for instance, portfolios basedon firm size or industry), but much worse for other types of portfolios (for instance, port-folios based on firm dividend yield or book value/market value ratio).

Figure 3.1 shows some results for US industry portfolios.

3.1.1 Interpretation of the CAPM Test

Instead of a t-test, we can use the equivalent chi-square test

α2

Var(α)

d→ χ2

1 under H0: α0 = 0. (3.5)

It is quite straightforward to use the properties of minimum-variance frontiers (seeGibbons, Ross, and Shanken (1989), and also MacKinlay (1995)) to show that the teststatistic in (3.5) can be written

α2

Var(α)=

(S Rq)2− (S Rm)2

[1 + (S Rm)2]/T, (3.6)

where S Rm is the Sharpe ratio of the market portfolio (as before) and S Rq is the Sharperatio of the tangency portfolio when investment in both the market return and asset i ispossible. (Recall that the tangency portfolio is the portfolio with the highest possibleSharpe ratio.) If the market portfolio has the same (squared) Sharpe ratio as the tangencyportfolio of the mean-variance frontier of Ri t and Rmt (so the market portfolio is mean-

31

0 0.5 1 1.50

5

10

15US industry portfolios, 1947−2004

β

Mea

n e

xce

ss r

etu

rn

AB

C

D E

F

GH

IJ

0 5 10 150

5

10


Predicted mean excess return (with α=0)

Mea

n e

xce

ss r

etu

rn

AB

C

D E

F

GH

IJ

Excess market return: 7.5%

Asset alpha WaldStat pval

all NaN 22.66 0.01

A 1.97 2.42 0.12

B 1.50 1.16 0.28

C −0.50 0.40 0.53

D 3.09 3.22 0.07

E −0.40 0.07 0.79

F 0.46 0.09 0.77

G 0.56 0.19 0.66

H 2.75 3.20 0.07

I 2.11 2.16 0.14

J 0.09 0.01 0.92

CAPM

Factor: US market

test statistics use Newey−West std

test statistics ~ χ2(n), n = no. of test assets

Figure 3.1: CAPM regressions on US industry indices

variance efficient also when we take Ri t into account) then the test statistic, α2/ Var(α),is zero—and CAPM is not rejected.

This is illustrated in Figure 3.2 which shows the effect of adding an asset to the invest-ment opportunity set. In this case, the new asset has a zero beta (since it is uncorrelatedwith all original assets), but the same type of result holds for any new asset. The basicpoint is that the market model tests if the new assets moves the location of the tangencyportfolio. In general, we would expect that adding an asset to the investment opportunityset would expand the mean-variance frontier (and it does) and that the tangency portfoliochanges accordingly. However, the tangency portfolio is not changed by adding an assetwith a zero intercept. The intuition is that such an asset has neutral performance com-pared to the market portfolio (obeys the beta representation), so investors should stick tothe market portfolio.

32

0 0.05 0.10

0.05

0.1

MV frontiers before and after (α=0)

σ

µ

Solid curves: 2 assets,

Dashed curves: 3 assets

0 0.05 0.10

0.05

0.1

MV frontiers before and after (α=0.05)

σ

µ

0 0.05 0.10

0.05

0.1

MV frontiers before and after (α=−0.04)

σ

µ

The new asset has the abnormal return α

compared to the market (of 2 assets)

Means 0.0800 0.0500 α

Cov

matrix

0.0256 0.0000 0.0000

0.0000 0.0144 0.0000

0.0000 0.0000 0.0144

Tang

portf N = 2 α = 0 α > 0 α < 0

0.47 0.47 0.31 0.82

0.53 0.53 0.34 0.91

NaN 0.00 0.34 −0.73

Figure 3.2: Effect on MV frontier of adding assets

3.1.2 Econometric Properties of the CAPM Test

A common finding from Monte Carlo simulations is that these tests tend to reject a truenull hypothesis too often when the critical values from the asymptotic distribution areused: the actual small sample size of the test is thus larger than the asymptotic (or “nom-inal”) size (see Campbell, Lo, and MacKinlay (1997) Table 5.1). The practical conse-quence is that we should either used adjusted critical values (from Monte Carlo or boot-strap simulations)—or more pragmatically, that we should only believe in strong rejec-tions of the null hypothesis.

To study the power of the test (the frequency of rejections of a false null hypothesis)we have to specify an alternative data generating process (for instance, how much extrareturn in excess of that motivated by CAPM) and the size of the test (the critical value touse). Once that is done, it is typically found that these tests require a substantial deviation

33

from CAPM and/or a long sample to get good power.

3.1.3 Several Assets

In most cases there are several (n) test assets, and we actually want to test if all the αi (fori = 1, 2, ..., n) are zero. Ideally we then want to take into account the correlation of thedifferent alphas.

While it is straightforward to construct such a test, it is also a bit messy. As a quickway out, the following will work fairly well. First, test each asset individually. Second,form a few different portfolios of the test assets (equally weighted, value weighted) andtest these portfolios. Although this does not deliver one single test statistic, it providesplenty of information to base a judgement on. For a more formal approach, see Section3.1.4.

A quite different approach to study a cross-section of assets is to first perform a CAPMregression (3.2) and then the following cross-sectional regression

Rei t = γ + λβi + ui , (3.7)

where Rei t is the (sample) average excess return on asset i . Notice that the estimated betas

are used as regressors and that there are as many data points as there are assets (n).There are severe econometric problems with this regression equation since the regres-

sor contains measurement errors (it is only an uncertain estimate), which typically tendto bias the slope coefficient (b) towards zero. To get the intuition for this bias, consideran extremely noisy measurement of the regressor: it would be virtually uncorrelated withthe dependent variable (noise isn’t correlated with anything), so the estimated slope coef-ficient would be close to zero.

If we could overcome this bias (and we can by being careful), then the testable im-plications of CAPM is that γ = 0 and that λ equals the average market excess return.We also want (3.7) to have a high R2—since it should be unity in a very large sample (ifCAPM holds).

3.1.4 Several Assets: SURE Approach∗

This section outlines how we can set up a formal test of CAPM when there are severaltest assets.

34

For simplicity, suppose we have two test assets. Stack (3.2) for the two equations are

Re1t = α1 + b1 Re

mt + ε1t , (3.8)

Re2t = α2 + b2 Re

mt + ε2t (3.9)

where E εi t = 0 and Cov(Remt , εi t) = 0. This is a system of seemingly unrelated re-

gressions (SURE)—with the same regressor (see, for instance, Greene (2003) 14). Inthis case, the efficient estimator (GLS) is LS on each equation separately. Moreover, thecovariance matrix of the coefficients is particularly simple.

To see what the covariances of the coefficients are, write the regression equation forasset 1 (3.8) on a traditional form

Re1t = x ′

tβ1 + ε1t , where xt =

[1

Remt

], β1 =

[α1

b1

], (3.10)

and similarly for the second asset (and any futher assets).Define

6xx =

∑T

t=1xt x ′

t/T , and σi j =

∑T

t=1εi t ε j t/T, (3.11)

where εi t is the fitted residual of asset i . The key result is then that the (estimated)asymptotic covariance matrix of the vectors βi and β j (for assets i and j) is

(estimated) Asy. Cov(βi , β j ) = σi j 6−1xx /T . (3.12)

(In many text books, this is written as σi j (X ′X)−1.)The null hypothesis in our two-asset case is

H0: α1 = 0 and α2 = 0. (3.13)

In a large sample, the estimator is normally distributed (this follows from the fact thatthe LS estimator is a form of sample average, so we can apply a central limit theorem).Therefore, under the null hypothesis we have the following result. Let A be the upper leftelement of 6−1

xx /T . Then[α1

α2

]∼ N

([00

],

[σ11 A σ12 A

σ12 A σ22 A

])(asymptotically). (3.14)

35

In practice we use the sample moments for the covariance matrix. Notice that the zeromeans in (3.14) come from the null hypothesis: the distribution is (as usual) constructedby pretending that the null hypothesis is true.

We can now construct a chi-square test by using the following fact.

Remark 21 If the n × 1 vector y ∼ N (0, �), then y′�−1y ∼ χ2n .

To apply this, let � be the covariance matrix in (3.14) and form the test static[α1

α2

]′

�−1

[α1

α2

]∼ χ2

2 . (3.15)

This can also be transformed into an F test, which might have better small sample prop-erties.

3.1.5 Representative Results of the CAPM Test

One of the more interesting studies is Fama and French (1993) (see also Fama and French(1996)). They construct 25 stock portfolios according to two characteristics of the firm:the size (by market capitalization) and the book-value-to-market-value ratio (BE/ME). InJune each year, they sort the stocks according to size and BE/ME. They then form a 5 × 5matrix of portfolios, where portfolio i j belongs to the i th size quintile and the j th BE/MEquintile.

They run a traditional CAPM regression on each of the 25 portfolios (monthly data1963–1991)—and then study if the expected excess returns are related to the betas as theyshould according to CAPM (recall that CAPM implies E Re

i t = βiλ where λ is the riskpremium (excess return) on the market portfolio).

However, it is found that there is almost no relation between E Rei t and βi (there is

a cloud in the βi × E Rei t space, see Cochrane (2001) 20.2, Figure 20.9). This is due

to the combination of two features of the data. First, within a BE/ME quintile, there is apositive relation (across size quantiles) between E Re

i t and βi —as predicted by CAPM (seeCochrane (2001) 20.2, Figure 20.10). Second, within a size quintile there is a negativerelation (across BE/ME quantiles) between E Re

i t and βi —in stark contrast to CAPM (seeCochrane (2001) 20.2, Figure 20.11).

36

0 5 10 150

5

10


Predicted mean excess return

Mea

n e

xce

ss r

etu

rn

AB

C

DE

F

GH

IJ

Asset alpha WaldStat pval

all NaN 42.24 0.00

A 0.74 0.38 0.54

B 0.34 0.06 0.81

C −1.63 4.77 0.03

D 1.10 0.45 0.50

E 3.29 6.29 0.01

F 1.24 0.65 0.42

G 0.27 0.05 0.83

H 4.56 9.80 0.00

I −0.46 0.12 0.73

J −2.25 8.11 0.00

Fama−French model

Factors: US market, SMB (size), and HML (book−to−market)

test statistics use Newey−West std

test statistics ~ χ2(n), n = no. of test assets

Figure 3.3: Fama-French regressions on US industry indices

3.2 Several Factors

In multifactor models, (3.2) is still valid—provided we reinterpret b and Remt as vectors,

sobRe

mt stands for bo Reot + bp Re

pt + ... (3.16)

In this case, (3.2) is a multiple regression, but the test (3.4) still has the same form.Figure 3.3 shows some results for the Fama-French model on US industry portfolios.

3.3 Fama-MacBeth∗

Reference: Cochrane (2001) 12.3; Campbell, Lo, and MacKinlay (1997) 5.8; Fama andMacBeth (1973)

The Fama and MacBeth (1973) approach is a bit different from the regression ap-proaches discussed so far. The method has three steps, described below.

• First, estimate the betas βi (i = 1, . . . , n) from (3.2) (this is a time-series regres-sion). This is often done on the whole sample—assuming the betas are constant.

37

Sometimes, the betas are estimated separately for different sub samples (so wecould let βi carry a time subscript in the equations below).

• Second, run a cross sectional regression for every t . That is, for period t , estimateλt from the cross section (across the assets i = 1, . . . , n) regression

Rei t = λ′

t βi + εi t , (3.17)

where βi are the regressors. (Note the difference to the traditional cross-sectionalapproach discussed in (3.7), where the second stage regression regressed E Re

i t onβi , while the Fama-French approach runs one regression for every time period.)

• Third, estimate the time averages

εi =1T

T∑t=1

εi t for i = 1, . . . , n, (for every asset) (3.18)

λ =1T

T∑t=1

λt . (3.19)

The second step, using βi as regressors, creates an errors-in-variables problem sinceβi are estimated, that is, measured with an error. The effect of this is typically to bias theestimator of λt towards zero (and any intercept, or mean of the residual, is biased upward).One way to minimize this problem, used by Fama and MacBeth (1973), is to let the assetsbe portfolios of assets, for which we can expect some of the individual noise in the first-step regressions to average out—and thereby make the measurement error in βi smaller.If CAPM is true, then the return of an asset is a linear function of the market return and anerror which should be uncorrelated with the errors of other assets—otherwise some factoris missing. If the portfolio consists of 20 assets with equal error variance in a CAPMregression, then we should expect the portfolio to have an error variance which is 1/20thas large.

We clearly want portfolios which have different betas, or else the second step regres-sion (3.17) does not work. Fama and MacBeth (1973) choose to construct portfoliosaccording to some initial estimate of asset specific betas. Another way to deal with theerrors-in-variables problem is to adjust the tests.

We can test the model by studying if εi = 0 (recall from (3.18) that εi is the time

38

average of the residual for asset i , εi t ), by forming a t-test εi/ Std(εi ). Fama and MacBeth(1973) suggest that the standard deviation should be found by studying the time-variationin εi t . In particular, they suggest that the variance of εi t (not εi ) can be estimated by the(average) squared variation around its mean

Var(εi t) =1T

T∑t=1

(εi t − εi

)2. (3.20)

Since εi is the sample average of εi t , the variance of the former is the variance of the latterdivided by T (the sample size)—provided εi t is iid. That is,

Var(εi ) =1T

Var(εi t) =1

T 2

T∑t=1

(εi t − εi

)2. (3.21)

A similar argument leads to the variance of λ

Var(λ) =1

T 2

T∑t=1

(λt − λ)2. (3.22)

Fama and MacBeth (1973) found, among other things, that the squared beta is notsignificant in the second step regression, nor is a measure of non-systematic risk.

Bibliography

Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The Econometrics of Financial

Markets, Princeton University Press, Princeton, New Jersey.

Cochrane, J. H., 2001, Asset Pricing, Princeton University Press, Princeton, New Jersey.

Elton, E. J., M. J. Gruber, S. J. Brown, and W. N. Goetzmann, 2003, Modern Portfolio

Theory and Investment Analysis, John Wiley and Sons, 6th edn.

Fama, E., and J. MacBeth, 1973, “Risk, Return, and Equilibrium: Empirical Tests,” Jour-

nal of Political Economy, 71, 607–636.

Fama, E. F., and K. R. French, 1993, “Common Risk Factors in the Returns on Stocksand Bonds,” Journal of Financial Economics, 33, 3–56.

39

Fama, E. F., and K. R. French, 1996, “Multifactor Explanations of Asset Pricing Anom-alies,” Journal of Finance, 51, 55–84.

Gibbons, M., S. Ross, and J. Shanken, 1989, “A Test of the Efficiency of a Given Portfo-lio,” Econometrica, 57, 1121–1152.

Greene, W. H., 2003, Econometric Analysis, Prentice-Hall, Upper Saddle River, NewJersey, 5th edn.

MacKinlay, C., 1995, “Multifactor Models Do Not Explain Deviations from the CAPM,”Journal of Financial Economics, 38, 3–28.

40

4 Event Studies

Reference (medium): Bodie, Kane, and Marcus (2002) 12.3 or Elton, Gruber, Brown, andGoetzmann (2003) 17 (parts of)Reference (advanced): Campbell, Lo, and MacKinlay (1997) 4


4.1 Basic Structure of Event Studies

The idea of an event study is to study the effect (on stock prices or returns) of a specialevent by using a cross-section of such events. For instance, what is the effect of a stocksplit announcement on the share price? Other events could be debt issues, mergers andacquisitions, earnings announcements, or monetary policy moves.

The event is typically assumed to be a discrete variable. For instance, it could be amerger or not or if the monetary policy surprise was positive (lower interest than expected)or not. The basic approach is then to study what happens to the returns of those assetsthat have such an event.

Only news should move the asset price, so it is often necessary to explicitly modelthe previous expectations to define the event. For earnings, the event is typically taken tobe the earnings announcement minus (some average of) analysts’ forecast. Similarly, formonetary policy moves, the event could be specified as the interest rate decision minusprevious forward rates (as a measure of previous expectations).

The abnormal return of asset i in period t is

ui,t = Ri,t − Rnormali,t , (4.1)

where Ri t is the actual return and the last term is the normal return (which may differacross assets and time). The definition of the normal return is discussed in detail in Section4.2. These returns could be nominal returns, but more likely (at least for slightly longerhorizons) real returns or excess returns.

Suppose we have a sample of n such events (“assets”). To keep the notation (reason-

41

-

-3 -2 -1 0 1 2 3Eventday

� -Event window

Figure 4.1: Event study with an event window of +-2 days

ably) simple, we “normalize” the time so period 0 is the time of the event. Clearly theactual calendar time of the events for assets i and j are likely to differ, but we shift thetime line for each asset individually so the time of the event is normalized to zero forevery asset.

The (cross-sectional) average abnormal return at the event time (time 0) is

u0 =(u1,0 + u2,0 + ... + un,0

)/n =

∑ni=1ui,0/n. (4.2)

To control for information leakage and slow price adjustment, the abnormal return is oftencalculated for some time before and after the event: the “event window” (often ±20 daysor so). For lead s (that is, s periods after the event time 0), the cross sectional averageabnormal return is

us =(u1,s + u2,s + ... + un,s

)/n =

∑ni=1ui,s/n. (4.3)

For instance, u2 is the average abnormal return two days after the event, and u−1 is forone day before the event.

The cumulative abnormal return (CAR) of asset i is simply the sum of the abnormalreturn in (4.1) over some period around the event. It is often calculated from the beginningof the event window. For instance, if the event window starts at −10, then the 3-periodcar for asset i is

cari,3 = ui,−10 + ui,−9 + ui,−8. (4.4)

The cross sectional average of the q-day car is

carq =(car1,q + car2,q + ... + carn,q

)/n =

∑ni=1cari,q/n. (4.5)

Example 22 Suppose there are two firms and the event window contains ±1 day around

42

the event day, and that the abnormal returns (in percent) are

Time Firm 1 Firm 2 Cross-sectional Average

−1 0.2 −0.1 0.050 1.0 2.0 1.51 0.1 0.3 0.2

We then have the following cumulative returns

Time Firm 1 Firm 2 Cross-sectional Average

−1 0.2 −0.1 0.050 1.2 1.9 1.551 1.3 2.2 1.75

4.2 Models of Normal Returns

This section summarizes the most common ways of calculating the normal return in (4.1).The parameters in these models are typically estimated on a recent sample, the “estimationwindow,” that ends before the event window. In this way, the estimated behaviour of thenormal return should be unaffected by the event. It is almost always assumed that theevent is exogenous in the sense that it is not due to the movements in the asset priceduring either the estimation window or the event window. This allows us to get a cleanestimate of the normal return.

The constant mean return model assumes that the return of asset i fluctuates randomlyaround some mean µi

Ri,t = µi + εi,t with E(εi,t) = Cov(εi,t , εi,t−s) = 0. (4.6)

This mean is estimated by the sample average (during the estimation window). The nor-mal return in (4.1) is then the estimated mean. µi so the abnormal return becomes εi,t .

The market model is a linear regression of the return of asset i on the market return

Ri,t = αi + βi Rm,t + εi t with E(εi,t) = Cov(εi,t , εi,t−s) = Cov(εi,t , Rm,t) = 0. (4.7)

Notice that we typically do not impose the CAPM restrictions on the intercept in (4.7).The normal return in (4.1) is then calculated by combining the regression coefficients with

43

the actual market return as αi + βi Rm,t , so the the abnormal return becomes εi t .Recently, the market model has increasingly been replaced by a multi-factor model

which uses several regressors instead of only the market return. For instance, Fama andFrench (1993) argue that (4.7) needs to be augmented by a portfolio that captures thedifferent returns of small and large firms and also by a portfolio that captures the differentreturns of firms with high and low book-to-market ratios.

Finally, yet another approach is to construct a normal return as the actual return onassets which are very similar to the asset with an event. For instance, if asset i is asmall manufacturing firm (with an event), then the normal return could be calculated asthe actual return for other small manufacturing firms (without events). In this case, theabnormal return becomes the difference between the actual return and the return on thematching portfolio. This type of matching portfolio is becoming increasingly popular.

All the methods discussed here try to take into account the risk premium on the asset.It is captured by the mean in the constant mean mode, the beta in the market model, andby the way the matching portfolio is constructed. However, sometimes there is no datain the estimation window, for instance at IPO’s since there is then no return data beforethe event date. The typical approach is then to use the actual market return as the normalreturn—that is, to use (4.7) but assuming that αi = 0 and βi = 1. Clearly, this does notaccount for the risk premium on asset i , and is therefore a fairly rough guide.

Apart from accounting for the risk premium, does the choice of the model of thenormal return matter a lot? Yes, but only if the model produces a higher coefficient ofdetermination (R2) than competing models. In that case, the variance of the abnormalreturn is smaller for the market model which the test more precise (see Section 4.3 fora discussion of how the variance of the abnormal return affects the variance of the teststatistic). To illustrate this, consider the market model (4.7). Under the null hypothesisthat the event has no effect on the return, the abnormal return would be just the residualin the regression (4.7). It has the variance (assuming we know the model parameters)

Var(ui,t) = Var(εi t) = (1 − R2) Var(Ri,t), (4.8)

where R2 is the coefficient of determination of the regression (4.7). (See Appendix for aproof.)

This variance is crucial for testing the hypothesis of no abnormal returns: the smalleris the variance, the easier it is to reject a false null hypothesis (see Section 4.3). The

44

constant mean model has R2= 0, so the market model could potentially give a much

smaller variance. If the market model has R2= 0.75, then the standard deviation of

the abnormal return is only half that of the constant mean model. More realistically,R2 might be 0.43 (or less), so the market model gives a 25% decrease in the standarddeviation, which is not a whole lot. Experience with multi-factor models also suggest thatthey give relatively small improvements of the R2 compared to the market model. Forthese reasons, and for reasons of convenience, the market model is still the dominatingmodel of normal returns.

High frequency data can be very helpful, provided the time of the event is known.High frequency data effectively allows us to decrease the volatility of the abnormal returnsince it filters out irrelevant (for the event study) shocks to the return while still capturingthe effect of the event.

4.3 Testing the Abnormal Return

In testing if the abnormal return is different from zero, there are two sources of samplinguncertainty. First, the parameters of the normal return are uncertain. Second, even ifwe knew the normal return for sure, the actual returns are random variables—and theywill always deviate from their population mean in any finite sample. The first sourceof uncertainty is likely to be much smaller than the second—provided the estimationwindow is much longer than the event window. This is the typical situation, so the rest ofthe discussion will focus on the second source of uncertainty.

It is typically assumed that the abnormal returns are uncorrelated across time andacross assets. The first assumption is motivated by the very low autocorrelation of returns.The second assumption makes a lot of sense if the events are not overlapping in time, sothat the event of assets i and j happen at different (calendar) times. In contrast, if theevents happen at the same time, the cross-correlation must be handled somehow. This is,for instance, the case if the events are macroeconomic announcements or monetary policymoves. An easy way to handle such synchronized events is to form portfolios of thoseassets that share the event time—and then only use portfolios with non-overlapping eventsin the cross-sectional study. For the rest of this section we assume no autocorrelation orcross correlation.

Let σ 2i = Var(ui,t) be the variance of the abnormal return of asset i . The variance of

45

the cross-sectional (across the n assets) average, us in (4.3), is then

Var(us) =

(σ 2

1 + σ 22 + ... + σ 2

n

)/n2

=∑n

i=1σ2i /n2, (4.9)

since all covariances are assumed to be zero. In a large sample (where the asymptoticnormality of a sample average starts to kick in), we can therefore use a t-test since

us/ Std(us) →d N (0, 1). (4.10)

The cumulative abnormal return over q period, cari,q , can also be tested with a t-test.Since the returns are assumed to have no autocorrelation the variance of the cari,q

Var(cari,q) = qσ 21 . (4.11)

This variance is increasing in q since we are considering cumulative returns (not the timeaverage of returns).

The cross-sectional average cari,q is then (similarly to (4.9))

Var(carq) =

(qσ 2

1 + qσ 22 + ... + qσ 2

n

)/n2

=∑n

i=1σ2i q/n2. (4.12)

Figures 4.2a-b in Campbell, Lo, and MacKinlay (1997) provide a nice example of anevent study (based on the effect of earnings announcements).

4.4 Quantitative Events

Some events are not easily classified as discrete variables. For instance, the effect ofpositive earnings surprise is likely to depend on how large the surprise is—not just if therewas a positive surprise. This can be studied by regressing the abnormal return (typicallythe cumulate abnormal return) on the value of the event (xi )

cari,q = a + bxi + ζi . (4.13)

The slope coefficient is then a measure of how much the cumulative abnormal return reactto a change of one unit of xi .

46

A Derivation of (4.8)

From (4.7), the derivation of (4.8) is as follows:

Var(Ri,t) = β2i Var(Rm,t) + Var(εi t).

We therefore get

Var(εi t) = Var(Ri,t) − β2i Var(Rm,t)

= Var(Ri,t) − Cov(Ri,t , Rm,t)2/ Var(Rm,t)

= Var(Ri,t) − Corr(Ri,t , Rm,t)2 Var(Ri,t)

= (1 − R2) Var(Ri,t).

The second equality follows from the fact that βi equals Cov(Ri,t , Rm,t)/ Var(Rm,t), thethird equality from multiplying and dividing the last term by Var(Ri,t) and using thedefinition of the correlation, and the fourth equality from the fact that the coefficientof determination in a simple regression equals the squared correlation of the dependentvariable and the regressor.

Bibliography






Fama, E. F., and K. R. French, 1993, “Common Risk Factors in the Returns on Stocksand Bonds,” Journal of Financial Economics, 33, 3–56.

47

5 Time Series Analysis

More advanced material is denoted by a star (∗). It is not required reading.Main reference: Newbold (1995) 17 or Pindyck and Rubinfeld (1998) 13.5, 16.1-2,

and 17.2Time series analysis has proved to be a fairly efficient way of producing forecasts. Its

main drawback is that it is typically not conducive to structural or economic analysis ofthe forecast. Still, small VAR systems (see below) have been found to forecast as well asmost other forecasting models (including large structural macroeconometric models).

5.1 Descriptive Statistics

The pth autocovariance of yt is estimated by

Cov(yt , yt−p

)=∑T

t=1 (yt − y)(yt−p − y

)/T , where y =

∑Tt=1yt/T . (5.1)

The conventions in time series analysis are that we use the same estimated (using all data)mean in both places and that we divide by T .

The pth autocorrelation is estimated as

Corr(yt , yt−p

)=

Cov(yt , yt−p

)Std (yt)

2 . (5.2)

Compared with a traditional estimate of a correlation we here impose that the standarddeviation of yt and yt−p are the same (which typically does not make much of a differ-ence).

The pth partial autocorrelation is discussed in Section 5.3.7.

48

5.2 White Noise

The white noise process is the basic building block used in most other time series models.It is characterized by a zero mean, a constant variance, and no autocorrelation

E εt = 0

Var (εt) = σ 2, and

Cov (εt−s, εt) = 0 if s 6= 0. (5.3)

If, in addition, εt is normally distributed, then it is said to be Gaussian white noise. Thisprocess can clearly not be forecasted.

To construct a variable that has a non-zero mean, we can form

yt = µ + εt , (5.4)

where µ is constant. This process is most easily estimated by estimating the sample meanand variance in the usual way (as in (5.1) with p = 0).

5.3 Autoregression (AR)

5.3.1 AR(1)

In this section we study the first-order autoregressive process, AR(1), in some detail inorder to understand the basic concepts of autoregressive processes. The process is as-sumed to have a zero mean (or is demeaned, that an original variable minus its mean, forinstance yt = xt − xt )—but it is straightforward to put in any mean or trend.

An AR(1) isyt = ayt−1 + εt , with Var(εt) = σ 2, (5.5)

where εt is the white noise process in (5.3) which is uncorrelated with yt−1. If −1 < a <

1, then the effect of a shock eventually dies out: yt is stationary.The AR(1) model can be estimated with OLS (since εt and yt−1 are uncorrelated) and

the usual tools for testing significance of coefficients and estimating the variance of theresidual all apply.

49

The basic properties of an AR(1) process are (provided |a| < 1)

Var (yt) = σ 2/(1 − a2) (5.6)

Corr (yt , yt−s) = as, (5.7)

so the variance and autocorrelation are increasing in a.

Remark 23 (Autocorrelation and autoregression). Notice that the OLS estimate of a in

(5.5) is essentially the same as the sample autocorrelation coefficient in (5.2). This follows

from the fact that the slope coefficient is Cov(yt , yt−p

)/Var(yt−p). The denominator

can be a bit different since a few data points are left out in the OLS estimation, but the

difference is likely to be small.

Example 24 With a = 0.85 and σ 2= 0.52, we have Var (yt) = 0.25/(1 − 0.852) ≈ 0.9,

which is much larger than the variance of the residual. (Why?)

If a = 1 in (5.5), then we get a random walk. It is clear from the previous analysisthat a random walk is non-stationary—that is, the effect of a shock never dies out. Thisimplies that the variance is infinite and that the standard tools for testing coefficients etc.are invalid. The solution is to study changes in y instead: yt − yt−1. In general, processeswith the property that the effect of a shock never dies out are called non-stationary or unitroot or integrated processes. Try to avoid them.

See Figure 5.1 for an example of an AR(1).

0 20 40 60−0.5

0





0 20 40 600

0.05

0.1



0 20 40 600

0.2

0.4




0 20 40 600

0.1

0.2



Figure 5.1: Predicting US stock returns (various investment horizons) with lagged returns.

50

5.3.2 More on the Properties of an AR(1) Process∗

Solve (5.5) backwards by repeated substitution

yt = a(ayt−2 + εt−1)︸︷︷︸yt−1

+ εt (5.8)

= a2yt−2 + aεt−1 + εt (5.9)... (5.10)

= aK+1yt−K−1 +

K∑s=0

asεt−s . (5.11)

The factor aK+1yt−K−1 declines monotonically to zero if 0 < a < 1 as K increases, anddeclines in an oscillating fashion if −1 < a < 0. In either case, the AR(1) process iscovariance stationary and we can then take the limit as K → ∞ to get

yt = εt + aεt−1 + a2εt−2 + ...

=

∞∑s=0

asεt−s . (5.12)

Since εt is uncorrelated over time, yt−1 and εt are uncorrelated. We can thereforecalculate the variance of yt in (5.5) as the sum of the variances of the two components onthe right hand side

Var (yt) = Var (ayt−1) + Var (εt)

= a2 Var (yt−1) + Var (εt)

= Var (εt) /(1 − a2), since Var (yt−1) = Var (yt) . (5.13)

In this calculation, we use the fact that Var (yt−1) and Var (yt) are equal. Formally, thisfollows from that they are both linear functions of current and past εs terms (see (5.12)),which have the same variance over time (εt is assumed to be white noise).

Note from (5.13) that the variance of yt is increasing in the absolute value of a, whichis illustrated in Figure 5.2. The intuition is that a large |a| implies that a shock have effectover many time periods and thereby create movements (volatility) in y.

51

0 5 10−5

0

5

Forecast with 90% conf band

Forecasting horizon

Intial value: 3

0 5 10−5

0

5

Forecast with 90% conf band

Forecasting horizon

Intial value: 0

AR(1) model: yt+1

= 0.85yt + ε

t+1, σ = 0.5

Figure 5.2: Properties of AR(1) process

Similarly, the covariance of yt and yt−1 is

Cov (yt , yt−1) = Cov (ayt−1 + εt , yt−1)

= a Cov (yt−1, yt−1)

= a Var (yt) . (5.14)

We can then calculate the first-order autocorrelation as

Corr (yt , yt−1) =Cov (yt , yt−1)

Std(yt) Std(yt−1)

= a. (5.15)

It is straightforward to show that

Corr (yt , yt−s) = Corr (yt+s, yt) = as . (5.16)

5.3.3 Forecasting with an AR(1)

Suppose we have estimated an AR(1). To simplify the exposition, we assume that weactually know a and Var(εt), which might be a reasonable approximation if they wereestimated on long sample.

52

We want to forecast yt+1 using information available in t . From (5.5) we get

yt+1 = ayt + εt+1. (5.17)

Since the best guess of εt+1 is that it is zero, the best forecast and the associated forecasterror are

Et yt+1 = ayt , and (5.18)

yt+1 − Et yt+1 = εt+1 with variance σ 2. (5.19)

We may also want to forecast yt+2 using the information in t . To do that note that(5.5) gives

yt+2 = ayt+1 + εt+2

= a(ayt + εt+1)︸︷︷︸yt+1

+ εt+2

= a2yt + aεt+1 + εt+2. (5.20)

Since the Et εt+1 and Et εt+2 are both zero, we get that

Et yt+2 = a2yt , and (5.21)

yt+2 − Et yt+2 = aεt+1 + εt+2 with variance a2σ 2+ σ 2. (5.22)

More generally, we have

Et yt+s = as yt , (5.23)

Var (yt+s − Et yt+s) =

(1 + a2

+ a4+ ... + a2(s−1)

)σ 2 (5.24)

=a2s

− 1a2 − 1

σ 2. (5.25)

Example 25 If yt = 3, a = 0.85 and σ = 0.5, then the forecasts and the forecast error

variances become

Horizon s Et yt+s Var (yt+s − Et yt+s)

1 0.85 × 3 = 2.55 0.252 0.852

× 3 = 2.17(0.852

+ 1)× 0.52

= 0.43

25 0.8525× 3 = 0.05 0.8550

−10.852−1 × 0.52

= 0.90

53

Notice that the point forecast converge towards zero and the variance of the forecast error

variance to the unconditional variance (see Example 24).

If the shocks εt , are normally distributed, then we can calculate 90% confidence inter-vals around the point forecasts in (5.18) and (5.21) as

90% confidence band of Et yt+1 : ayt ± 1.65 × σ (5.26)

90% confidence band of Et yt+2 : a2yt ± 1.65 ×

√a2σ 2 + σ 2. (5.27)

(Recall that 90% of the probability mass is within the interval −1.65 to 1.65 in the N(0,1)distribution). To get 95% confidence bands, replace 1.65 by 1.96. Figure 5.2 gives anexample.

Example 26 Continuing Example 25, we get the following 90% confidence bands

Horizon s

1 2.55 ± 1.65 ×√

0.25 ≈ [1.7, 3.4]

2 2.17 ± 1.65 ×√

0.43 ≈ [1.1, 3.2]

25 0.05 ± 1.65 ×√

0.90 ≈ [−1.5, 1.6]

.

Remark 27 (White noise as special case of AR(1).) When a = 0 in (5.5), then the AR(1)

collapses to a white noise process. The forecast is then a constant (zero) for all forecasting

horizons, see (5.23), and the forecast error variance is also the same for all horizons, see

(5.25).

5.3.4 Adding a Constant to the AR(1)

The discussion of the AR(1) worked with a zero mean variable, but that was just forconvenience (to make the equations shorter). One way to work with a variable xt witha non-zero mean, is to first estimate its sample mean xt and then let the yt in the AR(1)model (5.5) be a demeaned variable yt = xt − xt .

To include a constant µ in the theoretical expressions, we just need to substitute xt −µ

for yt everywhere. For instance, in (5.5) we would get

xt − µ = a (xt−1 − µ) + εt or

xt = (1 − a) µ + axt−1 + εt . (5.28)

54

Estimation by LS will therefore give an intercept that equals (1 − a) and a slope coeffi-cient that equals a.

5.3.5 AR(p)

The pth-order autoregressive process, AR(p), is a straightforward extension of the AR(1)

yt = a1yt−1 + a2yt−2 + ...ap yt−p + εt . (5.29)

All the previous calculations can be made on this process as well—it is just a bit messier.This process can also be estimated with OLS since εt is uncorrelated with lags of yt .Adding a constant is straightforward by substituting xt − µ for yt everywhere

5.3.6 Forecasting with an AR(2)

As an example, consider making a forecast of yt+1 based on the information in t by usingan AR(2)

yt+1 = a1yt + a2yt−1 + εt+1. (5.30)

This immediately gives the one-period point forecast

Et yt+1 = a1yt + a2yt−1. (5.31)

We can use (5.30) to write yt+2 as

yt+2 = a1yt+1 + a2yt + εt+2

= a1(a1yt + a2yt−1 + εt+1)︸︷︷︸yt+1

+ a2yt + εt+2

= (a21 + a2)yt + a1a2yt−1 + a1εt+1 + εt+2. (5.32)

Figure 5.3 gives an empirical example.The expressions for the forecasts and forecast error variances quickly get somewhat

messy—and even more so with an AR of higher order than two. There is a simple, andapproximately correct, shortcut that can be taken. Note that both the one-period andtwo-period forecasts are linear functions of yt and yt−1. We could therefore estimate the

55

1995 2000 20050

2

4

Forecast made in t−1 and Actual

ForecastActual

Year

%

R2=0.77

1995 2000 2005

0

2

4


Year

%

R2=0.67

1995 2000 2005

0

2

4


Year

%

R2=0.46

Notes

Model: AR(2) of US GDP growth

(since same quarter last year)

Estimation: 1947−1994 used in estimation

Slope coefs: 1.26 and −0.51

R2 is corr(forecast,actual)2 for 1995−

The figures show y(t) and E(t−s)y(t) in quarter t

Figure 5.3: Forecasting with an AR(2)

following two equations with OLS

yt+1 = a1yt + a2yt−1 + εt+1 (5.33)

yt+2 = b1yt + b2yt−1 + vt+2. (5.34)

Clearly, (5.33) is the same as (5.30) and the estimated coefficients can therefore be usedto make one-period forecasts, and the variance of εt+1 is a good estimator of the varianceof the one-period forecast error. The coefficients in (5.34) will be very similar to what weget by combining the a1 and a2 coefficients as in (5.32): b1 will be similar to a2

1 + a2 andb2 to a1a2 (in an infinite sample they should be identical). Equation (5.34) can thereforebe used to make two-period forecasts, and the variance of vt+2 can be taken to be theforecast error variance for this forecast.

56

5.3.7 Partial Autocorrelations

The pth partial autocorrelation tries to measure the direct relation between yt and yt−p,where the indirect effects of yt−1, ..., yt−p+1 are eliminated. For instance, if yt is gen-erated by an AR(1) model, then the 2nd autocorrelation is a2, whereas the 2nd partialautocorrelation is zero. The partial autocorrelation is therefore a way to gauge how manylags that are needed in an AR(p) model.

In practice, the first partial autocorrelation is estimated by a in an AR(1)

yt = ayt−1 + εt . (5.35)

The second partial autocorrelation is estimated by the second slope coefficient (a2) in anAR(2)

yt = a1yt−1 + a2yt−2 + εt , (5.36)

and so forth. The general pattern is that the pth partial autocorrelation is estimated by thelope coefficient of the pth lag in an AR(p), where we let p go from 1,2,3...

5.4 Moving Average (MA)

A qth-order moving average process is

yt = εt + θ1εt−1 + ... + θqεt−q, (5.37)

where the innovation εt is white noise (usually Gaussian).Estimation of MA processes is typically done by setting up the likelihood function

and then using some numerical method to maximize it; LS does not work at all since theright hand side variables are unobservable. This is one reason why MA models play alimited role in applied work. Moreover, most MA models can be well approximated byan AR model of low order.

Remark 28 The autocorrelations and partial autocorrelations (for different lags) can

help us gauge if the time series looks more like an AR or an MA. In an AR(p) model, the

autocorrelations decay to zero for long lags, while the p + 1 partial autocorrelation (and

beyond) goes abruptly to zero. The reverse is true for an MA model.

57

5.5 ARMA(p,q)

Autoregressive-moving average models add a moving average structure to an AR model.For instance, an ARMA(2,1) could be

yt = a1yt−1 + a2yt−2 + εt + θ1εt−1,

where εt is white noise. This type of model is much harder to estimate than the autoregres-sive model (LS cannot be used). The appropriate specification of the model (number oflags of yt and εt ) is often unknown. The Box-Jenkins methodology is a set of guidelinesfor arriving at the correct specification by starting with some model, study the autocorre-lation structure of the fitted residuals and then changing the model.

Most ARMA models can be well approximated by an AR model—provided we addsome extra lags. Since AR models are so simple to estimate, this approximation approachis often used.

Remark 29 In an ARMA model, both the autocorrelations and partial autocorrelations

decay to zero for long lags.

5.6 VAR(p)

The vector autoregression is a multivariate version of an AR(1) process: we can think ofyt and εt in (5.29) as vectors and the ai as matrices.

For instance the VAR(1) of two variables (xt and zt ) is (in matrix form)[xt+1

zt+1

]=

[a11 a12

a21 a22

][xt

zt

]+

[εxt+1

εzt+1

], (5.38)

or equivalently

xt+1 = a11xt + a12zt + εxt+1, and (5.39)

zt+1 = a21xt + a22zt + εzt+1. (5.40)

Both (5.39) and (5.40) are regression equations, which can be estimated with OLS(since εxt+1 and εzt+1 are uncorrelated with xt and zt ).

58

With the information available in t , that is, information about xt and zt , (5.39) and(5.40) can be used to forecast one step ahead as

Et xt+1 = a11xt + a12zt (5.41)

Et zt+1 = a21xt + a22zt . (5.42)

We also want to make a forecast of xt+2 based on the information in t . Clearly, it mustbe the case that

Et xt+2 = a11 Et xt+1 + a12 Et zt+1 (5.43)

Et zt+2 = a21 Et xt+1 + a22 Et zt+1. (5.44)

We already have values for Et xt+1 and Et zt+1 from (5.41) and (5.42) which we can use.For instance, for Et xt+2 we get

Et xt+2 = a11(a11xt + a12zt)︸︷︷︸Et xt+1

+ a12(a21xt + a22zt)︸︷︷︸Et zt+1

=

(a2

11 + a12a21

)xt + (a12a22 + a11a12) zt . (5.45)

This has the same form as the one-period forecast in (5.41), but with other coefficients.Note that all we need to make the forecasts (for both t + 1 and t + 2) are the values inperiod t (xt and zt ). This follows from that (5.38) is a first-order system where the valuesof xt and zt summarizes all relevant information about the future that is available in t .

The forecast uncertainty about the one-period forecast is simple: the forecast errorxt+1 − Et xt+1 = εxt+1. The two-period forecast error, xt+2 − Et xt+2, is a linear com-bination of εxt+1, εzt+1, and εxt+2. The calculations of the forecasting error variance (aswell as for the forecasts themselves) quickly get messy. This is even more true when theVAR system is of a higher order.

As for the AR(p) model, a practical way to get around the problem with messy calcu-lations is to estimate a separate model for each forecasting horizon. In a large sample, thedifference between the two ways is trivial. For instance, suppose the correct model is theVAR(1) in (5.38) and that we want to forecast x one and two periods ahead. From (5.41)

59

and (5.45) we see that the regression equations should be of the form

xt+1 = δ1xt + δ2zt + ut+1, and (5.46)

xt+2 = γ1xt + γ2zt + wt+s . (5.47)

With estimated coefficients (OLS can be used), it is straightforward to calculate forecastsand forecast error variances. It can be shown that this.

In a more general VAR(p) model we need to include p lags of both x and z in theregression equations (p = 1 in (5.46) and (5.47)).

5.6.1 Granger Causality

If zt can help predict future x , over and above what lags of x itself can, then z is said toGranger Cause x . This is a statistical notion of causality, and may not necessarily havemuch to do with economic causality (Christmas cards may Granger cause Christmas).In (5.46) z does Granger cause x if δ2 6= 0, which can be tested with an F-test. Moregenerally, there may be more lags of both x and z in the equation, so we need to test if allcoefficients on different lags of z are zero.

5.7 Non-stationary Processes

5.7.1 Introduction

A trend-stationary process can be made stationary by subtracting a linear trend. Thesimplest example is

yt = µ + βt + εt (5.48)

where εt is white noise.A unit root process can be made stationary only by taking a difference. The simplest

example is the random walk with drift

yt = µ + yt−1 + εt , (5.49)

where εt is white noise. The name “unit root process” comes from the fact that the largesteigenvalues of the canonical form (the VAR(1) form of the AR(p)) is one. Such a processis said to be integrated of order one (often denoted I(1)) and can be made stationary by

60

taking first differences.

Example 30 (Non-stationary AR(2).) The process yt = 1.5yt−1 − 0.5yt−2 + εt can be

written [yt

yt−1

]=

[1.5 −0.51 0

][yt−1

yt−2

]+

[εt

0

],

where the matrix has the eigenvalues 1 and 0.5 and is therefore non-stationary. Note that

subtracting yt−1 from both sides gives yt − yt−1 = 0.5 (yt−1 − yt−2)+ εt , so the variable

xt = yt − yt−1 is stationary.

The distinguishing feature of unit root processes is that the effect of a shock never

vanishes. This is most easily seen for the random walk. Substitute repeatedly in (5.49) toget

yt = µ + (µ + yt−2 + εt−1) + εt

...

= tµ + y0 +

t∑s=1

εs . (5.50)

The effect of εt never dies out: a non-zero value of εt gives a permanent shift of the levelof yt . This process is clearly non-stationary. A consequence of the permanent effect ofa shock is that the variance of the conditional distribution grows without bound as theforecasting horizon is extended. For instance, for the random walk with drift, (5.50), thedistribution conditional on the information in t = 0 is N

(y0 + tµ, sσ 2) if the innovations

are Gaussian. This means that the expected change is tµ and that the conditional vari-ance grows linearly with the forecasting horizon. The unconditional variance is thereforeinfinite and the standard results on inference are not applicable.

In contrast, the conditional distributions from the trend stationary model, (5.48), isN(st, σ 2).A process could have two unit roots (integrated of order 2: I(2)). In this case, we need

to difference twice to make it stationary. Alternatively, a process can also be explosive,that is, have eigenvalues outside the unit circle. In this case, the impulse response functiondiverges.

Example 31 (Two unit roots.) Suppose yt in Example (30) is actually the first difference

61

of some other series, yt = zt − zt−1. We then have

zt − zt−1 = 1.5 (zt−1 − zt−2) − 0.5 (zt−2 − zt−3) + εt

zt = 2.5zt−1 − 2zt−2 + 0.5zt−3 + εt ,

which is an AR(3) with the following canonical form zt

zt−1

zt−2

=

2.5 −2 0.51 0 00 1 0

zt−1

zt−2

zt−3

+

εt

00

.

The eigenvalues are 1, 1, and 0.5, so zt has two unit roots (integrated of order 2: I(2) and

needs to be differenced twice to become stationary).

Example 32 (Explosive AR(1).) Consider the process yt = 1.5yt−1 + εt . The eigenvalue

is then outside the unit circle, so the process is explosive. This means that the impulse

response to a shock to εt diverges (it is 1.5s for s periods ahead).

5.7.2 Spurious Regressions

Strong trends often causes problems in econometric models where yt is regressed on xt .In essence, if no trend is included in the regression, then xt will appear to be significant,just because it is a proxy for a trend. The same holds for unit root processes, even ifthey have no deterministic trends. However, the innovations accumulate and the seriestherefore tend to be trending in small samples. A warning sign of a spurious regression iswhen R2 > DW statistics. See Figure 5.4 for an example.

For trend-stationary data, this problem is easily solved by detrending with a lineartrend (before estimating or just adding a trend to the regression).

However, this is usually a poor method for a unit root processes. What is needed is afirst difference. For instance, a first difference of the random walk is

1yt = yt − yt−1

= εt , (5.51)

which is white noise (any finite difference, like yt − yt−s , will give a stationary series), sowe could proceed by applying standard econometric tools to 1yt .

62

−1 0 10

5

10

Distribution of bLS

, ρ=0.2

b

−1 0 10

5

10

Distribution of bLS

, ρ=1

b

Model: yt = 1 + ρy

t + ε

t and x

t = 2 + ρx

t + η

t

where εt and η

t are uncorrelated

bLS

is the LS estimate of b in

yt = a + bx

t + u

t

Figure 5.4: Distribution of LS estimator when yt and xt are independent AR(1) processes

One may then be tempted to try first-differencing all non-stationary series, since itmay be hard to tell if they are unit root process or just trend-stationary. For instance, afirst difference of the trend stationary process, (5.48), gives

yt − yt−1 = β + εt − εt−1. (5.52)

Its unclear if this is an improvement: the trend is gone, but the errors are now of MA(1)type (in fact, non-invertible, and therefore tricky, in particular for estimation).

5.7.3 Testing for a Unit Root∗

Suppose we run an OLS regression of

yt = ayt−1 + εt , (5.53)

where the true value of |a| < 1. The asymptotic distribution is of the LS estimator is

√T(a − a

)∼ N

(0, 1 − a2

). (5.54)

63

(The variance follows from the standard OLS formula where the variance of the estimatoris σ 2 (X ′X/T

)−1. Here plim X ′X/T =Var(yt) which we know is σ 2/(1 − a2)).

It is well known (but not easy to show) that when a = 1, then a is biased towardszero in small samples. In addition, the asymptotic distribution is no longer (5.54). Infact, there is a discontinuity in the limiting distribution as we move from a stationary/toa non-stationary variable. This, together with the small sample bias means that we haveto use simulated critical values for testing the null hypothesis of a = 1 based on the OLSestimate from (5.53).

The approach is to calculate the test statistic

t =a − 1Std(a)

,

and reject the null of non-stationarity if t is less than the critical values published byDickey and Fuller (typically more negative than the standard values to compensate for thesmall sample bias) or from your own simulations.

In principle, distinguishing between a stationary and a non-stationary series is verydifficult (and impossible unless we restrict the class of processes, for instance, to anAR(2)), since any sample of a non-stationary process can be arbitrary well approximatedby some stationary process et vice versa. The lesson to be learned, from a practical pointof view, is that strong persistence in the data generating process (stationary or not) invali-

dates the usual results on inference. We are usually on safer ground to apply the unit rootresults in this case, even if the process is actually stationary.

Bibliography

Newbold, P., 1995, Statistics for Business and Economics, Prentice-Hall, 4th edn.

Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric Models and Economic Forecasts,Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.

64

6 Predicting Asset Returns

Reference (medium): Elton, Gruber, Brown, and Goetzmann (2003) 17 (efficient markets)and 19 (earnings estimation); Bodie, Kane, and Marcus (1999) 12 (efficient markets) ;Cuthbertson (1996) 5 and 6Reference (advanced): Campbell, Lo, and MacKinlay (1997) 2 and 7; Cochrane (2001)20.1


6.1 Asset Prices, Random Walks, and the Efficient Market Hypothe-sis

Let Pt be the price of an asset at the end of period t , after any dividends in t has been paid(an ex-dividend price). The gross return (1 + Rt+1, like 1.05) of holding an asset withdividends (per current share), Dt+1, between t and t + 1 is then defined as

1 + Rt+1 =Pt+1 + Dt+1

Pt. (6.1)

The dividend can, of course, be zero in a particular period, so this formulation encom-passes the case of daily stock prices with annual dividend payment.

Remark 33 (Conditional expectations) The expected value of the random variable yt+1

conditional on the information set in t , Et yt+1 is the best guess of yt+1 using the informa-

tion in t. Example: suppose yt+1 equals xt +εt+1, where xt is known in t, but all we know

about εt+1 in t is that it a random variable with a zero mean and some (finite) variance.

In this case, the best guess of yt+1 based of what we know in t is equal to xt .

Take expectations of (6.1) based on the information set in t

1 + Et Rt+1 =Et Pt+1 + Et Dt+1

Ptor (6.2)

Pt =E t Pt+1 + E t Dt+1

1 + E t Rt+1. (6.3)

65

This formulation is only a definition, but it will help us organize the discussion of howasset prices are determined.

This expected return, Et Rt+1, is likely to be greater than a riskfree interest rate if theasset has positive systematic (non-diversifiable) risk. For instance, in a CAPM model thiswould manifest itself in a positive “beta.” In an equilibrium setting, we can think of thisas a “required return” needed for investors to hold this asset.

6.1.1 Different Versions of the Efficient Market Hypothesis

The efficient market hypothesis casts a long shadow on every attempt to forecast assetprices. In its simplest form it says that it is not possible to forecast asset prices, but thereare several other forms with different implications. Before attempting to forecast financialmarkets, it is useful to take a look at the logic of the efficient market hypothesis. This willhelp us organize the effort and to interpret the results.

A modern interpretation of the efficient market hypothesis (EMH) is that the informa-tion set used in forming the market expectations in (6.2) includes all public information.(This is the semi-strong form of EMH since it says all public information; the strong formsays all public and private information; and the weak form says all information in priceand trading volume data.) The implication is that simple stock picking techniques are notlikely to improve the portfolio performance, that is, abnormal returns. Instead, advanced(costly?) techniques are called for in order to gather more detailed information than thatused in market’s assessment of the asset. Clearly, with a better forecast of the future returnthan that the market’s there is plenty of scope for dynamic trading strategies. Note thatthis modern interpretation of the efficient market hypothesis does not rule out the possi-bility of forecastable prices of returns—just that abnormal returns cannot be achieved bystock picking techniques which rely on public information.

There are several different traditional interpretations of EMH. Like the modern in-terpretation, they do not rule out the possibility of achieving abnormal returns by usingbetter information than the rest of the market. However, they make stronger assumptionsabout whether prices or returns are forecastable. Typically one of the following is as-sumed to be unforecastable: price changes, returns, or returns in excess of a riskfree rate(interest rate). By unforecastable, it is meant that the best forecast (expected value condi-tional on available information) is a constant. Conversely, if it is found that there is someinformation in t that can predict returns Rt+1, then the market cannot price the asset as

66

if Et Rt+1 is a constant—at least not if the market forms expectations rationally. We willnow analyze the logic of each of the traditional interpretations.

If price changes are unforecastable, then Et Pt+1 − Pt equals a constant. Typically,this constant is taken to be zero so Pt is a martingale. Use Et Pt+1 = Pt in (6.2)

Et Rt+1 =Et Dt+1

Pt. (6.4)

This says that the expected net return on the asset is the expected dividend divided by thecurrent price. This is clearly implausible for daily data since it means that the expectedreturn is zero for all days except those days when the asset pays dividend (or rather, theday the asset goes ex dividend)—and then an enormous expected return on the one daywhen Et Dt+1 6= 0. As a first step, we should probably refine the interpretation of theefficient market hypothesis to include the dividend so that Et(Pt+1 + Dt) = Pt . Usingthat in (6.2) gives 1 + Et Rt+1 = 1, which can only be satisfied if Et Rt+1 = 0, whichseems very implausible for investment horizons—although it is probably a reasonableapproximation for short horizons (a week or less).

If returns are unforecastable, so Et Rt+1 = R (a constant), then (6.3) gives

Pt =E t Pt+1 + E t Dt+1

1 + R. (6.5)

The main problem with this interpretation is that looks at every asset separately andthat outside options are not taken into account. For instance, if the nominal interestrate changes from 5% to 10%, why should the expected (required) return on a stock beunchanged? In fact, most asset pricing models suggest that the expected return Et Rt+1

equals the riskfree rate plus compensation for risk.If excess returns are unforecastable, then the compensation (over the riskfree rate)

for risk is constant. The risk compensation is, of course, already reflected in the currentprice Pt , so the issue is then if there is some information in t which is correlated withthe risk compensation in Pt+1. Note that such forecastability does not necessarily implyan inefficient market or presence of uninformed traders—it could equally well be dueto movements in risk compensation driven by movements in uncertainty (option pricessuggests that there are plenty of movements in uncertainty). If so, the forecastabilitycannot be used to generate abnormal returns (over riskfree rate plus risk compensation).However, it could also be due to exploitable market inefficiencies. Alternatively, you may

67

argue that the market compensates for risk which you happen to be immune to—so youare interested in the return rather than the risk adjusted return.

This discussion of the traditional efficient market hypothesis suggests that the mostinteresting hypotheses to test are if returns or excess returns are forecastable. In practice,the results for them are fairly similar since the movements in most asset returns are muchgreater than the movements in interest rates.

6.1.2 Martingales and Random Walks∗

Further reading: Cuthbertson (1996) 5.3The accumulated wealth in a sequence of fair bets is expected to be unchanged. It is

then said to be a martingale.The time series x is a martingale with respect to an information set �t if the expected

value of xt+s (s ≥ 1) conditional on the information set �t equals xt . (The informationset �t is often taken to be just the history of x : xt , xt−1, ...)

The time series x is a random walk if xt+1 = xt + εt+1, where εt and εt+s are un-correlated for all s 6= 0, and E εt = 0. (There are other definitions which require that εt

and εt+s have the same distribution.) A random walk is a martingale; the converse is notnecessarily true.

Remark 34 (A martingale, but not a random walk). Suppose yt+1 = ytut+1, where ut

and ut+s are uncorrelated for all s 6= 0, and Et ut+1 = 1 . This is a martingale, but not a

random walk.

In any case, the martingale property implies that xt+s = xt +εt+s , where the expectedvalue of εt+s based on �t is zero. This is close enough to the random walk to motivatethe random walk idea in most cases.

6.1.3 Application: Mean-Variance Portfolio Choice with Predictable Returns∗

If there are non-trivial market imperfections, then predictability can be used to generateeconomic profits. If there are no important market imperfections, then predictability ofexcess returns should be thought of as predictable movements in risk premia. We willtypically focus on this second interpretation.

68

As a simple example, consider a small mean-variance investor whose preferencesdiffer from the average investor: he is not affected by risk that creates the time vari-ation in expected returns. Suppose he can invest in two assets: a risky asset with re-turn Rt+1 and a risk free asset with return R f and zero variance. The return on theportfolio is Rp,t+1 = αRt+1 + (1 − α)R f . The utility is quadratic in terms of wealth:E Rp,t+1 − Var(Rp,t+1)k/2. Substituting gives that the maximization problem is

maxα α E Rt+1 + (1 − α)R f −k2α2 Var(Rt+1).

The first order condition is

0 = Et Rt+1 − R f − kα Var(Rt+1) or

α =1k

E Rt+1 − R f

Var(Rt+1).

The weight on the risky asset is clearly increasing in the excess return and decreasing inthe variance. If we compare two investors of this type and the same k, but with differentinvestment horizons, then the portfolio weight α is the same if the ratio of the mean andvariance is unchanged by the horizon. This is the case if returns are iid.

To demonstrate the last point note that the two period excess return is approximatelyequal to the sum of two one period returns Rt+1 − R f + Rt+2 − R f . With iid returns themean of this two period excess return is 2 E(Rt+1 − R f ) and the variance is 2 Var(Rt+1)

since the covariance of the returns is zero. We therefore get

α for 1-period horizon =1k

E Rt+1 − R f

Var(Rt+1),

α for 2-period horizon =1k

2 E(Rt+1 − R f )

2 Var(Rt+1),

which are the same.With correlated returns we still have the same two period mean, but the variance is

now Var(Rt+1 + Rt+2) = 2 Var(Rt+1) + 2 Cov(Rt+1, Rt+2). This gives the portfolioweight on the risky asset

α for 2 period horizon =1k

2 E(Rt+1 − R f )

2 Var(Rt+1) + 2 Cov(Rt+1, Rt+2).

With mean reversion in prices the covariance is negative, so the weight on the risky asset

69

is larger for the two period horizon than for the one period horizon.

6.2 Autocorrelations

6.2.1 Autocorrelation Coefficients and the Box-Pierce Test

The autocovariances of the yt process can be estimated as

γs =1T

T∑t=1+s

(yt − y) (yt−s − y) , (6.6)

with y =1T

T∑t=1

yt . (6.7)

(We typically divide by T in even if we have only T − s full observations to estimate γs

from.) Autocorrelations are then estimated as

ρs = γs/γ0. (6.8)

The sampling properties of ρs are complicated, but there are several useful large sam-ple results for Gaussian processes (these results typically carry over to processes whichare similar to the Gaussian—a homoskedastic process with finite 6th moment is typicallyenough, see Priestley (1981) 5.3 or Brockwell and Davis (1991) 7.2-7.3). When the trueautocorrelations are all zero (not ρ0, of course), then for any i and j different from zero

√T

[ρi

ρ j

]→

d N

([00

],

[1 00 1

]). (6.9)

This result can be used to construct tests for both single autocorrelations (t-test or χ2 test)and several autocorrelations at once (χ2 test).

Example 35 (t-test) We want to test the hypothesis that ρ1 = 0. Since the N (0, 1) dis-

tribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we

can reject the null hypothesis at the 10% level if√

T |ρ1| > 1.65. With T = 100, we

therefore need |ρ1| > 1.65/√

100 = 0.165 for rejection, and with T = 1000 we need

|ρ1| > 1.65/√

1000 ≈ 0.053.

70

The Box-Pierce test follows directly from the result in (6.9), since it shows that√

T ρi

and√

T ρ j are iid N(0,1) variables. Therefore, the sum of the square of them is distributedas an χ2 variable. The test statistics typically used is

QL = TL∑

s=1

ρ2s →

d χ2L . (6.10)

Example 36 (Box-Pierce) Let ρ1 = 0.165, and T = 100, so Q1 = 100 × 0.1652=

2.72. The 10% critical value of the χ21 distribution is 2.71, so the null hypothesis of no

autocorrelation is rejected.

The choice of lag order in (6.10), L , should be guided by theoretical considerations,but it may also be wise to try different values. There is clearly a trade off: too few lags maymiss a significant high-order autocorrelation, but too many lags can destroy the power ofthe test (as the test statistics is not affected much by increasing L , but the critical valuesincrease).

6.2.2 Autoregressions

An alternative way of testing autocorrelations is to estimate an AR model

yt = c + a1yt−1 + a2yt−2 + ... + ap yt−p + εt , (6.11)

and then test if all slope coefficients (a1, a2, ..., ap) are zero with a χ2 or F test. Thisapproach is somewhat less general than the Box-Pierce test, but most stationary timeseries processes can be well approximated by an AR of relatively low order.

6.2.3 Long-Run Autoregressions

Consider an AR(1) of two-period sums (returns)

yt+1 + yt+2 = a + b2 (yt−1 + yt) + εt+2. (6.12)

This can be estimated LS on non-overlapping returns and b2 = 0 can be tested usingstandard methods. Clearly, b2 equals the autocorrelation of two-period returns. This typeof autoregression can be done also on 3-period returns (non-overlapping) and longer.

71

Overlapping Returns∗

Inference of the slope coefficient in long-run autoregressions like (6.12) must be donewith care. If only non-overlapping returns are used, the standard LS expression for thestandard deviation of the autoregressive parameter is likely to be reasonable. This is notthe case, if overlapping returns are used.

As an example, consider the two-period return, yt−1+yt . Two successive observationswith non-overlapping returns are then

yt+1 + yt+2 = a + b2 (yt−1 + yt) + εt+2 (6.13)

yt+3 + yt+4 = a + b2 (yt+1 + yt+2) + εt+4. (6.14)

Suppose that yt has no autocorrelation , so the slope coefficient b2 = 0. We can then writethe residuals as

εt+2 = −a + yt+1 + yt+2 (6.15)

εt+4 = −a + yt+3 + yt+4, (6.16)

which are uncorrelated.Compare this to the case where we use overlapping data. Two successive observations

are then

yt+1 + yt+2 = a + b2 (yt−1 + yt) + εt+2 (6.17)

yt+2 + yt+3 = a + b2 (yt + yt+1) + εt+3. (6.18)

As before, b2 = 0 is yt has no autocorrelation, so the residuals become

εt+2 = −a + yt+1 + yt+2 (6.19)

εt+3 = −a + yt+2 + yt+3, (6.20)

which are correlated since yt+2 shows up in both. This demonstrates that overlappingreturn data introduces autocorrelation of the residuals—which has to be handled in orderto make correct inference.

72

1990 1995 2000 20050

2

4

6

8SMI

Year

SMI

bill portfolio

1990 1995 2000 2005−10

0

10SMI daily excess returns, %

Year

Autocorr 0.04

Daily SMI data, 1988−2004

Autocorr of returns (daily, weekly, monthly): 0.04 −0.05 0.04

Autocorr of absolute returns (daily, weekly, monthly): 0.26 0.26 0.18

Figure 6.1: Time series properties of SMI

6.2.4 Autoregressions versus Autocorrelations∗

If is straightforward to see the relation between autocorrelations and the AR model whenthe AR model is the true process. This relation is given by the Yule-Walker equations.

For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi-cient. For an AR(2), yt = a1yt−1 + a2yt−2 + εt , we have Cov(yt , yt)

Cov(yt−1, yt)

Cov(yt−2, yt)

=

Cov(yt , a1yt−1 + a2yt−2 + εt)

Cov(yt−1, a1yt−1 + a2yt−2 + εt)

Cov(yt−2, a1yt−1 + a2yt−2 + εt)

=

a1 Cov(yt , yt−1) + a2 Cov(yt , yt−2) + Cov(yt , εt)

a1 Cov(yt−1, yt−1) + a2 Cov(yt−1, yt−2)

a1 Cov(yt−2, yt−1) + a2 Cov(yt−2, yt−2)

, or

γ0

γ1

γ2

=

a1γ1 + a2γ2 + Var(εt)

a1γ0 + a2γ1

a1γ1 + a2γ0

. (6.21)

73

0 5 10−0.1

0

0.1

0.2

Autocorr, daily returns

Days

S&P 500 excess returns, 1979−2004

Autocorr with 90% conf band

0 5 10−0.1

0

0.1

0.2

Autocorr, daily abs(returns)

Days

0 5 10−0.1

0

0.1

0.2

Autocorr, weekly returns

Weeks

0 5 10−0.1

0

0.1

0.2

Autocorr, weekly abs(returns)

Weeks

Figure 6.2: Predictability of US stock returns

To transform to autocorrelation, divide through by γ0. The last two equations are then[ρ1

ρ2

]=

[a1 + a2ρ1

a1ρ1 + a2

]or

[ρ1

ρ2

]=

[a1/ (1 − a2)

a21/ (1 − a2) + a2

]. (6.22)

If we know the parameters of the AR(2) model (a1, a2, and Var(εt)). then we can solvefor the autocorrelations. Alternatively, if we know the autocorrelations, then we can solvefor the autoregression coefficients. This demonstrates that testing that all the autocorre-lations are zero is essentially the same as testing if all the autoregressive coefficients arezero. Note, however, that the transformation is non-linear, which may make a differencein small samples.

74

0 20 40 60−0.5

0





0 20 40 600

0.05

0.1



0 20 40 600

0.2

0.4




0 20 40 600

0.1

0.2



Figure 6.3: Predictability of US stock returns

6.2.5 Variance Ratios

A variance ratio is another way to measure predictability. It is defined as the variance ofa q-period return divided by q times the variance of a 1-period return

V Rq =

Var(∑q−1

s=0 yt−s

)q Var(yt)

. (6.23)

75

0 20 40 600.5

1

1.5Variance Ratio, 1926−


VR with 90% conf band

0 20 40 600.5

1

1.5Variance Ratio, 1952−



Confidence bands use asymptotic sampling distribution of VR

Figure 6.4: Variance ratios, US excess stock returns

To see that this is related to predictability, consider the 2-period variance ratio.

V R2 =Var(yt + yt−1)

2 Var(yt)(6.24)

=Var (yt) + Var (yt−1) + 2 Cov (yt , yt−1)

2 Var (yt)

= 1 +Cov (yt , yt−1)

Var (yt)

= 1 + ρ1. (6.25)

It is clear from (6.25) that if yt is not serially correlated, then the variance ratio is unity;a value above one indicates positive serial correlation and a value below one indicatesnegative serial correlation. The same applies to longer horizons.

The estimation of V Rq is typically not done by replacing the population variancesin (6.23) with the sample variances, since this would require using non-overlapping longreturns—which wastes a lot of data points. For instance, if we have 24 years of data andwe want to study the variance ratio for the 5-year horizon, then 4 years of data are wasted.

76

Instead, we typically rely on a transformation of (6.23)

V Rq =

Var(∑q−1

s=0 yt−s

)q Var(yt)

=

q−1∑s=−(q−1)

(1 −

|s|q

)ρs or

= 1 + 2q−1∑s=1

(1 −

sq

)ρs . (6.26)

To estimate V Rq , we first estimate the autocorrelation coefficients (using all availabledata points for each estimation) and then calculate (6.26).

Remark 37 (*V Rq and the sampling distribution of V Rq) Under the null hypothesis that

there is no autocorrelation, (6.9) and (6.26) give

√T(V Rq − 1

)→

d N

0,

q−1∑s=1

4(

1 −sq

)2 . (6.27)

For instance, we have

√T(V R2 − 1

)→

d N (0, 1) and√

T(V R3 − 1

)→

d N (0, 20/9) . (6.28)

The results in CLM Table 2.5 and 2.6 (weekly CRSP stock index returns, early 1960sto mid 1990s) show variance ratios above one and increasing with the number of lags, q.The results for individual stocks in CLM Table 2.7 show variance ratios close to, or evenbelow, unity. Cochrane 20 Table 20.5–6 report weak evidence for more mean reversion inmulti-year returns (annual NYSE stock index,1926 to mid 1990s).

6.3 Other Predictors and Methods

There are many other possible predictors of future stock returns. For instance, both thedividend-price ratio and nominal interest rates have been used to predict long-run returns,and lagged short-run returns on other assets have been used to predict short-run returns.

77

6.3.1 Lead-Lags

Stock indices have more positive autocorrelation than (most) the individual stocks: thereshould therefore be fairly strong cross-autocorrelations across individual stocks. (SeeCampbell, Lo, and MacKinlay (1997) Tables 2.7 and 2.8.) Indeed, this is also what isfound in US data where weekly returns of large size stocks forecast weekly returns ofsmall size stocks.

6.3.2 Dividend-Price Ratio as a Predictor

One of the most successful attempts to forecast long-run return is by using the dividend-price ratio (here in logs)

q∑s=1

rt+s = α + βq(dt − pt) + εt+q . (6.29)

For instance, CLM Table 7.1, report R2 values from this regression which are close tozero for monthly returns, but they increase to 0.4 for 4-year returns (US, value weightedindex, mid 1920s to mid 1990s).

6.3.3 Predictability but No Autocorrelation

The evidence for US stock returns is that long-run returns may perhaps be predicted byusing dividend-price ratio or interest rates, but that the long-run autocorrelations are weak(long run US stock returns appear to be “weak-form efficient” but not “semi-strong ef-ficient”). This should remind us of the fact that predictability and autocorrelation neednot be the same thing: although autocorrelation implies predictability, we can have pre-dictability without autocorrelation.

6.3.4 Trading Strategies

Another way to measure predictability and to illustrate its economic importance is tocalculate the return of a dynamic trading strategy, and then measure the “performance”of this strategy in relation to some benchmark portfolios. The trading strategy should, ofcourse, be based on the variable that is supposed to forecast returns.

78

A common way (since Jensen, updated in Huberman and Kandel (1987)) is to studythe performance of a portfolio by running the following regression

R1t − R f t = α + β(Rmt − R f t) + εt , E εt = 0 and Cov(R1t − R f t , εt) = 0, (6.30)

where R1t − R f t is the excess return on the portfolio being studied and Rmt − R f t theexcess returns of a vector of benchmark portfolios (for instance, only the market portfolioif we want to rely on CAPM; returns times conditional information if we want to allowfor time-variation in expected benchmark returns). Neutral performance (mean varianceintersection) requires α = 0, which can be tested with a t or F test.

6.4 Security Analysts

Reference: Makridakis, Wheelwright, and Hyndman (1998) 10.1 and Elton, Gruber,Brown, and Goetzmann (2003) 19

6.4.1 Evidence on Analysts’ Performance

Makridakis, Wheelwright, and Hyndman (1998) 10.1 shows that there is little evidencethat the average stock analyst beats (on average) the market (a passive index portfolio).In fact, less than half of the analysts beat the market. However, there are analysts whichseem to outperform the market for some time, but the autocorrelation in over performanceis weak. The evidence from mutual funds is similar. For them it is typically also foundthat their portfolio weights do not anticipate price movements.

It should be remembered that many analysts also are sales persons: either of a stock(for instance, since the bank is underwriting an offering) or of trading services. It couldwell be that their objective function is quite different from minimizing the squared forecasterrors—or whatever we typically use in order to evaluate their performance. (The numberof litigations in the US after the technology boom/bust should serve as a strong reminderof this.)

6.4.2 Do Security Analysts Overreact?

The paper by Bondt and Thaler (1990) compares the (semi-annual) forecasts (one- andtwo-year time horizons) with actual changes in earnings per share (1976-1984) for several

79

hundred companies. The study is done by running regressions like

Actual change = α + β(forecasted change),

and the study the estimates of the α and β coefficients. With rational expectations (anda long enough sample), we should have α = 0 (no constant bias in forecasts) and β = 1(proportionality, for instance no exaggeration).

The main findings are as follows. The main result is that 0 < β < 1, so that theforecasted change tends to be too wild in a systematic way: a forecasted change of 1% is(on average) followed by less than 1% actual change in the same direction. This meansthat analysts in this sample tended to be too extreme—to exaggerate both positive andnegative news.

6.4.3 High-Frequency Trading Based on Recommendations from Stock Analysts

Barber, Lehavy, McNichols, and Trueman (2001) give a somewhat different picture.They focus on the profitability of a trading strategy based on analyst’s recommenda-tions. They use a huge data set (some 360,000 recommendation, US stocks) for the period1985-1996. They sort stocks in to five portfolios depending on the consensus (average)recommendation—and redo the sorting every day (if a new recommendation is published).They find that such a daily trading strategy gives a annual 4% abnormal return on the port-folio of the most highly recommended stocks, and an annual -5% abnormal return on theleast favourably recommended stocks.

This strategy requires a lot of trading (a turnover of 400% annually), so trading costswould typically reduce the abnormal return on the best portfolio to almost zero. A lessfrequent rebalancing (weekly, monthly) gives a very small abnormal return for the beststocks, but still a negative abnormal return for the worst stocks. Chance and Hemler(2001) obtain similar results when studying the investment advise by 30 professional“market timers.”

6.4.4 The Characteristics of Individual Analysts’ Forecasts in Europe

Bolliger (2001) study the forecast accuracy (earnings per share) of European (13 coun-tries) analysts for the period 1988–1999. In all, some 100,000 forecasts are studied. Itis found that the forecast accuracy is positively related to how many times an analyst has

80

forecasted that firm and also (surprisingly) to how many firms he/she forecasts. The ac-curacy is negatively related to the number of countries an analyst forecasts and also to thesize of the brokerage house he/she works for.

6.4.5 Bond Rating Agencies versus Stock Analysts

Ederington and Goh (1998) use data on all corporate bond rating changes by Moody’sbetween 1984 and 1990 and the corresponding earnings forecasts (by various stock ana-lysts).

The idea of the paper by Ederington and Goh (1998) is to see if bond ratings driveearnings forecasts (or vice versa), and if they affect stock returns (prices).

1. To see if stock returns are affected by rating changes, they first construct a “normal”return by a market model:

normal stock returnt = α + β × return on stock indext ,

where α and β are estimated on a normal time period (not including the ratingchange). The abnormal return is then calculated as the actual return minus thenormal return. They then study how such abnormal returns behave, on average,around the dates of rating changes. Note that “time” is then measured, individuallyfor each stock, as a distance from the day of rating change. The result is that thereare significant negative abnormal returns following downgrades, but zero abnormalreturns following upgrades.

2. They next turn to question of whether bond ratings drive earnings forecasts or viceversa. To do that they first note that there are some predictable pattern in revisionsof earnings forecasts. They therefore fit a simple autoregressive model of earningsforecasts, and construct a measure of earnings forecast revisions (surprises) fromthe model. They then relate this surprise variable to the bond ratings. In short, theresults are the following:

(a) both earnings forecasts and ratings react to the same information, but there isalso a direct effect of rating changes, which differs between downgrades andupgrades.

81

(b) downgrades: the ratings have a strong negative direct effect on the earningsforecasts; the returns react ever quicker than analysts

(c) upgrades: the ratings have a small positive direct effect on the earnings fore-casts; there is no effect on the returns

A possible reason for why bond ratings could drive earnings forecasts and prices isthat bond rating firms typically have access to more inside information about firms thanstock analysts and investors.

A possible reason for the observed asymmetric response of returns to ratings is thatfirms are quite happy to release positive news, but perhaps more reluctant to release badnews. If so, then the information advantage of bond rating firms may be particularly largeafter bad news. A downgrading would then reveal more new information than an upgrade.

The different reactions of the earning analysts and the returns are hard to reconcile.

6.4.6 International Differences in Analyst Forecast Properties

Ang and Ciccone (2001) study earnings forecasts for many firms in 42 countries over theperiod 1988 to 1997. Some differences are found across countries: forecasters disagreemore and the forecast errors are larger in countries with low GDP growth, less accountingdisclosure, and less transparent family ownership structure.

However, the most robust finding is that forecasts for firms with losses are special:forecasters disagree more, are more uncertain, and are more overoptimistic about suchfirms.

6.5 Technical Analysis

Main reference: Bodie, Kane, and Marcus (1999) 12.2; Neely (1997) (overview, foreignexchange market)Further reading: Murphy (1999) (practical, a believer’s view); The Economist (1993)(overview, the perspective of the early 1990s); Brock, Lakonishok, and LeBaron (1992)(empirical, stock market); Lo, Mamaysky, and Wang (2000) (academic article on returndistributions for “technical portfolios”)

82

6.5.1 General Idea of Technical Analysis

Technical analysis is typically a data mining exercise which looks for local trends orsystematic non-linear patterns. The basic idea is that markets are not instantaneouslyefficient: prices react somewhat slowly and predictably to news. The logic is essentiallythat an observed price move must be due some news (exactly which is not very important)and that old patterns can tell us where the price will move in the near future. This is anattempt to gather more detailed information than that used by the market as a whole. Inpractice, the technical analysis amounts plotting different transformations (for instance, amoving average) of prices—and to spot known patterns. This section summarizes somesimple trading rules that are used.

6.5.2 Technical Analysis and Local Trends

Many trading rules rely on some kind of local trend which can be thought of as positiveautocorrelation in price movements (also called momentum1).

A filter rule like “buy after an increase of x% and sell after a decrease of y%” isclearly based on the perception that the current price movement will continue.

A moving average rule is to buy if a short moving average (equally weighted or expo-nentially weighted) goes above a long moving average. The idea is that this event signalsa new upward trend. The difference between the two moving averages is called an oscil-

lator (or sometimes, moving average convergence divergence2). A version of the movingaverage oscillator is the relative strength index3, which is the ratio of average price levelon “up” days to the average price on “down” days—during the last z (14 perhaps) days.

The trading range break-out rule typically amounts to buying when the price risesabove a previous peak (local maximum). The idea is that a previous peak is a resistance

level in the sense that some investors are willing to sell when the price reaches that value(perhaps because they believe that prices cannot pass this level; clear risk of circularreasoning or self-fulfilling prophecies; round numbers often play the role as resistancelevels). Once this artificial resistance level has been broken, the price can possibly risesubstantially. On the downside, a support level plays the same role: some investors are

1In physics, momentum equals the mass times speed.2Yes, the rumour us true: the tribe of chartists is on the verge of developing their very own language.3Not to be confused with relative strength, which typically refers to the ratio of two different asset prices

(for instance, an equity compared to the market).

83

willing to buy when the price reaches that value.When the price is already trending up, then the trading range break-out rule may be

replaced by a channel rule, which works as follows. First, draw a trend line throughprevious lows and a channel line through previous peaks. Extend these lines. If the pricemoves above the channel (band) defined by these lines, then buy. A version of this is todefine the channel by a Bollinger band, which is ±2 standard deviations from a movingdata window around a moving average.

A head and shoulder pattern is a sequence of three peaks (left shoulder, head, rightshoulder), where the middle one (the head) is the highest, with two local lows in betweenon approximately the same level (neck line). (Easier to draw than to explain in a thousandwords.) If the price subsequently goes below the neckline, then it is thought that a negativetrend has been initiated. (An inverse head and shoulder has the inverse pattern.)

Clearly, we can replace “buy” in the previous rules with something more aggressive,for instance, replace a short position with a long.

The trading volume is also often taken into account. If the trading volume of assetswith declining prices is high relative to the trading volume of assets with increasing pricesis high, then this is interpreted as a market with selling pressure. (The basic problem withthis interpretation is that there is a buyer for every seller, so we could equally well interpretthe situations as if there is a buying pressure.)

6.5.3 Technical Analysis and Mean Reversion

If we instead believe in mean reversion of the prices, then we can essentially reverse theprevious trading rules: we would typically sell when the price is high.

Some investors argue that markets show periods of mean reversion and then periodswith trends—an that both can be exploited. Clearly, the concept of a support and re-sistance levels (or more generally, a channel) is based on mean reversion between thesepoints. A new trend is then supposed to be initiated when the price breaks out of thisband.

6.6 Empirical U.S. Evidence on Stock Return Predictability

The two most common methods for investigating the predictability of stock returns areto calculate autocorrelations and to construct simple dynamic portfolios and see if they

84

1990 1995 2000 20051

2

3

4

5

Hold index if MA(3)>MA(25)

Year

SMI

Rule

1990 1995 2000 20051

2

3

4

5

Hold index if Pt > max(P

t−1,...,P

t−5)

Year

SMI

Rule

1990 1995 2000 20051

2

3

4

5

Hold index if Pt/P

t−7 > 1

Year

SMI

Rule

Daily SMI data, 1988−2004

Weekly rebalancing:

hold index or riskfree

Figure 6.5: Examples of trading rules applied to SMI. The rule portfolios are rebalancedevery Wednesday: if condition (see figure titles) is satisfied, then the index is held for thenext week, otherwise a government bill is held. The figures plot the portfolio values.

outperform passive portfolios. The dynamic portfolio could, for instance, be a simplefilter rule that calls for rebalancing once a month by buying (selling) assets which haveincreased (decreased) more than x% the last month. If this portfolio outperforms a passiveportfolio, then this is evidence of some positive autocorrelation (“momentum”) on a one-month horizon. The following points summarize some evidence which seems to hold forboth returns and returns in excess of a riskfree rate (an interest rate).

1. The empirical evidence suggests some, but weak, positive autocorrelation in short

horizon returns (one day up to a month) — probably too little to trade on. The auto-correlation is stronger for small than for large firms (perhaps no autocorrelation at

85

all for weekly or longer returns in large firms). This implies that equally weightedstock indices have larger autocorrelation than value-weighted indices. (See Camp-bell, Lo, and MacKinlay (1997) Table 2.4.)

2. Stock indices have more positive autocorrelation than (most) individual stocks:there must be fairly strong cross-autocorrelations across individual stocks. (SeeCampbell, Lo, and MacKinlay (1997) Tables 2.7 and 2.8.)

3. There seems to be negative autocorrelation of multi-year stock returns, for instancein 5-year US returns for 1926-1985. It is unclear what drives this result, how-ever. It could well be an artifact of just a few extreme episodes (Great Depression).Moreover, the estimates are very uncertain as there are very few (non-overlapping)multi-year returns even in a long sample—the results could be just a fluke.

4. The aggregate stock market returns, that is, a return on a value-weighted stock in-dex, seems to be forecastable on the medium horizon by various information vari-

ables. In particular, future stock returns seems to be predictable by the currentdividend-price ratio and earnings-price ratios (positively, one to several years), orby the interest rate changes (negatively, up to a year). For instance, the coefficientof determination (usually denoted R2, but should not to be confused with the returnused above) for predicted the two-year return on the US stock market on currentdividend-price ratio is around 0.3 for the 1952-1994 sample. (See Campbell, Lo,and MacKinlay (1997) Tables 7.1-2.) This evidence suggests that expected returnsmay very well be time varying and correlated with the business cycle.

5. Even if short-run returns, Rt+1, are fairly hard to forecast, it is often fairly easyto forecast volatility as measured by |Rt+1| or R2

t+1 (for instance, using ARCH orGARCH models). For an example, see Bodie, Kane, and Marcus (1999) Figure13.7. This could possibly be used for dynamic trading strategies on options (whichdirectly price volatility). For instance, buying both a call and a put option (a “strad-dle” or a “strangle”), is a bet on a large price movement (in any direction).

6. It is sometimes found that stock prices behave differently in periods with highvolatility than in more normal periods. Granger (1992) reports that the forecast-ing performance is sometimes improved by using different forecasting models for

86

these two regimes. A simple and straightforward way to estimate a model for peri-ods of normal volatility is to simply throw out data for volatile periods (and otherexceptional events).

7. It is important to assess forecasting models in terms of their out-of-sample forecastperformance. Too many models seems to fit data in-sample, but most of them failin out-of-sample tests. Forecasting models are of no use of they cannot forecast!

8. There are also a number of strange patterns (“anomalies”) like the small-firms-in-January effect (high returns on these in the first part of January) and the book-to-market (high returns on firms with high book/market value of the firm’s equity).

Bibliography

Ang, J. S., and S. J. Ciccone, 2001, “International Differences in Analyst Forecast Prop-erties,” mimeo, Florida State University.

Barber, B., R. Lehavy, M. McNichols, and B. Trueman, 2001, “Can Investors Profit fromthe Prophets? Security Analyst Recommendations and Stock Returns,” Journal of Fi-

nance, 56, 531–563.

Bodie, Z., A. Kane, and A. J. Marcus, 1999, Investments, Irwin/McGraw-Hill, Boston,4th edn.

Bolliger, G., 2001, “The Characteristics of Individual Analysts’ Forecasts in Europe,”mimeo, University of Neuchatel.

Bondt, W. F. M. D., and R. H. Thaler, 1990, “Do Security Analysts Overreact?,” American

Economic Review, 80, 52–57.

Brock, W., J. Lakonishok, and B. LeBaron, 1992, “Simple Technical Trading Rules andthe Stochastic Properties of Stock Returns,” Journal of Finance, 47, 1731–1764.

Brockwell, P. J., and R. A. Davis, 1991, Time Series: Theory and Methods, SpringerVerlag, New York, second edn.



87

Chance, D. M., and M. L. Hemler, 2001, “The Performance of Professional MarketTimers: Daily Evidence from Executed Strategies,” Journal of Financial Economics,62, 377–411.

Cochrane, J. H., 2001, Asset Pricing, Princeton University Press, Princeton, New Jersey.

Cuthbertson, K., 1996, Quantitative Financial Economics, Wiley, Chichester, England.

Ederington, L. H., and J. C. Goh, 1998, “Bond Rating Agencies and Stock Analysts: WhoKnows What When?,” Journal of Financial and Quantitative Analysis, 33.



Granger, C. W. J., 1992, “Forecasting Stock Market Prices: Lessons for Forecasters,”International Journal of Forecasting, 8, 3–13.

Huberman, G., and S. Kandel, 1987, “Mean-Variance Spanning,” Journal of Finance, 42,873–888.

Lo, A. W., H. Mamaysky, and J. Wang, 2000, “Foundations of Technical Analysis: Com-putational Algorithms, Statistical Inference, and Empirical Implementation,” Journal

of Finance, 55, 1705–1765.

Makridakis, S., S. C. Wheelwright, and R. J. Hyndman, 1998, Forecasting: Methods and

Applications, Wiley, New York, 3rd edn.

Murphy, J. J., 1999, Technical Analysis of the Financial Markets, New York Institute ofFinance.

Neely, C. J., 1997, “Technical Analysis in the Foreign Exchange Market: A Layman’sGuide,” Federal Reserve Bank of St. Louis Review.

Priestley, M. B., 1981, Spectral Analysis and Time Series, Academic Press.

The Economist, 1993, “Frontiers of Finance,” pp. 5–20.

88

7 ARCH and GARCH

Reference (easy):Pindyck and Rubinfeld (1998) 10.3; Bodie, Kane, and Marcus (2002)13.4; Hull (2000) 15Reference (medium): Verbeek (2004) 8; Enders (2004) 3Reference (advanced): Campbell, Lo, and MacKinlay (1997) 12; Hamilton (1994) 21;Greene (2003) 11.8; Hentschel (1995); Franses and van Dijk (2000)

7.1 Heteroskedasticity

7.1.1 Heteroskedasticity in General

Time-variation in volatility (heteroskedasticity) is a common feature of macroeconomicand financial data.

The perhaps most straightforward way to gauge it is to estimate a time-series of vari-ances on “rolling samples.” For a zero-mean variable, ut , this could mean

σ 2t = (u2

t−1 + u2t−2 + . . . + u2

t−q)/q, (7.1)

where the latest q observations are used. Notice that σ 2t depends on lagged information,

and could therefore be thought of as the prediction (made in t − 1) of the volatility in t .Unfortunately, this method can produce quite abrupt changes in the estimate. An al-

ternative is therefore to use an exponential moving average (EMA) estimator of volatility,which uses all data points since the beginning of the sample—but where recent observa-tions carry larger weights. The weight for lag s be (1 − λ)λs where 0 < λ < 1, so

σ 2t = (1 − λ)(u2

t−1 + λu2t−2 + λ2u2

t−3 + . . .), (7.2)

which can also be calculated in a recursive fashion as

σ 2t = (1 − λ)u2

t−1 + λσ 2t−1. (7.3)

The initial value (before the sample) could be assumed to be zero or (perhaps better) the

89

1980 1990 20000

20

40

S&P 500, daily dataG

AR

CH

std

, a

nn

ua

lized

1980 1990 20000

20

40

S&P 500, EMA estimate, λ = 0.99

EM

A s

td,

an

nu

ali

zed

Sample (daily) 1979−2004

AR(1) of excess returns

with GARCH(1,1) errors

AR(1) coef: 0.04

ARCH&GARCH coefs: 0.06 0.93

1980 1990 20000

20

40


EM

A s

td,

an

nu

ali

zed

Figure 7.1: Conditional standard deviation, estimated by GARCH(1,1) model

unconditional variance in a historical sample. Notice that σ 2t depends on lagged informa-

tion, and could therefore be thought of as the prediction (made in t − 1) of the volatilityin t .

This methods is commonly used by practitioners. For instance, the RISK Metrics(formerly part of JP Morgan) uses this method with λ = 0.94 for use on daily data.Alternatively, λ can be chosen to minimize some criterion function like 6T

t=1(u2t − σ 2

t )2.See Figure 7.1 for an example.

7.1.2 Heteroskedastic Residuals in a Regression

Suppose we have a regression model

yt = b0 + x1tb1 + x2tb2 +· · ·+ xktbk + εt , where E εt = 0 and Cov(xi t , εt) = 0. (7.4)

90

In the standard case we assume that εt is iid (independently and identically distributed),which rules out heteroskedasticity.

In case the residuals actually had heteroskedasticity, least squares (LS) is still a usefulestimator: it is still consistent (we get the correct values as the sample becomes reallylarge)—and it is reasonably efficient (in terms of the variance of the estimates). However,the standard expression for the standard errors (of the coefficients) is (except in a specialcase, see below) not correct. This is illustrated in Figure 7.2.

There are two ways to handle this problem. First, we could use some other estimationmethod than LS that incorporates the structure of the heteroskedasticity. For instance,combining the regression model (7.4) with an ARCH structure of the residuals—and es-timate the whole thing with maximum likelihood (MLE) is one way. Second, we couldstick to OLS, but use another expression for the variance of the coefficients (usually calleda “heteroskedasticity consistent covariance matrix,” among which “White’s covariancematrix” is the most common).

To test for heteroskedasticity, we can use White’s test of heteroskedasticity. The nullhypothesis is homoskedasticity, and the alternative hypothesis is the kind of heteroskedas-ticity which can be explained by the levels, squares, and cross products of the regressors—clearly a special form of heteroskedasticity. The reason for this specification is that if thesquared residual is uncorrelated with wt , then the usual LS covariance matrix applies—even if the residuals have some other sort of heteroskedasticity (this is the special casementioned before).

To implement White’s test, let wi be the squares and cross products of the regressors.For instance, if the regressors include (1, x1t , x2t) then wt is the vector (1, x1t , x2t , x2

1t , x1t x2t , x22t ).

The test is then to run a regression of squared fitted residuals on wt

ε2t = w′

tγ + vi , (7.5)

and to test if all the slope coefficients (not the intercept) in γ are zero. (This can be donebe using the fact that T R2

∼ χ2P , P = dim(wi ) − 1.)

7.1.3 Autoregressive Conditional Heteroskedasticity (ARCH)

Autoregressive heteroskedasticity is another special form of heteroskedasticity—and it isoften found in financial data which shows volatility clustering (calm spells, followed by

91

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.20

0.02

0.04

0.06

0.08

0.1Std of LS estimator

α

Model: yt=0.9x

t+ε

t,

where εt ∼ N(0,h

t), with h

t = 0.5exp(αx

t

2)

σ2(X’X)

−1

White’s

Simulated

Figure 7.2: Variance of OLS estimator, heteroskedastic errors

volatile spells, followed by...).To test for ARCH features, Engle’s test of ARCH is perhaps the most straightforward.

It amounts to running an AR(q) regression of the squared zero-mean variable (here de-noted ut )

u2t = a0 + a1u2

t−1 + . . . + aqu2t−q + vt , (7.6)

Under the null hypothesis of no ARCH effects, all slope coefficients are zero and the R2

of the regression is zero. (This can be tested by noting that, under the null hypothesis,T R2

∼ χ2q .) This test can also be applied to the fitted residuals from the a regression

like (7.4). However, in this case, it is not obvious that ARCH effects makes the standardexpression for the LS covariance matrix invalid—this is tested by White’s test as in (7.5).

7.2 ARCH Models

This section discusses the Autoregressive Conditional Heteroskedasticity (ARCH) model.It is a model of how volatility depends on recent volatility.

92

There are two basic reasons for being interested in an ARCH model. First, if resid-uals of the regression model (7.4) has ARCH features, then an ARCH model (that is, aspecification of exactly how the ARCH features are generated) can help us estimate the re-gression model by maximum likelihood. Second, we may be interested in understandingthe ARCH features more carefully, for instance, as an input in a portfolio choice processor option pricing.

7.2.1 Properties of ARCH(1)

In the ARCH(1) model the residual in the regression equation (7.4), or some other zero-mean variable, can be written

ut = vtσt with vt ∼ iid with Et−1 vt = 0 and Et−1 v2t = 1, (7.7)

and the conditional variance is generated by

σ 2t = α0 + α1u2

t−1, with α0 > 0 and 0 ≤ α1 < 1. (7.8)

The non-negativity restrictions on α0 and α1 are needed in order to guarantee σ 2t > 0.

The upper bound α1 < 1 is needed in order to make the conditional variance stationary(more later).

It is clear that the unconditional distribution of ut is non-normal. Even if we assumethat vt is iid N (0, 1), we get that the conditional distribution of ut is N (0, σ 2

t ), so the un-conditional distribution of ut is a mixture of normal distributions with different variances.It can be shown that the result is a distribution which has fatter tails than a normal distri-bution with the same variance (excess kurtosis)—which is a common feature of financialdata.

It is straightforward to show that the ARCH(1) model implies that we in period t canforecast the future conditional variance (σ 2

t+s) as

Et σ 2t+s =

α0

1 − α1+ αs−1

1

(σ 2

t+1 −α0

1 − α1

)for s = 1, 2, . . . (7.9)

Notice that σ 2t+1 is known in t . The conditional volatility behaves like an AR(1), and

0 ≤ α1 < 1 is necessary to keep it positive and and stationary.

93

7.2.2 Estimation of the ARCH(1) Model

The most common way to estimate the model is to assume that vt ∼iid N (0, 1) and toset up the likelihood function. The log likelihood is easily found, since the model isconditionally Gaussian. It is

ln L = −T2

ln (2π) −12

T∑t=1

ln σ 2t −

12

T∑t=1

u2t

σ 2t

if vt is iid N (0, 1). (7.10)

The estimates are found by maximizing the likelihood function (by choosing the parame-ters). This is done by a numerical optimization rootine, which should preferably imposethe constraints in (7.8).

If ut is just a zero-mean variable (no regression equation), then this just amounts tochoosing the parameters (α0 and α1) in (7.8). Instead, if ut is a residual from a regressionequation (7.4), then we instead need to choose both the regression coefficients (b0, ..., bk)in (7.4) and the parameters (α0 and α1) in (7.8). In either case, we need a starting value ofσ 2

1 = α0 +α1u20. This most common approach is to use the first observation as a “starting

point,” that is, we actually have a sample from (t =) 0 to T , but observation 0 is only usedto construct a starting value of σ 2

1 , and only observations 1 to T are used in the calculationof the likelihood function value.

Remark 38 (Regression with ARCH(1) residuals) To estimate the full model (7.4) and

(7.8) by ML, we can do as follows.

First, guess values of the parameters b0, ..., bk , and α0, and α1. The guess of b0, ..., bk can

be taken from an LS estimation of (7.4), and the guess of α0 and α1 from an LS estimation

of ε2t = α0 + α1ε

2t−1 + εt where εt are the fitted residuals from the LS estimation of (7.4).

Second, loop over the sample (first t = 1, then t = 2, etc.) and calculate ut = εt from

(7.4) and σ 2t from (7.8). Plug in these numbers in (7.10) to find the likelihood value.

Third, make better guesses of the parameters and do the second step again. Repeat until

the likelihood value converges (at a maximum).

Remark 39 (Imposing parameter constraints on ARCH(1).) To impose the restrictions

in (7.8) when the previous remark is implemented, iterate over values of (b, α0, α1) and

let α0 = α20 and α1 = exp(α1)/[1 + exp(α11)].

If is sometimes found that the standardized values of ut , ut/σt , still have too fat tailscompared with N (0, 1). This would violate the assumption about a normal distribution in

94

(7.10). Estimation using other likelihood functions, for instance, for a t-distribution canthen be used. Or the estimation can be interpreted as a quasi-ML (is typically consistent,but requires different calculation of the covariance matrix of the parameters).

It is straightforward to add more lags to (7.8). For instance, an ARCH(p) would be

σ 2t = α0 + α1u2

t−1 + . . . + αpu2t−p. (7.11)

The form of the likelihood function is the same except that we now need p starting valuesand that the upper boundary constraint should now be 6

pj=1α j ≤ 1.

7.3 GARCH Models

Instead of specifying an ARCH model with many lags, it is typically more convenient tospecify a low-order GARCH (Generalized ARCH) model. The GARCH(1,1) is a simpleand surprisingly general model, where the volatility follows

σ 2t = α0 + α1u2

t−1 + β1σ2t−1,with α0 > 0; α1, β1 ≥ 0; and α1 + β1 < 1. (7.12)

The non-negativity restrictions are needed in order to guarantee that σ 2t > 0 in all

periods. The upper bound α1 + β1 < 1 is needed in order to make the σ 2t stationary and

therefore the unconditional variance finite.

Remark 40 The GARCH(1,1) has many similarities with the exponential moving average

estimator of volatility (7.3). The main differences are that the exponential moving average

does not have a constant and volatility is non-stationary (the coefficients sum to unity).

The GARCH(1,1) corresponds to an ARCH(∞) with geometrically declining weights,which suggests that a GARCH(1,1) might be a reasonable approximation of a high-orderARCH. Similarly, the GARCH(1,1) model implies that we in period t can forecast thefuture conditional variance (σ 2

t+s) as

Et σ 2t+s =

α0

1 − α1 − β1+ (α1 + β1)

s−1(

σ 2t+1 −

α0

1 − α1 − β1

), (7.13)

which is of the same form as for the ARCH model (7.9), but where the sum of α1 and β1

is like an AR(1) parameter.

95

To estimate the model consisting of (7.4) and (7.12) we can still use the likelihoodfunction (7.10) and do a MLE (but we now have to choose a value of β1 as well). Wetypically create the starting value of u2

0 as in the ARCH(1) model, but this time we alsoneed a starting value of σ 2

0 . It is often recommended to use σ 20 = Var(ut).

Remark 41 (Imposing parameter constraints on GARCH(1,1).) To impose the restric-

tions in (7.12), iterate over values of (b, α0, α1, β1) and let α0 = α20 , α1 = exp(α1)/[1 +

exp(α1) + exp(β1)], and β1 = exp(β1)/[1 + exp(α1) + exp(β1)].

7.4 Non-Linear Extensions

A very large number of extensions have been suggested. I summarize a few of them,which can be estimated by using the likelihood function (7.10) to do a MLE.

An asymmetric GARCH (Glosten, Jagannathan, and Runkle (1993)) can be con-structed as

σ 2t = α0 +α1u2

t−1 +β1σ2t−1 +γ δ(ut−1 > 0)u2

t−1, where δ(q) =

{1 if q is true0 else.

(7.14)

This means that the effect of the shock u2t−1 is α1 if the shock was negative and α1 + γ if

if the shock was positive.The EGARCH (exponential GARCH, Nelson (1991)) sets

ln σ 2t = α0 + α1

|ut−1|

σt−1+ β1 ln σ 2

t−1 + γut−1

σt−1(7.15)

Apart from being written in terms of the log (which is a smart trick to make σ 2t > 0 hold

without any restrictions on the parameters), this is an asymmetric model. he |ut−1| termis symmtric: both negative and positive values of ut−1 affects the volatility in the sameway. The linear term in ut−1 modifies this to make the effect asymmetric. In particular,if γ < 0, then the volatility increases more in response to a negative ut−1 (“bad news”)than to a positive ut−1.

Hentschel (1995) estimates several models of this type, as well as a very generalformulation on daily stock index data for 1926 to 1990 (some 17,000 observations). Moststandard models are rejected in favour of a model where σt depends on σt−1 and |ut−1 −

b|3/2.

96

7.5 (G)ARCH-M

It can make sense to let the conditional volatility enter the mean equation—for instance,as a proxy for risk which may influence the expected return.

Example 42 A mean variance investor solves

maxα E Rp − σ 2pk/2, subject to

Rp = αRm + (1 − α)R f ,

where Rm is the return on the risky asset (the market index) and R f is the riskfree return.

The solution is

α =1k

E(Rm − R f )

σ 2m

.

In equilibrium, this weight is one (since the net supply of bonds is zero), so we get

E(Rm − R f ) = kσ 2m,

which says that the expected excess return is increasing in both the market volatility and

risk aversion (k).

We modify the “mean equation” (7.4) to include the conditional variance σ 2t (taken

from any of the models for heteroskedasticity) as a regressor

yt = x ′

tb + ϕσ 2t + ut . (7.16)

Note that σ 2t is predetermined, since it is a function of information in t − 1. This model

can be estimated by using the likelihood function (7.10) to do MLE.

Remark 43 (Coding of (G)ARCH-M) We can use the same approach as in Remark 38,

except that we use (7.16) instead of (7.4) to calculate the residuals (and that we obviously

also need a guess of ϕ).

97

7.6 Multivariate (G)ARCH

7.6.1 Different Multivariate Models

This section gives a brief summary of some multivariate models of heteroskedasticity.Suppose ut is an n × 1 vector. For instance, ut could be the residuals from n differentregressions or just n different demeaned return series.

We define the conditional (on the information set in t − 1) covariance matrix of ut as

6t = Et−1 utu′

t . (7.17)

Remark 44 (The vech operator) vech(A) of a matrix A gives a vector with the elements

on and below the principal diagonal A stacked on top of each other (column wise). For

instance, vech

[a11 a12

a21 a22

]=

a11

a21

a22

.

It may seem as if a multivariate (matrix) version of the GARCH(1,1) model wouldbe simple, but it is not. The reason is that it would contain far too many parameters.Although we only need to care about the unique elements of 6t , that is, vech(6t), thisstill gives very many parameters

vech(6t) = C + Avech(ut−1u′

t−1) + Bvech(6t−1). (7.18)

For instance, with n = 2 we have σ11,t

σ21,t

σ22,t

= C + A

u21,t−1

u1,t−1u2,t−1

u22,t−1

+ B

σ11,t−1

σ21,t−1

σ21,t−1

, (7.19)

where C is 3 × 1, A is 3 × 3, and B is 3 × 3. This gives 21 parameters, which is alreadyhard to manage. We have to limit the number of parameters. We also have to find away to impose restrictions so 6t is positive definite (compare the restrictions of positivecoefficients in (7.12)).

98

The Diagonal Model

One model that achieves both these aims is the diagonal model, which assumes that A

and B are diagonal. This means that every element of 6t follows a univariate process.With n = 2 we have σ11,t

σ21,t

σ22,t

=

c1

c2

c3

+

a1 0 00 a2 00 0 a3

u2

1,t−1

u1,t−1u2,t−1

u22,t−1

+

b1 0 00 b2 00 0 b3

σ11,t−1

σ21,t−1

σ21,t−1

,

(7.20)which gives 3 + 3 + 3 = 9 parameters (in C , A, and B, respectively). To make sure that6t is positive definite we have to impose further restrictions. The obvious drawback ofthis model is that there is no spillover of volatility from one variable to another.

The Constant Correlation Model

The constant correlation model assumes that every variance follows a univariate GARCHprocess and that the conditional correlations are constant. With n = 2 the covariancematrix is[

σ11,t σ12,t

σ12,t σ22,t

]=

[√

σ11,t 00 √

σ22,t

][1 ρ12

ρ12 1

][√

σ11,t 00 √

σ22,t

](7.21)

and each of σ11t and σ22t follows a GARCH process. Assuming a GARCH(1,1) as in(7.12) gives 7 parameters (2 × 3 GARCH parameters and one correlation), which is con-venient. The price is, of course, the assumption of no movements in the correlations. Toget a positive definite 6t , each individual GARCH model must generate a positive vari-ance (same restrictions as before), and that all the estimated (constant) correlations arebetween −1 and 1.

Bibliography


99

1990 1995 2000 20050

20

40

60Std of FTSE 100, GARCH(1,1)

Std

1990 1995 2000 20050

20

40

60Std of DAX 30, GARCH(1,1)

Std

1990 1995 2000 20050

0.5

1Correlation

Corr

DCC

CC


GARCH(1,1) of demeaned log index changes

The Std are annualized by multiplying by 250

Figure 7.3: Results for multivariate GARCH models



Enders, W., 2004, Applied Econometric Time Series, John Wiley and Sons, New York,2nd edn.

Franses, P. H., and D. van Dijk, 2000, Non-linear Time Series Models in Empirical Fi-

nance, Cambridge University Press.

Glosten, L. R., R. Jagannathan, and D. Runkle, 1993, “On the Relation Between theExpected Value and the Volatility of the Nominal Excess Return on Stocks,” Journal of

Finance, 48, 1779–1801.

Greene, W. H., 2003, Econometric Analysis, Prentice-Hall, Upper Saddle River, NewJersey, 5th edn.

100

Hamilton, J. D., 1994, Time Series Analysis, Princeton University Press, Princeton.

Harvey, A. C., 1989, Forecasting, Structural Time Series Models and the Kalman Filter,Cambridge University Press.

Hentschel, L., 1995, “All in the Family: Nesting Symmetric and Asymmetric GARCHModels,” Journal of Financial Economics, 39, 71–104.

Hull, J. C., 2000, Options, Futures, and Other Derivatives, Prentice-Hall, Upper SaddleRiver, NJ.

Nelson, D. B., 1991, “Conditional Heteroskedasticity in Asset Returns,” Econometrica,59, 347–370.

Pindyck, R. S., and D. L. Rubinfeld, 1998, Econometric Models and Economic Forecasts,Irwin McGraw-Hill, Boston, Massachusetts, 4ed edn.

Verbeek, M., 2004, A Guide to Modern Econometrics, Wiley, Chichester, 2nd edn.

101

8 Option Pricing and Estimation of Continuous Time Processes

Reference (medium): Hull (2000) 15Reference (advanced): Campbell, Lo, and MacKinlay (1997) 9; Harvey (1989) 9; Gourier-oux and Jasiak (2001) 12–13


8.1 The Black-Scholes Model

8.1.1 The Black-Scholes Option Price Model

A European call option contract traded (contracted and paid) in t may stipulate that thebuyer of the contract has the right (not the obligation) to buy one unit of the underlyingasset (from the issuer of the option) in t + m at the strike price X . The option payoff (int + m) is clearly max (0, St+m − X) ,where St+m is the asset price, and X and the strikeprice. See Figure 8.1 for the timing convention.

The Black-Scholes formula for a European call option is

Ct = St8(d1) − Xe−rm8(d1 − σ√

m), where d1 =ln(St/X) +

(r + σ 2/2

)m

σ√

m. (8.1)

where 8() is the cumulative distribution function for a variable distributed as N(0,1).For instance, 8(2) is the probability that y ≤ 2. In this equation, S0 is the price of the

-t t + m

Europeancall option:

buy option,agree on X , pay C

if S > X : payX and get asset

Figure 8.1: Timing convention of option contract

102

0 0.1 0.2 0.3 0.4 0.50

1

2

3

4

5

6

7BS call price as a function of volatility

Standard deviation, σ

The base case has

S, K, m, y, and σ:

42 42 0.5 0.05 0.2

Figure 8.2: Call option price, Black-Scholes model

underlying asset in period t , and r is the continuously compounded interest rate.The B-S formula can be derived from several stochastic processes of the underlying

asset price (discussed below), but they all imply that the distribution of log asset price int + m (conditional on the information in t) is normal with some mean α (not importantfor the option price) and the variance mσ 2

ln St+m ∼ N (α, mσ 2). (8.2)

8.1.2 Implied Volatility

The pricing formula (8.1) contains only one unknown parameter: the standard deviationσ in the distribution of ln St+m , see (8.2). With data on the option price, spot price,the interest rate, and the strike price, we can solve for standard deviation: the implied

volatility. This should not be thought of as an estimation of an unknown parameter—rather as just a transformation of the option price. Notice that we can solve (by trial-and-error or some numerical routine) for one implied volatility for each available strike price.See Figure 8.2 for an illustration and Figure 8.3 for a time series of implied volatilities.

If the Black-Scholes formula is correct, that is, if the assumption in (8.2) is correct,then these volatilities should be the same across strike prices. It is often found that the

103

1990 1995 2000 20050

10

20

30

40

50CBOE volatility index (VIX)

Figure 8.3: CBOE VIX, summary measure of implied volatities (30 days) on US stockmarkets

implied volatility is a U-shaped function of the strike price. One possible explanation isthat the (perceived) distribution of the future asset price has relatively more probabilitymass in the tails (“fat tails”) than a normal distribution has.

8.1.3 Brownian Motion without Mean Reversion: The Random Walk

The basic assumption behind the B-S formula (8.1) is that the log price of the underlyingasset, ln St , follows a geometric Brownian motion—with or without mean reversion.

This section discusses the standard geometric Brownian motion without mean rever-sion

d ln St = µdt + σdWt , (8.3)

where d ln St is the change in the log price over a very short time interval. On the righthand side, µ is the drift (typically expressed on annual basis), dt just indicates the changein time, σ is the standard deviation (per year), and dWt is a random component (Wienerprocess) that has an N (0, 1) distribution if we cumulate dWt over a year (

∫ 10 dWt ∼

N (0, 1)). By comparing (8.1) and (8.3) we notice that only the volatility (σ ), not the drift(µ), show up in the option pricing formula. In essence, the drift is already accounted forby the current spot rate in the option pricing formula (as the spot price certainly depends

104

on the expected drift of the asset price).

Remark 45 (Alternative stock price process.) If we instead of (8.3) assume the process

d St = µStdt +σ StdWt , then we get the same option price. The reason is that Ito’s lemma

tells us that (8.3) implies this second process with µ = µ + σ 2/2. The difference is only

in terms of the drift, which does not show up (directly, at least) in the B-S formula.

Remark 46 ((8.3) as a limit of a discrete time process.) (8.3) can be thought of as the

limit of the discrete time process ln St − ln St−h = µh + σεt (where εt is white noise) as

the time interval h becomes very small.

We can only observe the value of the asset price at a limited number of times, so weneed to understand what (8.3) implies for discrete time intervals. It is straightforward toshow that (8.3) implies that we have normally distributed changes and that the changesare uncorrelated (for non-overlapping data)

ln(St/St−h) ∼ N (µh, σ 2h), and Corr[ln(St/St−h), ln(St+h/St)] = 0. (8.4)

Notice that both the drift and the variance scale linearly with the horizon h. The reasonis, or course, that the growth rates (even for the infinitesimal horizon) are iid.

Remark 47 (iid random variable in discrete time.) Suppose xt has the constant mean µ

and a variance σ 2. Then E(xt + xt−1) = 2µ and Var(xt + xt−1) = 2σ 2+2 Cov(xt , xt−1).

If xt is iid, then the covariance is zero, so Var(xt + xt−1) = 2σ 2. Both mean and variance

scale linearly with the horizon.

8.1.4 Brownian Motion with Mean Reversion∗

The mean reverting Ornstein-Uhlenbeck process is

d ln St = λ(µ − ln St)dt + σdWt , with λ > 0. (8.5)

This process makes ln St revert back to the mean µ, and the mean reversion is faster if λ

is large. It is used in, for instance, the Vasicek model of interest rates.To estimate the parameters in (8.5) on real life data, we (once again) have to under-

stand what the model implies for discretely sampled data. It can be shown that it implies

105

a discrete time AR(1)

ln St = α + ρ ln St−h + εt , with (8.6)

ρ = e−λh, α = µh(1 − ρ), and εt ∼ N[0, σ 2(1 − ρ2)/(2λh)

]. (8.7)

We know that the maximum likelihood estimator (MLE) of the discrete AR(1) is leastsquares combined with the traditional estimator of the residual variance. MLE has thefurther advantage of being invariant to parameter transformations, which here means thatthe MLE of λ, µ and σ 2 can be backed out from the LS estimates of ρ, α and Var(εt ) byusing (8.7).

Example 48 Suppose λ, µ and σ 2 are 2, 0, and 0.25 respectively—and the periods are

years (so one unit of time corresponds to a year) . Equations (8.6)–(8.7) then gives the

following AR(1) for weekly (h = 1/52) data

ln St = 0.96 ln St−h + εt with Var(εt) ≈ 0.24.

8.2 Estimation of the Volatility of a Random Walk Process

This section discusses different ways of estimating the volatility in the random walkmodel (8.3). We will assume that we have data for observations in τ = 1, 2, .., n andthat the time between τ and τ + 1 is h. The sample therefore stretches over T = nh

periods (years). Compare this with the notation in (8.1)–(8.4) where the time from t tot + 1 typically is thought of as a year, so the interest rate is a per-year rate.

For instance, for daily data h = 1/365 (or possibly something like 1/250 if onlytrading days are counted). Instead, with weekly data h = 1/52. See Figure 8.4 for anillustration.

8.2.1 Standard Approach

We first estimate the variance for the sampling frequency we have, and then convert to theannual frequency.

According to (8.4) the growth rates, ln(St/St−h), are iid over any sampling frequency.To simplify the notation, let yτ = ln(Sτ/Sτ−1) be the observed growth rates. The classical

106

-0 h 2h T = nh

-0 T = nhh 2h

Figure 8.4: Two different samplings with same time span T

estimator of the variance of an iid data series is

s2=

∑n

τ=1(yτ − y)2 /n, where (8.8)

y =

∑n

τ=1yτ/n. (8.9)

(This is also the maximum likelihood estimator.) To annualize these numbers, notice that

σ 2h = s2, and µh = y, or (8.10)

σ 2= s2/h, and µ = y/h. (8.11)

Example 49 If (y, s2) = (0.001, 0.03) on daily data, then the annualized values are

(µ, σ 2) = (0.25, 7.5).

Notice that is can be quite important to subtract the mean drift, y. Recall that for anyrandom variable, we have

σ 2= E(x2) − µ2, (8.12)

so a non-zero mean drives a wedge between the variance (which we want) and the secondmoment (which we estimate if we assume y = 0).

Example 50 For the US stock market index excess return since WWII we have approxi-

mately a variance of 0.162 and a mean of 0.08. In this case, (8.12) becomes

0.162= E(x2) − 0.082, so E(x2) ≈ 0.182.

Assuming that the drift is zero gives an estimate of the variance equal to 0.182 which is

25% too high.

107

Remark 51 (Variance vs. second moment, the effect of the maturity) Suppose we are

interested in the variance over an m-period horizon, for instance, because we want to

price an option that matures in t + m. How important is it then to use the variance (mσ 2)

rather than the second moment? The relative error is

Second moment - variancevariance

=m2µ2

mσ 2 =mµ2

σ 2 ,

where we have used the fact that the second moment equals the variance plus the squared

mean (cf. (8.12)). Clearly, this relative exaggeration is zero if the mean is zero. The

relative exaggeration is small if the maturity is small.

If we have high frequency data on the asset price or the return, then we can choosewhich sampling frequency to use in (8.8)–(8.9). Recall that with a sample with n obser-vations where the the length of time between the observations is h, the sample stretchesover T = nh periods (years). It can be shown that the asymptotic variances (that is, thevariances in a very large sample) of the estimators of µ and σ 2 in (8.8)–(8.11)

Var(µ) = σ 2/T and Var(σ 2) = 2σ 4/n. (8.13)

To get a precise estimator of the mean drift, µ, a sample that stretches over a long period iscrucial: it does not help to just sample more frequently. However, the sampling frequencyis crucial for getting a precise estimator of σ 2, while a sample that stretches over a longperiod is unimportant. For estimating the volatility to the B-S model, we should thereforeuse high frequency data.

8.2.2 Exponential Moving Average

The traditional estimator is based on the assumption that volatility is constant—which isconsistent with the assumptions of the B-S model. In reality, volatility is time varying.

A practical ad hoc approach to estimate time varying volatility is to modify (8.8)–(8.9)so that recent observations carry larger weight. The exponential moving average (EMA)model lets the weight for lag s be (1 − λ)λs where 0 < λ < 1. The estimator is

s2τ = (1 − λ)

[(yτ−1 − yτ )

2+ λ (yτ−2 − yτ )

2+ λ2 (yτ−3 − yτ )

2+ . . .

](8.14)

yτ =[yτ−1 + yτ−2 + yτ−3 + . . .

]/(τ − 1). (8.15)

108

1980 1990 20000

20

40

S&P 500, daily data

GA

RC

H s

td,

an

nu

ali

zed

1980 1990 20000

20

40


EM

A s

td,

an

nu

ali

zed


AR(1) of excess returns

with GARCH(1,1) errors

AR(1) coef: 0.04

ARCH&GARCH coefs: 0.06 0.93

1980 1990 20000

20

40


EM

A s

td,

an

nu

ali

zed

Figure 8.5: Different estimates of US equity market volatility

Notice that the mean is estimated as a traditional sample mean. Clearly, a higher λ meansthat old data plays a larger role—and at the limit as λ goes towards one, we have thetraditional estimator. See Figure 8.6 for a comparison using daily US equity returns.

If we assume that yτ = 0 in all periods, then the calculation in (8.14) can be speededup by using the the following equivalent expression

s2τ = λs2

τ−1 + (1 − λ) (yτ−1 − yτ )2 . (8.16)

This method is commonly used by practitioners. For instance, the RISK Metrics is basedon λ = 0.94 on daily data. Alternatively, λ can be chosen to minimize some criterionfunction like 6n

τ=1[(yτ − y)2− s2

τ ]2.

It should be noted, however, that the B-S formula does not allow for random volatility.

109

1990 1995 2000 20050

20

40

S&P 500, EMA estimate, λ = 0.9E

MA

std

, a

nn

ua

lized

1990 1995 2000 20050

20

40

CBOE volatility index (VIX)

Figure 8.6: Different estimates of US equity market volatility

8.2.3 Autoregressive Conditional Heteroskedasticity

The model with Autoregressive Conditional Heteroskedasticity (ARCH) is a useful toolfor estimating the properties of volatility clustering. The first-order ARCH expressesvolatility as a function of the latest squared shock

s2τ = α0 + α1u2

τ−1, (8.17)

where uτ is a zero-mean variable. The model requires α0 > 0 and 0 ≤ α1 < 1 toguarantee that the volatility stays positive and finite. The variance reverts back to anaverage variance (α0/(1 − α1)). The rate of mean reversion is α1, that is, the variancebehaves much like an AR(1) model with an autocorrelation parameter of α1. The modelparameters are typically estimated by maximum likelihood. Higher-order ARCH modelsinclude further lags of the squared shocks (for instance, u2

τ−2).Instead of using a high-order ARCH model, it is often convenient to use a first-order

generalized ARCH model, the GARCH(1,1) model. It adds a term that captures directautoregression of the volatility

s2τ = α0 + α1u2

τ−1 + β1s2τ−1. (8.18)

We require that α0 > 0, α1 ≥ 0, β1 ≥ 0, and α1 + β1 < 1 to guarantee that the volatilitystays positive and finite. This is very similar to the EMA in (8.16), except that the variancereverts back to the mean (α0/(1 − α1 − β1)). The rate of mean reversion is α1 + β1, thatis, the variance behaves much like an AR(1) model with an autocorrelation parameter of

110

α1 + β1.

8.2.4 Time-Variation in Volatility and the B-S Formula

The ARCH and GARCH models imply that volatility is random, so they are (strictlyspeaking) not consistent with the B-S model. However, they are often combined with theB-S model to provide an approximate option price. See Figure 8.7 for a comparison of theactual distribution of the log asset price at different horizons (10 and 100 days) when thedaily returns are generated by a GARCH model—and a normal distribution with the samemean and variance. To be specific, the figure with a 10-day horizon shows the distributionof ln St+m , as seen from period t , where ln St+m is calculated as

ln St+m = ln St + uτ + uτ+1 + . . . + uτ+10, (8.19)

where each of the growth rates (uτ ) is drawn from an N (0, s2τ ) distribution where the

variance follows the GARCH(1,1) process in (8.18).It is clear the normal distribution is a good approximation unless the horizon is short

and the ARCH component (α1u2τ−1) dominates the GARCH component (β1s2

τ−1). Intu-itively, the summing of the uncorrelated growth rates in (8.19) give the same effect astaking an average: if we average sufficiently many components we get a distribution thatlooks more and more similar to a normal distribution (this is the central limit theorem).

Remark 52 A time-varying, but non-random volatility could be consistent with (8.2): if

ln St+m is the sum (integral) of normally distributed changes with known (but time-varying

variances), then this sum has a normal distribution (recall: if the random variables x and

y are normally distributed, so is x + y). A random variance does not fit this case, since a

variable with a random variance is not normally distributed.

Bibliography



Gourieroux, C., and J. Jasiak, 2001, Financial Econometrics: Problems, Models, and

Methods, Princeton University Press.

111

−20 0 200

0.05

0.1

α1, β

1 and T (days) are 0.8 0.09 10

Return

N

Sim

−40 −20 0 20 400

0.01

0.02

0.03

α1, β

1 and T (days) are 0.8 0.09 100

Return

−10 0 100

0.05

0.1

α1, β

1 and T (days) are 0.09 0.8 10

Return

−40 −20 0 20 400

0.02

0.04

α1, β

1 and T (days) are 0.09 0.8 100

Return

Figure 8.7: Comparison of normal and simulated distribution of m-period returns

Harvey, A. C., 1989, Forecasting, Structural Time Series Models and the Kalman Filter,Cambridge University Press.

Hull, J. C., 2000, Options, Futures, and Other Derivatives, Prentice-Hall, Upper SaddleRiver, NJ.

112

9 Kernel Density Estimation and Regression

9.1 Non-Parametric Regression

Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Hardle (1990); Pagan and Ullah(1999); Mittelhammer, Judge, and Miller (2000) 21

9.1.1 Simple Kernel Regression

Non-parametric regressions are used when we are unwilling to impose a parametric formon the regression equation—and we have a lot of data.

Let the scalars yt and xt be related as

yt = m(xt) + εt , εt is iid and E εt = Cov [m(xt), εt ] = 0, (9.1)

where m() is a, possibly non-linear, function.Suppose the sample had 3 observations (say, t = 3, 27, and 99) with exactly the same

value of xt , say 1.9. A natural way of estimating m(x) at x = 1.9 would then be toaverage over these 3 observations as we can expect average of the error terms to be closeto zero (iid and zero mean).

Unfortunately, we seldom have repeated observations of this type. Instead, we maytry to approximate the value of m(x) (x is a single value, 1.9, say) by averaging overobservations where xt is close to x . The general form of this type of estimator is

m(x) =

∑Tt=1 w(xt − x)yt∑T

t=1 w(xt − x), (9.2)

where w(xt − x)/6Tt=1w(xt − x) is the weight given to observation t . Note that denomi-

nator makes the weights sum to unity. The basic assumption behind (9.2) is that the m(x)

function is smooth so local (around x) averaging makes sense.As an example of a w(.) function, it could give equal weight to the k values of xt

which are closest to x and zero weight to all other observations (this is the “k-nearestneighbor” estimator, see Hardle (1990) 3.2). As another example, the weight function

113

could be defined so that it trades off the expected squared errors, E[yt − m(x)]2, and theexpected squared acceleration, E[d2m(x)/dx2

]2. This defines a cubic spline (and is often

used in macroeconomics, where xt = t and is then called the Hodrick-Prescott filter).A Kernel regression uses a pdf as the weight function w(.). The pdf of N (0, h2)

is commonly used, where the choice of h allows us to easily vary the relative weightsof different observations. This weighting function is positive so all observations get apositive weight, but the weights are highest for observations close to x and then tapers ofin a bell-shaped way. See Figure 9.1 for an illustration.

A low value of h means that the weights taper off fast—the weight function is then anormal pdf with a low variance. With this particular kernel, we get the following estimatorof m(x) at a point x

m(x) =

∑Tt=1 Kh(xt − x)yt∑T

t=1 Kh(xt − x), where Kh(xt − x) =

exp[−(xt − x)2/(2h2)]

h√

2π. (9.3)

Note that Kh(xt − x) corresponds to the weighting function w(xt − x) in (9.2).In practice we have to estimate m(x) at a finite number of points x . This could, for

instance, be 100 evenly spread points in the interval between the minimum and maximumvalues observed in the sample. See Figure 9.2 for an illustration. Special correctionsmight be needed if there are a lot of observations stacked close to the boundary of thesupport of x (see Hardle (1990) 4.4).

Example 53 Suppose the sample has three data points [x1, x2, x3] = [1.5, 2, 2.5] and

[y1, y2, y3] = [5, 4, 3.5]. Consider the estimation of m(x) at x = 1.9. With h = 1, the

numerator in (9.3) is∑T

t=1Kh(xt − x)yt =

(e−(1.5−1.9)2/2

× 5 + e−(2−1.9)2/2× 4 + e−(2.5−1.9)2/2

× 3.5)

/√

2π

≈ (0.92 × 5 + 1.0 × 4 + 0.84 × 3.5) /√

2π

= 11.52/√

2π.

The denominator is∑T

t=1Kh(xt − x) =

(e−(1.5−1.9)2/2

+ e−(2−1.9)2/2+ e−(2.5−1.9)2/2

)/√

2π

≈ 2.75/√

2π.

114

1.5 2 2.5

5

4

Data and weights for m(1.7)

xt

weights⊗

1.5 2 2.50

1

1.5 2 2.5

5

4


xt

⊗

1.5 2 2.50

1

1.5 2 2.5

5

4


xt

⊗

1.5 2 2.50

1

Data on x: 1.5 2.0 2.5

Data on y: 5.0 4.0 3.5

N(0,0.252) kernel

⊗ denotes the fitted m(x)

Figure 9.1: Example of kernel regression with three data points

The estimate at x = 1.9 is therefore

m(1.9) ≈ 11.52/2.75 ≈ 4.19.

Kernel regressions are typically consistent, provided longer samples are accompaniedby smaller values of h, so the weighting function becomes more and more local as thesample size increases. It can be shown (see Hardle (1990) 3.1) that the mean squarederror of the estimator at the value x is approximately (for general kernel functions)

See Figures 9.3–9.4 for an example. Note that the volatility is defined as the square ofthe drift minus expected drift (from the same estimation method).

115

1.4 1.6 1.8 2 2.2 2.4

3.5

4

4.5

5

Kernel regression

x

Data

h= 0.25

h= 0.2

Figure 9.2: Example of kernel regression with three data points

9.1.2 Multivariate Kernel Regression

Suppose that yt depends on two variables (xt and zt )

yt = m(xt , zt) + εt , εt is iid and E εt = 0. (9.4)

This makes the estimation problem much harder since there are typically few observationsin every bivariate bin (rectangle) of x and z. For instance, with as little as a 20 intervals ofeach of x and z, we get 400 bins, so we need a large sample to have a reasonable numberof observations in every bin.

In any case, the most common way to implement the kernel regressor is to let

m(x, z) =

∑Tt=1Khx (xt − x)Khz(zt − z)yt∑T

t=1Khx (xt − x)Khz(zt − z), (9.5)

where Khx (xt − x) and Khz(zt − z) are two kernels like in (9.3) and where we may allowhx and h y to be different (and depend on the variance of xt and yt ). In this case. theweight of the observation (xt , zt ) is proportional to Khx (xt − x)Khz(zt − z), which is highif both xt and yt are close to x and y respectively.

See Figure 9.4 for an example.

116

0 10 20−0.2

0

0.2

a. Drift vs level, in bins

yt

Dri

ftt+

1

0 10 200

0.5

1

1.5b. Volatility vs level, in bins

yt

Vola

tili

tyt+

1

Daily federal funds rates 1954−2004

Driftt+1

= yt+1

−yt

Volatilityt+1

= (Driftt+1

− fitted Driftt+1

)2

Figure 9.3: Federal funds rate

9.1.3 A Primer in Interest Rate Modelling

Interest rate models are typically designed to describe the movements of the entire yieldcurve in terms of a small number of factors. For instance, the model assumes that theshort interest rate, rt , is a mean-reverting AR(1) process

rt = ρrt−1 + εt , where εt ∼ N (0, σ 2), so (9.6)

rt − rt−1 = (ρ − 1)rt−1 + εt , (9.7)

and that all term premia are constant. This means that the drift is decreasing in the interestrate, but that the volatility is constant.

(The usual assumption is that the short interest rate follows an Ornstein-Uhlenbeckdiffusion process, which implies the discrete time model in (9.6).) It can then be shownthat all interest rates (for different maturities) are linear functions of short interest rates.

To capture more movements in the yield curve, models with richer dynamics are used.For instance, Cox, Ingersoll, and Ross (1985) constructs a model which implies that theshort interest rate follows an AR(1) as in (9.6) except that the variance is proportional tothe interest rate level, so εt ∼ N (0, rt−1σ

2).

117

0 10 20−0.2

0

0.2

a. Drift vs level, kernel regression

yt

Dri

ftt+

1

Small h

Large h

0 10 200

0.5

1

1.5b. Vol vs level, kernel regression

yt

Vola

tili

tyt+

1 Small h

Large h

0 10 20−0.2

0

0.2

c. Drift vs level, kernel regression, 90% conf band

yt

Dri

ftt+

1

Daily federal funds rates 1954−2004

Driftt+1

= yt+1

−yt

Volatilityt+1

= (Driftt+1

− fitted Driftt+1

)2

Figure 9.4: Federal funds rate

Recently, non-parametric methods have been used to estimate how the drift and volatil-ity are related to the interest rate level (see, for instance, Ait-Sahalia (1996)). Figures 9.3–

9.4 give an example. Note that the volatility is defined as the square of the drift minusexpected drift (from the same estimation method).

9.1.4 Non-Parametric Option Pricing

There seems to be systematic deviations from the Black-Scholes model. For instance,implied volatilities are often higher for options far from the current spot (or forward)price—the volatility smile. This is sometimes interpreted as if the beliefs about the futurelog asset price put larger probabilities on very large movements than what is compatiblewith the normal distribution (“fat tails”).

This has spurred many efforts to both describe the distribution of the underlying asset

118

price and to amend the Black-Scholes formula by adding various adjustment terms. Onestrand of this literature uses non-parametric regressions to fit observed option prices tothe variables that also show up in the Black-Scholes formula (spot price of underlyingasset, strike price, time to expiry, interest rate, and dividends). For instance, Ait-Sahaliaand Lo (1998) applies this to daily data for Jan 1993 to Dec 1993 on S&P 500 indexoptions (14,000 observations). They find interesting patterns of the implied moments(mean, volatility, skewness, and kurtosis) as the time to expiry changes. In particular, thenon-parametric estimates suggest that distributions for longer horizons have increasinglylarger skewness and kurtosis. Whereas the distributions for short horizons are not toodifferent from normal distributions, this is not true for longer horizons.

Bibliography

Ait-Sahalia, Y., 1996, “Testing Continuous-Time Models of the Spot Interest Rate,” Re-

view of Financial Studies, 9, 385–426.

Ait-Sahalia, Y., and A. W. Lo, 1998, “Nonparametric Estimation of State-Price DensitiesImplicit in Financial Asset Prices,” Journal of Finance, 53, 499–547.



Cox, J. C., J. E. Ingersoll, and S. A. Ross, 1985, “A Theory of the Term Structure ofInterest Rates,” Econometrica, 53, 385–407.

Hardle, W., 1990, Applied Nonparametric Regression, Cambridge University Press, Cam-bridge.

Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric Foundations, Cam-bridge University Press, Cambridge.

Pagan, A., and A. Ullah, 1999, Nonparametric Econometrics, Cambridge UniversityPress.

119

Documents

S oderlind¨ - Dokuz Eylül UniversityT =25 T =100 Sample average-5 0 5 0 0.2 0.4 b. Distribution of Ö T ´ sample avg. Ö T ´ sample average Sample average of z t-1 where z t has