MFE Financial Econometrics
Michaelmas Term 2011
Weeks 1 - 4
Jeremy Large and Neil Shephard
Oxford-Man Institute of Quantitative Finance, University of Oxford
September 27, 2011
2
Contents
1 Probability 5
1.1 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Sample spaces, events and axioms . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.3 Random walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Distribution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.5 Quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.6 Some common random variables . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.7 Multivariate random variables . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.8 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.9 Covariance matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.2.10 Back to distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.2.11 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.2.12 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.3 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.3.2 Bias and mean square error of estimators . . . . . . . . . . . . . . . . . . 45
1.3.3 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.3.4 Bayes theorem* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4 Simulating random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.4.1 Pseudo random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4.2 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4.3 Inverting distribution functions . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5 Asymptotic approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.5.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.5.3 Some payback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.5.4 Some more theory (first part examinable) . . . . . . . . . . . . . . . . . . 54
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.7 Solutions to some exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.8 Addition and multiplication of matrices . . . . . . . . . . . . . . . . . . . . . . . 69
3
4 CONTENTS
2 Estimating models and testing hypotheses 752.1 Generic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.2.2 Choice of model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782.2.3 Nuisance parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.3 Likelihood based estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792.3.1 ML estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792.3.2 The model and DGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842.3.3 Properties of ML estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 852.3.4 Properties of ML estimator if model is true . . . . . . . . . . . . . . . . . 892.3.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.4 Moment based estimation * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932.4.1 Method of moment estimator * . . . . . . . . . . . . . . . . . . . . . . . . 932.4.2 Behaviour of MM estimator * . . . . . . . . . . . . . . . . . . . . . . . . . 942.4.3 MLE as a MM estimator * . . . . . . . . . . . . . . . . . . . . . . . . . . 952.4.4 Moments for dynamic models * . . . . . . . . . . . . . . . . . . . . . . . . 962.4.5 Generalised method of moments (GMM) * . . . . . . . . . . . . . . . . . 97
3 Appendix 993.1 Appendix: the Newton-Raphson method . . . . . . . . . . . . . . . . . . . . . . 993.2 Simple matrix inverses and determinants . . . . . . . . . . . . . . . . . . . . . . . 100
NOTE: some sections are marked * to indicate that they describe non-essential (andnon-examinable) material for this part of the lecture course (i.e. for the first four weeksof Michaelmas Term).
Chapter 1
Probability
5
6 CHAPTER 1. PROBABILITY
1.1 Basic probability
1.1.1 Reading
Probability theory is the basis of all modern econometrics. The treatment I will developwill focus on results we will need in order to progress in the econometrics courses, howeveryou will be aware that much of other core courses will discuss events using probabilitytheory.
Remark 1.1 Textbook treatments include Grimmett and Stirzaker (2001). Discussionsof probability theory can be given at a number of different levels. The most formal is basedon measure theory see Billingsley (1995). An econometric text which uses this in itsintroduction to probability is Gallant (1997). I would recommend this book to studentswho have a strong maths/stats background. We will not use this material very formally,although we will discuss sigma algebras briefly. Our treatment will be based at the samelevel as Casella and Berger (2001), which is an excellent book. Alternative solid textson this subject include Hendry (1995, pp. 639-676). All of Gallant (1997), Hendry(1995) and Casella and Berger (2001) will be useful for our later treatment of inference.A modern book on statistical inference is Davison (2003).
In this rest of these notes we will discuss the initial construction of the probabilitycalculus. The emphasis will be on sample spaces, events and probability functions.
1.1.2 Sample spaces, events and axioms
Basic probability is built around set theory. We need some definitions before we start.To illustrate these ideas we will use a simple example throughout.
Write Yi as the price of a very simple asset at time i. e.g. normal Vodafone tradesare only priced to the nearest 0.25p, so here 0.25p represents a tick. Figure 1.1 shows anexample of this.
We suppose the price starts at zero and it can move 1 tick up or down each timeperiod, or stay the same! Then the following table shows the possible prices an abstractasset could have at different times
time Yii = 0 0i = 1 1, 0, 1i = 2 2,1, 0, 1, 2i = 3 3,2,1, 0, 1, 2, 3i = 4 4,3,2,1, 0, 1, 2, 3, 4
Thus, for example, Y4 can take on 9 different values.Sample space. The set , is called the sample space, if it contains all possible
(primitive) outcomes that we are considering, e.g. if we think about Y4 then its samplespace is
= {4,3,2, 1, 0, 1, 2, 3, 4} .Event. An event is a subset of (including itself), e.g. Let
A = {1}
1.1. BASIC PROBABILITY 7
8 9 10 11 12 13 14 15 16
138.75
139.00
139.25
139.50
139.75
140.00
140.25
Figure 1.1: Sample path of the best bid price (best available marginal price to a seller) forVodafone on the LSEs electronic order book SETS for the first working day in 2004.
i.e. Y4 = 1. Further let B be the event that Y4 is strictly positive, so
B = {1, 2, 3, 4}.
Example 1.1 Value at risk. An important concept is downside risk how much youcan lose, how quickly and how often. In this case the event of a large loss might bedefined as the event
{4,3} .A rapid fall of 3 ticks or more. In practice value at risk tends to be computed over a dayor more, rather than over tiny time periods.
We now recall some notation for A and B which are events in .
Union. A B, e.g. A B = {1, 2, 3, 4}. Intersection. A B, e.g. A B = {1}. Complementation. Ac, e.g. Ac = {4,3,2,1, 0, 2, 3, 4}.
Modern probability theory is built on an axiomatic development based upon the theoryof sets.Probability axioms based on the triple (,F ,Pr)
8 CHAPTER 1. PROBABILITY
For a sample space , a probability function is a function Pr that satisfies, on allpossible events F which are subsets of ,
1. Pr() 0, for all F ( is an element of F)2. Pr() = 1.
3. If { = Aj F} are disjoint then
Pr
( i=1
Ai
)=
i=1
Pr(Ai).
e.g.
Pr(Y4 > 0) =
4i=1
Pr(Y4 = i).
Comments:
Only events have probabilities. Events, E, are subsets of , not elements. So E or, equivalently, E F . Probabilities are always zero. A realization is when a single is picked (happens). However, strictly speaking this realization has no probability (giving it a probabilitymakes no sense).
If events are disjoint (non-overlapping) then the probability of exactly one of themoccurring is the sum of the probabilities that each one occurs. An example is of twodisjoint events {A1, A2}. Then
Pr(A1 A2) = Pr(A1) + Pr(A2).
Example 1.2 Divide any into {A,Ac}. (e.g. Y4 is strictly positive A = {1, 2, 3, 4} ,Ac = {4,3,2,1, 0}). We know = {A} {Ac} and that these events are disjoint.Consequently
Pr (Ac) = 1 Pr (A) .As Pr (Ac) 0 we have that
Pr (A) 1.Finally, the sigma algebra allows us to write = and so
Pr() = 0.
Example 1.3 Consider two events A,B which are in F . These events are not necessarilydisjoint and so we cannot immediately use axiom 3. e.g. for Y4
A = {1, 2, 3} , B = {1, 1} .
1.1. BASIC PROBABILITY 9
The result is thatPr (A B) = Pr(A) + Pr(B) Pr {A B} . (1.1)
In our example this is
Pr(Y4 = {1, 1, 2, 3}) = Pr(Y4 = {1, 2, 3}) + Pr(Y4 = {1, 1}) Pr(Y4 = 1).That is the probability of at least one ofA andB occurring is the probability ofA occurringplus the probability of B occurring minus the probability that they both happen.
1.1.3 Independence
Consider two events A,B which are in F . We are interested in the concept that theoccurrence of one event does not affect the probability of another even also happening.When this is true we say that the two events are independent. Mathematically we writethe events A,B are independent (in F) iff
Pr {A B} = Pr(A) Pr(B),Note that these events cannot be independent if the events are disjoint, for then Pr {A B} =0.
It is often helpful to have a shorthand for independence. We write
A B.Example 1.4 Let S and T be any subsets of {1, 0, 1} (e.g. suppose that S = {1, 1}and T = {1} ). Define A and B by:
A is [ (Y4 Y3) S]and
B is [ (Y3 Y2) T ].Many models assume that for any S and T ,1
A B.
1.1.4 Conditional Probability
Basics
Sometimes we wish to change the sample space on which we compute probabilities. e.g.suppose we are at time 3, then we know the value of Y3 = 3, say. Then
Y4 {2, 3, 4} ,which is a new sample space. On this sample space we can define new events and newprobabilities, e.g.
Pr(Y4 > 2) = 1 Pr(Y4 = 2).1Empirically this is rejected, for although A, B are typically almost uncorrelated they are not independent.
High levels of volatility tend to follow high levels of volatility.
10 CHAPTER 1. PROBABILITY
but it confusing to have two meanings for Pr, so it is better to be explicit that we knowY3 = 3. We are thus conditioning on this and write
Pr(Y4 > 2|Y3 = 3) = 1 Pr(Y4 = 2|Y3 = 3).This is called a conditional probability conditional probabilities are normal probabilitiesbut where we are changing the sample space.
More abstractly lets use condition on some event B (e.g. Y2 = 3) then the probabilityaxioms transfer across to become
Pr ( A| B) 0Pr ( B| B) = 1Pr ( i=1Ai| B) =i=1 Pr ( Ai| B)
if {(Ai B)} are disjoint.Example 1.5 Maybe interested in interested in the forecast distributions:
Pr(Y4|Y3),Pr(Y4|Y2)Pr(Y4|Y1)Pr(Y4|Y0, Y1, Y2, Y3)
the distribution of Y4 given we know the price at time 3, 2 or 1. The last conditionalprobability is a one-step ahead forecast distribution given the path of the process.
Definition
It is often useful to deduce conditional probability statements from the behaviour ofthe joint distribution. This is we know the joint distribution of employment and wagesit would be nice to automatically deduce the conditional probability statements aboutwages given employment. We can do this in the following way.
Consider a world with two outcomes, A and B. We might be interested in Pr(A) orPr(B) or Pr(AB). Alternatively, we could want to know Pr(A|B), assuming Pr(B) > 0.This means I constrain my world so that B happens and I ask if A then happens. Thiscan only happen if both A and B occurs, so we define
Pr(A|B) = Pr(A B)Pr(B)
.
It is easy to show that this obeys the probability axioms. This is a vital concept ineconometrics.
We are often interested in joint conditional probabilities
Pr(A B|C).If
Pr(A B|C) = Pr(A|C) Pr(B|C),we say that conditionally on C, A and B are independent. This is often written as
(A B) |C.
1.2. RANDOM VARIABLES 11
Example 1.6 In many models in financial econometrics
Pr(Yi|Yi1, Yi2, ..., Y0) = Pr(Yi|Yi1),
which is called the Markov assumption. This means the probability distribution at time ionly depends upon the value of the process one period earlier, not on any further historyof the process. This means that
(Y2 Y4) |Y3,the price of the asset at time 4 is independent of its value at time 2, given we know itsvalue at time 3.
Bayes theorem
Of course we have the corresponding result that
Pr(B) Pr(A|B) = Pr(A B),
and
Pr(A) Pr(B|A) = Pr(A B).Rearranging we get one of the two (the other being the central limit theorem) most famousresults in probability theory
Pr(B|A) = Pr(A B)Pr(A)
=Pr(B) Pr(A|B)
Pr(A), (1.2)
Bayes theorem. This says that I can go from knowing Pr(A|B) into knowing Pr(B|A) bysimply computing the ratio Pr(B)/Pr(A). We will return to it later.
1.2 Random variables
1.2.1 Basics
We have written to denote each of the events associated with the triple (,F ,Pr). Thatis F is generated from , F and Pr is the function which associates a probability.
The events do not need to refer to numerical events. However, if we take a function ofthese events X() which leads to a (possibly a vector or matrix valued) numerical valuethen X() is called a random variable. Most of econometrics is about random variables.In most probability theory when we refer to random variables we drop reference to , sowe will write X as the random variable.
1.2.2 Examples
A classic model for asset price models is the so-called binomial tree. This simplifies theabove model to allow prices to go up or down one tick, but not stay the same. The
12 CHAPTER 1. PROBABILITY
corresponding sample space for the price at time i is given below
time Yii = 0 0 = {0}i = 1 1 = {1, 1}i = 2 2 = {2, 0, 2}i = 3 3 = {3,1, 1, 3}i = 4 4 = {4,2, 0, 2, 4}
Bernoulli random variable. Write the event of an up tick at time 1 ( = U) or downtick ( = D). Let X( = U) = 1 and X( = D) = 0. Then X is a binary randomvariable with two points of support, 0, 1 and is called a Bernoulli random variable.Clearly we can think of
Y1 = 2X 1 {1, 1} .Write Pr(X = 1) = p and Pr(X = 0) = 1 p. Then
Pr(Y1 = 1) = p, Pr(Y1 = 1) = 1 p.More generally write Xi as the same as above but for time i and assume that Xi are
independent and identically distributed (i.i.d.). We model prices as
Yi = Yi1 + 2Xi 1, i = 1, 2, 3, ..., Y0 = 0, (1.3)which is called the binomial tree process and is one of the most famous in asset pricing.The definition of a process Y0, Y1, Y2, ... is that we record the value of an item at a sequenceof times.
Binomial. Suppose we carry out n independent Bernoulli trials with Pr (Xi = 1) = pthen
X =
ni=1
Xi,
is called a binomial random variable. Simple combinational arguments give
Pr (X = x) =n!
x! (n x)!px(1 p)nx, x = 0, 1, ..., n. (1.4)
The permissible range of X, that is 0,1,2,...,n is called the support of X.
1.2.3 Random walk
The binomial tree (1.3) can be written as
Yi = 2i
j=1
Xj i, i = 0, 1, 2, ....
using the fact that Y0 = 0. This is a rather nice process for we know that
Pr(Yi = x) = Pr
(X =
x+ i
2
),
where X has the form (1.4).
1.2. RANDOM VARIABLES 13
Exercise 1.1 Suppose Yi obeys a binomial tree process. Derive the form of
Pr(Yi+3 = x|Yi = y).The binomial tree model is the special case of the random walk process, which can be
written asYi = Yi1 +Xi,
where the shocks Xi are i.i.d.. This is a much discussed process in asset pricing, althoughwe will see later that the i.i.d. assumption in the model is overly strong. Clearly if wethink of Yi as log-prices then
Xi = Yi Yi1,are returns. Hence the price process can be transformed into an i.i.d. sample by takingfirst differences.
Example 1.7 Figure 1.2 shows 8 random sample paths of Yi for i = 0, 1, ..., 100 froma binomial tree process using p = 0.5. The way we carried out this simulation will bediscussed in Section 1.4. In the lower right hand corner of the Figure we also show ahistogram of 10, 000 draws from Y100. The histogram will be defined in Section 1.3.3, fornow it surfeits that it approximates Pr(Y100 = x) for a range of x.
0 50 100
0
5
10 1st sample path of Yi
0 50 100
10
0
0 50 100
0
5
10
0 50 100
0
10
20 4th sample path of Yi
0 50 100
20
10
0
0 50 100
0
5
10
0 50 100
0
10
7th sample path of Yi
0 50 100
0
5
25 0 25
0.02
0.04
Histogram of Y100. Binomial density
Figure 1.2: 8 sample paths from a binomial tree with p = 0 .5 . Also shown is the histogram ofthe process at time 100, that is Y100. It is based on drawing 10,000 sample paths of the process.code: bintree
14 CHAPTER 1. PROBABILITY
1.2.4 Distribution functions
Distribution function of a random variable X is
FX(x) = Pr(X x).Here x is an indicator. Density function for continuous X,
fX(x) =FX(x)
x.
Note that for continuous variables
Pr(X = x) = 0,
for every x. One can go from the density function to the distribution function via
FX(x) =
x
fX(y)dy.
For X with atoms of support we often write fX(x) = Pr(X = x).
1.2.5 Quantile functions
An important function in econometrics is obtained by inverting the distribution function.i.e. we ask: for a given u [0, 1], find x such that
u = FX(x).
We callx = F1X (u),
the quantile function of X. The 0.1 quantile tells us the value of X such that only 10%of the population fall below that value. The most well known quantile is
x = F1X (0.5),
which is called the median.
Example 1.8 Quantiles are central in simple value at risk (VaR) calculations, whichmeasure the degree of risk taken by banks. In simple VaR calculations one looks at themarginal distribution of the returns over a day, written Yi Yi1, and calculate
F1YiYi1(0.05),
the 5% percentile of the return distribution.
If X has atoms of non-zero probability then the quantile function is not unique.
Example 1.9 An exponential random variable X Exp () has densityfX(x) =
1
exp(x/), x, R+. (1.5)
The corresponding distribution and function is
FX(x) = 1 exp(x/),while the quantile function is
F1X (u) = log(1 u).
1.2. RANDOM VARIABLES 15
1.2.6 Some common random variables
Normal
The normal or Gaussian distribution is probably the most important we will come across.It naturally arises out of the distribution of an average (mean) and has convenient prop-erties under linear translations. Its form does not look immediately attractive. Thedensity is
fX (x) =12pi2
exp
{(x )
2
22
}, x, R, 2 R+.
We can see this density peaks at and is symmetric around .Mathematically we can think of this density in the following way:
log fX (x) = c 122
(x )2.
Thus the log-density is quadratic in x. The constant c is determined to make the densityintegrate to one. Its exact form is not very interesting, it just turns out to be rathersimple in this case.
Example 1.10 Think of the distribution of daily changes in UK Sterling against USDollar. We will work with the rate which is recorded daily by Datastream from 26 July1985 to 28th July 2000. Throughout, for simplicity of exposition, we report 100 times thereturns (daily changes in the log of the exchange rate, which are very close to percentagechanges). The Gaussian fit to the density is given in Figure 1.3. The Gaussian fit is in linewith the Brownian motion model behind the Black-Scholes option pricing model. Thex-axis is marked off in terms of returns, not standard deviations which are 0.623. Figure1.4 shows the log of the densities, which shows the characteristic quadratic decay of theGaussian density. We can see that for exchange rate data a better density is where thedecay is linear. This suggests we might become interested in densities of the type
log fX (x) = c 1|x | .
When we calculate the normalisation2 this delivers the Laplace density
fX (x) =1
2exp
{1|x |
}, x, R, 2 R+,
which is exponentially distributed to the left and right of the location .
2This looks non-trivial, but set = 0, then
1 = d
exp
{ 1|y|}dy
= 2d
0
exp
{ 1y
}dy
= 2d,
which determines d. Allowing 6= 0 just shifts the density so it makes no difference to the solution.
16 CHAPTER 1. PROBABILITY
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8Flexible estimator Fitted normal
Figure 1.3: Estimates of the unconditional density of the 100 times returns for Sterling/Dollar.Also density for the ML fit of the normal distribution.
A normal density has support on the real line. Centred at , 2 determines its scale(spread). The notation for a normal is X N (, 2) . This will be discussed at morelength later. FX (x) is only available via numerical methods.
A vital property of the normal distribution is that if X N (, 2) and and arenon-random then
+ X N( + , 22).That is under linear (affine) transformations a normal variable is normal. This result,as well as the others given in this paragraph, are most easily proved using cumulantfunctions, which will be introduced in a moment.
Notice also that we can write
Xlaw= + u,
where u N(0, 1). Equality in law, means the left and right hand side quantities havethe same law or distribution. Finally, if X and Y are independent normal variables withmeans x and y and variances
2x and
2y, then
X + Y N(x + y, 2x + 2y).
That is the means and variances sum while normality is maintained. This is a veryconvenient result for asset pricing, as we will see later.
1.2. RANDOM VARIABLES 17
2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5
10
8
6
4
2
0Flexible estimator Fitted normal
Figure 1.4: Estimates of the unconditional log-density of 100 times the returns for Ster-ling/Dollar. Also the log-density for the ML fit of the normal distribution.
Example 1.11 Suppose that Xi are i.i.d. N(, 2) then the random walk
Yi = Yi1 +Xi, Y0 = 0,
has the feature thatYi N(i, i2),
orYi+s|Yi N(Yi + s, s2).
Figure 1.5 replicates Figure 1.2 but replaces the scaled and recentred Bernoulli variablewith a normal draw. We selected = 0 and 2 = 4 0.5 0.5 so that it matches themean and variance of the previous binomial tree.
Gamma
Normal densities have support on R. Often it is convenient to have a density on R+. Acommonly used density is the gamma
fX(x) =x1
()exp (x/) , x, , R+,
which we write as G(, ). An important special case of the gamma density is the expo-nential where we set = 1. This is written as Exp () .
18 CHAPTER 1. PROBABILITY
0 50 1002.5
0.0
2.5
5.0
7.5 1st sample path of Yi
0 50 100
5
0
5
0 50 100
10
5
0
0 50 100
0
5
10
4th sample path of Yi
0 50 100
0
5
10
15
0 50 100
20
10
0
0 50 100
10
5
0
5 7th sample path of Yi
0 50 100
5
0
5
10
25 0 25
0.02
0.04Histogram of Y100. Gaussian density
Figure 1.5: 8 sample paths from a Gaussian random walk with = 0 and 2 = 4 0.5 0.5.Also shows the histogram of the process at time 100, that is Y100. It is based on drawing 10,000sample paths of the process. code: bintree
Chi-squared
Suppose Xii.i.d. N(0, 1) (often written NID(0, 1), which means we have independent and
identically distributed copies of the random variables) then
Y =
i=1
X2i 2 ,
a chi-squared random variable with degrees of freedom . It can be shown that 2 =Ga(/2, 1/2).
Uniform
Sometimes variables are constrained to live on small intervals. The leading example ofthis is the standard uniform
fX(x) = 1, x [0, 1].Hence this variables only has support on [0, 1]. This random variable is often used ineconomic theory as a stylised way of introducing uncertainty into a model. It also playsa crucial role in the theory and practice of simulation.
1.2. RANDOM VARIABLES 19
A more general uniform is defined by
fX(x) =1
2 1 , x [1, 2] .
Poisson
Counts are often modelled in economics, e.g. the number of trades in a fixed interval oftime. The standard model for this is the Poisson
fX(x) =ex
x!, x = 0, 1, 2, ..., (1.6)
which lives on the non-negative integers. This is written as X Po(). It can be shownthat if Xi Po(i) are independent over i, then
ni=1
Xi Po(
ni=1
i
).
Student t
Student t random variable is generated by a ratio of random variables
t =N(0, 1)2/
,
where N(0, 1) 2 . This is symmetrically distributed about 0. When is not an integerit can be defined by
t =N(0, 1)
Ga(/2, 1/2)/, > 0.
Its density is quite complicated and is uninteresting from our viewpoint. An importantfeature of Student t random variables is that
E |t |r r.
The special case of = 1 is called a Cauchy random variable and it has the featurethat even the mean does not exist!
1.2.7 Multivariate random variables
All of the above holds for multivariate p 1 vectorX = (X1, ..., Xp)
.
The elements of this vector need not be independent, hence they could for example rep-resent a vector of returns on a collection of assets. Hence multivariate random variablesplay a central role in portfolio allocation and risk assessments, as well as really all aspectsof econometrics.
20 CHAPTER 1. PROBABILITY
Sustained example: returns from a portfolio
Conside the bivariate case where p = 2. We might think of
X =
(X1X2
)=
(YZ
),
where X1 = Y is the return over the next day on IBM and X2 = Z is the return over thenext day on the S&P composite index.
Consider the case of measuring the outperformance of the index by IBM. This is
Y Z.We can write this as
(1,1)(YZ
)= bX
where
b =
(11
), so b = (1,1) .
Thus the outperformance can be measured using linear algebra. This outperformance canbe thought of as a simple portfolio, buying IBM and selling the index.
Consider, slightly more abstractly, a portfolio made up of c shares in Y and d in Z.Then the portfolio returns
cY + dZ.
This can be written in terms of vectors as
(c, d)
(YZ
)= f X, f =
(cd
).
More generally, we might write p portfolios, each with different portfolio weights asB11Y +B12ZB21Y +B22ZB31Y +B32Z
...Bp1Y +Bp2Z
= BX,where
B =
B11 B12B21 B22B31 B32...
...Bp1 Bp2
.This is quite a powerful way of compactly writing out portfolios. Here depending upontwo assets. But if one was to extend it to q assets
X =
X1X2X3...Xq
,
1.2. RANDOM VARIABLES 21
and a weight matrix
B =
B11 B12 B13 B1qB21 B22 B23 B2qB31 B32 B33 B3q...
......
. . ....
Bp1 Bp2 Bp3 Bpq
.Now the p portfolios, depending upon q assets, have returns
BX =
qj=1B1jXjqj=1B2jXjqj=1B3jXj
...qj=1BpjXj
.Again this is quite a simple representation of quite a complicated situation.
Back on track
In particular if X is a 2 1 vector
X =
(X1X2
)and x =
(x1x2
),
thenFX(x) = Pr(X1 x1, X2 x2),
which in the continuous case becomes
FX(x) =
x2
x1
fX(y1, y2)dy1dy2.
Likewise
fX(x1, x2) =2FX(x1, x2)
x1x2.
When X1 X2 then this simplifies to
fX(x1, x2) =2FX1(x1)FX2(x2)
x1x2=FX1(x1)
x1
FX2(x2)
x2= fX1(x1)fX2(x2).
An important point is that
fX(y1, x2)dy1 =FX(, x2)
x2
= Pr(X1 , X2 x2)
x2
=FX2(x2)
x2= fX2(x2).
22 CHAPTER 1. PROBABILITY
(a) Standard normal density
2.5 0
.0 2.5
2.50.0
2.5
0.05
0.10
0.15
(b) NIG(1,0,0,1) density
2.5 0
.0 2.5
2.50.0
2.5
0.1
0.2
0.3
(c) Standard logdensity
2.5 0
.0 2.5
2.50.0
2.5
10
.0
7.5
5.
0
2.5
(d) NIG(1,0,0,1) logdensity
2.5 0
.0 2.5
2.50.0
2.5
7.
5
5.0
2.
50.
0
Figure 1.6: Graph of the densities and log-densities of N(0,I) and NIG(1,0,0,1,I) variables. (a)Density of N(0,I). (b) Density of NIG variables. (c) Log-density of N(0,I). (d) Log-density ofbivariate NIG variables. Code: levy graphs
Hence if we integrate out a variable from a density function we produce the marginaldensity of the random variable.
The conditional distribution function takes on the form
FX1|X2=x2(x1) = Pr(X1 x1|X2 = x2),while, we define
fX1|X2=x2(x1) = Pr(X1 x1|X2 = x2)
x1,
which must have the properties of a density. When X is continuous then it can be shownthat
fX1|X2=x2(x1) =fX(x1, x2)
fX2(x2).
Although intuitively obvious the proof of this last result is not straightforward, to ourknowledge.
The following example is taken from WRDS and thinks of the daily returns fromIBM and the S&P composite index from 2000 until the end of 2004 as bivariate randomvariables. These are shown, as a time series, in Figure 2.3.1. We ignore the time seriesstructure of these returns for now and regard them as being bivariate i.i.d. through time.
Figure 2.3.1 shows a plot of the IBM returns graphed against the S&P. It shows thatthe S&P variable has much less variability than the IBM one, and that when the S&Pindex goes up then so tends IBM.
1.2. RANDOM VARIABLES 23
2000 2001 2002 2003 2004 2005
0.15
0.10
0.05
0.00
0.05
0.10
(PtPt1)/Pt1IBM returns S&P composite returns
1.2.8 Moments
General case
Suppose X is a random variable. We often will use an expectation of a function of randomvariable. We define3, for a continuous X, if it exists4
E {g(X)} =g(x)fX(x)dx. (1.7)
The expectation obeys some important rules. For example if a, b are constants thenE {a+ bg(X)} = a + bE {g(X)}. This follows from the definition of expectations assolutions to integrals (1.7).
Exercise 1.2 Prove that E{a + bg (X)} = a+ bE{g (X)}.3If X is discrete then
E {g(X)} =
g (xi) fX(xi).
In the more theoretical literature we often use a notation which allows us to deal with both the discrete and thecontinuous cases. There we write
E {g(X)} =g(x)dFX(x).
4The expectation of g(X) is said to exist iff|g(X)| fX(x)dx
24 CHAPTER 1. PROBABILITY
0.05 0.04 0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.04 0.05
0.15
0.10
0.05
0.00
0.05
0.10IBM
S&P
IBM S&P composite
Mean
The most important special case of the expectation operator is the mean. This is definedas (when it exists)
1 = E(X) =
xfX(x)dx.
It is often used as a measure of the average value of a random variable (alternatives includemode and median).
Example 1.12 Suppose X is a Bernoulli trial with Pr(X = 1) = p and Pr(X = 0) =1 p. Then
E(X) = 1 Pr(X = 1) + 0 Pr(X = 0)= p. (1.8)
Likewise if X Bin(n, p) thenE(X) = np.
Likewise if Yi follows a binomial tree process (1.3) then
E(Yi) = 2i
j=1
E (Xj) i = i(2p 1).
Example 1.13 If X N(, 2), then
E(X) =
x12pi2
exp
{(x )
2
22
}dx
= +
(x ) 12pi2
exp
{(x )
2
22
}dx
= ,
1.2. RANDOM VARIABLES 25
using the fact that a density integrates to one.
Multivariate mean
Recall we write
X =
X1X2X3...Xq
.Now each Xj has a mean, E(Xj), so it would be nice to collect these together. Thefollowing notation does this. We define
E(X) =
E(X1)E(X2)E(X3)
...E(Xq)
.This is the mean of the vector.
Recall from Section 1.2.7 that we wrote the return on p portfolios as
BX,
where B is a p q weight matrix. An interesting question is what is the mean of theportfolio that is what is the mean of each element of this vector or each portfolio? Itturns out this has a simple answer
E(BX) = BE(X).
Why? Recall a mean of a vector is the mean of all the elements of the vector
E(BX) =
E(q
j=1B1jXj
)E(q
j=1B2jXj
)E(q
j=1B3jXj
)...
E(q
j=1BpjXj
)
.
But, for i = 1, 2, ..., p,
E
(qj=1
BijXj
)=
qj=1
E(BijXj) =
qj=1
BijE(Xj).
Hence
E(BX) =
qj=1B1jE(Xj)qj=1B2jE(Xj)qj=1B3jE(Xj)
...qj=1BpjE(Xj)
= BE(X),
26 CHAPTER 1. PROBABILITY
as stated. This is an important result for econometrics.
r th momentMore generally we introduce the notation
r = E(Xr) =
xrfX(x)dx,
to denote the r th moment (about the origin).Example 1.14 Suppose X is exponentially distributed (1.5) then
r =
0
xr1
exp(x/)dx.
Recall that the gamma function is defined as
(r) =
0
xr1 exp(x)dx, r > 0,
noting that if r is an integer then (r) = (r 1)! So
r = r1 0
(x/)r exp(x/)dx
= r 0
xr exp(x)dx= r(r + 1)
= rr!.
Variance
Likewise if X is univariate, the variance is defined as
2 = Var(X)= E {X E(X)}2
=
{x E(X)}2 fX(x)dx
= E(X2) {E(X)}2 .It is the second moment of the variable X E(X). The standard deviation is defined asVar(X).The variance is the basic measure used in asset pricing of the risk of holding an asset.
It is often assumed that investors try to avoid being exposed to such risk and are onlywilling to be exposed to it by being given a higher mean.
Example 1.15 Suppose X is a Bernoulli trial with Pr(X = 1) = p then
E(X2) = 1 Pr(X = 1) + 0 Pr(X = 0)= p,
1.2. RANDOM VARIABLES 27
so using (1.8)
Var(X) = p p2= p(1 p).
Likewise if X Bin(n, p) thenVar(X) = np(1 p).
Notice this variance is maximised at p = 1/2, when it is n/4.
Example 1.16 Suppose X N(0, 1), then
Var(X) =
x212pi
exp
(x
2
2
)dx
= 2
0
x212pi
exp
(x
2
2
)dx.
Letting y = x2, then dy = 2xdx, which means
Var(X) = 2
0
y
2x
12pi
exp (y) dy
=
0
y1/212pi
exp (y) dy
=12pi
(3/2)
=12pi
(1/2)
= 1,
using the well known result that (1/2) =2pi. This implies if X N(, 2) then
Var(X) = 2.
Example 1.17 Suppose X is exponentially distributed (1.5) then
1 = , 2 = 22.
This impliesVar(X) = 2.
Exercise 1.3 Prove that Var(a+ bX) = b2Var(X).
Example 1.18 Consider holding $1 of an asset giving a return of r (and interest rate)with no risk, and $b of an asset with return X which is random (ie risky). Then the returnwill obviously your portfolio would return
r + bX.
28 CHAPTER 1. PROBABILITY
The mean of this would ber + bE(X).
Exercise 1.3 implies the variance of the portfolio is
b2Var(X).
So we get a key feature for asset pricing the mean and standard deviation of the portfolioincreases with proportionally with the weight.
Exercise 1.4 Show the mean and variance of the Laplace random variable
fX (x) =1
2exp
{1|x |
}is and 22, respectively. HINT: the Laplace density is a two sided exponential variable.
Exercise 1.5 What is the mean and variance of a standard uniform?
Exercise 1.6 Prove that the mean and variance of a Poisson random variable (1.6) is.
Covariance
The covariance of X and Y is defined (when it exists) as
Cov(X, Y ) = E {X E(X)} {Y E(Y )}=
{x E(X)} {y E(Y )} fX,Y (x, y)dxdy
= E(XY ) E(X)E(Y ).Exercise 1.7 Prove that Cov(a + bX, c + dY ) = bdCov(X, Y ). Hence covariances arelocation invariant.
Exercise 1.8 Prove that Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).
Exercise 1.9 Prove that
Var(aX + bY ) = a2Var(X) + b2Var(Y ) + 2abCov(X, Y ).
Exercise 1.10 Prove that if X1,...,Xn are independent then
Var
(1
n
ni=1
Xi
)=
1
n2
ni=1
Var(Xi).
If the random variables are identically distributed, then show this simplifies to
Var
(1
n
ni=1
Xi
)=
1
nVar(X1).
1.2. RANDOM VARIABLES 29
Independence implies uncorrelatedness (nearly)
Recall
Cov(X, Y ) = E(XY ) E(X)E(Y ).So if X Y then
Cov(X, Y ) = E(X)E(Y ) E(X)E(Y ) = 0.Hence independence implies uncorrelatedness so long as the covariance exists. If thecovariance does not exist then this result obviously does not hold. e.g. two independentCauchy random variables do not have zero covariance as the covariance does not exist.
If the covariance between X and Y is zero we say they are uncorrelated. We write thisas X Y . So
(X Y ) = (X Y ) .The reverse is not true, uncorrelatedness does not imply independence (in Gaussian caseit does!).
Example 1.19 Suppose
X N(0, 1), Y = X2.Then
Cov(X, Y ) = E(XY ) E(X)E(Y ) = E(X3) = 0.
Cumulant function
The cumulant function (which, subject to some mild and relaxable regularity assump-tions5, uniquely characterises a distribution) of X is
K() = log E(eX). (1.9)
The exponent of this function is the moment generating function (mgf)
E(eX).
It has the property that so long as all the moments exist that
r =rE(eX)
r
=0
.
Example 1.20 In the case where X N(, 2) then it can be show that
K() = +1
222.
5Formally, the characteristic function
E exp(1X)
uniquely characterises the distribution. It always exists while cumulant functions do not. Here I am being moreinformal, ignoring the complex arguments.
30 CHAPTER 1. PROBABILITY
Exercise 1.11 Use the cumulant function to show that when X N(, 2) so1 = , 2 =
2, 3 = 0.
In the case where = 0 show that4 = 3
4.
Use these result to show that
E(21) = 1 and Var(21) = 2,
whileE(2) = and Var(
2) = 2.
Exercise 1.12 Use the property that densities integrate to one to derive the mgf of theinverse Gaussian variable (??). Use this result to show that the sum of two i.i.d. inverseGaussian variables are distributed as inverse Gaussian.
Correlation
The correlation of X and Y is defined (when it exists) as
Cor(X, Y ) =Cov(X, Y )Var(X)Var(Y )
.
NowCor(X, Y ) [1, 1],
which follows from the Cauchy-Schwarz inequality6.
Exercise 1.13 Prove that Cor(a + bX, c + dY ) = Cor(X, Y ). Hence correlations arescale and location invariant.
Example 1.21 In the IBM and S&P example then Mean IBM: 0.000206, Mean S P:-0.0000721, S.D. IBM: 0.0225, S.D. S P: 0.0127, Cor: 0.625. If we scale all returns by 100,then Mean IBM: 0.0206, Mean S P: -0.00721, S.D. IBM: 2.25, S.D. S P: 1.27, Cor: 0.625.
1.2.9 Covariance matrices
Think of X as a p dimensional7 vector of random variables. Thus
X =
X1X2X3...Xp
.6Which says that for all reals x1, x2, y1, y2 then(
x21 + x22
)(y21 + y
22) (x1y1 + x2y2)2.
7It is somewhat inelegant that I have change the dimension of the X vector from q to p. But it is of noconsequence and it is more familiar to work with p dimensional objects.
1.2. RANDOM VARIABLES 31
Then we define the covariance matrix of X as
Cov(X) =
Var(X1) Cov(X1, X2) Cov(X1, Xp)Cov(X2, X1) Var(X2) Cov(X2, Xp)...
.... . .
...
Cov(Xp, X1) Cov(Xp, X2) Var(Xp)
.This is a symmetric p p matrix. Notice this notation is slightly confusing, for we writeCov(X) of a vector X as a matrix and Cov(A,B) of scalars A and B as scalars. Hopefullyit will always be clear as to the meaning due to the context.
The covariance matrix can be calculated as
Cov(X) = E {X E(X)} {X E(X)} .If you do not see this reult immediately think of the p = 2 case. Then
{X E(X)} {X E(X)} =(X1 E(X1)X2 E(X2)
)(X1 E(X1)X2 E(X2)
)=
((X1 E(X1))2 (X1 E(X1)) (X2 E(X2))(X2 E(X2)) (X1 E(X1)) (X2 E(X2))2
).
Taking expectations of this delivers the required result, for
E {X E(X)} {X E(X)}
=
(E{(X1 E(X1))2
}E {(X1 E(X1)) (X2 E(X2))}
E {(X2 E(X2)) (X1 E(X1))} E{(X2 E(X2))2
} )=
(Var(X1) Cov(X1, X2)Cov(X2, X1) Var(X2)
).
Example 1.22 In the IBM and S&P example then we have approximately that
E(X) =
(0.02060.00721
)
Cov(X) =
(5.07 1.791.79 1.62
).
A very important result is that if B is a q p matrix of constants, then
E (a +BX) = a+BE(X) Cov(a+BX) = BCov(X)B.
Example 1.23 Suppose X is a return vector on p assets. Then if we construct a portfoliowith weights b = (b1, b2, ..., bp)
, then the return on portfolio will be bX. Thus the expectedreturn will be
E(bX) = bE(X) and Var(bX) = bCov(X)b.
32 CHAPTER 1. PROBABILITY
Example 1.24 Think of IBM returns - S&P returns.
Y = X1 X2.
E(Y ) = (1 1)(
0.02060.00721
)= 0.0278
Var(Y ) = (1 1)(
5.07 1.791.79 1.62
)(11
)= (1 1)
(3.280.17
)= 3.11.
The second of these results implies that Cov(X) > 0, for if we think of an arbitrarynon-stochastic z vector, then
zCov(X)z = Cov(zX) = Var(zX) 0.
Exercise 1.14 Show that
Var
(1
n
ni=1
Yi
)= E
{1
n
ni=1
Yi 1n
ni=1
E(Yi)
}2
= E
[1
n
ni=1
{Yi E(Yi)}]2
=1
n2
ni=1
nj=1
Cov(Yi, Yj).
What happens to this result if the Yi are independent? Further, what happens to it if the Yiare i.i.d. (independent and identically distributed) with variance 2? On the other handwhat happens when
Cov(Yi, Yi+s) =
2 s = 0 s = 10 s 6= 0.
Comment on this.
We also introduce the notation
r = E [{X E(X)}r] , r = 2, 3, ...
to denote the r-th central moment (about the mean).
1.2. RANDOM VARIABLES 33
Block covariance matrices
Think about the block structure
X =
(YZ
),
where Y and Z are vectors. Then often the following block covariance is defined
Cov(Y, Z) = E{(Y E(Y )) (Z E(Z))} .
The dimension of this matrix is dim(Y ) dim(Z). ClearlyCov(Y, Z) = Cov(Z, Y ).
Confusingly this gives us yet another meaning for the Cov notation! You just have to usethe context to work out which you mean.
This structure means that
Cov(X) = E
{(YZ
) E
(YZ
)}{(YZ
) E
(YZ
)}= E
((Y E(Y )) (Y E(Y )) (Y E(Y )) (Z E(Z))(Z E(Z)) (Y E(Y )) (Z E(Z)) (Z E(Z))
)=
{Cov(Y, Y ) Cov(Y, Z)Cov(Z, Y ) Cov(Z,Z)
}=
{Cov(Y ) Cov(Y, Z)
Cov(Z, Y ) Cov(Z)
}.
Then we say Y Z, that is the vectors are uncorrelated, if Cov(Y, Z) is a matrix of zeros.
Correlation matrices
Corresponding to the covariance matrix is the correlation matrix, which is (when it exists)
Cor(X) =
1 Cor(X1, X2) Cor(X1, Xp)Cor(X2, X1) 1 Cor(X2, Xp)...
.... . .
...
Cor(Xp, X1) Cor(Xp, X2) 1
= Cor(X).
This matrix is invariant to location and scale changes, but obviously not in general lineartransformations. Clearly Cor(X) has to be positive semi-definite.
1.2.10 Back to distributions
General discussion
Figure 1.6 gives, in the upper plate, two examples of bivariate densities. In fact here thetwo variables are independent and so
fX(x1, x2) = fX1(x1)fX2(x2),
34 CHAPTER 1. PROBABILITY
the product of the two marginal densities. The left graph is the normal case, the seconda non-normal case with heavier tails. It is interesting to look at the log-densities. Thenormal one is quadratic. The non-normal is more like being linear in the tails.
Figure 1.7 is an example of the above in practice. It displays the estimated jointdensity of daily changes in the DM and FF against the US Dollar in the last 6 yearsbefore the advent of the Euro. The left plates are the normal fit, the right plates are amore flexible non-normal model. The picture shows a tight correlation between these twocurrency movements (expected given they were about to be merged), while the flexibledistribution suggests the tails of the fit are much more linear than the quadratic assumedin the normal model.
FF
N L
ogd
ensit
y
(a): DM/FF Normal density
2
02
20
2
0.5
1.0
FF
GH
Log
den
sity
(b): DM/FF GH density
2
02
20
22.
55.
07.
5
FF
N L
ogd
ensit
y
(c): DM/FF Normal logdensity
2
02
20
2
10
5
0
FF
GH
Log
den
sity
(d): DM/FF GH logdensity
2
02
20
2
5
0
Figure 1.7: Fit of bivariate Gaussian and GH models for the DM and FF against the Dollar.(a) ML fit of bivariate Gaussian density, (c) gives the log of this density. (b) ML fit of bivariateGH density, (d) gives the log of this density. Code: em gh mult.
Multivariate normal
The p-dimensional X N(,). E(X) = , Cov(X) = . Assume > 0 (positivedefinite8). is always symmetric of course. Then
fX(x) = |2pi|1/2 exp{12(x )1 (x )
}, x Rp.
8Recall > 0 iff for all z 6= 0 we have thatzz > 0.
1.2. RANDOM VARIABLES 35
Here 1 is a matrix inverse, which exists due to the > 0 assumption. Further
|2pi|1/2 = (2pi)p/2 ||1/2 .Example 1.25 Suppose
= 2Ip,
so the elements of X are independent and homoskedastic. Then
fX(x) =(2pi2
)p/2exp
{ 122
(x ) (x )}
=(2pi2
)p/2exp
{ 122
pi=1
(xi i)2}
Density has a single peak at . The cumulant function of a multivariate normal is
K() = log E(eX) = +
1
2. (1.10)
Here is a vector.If a is q 1 and B is q p and both are constants, then
Y = a +BX N(a+B,BB), (1.11)a q-dimensional normal. That is linear combinations of normals are normal.
Exercise 1.15 Prove this result using (1.10).
Example 1.26 In particular if p = 2, then for X = (X1, X2)
= Cov(X) =
{Var (X1) Cov (X1, X2)Cov (X1, X2) Var (X2)
}=
(21 1212
22
),
where = Cor(X1, X2). This is an important model. Often written as(XY
) N
{(xy
),
(2x xyxy
2y
)}.
Quadratic forms in normals
SupposeX N(0, Ip),
where Ip is a p-dimensional identity matrix, then
X X =pi=1
X2i 2p,
a chi-squared variable with p degrees of freedom.Suppose
X N(0, Ip),
36 CHAPTER 1. PROBABILITY
where Ip is a p-dimensional identity matrix. Suppose is a p p idempotent matrix,which means that
= ,
then9
X X 2tr().
1.2.11 Conditional distributions
Basics
Consider two (multivariate) random variables X, Y , then
FX|Y=y(x) = Pr(X x|Y = y) = Pr(X x, Y = y)Pr (Y = y)
.
This is somewhat tricky for continuous random variables as then Pr (Y = y) = 0. Likewisethe conditional density is
fX|Y=y(x) =FX|Y=y(x)
x=fX,Y (x, y)
fY (y).
Remark 1.2 I do not know of a trivial proof of this result. The following informal
9Recall that for any matrix the eigenvalue decomposition means we can write
= PP ,
where is a diagonal matrix of eigenvalues of and the matrix of eigenvector P is such that P P = I . Theidempotency of means
= PP PP = PP = .
This can only hold if is a matrix of one and zeros.Now
Ylaw= P X N(0, P P ) = N(0, Ip),
so
X Xlaw= Y Y =
pj=1
21I(jj = 1) 2tr(),
as required.
1.2. RANDOM VARIABLES 37
argument is relatively easy to follow though
fX|Y=y(x) =FX|Y=y(x)
x
= limy0
xPr(X x|y Y y +y)
= limy0
x
Pr(X x, y Y y +y)Pr (y Y y +y)
= limy0
x
FX,Y (x, y +y) FX,Y (x, y)FY (y +y) FY (y)
=
x
yFX,Y (x, y)yFY (y)
=
2
yxFX,Y (x, y)
fY (y)
=fX,Y (x, y)
fY (y).
If X, Y are independent then
Pr(X x, Y y) = Pr (X x) Pr (Y y)and
fX,Y (x, y) = fX(x)fY (y).
Independence implies Cov(X, Y ) = 0, as E(XY ) = E(X)E(Y ), but Cov(X, Y ) = 0 doesnot imply independence (e.g. stock returns are basically uncorrelated through time, butnot independent big movements usually follow big movements).
Example 1.27 Bivariate normal. In this case(XY
) N
{(XY
),
(2X XYXY
2Y
)},
then
Y |X = x N{Y +
Y X2X
(x X), 2Y (Y X)
2
2X
}
= N
{Y +
YX
(x X), 2Y(1 2)} .
An illustration of this is where Y is the return on an asset, X is the return on the marketportfolio (i.e. a widely based market index). Then
Y |X =YX
is often called the beta of Y and is a measure of how Y moves with the market. So ifY |X is bigger than one, then Y is regarded as being sensitive to overall market moves,while when Y |X is small then Y behaves pretty differently from the market index.
38 CHAPTER 1. PROBABILITY
BIG picture.
Conditional variance does not depend upon x.
Change in the conditional mean isYX
(x X),
so is linear in x. Effect is compared to mean, i.e. xX . Divide by X removes thescale of x, times by Y puts the variables onto the y scale.
Obviously
X|Y = y N{X +
XY
(y Y ), 2X(1 2)} .
Notice the symmetry of this. We can condition on X or Y . Just because we write downX|Y or Y |X does not mean thatX causes Y or vice versa. It says nothing about causality.
More generally, consider two (multivariate) random variables X, Y . Suppose(XY
) N
{(XY
),
(XX XYY X Y Y
)}, XX > 0,Y Y > 0,XY =
Y X .
Then
X N (X ,XX) , Y N (Y ,Y Y ) ,and
X|Y = y N {X + XY1Y Y (y Y ),XX XY1Y YY X}= N
{X|Y ,X|Y
}.
That is if variables are jointly normal, then marginally and conditionally they are normal.
Exercise 1.16 Prove this result using the fact that
fX|Y=y(x) =fX,Y (x, y)
fY (y),
and(XX XYY X Y Y
)1=
(XX XY
Y X Y Y
)=
( (XX XY1Y YY X
)1 XXXY (Y Y )1 (XX)1Y YY X (Y Y )1 Y X
(XX
)1XY
).
Exercise 1.17 Prove that X Y if and only if XY is a matrix of zeros.
1.2. RANDOM VARIABLES 39
Regression and condition densities
Later we will discuss in some detail so-called regression methods, which relates the varia-tion of one variance, Y , to the values of other variables, written X. In the case were X, Yare jointly Gaussian this amounts to considering the conditional density of
Y |X = x N{Y +
Y X2X
(x X), 2Y (Y X)
2
2X
}.
If our scientific interest centres on this conditional density, to see how Y may respond tovariation in X, then it is often useful to use a different parameterisation. In particular,typically one writes
Y |X = x N {Y |X + Y |Xx, 2Y |X} ,or for simplicity
Y |X = x N (+ x, 2) .Here is usually referred to as the intercept, while is the slope. In the case where X ismultivariate, it takes on the more abstract form of the linear regression model
Y |X = x N ( x, 2) .Of course if the X, Y variables are not Gaussian then the above analysis looks rather
inappropriate. In later lectures we will see some aspects of this modelling framework isrobust to departures from Gaussianity.
In corporate finance there is some considerable interest in conditional regression modelsfor binary variables, with
Pr(Yi = 1|X = x) = g(x).The most well model of this type is the logistic regression where
Pr(Yi = 1|X = x) = exp(x)
1 + exp (x).
Notice in this case, no direct attempt to model the joint distribution of X, Y is made,instead we have just decided that our interest focuses on the parameters which index theconditional probability. More details of this analysis will be given in the third part ofthese lectures.
Conditional moments
Recall
fX|Y=y(x) =fX,Y (x, y)
fY (y).
ImportantlyfX,Y (x, y) = fX|Y=y(x)fY (y).
The definition of a conditional expectation is
EX|Y=yg(X) =g(x)fX|Y=y(x)dx.
40 CHAPTER 1. PROBABILITY
Exercise 1.18 Prove that EX|Y=y{a+ bg (X)} = a + bEX|Y=y{g (X)}.Now recall
fX,Y (x, y) = fX|Y=y(x)fY (y).
As a result we have that
EX(g(X)) = EY{EX|Y=y (g(X))
},
the law of iterated expectationsProof:
EX(g(X)) =
g(x)fX(x)dx
=
g(x)fX,Y (x, y)dxdy
=
g(x)fX|Y=y(x)fY (y)dxdy
=
{g(x)fX|Y=y(x)dx
}fY (y)dy
=
{EX|Y=y (g(X))
}fY (y)dy
= EY{EX|Y (g(X)|Y )
}Exercise 1.19 Suppose E (X|Y = y) = a + by and E(Y ) = c, then EX = EY EX|Y =EY (a+ bY ) = a+ bc.
LikewiseVarX(X) = EY
(VarX|YX|Y
)+VarY (EX|YX|Y ).
Proof:
VarX(X) = EXX2 (EXX)2
= EY EX|Y(X2) (EY EX|YX)2
=[EY EX|Y
(X2) EY (EX|YX)2] + [EY (EX|YX)2 {(EY EX|YX)2}]
= EY
[EX|Y
(X2) (EX|YX)2]+ [EY (EX|YX)2 {(EY EX|YX)2}]
= EY(VarX|YX|Y
)+VarY (EX|YX|Y )
Exercise 1.20 Suppose V ar(X|Y = y) = a+ by2 and E(X|Y = y) = c+ dy. ThenVarX(X) = EY
(a+ bY 2
)+VarY (c+ dY )
= a+ bEY(Y 2)+ d2VarY (Y ).
Example 1.28 The simple value at risk (VaR) calculation based on the quantile
V aR = F1X (),
1.2. RANDOM VARIABLES 41
where X is the return, has been heavily criticised in the literature as being incoherent. Awidely used alternative is to calculate the so-called expected short-fall measure
ES = E{X|X < F1X ()
},
which calculates the expected loss given the move is in the bottom quantile. A relativelysimple introduction to this is given in Acerbi and Tasche (2002).
Martingale
In modelling dynamics in financial economics martingales play a large role. They formthe mathematical formulation of a risk-neutral fair game in financial economics. They arecentral to stochastic analysis and allow us to generalise central limit theories away fromthe i.i.d. case.
Consider a sequence asset prices recorded through time
Y1, Y2, Y3, ...
where the subscript reflects time. A natural object to study is
E(Yi|Y1, ..., Yi1),the conditional expectation (which we assume exists) of the future given the past. Thenif
E(Yi|Y1, ..., Yi1) = Yi1,then the sequence is said to be a martingale with respect to its own past history10.
Example 1.29 A special case of the martingale is a random walk
Yi = Yi1 + i,
where i is i.i.d. with zero mean.
Example 1.30 Suppose
Yi = Yi1 + i, i|Y1, ..., Yi1 N(0, 2i )where
2i = + 12i1 + ...+ p
2ip.
Clearly, so long as E |Yi|
42 CHAPTER 1. PROBABILITY
0 50 100 150 200 250 300 350 400 450 5007.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
Copula
Consider a bivariate density for X, Y . Write, for all permissible points of support11
GX,Y (x, y) =FX,Y (x, y)
FX(x)FY (y).
Thus we can always write
FX,Y (x, y) = FX(x)FY (y)GX,Y (x, y).
Then G solely controls the stochastic dependence between X and Y .If we assume that X and Y are continuous random variables, then knowing x gives
me FX(x) and vice versa. Thus we can find a H function
HX,Y (FX(x), FY (y)) = GX,Y (x, y),
so
FX,Y (x, y) = FX(x)FY (y)HX,Y (FX(x), FY (y))
= C(FX(x), FY (y)),
where C(., .) is a copula function. Notice that FX(x) and FY (y) live on 0,1. A copula issimply a bivariate distribution function for variables defined on the unit interval!
This looks really boring and contrived. However, what is has achieved is the separationof the modelling of the marginal distribution of X and Y , given through FX(x) and FY (y),and their dependence, given by C(., .).
11That is we only work with x and y for which FX(x) > 0 and FY (y) > 0.
1.2. RANDOM VARIABLES 43
A nice feature of the above setup is that by the chain rule
FX,Y (x, y)
x=FX(x)
x
C(FX(x), FY (y))
FX(x),
so2FX,Y (x, y)
xy=FX(x)
x
FY (y)
y
2C(FX(x), FY (y))
FX(x)FY (y),
or in other wordsfX,Y (x, y) = fX(x)fY (y)c (FX(x), FY (y)) ,
the product of marginal densities times the density of two unit interval variables. Ourfavourite paper on this topic is Patton (2004).
1.2.12 Bayes theorem
Bayes theorem (1.2) is important for random variables. It plays a large role in economictheory and some parts of econometrics. Hence we spell it out in some details.
Clearly
fX|Y=y(x) =fX,Y (x, y)
fY (y)=fX(x)fY |X=x(y)
fY (y).
However, it is often easier to think of this in a more cut down form as
fX|Y=y(x) = cfX(x)fY |X=x(y),
where c is a constant such thatcfX(x)fY |X=x(y)dx = 1.
This allows one to do some calculations in cases where f(y) is not known. This is aconvenient trick.
Often fX(x) is called a PRIOR and fY |X=x(y) the LIKELIHOOD. In such cases it isstandard to call fX|Y=y(x) the POSTERIOR density. Thus we have
posterior prior likelihood.
Example 1.31 SupposeX N(X , 2X)
andY |X = x N(x, 2),
thenX|Y = y N(p, 2p),
where
p = 2p
(1
2XX +
1
2y
)where 2p =
(1
2X+
1
2
)1. (1.12)
44 CHAPTER 1. PROBABILITY
To show this, write
log fX|Y=y(x) = log c+ log fX(x) + log fY |X(y|x) (1.13)= log c 1
22X(x X)2
1
22(y x)2 . (1.14)
This is quadratic in x so X|Y must be Gaussian. The only task is to compute p and 2p.Of course, as it is a Gaussian density
log fX|Y=y(x) = log c 1
22p
(x p
)2. (1.15)
The quadratic term in x in (1.15) and (1.14) are, respectively,
122p
and 122X
122
.
This means that
2p =
(1
2X+
1
2
)1. (1.16)
Likewise, the linear term in x in (1.15) and (1.14) are, respectively,
1
2pp and
1
2XX +
1
2y.
This implies that
p = 2p
(1
2XX +
1
2y
). (1.17)
This means that the posterior mean weights the prior mean and the new observationy according to their variability. Special cases are where 2X , (the prior tells usnothing) in which case p y and where 2X 0, (the prior tells us everything) inwhich case p X .
1.3 Estimators
1.3.1 Introduction
A statistic S(X) is a function of a (vector) random variable X. When we use the statisticto learn about a feature of the probability model then we say we are estimating themodel. We call the random version of this function S(X) an estimator, while if we takethis function of an observed vector then we call it an estimate S(x).
Example 1.32 A simple example of this is where
S(X) =1
n
ni=1
Xi.
1.3. ESTIMATORS 45
If Xi NID(, 2) then using the fact that S(X) is a linear combination of normals wehave that
S(X) N(,2
n
).
If n is very large the estimator is very close to , the average value of the normal distri-bution.
1.3.2 Bias and mean square error of estimators
Suppose we design an estimator to estimate some quantity . Then we might wish forS(X) to be close to on average. One way of thinking about this is to record the
bias: E {S(X)} .
Example 1.33 If Xi NID(, 2) then
X =1
n
ni=1
Xi,
the sample mean (sample average), has a zero bias.
When the bias is zero, the estimator is said to be unbiased. Unbiased estimators maybe very imprecise for they can have a very large dispersion. One way of measuring theimprecision of estimator is through the mean square error criteria
mse:
E {S(X) }2 = Var {S(X)}+ [E {S(X)} ]2 .A more precise estimator may in fact be biased.
Exercise 1.21 Estimate 2 using a random sample from NID (, 2) by
S(X) =1
n kni=1
(Xi X
)2.
Show this has minimum mse when k = 1 while it is unbiased when k = 1. HINT: Notethat
ni=1
(Xi )2 =ni=1
(Xi X
)2+ n
(X )2 ,
and thatn
i=1 (Xi )2 /2 2n whilen(X ) / N(0, 1).
1.3.3 Histogram
Histograms are popular ways of getting a simple impression of the density of a randomsample of some random variable X. This is based on constructing bins of the support ofX:
c0 < c1 < c2 < ... < cJ
46 CHAPTER 1. PROBABILITY
The histogram tries to estimate
fi =1
ci ci1 Pr(X (ci1, ci])
=1
ci ci1E {I(X (ci1, ci])} ,
which for bins of short width we approximates the density of X. The histogram is anestimator of this based on the random variables X1, X2, ..., Xn
fi =1
ci ci1nj=1
I(Xj (ci1, ci]).
Examples of histograms are given in Figures 1.2 and 1.5.
1.3.4 Bayes theorem*
We saw that when Xi NID(, 2) then X estimates , withX N(, 2).
Thus the statistic is centred around a fixed, unknown constant .An alternative approach to carrying out statistical calculations is to regard as un-
known with uncertainty expressed via a prior N(, 2). For simplicity of exposition
suppose 2 is known. Then we can calculate
f(|X1, X2, ..., Xn) f()f(X1,X2,...,Xn|).The the calculations in 1.2.12 show that
|X1, X2, ..., Xn N(p, 2p),with
p = 2p
(1
2 +
1
2
ni=1
Xi
)where 2p =
(1
2+
n
2
)1.
Here the statisticn
i=1Xi appears in the posterior mean.
1.4 Simulating random variables
Cameron and Trivedi (2005, Ch. 12.7)A crucial feature of modern econometrics is that much of it depends upon simulation
that is to produce random variables from known distribution functions. Further,simulation is increasingly used in asset pricing theory to study analytically intractablemodels, e.g. to study complicated options. The advantage of simulation is that it allowsus to easily understand quite complicated transformations of random variables, which aresometimes extremely hard to carry out using the change of variable method.
In this section we will discuss the basic probability theory associated with simulation.Textbook expositions of this material include Ripley (1987) and Devroye (1986).
1.4. SIMULATING RANDOM VARIABLES 47
1.4.1 Pseudo random numbers
All of the simulation methods we will discuss here will be built out of draws based on asequence of independent and identically distributed (standard) uniform (pseudo12) ran-dom numbers {Ui [0, 1]}. The problem of producing such numbers will be regarded assolved in these notes. An example is given below
Ui.734.452.234.123.987
1.4.2 Discrete random variables
Suppose we wish to simulate a Bernoulli trial Xi on a computer with Pr(Xi = 1) = p.We set Xi = 0 if Ui 1p, and Xi = 1 tail if Ui > 1p. Then Pr(Xi = 0) = Pr(Ui
1 p) = 1 p and Pr(Xi = 1) = Pr(Ui > 1 p) = p.Another way of writing this is through the distribution function. Let
FX(x) = Pr(X x) = 0 x < 01 p 0 x < 11 1 x.
It would be useful if we could invert FX(x), i.e. for a given u [0, 1] find an x such thatu = FX(x).
But it has jumps in the function and so there does not exist a straightforward inverse.We can define a unique inverse (smallest value of x [0, 1] such that u FX(x)) of this
FX (u) ={
0 u 1 p1 u > 1 p.
ThenXi = F
X (Ui),
produces the Bernoulli trial as required. Using the same underlying uniforms as in theprevious subsection produces, when p = 1/2,
Ui Xi.734 1.452 0.234 0.093 0.987 1
When p = 0.9 the same underling uniforms produce12In the simulation literature these uniforms are sometimes generated using natural random numbers (e.g.
depending upon atomic decay), but mostly the uniforms are based on approximations (see Ripley (1987)). Theseapproximations are typically deterministic sequences which are sufficiently chaotic that it is very hard to discernpatterns in these sequences even with massively large samples of draws.
48 CHAPTER 1. PROBABILITY
Ui Xi.734 1.452 1.234 1.093 0.987 1
Example 1.34 Suppose that the support of a random variable X, with density fX(x) =Pr(X = x), is finite. For simplicity suppose it is on the points 0, 1, 2. Then
FX (u) =
0 u [0,Pr(Xi = 0)]1 u (Pr(Xi = 0),Pr(Xi 1)]2 u (Pr(Xi 1), 1].
So we can sample from this using Xi = FX (Ui). Such an algorithm will be very fast. e.g.
Pr(Xi = 0) = 0.2, Pr(Xi = 1) = 0.4, Pr(Xi = 2) = 0.4 then
Ui Xi.734 2.452 1.234 1.093 0.987 2
When the support of X has a lot of points then this algorithm may become quiteexpensive and more involved methods may become necessary.
1.4.3 Inverting distribution functions
Given a sequence of i.i.d. uniforms we can produce i.i.d. draws from any continuousdistribution FX(x). The argument goes like this:
As Ui is uniform, soPr(Ui FX(x)) = FX(x).
Thus
Pr(Ui FX(x)) = Pr(F1X (Ui) x)= Pr(Xi x),
if we takeXi = F
1X (Ui). (1.18)
Thus we can produce random numbers from any continuous distribution by plugging theuniforms into the quantile function (1.18).
Example 1.35 The exponential distribution. Recall FX(x) = 1 exp( 1x), and so
the quantile function isF1X (p) = log (1 p) .
Hence log (1 Ui)
are i.i.d. exponential draws. e.g. = 1
1.5. ASYMPTOTIC APPROXIMATION 49
Ui Xi.734 1.324.452 0.601.234 0.266.093 0.097.987 4.343
This method is convenient for a large numbers of problems where we can evaluateinexpensively the quantile function. However, for many problems this is not true. An ex-ample of this is the normal distribution, where the distribution function is very expensiveto evaluate (and the quantile is even more so).
1.5 Asymptotic approximation
Cameron and Trivedi (2005, Ch. A)
1.5.1 Motivation
Example 1.36 Classical convergence
Xn = 3 +1
n 3
as n.
Example 1.37 A little more fuzzy when we think of
Xn = 3 +Y
n? 3,
where Y is a random variable. Different measures of convergence. Some need moments,others dont.
Distribution theory can become very complicated and sometimes impossible. As aresult we often employ approximations. Many different types of approximations arepossible. The dominant way of constructing approximations is to think of the errorinduced by using the approximation as the sample size gets large such approximationsare called asymptotic approximations (asymptotic in n). This is a particularly attractiveidea if we are estimating a parameter and we wish that our estimator becomes moreprecise as we increase the sample size. Two basic results are used in this literature: thelaw of large numbers and the central limit theorem. Such approximations are examples ofthe more general concepts convergence in probability and convergence in distributionwhich will be presented in some detail. Additional references are Gallant (1997, Ch. 4),Grimmett and Stirzaker (2001), Newey and McFadden (1994) and McCabe and Tremayne(1993).
Formally we will think of a sequence of random variables X1, X2, . . . , Xn which, as ngets large, will be such that Xn will behave like some other random variable or constantX.
50 CHAPTER 1. PROBABILITY
Example 1.38 We are interested in
Xn =1
n
nj=1
Yj.
Then it forms a sequence
X1 = Y1, X2 =1
2(Y1 + Y2) , X3 =
1
3(Y1 + Y2 + Y3).
What does 1n
nj=1 Yj behave like for large n? What does Xn converge to for large n?
1.5.2 Definitions
We think of a sequence of random variables {Xn} and we ask if the distance between {Xn}and some other variable X (in this context X is almost always a constant in econometrics,guide your thinking by making that assumption) gets small as n goes to infinity. We canmeasure smallness in many ways and so there are lots of different notions of convergence.We discuss three, the second of which will be the most important for us.
Definition 1.1 (Convergence in mean square) Let X and X1, X2, . . . be random vari-ables. If
limn
E (Xn X)2 = 0,then the sequence X1, X2, . . . is said to converge in mean square to the random variableX. A shorthand notation is
Xnm.s. X. (1.19)
Exercise 1.22 Show that necessary and sufficient conditions for Xnm.s. X are that
limn
E(Xn X) = 0, [asymptotic unbiased] limn
Var(Xn X) = 0.
Example 1.39 Suppose Y1, ..., Yn are i.i.d. with mean and variance 2. Then define
Xn =1
n
ni=1
Yi,
which has
E (Xn) =1
n
ni=1
E(Yi) = ,
and
Var (Xn) =1
n2Var
ni=1
Yi =1
n2
ni=1
Var(Yi)
=1
n2.
Hence Xn is unbiased and the variance goes to zero. Hence
Xnm.s. .
1.5. ASYMPTOTIC APPROXIMATION 51
Definition 1.2 (Convergence in probability) If for all , > 0 no s.t.Pr(|Xn X| < ) > 1 , n > n0,
then the sequence X1, X2, . . . is said to converge in probability to the random variable X.A shorthand notation is
Xnp X. (1.20)
Exercise 1.23 Prove that mean square convergence is stronger than convergence in prob-ability by using Chebychevs inequality:
Pr (|Xn X| > ) 1rE |Xn X|r , for all > 0.
Definition 1.3 (Convergence almost surely) Let X and X1, X2, . . . be random variables.If, for all , > 0, there exists a n0 s.t.
Pr(|Xn X| < , n > n0) > 1 ,
then we say that {Xn} almost surely converges to X, which we write as Xn a.s. X.This almost sure convergence is about ensuring that the joint behaviour of all events
n > n0 is well behaved. Convergence in probability just looks at the marginal probabilitiesfor each n.
Exercise 1.24 Prove that
Xna.s. X Xn p X.
Further note that Xna.s. X does not imply or is not implied by Xn m.s. X.
Theorem 1.1 Weak Law Large Numbers (WLLN). Let Xi iid,E(Xi),Var (Xi) exist,then
1
n
ni=1
Xip E(Xi),
as n.Proof. Straightforward using Chebyshevs inequality or using the generic result that
m.s p . Figure 1.8 gives some examples of this. The experiment has varying
sample sizes and uses normal and student t Xi. As we go down the sample size increasesand the scatter of the points becomes narrower. Indicates convergence. Expected resultas mean and variance exist.
Sometimes it will be helpful to strengthen this to a result which is much harder toprove!
Theorem 1.2 (Kolmogorovs) Strong Law of Large Numbers (SLLN). Let Xi iid,E(Xi)exist, then
Xa.s. E(Xi).
52 CHAPTER 1. PROBABILITY
2 0 2 4 6 8
0.1
0.2
0.3Average of N() draws
N(s=1.17)
15 10 5 0 5 10 15
0.1
0.2
0.3 Average of t drawsN(s=1.62)
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.5
1.0Average of N() draws
N(s=0.399)
1 2 3 4 5 6 7
0.25
0.50
0.75Average of t draws
N(s=0.573)
2.50 2.75 3.00 3.25 3.50
1
2
3Average of N() draws
N(s=0.125)
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75
1
2
Average of t drawsN(s=0.181)
Figure 1.8: Histogram of 10,000 simulations of a sample average of n shifted normal (left panel)or student t with 4 degrees of freedom (right panel) together with fitted normal curve. Top hasn = 3, n = 25 and n = 250. Code: sim mean.
Proof. Difficult. See, for example, Gallant (1997, p. 132).
The above theorem is remarkable, even if the variance does not exist the sample averageconverges to the expected value. Notice it implies
1
n
ni=1
g(Xi)p E(g(Xi)),
so long as E(g(Xi))
1.5. ASYMPTOTIC APPROXIMATION 53
15 10 5 0 5 10 15 20
0.05
0.10
0.15 n = 10
2.5 0.0 2.5 5.0 7.5 10.0
0.2
0.4n = 100
0 1 2 3 4 5 6
0.25
0.50
0.75n = 1,000
1.5 2.0 2.5 3.0 3.5 4.0 4.5
0.5
1.0n = 10,000
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75
1
2
3
n = 100,000
2.8 3.0 3.2 3.4 3.6
2.5
5.0
7.5n = 1,000,000
Figure 1.9: Estimated density, using 5,000 simulations, of the sample average of n shiftedand scaled student t with 1.5 degrees of freedom. Sample size is given on the graphs. Code:sim var infinity.
1.5.3 Some payback
Then the most important rules are
If A p a , then g(A) p g (a) where g(.) is a continuous function.Example 1.40 Suppose Xi iid,E(Xi) 6= 0, Var (Xi) exist, then
1
n
ni=1
Xip E(Xi),
which implies1
1n
ni=1Xi
p 1E(Xi)
.
IfA
p a, B p b,then
g(A)h(B)p g(a)h(b)
Example 1.41 Suppose Yi iid,E(Yi),Var (Yi) exist. then(1
n
ni=1
Xi
)(1
n
ni=1
Yi
)p E(Xi)E(Yi).
54 CHAPTER 1. PROBABILITY
1.5.4 Some more theory (first part examinable)
Refined measure of convergence (examinable)
Convergence almost surely or in probability is quite a rough measure for it says thatXn X implodes to zero with large values of n. It does not indicate at what rate thisconvergence will be achieved nor give any distributional shape to Xn X. In order toimprove our understanding we need to have a concept called convergence in distribution.
Definition 1.4 (Convergence in distribution) The sequence X1, X2, . . . of random vari-ables is said to converge in distribution to the random variable X if
FXn(x) FX (x) . (1.21)A shorthand notation is
Xnd X. (1.22)
Important result in this context is Slutskys results:
Suppose Xn d X and Yn P . Then XnYn d X and Xn/Yn d X/ if 6= 0.
Suppose Xn d X and Yn P . Let be a continuous mapping. Then (Xn, Yn) d (X,) .
A proof of this result can be found in Gallant.
Example 1.42 Suppose X1, ..., Xn are i.i.d. normal with mean and variance 2. Then
n(X )
N(0, 1).
Likewise
2 =1
n
ni=1
(Xi X
)2=
1
n
ni=1
X2i X2 a.s. 2.
Then by Slutsky theorem n(X )
d N (0, 1) .
In order to use asymptotic theory we have some generic tools central limit theorems.The most famous of these is the Lindeberg-Levy central limit theorem. This is a veryimportant result.
Theorem 1.3 (Lindeberg-Levy) Let X1, X2, . . . be independent, identically distributedrandom variables, so that EXi = , Var (Xi) =
2. Set
X = (X1 + +Xn)/n.Then
n(X ) d N (0, 2) .
1.5. ASYMPTOTIC APPROXIMATION 55
2.5 0.0 2.5 5.0 7.5 10.0
0.2
0.4Average of 21 draws
N(s=1.4)
15 10 5 0 5
0.1
0.2
0.3 Log of average of 2
1 drawsN(s=1.68)
2.5 0.0 2.5 5.0 7.5 10.0
0.1
0.2
0.3Average of 21 draws
N(s=1.42)
10.0 7.5 5.0 2.5 0.0 2.5 5.0
0.1
0.2
0.3 Log of average of 2
1 drawsN(s=1.49)
5.0 2.5 0.0 2.5 5.0 7.5
0.1
0.2
0.3 Average of 2
1 drawsN(s=1.41)
7.5 5.0 2.5 0.0 2.5 5.0
0.1
0.2
0.3 Log of average of 2
1 drawsN(s=1.42)
Figure 1.10: Estimated density, using 10,000 simulations, ofn(X 1) from a sample of iid
21 variables. Left panel, as above. Right panel looks atn(log(X) log(1)). top graphs have
n = 3, 10 and 50. Code: clt chi.ox.
Example 1.43 Suppose Xi are i.i.d. 21, that is N(0, 1)
2. Then Xi has a mean of 1 anda variance of 2. CLT shows that
n(X 1) d N (0, 2) .
Figure 1.10 shows results from a simulation of this asymptotic result. The simulationson the left panel show as n increases from 3 to 10 to 50 the CLT becomes an everincreasingly accurate approximation to the actual distribution. Notice the x-axis, now asthe sample size increases there is no implosion for we are scaling by root n, which balancesthe fact that X is becoming like 1.
Transformations: the delta method
On the left panel we have drawn another asymptotic approximation. This is based on
n{log(X) log (1)} .
56 CHAPTER 1. PROBABILITY
The idea is as follows. For any continuous function g we can we can use a Taylors theoremof the mean13
g(X)= g () + (X ) g ()
=
.
Here lives between X and . As n gets large so . Hence this linearisationbecomes more precise. Then rewriting
n{g(X) g ()} n(X )g ()
D N
{0, 2
(g ()
)2}.
In econometrics this approach to looking at the asymptotics of a transformation is usuallycalled the delta method. It can be used for any continuously differentiable function g(.).
In our case we have used a log approximation so log()/ = 1/ and so
n{log(X) log (1)} d N (0, 2) .
Hence the we have derived another asymptotic approximation. The simulations in Figure1.10 show that this log based asymptotic approximation is more accurate. That is thefinite sample behaviour of the asymptotic theory is preferable.
Heterogeneous CLT
In applications, the assumptions of the Lindeberg-Levy Theorem are often found veryrestrictive. Fortunately, the assumptions that the observations are independent and iden-tically distributed can be relaxed to assuming that the observations are approximatelyindependent and approximately identically distributed. This can be said in many waysand therefore many versions of the Central Limit Theorem can be found in the literature.A prominent example is the following:
Theorem 1.4 (Lindeberg) Let X1, X2, . . . be independent random variables, so that E (Xi) =i, Var (Xi) =
2i . Define:
X = (X1 + +Xn) /n, = (1 + + n) /n,2 =
(21 + + 2n
)/n.
If the variables satisfy the Lindeberg condition, that is, for all > 0,
1
2
ni=1
E{(Xi i)2 1(|Xii|>n)
} 0 as n
13Recall Taylors theorem of the mean is that for all x, y there exists a y such that
g(y) = g(x) + (y x) g(x)x
x=y
,
where y is between x and y. More informally
g(y) = g(x) + (y x)g(x)x
+ (y x)2 2g(x)
2x2+ ...,
then for small y x we can neglect the higher order terms and produce g(y) g(x) + (y x)g(x)/x.
1.5. ASYMPTOTIC APPROXIMATION 57
then: (X )2/n
d N (0, 1) .
Proof. Can be found in Feller (1971, pp. 518-21).The interpretation of the Lindeberg condition is that each observation should be
asymptotically negligible: the probability that any observation Xn is of the same or-der of magnitude as n2 must tend to zero. See also the discussion by Hendry (1995, p.717).
Later in the econometrics course time series will be studied. In time series the obser-vations are typically not independent. The Central Limit Theorems for so-called mixingprocesses and martingale difference sequences are tools to deal with such dependency.See, for example, Hendry (1995, Appendix A4.10+11).
Order notation
Asymptotic theory is usually discussed as n . It may be of interest to discuss theorder of magnitude of remainder terms. The following notation is used in calculus.Notation. Let f (x) and g (x) be two functions. If:
f (x) /g (x) 0 as xthen f is of smaller order than g and we write:
f (x) = o {g (x)} small o.If:
limx
|f (x) /g (x)| constantthen f is of the same order as g and we write:
f (x) = O {g (x)} big O.Example 1.44 0 < a < b na = o (nb) .Example 1.45 0 < a log n = o (na) .Recall the inequality log x x 1 x. (draw, use convexity argument). Thus log n nand a logn = logna na. Thus log n na/a.
The corresponding notation in probability isNotation. Let X1, X2, . . . be a sequence of random variables and f a real function. If:
Xn/f (n)p 0
then we write:Xn = op {f (n)} small op.
If:Xn/f (n)
D Xthen we write:
Xn = Op {f (n)} big Op.
58 CHAPTER 1. PROBABILITY
Example 1.46 By the Law of Large Numbersn
j=1Xj/np . Therefore:
1
n
nj=1
Xj = Op (1) ,
1
n
nj=1
Xj = + op (1) .
Example 1.47 By the Central Limit Theorem, with = 0, 2 = 1,
nj=1
Xj/n
d N (0, 1) .
Therefore:
1n
nj=1
Xj = Op (1) ,
FXj/n (x) = (x) + o (1) .In fact:
FXj/n (x) = (x) +O(
1n
).
Functional central limit theory
So far we have looked at limits of terms like
n1
n
ni=1
{Xi E(Xi)} ,
where Xi are assumed to be i.i.d. with a constant variance 2 = Var(Xi). The Lindeberg-
Levy CLT means that
n1
n
ni=1
{Xi E(Xi)} L N(0, 2).
In more advanced econometrics it is helpful to extend this result. At first sight it willlook odd.
For a given n > 0, define the partial sum
Y(n)t =
1n
tni=1
{Xi E(Xi)} , t [0, 1], (1.23)
where t represents time and a denotes the integer part of a. It means that over anyfixed interval for t, that is time, the process is made up of centred and normalised sums ofi.i.d. events. We then allow n, the number of these events in any fixed interval of time ofunit length, to go off to infinity. This is often labelled in-fill asymptotics. As a result,
1.6. CONCLUSION 59
for a specific t, Y(n)t obeys a central limit theory and becomes Gaussian. Further, this
idea can be extended to show that the whole partial sum, as a random function, convergesto a scaled version of Brownian motion, as n goes to infinity. This is often written as
1n
tni=1
{Xi E(Xi)} 2Wt, t [0, 1],
where the convergence is for the whole process, not for a single t. This type of result iscalled a functional central limit theory. The proof of this result is somewhat advanced.Although it is easy to see it holds for a specific t, it holding so that the convergence isas a process is difficult as the objects we are considering are infinite dimensional ie afunction of t which lives on 0, 1. The nice things about function CLTs is that they havebeen proved under some rather general conditions on the Xi, allowing limit amounts oftime dependence and heteroskedasticity.
In the econometrics of asset pricing this type of argument has been used as an argumentto support the use of Brownian motion models of asset prices. The argument goes alongthe lines that the Xi are shocks to the price, so if the price is moved by adding up lotsof small shocks then the process as a whole should become Brownian in the limit so longas the stock is very liquid. Although somewhat appealing, this is rather artificial in ourview and not a good basis on which to build good models of asset prices.
At first sight this suggests the only reasonable continuous time version of a randomwalk, which will sum up many small events, is Brownian motion. This insight is, however,incorrect.
1.6 Conclusion
In these lecture notes we have given a review of random variables and their limits. Thefocus has been on work which will be helpful in econometrics where we need a goodunderstanding of distributions, moments, estimators and models built from randomness.Two tools for developing approximations were also introduced: simulation and asymp-totics. Both are very important when we look at models outside simple classic linearmodels.
1.7 Solutions to some exercises
These were originally drafted by Christian Dambolena, Rustom Irani, Eric Engler, first year students in
the 2004/2005 M.Phil. in Economics at the University of Oxford. Edited by Neil Shephard. I thank Sri
Thirumaijah (MFE, 2005) for finding an error in an earlier draft.
Proof of Exercise 1.2
E{a+ bg(X)} ={a+ bg(x)} fX(x)dx
= a
fX(x)dx + b
g(x)fX(x)dx
= a+ bE{g(X)}.
60 CHAPTER 1. PROBABILITY
Proof of Exercise 1.3
Var(a+ bX) = E{(a+ bX) E(a+ bX)}2= E{(a+ bX) (a+ bE (X))}2= E{b (X E (X))}2= b2E{X E (X)}2= b2Var(X).
Proof of Exercise 1.4
E (X) =
xfX(x)dx =
x1
2e
1
|x|dx
=
x1
2e
1
(x)dx+
x1
2e
1
(x)dx
= +
(x ) 12
e1
(x)dx+
(x ) 12
e1
(x)dx
= .
Likewise
Var (X) =
(x )2 fX(x)dx
=
(x )2 12
e1
(x)dx+
(x )2 12
e1
(x)dx
= 22,
using the properties of the exponential random variables.
Proof of Exercise 1.5 Recall the density of a general uniform
fX(x) =1
b a, x [a, b],
where a special case is the standard uniform fX(x) = 1, x [0, 1]. We derive the mean and variance ofa standard uniform. First
E(X) =
10
xfX(x)dx
=
10
xdx
= [1
2x2]10
=1
2
while
1.7. SOLUTIONS TO SOME EXERCISES 61
Var(X) = E{X E(X)}2
=
10
{x 12}2fX(x)dx
=
10
{x2 x+ 14}dx
= [1
3x3 1
2x2 +
1
4x]10
= [1
313 1
212 +
1
41] 0
=1
12.