47
ICICI Centre for Mathematical Sciences Mathematical Finance 2004-5 A Probability Primer Dr Amber Habib We will sum up the most essential aspects of Probability, and the transition to Statistical Inference. We avoid technical distractions. The gaps we leave can be filled in by consulting one of the references provided at the end. The figures have been produced using Mathematica. Contents 1 Sample Space, Random Variables 2 2 Probability 3 3 Probability Distributions 4 4 Binomial Distribution 7 5 Normal Distribution 9 6 Expectation 11 7 Variance 13 8 Lognormal Distribution 17 9 Bivariate Distributions 18 10 Conditional Probability and Distributions 21 11 Independence 24 12 Chi Square Distribution 25 13 Random Samples 26 14 Sample Mean and Variance 28 15 Large Sample Approximations 29 1

A Probability Primer

Embed Size (px)

DESCRIPTION

An introduction to probability and statistics, from basic probability to sampling distributions and confidence intervals.

Citation preview

Page 1: A Probability Primer

ICICI Centre for Mathematical Sciences

Mathematical Finance 2004-5

A Probability PrimerDr Amber Habib

We will sum up the most essential aspects of Probability, and the transitionto Statistical Inference. We avoid technical distractions. The gaps we leavecan be filled in by consulting one of the references provided at the end.

The figures have been produced using Mathematica.

Contents

1 Sample Space, Random Variables 2

2 Probability 3

3 Probability Distributions 4

4 Binomial Distribution 7

5 Normal Distribution 9

6 Expectation 11

7 Variance 13

8 Lognormal Distribution 17

9 Bivariate Distributions 18

10 Conditional Probability and Distributions 21

11 Independence 24

12 Chi Square Distribution 25

13 Random Samples 26

14 Sample Mean and Variance 28

15 Large Sample Approximations 29

1

Page 2: A Probability Primer

1 SAMPLE SPACE, RANDOM VARIABLES 2

16 Point Estimation 32

17 Method of Moments 33

18 Maximum Likelihood Estimators 34

19 Sampling Distributions 36

20 Confidence Intervals 41

1 Sample Space, Random Variables

The sample space is the collection of objects whose statistical propertiesare to be studied. Each such object is called an outcome, and a collectionof outcomes is called an event.

• Mathematically, the sample space is a set, an outcome is a member of thisset, and an event is a subset of it.

For example, our sample space may be the collection of stocks listed onthe National Stock Exchange. We may ask statistical questions about thisspace, such as: What is the average change in the value of a listed stock(an outcome) over the past year? Alternately, our sample space might bethe history of a single stock: All its daily closing prices, say, over the last5 years. Our question might be: What is the average weekly change of thevalue over the last year? Or, how large is the variation in the prices over its5 year history? We could ask more subtle questions: How likely is it that thestock will have a sudden fall in value over the next month?

• Typical notation is to use S for sample space, lower case letters such as xfor individual outcomes, and capital letters such as E for events.

In a particular context, we would be interested in some specific numericalproperty of the outcomes: such as the closing price on October 25, 2004,of all the stocks listed on the NSE. This property allots a number to eachoutcome, so it can be viewed as a function whose domain is the sample spaceS and range is in the real numbers R.

• A function X : S → R is called a random variable.

Page 3: A Probability Primer

2 PROBABILITY 3

Our interest is in taking a particular random variable X and studying howits values are distributed. What is the average? How much variation is therein its values? Are very large values unlikely enough to be ignored?

2 Probability

We present two viewpoints on the meaning of probability. Both are relevantto Finance and, interestingly, they lead to the same mathematical definitionof probability!

Viewpoint 1: The probability of an event should predict the relative fre-quency of its occurrence. That is, suppose we say the probability of a randomstock having increased in value over the last month is 0.6. Then, if we lookat 100 different stocks, about 60 of them should have increased in value. Theprediction should be more accurate if we look at larger numbers of stocks.

Viewpoint 2: The probability of an event reflects our (subjective) opinionabout how likely it is to occur, in comparison to other events. Thus, if weallot probability 0.4 to event A and 0.2 to event B, we are expressing theopinion that A is twice as likely as B.

Viewpoint 1 is appropriate when we are analyzing the historical data topredict the future. Viewpoint 2 is useful in analyzing how an individual mayact when faced with certain information. Both viewpoints are captured bythe following mathematical formulation:

• Let A be the collection of all subsets of the sample space S: we call it theevent algebra.

• A probability function is a function P : A → [0, 1] such that:

1. P(S) = 1.

2. P

(

∞⋃

i=1

Ei

)

=∞

i=1

P(Ei), if the Ei are pairwise disjoint events (i.e., i 6= j

implies Ei ∩ Ej = ∅).

Let P : A → [0, 1] be a probability function. Then it automatically has thefollowing properties:

Page 4: A Probability Primer

3 PROBABILITY DISTRIBUTIONS 4

1. P(∅) = 0.

2. P

(

n⋃

i=1

Ei

)

=

n∑

i=1

P(Ei), if the Ei are pairwise disjoint events.

3. P(Ac) = 1 − P(A).

4. P

(

i

Ei

)

≤∑

i

P(Ei), for any collection of events Ei.

3 Probability Distributions

We return to our main question: How likely are different values (or rangesof values) of a random variable X : S → R?

If we just plug in the definition of a random variable, we realize that ourquestion can be phrased as follows: What is the probability of the eventwhose outcomes correspond to a given value (or range of values) of X?

Thus, suppose I ask, what is the probability that X takes on a value greaterthan 100? This is to be interpreted as: what is the probability of the eventwhose outcomes t all satisfy X(t) > 100? That is,

P(X > 100) = P({t : X(t) > 100}).

It is convenient to consider two types of random variables, ones whose valuesvary discretely (discrete random variables) and those whose values varycontinuously (continuous random variables).1

Examples:

1. Let us allot +1 to the stocks whose value rose on a given day, and −1 tothose whose value fell. Then we have created a random variable whosepossible values are ±1. This is a discrete random variable.

2. Let us allot the whole number +n to the stocks whose value rose bybetween n and n + 1 on a given day, and −n to those whose valuefell by between n and n + 1. Then we have created a random variablewhose possible values are all the integers. This is also a discrete randomvariable.

1These are not the only types. But these types contain all the ones we need.

Page 5: A Probability Primer

3 PROBABILITY DISTRIBUTIONS 5

3. If, in the previous example, we let X be the actual change in value, itis still discrete (since all changes are in multiples of Rs 0.01). However,now the values are so close that it is simpler to ignore the discretenessand model X as a continuous random variable.

Discrete Random Variables

Let S be the sample space, A the event algebra, and P : A → [0, 1] aprobablity function. Let X : S → R be a discrete random variable, withrange x1, x2, . . . (the range can be finite or infinite). The probability of anyparticular value x is

P(X = x) = P({t ∈ S : X(t) = x}).

• We call these values the probability distribution or probability den-

sity of X and denote them by fX : R → [0, 1],

fX(x) = P(X = x).

One can find the probability of a range of values of X by just summing upthe probabilities of all the individual values in that range. For instance,

P(a < X < b) =∑

xi∈(a,b)

fX(xi).

In particular, summing over the entire range gives

i

fX(xi) = 1,

since the total probability must be 1.

Discrete Uniform Distribution: Consider an X whose range is {0, 1, 2, . . . , n}and each value is equally likely. Then

fX(x) =

{

0 x /∈ {0, 1, 2, . . . , n}1/(n + 1) else

.

Page 6: A Probability Primer

3 PROBABILITY DISTRIBUTIONS 6

Continuous Random Variables

Suppose the values of a random variable X vary continuously over somerange, such as [0, 1]. Then, it is not particularly useful to ask for the likeli-hood of X taking on any individual value such as 1/2 - since there are in-finitely many choices, each individual choice is essentially impossible and hasprobability zero. From a real life viewpoint also, since exact measurementsof a continuously varying quantity are impossible, it is only reasonable to askfor the probability of an observation lying in a range, such as (0.49, 0.51),rather than its having the exact value 0.5.

The notion of a probability distribution of a continuous random variable isdeveloped with this in mind. Recall that in the discrete context, probabilityof a range was obtained by summing over that range. So, in the continuouscase, we seek to obtain probability of a range by integrating over it.

• Given a continuous random variable X, we define its probability density

to be a function fX : R → [0,∞) such that for any a, b with a ≤ b,

P(a ≤ X ≤ b) =

∫ b

a

fX(x) dx.

In particular,

∫ ∞

−∞

fX(x) dx = 1.

Important. The number fX(x) does not represent the probability thatX = x. Individual values of fX have no significance, only the integrals of fX

do! (Contrast this with the discrete case.)

Continuous Uniform Distribution:2 Suppose we want a random variableX that represents a quantity which varies over [0, 1] without any bias: this istaken to mean that P(X ∈ [a, b]) should not depend on the location of [a, b]but only on its length.

0 1

-�

I1

-�

I2

-�

I3 P(X ∈ I1) = P(X ∈ I2) =1

2P(X ∈ I3)

2Usually, this is just called the unifom distribution.

Page 7: A Probability Primer

4 BINOMIAL DISTRIBUTION 7

This is achieved by taking the following probability density:

fX(x) =

{

1 0 ≤ x ≤ 10 else

For then, with 0 ≤ a ≤ b ≤ 1,

P(a ≤ X ≤ b) =

∫ b

a

1 dx = b − a.

0 1

1 fX

Cumulative Probability Distribution

• The cumulative probability distribution of a random variable X, de-noted FX , is defined by:

FX(x) = P(X ≤ x).

It is easy to see that

FX(x) =

xi≤x

fX(xi) X is discrete with range x1, x2, . . .

∫ x

−∞

fX(t) dt X is continuous

Now we look at two distributions, one discrete and one continuous, whichare especially important.

4 Binomial Distribution

Consider a random variable X which can take on only two values, say 0 and 1(the choice of values is not important). Suppose the probability of the value

Page 8: A Probability Primer

4 BINOMIAL DISTRIBUTION 8

0 is q and of 1 is p. Then we have:

1. 0 ≤ p, q ≤ 1,

2. p + q = 1

Question: Suppose we observe X n times. What are the likely distributionsof 0’s and 1’s? Specifically, we ask: What is the probability of observing 1 ktimes?

We calculate as follows. Let us consider all possible combinations of n 0’sand 1’s:

Number of combinations with k 1’s = Ways of picking k slots out of n =(

n

k

)

.

Probability of each individual combination with k 1’s = pk(1 − p)n−k.

Therefore, P(1 occurs k times) =

(

n

k

)

pk(1 − p)n−k.

• A random variable Y has a binomial distribution with parameters nand p if it has range 0, 1, . . . , n and its probability distribution is:

fY (k) =

(

n

k

)

pk(1 − p)n−k, k = 0, 1, . . . , n.

We call Y a binomial random variable and write Y ∼ B(n, p).

As illustrated above, binomial distributions arise naturally wherever we arefaced with a sequence of choices. In Finance, the binomial distribution ispart of the Binomial Tree Model for pricing options.

Figure 1 illustrates the binomial distributions with p = 0.2 and p = 0.5,using different types of squares for each. Both have n = 10.

Exercise. In Figure 1, identify which points correspond to p = 0.2 andwhich to p = 0.5.

Page 9: A Probability Primer

5 NORMAL DISTRIBUTION 9

2 4 6 8 10

0.05

0.1

0.15

0.2

0.25

0.3

Figure 1: Two binomial distributions.

5 Normal Distribution

This kind of probability distribution is at once the most common in nature,among the easiest to work with mathematically, and theoretically at the heartof Probability and Statistical Inference. Among its remarkable properties isthat any phenomenon occurring on a large scale tends to be governed byit. When in doubt about the nature of a distribution, assume it is (nearly)normal, and you will usually get good results!

We first define the standard normal distribution. This is a probabilitydensity of the form

fX(x) =1√2π

e−x2/2.

It has the following ‘bell-shaped’ graph:

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

Page 10: A Probability Primer

5 NORMAL DISTRIBUTION 10

Exercise. Can you explain the factor1√2π

? (Hint: Think about the

requirement that total probability should be 1.)

Note that the graph is symmetric about the y-axis. The axis of symmetry canbe moved to another position m, by replacing x by x − m. In the followingdiagram, the dashed line represents the standard normal distribution:

m

0.1

0.2

0.3

0.4

fX(x) =1√2π

e−(x−m)2/2.

Also, starting from the standard normal distribution, we can create one witha similar shape but bunched more tightly around the y-axis. We achieve thisby replacing x with x/s:

-3 -2 -1 1 2 3

fX(x) =1√2πs

e−1

2(x/s)2 .

By combining both kinds of changes, we reach the definition of a generalnormal distribution:

• A random variable X has a normal distribution with parameters µ, σ,

Page 11: A Probability Primer

6 EXPECTATION 11

if its density function has the form:

fX(x) =1√2πσ

e−1

2(x−µ

σ)2 .

We call X a normal random variable and write X ∼ N(µ, σ).

The axis of symmetry of this distribution is determined by µ and its clusteringabout the axis of symmetry is controlled by σ.

Exercise. Will increasing σ make the graph more tightly bunched aroundthe axis of symmetry? What will it do to the peak height of the graph?

In the empirical sciences, errors in observation tend to be normally dis-tributed: they are clustered around zero, small errors are common, and verylarge errors are very rare. Regarding this, observe from the graph that by±3 the density of the standard normal distribution has essentially becomezero: in fact the probability of a standard normal variable taking on a valueoutside [−3, 3] is just 0.0027. In theoretical work, the normal distribution isthe main tool in determining whether the gap between theoretical predictionsand observed reality can be attributed solely to errors in observation.

6 Expectation

If we have some data consisting of numbers xi, each occurring fi times, thenthe average of this data is defined to be:

x̄ =Sum of all the data

Number of data points=

i fixi∑

i fi

=∑

i

fi∑

j fj

xi

Now, if we have a discrete random variable X, then fX(xi) predicts therelative frequency with which xi will occur in a large number of observationsof X, i.e., we view fX(xi) as a prediction of fi

j fj. And then,

i

fX(xi)xi

becomes a predictor for the average x̄ of the observations of X.

Page 12: A Probability Primer

6 EXPECTATION 12

• The expectation of a discrete random variable is defined to be

E[X] =∑

i

fX(xi)xi.

On replacing the sum by an integral we arrive at the notion of expectationof a continuous random variable:

• The expectation of a continuous random variable is defined to be

E[X] =

∫ ∞

−∞

x fX(x) dx.

Expectation is also called mean and denoted by µX or just µ.

Exercise. Make the following calculations:

1. X has the discrete uniform distribution with range 0, . . . , n. ThenE[X] = n

2.

2. X has the uniform distribution on [0, 1]. Then E[X] = 1/2.

3. X ∼ B(n, p) implies E[X] = np.

4. X ∼ N(µ, σ) implies E[X] = µ.

Some elementary properties of expectation:

1. E[c] = c, for any constant c.3

2. E[cX] = c E[X], for any constant c.

Suppose X : S → R is a random variable and g : R → R is any function.Then their composition g ◦ X : S → R, defined by (g ◦ X)(w) = g(X(w)), isa new random variable which we will call g(X).

Example. Let g(x) = xr. Then g ◦ X is denoted Xr. 2

Suppose X is discrete with range {xi}. Then, the range of g(X) is {g(xi)}.Therefore we can calculate the expectation of g(X) as follows:4

E[g(X)] =∑

i

g(xi)P(

g(X) = g(xi))

=∑

i

g(xi)fX(xi).

3A constant c can be viewed as a random variable whose range consists of the singlevalue c.

4Our calculation is valid if g is one-one. With slightly more effort we can make it validfor any g.

Page 13: A Probability Primer

7 VARIANCE 13

If X is continuous, one has the analogous formula:

E[g(X)] =

∫ ∞

−∞

g(x)fX(x) dx.

Example. Let g(x) = x2. Then

E[X2] =

i x2i fX(xi) X is discrete.

∫ ∞

−∞

x2fX(x) dx X is continuous.

2

With these facts in hand, the following result is easy to prove.

• Let X be any random variable, and g, h two real functions. Then

E[g(X) + h(X)] = E[g(X)] + E[h(X)].

7 Variance

Given some data {xi}ni=1, its average x̄ is seen as a central value about which

the data is clustered. The significance of the average is greater if the clus-tering is tight, less otherwise. To measure the tightness of the clustering, weuse the variance of the data:

s2 =1

n

i

(xi − x̄)2.

Variance is just the average of the squared distance from each data point tothe average (of the data).

Therefore, in the analogous situation where we have a random variable X, ifwe wish to know how close to its expectation its values are likely to be, weagain define a quantity called the variance of X:

var[X] = E[(X − E[X])2].

Alternate notation for variance is σ2X or just σ2. The quantity σX or σ, the

square root of the variance, is called the standard deviation of X. Itsadvantage is that it is in the same units as X.

Page 14: A Probability Primer

7 VARIANCE 14

Exercise. Will a larger value of variance indicate tighter clustering aroundthe mean?

Sometimes, it is convenient to use the following alternative formula for vari-ance:

• var[X] = E[X2] − E[X]2.

This is obtained as follows.

var[X] = E[(X − E[X])2]

= E[X2 − 2E[X]X + E[X]2]

= E[X2] − 2E[ (E[X]X) ] + E[(E[X]2)]

= E[X2] − 2E[X]2 + E[X]2

= E[X2] − E[X]2.

Elementary properties of variance:

1. var[X + a] = var[X], if a is any constant.

2. var[aX] = a2 var[X], if a is any constant.

Exercise. Will it be correct to say that σaX = a σX for any constant a?

Exercise. Let X be a random variable with expectation µ and standard

deviation σ. Then Z =X − µ

σhas expectation 0 and standard deviation 1.

Exercise. Suppose X has the discrete uniform distibution with range {0, . . . , n}.We have seen that E[X] = n/2. Show that its variance is

1

6n(n + 2).

Exercise. Suppose X has the continuous uniform distibution with range[0, 1]. We have seen that E[X] = 1/2. Show that its variance is 1/12.

Page 15: A Probability Primer

7 VARIANCE 15

Example. Suppose X ∼ B(n, p). We know E[X] = np. Therefore,

E[X(X − 1)] =

n∑

k=0

k(k − 1)

(

n

k

)

pk(1 − p)n−k

=

n∑

k=2

n!

(k − 2)!(n − k)!pk(1 − p)n−k

=n

k=2

n(n − 1)

(

n − 2

k − 2

)

pk(1 − p)n−k

= n(n − 1)p2n−2∑

i=0

(

n − 2

i

)

pi(1 − p)(n−2)−i

= n(n − 1)p2

And so,

var[X] = E[X2] − E[X]2

= E[X(X − 1)] + E[X] − E[X]2

= n(n − 1)p2 + np(1 − np)

= np(1 − p)

2

Example. Suppose X ∼ N(µ, σ). We know E[X] = µ. Therefore,

var[X] =1

σ√

∫ ∞

−∞

(x − µ)2e−1

2(x−µ

σ)2 dx

=σ2

√2π

∫ ∞

−∞

z2e−z2/2 dz

Now we integrate by parts:

∫ ∞

−∞

z2e−z2/2 dz =

∫ ∞

−∞

z(ze−z2/2) dz

= −ze−z2/2∣

−∞+

∫ ∞

−∞

e−z2/2 dz

= 0 +√

Thereforevar[X] = σ2.

Page 16: A Probability Primer

7 VARIANCE 16

10 20 30 40

0.02

0.04

0.06

0.08

0.1

Figure 2: Normal approximation to a binomial distribution.

2

Thus, if X ∼ N(µ, σ), then Z =X − µ

σhas expectation 0 and standard

deviation 1. In fact, Z is again a normal distribution and hence is the stan-dard normal distribution (it has parameters 0 and 1). Through this link, allquestions about normal distributions can be converted to questions aboutthe standard normal distribution. For instance, let X, Z be as above. Then:

P(X ≤ a) = P

(

X − µ

σ≤ a − µ

σ

)

= P

(

Z ≤ a − µ

σ

)

.

Now we can also illustrate our earlier statement about how the normal dis-tribution serves as a substitute for other distributions. Figure 2 comparesa binomial distribution (n = 100 and p = 0.2) with the normal distributionwith the same mean and variance (µ = np = 20 and σ2 = np(1 − p) = 4).

Generally, the normal distribution is a good approximation to the binomialdistribution if n is large. One criterion that is often used is that, for areasonable approximation, we should have both np and n(1−p) greater than5.

Page 17: A Probability Primer

8 LOGNORMAL DISTRIBUTION 17

1 2 3 4

0.2

0.4

0.6

0.8

1

1.2

C

B

A

Figure 3: Lognormal density functions: (A) µ = 0, σ = 1, (B) µ = 1, σ = 1, (C)µ = 0, σ = 0.4.

8 Lognormal Distribution

• If Z ∼ N(µ, σ) then X = eZ is called a lognormal random variable withparameters µ and σ.5

Exercise. Let Z ∼ N(µ, σ). Then

E[etZ ] = eµt+ 1

2σ2t2 .

The t = 1, 2 cases of the Exercise immediately give the following:

• Let X be a lognormal variable with parameters µ and σ. Then

E[X] = eµ+ 1

2σ2

, var[X] = e2µ+2σ2 − e2µ+σ2

.

The lognormal distribution is used in Finance to model the variation of stockprices with time. Without explaining how, we show in Figure 4 an exampleof the kind of behaviour predicted by this model.

5The name comes from “The log of X is normal.”

Page 18: A Probability Primer

9 BIVARIATE DISTRIBUTIONS 18

Figure 4: A simulation of the lognormal model for variation of stock prices withtime.

9 Bivariate Distributions

So far we have dealt with individual random variables, i.e. with models forparticular features of a population. The next step is to study the relationshipsthat exist between different features of a population. For instance, an investormight like to know the nature and strength of the connection between herportfolio and a stock market index such as the NIFTY. If the NIFTY goesup, is her portfolio likely to do the same? How much of a rise is it reasonableto expect?

This leads us to the study of pairs of random variables and the probabilitiesassociated with their joint values. Are high values of one associated withhigh or low values of the other? Is there a significant connection at all?

• Let S be a sample space and X, Y : S → R two random variables. Thenwe say that X, Y are jointly distributed.

• Let X, Y be jointly distributed discrete random variables. The joint dis-

Page 19: A Probability Primer

9 BIVARIATE DISTRIBUTIONS 19

tribution fX,Y of X and Y is defined by

fX,Y (x, y) = P(X = x, Y = y).

Since fX,Y is a function of two variables, we call it a bivariate distribution.It can be used to find any probability associated with X and Y . Let X haverange {xi} and Y have range {yj}. Then:

1. P(a ≤ X ≤ b, c ≤ Y ≤ d) =∑

xi∈[a,b]

yj∈[c,d]

fX,Y (xi, yj).

2. fX(x) =∑

j

fX,Y (x, yj) and fY (y) =∑

i

fX,Y (xi, y).

3.∑

i,j

fX,Y (xi, yj) = 1.

We will be interested in various combinations of X and Y . Therefore, con-sider a function g : R

2 → R. We use it to define a new random variableg(X, Y ) : S → R by

g(X, Y )(w) = g(X(w), Y (w)).

The expectation of this new random variable can be obtained, as usual, bymultiplying its values with their probabilities:

E[g(X, Y )] =∑

i,j

g(xi, yj) fX,Y (xi, yj).

We create analogous definitions when X, Y are jointly distributed continuous

random variables.

• Let X, Y be jointly distributed continuous random variables. Their joint

probability density fX,Y is a function of two variables whose integrals givethe probability of X, Y lying in any range:

P(a ≤ X ≤ b, c ≤ Y ≤ d) =

∫ b

a

∫ d

c

fX,Y (x, y) dx dy.

Then the following are easy to prove:

1.

∫ ∞

−∞

fX,Y (x, y) dy = fX(x) and

∫ ∞

−∞

fX,Y (x, y) dx = fY (y).

Page 20: A Probability Primer

9 BIVARIATE DISTRIBUTIONS 20

2.

∫ ∞

−∞

∫ ∞

−∞

fX,Y (x, y) dx dy = 1.

• Let X, Y be jointly distributed continuous random variables and g : R2 →

R. Then the expectation of g(X, Y ) is given by

E[g(X, Y )] =

∫ ∞

−∞

∫ ∞

−∞

g(x, y)fX,Y (x, y) dx dy.

Some important special cases:

1. Suppose X, Y are jointly distributed discrete random variables. Then

E[X + Y ] =∑

i,j

(xi + yj) fX,Y (xi, yj)

=∑

i

xi(∑

j

fX,Y (xi, yj)) +∑

j

yj(∑

i

fX,Y (xi, yj))

=∑

i

xifX(xi) +∑

j

yjfY (yj) = E[X] + E[Y ]

There is a similar proof when X, Y are continuous. Thus expectationalways distributes over sums.

2. E[(X − µX)(Y − µY )] is called the covariance of X and Y and isdenoted by cov[X, Y ] or σXY . If large values of X tend to go with largevalues of Y then covariance is positive. If they go with small values ofY , covariance is negative. A zero covariance indicates that X and Yare unrelated. (See Figure 5.)

3. We have the following identity:6

var[X + Y ] = var[X] + var[Y ] + 2cov[X, Y ].

Exercise. Show that cov[X, Y ] = E[XY ] − E[X]E[Y ].

• The correlation coefficient of X, Y is defined to be

ρ = ρX,Y =cov[X, Y ]

σXσY.

6Compare this with the identity connecting the dot product and length of vectors:||u + v||2 = ||u||2 + ||v||2 + 2u · v. This motivates us to think of covariance as a kind of dotproduct between different random variables, with variance as squared length. Geometricanalogies then lead to useful statistical insights and even proofs, such as the statementbelow that |ρ| ≤ 1.

Page 21: A Probability Primer

10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 21

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3

Figure 5: Observed values of two jointly distributed standard normal variableswith covariances 0.95, −0.6 and 0 respectively.

The advantage of the correlation coefficient is that it is not affected by theunits used in the measurement. If we replace X by aX ′ = X + b and Y byY ′ = cY + d, we will have

ρX′,Y ′ = ρX,Y .

An interesting fact is that |cov[X, Y ]| ≤ σXσY .7 This immediately implies:

• −1 ≤ ρX,Y ≤ 1.

Exercise. Suppose a die is tossed once. Let X take on the value 1 if theresult is ≤ 4 and the value 0 otherwise. Similarly, let Y take on the value 1if the result is even and the value 0 otherwise.

1. Show that the values of fX,Y are given by:

X\Y 0 10 1/6 1/61 1/3 1/3

2. Show that µX = 2/3 and µY = 1/2.

3. Show that cov[X, Y ] = 0.

10 Conditional Probability and Distributions

Consider a sample space S with event algebra A and a probability fuctionp : A → [0, 1]. Let A, B ⊂ A be events. If we know B has occurred,

7This is the analogue of the geometric fact that |u · v| ≤ ||u|| ||v||.

Page 22: A Probability Primer

10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 22

what is the probability that A has also occurred? We reason that since weknow B has occurred, in effect B has become our sample space. Thereforeall probabilities of events inside B should be scaled by 1/P(B), to keep thetotal probability at 1. As for the occurrence of A, the points outside Bare irrelevant, so our answer should be P(A ∩ B) times the correcting factor1/P(B).

• The conditional probability of A, given B, is defined to be

P(A|B) =P(A ∩ B)

P(B).

We apply this idea to random variables:

• Let X, Y be jointly distributed random variables (both discrete or bothcontinuous). Then the conditional probability distribution of Y , givenX = x, is defined to be

fY |X=x(y) =fX,Y (x, y)

fX(x).

In the discrete case, we have fY |X=x(x) = P(Y = y|X = x).

Note that the conditional probability distribution is a valid probability dis-tribution in its own right. For example, in the discrete case, we have

1. 0 ≤ fY |X=x(y) ≤ 1,

2.∑

i

fY |X=x(yi) =∑

i

fX,Y (x, yi)

fX(x)=

fX(x)

fX(x)= 1.

Since fY |X=x is a probability distribution, we can use it to define expectations.

• The conditional expectation of Y , given X = x, is defined to be

E[Y |X = x] =

i

yifY |X=x(yi) X, Y are discrete

∫ ∞

−∞

yfY |X=x(y) dy X, Y are continuous

Page 23: A Probability Primer

10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS 23

-3 -2 -1 1 2 3

-3

-2

-1

1

2

3

Figure 6: Regression curve for two normal variables with ρ = 0.7.

• Note that E[Y |X = x] is a function of x. It is also denoted by µY |X=x orµY |x, and is called the curve of regression of Y on X.

The function E[Y |X = x] creates a new random variable E[Y |X]. Below,we calculate the expectation of this new random variable in the continuouscase:

E[E[Y |X]] =

∫ ∞

−∞

E[Y |X = x]fX(x) dx

=

∫ ∞

−∞

∫ ∞

−∞

yfY |X=x(y)fX(x) dy dx

=

∫ ∞

−∞

∫ ∞

−∞

yfX,Y (x, y) dy dx

= E[Y ]

Similar calculations can be carried out in the discrete case, so that we havethe general result:

E[Y ] = E[E[Y |X]].

This result is useful when we deal with experiments carried out in stages,and we have information on how the results of one stage depend on those ofthe previous ones.

Page 24: A Probability Primer

11 INDEPENDENCE 24

11 Independence

Let X, Y be jointly distributed random variables. We consider Y to beindependent of X, if knowledge of the value taken by X tells us nothingabout the value taken by Y . Mathematically, this means:

fY |X=x(y) = fY (y).

This is easily rearranged to:

fX,Y (x, y) = fX(x)fY (y).

Note that the last expression is symmetric in X, Y .

• Jointly distributed X, Y are independent if we have the identity:

fX,Y (x, y) = fX(x)fY (y).

Exercise. If X, Y are independent random variables and g : R2 → R any

function of the form g(x, y) = m(x)n(y), then

E[g(X, Y )] = E[m(X)]E[n(Y )].

Exercise. If X, Y are independent, then cov[X, Y ] = 0.

A common error is to think zero covariance implies independence – in fact itonly indicates the possibility of independence.

Exercise. If X, Y are independent, then var[X + Y ] = var[X] + var[Y ].

We return again to the normal distribution. The following facts about it arethe key to its wide usability:

1. If X ∼ N(µ, σ) and a 6= 0, then aX ∼ N(aµ, |a|σ).

2. If Xi ∼ N(µi, σi) with i = 1, . . . , n are pairwise independent, then∑

i Xi ∼ N(µ, σ), where

µ =∑

i

µi, σ2 =∑

i

σ2i .

Of course, the mean and variance would add up like this for any collectionof independent random variables. The important feature here is the preser-vation of normality.

Page 25: A Probability Primer

12 CHI SQUARE DISTRIBUTION 25

1 2 3 4 5

0.10.20.30.40.50.60.70.8

n=3

n=2

n=1

10 20 30 40

0.02

0.04

0.06

0.08

0.1

n=20

n=10

Figure 7: Chi square distributions with various degrees of freedom n.

12 Chi Square Distribution

We have remarked that measurements of continuously varying quantities tendto follow normal distributions. Therefore, a consistent method of measure-ment will then amount to looking at a sequence of normal variables. Theerrors in the measurements can be treated as a sequence of independent nor-mal variables with mean zero. By scaling units, we can take them to bestandard normal.

Suppose we consider independent standard normal variables X1, . . . , Xn. Ifwe view them as representing a sequence of errors, it is natural to ask if, intotal, the errors are large or small. The sum

i Xi won’t do as a measure ofthis, because individual Xi could take on large values yet cancel out to givea small value of the sum. This problem can be avoided by summing either|Xi| or X2

i . The latter choice is more tractable.

• Let X1, . . . , Xn be independent standard normal variables. The variable

X =n

i=1

X2i

is called a chi square random variable with n degrees of freedom. Wewrite X ∼ χ2

(n), and say it has the chi square distribution.

Exercise. Let X be a standard normal variable. Show that:

1. E[X2] = 1.

2. E[X4] = 3.

Page 26: A Probability Primer

13 RANDOM SAMPLES 26

10 20 30 40 50 60

0.01

0.02

0.03

0.04

0.05

0.06

Normal

Chi square

Figure 8: Density functions of chi square and normal distributions

3. var[X2] = 2.

Exercise. Let X ∼ χ2(n). Show that E[X] = n and var[X] = 2n.

Figure 7 shows chi square distributions with different degrees of freedom.Note that as the degrees of freedom increase, the distributions look more andmore normal. Figure 8 compares the χ2

(30) and N(30,√

60) distributions:they are very close to each other, and so a chi square distribution χ2

(n) withn ≥ 30 is usually replaced by a normal approximation N(n,

√2n).

• If X ∼ χ2(m) and Y ∼ χ2

(n) are independent, then X + Y is a sum ofm + n squared independent standard normal variables. Therefore X + Y ∼χ2

(m + n).

• The converse is also true. If X, Y are independent, X ∼ χ2(m) and X+Y ∼

χ2(m + n), then Y ∼ χ2

(n).

13 Random Samples

The subject of statistical inference is concerned with the task of using ob-served data to draw conclusions about the distribution of properties in a

Page 27: A Probability Primer

13 RANDOM SAMPLES 27

population. The main obstruction is that it may not be possible (or evendesirable) to observe all the members of the population. We are forced todraw conclusions by observing only a fraction of the members, and theseconclusions are necessarily probabilistic rather than certain.

In such situations, probability enters at two levels. Let us illustrate thisby a familiar example - the use of opinion polls. An opinion poll consistsof a rather small number of interviews, and from the opinions observed inthese, the pollster extrapolates the likely distribution of opinions in the wholepopulation. Thus, one may come across a poll which, after mailing forms to10,000 people and getting responses from about half of them, concludes that42% (of a population of 200 million) support candidate A, while 40% supportB, and the remainder are undecided. These conclusions are not certain andcome with error limits, say of 3%. Typically, this is done by reviewing all thepossible distributions of opinions, and finding out which one is most likelyto lead to the observations. This is one level at which probability is used.The other level is in evaluating the confidence with which we can declare ourconclusions - the error limit of 3% is not certain either, but has a probabilityattached, say 90%. If we wish to have an error bar we are more certain about(say 95%), we would have to raise the bar (say, to 5%).

Sampling is the act of choosing certain members of a population, and thenmeasuring their properties. Since the results depend on the choices andare not known beforehand, each measurement is naturally represented by arandom variable. We will also make the usually reasonable assumptions thatthe measurements do not disturb the population, and that each choice isindependent of the previous ones.

• A random sample is a finite sequence of jointly distributed random vari-ables X1, . . . , Xn such that

1. Each Xi has the same probability density: fXi= fXj

for all i, j.

2. The Xi are independent: fX1,...,Xn(x1, . . . , xn) =

i

fXi(xi).

The common density function fX for all the Xi is called the density functionof the population. We also say that we are sampling from a population oftype X.

Page 28: A Probability Primer

14 SAMPLE MEAN AND VARIANCE 28

14 Sample Mean and Variance

Broadly, the task of Statistics is to estimate the type of a population, aswell as the associated parameters. Thus we might first try to establish thata population can be reasonably described by a binomial variable, and thenestimate n and p. The parameters of a population can be estimated in variousways, but are most commonly approached through the mean and variance.For instance, if we have estimates µ̂ and σ̂2 for the mean and variance ofa binomial population, we could then use µ = np and σ2 = np(1 − p) toestimate n and p.

• Let X1, . . . , Xn be a random sample. Its sample mean is a randomvariable X̄ defined by

X̄ =1

n

n∑

i=1

Xi.

Observed values of the sample mean are used as estimates of the populationmean µ. Therefore, we need to be reassured that, on average at least, wewill see the right value:

E[X̄] =1

n

n∑

i=1

E[Xi] =1

n

n∑

i=1

µ = µ.

Moreover, we would like the variance of X̄ to be small so that its values aremore tightly clustered around µ. We have

var[X̄] =1

n2

n∑

i=1

var[Xi] =σ2

n,

where σ2 is the population variance. Thus the variance of the sample meangoes to zero as the sample size increases: the sample mean becomes a morereliable estimator of the population mean.

E[X̄] = µ, var[X̄] =σ2

n

• Let X1, . . . , Xn be a random sample. Its sample variance is a randomvariable S2 defined by

S2 =1

n − 1

n∑

i=1

(Xi − X̄)2.

Page 29: A Probability Primer

15 LARGE SAMPLE APPROXIMATIONS 29

i

(Xi − X̄)2 =∑

i

(X2i + (X̄)2 − 2XiX̄) =

i

X2i − n(X̄)2.

E[

i

(Xi − X̄)2]

=∑

i

E[X2i ] − nE[(X̄)2]

=∑

i

(σ2 + µ2) − n(var[X̄] + E[X̄]2)

= n(σ2 + µ2) − σ2 − nµ2 = (n − 1)σ2

E[S2] = σ2

A rather longer calculation, which we do not include, shows that:

var[S2] =2σ4

n − 1

Again, we see that sample variance has the right average (σ2) and clustersmore tightly around it if we use larger samples.

15 Large Sample Approximations

Chebyshev’s Inequality. Let X be any random variable, with mean µ andstandard deviation σ. Then for any r ≥ 0,

P

(

|X − µ| ≥ rσ)

≤ 1

r2.

Page 30: A Probability Primer

15 LARGE SAMPLE APPROXIMATIONS 30

Proof. We give the proof for a discrete random variable X with range {xi}.σ2 =

i

(xi − µ)2fX(xi)

≥∑

i:|xi−µ|≥rσ

(xi − µ)2fX(xi)

≥ r2σ2∑

i:|xi−µ|≥rσ

fX(xi)

= r2σ2P

(

|X − µ| ≥ rσ)

Rearranging gives the desired inequality. 2

Weak Law of Large Numbers. Let X1, . . . , Xn be a random sample froma population with mean µ and standard deviation σ. Then for any c ≥ 0:

P

(

|X̄ − µ| ≥ c)

≤ σ2

nc2.

In particular, the probability diminishes to zero as n → ∞.

Proof. Apply Chebyshev’s inequality to X̄, noting that it has mean µ andstandard deviation σ/

√n:

P

(

|X̄ − µ| ≥ rσ√n

)

≤ 1

r2.

Substituting c = rσ/√

n gives the Weak Law. 2

The Weak Law allows us to make estimates of the mean of arbitrary accuracyand certainty, by just increasing the sample size.

A remarkable feature of the sample mean is that as the sample size increases,its distribution looks more and more like a normal distribution, regardless ofthe population distribution! We give an illustration of this in Figure 9.

Central Limit Theorem. Let X1, . . . , Xn be a random sample from apopulation with mean µ and variance σ2. Then the sample mean X̄ is morenormally distributed as n increases:

limn→∞

X̄ − µ

σ/√

n∼ N(0, 1).

When we work with large samples,8 the Central Limit Theorem allows us to

8A commonly used definition of “large” is n ≥ 30.

Page 31: A Probability Primer

15 LARGE SAMPLE APPROXIMATIONS 31

1.6 1.8 2 2.2 2.4

Figure 9: The histogram represents the means of 5000 samples of size 100, froma binomial population with parameters n = 10 and p = 0.2. It is matched againsta normal distribution with µ = 2 and σ =

√0.016.

use the normal approximation to make sharper error estimates (relative tothe Weak Law). We will not prove this result, but we shall use it frequently.

Example. Suppose we are trying to estimate the mean of a populationwhose variance is known to be σ2 = 1.6 (as in Figure 9). Using samplesof size 100, we want error bars that are 90% certain. Then the Weak Lawsuggests we find a c such that

1.62

100c2= 0.1, or c = 0.506.

Thus we are 90% sure that any particular observed mean is within 0.506 ofthe actual population mean.

If we consider the particular data that led to the histogram of Figure 9,we find that in fact 90% of the observed sample means are within 0.21 ofthe actual mean, which is 2. Additionally, 99.9% of the observed means arewithin 0.5 of 2. So the Weak Law estimates are correct, but inefficient.

On the other hand, taking 100 as a large sample size, we see from the CentralLimit Theorem that Z = 7.905(X̄ − µ) should be approximately a standardnormal variable. Therefore we are 90% sure that its observed values arewithin 1.65 of 0. So we are 90% sure that any observed sample mean is

Page 32: A Probability Primer

16 POINT ESTIMATION 32

within1.65

7.905= 0.209 of the actual population mean. Similarly, we are 99.9%

sure that any observed mean is within 0.5 of the actual. Thus, the CentralLimit Theorem provides estimates which are close to reality(in this case,identical). 2

16 Point Estimation

We now start our enquiry into the general process of estimating statisti-cal parameters – so far we have become familiar with estimating mean andvariance via sample mean and sample variance.

• A statistic of a random sample X1, . . . , Xn is any function

Θ = f(X1, . . . , Xn).

Thus a statistic is itself a random variable. X̄ and S2 are examples of statis-tics. The probability distribution of a statistic is called its sampling dis-

tribution.

• A statistic Θ is called an estimator of a population parameter θ if valuesof the statistic are used as estimates of θ.

X̄ is an estimator of the population mean, and S2 is an estimator of thepopulation variance.

• If a statistic Θ is an estimator of a population parameter θ, then anyobserved value of Θ is denoted θ̂ and called a point estimate of θ.

According to this convention, observed values of X̄ should be denoted by µ̂,and of S2 by σ̂2. We will also denote them by x̄ and s2, respectively.

• A statistic Θ is an unbiased estimator of a population parameter θ if

E[Θ] = θ.

X̄ is an unbiased estimator of the population mean, and S2 is an unbiasedestimator of the population variance.

Page 33: A Probability Primer

17 METHOD OF MOMENTS 33

Exercise. Suppose we use S =√

S2 as an estimator of the standard devia-tion of the population. Is it an unbiased estimator?

In general, if Θ is an estimator of θ, the gap E[Θ] − θ is called its bias.

• Θ is an asymptotically unbiased estimator of θ if the bias goes to 0 asthe sample size goes to infinity.

Exercise. Show the following statistics are asymptotically unbiased estima-tors:

1.1

n

i

(Xi − X̄)2, of σ2.

2. S =√

S2, of σ.

• Θ is a consistent estimator of θ if for any c ≥ 0,

P

(

|Θ − θ| ≥ c)

→ 0, as n → ∞,

where n is the sample size.

The Weak Law of Large Numbers states that X̄ is a consistent estimatorof µ. The proof of the Weak Law can be trivially generalized to yield thefollowing result:

Theorem. Let Θ be an unbiased estimator for θ, such that var[Θ] → 0 asthe sample size n → ∞. Then Θ is a consistent estimator of θ. 2

It follows that S2 is a consistent estimator of σ2.

17 Method of Moments

Having created the abstract notion of an estimator of a population param-eter, we now face the problem of creating such estimators. The Method ofMoments is based on the idea that parameters of a population can be re-covered from its moments, and for the moments there are certain obviousestimators.

Page 34: A Probability Primer

18 MAXIMUM LIKELIHOOD ESTIMATORS 34

• Given a random variable X, its rth moment is defined to be

µr = E[Xr].

Note that µX = µ1 and σ2X = µ2 − µ2

1.

• Consider a population with probability distribution fX . Then the statistic

Mr =1

n

n∑

i=1

Xri

is the Method of Moments estimator for the population moment µr.

We have M1 = X̄. Further, the identity σ2X = µ2 − µ2

1 leads to the use ofM2 − M2

1 as an estimator of σ2.

Exercise. M2 − M21 =

1

n

i

(Xi − X̄)2.

The Method of Moments is easy to implement. Its drawback is that it comeswith no guarantees about the performance of the estimators that it creates.

18 Maximum Likelihood Estimators

Maximum Likelihood Estimation (MLE) will be our main tool for generatingestimators. The reason lies in the following guarantee:

Fact Maximum Likelihood Estimators are asymptotically minimum vari-ance and unbiased. Thus, for large samples, they are essentially the bestestimators.

Before describing the method, we ask the reader to consider the followinggraph and question:

Page 35: A Probability Primer

18 MAXIMUM LIKELIHOOD ESTIMATORS 35

-2 -1 1 2 3 4

0.1

0.2

0.3

0.4

Question. The graphs above represent two conjectured distributions for apopulation. If we pick a random member of the population and it turns outto be 2, which of these conjectures appears more plausible? What if therandom member turns out to be −1? 1?

• Consider a population whose probability density f( · ; t) depends on a pa-rameter t. We draw a random sample X1, . . . , Xn from this population. Thecorresponding likelihood function is defined to be

L(x1, . . . , xn; t) =∏

i

f(xi; t),

where xi is a value of Xi. Thus L is the joint probability density of therandom sample.

• Suppose we observe the values x1, . . . , xn of a random sample X1, . . . , Xn.The corresponding maximum likelihood estimate of t is the value t̂ thatmaximizes L(x1, . . . , xn; t).

Example. Suppose we have a binomial population with n = 1 and we wishto estimate p using a sample of size m. The corresponding likelihood functionis:

L(x1, . . . , xm; p) =∏

i

(

1

xi

)

pxi(1 − p)1−xi

= (∏

i: xi=1

p)(∏

i: xi=0

(1 − p))

= p∑

i xi(1 − p)m−∑

i xi

= pmx̄(1 − p)m(1−x̄).

To maximize L, we use Calculus:

dL

dp= 0 =⇒ p = x̄.

Page 36: A Probability Primer

19 SAMPLING DISTRIBUTIONS 36

So the sample mean X̄ is the ML Estimator of p. 2

Example. Suppose we have a normal population whose σ is known, and wewant the maximum likelihood estimator of µ. The likelihood function is

L(x1, . . . , xn; σ) =∏

i

1

σ√

2πe−

1

2(

xi−µ

σ)2

=1

(σ√

2π)ne−

1

2

i(xi−µ

σ)2

Matters simplify if we take the logarithms:

ln L(x1, . . . , xn; σ) = −n ln(σ√

2π) − 1

2

i

(xi − µ

σ

)2

We differentiate with respect to µ:

0 =d(ln L)

dµ= −

i

(xi − µ

σ

)

=⇒ µ = x̄

So the sample mean is the ML Estimator of µ. 2

Exercise. Suppose we have a normal population whose µ is known. Show

that the ML Estimator of σ2 isn − 1

nS2.

The MLE technique can also be applied when many parameters t1, . . . , tkhave to be estimated. Then the likelihood function has the form

L(x1, . . . , xn; t1, . . . , tk)

and we again seek the values t̂1, . . . , t̂k which maximize it.

Exercise. Suppose we have a normal population and neither parameter isknown. Find their ML Estimators.

19 Sampling Distributions

In earlier sections, we have explored the characteristics of the sample meanand variance. We could give their expectation and variance, and for large

Page 37: A Probability Primer

19 SAMPLING DISTRIBUTIONS 37

samples the Central Limit Theorem gives a close approximation to the actualdistribution of X̄. Other than this, we do not have an explicit description ofthe sampling distributions of X̄ and S2. However, if the population is knownto be normal, we can do better.

Theorem. Let X1, . . . , Xn be a random sample from a normal population.Then

1. X̄ and S2 are independent.

2. X̄ ∼ N(µ, σ/√

n).

3. (n − 1)S2

σ2∼ χ2

(n − 1).

Proof. We skip the proof of the first item – it is hard and no elementaryStatistics book contains it! The next item is trivial, since linear combinationsof independent normal variables are again normal. For the last claim, we startwith the identity

(n − 1)S2 =∑

i

X2i − n(X̄)2 =

i

(Xi − µ)2 − n(X̄ − µ)2.

Hence,∑

i

(

Xi − µ

σ

)2

= (n − 1)S2

σ2+

(

X̄ − µ

σ/√

n

)2

.

Let us recall the following fact: If Z, Y are independent, Z ∼ χ2(k) and

Y +Z ∼ χ2(k +m), then Y ∼ χ2

(m). We can apply this to the last equation

with k = 1 and k + m = n, to get (n − 1)S2

σ2∼ χ2

(n − 1). 2

Exercise. Let S2 be the sample variance of a sample of size n from anN(µ, σ) population. Show that

var[S2] =2σ4

n − 1.

Page 38: A Probability Primer

19 SAMPLING DISTRIBUTIONS 38

-0.3 -0.2 -0.1 0.1 0.2 0.3

0.6

0.8

1.2

1.4

Figure 10: This diagram illustrates the independence of X̄ and S2 for a normalpopulation. It shows (x̄, s2) pairs for a thousand samples (each of size 100) froma standard normal population.

-3 -2 -1 1 2 3

0.1

0.2

0.3

0.4

0.5

C

B

A

Figure 11: The graphs compare t distributions with the standard normal distri-bution. We have A: t(1), B: t(10), C: N(0, 1).

Page 39: A Probability Primer

19 SAMPLING DISTRIBUTIONS 39

t Distribution

• Consider independent random variables Y, Z, where Y ∼ χ2(n) and Z ∼

N(0, 1). Then the random variable

T =Z

Y/√

n

is said to have a t distribution with n degrees of freedom. We write T ∼t(n). This definition is motivated by the following example:

Example. Suppose we have a random sample X1, . . . , Xn from a normalpopulation, and we seek to estimate µ. Then we have:

Z =X̄ − µ

σ/√

n∼ N(0, 1).

We can’t use Z directly because σ is also unknown. One option is to replaceσ by its estimator S, but then the distribution is no longer normal. However,we recall that

Y = (n − 1)S2

σ2∼ χ2

(n − 1).

Further, since X̄, S2 are independent, so are Y and Z. Therefore

Z

Y/√

n − 1∼ t(n − 1)

To round off the example, note that

Z

Y/√

n − 1=

X̄ − µ

S/√

n.

2

Figure 11 shows that the t distribution is only needed for small sample sizes.Beyond n = 30, the t distributions are essentially indistinguishable from thestandard normal one, and so we can directly use the observed value s (of S)for σ and work with the standard normal distribution.

Page 40: A Probability Primer

19 SAMPLING DISTRIBUTIONS 40

0.5 1 1.5 2 2.5 3

0.2

0.4

0.6

0.8

1

1.2

1.4

CBA

0.5 1 1.5 2

0.5

1

1.5

2

2.5

3

B

A

Figure 12: The first graph shows various F distributions (A: F(3, 10), B: F(10, 10),C: F(50, 50)). The second graph compares the F(200, 200) distribution (A) and thenormal distribution with the same mean and standard deviation (B).

F Distribution

• Suppose X1, X2 are independent random variables, and X1 ∼ χ2(n1), X2 ∼

χ2(n2). Then the random variable

F =X1/n1

X2/n2

is said to have an F distribution with n1, n2 degrees of freedom. We writeF ∼ F(n1, n2).

Example. Suppose we wish to compare the variances of two normal popu-lations. We could do this by estimating their ratio. Therefore, for i = 1, 2,let S2

i be the sample variance of a random sample of size ni from the ith

population. Then

(ni − 1)S2

i

σ2i

∼ χ2(ni − 1).

Therefore, on taking the ratio and simplifying,

S21/σ

21

S22/σ

22

∼ F(n1 − 1, n2 − 1)

Given observations of Si, we can use this relation to estimate σ1/σ2. 2

From Figure 12 we see that F distributions also become more normal withlarger sample sizes. However, they do so much more slowly than χ2

or tdistributions.

Page 41: A Probability Primer

20 CONFIDENCE INTERVALS 41

You may have noted that we have not provided the density functions for theχ2

, t and F distributions. Their density functions are not very complicated,but they are rarely used directly. What we really need are their direct in-tegrals (to get probabilities) and these do not have closed form formulae!Instead, we have to look up the values from tables (provided at the end ofevery Statistics book) or use computational software such as Mathematica.

Remark From the definition of the F distribution, it is clear that F ∼F(m1, m2) implies 1/F ∼ F(m2, m1). So tables for the F distribution areprinted only for the case m1 ≤ m2.

20 Confidence Intervals

An interval estimate for a population parameter t is an interval [t̂1, t̂2]where t is predicted to lie. The numbers t̂1, t̂2 are obtained from a randomsample: so they are values of statistics T1 and T2. Let

P(T̂1 < t < T̂2) = 1 − α.

Then 1 − α is called the degree of confidence and t̂1, t̂2 are the lowerand upper confidence limits. The interval (t̂1, t̂2) is called a (1 − α)100%confidence interval.

To obtain a confidence interval for a parameter t we need to create a statisticT in which t is the only unknown. If T is of one of the standard types(normal, chi square, etc.) we can use tables or statistical software to findt1, t2 such that

P(t1 < T < t2) = 1 − α.

By rearranging t1 < T < t2 in the form T1 < t < T2 we obtain a (1−α)100%confidence interval for t.

Confidence Interval For Mean

Example. Suppose we need to estimate the mean µ of a normal populationwhose standard deviation σ is known. If we have a random sample of size n,then

X̄ ∼ N(µ, σ/√

n).

Page 42: A Probability Primer

20 CONFIDENCE INTERVALS 42

Figure 13: The point z such that P(Z > z) = α/2, also satisfies P(|Z| < z) = 1−α.

If we desire a degree of confidence 1 − α, we need a c such that

P(

|X̄ − µ| < c)

= 1 − α.

Then (x̄−c, x̄+c) will be a confidence interval for µ with degree of confidence1 − α.

Define Z =X̄ − µ

σ/√

n. This is a standard normal variable. Also,

P(

|X̄ − µ| < c)

= P

(

|Z| <c√

n

σ

)

.

The problem can be finished off by locating z =c√

n

σsuch that (See Figure

13):

P(Z > z) =α

2.

For example, suppose we have n = 150, σ = 6.2 and α = 0.01. Then we havez = 2.575, since

1√2π

∫ ∞

2.575

e−x2/2dx = 0.005.

From z =c√

n

σwe obtain

c =2.575 × 6.2√

150= 1.30.

Page 43: A Probability Primer

20 CONFIDENCE INTERVALS 43

Figure 14: A one-sided confidence interval.

Hence (x̄ − 1.30, x̄ + 1.30) is a 99% confidence interval for µ. 2

Let us note some features of this example:

1. We found a confidence interval which was symmetric about the pointestimate x̄. We can also find asymmetric ones. In particular, we have

the one-sided confidence interval (−∞, x̄+zσ√n

), where z is chosen

to satisfy P(Z > z) = α. (See Figure 14.)

2. We assumed a normal population. This can be bypassed with the helpof the Central Limit Theorem and a large sample size n, for then X̄will be normally distributed.

3. We assumed σ was somehow known. If not, we can work with the

t distribution T =X̄ − µ

S/√

n. Further, if n ≥ 30, we can just use the

observed value s of S for σ.

The next example illustrates the last item.

Example. Suppose we have a small random sample of size n from a normalpopulation, N(µ, σ). Then

T =X̄ − µ

S/√

n∼ t(n − 1).

Page 44: A Probability Primer

20 CONFIDENCE INTERVALS 44

To obtain a confidence interval for µ, with degree of confidence 1 − α, weseek a value t such that

P( |T | ≤ t ) = 1 − α.

For then, the interval

{

µ : −t ≤ x̄ − µ

s/√

n≤ t

}

=

(

x̄ − s√n

t, x̄ +s√n

t

)

is the required confidence interval for µ.

For instance, suppose the random sample consists of the following numbers:2.3, 1.9, 2.1, 2.8, 2.3, 3.6, 1.4, 1.8, 2.1, 3.2, 2.0 and 1.9. Then we have

x̄ = 2.28 and s2 = 0.407.

Also, T ∼ t(11). If we want a confidence level of 95%, then we can taket = 2.201 since P(T ≥ 2.201) = 0.025 = α/2 (and T is symmetric abouty-axis). So the 95% confidence interval is (1.88, 2.68). 2

Confidence Interval For Variance

Consider the task of estimating the variance σ2 of a normal population. Ifwe have a random sample of size n, then

(n − 1)S2

σ2∼ χ2

(n − 1).

We take x1, x2 such that

P

(

(n − 1)S2

σ2≥ x1

)

= 1 − α

2and P

(

(n − 1)S2

σ2≥ x2

)

2,

so that P

(

x1 < (n − 1)S2

σ2< x2

)

= 1 − α. Therefore,

P

(

n − 1

x2

S2 < σ2 <n − 1

x1

S2

)

= 1 − α.

Thus,

(

n − 1

x2

s2,n − 1

x1

s2

)

is a (1 − α)100% confidence interval for σ2.

Page 45: A Probability Primer

20 CONFIDENCE INTERVALS 45

Figure 15: Choosing a confidence interval for variance, using a chi square distri-bution.

Confidence Interval For Difference of Means

We now consider the problem of comparing two different populations: Wedo this by estimating either the difference or the ratio of their populationparameters such as mean and variance.

Example. Consider independent normal populations with parameters (µi, σi),i = 1, 2, respectively. We will estimate µ1 − µ2, assuming the variances areknown. Now,

X̄i ∼ N

(

µi,σi√ni

)

.

Assuming the random samples to be independent, we have

X̄1 − X̄2 ∼ N(µ1 − µ2, σ), where σ =

σ21

n1+

σ22

n2.

Therefore,

Z =(X̄1 − X̄2) − (µ1 − µ2)

σ∼ N(0, 1).

We are now in a familiar situation. We choose z so that

P(Z > z) = α/2.

Then,P(|Z| < z) = 1 − α.

Hence,P

(

|(X̄1 − X̄2) − (µ1 − µ2)| < σz)

= 1 − α.

Page 46: A Probability Primer

20 CONFIDENCE INTERVALS 46

And so, (x̄1 − x̄2 −σz, x̄1 − x̄2 + σz) is a (1−α)100% confidence interval forthe difference of means. 2

If the variances are not known, we can take large samples (ni ≥ 30) anduse s2

i for σ2i . If even that is not possible, we need to at least assume that

σ1 = σ2 = σ. Then

Z =(X̄1 − X̄2) − (µ1 − µ2)

σ√

1n1

+ 1n2

∼ N(0, 1).

For σ, we have the pooled estimator

S2p =

(n1 − 1)S21 + (n2 − 1)S2

2

n1 + n2 − 1,

which is just the overall sample variance of the pooled samples. We have

(ni − 1)S2

i

σ2∼ χ2

(ni − 1) i = 1, 2.

Hence

Y =2

i=1

(ni − 1)S2

i

σ2∼ χ2

(n1 + n2 − 2).

Therefore

T =Z

Y/(n1 + n2 − 2)∼ t(n1 + n2 − 2).

Now T can also be expressed as:

T =(X̄1 − X̄2) − (µ1 − µ2)

Sp

1n1

+ 1n2

So we see that the t distribution can be used to find confidence intervals forµ1 − µ2. For example, suppose we find t such that

P(T ≥ t) = α/2.

ThenP(|T | ≤ t) = 1 − α.

This can be rearranged into

P

(

|(X̄1 − X̄2) − (µ1 − µ2)| ≤ t Sp

1

n1

+1

n2

)

= 1 − α.

Page 47: A Probability Primer

20 CONFIDENCE INTERVALS 47

Therefore,(

x̄1 − x̄2 − t sp

1

n1+

1

n2, x̄1 − x̄2 + t sp

1

n1+

1

n2

)

is a (1 − α)100% confidence interval for µ1 − µ2.

Confidence Interval For Ratio Of Variances

For comparing the variances of two normal populations, it is convenient tolook at their ratio through the F distribution.

Example. Suppose we have independent random samples from N(µi, σi)populations, with i = 1, 2. Let ni be the sample size and S2

i the samplevariance for the sample from the ith population. Then

(ni − 1)S2

i

σ2i

∼ χ2(ni − 1),

and hence

F =S2

1/σ21

S22/σ

22

∼ F(n1 − 1, n2 − 1).

Therefore, we look for f1, f2 such that

P(F > f1) = 1 − α/2 and P(F > f2) = α/2,

so thatP(f1 < F < f2) = 1 − α.

This yields the (1 − α)100% confidence interval(

s21

s22

1

f2,

s21

s22

1

f1

)

for σ21/σ

22. 2

References

1. A.M. Mood, F.A. Graybill and D.C. Boes, Introduction to the Theory

of Statistics, Third Edition, Tata McGraw-Hill, New Delhi.

2. J.E. Freund, Mathematical Statistics, Fifth Edition, Prentice-Hall India.