M2S1 Lecture Notes

M2S1 Lecture Notes

G. A. Younghttp://www2.imperial.ac.uk/∼ayoung

September 2011

ii

Contents

1 DEFINITIONS, TERMINOLOGY, NOTATION 11.1 EVENTS AND THE SAMPLE SPACE . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 OPERATIONS IN SET THEORY . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 MUTUALLY EXCLUSIVE EVENTS AND PARTITIONS . . . . . . . . . . . 2

1.2 THE σ-FIELD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 THE PROBABILITY FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY . . . . . . . . . . . . . 51.5 CONDITIONAL PROBABILITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 THE THEOREM OF TOTAL PROBABILITY . . . . . . . . . . . . . . . . . . . . . 71.7 BAYES’ THEOREM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 COUNTING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.8.1 THE MULTIPLICATION PRINCIPLE . . . . . . . . . . . . . . . . . . . . . 81.8.2 SAMPLING FROM A FINITE POPULATION . . . . . . . . . . . . . . . . . 81.8.3 PERMUTATIONS AND COMBINATIONS . . . . . . . . . . . . . . . . . . . 91.8.4 PROBABILITY CALCULATIONS . . . . . . . . . . . . . . . . . . . . . . . . 10

2 RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS 132.1 RANDOM VARIABLES & PROBABILITY MODELS . . . . . . . . . . . . . . . . . 132.2 DISCRETE RANDOM VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 PROPERTIES OF MASS FUNCTION fX . . . . . . . . . . . . . . . . . . . 142.2.2 CONNECTION BETWEEN FX AND fX . . . . . . . . . . . . . . . . . . . . 152.2.3 PROPERTIES OF DISCRETE CDF FX . . . . . . . . . . . . . . . . . . . . 15

2.3 CONTINUOUS RANDOM VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 PROPERTIES OF CONTINUOUS FX AND fX . . . . . . . . . . . . . . . . 16

2.4 EXPECTATIONS AND THEIR PROPERTIES . . . . . . . . . . . . . . . . . . . . . 192.5 INDICATOR VARIABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 TRANSFORMATIONS OF RANDOM VARIABLES . . . . . . . . . . . . . . . . . . 21

2.6.1 GENERAL TRANSFORMATIONS . . . . . . . . . . . . . . . . . . . . . . . 212.6.2 1-1 TRANSFORMATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 GENERATING FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.7.1 MOMENT GENERATING FUNCTIONS . . . . . . . . . . . . . . . . . . . . 262.7.2 KEY PROPERTIES OF MGFS . . . . . . . . . . . . . . . . . . . . . . . . . 262.7.3 OTHER GENERATING FUNCTIONS . . . . . . . . . . . . . . . . . . . . . 28

2.8 JOINT PROBABILITY DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . 292.8.1 THE CHAIN RULE FOR RANDOM VARIABLES . . . . . . . . . . . . . . 332.8.2 CONDITIONAL EXPECTATION AND ITERATED EXPECTATION . . . 33

2.9 MULTIVARIATE TRANSFORMATIONS . . . . . . . . . . . . . . . . . . . . . . . . 342.10 MULTIVARIATE EXPECTATIONS AND COVARIANCE . . . . . . . . . . . . . . 37

2.10.1 EXPECTATION WITH RESPECT TO JOINT DISTRIBUTIONS . . . . . . 372.10.2 COVARIANCE AND CORRELATION . . . . . . . . . . . . . . . . . . . . . 372.10.3 JOINT MOMENT GENERATING FUNCTION . . . . . . . . . . . . . . . . 392.10.4 FURTHER RESULTS ON INDEPENDENCE . . . . . . . . . . . . . . . . . 39

2.11 ORDER STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iii

iv CONTENTS

3 DISCRETE PROBABILITY DISTRIBUTIONS 41

4 CONTINUOUS PROBABILITY DISTRIBUTIONS 45

5 MULTIVARIATE PROBABILITY DISTRIBUTIONS 515.1 THE MULTINOMIAL DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 THE DIRICHLET DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3 THE MULTIVARIATE NORMAL DISTRIBUTION . . . . . . . . . . . . . . . . . . 52

6 PROBABILITY RESULTS & LIMIT THEOREMS 556.1 BOUNDS ON PROBABILITIES BASED ON MOMENTS . . . . . . . . . . . . . . 556.2 THE CENTRAL LIMIT THEOREM . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 MODES OF STOCHASTIC CONVERGENCE . . . . . . . . . . . . . . . . . . . . . 57

6.3.1 CONVERGENCE IN DISTRIBUTION . . . . . . . . . . . . . . . . . . . . . 576.3.2 CONVERGENCE IN PROBABILITY . . . . . . . . . . . . . . . . . . . . . . 586.3.3 CONVERGENCE IN QUADRATIC MEAN . . . . . . . . . . . . . . . . . . 59

7 STATISTICAL ANALYSIS 617.1 STATISTICAL SUMMARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 SAMPLING DISTRIBUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.3 HYPOTHESIS TESTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.3.1 TESTING FOR NORMAL SAMPLES - THE Z-TEST . . . . . . . . . . . . 637.3.2 HYPOTHESIS TESTING TERMINOLOGY . . . . . . . . . . . . . . . . . . 647.3.3 THE t-TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3.4 TEST FOR σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.3.5 TWO SAMPLE TESTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.4 POINT ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.4.1 ESTIMATION TECHNIQUES I: METHOD OF MOMENTS . . . . . . . . . 687.4.2 ESTIMATION TECHNIQUES II: MAXIMUM LIKELIHOOD . . . . . . . . 69

7.5 INTERVAL ESTIMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5.1 PIVOTAL QUANTITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5.2 INVERTING A TEST STATISTIC . . . . . . . . . . . . . . . . . . . . . . . . 71

CHAPTER 1

DEFINITIONS, TERMINOLOGY, NOTATION

1.1 EVENTS AND THE SAMPLE SPACE

Definition 1.1.1 An experiment is a one-off or repeatable process or procedure for which(a) there is a well-defined set of possible outcomes(b) the actual outcome is not known with certainty.

Definition 1.1.2 A sample outcome, ω, is precisely one of the possible outcomes of anexperiment.

Definition 1.1.3 The sample space, Ω, of an experiment is the set of all possible outcomes.

NOTE : Ω is a set in the mathematical sense, so set theory notation can be used. For example, ifthe sample outcomes are denoted ω1, ..., ωk, say, then

Ω = {ω1, ..., ωk} = {ωi : i = 1, ..., k} ,

and ωi ∈ Ω for i = 1, ..., k.

The sample space of an experiment can be

- a FINITE list of sample outcomes, {ω1, ..., ωk}

- a (countably) INFINITE list of sample outcomes, {ω1, ω2, ...}

- an INTERVAL or REGION of a real space,{ω : ω ∈ A ⊆ Rd

}

Definition 1.1.4 An event, E, is a designated collection of sample outcomes. Event E occursif the actual outcome of the experiment is one of this collection. An event is, therefore, a subset ofthe sample space Ω.

Special Cases of Events

The event corresponding to the collection of all sample outcomes is Ω.

The event corresponding to a collection of none of the sample outcomes is denoted ∅.

i.e. The sets ∅ and Ω are also events, termed the impossible and the certain event respectively,and for any event E, E ⊆ Ω.

1

2 CHAPTER 1. DEFINITIONS, TERMINOLOGY, NOTATION

1.1.1 OPERATIONS IN SET THEORY

Since events are subsets of Ω, set theory operations are used to manipulate events in probabilitytheory. Consider events E,F ⊆ Ω. Then we can reasonably concern ourselves also with eventsobtained from the three basic set operations:

UNION E ∪ F “E or F or both occur”INTERSECTION E ∩ F “both E and F occur”COMPLEMENT E′ “E does not occur”

Properties of Union/Intersection operators

Consider events E,F,G ⊆ Ω.

COMMUTATIVITY E ∪ F = F ∪ EE ∩ F = F ∩ E

ASSOCIATIVITY E ∪ (F ∪G) = (E ∪ F ) ∪GE ∩ (F ∩G) = (E ∩ F ) ∩G

DISTRIBUTIVITY E ∪ (F ∩G) = (E ∪ F ) ∩ (E ∪G)E ∩ (F ∪G) = (E ∩ F ) ∪ (E ∩G)

DE MORGAN’S LAWS (E ∪ F )′= E

′∩ F

′

(E ∩ F )′= E

′∪ F

′

1.1.2 MUTUALLY EXCLUSIVE EVENTS AND PARTITIONS

Definition 1.1.5 Events E and F are mutually exclusive if E ∩F = ∅, that is, if events E andF cannot both occur. If the sets of sample outcomes represented by E and F are disjoint (haveno common element), then E and F are mutually exclusive.

Definition 1.1.6 Events E1, ..., Ek ⊆ Ω form a partition of event F ⊆ Ω if(a) Ei ∩ Ej = ∅ for i 6= j, i, j = 1, ..., k

(b)k⋃

i=1Ei = F ,

so that each element of the collection of sample outcomes corresponding to event F is in one andonly one of the collections corresponding to events E1, ...Ek.

1.2. THE σ-FIELD 3

Figure 1.1: Partition of Ω

In Figure 1.1, we have Ω =6⋃

i=1Ei

Figure 1.2: Partition of F ⊂ Ω

In Figure 1.2, we have F =6⋃

i=1(F ∩ Ei), but, for example, F ∩ E6 = ∅.

1.2 THE σ-FIELD

Events are subsets of Ω, but need all subsets of Ω be events? The answer is negative. But itsuffices to think of the collection of events as a subcollection A of the set of all subsets of Ω. Thissubcollection should have the following properties:

(a) if A,B ∈ A then A ∪B ∈ A and A ∩B ∈ A;

(b) if A ∈ A then A′ ∈ A;


(c) ∅ ∈ A.

A collection A of subsets of Ω which satisfies these three conditions is called a field. It followsfrom the properties of a field that if A1, A2, . . . , Ak ∈ A, then

k⋃

i=1

Ai ∈ A.

So, A is closed under finite unions and hence under finite intersections also. To see this note thatif A1, A2 ∈ A, then

A′1, A′2 ∈ A =⇒ A′1 ∪A

′2 ∈ A =⇒

(A′1 ∪A

′2

)′ ∈ A =⇒ A1 ∩A2 ∈ A.

This is fine when Ω is a finite set, but we require slightly more to deal with the common situationwhen Ω is infinite. We require the collection of events to be closed under the operation of takingcountable unions, not just finite unions.

Definition 1.2.1 A collection A of subsets of Ω is called a σ−field if it satisfies the followingconditions:

(I) ∅ ∈ A;

(II) if A1, A2, . . . ∈ A then⋃∞i=1Ai ∈ A;

(III) if A ∈ A then A′ ∈ A.

To recap, with any experiment we may associate a pair (Ω,A), where Ω is the set of all possibleoutcomes (or elementary events) and A is a σ−field of subsets of Ω, which contains all the eventsin whose occurrences we may be interested. So, from now on, to call a set A an event is equivalentto asserting that A belongs to the σ−field in question.

1.3 THE PROBABILITY FUNCTION

Definition 1.3.1 For an event E ⊆ Ω, the probability that E occurs will be written P (E).

Interpretation: P (.) is a set-function that assigns “weight” to collections of possible outcomes ofan experiment. There are many ways to think about precisely how this assignment is achieved;

CLASSICAL : “Consider equally likely sample outcomes ...”

FREQUENTIST : “Consider long-run relative frequencies ...”

SUBJECTIVE : “Consider personal degree of belief ...”

or merely think of P (.) as a set-function.

Formally, we have the following definition.

Definition 1.3.2 A probability function P (.) on (Ω,A) is a function P : A → [0, 1] satisfying:

1.4. PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY 5

(a) P (∅) = 0, P (Ω) = 1;

(b) if A1, A2, . . . is a collection of disjoint members of A, so that Ai ∩Aj = ∅ from all pairs i, jwith i 6= j, then

P

(∞⋃

i=1

Ai

)

=∞∑

i=1

P (Ai).

The triple (Ω,A, P (.)), consisting of a set Ω, a σ-field A of subsets of Ω and a probability functionP (.) on (Ω,A) is called a probability space.

1.4 PROPERTIES OF P(.): THE AXIOMS OF PROBABILITY

For events E,F ⊆ Ω

1. P (E′) = 1− P (E).

2. If E ⊆ F , then P (E) ≤ P (F ).

3. In general, P (E ∪ F ) = P (E) + P (F )− P (E ∩ F ).

4. P (E ∩ F ′) = P (E)− P (E ∩ F )

5. P (E ∪ F ) ≤ P (E) + P (F ).

6. P (E ∩ F ) ≥ P (E) + P (F )− 1.

NOTE : The general addition rule 3 for probabilities and Boole’s Inequalities 5 and 6extend to more than two events. Let E1, ..., En be events in Ω. Then

P

(n⋃

i=1

Ei

)

=∑

i

P (Ei)−∑

i<j

P (Ei ∩ Ej) +∑

i<j<k

P (Ei ∩ Ej ∩ Ek)− ...+ (−1)nP

(n⋂

i=1

Ei

)

and

P

(n⋃

i=1

Ei

)

≤n∑

i=1

P (Ei).

To prove these results, construct the events F1 = E1 and

Fi = Ei ∩

(i−1⋃

k=1

Ek

)′

for i = 2, 3, ..., n. Then F1, F2, ...Fn are disjoint, andn⋃

i=1Ei =

n⋃

i=1Fi, so

P

(n⋃

i=1

Ei

)

= P

(n⋃

i=1

Fi

)

=n∑

i=1

P (Fi).


Now, by property 4 above

P (Fi) = P (Ei)− P

(

Ei ∩

(i−1⋃

k=1

Ek

))

, i = 2, 3, ..., n,

= P (Ei)− P

(i−1⋃

k=1

(Ei ∩ Ek)

)

and the result follows by recursive expansion of the second term for i = 2, 3, ...n.

NOTE : We will often deal with both probabilities of single events, and also probabilities forintersection events. For convenience, and to reflect connections with distribution theory that willbe presented in Chapter 2, we will use the following terminology; for events E and F

P (E) is the marginal probability of E

P (E ∩ F ) is the joint probability of E and F

1.5 CONDITIONAL PROBABILITY

Definition 1.5.1 For events E,F ⊆ Ω the conditional probability that F occurs given thatE occurs is written P(F |E), and is defined by

P (F |E) =P (E ∩ F )P (E)

,

if P(E) > 0.

NOTE: P (E ∩ F ) = P (E)P (F |E), and in general, for events E1, ..., Ek,

P

(k⋂

i=1

Ei

)

= P (E1)P (E2|E1)P (E2|E1 ∩ E2)...P (Ek|E1 ∩ E2 ∩ ... ∩ Ek−1).

This result is known as the CHAIN or MULTIPLICATION RULE.

Definition 1.5.2 Events E and F are independent if

P (E|F ) = P (E), so that P (E ∩ F ) = P (E)P (F ).

Extension : Events E1, ..., Ek are independent if, for every subset of events of size l ≤ k, indexedby {i1, ..., il}, say,

P

l⋂

j=1

Eij

=l∏

j=1

P (Eij ).

1.6. THE THEOREM OF TOTAL PROBABILITY 7

1.6 THE THEOREM OF TOTAL PROBABILITY

THEOREM

Let E1, ..., Ek be a (finite) partition of Ω, and let F ⊆ Ω. Then

P (F ) =k∑

i=1

P (F |Ei)P (Ei).

PROOF

E1, ..., Ek form a partition of Ω, and F ⊆ Ω, so

F = (F ∩ E1) ∪ ... ∪ (F ∩ Ek)

=⇒ P (F ) =k∑

i=1P (F ∩ Ei) =

k∑

i=1P (F |Ei)P (Ei),

writing F as a disjoint union and using the definition of a probability function.

Extension: The theorem still holds if E1, E2, ... is a (countably) infinite a partition of Ω, andF ⊆ Ω, so that

P (F ) =∞∑

i=1

P (F ∩ Ei) =∞∑

i=1

P (F |Ei)P (Ei),

if P(Ei) > 0 for all i.

1.7 BAYES’ THEOREM

THEOREM

Suppose E,F ⊆ Ω, with P (E), P (F ) > 0. Then

P (E|F ) =P (F |E)P (E)P (F )

.

PROOF

P (E|F )P (F ) = P (E ∩ F ) = P (F |E)P (E), so P (E|F )P (F ) = P (F |E)P (E).

Extension: If E1, ..., Ek are disjoint, with P (Ei) > 0 for i = 1, ..., k, and form a partition of F ⊆ Ω,then

P (Ei|F ) =P (F |Ei)P (Ei)k∑

j=1

P (F |Ej)P (Ej)

.

NOTE: in general, P (E|F ) 6= P (F |E).


1.8 COUNTING TECHNIQUES

This section is included for completeness, but is not examinable.

Suppose that an experiment has N equally likely sample outcomes. If event E corresponds to acollection of sample outcomes of size n(E), then

P (E) =n(E)

N,

so it is necessary to be able to evaluate n(E) and N in practice.

1.8.1 THE MULTIPLICATION PRINCIPLE

If operations labelled 1, ..., r can be carried out in n1, ..., nr ways respectively, then there are

r∏

i=1

ni = n1...nr

ways of carrying out the r operations in total.

Example 1.1 If each of r trials of an experiment has N possible outcomes, then there are N r

possible sequences of outcomes in total. For example:(i) If a multiple choice exam has 20 questions, each of which has 5 possible answers, then thereare 520 different ways of completing the exam.(ii) There are 2m subsets of m elements (as each element is either in the subset, or not in thesubset, which is equivalent to m trials each with two outcomes).

1.8.2 SAMPLING FROM A FINITE POPULATION

Consider a collection of N items, and a sequence of operations labelled 1, ..., r such that the ithoperation involves selecting one of the items remaining after the first i− 1 operations have beencarried out. Let ni denote the number of ways of carrying out the ith operation, for i = 1, ..., r.Then there are two distinct cases;(a) Sampling with replacement : an item is returned to the collection after selection. Thenni = N for all i, and there are N

r ways of carrying out the r operations.(b) Sampling without replacement : an item is not returned to the collection after selected.Then ni = N − i+ 1, and there are N(N − 1)...(N − r + 1) ways of carrying out the r operations.e.g. Consider selecting 5 cards from 52. Then

(a) leads to 525 possible selections, whereas

(b) leads to 52.51.50.49.48 possible selections.

NOTE : The order in which the operations are carried out may be importante.g. in a raffle with three prizes and 100 tickets, the draw {45, 19, 76} is different from {19, 76, 45}.

1.8. COUNTING TECHNIQUES 9

NOTE : The items may be distinct (unique in the collection), or indistinct (of a unique type inthe collection, but not unique individually).e.g. The numbered balls in the National Lottery, or individual playing cards, are distinct. Howeverwhen balls in the lottery are regarded as “WINNING” or “NOT WINNING”, or playing cards areregarded in terms of their suit only, they are indistinct.

1.8.3 PERMUTATIONS AND COMBINATIONS

Definition 1.8.1 A permutation is an ordered arrangement of a set of items.A combination is an unordered arrangement of a set of items.

RESULT 1 The number of permutations of n distinct items is n! = n(n− 1)...1.

RESULT 2 The number of permutations of r from n distinct items is

Pnr =n!

(n− r)!= n(n− 1)...(n− r + 1) (by the Multiplication Principle).

If the order in which items are selected is not important, then

RESULT 3 The number of combinations of r from n distinct items is the binomial coefficient

Cnr =

(n

r

)

=n!

r!(n− r)!(as Pnr = r!C

nr ).

-recall the Binomial Theorem, namely

(a+ b)n =n∑

i=0

(n

i

)

aibn−i.

Then the number of subsets of m items can be calculated as follows; for each 0 ≤ j ≤ m, choose asubset of j items from m. Then

Total number of subsets =m∑

j=0

(m

j

)

= (1 + 1)m = 2m.

If the items are indistinct, but each is of a unique type, say Type I, ..., Type κ say, (the so-calledUrn Model) then

RESULT 4 The number of distinguishable permutations of n indistinct objects, comprising niitems of type i for i = 1, ..., κ is

n!

n1!n2!...nκ!.

Special Case : if κ = 2, then the number of distinguishable permutations of the n1 objects of typeI, and n2 = n− n1 objects of type II is

Cnn2 =n!

n1!(n− n1)!.


RESULT 5 There are Cnr ways of partitioning n distinct items into two “cells”, with r in onecell and n− r in the other.

1.8.4 PROBABILITY CALCULATIONS

Recall that if an experiment has N equally likely sample outcomes, and event E corresponds to acollection of sample outcomes of size n(E), then

P (E) =n(E)

N.

Example 1.2 A True/False exam has 20 questions. Let E = “16 answers correct at random”.Then

P (E) =Number of ways of getting 16 out of 20 correct

Total number of ways of answering 20 questions=

(20

16

)

220= 0.0046.

Example 1.3 Sampling without replacement. Consider an Urn Model with 10 Type I objectsand 20 Type II objects, and an experiment involving sampling five objects without replacement.Let E=“precisely 2 Type I objects selected” We need to calculate N and n(E) in order tocalculate P(E). In this case N is the number of ways of choosing 5 from 30 items, and hence

N =

(30

5

)

.

To calculate n(E), we think of E occurring by first choosing 2 Type I objects from 10, and thenchoosing 3 Type II objects from 20, and hence, by the multiplication rule,

n(E) =

(10

2

)(20

3

)

.

Therefore

P (E) =

(10

2

)(20

3

)

(30

5

) = 0.360.

This result can be checked using a conditional probability argument; consider event F ⊆ E, whereF = “sequence of objects 11222 obtained”. Then

F =5⋂

i=1Fij

where Fij = “type j object obtained on draw i” i = 1, ..., 5, j = 1, 2. Then

P (F ) = P (F11)P (F21|F11)...P (F52|F11, F21, F32, F42) =10

30

9

29

20

28

19

27

18

26.

1.8. COUNTING TECHNIQUES 11

Now consider event G where G = “sequence of objects 12122 obtained”. Then

P (G) =10

30

20

29

9

28

19

27

18

26,

i.e. P (G) = P (F ). In fact, any sequence containing two Type I and three Type II objects has this

probability, and there are

(5

2

)

such sequences. Thus, as all such sequences are mutually exclusive,

P (E) =

(5

2

)10

30

9

29

20

28

19

27

18

26=

(10

2

)(20

3

)

(30

5

)

as before.

Example 1.4 Sampling with replacement. Consider an Urn Model with 10 Type I objects and20 Type II objects, and an experiment involving sampling five objects with replacement. Let E =“precisely 2 Type I objects selected”. Again, we need to calculate N and n(E) in order tocalculate P(E). In this case N is the number of ways of choosing 5 from 30 items withreplacement, and hence

N = 305.

To calculate n(E), we think of E occurring by first choosing 2 Type I objects from 10, and 3Type II objects from 20 in any order. Consider such sequences of selection

Sequence Number of ways

11222 10.10.20.20.2012122 10.20.10.20.20. .

etc., and thus a sequence with 2 Type I objects and 3 Type II objects can be obtained in 102203

ways. As before there are

(5

2

)

such sequences, and thus

P (E) =

(5

2

)

102203

305= 0.329.

Again, this result can be verified using a conditional probability argument; consider event F ⊆ E,where F = “sequence of objects 11222 obtained”. Then

P (F ) =

(10

30

)2(2030

)3

as the results of the draws are independent. This result is true for any sequence containing two

Type I and three Type II objects, and there are

(5

2

)

such sequences that are mutually exclusive,

so

P (E) =

(5

2

)(10

30

)2(2030

)3.


CHAPTER 2

RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS

This chapter contains the introduction of random variables as a technical device to enable thegeneral specification of probability distributions in one and many dimensions to be made. The keytopics and techniques introduced in this chapter include the following:

• EXPECTATION

• TRANSFORMATION

• STANDARDIZATION

• GENERATING FUNCTIONS

• JOINT MODELLING

• MARGINALIZATION

• MULTIVARIATE TRANSFORMATION

• MULTIVARIATE EXPECTATION & COVARIANCE

• SUMS OF VARIABLES

Of key importance is the moment generating function, which is a standard device for identifi-cation of probability distributions. Transformations are often used to transform a randomvariable or statistic of interest to one of simpler form, whose probability distribution is moreconvenient to work with. Standardization is often a key part of such simplification.

2.1 RANDOM VARIABLES & PROBABILITY MODELS

We are not always interested in an experiment itself, but rather in some consequence of its randomoutcome. Such consequences, when real valued, may be thought of as functions which map Ω toR, and these functions are called random variables.

Definition 2.1.1 A random variable (r.v.) X is a function X : Ω→ R with the property that{ω ∈ Ω : X(ω) ≤ x} ∈ A for each x ∈ R.

The point is that we defined the probability function P (.) on the σ−field A, so if A(x) = {ω ∈ Ω :X(ω) ≤ x}, we cannot discuss P (A(x)) unless A(x) belongs to A. We generally pay no attention tothe technical condition in the definition, and just think of random variables as functions mappingΩ to R.

13

14 CHAPTER 2. RANDOM VARIABLES & PROBABILITY DISTRIBUTIONS

So, we regard a set B ⊆ R as an event, associated with event A ⊆ Ω if

A = {ω : X(ω) = x for some x ∈ B}.

A and B are events in different spaces, but are equivalent in the sense that

P (X ∈ B) = P (A),

where, formally, it is the latter quantity that is defined by the probability function. Attentionswitches to assigning the probability P (X ∈ B) for appropriate sets B ⊆ R.

If Ω is a list of discrete elements Ω = {ω1, ω2, ...}, then the definition indicates that the events ofinterest will be of the form [X = b], or equivalently of the form [X ≤ b] for b ∈ R. For more generalsample spaces, we will concentrate on events of the form [X ≤ b] for b ∈ R.

2.2 DISCRETE RANDOM VARIABLES

Definition 2.2.1 A random variable X is discrete if the set of all possible values of X (that is,the range of the function represented by X), denoted X, is countable, that is

X = {x1, x2, ..., xn} [FINITE] or X = {x1, x2, ...} [INFINITE].

Definition 2.2.2 PROBABILITY MASS FUNCTION

The function fX defined on X by

fX(x) = P [X = x], x ∈ X

that assigns probability to each x ∈ X is the (discrete) probability mass function, or pmf.

NOTE: For completeness, we define

fX(x) = 0, x /∈ X,

so that fX is defined for all x ∈ R Furthermore we will refer to X as the support of randomvariable X, that is, the set of x ∈ R such that fX(x) > 0.

2.2.1 PROPERTIES OF MASS FUNCTION fX

Elementary properties of the mass function are straightforward to establish using properties of theprobability function. A function fX is a probability mass function for discrete random variable Xwith range X of the form {x1, x2, ...} if and only if

(i) fX(xi) ≥ 0, (ii)∑fX(xi) = 1.

These results follow as events [X = x1], [X = x2] etc. are equivalent to events that partition Ω,that is, [X = xi] is equivalent to event Ai hence P [X = xi] = P (Ai), and the two parts of thetheorem follow immediately.

Definition 2.2.3 DISCRETE CUMULATIVE DISTRIBUTION FUNCTION

The cumulative distribution function, or cdf, FX of a discrete r.v. X is defined by

FX(x) = P [X ≤ x], x ∈ R.

2.3. CONTINUOUS RANDOM VARIABLES 15

2.2.2 CONNECTION BETWEEN FX AND fX

Let X be a discrete random variable with range X = {x1, x2, ...}, where x1 < x2 < ..., andprobability mass function fX and cdf FX . Then for any real value x, if x < x1, then FX(x) = 0,and for x ≥ x1,

FX(x) =∑

xi≤x

fX(xi) ⇐⇒ fX(xi) = FX(xi)− FX(xi−1) i = 2, 3, ...

with, for completeness, fX(x1) = FX(x1) . These relationships follow as events of the form [X ≤ xi]can be represented as countable unions of the events Ai. The first result therefore follows fromproperties of the probability function. The second result follows immediately.

2.2.3 PROPERTIES OF DISCRETE CDF FX

(i) In the limiting cases,

limx→−∞

FX(x) = 0, limx→∞FX(x) = 1.

(ii) FX is continuous from the right (but not continuous) on R that is, for x ∈ R,

limh→0+

FX(x+ h) = FX(x).

(iii) FX is non-decreasing, that is

a < b =⇒ FX(a) ≤ FX(b).

(iv) For a < b,P [a < X ≤ b] = FX(b)− FX(a).

The key idea is that the functions fX and/or FX can be used to describe the probability distri-bution of the random variable X. A graph of the function fX is non-zero only at the elements ofX. A graph of the function FX is a step-function which takes the value zero at minus infinity,the value one at infinity, and is non-decreasing with points of discontinuity at the elements of X.

2.3 CONTINUOUS RANDOM VARIABLES

Definition 2.3.1 A random variable X is continuous if the function FX defined on R by

FX(x) = P [X ≤ x]

for x ∈ R is a continuous function on R , that is, for x ∈ R,

limh→0FX(x+ h) = FX(x).

Definition 2.3.2 CONTINUOUS CUMULATIVE DISTRIBUTION FUNCTIONThe cumulative distribution function, or cdf, FX of a continuous r.v. X is defined by

FX(x) = P [X ≤ x], x ∈ R.


Definition 2.3.3 PROBABILITY DENSITY FUNCTIONA random variable is absolutely continuous if the cumulative distribution function FX can bewritten

FX(x) =

∫ x

−∞fX(t)dt

for some function fX , termed the probability density function, or pdf, of X.

From now on when we speak of a continuous random variable, we will implicitly assume the abso-lutely continuous case, where a pdf exists.

2.3.1 PROPERTIES OF CONTINUOUS FX AND fX

By analogy with the discrete case, let X be the range of X, so that X = {x : fX(x) > 0}.

(i) The pdf fX need not exist, but as indicated above, continuous r.v.’s where a pdf fX cannotbe defined in this way will be ignored. The function fX can be defined piecewise on intervalsof R .

(ii) For the cdf of a continuous r.v.,

limx→−∞

FX(x) = 0, limx→∞

FX(x) = 1.

(iii) Directly from the definition, at values of x where FX is differentiable,

fX(x) =d

dt{FX(t)}t=x .

(iv) If X is continuous,

fX(x) 6= P [X = x] = limh→0+

[P (X ≤ x)− P (X ≤ x− h)] = limh→0+

[FX(x)− FX(x− h)] = 0.

(v) For a < b,

P [a < X ≤ b] = P [a ≤ X < b] = P [a ≤ X ≤ b] = P [a < X < b] = FX(b)− FX(a).

It follows that a function fX is a pdf for a continuous random variable X if and only if

(i) fX(x) ≥ 0, (ii)

∫ ∞

−∞fX(x)dx = 1.

This result follows direct from definitions and properties of FX .

Example 2.1 Consider a coin tossing experiment where a fair coin is tossed repeatedly underidentical experimental conditions, with the sequence of tosses independent, until a Head isobtained. For this experiment, the sample space, Ω is then the set of sequences({H} , {TH} , {TTH} , {TTTH} ...) with associated probabilities 1/2, 1/4, 1/8, 1/16, ... .

Define discrete random variable X : Ω −→ R, by X(ω) = x⇐⇒ first H on toss x. Then

fX(x) = P [X = x] =

(1

2

)x, x = 1, 2, 3, ...

2.3. CONTINUOUS RANDOM VARIABLES 17

and zero otherwise. For x ≥ 1, let k(x) be the largest integer not greater than x, then

FX(x) =∑

xi≤x

fX(xi) =

k(x)∑

i=1

fX(i) = 1−

(1

2

)k(x)

and FX(x) = 0 for x < 1.

Graphs of the probability mass function (left) and cumulative distribution function (right) areshown in Figure 2.1. Note that the mass function is only non-zero at points that are elements ofX, and that the cdf is defined for all real values of x, but is only continuous from the right. FX istherefore a step-function.

x

f(x)

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

PMF f(x)

x

F(x

)

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

CDF F(x)

Figure 2.1: PMF fX(x) =(12

)x, x = 1, 2, . . . , and CDF FX(x) = 1−

(12

)k(x).

Example 2.2 Consider an experiment to measure the length of time that an electricalcomponent functions before failure. The sample space of outcomes of the experiment, Ω is R+,and if Ax is the event that the component functions for longer than x > 0 time units, supposethat P(Ax) = exp

{−x2

}.

Define continuous random variable X : Ω −→ R+, by X(ω) = x⇐⇒ component fails at time x.Then, if x > 0,

FX(x) = P [X ≤ x] = 1− P (Ax) = 1− exp{−x2

}


and FX(x) = 0 if x ≤ 0. Hence if x > 0,

fX(x) =d

dt{FX(t)}t=x = 2x exp

{−x2

},

and zero otherwise.

Graphs of the probability density function (left) and cumulative distribution function (right) areshown in Figure 2.2. Note that both the pdf and cdf are defined for all real values of x, and thatboth are continuous functions.

x

f(x)

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

PDF f(x)

x

F(x

)

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

CDF F(x)

Figure 2.2: PDF fX(x) = 2x exp{−x2}, x > 0, and CDF FX(x) = 1− exp{−x2}, x > 0.

Note that here

FX(x) =

∫ x

−∞fX(t)dt =

∫ x

0fX(t)dt

as fX(x) = 0 for x ≤ 0, and also that

∫ ∞

−∞fX(x)dx =

∫ ∞

0fX(x)dx = 1.

2.4. EXPECTATIONS AND THEIR PROPERTIES 19

2.4 EXPECTATIONS AND THEIR PROPERTIES

Definition 2.4.1 For a discrete random variable X with range X and with probability massfunction fX , the expectation or expected value of X with respect to fX is defined by

EfX [X] =∞∑

x=−∞

xfX(x) =∑

x∈X

xfX(x).

For a continuous random variable X with range X and pdf fX , the expectation orexpected value of X with respect to fX is defined by

EfX [X] =

∫ ∞

−∞xfX(x)dx =

∫

XxfX(x)dx.

NOTE : The sum/integral may not be convergent, and hence the expected value may be infinite.It is important always to check that the integral is finite: a sufficient condition is the absoluteintegrability of the summand/integrand, that is

∑

x

|x| fX(x) <∞ =⇒∑

x

xfX(x) = EfX [X] <∞,

or in the continuous case∫ ∞

−∞|x| fX(x)dx <∞ =⇒

∫ ∞

−∞xfX(x)dx = EfX [X] <∞.

Extension : Let g be a real-valued function whose domain includes X. Then

EfX [g(X)] =

∑

x∈X

g(x)fX(x), if X is discrete,

∫

Xg(x)fX(x)dx, if X is continuous.

PROPERTIES OF EXPECTATIONSLet X be a random variable with mass function/pdf fX . Let g and h be real-valued functionswhose domains include X, and let a and b be constants. Then

EfX [ag(X) + bh(X)] = aEfX [g(X)] + bEfX [h(X)],

as (in the continuous case)

EfX [ag(X) + bh(X)] =

∫

X[ag(x) + bh(x)]fX(x)dx

= a

∫

Xg(x)fX(x)dx+ b

∫

Xh(x)fX(x)dx

= aEfX [g(X)] + bEfX [h(X)].


SPECIAL CASES :

(i) For a simple linear function

EfX [aX + b] = aEfX [X] + b.

(ii) Consider g(x) = (x−EfX [X])2. Write μ =EfX [X] (a constant that does not depend on x).

Then, expanding the integrand

EfX [g(X)] =

∫(x− μ)2fX(x)dx =

∫x2fX(x)dx− 2μ

∫xfX(x)dx+ μ

2

∫fX(x)dx

=

∫x2fX(x)dx− 2μ2 + μ2 =

∫x2fX(x)dx− μ2

= EfX [X2]− {EfX [X]}

2 .

ThenV arfX [X] = EfX [X

2]− {EfX [X]}2

is the variance of the distribution. Similarly,√V arfX [X] is the standard deviation of

the distribution.

(iii) Consider g(x) = xk for k = 1, 2, .... Then in the continuous case

EfX [g(X)] = EfX [Xk] =

∫

XxkfX(x)dx,

and EfX [Xk] is the kth moment of the distribution.

(iv) Consider g(x) = (x− μ)k for k = 1, 2, .... Then

EfX [g(X)] = EfX [(X − μ)k] =

∫

X(x− μ)kfX(x)dx,

and EfX [(X − μ)k] is the kth central moment of the distribution.

(v) Consider g(x) = aX + b. Then V arfX [aX + b] = a2V arfX [X],

V arfX [g(X)] = EfX [(aX + b− EfX [aX + b])2]

= EfX [(aX + b− aEfX [X]− b)2]

= EfX [(a2(X − EfX [X])

2]

= a2V arfX [X].

2.5 INDICATOR VARIABLES

A particular class of random variables called indicator variables are particularly useful. Let Abe an event and let IA : Ω→ R be the indicator function of A, so that

IA(ω) =

{1, if ω ∈ A,0, if ω ∈ A′.

2.6. TRANSFORMATIONS OF RANDOM VARIABLES 21

Then IA is a random variable taking values 1 and 0 with probabilities P (A) and P (A′) respectively.

Also, IA has expectation P (A) and variance P (A){1 − P (A)}. The usefulness lies in the factthat any discrete random variable X can be written as a linear combination of indicator randomvariables:

X =∑

i

aiIAi ,

for some collection of events (Ai, i ≥ 1) and real numbers (ai, i ≥ 1). Sometimes we can obtain theexpectation and variance of a random variable X easily by expressing it in this way, then usingknowledge of the expectation and variance of the indicator variables IAi , rather than by directcalculation.

2.6 TRANSFORMATIONS OF RANDOM VARIABLES

2.6.1 GENERAL TRANSFORMATIONS

Consider a discrete/continuous r.v. X with range X and probability distribution described bymass/pdf fX , or cdf FX . Suppose g is a real-valued function defined on X. Then Y = g(X) isalso an r.v. (Y is also a function from Ω to R). Denote the range of Y by Y. For A ⊆ R, the event[Y ∈ A] is an event in terms of the transformed variable Y . If fY is the mass/density function forY , then

P [Y ∈ A] =

∑

y∈A

fY (y), Y discrete,

∫

A

fY (y)dy, Y continuous.

We wish to derive the probability distribution of random variable Y ; in order to do this, we firstconsider the inverse transformation g−1 from Y to X defined for set A ⊆ Y (and for y ∈ Y) by

g−1(A) = {x ∈ X : g(x) ∈ A} , g−1(y) = {x ∈ X : g(x) = y} ,

that is, g−1(A) is the set of points in X that map into A, and g−1(y) is the set of points in X thatmap to y, under transformation g. By construction, we have

P [Y ∈ A] = P [X ∈ g−1(A)].

Then, for y ∈ R, we have

FY (y) = P [Y ≤ y] = P [g(X) ≤ y] =

∑

x∈Ay

fX(x), X discrete,

∫

Ay

fX(x)dx, X continuous,

where Ay = {x ∈ X : g(x) ≤ y}. This result gives the “first principles” approach to computingthe distribution of the new variable. The approach can be summarized as follows:

• consider the range Y of the new variable;

• consider the cdf FY (y). Step through the argument as follows

FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ∈ Ay].


X (radians)

Y =

sin

(X)

0 1 2 3 4 5 6

-1.0

-0.5

0.0

0.5

1.0

y

x1 x2

Transformation: Y=sin(X)

Figure 2.3: Computation of Ay for Y = sinX.

Note that it is usually a good idea to start with the cdf, not the pmf or pdf.

Our main objective is therefore to identify the set

Ay = {x ∈ X : g(x) ≤ y} .

Example 2.3 Suppose that X is a continuous r.v. with range X ≡ (0, 2π) whose pdf fX isconstant

fX(x) =1

2π, 0 < x < 2π,

and zero otherwise. This pdf has corresponding continuous cdf

FX(x) =x

2π, 0 < x < 2π.

Consider the transformed r.v. Y = sinX. Then the range of Y , Y, is [−1, 1], but thetransformation is not 1-1. However, from first principles, we have

FY (y) = P [Y ≤ y] = P [sinX ≤ y] .

Now, by inspection of Figure 2.3, we can easily identify the required set Ay, y > 0 : it is the unionof two disjoint intervals

Ay = [0, x1] ∪ [x2, 2π] =[0, sin−1 y

]∪[π − sin−1 y, 2π

].

Hence


X (radians)

T =

tan(

X)

0 1 2 3 4 5 6

-10

-50

510

t

x1 x2

Transformation: T=tan(X)

Figure 2.4: Computation of Ay for T = tanX.

FY (y) = P [sinX ≤ y] = P [X ≤ x1] + P [X ≥ x2] = {P [X ≤ x1]}+ {1− P [X < x2]}

=

{1

2πsin−1 y

}

+

{

1−1

2π

(π − sin−1 y

)}

=1

2+1

πsin−1 y,

and hence, by differentiation,

fY (y) =1

π

1√1− y2

.

[A symmetry argument verifies this for y < 0.]

Example 2.4 Consider transformed r.v. T = tanX. Then the range of T , T, is R, but thetransformation is not 1-1. However, from first principles, we have, for t > 0,

FT (t) = P [T ≤ t] = P [tanX ≤ t] .

Figure 2.4 helps identify the required set At: in this case it is the union of three disjoint intervals

At = [0, x1] ∪(π2, x2

]∪

(3π

2, 2π

]

=[0, tan−1 t

]∪(π2, π + tan−1 t

]∪

(3π

2, 2π

]

,

(note, for values of t < 0, the union will be of only two intervals, but the calculation proceedsidentically). Therefore,

FT (t) = P [tanX ≤ t] = P [X ≤ x1] + P[π2< X ≤ x2

]+ P

[3π

2< X ≤ 2π

]

=

{1

2πtan−1 t

}

+1

2π

{π + tan−1 t−

π

2

}+1

2π

{

2π −3π

2

}

=1

πtan−1 t+

1

2,


and hence, by differentiation,

fT (t) =1

π

1

1 + t2.

2.6.2 1-1 TRANSFORMATIONS

The mapping g(X), a function of X from X, is 1-1 and onto Y if for each y ∈ Y, there exists oneand only one x ∈ X such that y = g(x).

The following theorem gives the distribution for random variable Y = g(X) when g is 1-1.

Theorem 2.6.1 THE UNIVARIATE TRANSFORMATION THEOREM

Let X be a random variable with mass/density function fX and support X. Let g be a 1-1 functionfrom X onto Y with inverse g−1. Then Y = g(X) is a random variable with support Y and

Discrete Case : The mass function of random variable Y is given by

fY (y) = fX(g−1(y)), y ∈ Y = {y|fY (y) > 0} ,

where x is the unique solution of y = g(x) (so that x = g−1(y)).

Continuous Case : The pdf of random variable Y is given by

fY (y) = fX(g−1(y))

∣∣∣∣d

dt

{g−1(t)

}t=y

∣∣∣∣ , y ∈ Y = {y|fY (y) > 0} ,

where y = g(x), provided that the derivative

d

dt

{g−1(t)

}

is continuous and non-zero on Y.

Proof. Discrete case : by direct calculation,

fY (y) = P [Y = y] = P [g(X) = y] = P [X = g−1(y)] = fX(x)

where x = g−1(y), and hence fY (y) > 0⇐⇒ fX(x) > 0.

Continuous case : function g is either (I) a monotonic increasing, or (II) a monotonic decreasingfunction.

Case (I): If g is increasing, then for x ∈ X and y ∈ Y, we have that

g(x) ≤ y ⇐⇒ x ≤ g−1(y).

Therefore, for y ∈ Y,

FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ≤ g−1(y)] = FX(g

−1(y))


and, by differentiation, because g is monotonic increasing,

fY (y) = fX(g−1(y))

d

dt

{g−1(t)

}t=y= fX(g

−1(y))

∣∣∣∣d

dt

{g−1(y)

}t=y

∣∣∣∣ , as

d

dt

{g−1(t)

}> 0.

Case (II): If g is decreasing, then for x ∈ X and y ∈ Y we have

g(x) ≤ y ⇐⇒ x ≥ g−1(y)

Therefore, for y ∈ Y,

FY (y) = P [Y ≤ y] = P [g(X) ≤ y] = P [X ≥ g−1(y)] = 1− FX(g

−1(y)),

so

fY (y) = −fX(g−1(y))

d

dt

{g−1(y)

}= fX(g

−1(y))

∣∣∣∣d

dt

{g−1(t)

}t=y

∣∣∣∣ as

d

dt

{g−1(t)

}< 0.

Definition 2.6.1 Suppose transformation g : X −→ Y is 1-1, and is defined by g(x) = y forx ∈ X. Then the Jacobian of the transformation, denoted J(y), is given by

J(y) =d

dt

{g−1(t)

}t=y,

that is, the first derivative of g−1 evaluated at y = g(x). Note that the inverse transformationg−1 : Y −→ X has Jacobian 1/J(x).

NOTE :

(i) The Jacobian is precisely the same term that appears as a change of variable term in anintegration.

(ii) In the Univariate Transformation Theorem, in the continuous case, we take the modulus ofthe Jacobian

(iii) To compute the expectation of Y = g(X), we now have two alternative methods ofcomputation; we either compute the expectation of g(X) with respect to the distribution ofX, or compute the distribution of Y , and then its expectation. It is straightforward todemonstrate that the two methods are equivalent, that is

EfX [g(X)] = EfY [Y ]

This result is sometimes known as the Law of the Unconscious Statistician.


IMPORTANT NOTE: Note that the apparently appealing “plug-in” approach that setsfY (y) = fX

(g−1(y)

)

will almost always fail as the Jacobian term must be included. For example, if Y = eX

so that X = log Y , then merely settingfY (y) = fX(log y)

is insufficient, you must have

fY (y) = fX(log y)×1

y.

2.7 GENERATING FUNCTIONS

2.7.1 MOMENT GENERATING FUNCTIONS

Definition 2.7.1 For random variable X with mass/density function fX , themoment generating function, or mgf, of X, MX , is defined by

MX(t) = EfX [etX ],

if this expectation exists for all values of t ∈ (−h, h) for some h > 0, that is,

DISCRETE CASE MX(t) =∑etxfX(x)

CONTINUOUS CASE MX(t) =

∫etxfX(x)dx

where the sum/integral is over X.

NOTE : It can be shown that if X1 and X2 are random variables taking values on X withmass/density functions fX1 and fX2 , and mgfs MX1 and MX2 respectively, then

fX1(x) ≡ fX2(x), x ∈ X⇐⇒MX1(t) ≡MX2(t), t ∈ (−h, h).

Hence there is a 1-1 correspondence between generating functions and distributions: thisprovides a key technique for identification of probability distributions.

2.7.2 KEY PROPERTIES OF MGFS

(i) If X is a discrete random variable, the rth derivative of MX evaluated at t, M(r)X (t), is given by

M(r)X (t) =

dr

dsr{MX(s)}s=t =

dr

dsr

{∑esxfX(x)

}

s=t=∑xretxfX(x)

and hence

M(r)X (0) =

∑xrfX(x) = EfX [X

r].

2.7. GENERATING FUNCTIONS 27

If X is a continuous random variable, the rth derivative of MX is given by

M(r)X (t) =

dr

dsr

{∫esxfX(x)dx

}

s=t

=

∫xretxfX(x)dx

and hence

M(r)X (0) =

∫xrfX(x)dx = EfX [X

r].

(ii) If X is a discrete random variable, then

MX(t) =∑etxfX(x) =

∑{∞∑

r=0

(tx)r

r!

}

fX(x)

= 1 +∞∑

r=1

tr

r!

{∑xrfX(x)

}= 1 +

∞∑

r=1

tr

r!EfX [X

r].

The identical result holds for the continuous case.

(iii) From the general result for expectations of functions of random variables,

EfY [etY ] ≡ EfX [e

t(aX+b)] =⇒MY (t) = EfX [et(aX+b)] = ebtEfX [e

atX ] = ebtMX(at).

Therefore, ifY = aX + b,MY (t) = e

btMX(at)

Theorem 2.7.1 Let X1, ..., Xk be independent random variables with mgfs MX1 , ...,MXkrespectively. Then if the random variable Y is defined by Y = X1 + ...+Xk,

MY (t) =k∏

i=1

MXi(t).

Proof. For k = 2, if X1 and X2 are independent, integer-valued, discrete r.v.s, then ifY = X1 +X2, by the Theorem of Total Probability,

fY (y) = P [Y = y] =∑

x1

P [Y = y|X1 = x1]P [X1 = x1] =∑

x1

fX2 (y − x1) fX1 (x1) .

Hence

MY (t) = EfY [etY ] =

∑

y

etyfY (y) =∑

y

ety

{∑

x1

fX2 (y − x1) fX1 (x1)

}

=∑

x2

et(x1+x2)

{∑

x1

fX2 (x2) fX1 (x1)

}

(changing variables in the summation, x2 = y − x1)

=

{∑

x1

etx1fX1 (x1)

}{∑

x2

etx2fX2 (x2)

}

=MX1(t)MX2(t),


and the result follows for general k by recursion.

The result for continuous random variables follows in the obvious way.

Special Case : If X1, ..., Xk are identically distributed, then MXi(t) ≡MX(t), say, for all i, so

MY (t) =k∏

i=1

MX(t) = {MX(t)}k .

2.7.3 OTHER GENERATING FUNCTIONS

Definition 2.7.2 For random variable X, with mass/density function fX , thefactorial moment or probability generating function, fmgf or pgf , of X, GX , is defined by

GX(t) = EfX [tX ] = EfX [e

X log t] =MX(log t),

if this expectation exists for all values of t ∈ (1− h, 1 + h) for some h > 0.

Properties :

(i) Using similar techniques to those used for the mgf, it can be shown that

G(r)X (t) =

dr

dsr{GX(s)}s=t = EfX

[X(X − 1)...(X − r + 1)tX−r

]

=⇒ G(r)X (1) = EfX [X(X − 1)...(X − r + 1)],

where EfX [X(X − 1)...(X − r + 1)] is the rth factorial moment.

(ii) For discrete random variables, it can be shown by using a Taylor series expansion of GX that,for r = 1, 2, ...,

G(r)X (0)

r!= P [X = r].

Definition 2.7.3 For random variable X with mass/density function fX , thecumulant generating function of X, KX , is defined by

KX(t) = log [MX(t)] ,

for t ∈ (−h, h) for some h > 0.Moment generating functions provide a very useful technique for identifying distributions, but sufferfrom the disadvantage that the integrals which define them may not always be finite. Another classof functions which are equally useful and whose finiteness is guaranteed is described next.

Definition 2.7.4 The characteristic function, or cf, of X, CX , is defined by

CX(t) = EfX[eitX

].

2.8. JOINT PROBABILITY DISTRIBUTIONS 29

By definition

CX(t) =

∫

x∈XeitxfX(x)dx =

∫

x∈X[cos tx+ i sin tx] fX(x)dx

=

∫

x∈Xcos txfX(x)dx+ i

∫

x∈Xsin txfX(x)dx

= EfX [cos tX] + iEfX [sin tX] .

We will be concerned primarily with cases where the moment generating function exists, and theuse of moment generating functions will be a key tool for identification of distributions.

2.8 JOINT PROBABILITY DISTRIBUTIONS

Suppose X and Y are random variables on the probability space (Ω,A, P (.)). Their distributionfunctions FX and FY contain information about their associated probabilities. But how do wedescribe information about their properties relative to each other? We think of X and Y as compo-nents of a random vector (X,Y ) taking values in R2, rather than as unrelated random variableseach taking values in R.

Example 2.5 Toss a coin n times and let Xi = 0 or 1, depending on whether the ith toss is a tailor a head. The random vector X = (X1, . . . , Xn) describes the whole experiment. The totalnumber of heads is

∑ni=1Xi.

The joint distribution function of a random vector (X1, . . . , Xn) is P (X1 ≤ x1, . . . , Xn ≤ xn),a function of n real variables x1, . . . , xn.

For vectors x = (x1, . . . , xn) and y = (y1, . . . , yn), write x ≤ y if xi ≤ yi for each i = 1, . . . , n.

Definition 2.8.1 The joint distribution function of a random vector X = (X1, . . . , Xn) on(Ω,A, P (.)) is given by FX : Rn −→ [0, 1], defined by FX(x) = P (X ≤ x), x ∈ Rn. [Remember,formally, {X ≤ x} means {ω ∈ Ω : X(ω) ≤ x}.]

We will consider, for simplicity, the case n = 2, without any loss of generality: the case n > 2 isjust notationally more cumbersome.

Properties of the joint distribution function.

The joint distribution function FX,Y of the random vector (X,Y ) satisfies:

(i)lim

x,y−→−∞FX,Y (x, y) = 0,

limx,y−→∞

FX,Y (x, y) = 1.


(ii) If (x1, y1) ≤ (x2, y2) thenFX,Y (x1, y1) ≤ FX,Y (x2, y2).

(iii) FX,Y is continuous from above,

FX,Y (x+ u, y + v) −→ FX,Y (x, y),

as u, v −→ 0+.

(iv)

limy−→∞

FX,Y (x, y) = FX(x) ≡ P (X ≤ x),

limx−→∞

FX,Y (x, y) = FY (y) ≡ P (Y ≤ y).

FX and FY are the marginal distribution functions of the joint distribution FX,Y .

Definition 2.8.2 The random variables X and Y on (Ω,A, P (.)) are (jointly) discrete if (X,Y )takes values in a countable subset of R2 only.

Definition 2.8.3 Discrete variablesX and Y are independent if the events {X = x} and {Y = y}are independent for all x and y.

Definition 2.8.4 The joint probability mass function fX,Y : R2 −→ [0, 1] of X and Y isgiven by

fX,Y (x, y) = P (X = x, Y = y).

The marginal pmf of X, fX(x), is found from:

fX(x) = P (X = x)

=∑

y

P (X = x, Y = y)

=∑

y

fX,Y (x, y).

Similarly for fY (y).

The definition of independence can be reformulated as: X and Y are independent iff fX,Y (x, y) =fX(x)fY (y), for all x, y ∈ R.

More generally, X and Y are independent iff fX,Y (x, y) can be factorized as the product g(x)h(y)of a function of x alone and a function of y alone.

Let X be the support of X and Y be the support of Y . Then Z = (X,Y ) has support Z = {(x, y) :fX,Y (x, y) > 0}. In nice cases Z = X × Y, but we need to be alert to cases with Z ⊂ X × Y. Ingeneral, given a random vector (X1, . . . , Xk) we will denote its range or support by X(k).


Definition 2.8.5 The conditional distribution function of Y given X = x, FY |X(y|x) is de-fined by

FY |X(y|x) = P (Y ≤ y|X = x),

for any x such that P (X = x) > 0. The conditional probability mass function of Y givenX = x, fY |X(y|x), is defined by

fY |X(y|x) = P (Y = y|X = x),

for any x such that P (x = x) > 0.

Turning now to the continuous case, we define:

Definition 2.8.6 The random variables X and Y on (Ω,A, P (.)) are called jointly continuousif their joint distribution function can be expressed as

FX,Y (x, y) =

∫ x

u=−∞

∫ y

v=−∞fX,Y (u, v)dvdu,

x, y ∈ R, for some fX,Y : R2 −→ [0,∞).

Then fX,Y is the joint probability density function of X,Y .

If FX,Y is ‘sufficiently differentiable’ at (x, y) we have

fX,Y (x, y) =∂2

∂x∂yFX,Y (x, y).

This is the usual case, which we will assume from now on.

Then:

(i)

P (a ≤ X ≤ b, c ≤ Y ≤ d) = FX,Y (b, d)− FX,Y (a, d)− FX,Y (b, c) + FX,Y (a, c)

=

∫ d

c

∫ b

a

fX,Y (x, y)dxdy.

If B is a ‘nice’ subset of R2, such as a union of rectangles,

P ((X,Y ) ∈ B) =∫

B

∫fX,Y (x, y)dxdy.

(ii) The marginal distribution functions of X and Y are:

FX(x) = P (X ≤ x) = FX,Y (x,∞),

FY (y) = P (Y ≤ y) = FX,Y (∞, y).


Since

FX(x) =

∫ x

−∞

{∫ ∞

−∞fX,Y (u, y)dy

}

du,

we see, differentiating with respect to x, that the marginal pdf of X is

fX(x) =

∫ ∞

−∞fX,Y (x, y)dy.

Similarly, the marginal pdf of Y is

fY (y) =

∫ ∞

−∞fX,Y (x, y)dx.

We cannot, as we did in the discrete case, define independence of X and Y in terms of events{X = x} and {Y = y}, as these have zero probability and are trivially independent.

So,

Definition 2.8.7 X and Y are independent if {X ≤ x} and {Y ≤ y} are independentevents, for all x, y ∈ R.

So, X and Y are independent iff

FX,Y (x, y) = FX(x)FY (y), ∀x, y ∈ R,

or (equivalently) ifffX,Y (x, y) = fX(x)fY (y),

whenever FX,Y is differentiable at (x, y).

(iv) Definition 2.8.8 The conditional distribution function of Y given X = x, FY |X(y|x)or P (Y ≤ y|X = x) is defined as

FY |X(y|x) =∫ y

v=−∞

fX,Y (x, v)

fX(x)dv,

for any x such that fX(x) > 0.

Definition 2.8.9 The conditional density function of Y , given X = x, fY |X(y|x), isdefined by

fY |X(y|x) =fX,Y (x, y)

fX(x),

for any x such that fX(x) > 0.

This is an appropriate point to remark that not all random variables are either continuous ordiscrete, and not all distribution functions are either absolutely continuous or discrete. Manypractical examples exist of distribution functions that are partly discrete and partly continuous.


Example 2.6 We record the delay that a motorist encounters at a one-way traffic stop sign. LetX be the random variable representing the delay the motorist experiences. There is a certainprobability that there will be no opposing traffic, so she will be able to proceed without delay.However, if she has to wait, she could (in principle) have to wait for any positive amount of time.The experiment could be described by assuming that X has distribution functionFX(x) = (1− pe−λx)I[0,∞)(x). This has a jump of 1− p at x = 0, but is continuous for x > 0:there is a probability 1− p of no wait at all.

We shall see later cases of random vectors, (X,Y ) say, where one component is discrete and the othercontinuous: there is no essential complication in the manipulation of the marginal distributions etc.for such a case.

2.8.1 THE CHAIN RULE FOR RANDOM VARIABLES

As with the chain rule for manipulation of probabilities, there is an explicit relationship betweenjoint, marginal, and conditional mass/density functions. For example, consider three continuousrandom variables X1, X2, X3, with joint pdf fX1,X2,X3 . Then,

fX1,X2,X3(x1, x2, x3) = fX1(x1)fX2|X1(x2|x1)fX3|X1,X2(x3|x1, x2),

so that, for example,

fX1(x1) =

∫

X2

∫

X3fX1,X2,X3(x1, x2, x3)dx2dx3

=

∫

X2

∫

X3fX1|X2,X3(x1|x2, x3)fX2,X3(x2, x3)dx2dx3

=

∫

X2

∫

X3fX1|X2,X3(x1|x2, x3)fX2|X3(x2|x3)fX3(x3)dx2dx3.

Equivalent relationships hold in the discrete case and can be extended to determine the explicitrelationship between joint, marginal, and conditional mass/density functions for any number ofrandom variables.

NOTE: the discrete equivalent of this result is a DIRECT consequence of the Theorem of TotalProbability; the event [X1 = x1] is partitioned into sub-events [(X1 = x1) ∩ (X2 = x2) ∩ (X3 = x3)]for all possible values of the pair (x2, x3).

2.8.2 CONDITIONAL EXPECTATION AND ITERATED EXPECTATION

Consider two discrete/continuous random variables X1 and X2 with joint mass function/pdf fX1,X2 ,and the conditional mass function/pdf of X1 given X2 = x2, defined in the usual way by

fX1|X2(x1|x2) =fX1,X2(x1, x2)

fX2(x2).


Then the conditional expectation of g(X1) given X2 = x2 is defined by

EfX1|X2[g(X1)|X2 = x2] =

∑

x1∈X1

g(x1)fX1|X2(x1|x2), X1 DISCRETE.

∫

X1g(x1)fX1|X2(x1|x2)dx1, X1 CONTINUOUS,

i.e. the expectation of g(X1) with respect to the conditional density of X1 given X2 = x2, (possiblygiving a function of x2). The case g(x) ≡ x is a particular case.

Theorem 2.8.1 THE LAW OF ITERATED EXPECTATIONFor two continuous random variables X1 and X2 with joint pdf fX1,X2 ,

EfX1 [g(X1)] = EfX2

[EfX1|X2

[g(X1)|X2 = x2]].

Proof

EfX1 [g(X1)] =

∫

X1g(x1)fX1(x1)dx1

=

∫

X1g(x1)

{∫

X2fX1,X2(x1, x2)dx2

}

dx1

=

∫

X1g(x1)

{∫

X2fX1|X2(x1|x2)fX2(x2)dx2

}

dx1

=

∫

X1

∫

X2g(x1)fX1|X2(x1|x2)fX2(x2)dx2dx1

=

∫

X2

{∫

X1g(x1)fX1|X2(x1|x2)dx1

}

fX2(x2)dx2

=

∫

X2

{EfX1|X2

[g(X1)|X2 = x2]}fX2(x2)dx2

= EfX2

[EfX1|X2

[g(X1)|X2 = x2]],

so the expectation of g(X1) can be calculated by finding the conditional expectation of g(X1) givenX2 = x2, giving a function of x2, and then taking the expectation of this function with respect tothe marginal density for X2. Note that this proof only works if the conditional expectation andthe marginal expectation are finite. This results extends naturally to k variables.

2.9 MULTIVARIATE TRANSFORMATIONS

Theorem 2.9.1 THE MULTIVARIATE TRANSFORMATION THEOREMLet X = (X1, ..., Xk) be a vector of random variables, with joint mass/density function fX1,...,Xk .

2.9. MULTIVARIATE TRANSFORMATIONS 35

Let Y = (Y1, ..., Yk) be a vector of random variables defined by Yi = gi(X1, ..., Xk) for somefunctions gi, i = 1, ..., k, where the vector function g mapping (X1, ..., Xk) to (Y1, ..., Yk) is a 1-1transformation. Then the joint mass/density function of (Y1, ..., Yk) is given by

DISCRETE fY1,...,Yk(y1, ..., yk) = fX1,...,Xk(x1, ..., xk),

CONTINUOUS fY1,...,Yk(y1, ..., yk) = fX1,...,Xk(x1, ..., xk) |J(y1, ..., yk)| ,

where x = (x1, ..., xk) is the unique solution of the system y = g(x), so that x = g−1(y), and

where J(y1, ..., yk) is the Jacobian of the transformation, that is, the determinant of the k × kmatrix whose (i, j)th element is

∂

∂tj

{gi−1(t)

}t1=y1,...,tk=yk

,

where g−1i is the inverse function uniquely defined by Xi = g−1i (Y1, ..., Yk). Note again the

modulus.

Proof. The discrete case proof follows the univariate case precisely. For the continuous case,consider the equivalent events [X ∈ C] and [Y ∈ D], where D is the image of C under g. Clearly,P [X ∈ C] = P [Y ∈ D]. Now, P [X ∈ C] is the k dimensional integral of the joint density fX1,...,Xkover the set C, and P [Y ∈ D] is the k dimensional integral of the joint density fY1,...,Yk over the setD. The result follows by changing variables in the first integral from x to y = g(x), and equatingthe two integrands.

Note : As for single variable transformations, the ranges of the transformed variables must beconsidered carefully.

Example 2.7 The multivariate transformation theorem provides a simple proof of theconvolution formula: if X and Y are independent continuous random variables with pdfsfX(x) and fY (y), then the pdf of Z = X + Y is

fZ(z) =

∫ ∞

−∞fX(w)fY (z − w)dw.

Let W = X. The Jacobian of the transformation from (X,Y ) to (Z,W ) is 1. So, the joint pdf of(Z,W ) is

fZ,W (z, w) = fX,Y (w, z − w) = fX(w)fY (z − w).

Then integrate out W to obtain the marginal pdf of Z.

Example 2.8 Consider the case k = 2, and suppose that X1 and X2 are independent continuousrandom variables with ranges X1 = X2 = [0, 1] and pdfs given respectively by

fX1(x1) = 6x1(1− x1), 0 ≤ x1 ≤ 1,

fX2(x2) = 3x22, 0 ≤ x2 ≤ 1,

and zero elsewhere. In order to calculate the pdf of random variable Y1 defined by

Y1 = X1X2,


using the transformation result, consider the additional random variable Y2, where Y2 = X1 (note,as X1 and X2 take values on [0, 1], X1 ≥ X1X2 so Y1 ≤ Y2).

The transformation Y = g(X) is then specified by the two functions

g1(t1, t2) = t1t2, g2(t1, t2) = t1,

and the inverse transformation X = g−1(Y) (i.e. X in terms of Y) is

X1 = Y2, X2 = Y1/Y2,

giving

g−11 (t1, t2) = t2, g−12 (t1, t2) = t1/t2.

Hence

∂

∂t1

{g1−1(t1, t2)

}= 0,

∂

∂t2

{g1−1(t1, t2)

}= 1,

∂

∂t1

{g2−1(t1, t2)

}= 1/t2,

∂

∂t2

{g2−1(t1, t2)

}= −t1/t22,

and so the Jacobian J(y1, y2) of the transformation is given by

∣∣∣∣0 11/y2 −y1/y22

∣∣∣∣

so that J(y1, y2) = −1/y2. Hence, using the theorem

fY1,Y2(y1, y2) = fX1,X2(y2, y1/y2)× |J(y1, y2)|

= 6y2(1− y2)× 3(y1/y2)2 × 1/y2

= 18y21(1− y2)/y22,

on the set Y(2) = {(y1, y2) : 0 ≤ y1 ≤ y2 ≤ 1}, and zero otherwise. Hence

fY1(y1) =∫ 1y118y21(1− y2)/y

22dy2

= 18y21 [−1/y2 − log y2]1y1

= 18y21(−1 + 1/y1 + log y1)

= 18y1(1− y1 + y1 log y1),

for 0 ≤ y1 ≤ 1, and zero otherwise.

2.10. MULTIVARIATE EXPECTATIONS AND COVARIANCE 37

2.10 MULTIVARIATE EXPECTATIONS AND COVARIANCE

2.10.1 EXPECTATION WITH RESPECT TO JOINT DISTRIBUTIONS

Definition 2.10.1 For random variables X1, ..., Xk with range X(k) with mass/density functionfX1,...,Xk , the expectation of g(X1, ..., Xk) is defined in the discrete and continuous cases by

EfX1,...,Xk [g(X1, ..., Xk)] =

∑

X1

...∑

Xk

g(x1, ..., xk)fX1,...,Xk(x1, ..., xk),

∫

X1...

∫

Xk

g(x1, ..., xk)fX1,...,Xk(x1, ..., xk)dx1...dxk.

PROPERTIES(i) Let g and h be real-valued functions and let a and b be constants. Then, if fX ≡ fX1,...,Xk ,

EfX [ag(X1, ..., Xk) + bh(X1, ..., Xk)] = aEfX [g(X1, ..., Xk)] + bEfX [h(X1, ..., Xk)].

(ii) Let X1, ...Xk be independent random variables with mass functions/pdfs fX1 , ..., fXk respec-tively. Let g1, ..., gk be scalar functions of X1, ..., Xk respectively (that is, gi is a function of Xi onlyfor i = 1, ..., k). If g(X1, ..., Xk) = g1(X1)...gk(Xk), then

EfX [g(X1, ..., Xk)] =k∏

i=1

EfXi [gi(Xi)],

where EfXi [gi(Xi)] is the marginal expectation of gi(Xi) with respect to fXi .

(iii) Generally,EfX [g(X1)] ≡ EfX1 [g(X1)],

so that the expectation over the joint distribution is the same as the expectation over themarginaldistribution. The proof is an immediate consequence of the fact that the marginal pdf fX1 isobtained by integrating the joint density with respect to x2, . . . , xk. So, whevever we wish, it isreasonable to denote the expectation as, say, E[g(X1)], rather than EfX1 [g(X1)] or EfX [g(X1)]: wecan ‘drop subscripts’.

2.10.2 COVARIANCE AND CORRELATION

Definition 2.10.2 The covariance of two random variables X1 and X2 is denotedCovfX1,X2 [X1, X2], and is defined by

CovfX1,X2 [X1, X2] = EfX1,X2 [(X1 − μ1)(X2 − μ2)] = EfX1,X2 [X1X2]− μ1μ2,

where μi = EfXi [Xi] is the marginal expectation of Xi, for i = 1, 2, and where

EfX1,X2 [X1X2] =

∫ ∫x1x2fX1,X2(x1, x2)dx1dx2,

that is, the expectation of function g(x1, x2) = x1x2 with respect to the joint distribution fX1,X2 .


Definition 2.10.3 The correlation of X1 and X2 is denoted CorrfX1,X2 [X1, X2], and is definedby

CorrfX1,X2 [X1, X2] =CovfX1,X2 [X1, X2]√V arfX1 [X1]V arfX2 [X2]

.

If CovfX1,X2 [X1, X2] = CorrfX1,X2 [X1, X2] = 0 then variables X1 and X2 are uncorrelated.

Note that if random variables X1 and X2 are independent then

CovfX1,X2 [X1, X2] = EfX1,X2 [X1X2]− EfX1 [X1]EfX2 [X2]

= EfX1 [X1]EfX2 [X2]− EfX1 [X1]EfX2 [X2] = 0,

and so X1 and X2 are also uncorrelated (the converse does not hold).

NOTES:(i) For random variables X1 and X2, with (marginal) expectations μ1 and μ2 respectively, and(marginal) variances σ21 and σ

22 respectively, if random variables Z1 and Z2 are defined by

Z1 = (X1 − μ1)/σ1, Z2 = (X2 − μ2)/σ2,

then Z1 and Z2 are standardized variables. Then EfZi [Zi] = 0, V arfZi [Zi] = 1 and

CorrfX1,X2 [X1, X2] = CovfZ1,Z2 [Z1, Z2].

(ii) Extension to k variables: covariances can only be calculated for pairs of random variables, butif k variables have a joint probability structure it is possible to construct a k × k matrix, C say, ofcovariance values, whose (i, j)th element is

CovfXi,Xj [Xi, Xj ],

for i, j = 1, .., k, that captures the complete covariance structure in the joint distribution. If i 6= j,then

CovfXj,Xi [Xj , Xi] = CovfXi,Xj [Xi, Xj ],

so C is symmetric, and if i = j,

CovfXi,Xi [Xi, Xi] ≡ V arfXi [Xi].

The matrix C is referred to as the variance-covariance matrix.

(iii) If random variable X is defined by X = a1X1+a2X2+ ...akXk, for random variables X1, ..., Xkand constants a1, ..., ak, then

EfX [X] =k∑

i=1

aiEfXi [Xi],

V arfX [X] =k∑

i=1

a2iV arfXi [Xi] + 2k∑

i=1

i−1∑

j=1

aiajCovfXi,Xj [Xi, Xj

= aTCa, a = (a1, . . . , ak)T .

2.10. MULTIVARIATE EXPECTATIONS AND COVARIANCE 39

(iv) Combining (i) and (iii) when k = 2, and defining standardized variables Z1 and Z2,

0 ≤ V arfZ1,Z2 [Z1 ± Z2] = V arfZ1 [Z1] + V arfZ2 [Z2]± 2CovfZ1,Z2 [Z1, Z2]

= 1 + 1± 2CorrfX1,X2 [X1, X2] = 2(1± CorrfX1,X2 [X1, X2])

and hence−1 ≤ CorrfX1,X2 [X1, X2] ≤ 1.

2.10.3 JOINT MOMENT GENERATING FUNCTION

Definition 2.10.4 Let X and Y be jointly distributed. The joint moment generatingfunction of X and Y is

MX,Y (s, t) = E(esX+tY ).

If this exists in a neighbourhood of the origin (0,0), then it has the same attractive properties asthe ordinary moment generating function. It determines the joint distribution of X and Y uniquelyand it also yields the moments:

∂m+n

∂sm∂tnMX,Y (s, t)

∣∣∣∣s=t=0

= E(XmY n).

Joint moment generating functions factorize for independent random variables. We have

MX,Y (s, t) =MX(s)MY (t)

if and only if X and Y are independent. Note, MX(s) =MX,Y (s, 0), etc.

The definition of the joint moment generating function extends in an obvious way to three ormore random variables, with the corresponding result for independence. For instance, the ran-dom variable X is independent of (Y,Z) if and only if the joint moment generating functionMX,Y,Z(s, t, u) = E(e

sX+tY+uZ) factorizes as

MX,Y,Z(s, t, u) =MX(s)MY,Z(t, u).

2.10.4 FURTHER RESULTS ON INDEPENDENCE

The following results are useful:

(I) Let X and Y be independent random variables. Let g(x) be a function only of x and h(y) be afunction only of y. Then the random variables U = g(X) and V = h(Y ) are independent.

(II) Let X1, . . . , Xn be independent random vectors. Let gi(xi) be a function only of xi, i =1, . . . , n. Then the random variables Ui = gi(Xi), i = 1, . . . , n are independent.


2.11 ORDER STATISTICS

Order statistics, like sample moments, play a very important role in statistical inference. LetX1, . . . , Xn be independent, identically distributed continuous random variables, with cdf FX andpdf fX . Then order the Xi: let Y1 be the smallest of {X1, . . . , Xn}, Y2 be the second smallest of{X1, . . . , Xn}, . . ., Yn be the largest of {X1, . . . , Xn}. Note that since we assume continuity, thechance of ties is zero. It is customary to use the notation

Yk = X(k),

and then X(1), X(2), . . . , X(n) are known as the order statistics of X1, . . . , Xn. Two key resultsare:

Result A The order statistics have joint density

n!n∏

i=1

fX(yi), y1 < y2 < . . . < yn.

Result B The order statistic X(k) has density

f(k)(y) = k

(n

k

)

fX(y){1− FX(y)}n−k{FX(y)}

k−1.

An informal proof of Result A is straightforward, using a symmetry argument based on the inde-pendent, identically distributed nature of X1, . . . , Xn. Result B follows on noting that the eventX(k) ≤ y occurs if and only if at least k of the Xi lie in (−∞, y]. Recalling the Binomial distribution,this means that X(k) has distribution function

F(k)(y) =n∑

j=k

(n

j

)

{FX(y)}j{1− FX(y)}

n−j .

The pdf follows on differentiating this cdf.

CHAPTER 3

DISCRETE PROBABILITY DISTRIBUTIONS

Definition 3.1.1 DISCRETE UNIFORM DISTRIBUTION

X ∼ Uniform(n)

fX(x) =1

n, x ∈ X = {1, 2, ..., n} ,

and zero otherwise.

Definition 3.1.2 BERNOULLI DISTRIBUTION

X ∼ Bernoulli(θ)fX(x) = θ

x(1− θ)1−x, x ∈ X = {0, 1} ,

and zero otherwise.

NOTE The Bernoulli distribution is used for modelling when the outcome of an experiment iseither a “success” or a ‘failure”, where the probability of getting a success is equal to θ. Such anexperiment is a ‘Bernoulli trial’. The mgf is

MX(t) = (1− θ) + θet.

Definition 3.1.3 BINOMIAL DISTRIBUTION

X ∼ Bin(n, θ)

fX(x) =

(n

x

)

θx(1− θ)n−x, x ∈ X = {0, 1, 2, ..., n} , n ≥ 1, 0 ≤ θ ≤ 1.

NOTES1. If X1, ..., Xk are independent and identically distributed (IID) Bernoulli(θ) random variables,and Y = X1 + ..., Xk, then by the standard result for mgfs,

MY (t) = {MX(t)}k =

(1− θ + θet

)k,

so therefore Y ∼ Bin(k, θ) because of the uniqueness of mgfs. Thus the binomial distribution isused to model the total number of successes in a series of independent and identical experiments.

2. Alternatively, consider sampling without replacement from infinite collection, or samplingwith replacement from a finite collection of objects, a proportion θ of which are of Type I, and theremainder are of Type II. If X is the number of Type I objects in a sample of n, X ∼ Bin(n, θ).

41

42 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

Definition 3.1.4 POISSON DISTRIBUTION

X ∼ Poisson(λ)

fX(x) =e−λλx

x!, x ∈ X = {0, 1, 2, ...} , λ > 0,

and zero otherwise

NOTES1. If X ∼ Bin(n, θ), let λ = nθ. Then

MX(t) =(1− θ + θet

)n=

(

1 +λ(et − 1))n

)n−→ exp

{λ(et − 1

)},

as n −→∞, which is the mgf of a Poisson random variable. Therefore, the Poisson distributionarises as the limiting case of the binomial distribution, when n −→∞, θ −→ 0 with nθ = λconstant (that is, for “large” n and “small” θ). So, if n is large, θ is small, we can reasonablyapproximate Bin(n, θ) by Poisson(λ).

2. Suppose that X1 and X2 are independent, with X1 ∼ Poisson(λ1), X2 ∼ Poisson(λ2), then ifY = X1 +X2, using the general mgf result for independent random variables,

MY (t) =MX1(t)MX2(t) = exp{λ1(et − 1

)}exp

{λ2(et − 1

)}= exp

{(λ1 + λ2)

(et − 1

)}

so that Y ∼ Poisson(λ1 + λ2). Therefore, the sum of two independent Poisson random variablesalso has a Poisson distribution. This result can be extended easily; if X1, ..., Xk are independentrandom variables with Xi ∼ Poisson(λi) for i = 1, ..., k, then

Y =k∑

i=1

Xi =⇒ Y ∼ Poisson

(k∑

i=1

λi

)

.

3. THE POISSON PROCESS (This material is not examinable.)

The Poisson distribution arises as part of a larger modelling framework. Consider an experimentinvolving events (such as radioactive emissions) that occur repeatedly and randomly in time. LetX(t) be the random variable representing the number of events that occur in the interval [0, t], sothat X(t) takes values 0, 1, 2, .... Informally, in a Poisson process we have the following properties.In the interval (t, t+ h) there may or may not be events. If h is small, the probability of an eventin (t, t+ h) is roughly proportional to h; it is not very likely that two or more events occur in asmall interval.

Formally, a Poisson process with intensity λ is a process X(t), t ≥ 0, taking values in{0, 1, 2, . . .} such that

(a) X(0) = 0 and if s < t then X(s) ≤ X(t),

(b)

P (X(t+ h) = n+m| X(t) = n) =

λh+ o(h), m = 1,o(h), m > 1,1− λh+ o(h), m = 0.

43

(c) If s < t, the number of events X(t)−X(s) in (s, t] is independent of the number of events in[0, s].

Here [definition]

limh−→0

o(h)

h= 0.

Then, ifPn(t) = P [X(t) = n] = P [n events occur in [0, t]]

it can be shown that

Pn(t) =e−λt(λt)n

n!

(that is, the random variable corresponding to the number of events that occurs in the interval[0, t] has a Poisson distribution with parameter λt.)

Definition 3.1.5 GEOMETRIC DISTRIBUTION

X ∼ Geometric(θ)

fX(x) = (1− θ)x−1θ, x ∈ X = {1, 2, ...} , 0 ≤ θ ≤ 1.

NOTES1. The cdf is available analytically as

FX(x) = 1− (1− θ)x , x = 1, 2, 3, ...

2. If X ∼ Geometric(θ), then for x, j ≥ 1,

P [X = x+j|X > j] =P [X = x+ j,X > j]

P [X > j]=P [X = x+ j]

P [X > j]=(1− θ)x+j−1θ(1− θ)j

= (1−θ)x−1θ = P [X = x].

So P[X = x+ j|X > j] =P[X = x]. This property is unique (among discrete distributions) tothe geometric distribution, and is called the lack of memory property.

3. Alternative representations are sometimes useful:

fX(x) = φx−1 (1− φ) , x = 1, 2, 3... (that is, φ = 1− θ),

fX(x) = φx (1− φ) , x = 0, 1, 2, . . . .

4. The geometric distribution is used to model the number, X, of independent, identicalBernoulli trials until the first success is obtained. It is a discrete waiting time distribution.

Definition 3.1.6 NEGATIVE BINOMIAL DISTRIBUTION

X ∼ NegBin(n, θ)

fX(x) =

(x− 1n− 1

)

θn(1− θ)x−n, x ∈ X = {n, n+ 1, ...} , n ≥ 1, 0 ≤ θ ≤ 1.

44 CHAPTER 3. DISCRETE PROBABILITY DISTRIBUTIONS

NOTES1. If X ∼ Bin(n, θ), Y ∼ NegBin(r, θ), then for r ≤ n, P [X ≥ r] = P [Y ≤ n].

2. The Negative Binomial distribution is used to model the number, X, of independent, identicalBernoulli trials needed to obtain exactly n successes. [The number of trials up to and includingthe nth success].

3. Alternative representation: let Y be the number of failures in a sequence of independent,identical Bernoulli trials that contains exactly n successes. Then Y = X − n, and hence

fY (y) =

(n+ y − 1n− 1

)

θn(1− θ)y, y ∈ {0, 1, ...} .

4. If Xi ∼ Geometric(θ), for i = 1, ...n, are i.i.d. random variables, and Y = X1 + ...+Xn, thenY ∼ NegBin(n, θ) (result immediately follows using mgfs).

5. If X ∼ NegBin(n, θ), let n(1− θ)/θ = λ and Y = X − n. Then

MY (t) = e−ntMX(t) =

{θ

1− et(1− θ)

}n=

{1

1− λn(et − 1)

}n

−→ exp{λ(et − 1)

},

as n −→∞, hence the alternate form of the negative binomial distribution tends to the Poissondistribution as n −→∞ with n(1− θ)/θ = λ constant.

Definition 3.1.7 HYPERGEOMETRIC DISTRIBUTIONX ∼ HypGeom(N,R, n) for N ≥ R ≥ n

fX(x) =

(N −Rn− x

)(R

x

)

(N

n

) , x ∈ X = {max(0, n−N +R), ...,min(n,R)} ,

and zero otherwise.

NOTES1. The hypergeometric distribution is used as a model for experiments involving samplingwithout replacement from a finite population. Specifically, consider a finite population of size N ,consisting of R items of Type I and N −R of Type II: take a sample of size nwithout replacement, and let X be the number of Type I objects on the sample. The massfunction for the hypergeometric distribution can be obtained by using combinatorics/countingtechniques. However the form of the mass function does not lend itself readily to calculation ofmoments etc..

2. As N,R −→∞ with R/N = θ(constant), then

P [X = x] −→

(n

x

)

θx(1− θ)n−x,

so the distribution tends to a Binomial distribution.

CHAPTER 4

CONTINUOUS PROBABILITY DISTRIBUTIONS

Definition 4.1.1 CONTINUOUS UNIFORM DISTRIBUTION

X ∼ Uniform(a, b)

fX(x) =1

b− a, a ≤ x ≤ b.

NOTES1. The cdf is

FX(x) =x− ab− a

, a ≤ x ≤ b.

2. The case a = 0 and b = 1 gives the Standard uniform.

Definition 4.1.2 EXPONENTIAL DISTRIBUTION

X ∼ Exp(λ)fX(x) = λe

−λx , x > 0, λ > 0.

NOTES1. The cdf is

FX(x) = 1− e−λx, x > 0.

2. An alternative representation uses θ = 1/λ as the parameter of the distribution. This issometimes used because the expectation and variance of the Exponential distribution are

EfX [X] =1

λ= θ, V arfX [X] =

1

λ2.

3. If X ∼ Exp(λ), then, for all x, t > 0,

P [X > x+ t|X > t] =P [X > x+ t,X > t]

P [X > t]=P [X > x+ t]

P [X > t]=e−λ(x+t)

e−λt= e−λx = P [X > x].

Thus, for all x, t > 0, P [X > x+ t|X > t] = P [X > x] - this is known as the Lack of MemoryProperty, and is unique to the exponential distribution amongst continuous distributions.

4. Suppose that X(t) is a Poisson process with rate parameter λ > 0, so that

P [X(t) = n] =e−λt(λt)n

n!.

Let X1, ..., Xn be random variables defined by X1 = “time that first event occurs”, and, fori = 2, ..., n, Xi = “time interval between occurrence of (i− 1)st and ith events”. Then X1, ..., Xn

45

46 CHAPTER 4. CONTINUOUS PROBABILITY DISTRIBUTIONS

are IID because of the assumptions underlying the Poisson process. So consider the distributionof X1; in particular, consider the probability P[X1 > x] for x > 0. The event [X1 > x] isequivalent to the event “No events occur in the interval (0,x]”, which has probability e−λx. But

FX1(x) = P [X1 ≤ x] = 1− P [X1 > x] = 1− e−λx =⇒ X1 ∼ Exp(λ).

5. The exponential distribution is used to model failure times in continuous time. It is acontinuous waiting time distribution, the continuous analogue of the geometric distribution.

6. If X ∼ Uniform(0, 1), and

Y = −1

λlog(1−X),

then Y ∼ Exp(λ).

7. If X ∼ Exp(λ), andY = X1/α,

for α > 0, then Y has a (two-parameter)Weibull distribution, and

fY (y) = αλyα−1e−λy

α

, y > 0.

Definition 4.1.3 GAMMA DISTRIBUTION

X ∼ Ga(α, β)

fX(x) =βα

Γ(α)xα−1e−βx, x > 0, α, β > 0,

where, for any real number α > 0, the gamma function, Γ(.) is defined by

Γ(α) =

∫ ∞

0tα−1e−tdt.

NOTES1. If X1 ∼ Ga(α1, β), X2 ∼ Ga(α2, β) are independent random variables, and

Y = X1 +X2

then Y ∼ Ga(α1 + α2, β) (directly from properties of mgfs).

2. Ga(1, β) ≡ Exp(β).

3. If X1, ..., Xn ∼ Exp(λ) are independent random variables, and

Y = X1 + ...+Xn

then Y ∼ Ga(n, λ) (directly from 1. and 2. ).

4. For α > 0, integrating by parts, we have that

Γ(α) = (α− 1)Γ(α− 1)

and hence if α = 1, 2, ..., then Γ(α) = (α− 1)!. A useful fact is that Γ( 12) =√π.

47

5. Special Case : If α = 1, 2, ... the Ga(α/2, 1/2) distribution is also known as the chi-squareddistribution with α degrees of freedom, denoted by χ2α.

6. If X1 ∼ χ2n1 and X2 ∼ χ2n2 are independent chi-squared random variables with n1 and n2

degrees of freedom respectively, then random variable F defined as the ratio

F =X1/n1

X2/n2

has an F-distribution with (n1, n2) degrees of freedom.

Definition 4.1.4 BETA DISTRIBUTION

X ∼ Be(α, β)

fX(x) =Γ(α+ β)

Γ(α)Γ(β)xα−1(1− x)β−1, 0 < x < 1, α, β > 0.

NOTES1. If α = β = 1, Be(α, β) ≡ Uniform(0, 1)

2. If X1 ∼ Ga(α1, β), X2 ∼ Ga(α2, β) are independent random variables, and

Y =X1

X1 +X2

then Y ∼ Be(α1, α2) (using standard multivariate transformation techniques).

3. Suppose that random variables X and Y have a joint probability distribution such that theconditional distribution of X, given Y = y for 0 < y < 1, is binomial, Bin(n, y), and the marginaldistribution of Y is beta, Be(α, β), so that

fX|Y (x|y) =

(n

x

)

yx(1− y)n−x, x = 0, 1, ..., n,

fY (y) =Γ(α+ β)

Γ(α)Γ(β)yα−1(1− y)β−1, 0 < y < 1.

Then the marginal distribution of X is given by

fX(x) =

∫ 1

0fX|Y (x|y)fY (y)dy

=

(n

x

)Γ(α+ β)

Γ(α)Γ(β)

Γ(x+ α)Γ(n− x+ β)Γ(n+ α+ β)

, x = 0, 1, 2...n.

Note that this provides an example of a joint distribution of continuous Y and discrete X.


Definition 4.1.5 NORMAL DISTRIBUTION

X ∼ N(μ, σ2)

fX(x) =1

√2πσ2

exp

{

−1

2σ2(x− μ)2

}

, x ∈ R, μ ∈ R, σ > 0.

NOTES1. Special Case : If μ = 0, σ2 = 1, then X has a standard or unit normal distribution. Usually,the pdf of the standard normal is written φ(x), and the cdf is written Φ(x).

2. If X ∼ N(0, 1), andY = σX + μ

then Y ∼ N(μ, σ2). Re-expressing this result, if X ∼ N(μ, σ2), and Y = (X − μ)/σ, thenY ∼ N(0, 1) (using transformation or mgf techniques).

3. The Central Limit Theorem Suppose X1, ..., Xn are IID random variables with some mgfMX , with EfX [Xi] = μ and V arfX [Xi] = σ

2 that is, the mgf and the expectation and variance ofthe Xi’s are specified, but the pdf is not. Let the standardized random variable Zn be defined by

Zn =

n∑

i=1

Xi − nμ

√nσ2

and let Zn have mgf MZn . Then, as n −→∞,

MZn(t) −→ exp{t2/2

},

irrespective of the distribution of the Xi’s, that is, the distribution of Zn tends to a standardnormal distribution as n tends to infinity. This theorem will be proved and explained in Chapter6.

4. If X ∼ N(0, 1), and Y = X2, then Y ∼ χ21, so that the square of a unit normal randomvariable has a chi-squared distribution with 1 degree of freedom.

5. If X ∼ N(0, 1), and Y ∼ N(0, 1) are independent random variables, and Z is defined byZ = X/Y , then Z has a Cauchy distribution

fZ(z) =1

π

1

1 + z2, z ∈ R.

6. If X ∼ N(0, 1), and Y ∼ Ga(n/2, 1/2) for n = 1, 2, ... (so that Y ∼ χ2n), are independentrandom variables, and T is defined by

T =X

√Y/n

then T has a Student-t distribution with n degrees of freedom, T ∼ St(n),

fT (t) =

Γ

(n+ 1

2

)

Γ(n2

)(1

nπ

)1/2{

1 +t2

n

}−(n+1)/2, t ∈ R.

49

Taking limiting cases of the Student-t distribution

n −→∞ : St(n) −→ N(0, 1), n −→ 1 : St(n) −→ Cauchy.

7. If X1 ∼ N(μ1, σ21) and X2 ∼ N(μ2, σ

22) are independent and a, b are constants, then

T = aX1 + bX2 ∼ N(aμ1 + bμ2, a2σ21 + b

2σ22).


CHAPTER 5

MULTIVARIATE PROBABILITY DISTRIBUTIONS

For purely notational reasons, it is convenient in this chapter to consider a random vector X as acolumn vector, X = (X1, . . . , Xk)

T , say

5.1 THE MULTINOMIAL DISTRIBUTION

The multinomial distribution is a multivariate generalization of the binomial distribution. Recallthat the binomial distribution arose from an infinite Urn model with two types of objects beingsampled with replacement. Suppose that the proportion of “Type 1” objects in the urn is θ (so0 ≤ θ ≤ 1) and hence the proportion of “Type 2” objects in the urn is 1−θ. Suppose that n objectsare sampled, and X is the random variable corresponding to the number of “Type 1” objects inthe sample. Then X ∼ Bin(n, θ), and

fX(x) =

(n

x

)

θx(1− θ)n−x, x ∈ {0, 1, 2, ..., n} .

Now consider a generalization; suppose that the Urn contains k + 1 types of objects (k = 1, 2, ...),with θi being the proportion of Type i objects, for i = 1, ..., k + 1. Let Xi be the random variablecorresponding to the number of type i objects in a sample of size n, for i = 1, ..., k. Then the jointdistribution of vector X = (X1, ..., Xk)

T is given by

fX1,...,Xk(x1, ..., xk) =n!

x1!...xk!xk+1!θx11 ....θ

xkk θxk+1k+1 =

n!

x1!...xk!xk+1!

k+1∏

i=1θxii ,

where 0 ≤ θi ≤ 1 for all i, and θ1 + ... + θk + θk+1 = 1, and where xk+1 is defined by xk+1 =n− (x1+ ...+ xk). This is the mass function for the multinomial distribution which reduces to thebinomial if k = 1. It can also be shown that the marginal distribution of Xi is Bin(n, θi).

5.2 THE DIRICHLET DISTRIBUTION

The Dirichlet distribution is a multivariate generalization of the beta distribution. Recall thatthe beta distribution arose as follows; suppose that V1 and V2 are independent Gamma randomvariables with V1 ∼ Ga(α1, β), V2 ∼ Ga(α2, β). Then if X is defined by X = V1/ (V1 + V2), we havethat X ∼ Be(α1, α2). Now consider a generalization; suppose that V1, ..., Vk+1 are independentGamma random variables with Vi ∼ Ga(αi, β), for i = 1, ..., k + 1. Define

Xi =Vi

V1 + ...+ Vk+1

for i = 1, ..., k. Then the joint distribution of vector X = (X1, ..., Xk)T is given by density

fX1,...,Xk(x1, ..., xk) =Γ(α)

Γ(α1)...Γ(αk)Γ(αk+1)xα1−11 ...xαk−1k xk+1

αk+1−1,

51

52 CHAPTER 5. MULTIVARIATE PROBABILITY DISTRIBUTIONS

for 0 ≤ xi ≤ 1 for all i such that x1 + ... + xk + xk+1 = 1, where α = α1 + ... + αk+1 and wherexk+1 is defined by xk+1 = 1− (x1+ ...+xk). This is the density function which reduces to the betadistribution if k = 1. It can also be shown that the marginal distribution of Xi is Beta(αi, α).

5.3 THE MULTIVARIATE NORMAL DISTRIBUTION

The random vector X = (X1, . . . , Xk)T has a multivariate normal distribution if the joint pdf

is of the form:

fX(x1, ..., xk) =

(1

2π

)k/2 1

|Σ|1/2exp

{

−1

2(x− μ)TΣ−1(x− μ)

}

.

Here x is the (column) vector of length k formed by x1, . . . , xk, μ is a (column) vector of length k,Σ is a k× k symmetric, positive definite matrix [Σ = ΣT , xTΣx > 0 for all x 6= 0], and |Σ| denotesthe determinant of Σ.

We write X ∼ Nk(μ,Σ).

Properties

1. E[X] = μ: μ is the mean vector of X. If μ = (μ1, . . . , μk)T , we have E[Xi] = μi. Further, Σ is

the variance-covariance matrix of X, Σ = [σij ], where σij = cov[Xi, Xj ].

2. Since Σ is symmetric and positive definite, there exists a matrix Σ1/2 [the ‘square root of Σ’]such that: (i) Σ1/2 is symmetric; (ii) Σ = Σ1/2Σ1/2; (iii) Σ1/2Σ−1/2 = Σ−1/2Σ1/2 = I, the k × kidentity matrix, with Σ−1/2 = (Σ1/2)−1.

Then, if Z ∼ Nk(0, I), so that Z1, . . . Zk are IID N(0, 1), and X = μ+Σ1/2Z, then X ∼ Nk(μ,Σ).Conversely, if X ∼ Nk(μ,Σ), then Σ−1/2(X − μ) ∼ Nk(0, I).

3. A useful result is the following: if X ∼ Nk(μ,Σ) and D is a m× k matrix of rank m ≤ k, thenY ≡ DX ∼ Nm(Dμ, DΣDT ). A special case is where X ∼ Nk(0, I) and D is a k× k matrix of fullrank k, so that D is invertible: then Y = DX ∼ Nk(0, DDT ).

4. Suppose we partition X as

X =

(XaXb

)

,

with

Xa =

X1...Xm

, Xb =

Xm+1...Xk

.

We can similarly partition μ and Σ:

μ =

(μaμb

)

, Σ =

[Σaa ΣabΣba Σbb

]

.

Then if X ∼ Nk(μ,Σ) we have:

5.3. THE MULTIVARIATE NORMAL DISTRIBUTION 53

(I) The marginal distribution of Xa is Nm(μa,Σaa).

(II) The conditional distribution of Xb, given Xa = xa is

Xb|Xa = xa ∼ Nk−m(μb +ΣbaΣ−1aa (xa − μa),Σbb − ΣbaΣ

−1aaΣab).

(III) If a = (a1, . . . , ak)T , then

aTX ≡k∑

i=1

aiXi ∼ N(aTμ, aTΣa).

(IV) V = (X − μ)TΣ−1(X − μ) ∼ χ2k.

54 CHAPTER 5. MULTIVARIATE PROBABILITY DISTRIBUTIONS

CHAPTER 6

PROBABILITY RESULTS & LIMIT THEOREMS

6.1 BOUNDS ON PROBABILITIES BASED ON MOMENTS

Theorem 6.1.1 If X is a random variable, then for non-negative function h, and c > 0,

P [h(X) ≥ c] ≤EfX [h(X)]

c.

Proof. (continuous case) : Suppose that X has density function fX which is positive for x ∈ X.Let A = {x ∈ X : h(x) ≥ c} ⊆ X. Then, as h(x) ≥ c on A,

EfX [h(X)] =

∫h(x)fX(x)dx =

∫

A

h(x)fX(x)dx+

∫

A′

h(x)fX(x)dx

≥∫

A

h(x)fX(x)dx ≥∫

A

cfX(x)dx = cP [X ∈ A] = cP [h(X) ≥ c] .

SPECIAL CASE I - THE MARKOV INEQUALITY : If h(x) = |x|r for r > 0, so

P [|X|r ≥ c] ≤1

cEfX [|X|

r] .

SPECIAL CASE II - THE CHEBYCHEV INEQUALITY: Suppose that X is a randomvariable with expectation μ and variance σ2. Then taking h(x) = (x−μ)2 and c = k2σ2, for k > 0,gives

P [|X − μ| ≥ kσ] ≤ 1/k2,

and setting ε = kσ gives

P [|X − μ| ≥ ε] ≤ σ2/ε2, P [|X − μ| < ε] ≥ 1− σ2/ε2.

Theorem 6.1.2 JENSEN’S INEQUALITY

Suppose that X is a random variable, and function g is convex so thatd2

dt2{g(t)}t=x = g

′′(x) > 0,

∀x, with Taylor expansion around expectation μ of the form

g(x) = g(μ) + (x− μ)g′(μ) +1

2(x− μ)2g′′(x0), (6.1)

for some x0 such that x < x0 < μ. Then

EfX [g(X)] ≥ g(EfX [X]).

Proof. Taking expectations in (6.1), and noting that EfX [(X − μ)] = 0, EfX[(X − μ)2

]= σ2,

g′′(x0) ≥ 0, we have that

EfX [g(X)] = g(μ) +(0× g′(μ)

)+1

2

(σ2 × g′′(x0)

)≥ g(μ) = g(EfX [X]),

as σ2, g′′(x0) > 0.

55

56 CHAPTER 6. PROBABILITY RESULTS & LIMIT THEOREMS

6.2 THE CENTRAL LIMIT THEOREM

Theorem 6.2.1 Suppose X1, ..., Xn are i.i.d. random variables with mgf MX , with

EfX [Xi] = μ, VarfX [Xi] = σ2,

both finite. Let the random variable Zn be defined by

Zn =

n∑

i=1

Xi − nμ

√nσ2

,

and let Zn have mgf MZn . Then, as n −→∞,

MZn(t) −→ exp

{t2

2

}

,

irrespective of the form of MX .

Proof. First, let Yi = (Xi − μ)/σ for i = 1, ..., n. Then Y1, ..., Yn are i.i.d. with mgf MY say, andby the elementary properties of expectation,

EfY [Yi] = 0, V arfY [Yi] = 1,

for each i. Using the power series expansion result for mgfs, we have that

MY (t) = 1 + tEfY [Y ] +t2

2!EfY [Y

2] +t3

3!EfY [Y

3] +t4

4!EfY [Y

4] + ...

= 1 +t2

2!+t3

3!EfY [Y

3] +t4

4!EfY [Y

4] + ...

Now, the random variable Zn can be rewritten

Zn =1√n

n∑

i=1

(Xi − μσ

)

and thus, again by a standard mgf result, as Y1, ..., Yn are independent, we have that

MZn(t) =n∏

i=1

{MY (t/

√n)}=

{

1 +t2

2n+t3

6n3/2EfY [Y

3] +t4

6n2EfY [Y

4]...

}n.

Thus, as n −→∞, using the properties of the exponential function, which give that if an → a,then (

1 +an

n

)n→ ea,

we have

MZn(t) −→ exp

{t2

2

}

.

INTERPRETATION: Sums of independent and identically distributed random variables havea limiting distribution that is Normal, irrespective of the distribution of the variables.

6.3. MODES OF STOCHASTIC CONVERGENCE 57

6.3 MODES OF STOCHASTIC CONVERGENCE

6.3.1 CONVERGENCE IN DISTRIBUTION

Definition 6.3.1 Consider a sequence {Xn}, n = 1, 2, . . ., of random variables and acorresponding sequence of cdfs, FX1 , FX2 , ... so that for n = 1, 2, . . ., FXn(x) =P[Xn ≤ x] . Supposethat there exists a cdf, FX , such that for all x at which FX is continuous,

limn−→∞

FXn(x) = FX(x).

Then the sequence {Xn} converges in distribution to the random variable X with cdf FX .This is denoted

Xnd−→ X,

and FX is the limiting distribution.

Convergence of a sequence of mgfs also indicates convergence in distribution. That is, if for all tat which MX(t) is defined, if as n −→∞, we have

MXn(t) −→MX(t)

then Xnd−→ X.

Definition 6.3.2 The sequence {Xn} of random variables converges in distribution to theconstant c if the limiting distribution of Xn is degenerate at c, that is,

Xnd−→ X

and P[X = c] = 1, so that

FX(x) =

{0, x < c,1, x ≥ c.

This special type of convergence in distribution occurs when the limiting distribution is discrete,with the probability mass function only being non-zero at a single value. That is, if the limitingrandom variable is X, then

fX(x) = 1, x = c, and zero otherwise.

Theorem 6.3.1 The sequence of random variables {Xn} converges in distribution to c if andonly if, for all ε > 0,

limn−→∞

P [|Xn − c| < ε] = 1.

This theorem indicates that convergence in distribution to a constant c occurs if and only if theprobability becomes increasingly concentrated around c as n −→∞.


6.3.2 CONVERGENCE IN PROBABILITY

Definition 6.3.3 CONVERGENCE IN PROBABILITY TO A CONSTANTThe sequence of random variables {Xn} converges in probability to the constant c, denoted

XnP−→ c

if for all ε > 0

limn−→∞

P [|Xn − c| < ε] = 1, or, equivalently, limn−→∞

P [|Xn − c| ≥ ε] = 0,

that is, if the limiting distribution of Xn is degenerate at c.

Interpretation. Convergence in probability to a constant is precisely equivalent to convergencein distribution to a constant.

A very useful result is Slutsky’s Theorem which states that if Xnd−→ X and Yn

P−→ c, where c is

a finite constant, then: (i) Xn + Ynd−→ X + c, (ii) XnYn

d−→ cX, (iii) Xn/Yn

d−→ X/c, if c 6= 0.

Theorem 6.3.2 WEAK LAW OF LARGE NUMBERSSuppose that {Xn} is a sequence of i.i.d. random variables with expectation μ and variance σ2.Let Yn be defined by

Yn =1

n

n∑

i=1

Xi.

Then, for all ε > 0,

limn−→∞

P [|Yn − μ| < ε] = 1,

that is, YnP−→ μ, and thus the mean of X1, . . . , Xn converges in probability to μ.

Proof. Using the properties of expectation, it can be shown that Yn has expectation μ andvariance σ2/n, and hence by the Chebychev Inequality,

P [|Yn − μ| ≥ ε] ≤σ2

nε2−→ 0, as n −→∞

for all ε > 0. Hence

P [|Yn − μ| < ε] −→ 1, as n −→∞

and YnP−→ μ.

Definition 6.3.4 CONVERGENCE TO A RANDOM VARIABLEThe sequence of random variables {Xn} converges in probability to the random variable

X, denoted XnP−→ X, if, for all ε > 0,

limn−→∞

P [|Xn −X| < ε] = 1, or, equivalently, limn−→∞

P [|Xn −X| ≥ ε] = 0.

6.3. MODES OF STOCHASTIC CONVERGENCE 59

6.3.3 CONVERGENCE IN QUADRATIC MEAN

Definition 6.3.5 CONVERGENCE IN QUADRATIC MEANThe sequence of random variables {Xn} converges in quadratic mean (also called L2convergence) to the random variable X, denoted Xn

qm−→ X, if

E(Xn −X)2 → 0,

as n −→∞.

Theorem 6.3.3 For the sequence {Xn} of random variables,

(a) Convergence in quadratic mean to a random variable implies convergence in probability:

Xnqm−→ X =⇒ Xn

P−→ X.

(b) Convergence in probability to a random variable implies convergence in distribution:

XnP−→ X =⇒ Xn

d−→ X.

Proof. The proof of (a) is simple. Suppose that Xnqm−→ X. Fix ε > 0. Then, by Markov’s

inequality,

P (|Xn −X| > ε) = P ((Xn −X)2 > ε2) ≤

E(Xn −X)2

ε2−→ 0.

Proof of (b). Fix ε > 0 and let x be a continuity point of FX . Then

FXn(x) = P (Xn ≤ x,X ≤ x+ ε) + P (Xn ≤ x,X > x+ ε)

≤ P (X ≤ x+ ε) + P (|Xn −X| > ε)

= FX(x+ ε) + P (|Xn −X| > ε).

Also,

FX(x− ε) = P (X ≤ x− ε) = P (X ≤ x− ε,Xn ≤ x) + P (X ≤ x− ε,Xn > x)

≤ FXn(x) + P (|Xn −X| > ε).

Hence,FX(x− ε)− P (|Xn −X| > ε) ≤ FXn(x) ≤ FX(x+ ε) + P (|Xn −X| > ε).

Take the limit as n −→∞ to conclude that

FX(x− ε) ≤ lim infn−→∞

FXn(x) ≤ lim supn−→∞

FXn(x) ≤ FX(x+ ε).

This holds for all ε > 0. Take the limit as ε −→ 0 and use the fact that FX is continuous at x toconclude that

limn−→∞

FXn(x) = FX(x).

Note that the reverse implications do not hold. Convergence in probability does not implyconvergence in quadratic mean. Also, convergence in distribution does not imply convergence inprobability.


CHAPTER 7

STATISTICAL ANALYSIS

7.1 STATISTICAL SUMMARIES

Definition 7.1.1 A collection of independent, identically distributed random variables X1, ..., Xneach of which has distribution defined by cdf FX (or mass/density function fX) is a randomsample of size n from FX (or fX).

Definition 7.1.2 A function, T , of a random sample, X1, ..., Xn, that is, T = t(X1, ..., Xn) thatdepends only on X1, ..., Xn is a statistic. A statistic is a random variable. For example, thesample mean

X =X1 +X2 + ...+Xn

n

is a statistic. But a statistic T need not necessarily be constructed from a random sample (therandom variables need not be independent, identically distributed), but that is the caseencountered most often. In many circumstances it is necessary to consider statistics constructedfrom a collection of independent, but not identically distributed, random variables.

7.2 SAMPLING DISTRIBUTIONS

Definition 7.2.1 If X1, ..., Xn is a random sample from FX , say, and T = t(X1, ..., Xn) is astatistic, then FT (or fT ), the cdf (or mass/density function) of random variable T , is thesampling distribution of T . This notion extends immediately to the case of a statistic Tconstructed from a general collection of random variables X1, ..., Xn.

EXAMPLE: If X1, ..., Xn are independent random variables, with Xi ∼ N(μi, σ2i ) for

i = 1, ..., n, and a1, ..., an are constants, consider the distribution of random variable Y defined by

Y =n∑

i=1

aiXi.

Using standard mgf results, the distribution of Y is derived to be normal with parameters

μY =n∑

i=1

aiμi, σ2Y =n∑

i=1

a2iσ2i .

Now consider the special case of this result when X1, ..., Xn are independent, identicallydistributed with μi = μ and σ

2i = σ

2, and where ai = 1/n for i = 1, ..., n. Then

Y =n∑

i=1

1

nXi = X ∼ N

(

μ,σ2

n

)

.

61

62 CHAPTER 7. STATISTICAL ANALYSIS

Definition 7.2.2 For a random sample X1, ..., Xn from a probability distribution, then thesample variance, S2, is the statistic defined by

S2 =1

n− 1

n∑

i=1

(Xi − X

)2.

Theorem 7.2.1 SAMPLING DISTRIBUTION FOR NORMAL SAMPLESIf X1, ..., Xn is a random sample from a normal distribution, say Xi ∼ N(μ, σ2), then:

(a) X is independent of{Xi −X, i = 1, ..., n

};

(b) X and S2 are independent random variables;(c) The random variable

(n− 1)S2

σ2=

n∑

i=1

(Xi − X

)2

σ2

has a chi-squared distribution with n− 1 degrees of freedom.

Proof. Omitted here.

Theorem 7.2.2 Suppose that X1, ..., Xn is a random sample from a normal distribution, sayXi ∼ N(μ, σ2). Then the random variable

T =X − μS/√n

has a Student-t distribution with n− 1 degrees of freedom.

Proof. Consider the random variables

Z =

√n(X − μ)σ

∼ N(0, 1),

V =(n− 1)S2

σ2∼ χ2n−1,

and

T =Z

√V

n− 1

,

and use the properties of the normal distribution and related random variables (NOTE 6,following Definition 4.1.5). Also, see EXERCISES 5, Q4 (b).

7.3. HYPOTHESIS TESTING 63

7.3 HYPOTHESIS TESTING

7.3.1 TESTING FOR NORMAL SAMPLES - THE Z-TEST

We concentrate initially on random data samples that we can assume to have a normaldistribution, and utilize the Theorem from the previous section. We will look at two situations,namely one sample and two sample experiments. So, we suppose that X1, ..., Xn ∼ N

(μ, σ2

)

(one sample) and X1, ..., Xn ∼ N(μX , σ2X), Y1, ..., Yn ∼ N(μY , σ

2Y ) (two sample): in the latter case

we assume also independence of the two samples.

• ONE SAMPLE Possible tests of interest are: μ = μ0, σ = σ0 for some specified constantsμ0 and σ0.

• TWO SAMPLE Possible tests of interest are: μX = μY , σX = σY .

Recall from Theorem 7.2.1 that, if X1, ..., Xn ∼ N(μ, σ2) are the i.i.d. outcome random variablesof n experimental trials, then

X ∼ N

(

μ,σ2

n

)

and(n− 1)S2

σ2∼ χ2n−1,

with X and S2 statistically independent. Suppose we want to test the hypothesis that μ = μ0,for some specified constant μ0, (for example, μ0 = 20.0) is a plausible model. More specifically, wewant to test

H0 : μ = μ0, the NULL hypothesis, against

H1 : μ 6= μ0, the ALTERNATIVE hypothesis.

So, we want to test whether H0 is true, or whether H1 is true. In the case of a Normal sample,the distribution of X is Normal, and

X ∼ N

(

μ,σ2

n

)

=⇒ Z =X − μσ/√n∼ N (0, 1) ,

where Z is a random variable. Now, when we have observed the data sample, we can calculatex, and therefore we have a way of testing whether μ = μ0 is a plausible model; we calculate xfrom x1, ..., xn, and then calculate

z =x− μ0σ/√n.

If H0 is true, and μ = μ0, then the observed z should be an observation from an N(0, 1)distribution (as Z ∼ N(0, 1)), that is, it should be near zero with high probability. In fact, zshould lie between -1.96 and 1.96 with probability 1 − α = 0.95, say, as

P [−1.96 ≤ Z < 1.96] = Φ(1.96)− Φ(−1.96) = 0.975− 0.025 = 0.95.

If we observe z to be outside of this range, then there is evidence that H0 is not true.

So, basically, if we observe an extreme value of z, either H0 is true, but we have observed a rareevent, or we prefer to disbelieve H0 and conclude that the data contains evidence against H0.Notice the asymmetry between H0 and H1. The null hypothesis is ‘conservative’, reflectingperhaps a current state of belief, and we are testing whether the data is consistent with that


Figure 7.1: CRITICAL REGION IN A Z-TEST (taken from Schaum’s ELEMENTS OF STATIS-TICS II, Bernstein & Bernstein.

hypothesis, only rejecting H0 in favour of the alternative H1 if evidence is clear i.e. when the datarepresent a rare event under H0.

As an alternative approach, we could calculate the probability p of observing a z value that ismore extreme than the z we did observe; this probability is given by

p =

{2Φ(z), z < 0,2(1− Φ(z)), z ≥ 0.

If this p is very small, say p ≤ α = 0.05, then again there is evidence that H0 is not true. Thisapproach is called significance testing.

In summary, we need to assess whether z is a surprising observation from an N(0, 1)distribution - if it is, then we reject H0. Figure 6.1 depicts the ”critical region” in a Z-test.

7.3.2 HYPOTHESIS TESTING TERMINOLOGY

There are five crucial components to a hypothesis test, namely:

• the TEST STATISTIC;

• the NULL DISTRIBUTION of the statistic;

• the SIGNIFICANCE LEVEL of the test, usually denoted by α;

• the P-VALUE, denoted p;

• CRITICAL VALUE(S) of the test.


In the Normal example given above, we have that:

• z is the test statistic;

• The distribution of random variable Z if H0 is true is the null distribution;

• α = 0.05 is the significance level of the test (choosing α = 0.01 gives a “stronger” test);

• p is the p-value of the test statistic under the null distribution;

• the solution CR of Φ(CR) = 1− α/2 gives the critical values of the test ±CR. Thesecritical values define the boundary of a critical region: if the value z is in the criticalregion we reject H0.

7.3.3 THE t-TEST

In practice, we will often want to test hypotheses about μ when σ is unknown. We cannotperform the Z-test, as this requires knowledge of σ to calculate the z statistic. Recall that weknow the sampling distributions of X and S2, and that the two estimators are statisticallyindependent. Now, from the properties of the Normal distribution, if we have independentrandom variables Z ∼ N(0, 1) and Y ∼ χ2ν , then we know that random variable T defined by

T =Z

√Y/ν

has a Student-t distribution with ν degrees of freedom. Using this result, and recalling thesampling distributions of X and S2, we see that

T =

X − μσ/√n

√(n− 1)S2/σ2

(n− 1)

=(X − μ)S/√n∼ tn−1 :

T has a Student-t distribution with n− 1 degrees of freedom, denoted St(n− 1), which does notdepend on σ2. Thus we can repeat the procedure used in the σ known case, but use the samplingdistribution of T rather than that of Z to assess whether the value of the test statistic is“surprising” or not. Specifically, we calculate the observed value

t =(x− μ)s/√n

and find the critical values for a α = 0.05 test by finding the ordinates corresponding to the 0.025and 0.975 percentiles of a Student-t distribution, St(n− 1) (rather than a N(0, 1)) distribution.

7.3.4 TEST FOR σ

The Z-test and t-test are both tests for the parameter μ. To perform a test about σ, say

H0 : σ = σ0,

H1 : σ 6= σ0,


we construct a test based on the estimate of variance, S2. In particular, we saw from Theorem7.2.1 that the random variable Q, defined by

Q =(n− 1)S2

σ2∼ χ2n−1,

if the data have an N(μ, σ2) distribution. Hence if we define test statistic value q by

q =(n− 1)s2

σ0

then we can compare q with the critical values derived from a χ2n−1 distribution; we look for the0.025 and 0.975 quantiles - note that the chi-squared distribution is not symmetric, so we needtwo distinct critical values.

7.3.5 TWO SAMPLE TESTS

It is straightforward to extend the ideas from the previous sections to two sample situationswhere we wish to compare the distributions underlying two data samples. Typically, we considersample one, x1, ..., xnX , from a N(μX , σ

2X) distribution, and sample two, y1, ..., ynY , independently

from a N(μY , σ2Y ) distribution, and test the equality of the parameters in the two models.

Suppose that the sample mean and sample variance for samples one and two are denoted (x, s2X)and (y, s2Y ) respectively.

1. First, consider the hypothesis testing problem defined by

H0 : μX = μY ,H1 : μX 6= μY ,

when σX = σY = σ is known, so the two samples come from normal distributions with thesame, known, variance. Now, from the sampling distributions theorem we have, under H0

X ∼ N

(

μX ,σ2

nX

)

, Y ∼ N

(

μY ,σ2

nY

)

,=⇒ X − Y ∼ N

(

0,σ2

nX+σ2

nY

)

,

since X and Y are independent. Hence by the properties of normal random variables

Z =X − Y

σ

√1

nX+1

nY

∼ N(0, 1),

if H0 is true, giving us a test statistic z defined by

z =x− y

σ

√1

nX+1

nY

,

which we can compare with the standard normal distribution. If z is a surprisingobservation from N(0, 1), and lies in the critical region, then we reject H0. This procedureis the Two Sample Z-Test.


2. If we can assume that σX = σY , but the common value, σ, say, is unknown, we parallel theone sample t-test by replacing σ by an estimate in the two sample Z-test. First, we obtainan estimate of σ by “pooling” the two samples; our estimate is the pooled estimate, s2P ,defined by

s2P =(nX − 1)s2X + (nY − 1)s

2Y

nX + nY − 2,

which we then use to form the test statistic t defined by

t =x− y

sP

√1

nX+1

nY

.

It can be shown that if H0 is true then t should be an observation from a Student-tdistribution with nX + nY − 2 degrees of freedom. Hence we can derive the critical valuesfrom the tables of the Student-t distribution.

3. If σX 6= σY , but both parameters are known, we can use a similar approach to the oneabove to derive a test statistic z defined by

z =x− y

√σ2XnX+σ2YnY

,

which has an N(0, 1) distribution if H0 is true.

4. If σX 6= σY , but both parameters are unknown, we can use a similar approach to the oneabove to derive a test statistic t defined by

t =x− y

√s2XnX+s2YnY

.

The distribution of this statistic when H0 is true is not analytically available, but can beadequately approximated by a Student (m) distribution, where

m =(wX + wY )

2

(w2XnX − 1

+w2YnY − 1

) ,

with

wX =s2XnX, wY =

s2YnY.

Clearly, the choice of test we use depends on whether σX = σY or not. We may test thishypothesis formally; to test

H0 : σX = σY ,H1 : σX 6= σY ,

we compute the test statistic

Q =S2XS2Y,


which has as null distribution the F distribution with (nX − 1, nY − 1) degrees of freedom. Thisdistribution can be denoted F (nX − 1, nY − 1), and its quantiles are tabulated. Hence we canlook up the 0.025 and 0.975 quantiles of this distribution (the F distribution is not symmetric),and hence define the critical region. Informally, if the test statistic value q is very small or verylarge, then it is a surprising observation from the F distribution and hence we reject thehypothesis of equal variances.

HYPOTHESIS TESTING SUMMARY In general, to test a hypothesis H0, we consider astatistic calculated from the sample data. We derive mathematically the probability distributionof the statistic, considered as a random variable, when the hypothesis H0 is true, and compare theactual observed value of the statistic computed from the data sample with the hypotheticalprobability distribution. We ask the question “Is the value a likely observation from thisprobability distribution ”? If the answer is “No”, then reject the hypothesis, otherwise accept it.

7.4 POINT ESTIMATION

Definition 7.4.1 Let X1, ..., Xn be a random sample from a distribution with mass/densityfunction fX that depends on a (possibly vector) parameter θ. Then fX1(x1) = fX(x1; θ), so that

fX1,...,Xk(x1, ..., xk) =k∏

i=1

fX(xi; θ).

A statistic T = t(X1, ..., Xn) that is used to represent or estimate a function τ(θ) of θ based on anobserved sample of the random variables x1, ..., xn is an estimator, and t = t(x1, ..., xn) is anestimate, τ(θ), of τ(θ). The estimator T = t(X1, . . . , Xn) is said to be unbiased if E(T ) = θ,otherwise T is biased.

7.4.1 ESTIMATION TECHNIQUES I: METHOD OF MOMENTS

Suppose that X1, ..., Xn is a random sample from a probability distribution with mass/densityfunction fX that depends on a vector parameter θ of dimension k, and suppose that a samplex1, ..., xn has been observed. Let the rth moment of fX be denoted μr, and let the rth samplemoment, denoted mr be defined for r = 1, 2, . . . by

mr =1

n

n∑

i=1

xri .

Then mr is an estimate of μr, and

Mj =1

n

n∑

i=1

Xri

is an estimator of μr.

PROCEDURE : The method of moments technique of estimation involves matching thetheoretical moments μr ≡ μr(θ) to the sample moments mr, r = 1, 2, . . . , l, for suitable l, andsolving for θ. In most situations taking l = k, the dimension of θ, suffices: we obtain k equationsin the k elements of vector θ which may be solved simultaneously to find the parameter estimates.

7.4. POINT ESTIMATION 69

We may, however, need l > k. Intuitively, and recalling theWeak Law of Large Numbers, itis reasonable to suppose that there is a close relationship between the theoretical properties of aprobability distribution and estimates derived from a large sample. For example, we know that,for large n, the sample mean converges in probability to the theoretical expectation.

7.4.2 ESTIMATION TECHNIQUES II: MAXIMUM LIKELIHOOD

Definition 7.4.2 Let random variables X1, ..., Xn have joint mass or density function, denotedfX1,...,Xk , that depends on a vector parameter θ = (θ1, ..., θk). Then the joint/mass densityfunction considered as a function of θ for the (fixed) observed values x1, ..., xn of the variables isthe likelihood function, L(θ):

L(θ) = fX1,...,Xn(x1, ..., xn; θ).

If X1, ..., Xn represents a random sample from joint/mass density function fX

L(θ) =n∏

i=1

fX(xi; θ).

Definition 7.4.3 Let L(θ) be the likelihood function derived from the joint/mass densityfunction of random variables X1, ..., Xn, where θ ∈ Θ ⊆ Rk, say, and Θ is termed the parameterspace. Then for a fixed set of observed values x1, ..., xn of the variables, the estimate of θ termedthe maximum likelihood estimate (MLE) of θ, θ, is defined by

θ = argmaxθ∈Θ

L(θ).

That is, the maximum likelihood estimate is the value of θ for which L(θ) is maximized in theparameter space Θ.

DISCUSSION : The method of estimation involves finding the value of θ for which L(θ) ismaximized. This is generally done by setting the first partial derivatives of L(θ) with respect toθj equal to zero, for j = 1, ..., k, and solving the resulting k simultaneous equations. But we mustbe alert to cases where the likelihood function L(θ) is not differentiable, or where the maximumoccurs on the boundary of Θ! Typically, it is easier to obtain the MLE by maximising the(natural) logarithm of L(θ): we maximise l(θ) = logL(θ), the log-likelihood.

THE FOUR STEP ESTIMATION PROCEDURE: Suppose a sample x1, ..., xn has beenobtained from a probability model specified by mass or density function fX(x; θ) depending onparameter(s) θ lying in parameter space Θ. The maximum likelihood estimate is produced asfollows;

1. Write down the likelihood function, L(θ).

2. Take the natural log of the likelihood, collect terms involving θ.

3. Find the value of θ ∈ Θ, θ, for which logL(θ) is maximized, for example by differentiation.Note that, if parameter space Θ is a bounded interval, then the maximum likelihoodestimate may lie on the boundary of Θ. If the parameter is a k vector, the maximizationinvolves evaluation of partial derivatives.


4. Check that the estimate θ obtained in STEP 3 truly corresponds to a maximum in the (log)likelihood function by inspecting the second derivative of log L(θ) with respect to θ. In thesingle parameter case, if the second derivative of the log -likelihood is negative at θ = θ,then θ is confirmed as the MLE of θ (other techniques may be used to verify that thelikelihood is maximized at θ).

7.5 INTERVAL ESTIMATION

The techniques of the previous section provide just a single point estimate of the value of anunknown parameter. Instead, we can attempt to provide a set or interval of values whichexpresses our uncertainty over the unknown value.

Definition 7.5.1 Let X = (X1, . . . , Xn) be a random sample from a distribution depending onan unknown scalar parameter θ. Let T1 = l1(X1, . . . , Xn) and T2 = l2(X1, . . . , Xn) be twostatistics satisfying T1 ≤ T2 for which P (T1 < θ < T2) = 1− α, where P denotes probability whenX has distribution specified by parameter value θ, whatever the true value of θ, and where 1− αdoes not depend on θ. Then the random interval (T1, T2) is called a 1− α confidence intervalfor θ, and T1 and T2 are called lower and upper confidence limits, respectively.

Given data, x = (x1, . . . , xn), the realised value of X, we calculate the interval (t1, t2), wheret1 = l1(x1, . . . , xn) and t2 = l2(x1, . . . , xn). NOTE: θ here is fixed (it has some true, fixed butunknown value in Θ), and it is T1, T2 that are random. The calculated interval (t1, t2) either does,or does not, contain the true value of θ. Under repeated sampling of X, the random interval(T1, T2) contains the true value a proportion 1− α of the time. Typically, we will take α = 0.05 orα = 0.01, corresponding to a 95% or 99% confidence interval.

If θ is a vector, then we use a confidence set, such as a sphere or ellipse, instead of an interval.So, C(X) is a 1− α confidence set for θ if P (θ ∈ C(X)) = 1− α, for all possible θ.

The key problem is to develop methods for constructing a confidence interval. There are twogeneral procedures.

7.5.1 PIVOTAL QUANTITY

Definition 7.5.2 Let X = (X1, . . . , Xn) be a random sample from a distribution specified by(scalar) parameter θ. Let Q = q(X; θ) be a function of X and θ. If Q has a distribution that doesnot depend on θ, then Q is defined to be a pivotal quantity.

If Q = q(X; θ) is a pivotal quantity and has a probability density function, then for any fixed0 < α < 1, there will exist q1 and q2 depending on α such that P (q1 < Q < q2) = 1− α. Then, iffor each possible sample value x = (x1, . . . , xn), q1 < q(x1, . . . , xn; θ) < q2 if and only ifl1(x1, . . . , xn) < θ < l2(X1, . . . , xn), for functions l1 and l2, not depending on θ, then (T1, T2) is a1− α confidence interval for θ, where Ti = li(X1, . . . , Xn), i = 1, 2.

7.5. INTERVAL ESTIMATION 71

7.5.2 INVERTING A TEST STATISTIC

A second method utilises a correspondence between hypothesis testing and interval estimation.

Definition 7.5.3 For each possible value θ0, let A(θ0) be the acceptance region of a test ofH0 : θ = θ0, of significance level α, so that A(θ0) is the set of data values x such that H0 isaccepted in a test of significance level α. For each x, define a set C(x) by

C(x) = {θ0 : x ∈ A(θ0)}.

Then the random set C(X) is a 1− α confidence set.

Example. Let X = (X1, . . . , Xn), with the Xi IID N(μ, σ2), with σ2 known. Both the above

procedures yield a 1− α confidence interval for μ of the form (X − zα/2σ/√n, X + zα/2σ/

√n),

where Φ(zα/2) = 1− α/2.

We know thatX − μσ/√n∼ N(0, 1),

and so is a pivotal quantity. We have

P (−zα/2 <X − μσ/√n< zα/2) = 1− α.

Rearranging givesP (X − zα/2σ/

√n < μ < X + zα/2σ/

√n) = 1− α,

which defines a 1− α confidence interval.

If we test H0 : μ = μ0 against H1 : μ 6= μ0 a reasonable test has rejection region of the form{x : |x− μ0| > zα/2σ/

√n}. So, H0 is accepted for sample points with |x− μ0| < zα/2σ/

√n, or,

x− zα/2σ/√n < μ0 < x+ zα/2σ/

√n.

The test is constructed to have size (significance level) α, so P (H0 is accepted|μ = μ0) = 1− α.So we can write

P (X − zα/2σ/√n < μ0 < X + zα/2σ/

√n | μ = μ0) = 1− α.

This is true for every μ0, so

P (X − zα/2σ/√n < μ < X + zα/2σ/

√n) = 1− α

is true for every μ, so the confidence interval obtained by inverting the test is the same as thatderived by the pivotal quantity method.

Documents

M2S1 Lecture Notes