STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod

Semester 1, 2012 (Last adjustments: September 5, 2012)

Lecture Notes

STAT3914 – Applied Statistics

Lecturer

Dr. John T. Ormerod

School of Mathematics & Statistics F07

University of Sydney

(w) 02 9351 5883

(e) john.ormerod (at) sydney.edu.au

STAT3914 – Outline

2 Multivariate Normal Distribution

2 Point Estimates and Confidence Intervals for MVN

2 Wishart Distriubtion

2 Derivation of Hotelling’s T2

2 The Expecation Maximization Algorithm

2 Missing Data Analysis

STAT3914: Lecture 3 2

The standard univariate normal distribution

2 A random variable (RV) Z has a standard normal distribution N(0, 1), if Z has

density

ϕ(z) =1√2π

e−12z

2, −∞ < z <∞.

2 Z has moment generating function

E(etZ) =

∫ ∞−∞

etz · (2π)−12e−

12z

2dz

=

∫ ∞−∞

(2π)−12 e−

12(z−t)

2· e

12t

2dz

=e12t

2

2 It follows that E[Z] = 0 and Var[Z] = 1.


The standard multivariate normal distribution

2 Let U = (U1, U2, . . . , Up)T be a p-vector of NID(0, 1) RVs.

2 The joint density of U is given byp∏i=1

1√2πe−u

2i /2 = (2π)−p/2 exp

−1

2uTu.

2 Clearly,

E[U] = 0

where 0 is the 0 vector in Rp.

2 Similarly,

Cov[U] = E[UUT ] = I

where I = Ip is the p× p identity matrix.

2 We say U has a standard multivariate normal distribution which we denote

U ∼ Np(0, I).


Non-standard univariate normal distribution

2 Given a RV Z ∼ N(0, 1) we can generate a non-standard normal with mean

µ ∈ R and standard deviation σ > 0 by defining X = µ + σZ.

2 The density of X can be readily recovered by the formula for the density of

transformed RVs.

2 Let ψ(z) = µ + σz, then X = ψ(Z) and ψ−1(x) = (x− µ)/σ so

fX(x) = fZ(ψ−1(x))

∣∣∣∣ ddxψ−1(x)

∣∣∣∣= fZ

(x− µσ

) ∣∣∣∣1σ∣∣∣∣

=1√

2πσ2exp

−(x− µ)2

2σ2

.


Non-standard univariate normal distribution

2 The MGF of X is given by

E[etX ] = etµE[etσZ]

= exptµ + 1

2σ2t2.

2 An analogous transformation creates the general multivariate normal random

vector, i.e. if z ∼ N(0, 1) then

x = µ + σz ∼ N(µ, σ2).


Non-standard multivariate normal distribution

2 Let U ∼ Np(0, I) and define

X = ψ(U) ≡ µ + AU

where µ ∈ Rp, and A is a p× p non-singular matrix.

2 Clearly,

E[X] = µ

and its covariance is given by

Cov[X] = E[(X− µ) (X− µ)T

]= E

[AUUTAT

]= AE

[UUT

]AT

= AAT

≡ Σ.


Non-standard multivariate normal distribution

Suppose AAT = Σ then:

2 A is sort of a “square-root” of Σ (if Σ is symmetric then A = Σ12) and

|Σ| = |AAT | = |A| · |AT | = |A|2 > 0.

2 Claim: If U ∼ Np(0, I) and X = ψ(U) ≡ µ+ AU then X has a non-singular

multivariate normal distribution which we will denote

X ∼ Np(µ,Σ)

and X has density

fX(X) = (2π)−p/2|Σ|−12 exp

−1

2(x− µ)TΣ−1(x− µ).

2 Proof: Similarly to the univariate case with ψ−1(X) = A−1(X− µ)

fX(X) = fU

(ψ−1(X)

) ∣∣Jψ−1∣∣= (2π)−p/2 |Jψ|−1 exp

−1

2

[A−1(X− µ)

]T [A−1(X− µ)

]STAT3914: Lecture 3 8

2 The Jacobian of a linear transformation is the determinant of its matrix so

|Jψ| =

∣∣∣∣∣∣∣∂ψ1∂u1· · · ∂ψ1

∂up... . . . ...

∂ψp∂u1· · · ∂ψp

∂up

∣∣∣∣∣∣∣ = |A| = |Σ|12 .

2 The exponent simplifies as follows[A−1(X− µ)

]T [A−1(X− µ)

]= (X− µ)T (A−1)T (A−1)(X− µ)

= (X− µ)TΣ−1(X− µ)

as (A−1)T (A−1) = (AAT )−1 = Σ−1.

2 Thus, X has density

fX(X) = (2π)−p/2|Σ|−12 exp

−1

2(x− µ)TΣ−1(x− µ).


The MGF of the multivariate normal distribution

2 Claim: If X ∼ Np(µ,Σ) then the moment generating function (MGF) of X is

given by:

MX(s) = expsTµ + 1

2sTΣs

.

2 Proof: For s ∈ Rp we have

MX(s) = E[esTX]

= E[exp(sTµ + sTAU)]

= exp(sTµ) · E[eaTU], where aT = sTA

= exp(sTµ) · exp12a

Ta

since aTU ∼ N(0, aTa)

= expsTµ + 1

2sTΣs

.

2 This expression for the MGF is well defined even when Σ is singular. While it

would be awkward to define the multivariate normal distribution using an MGF

it does suggest the following definition...


The (general) multivariate normal distribution

2 Definition. A p-dimensional random vector X has a multivariate normal dis-

tribution if for every a ∈ Rp the linear combination aTX is univariate normal.

2 Claim. X is a p-dimensional multivariate normal random vector if and only if

its MGF can be expressed as

MX(s) = expsTµ + 1

2sTΣs

.

2 Aside: if MX is as above then by differentiating the MGF it follows that EX = µ

and that Cov(X) = Σ.


The (general) multivariate normal distribution

2 Proof:. Suppose the above expression for the MGF holds. Then, the MGF of

Y = aTX is given by

MY (t) = E[etaTX]

= MX(ta)

= exp

taTµ +

1

2t2aTΣa

so that

aTX ∼ N(aTµ, aT Σa)

by the uniqueness theorem for MGFs.


Conversely, if Ys = sTX is univariate normal for all s ∈ Rp then

E[eYs] = expE[Ys] + 1

2Var[Ys].

However, E[Ys] = sTE[X] = sTµ and

Var[Ys] = E[sTX− sTµ

] (sTX− sTµ

)T= sT

[E (X− µ) (X− µ)T

]s

= sTCov(X)s

Therefore, with Σ = Cov(X)

MX(s) = E[eYs] = expsTµ + 1

2sTΣs

.


Marginal Distributions

2 If X ∼ Np(µ,Σ) then for all a ∈ Rp,

aTX ∼ N(aTµ, aTΣa).

2 So by choosing ei = (0, · · · , 0, 1, 0, · · · , 0) with 1 in the ith position we have

Xi ∼ N(µi,Σii).

2 Thus all marginal distributions are normal.

2 The converse is not true (see exercises) .

2 Moreover, if X1 = (X1, . . . , Xr)T consists of the first r coordinates of X, then

X1 is an r-dimensional normal RV (why?)

X1 ∼ Nr(µ1,Σ11),

where [µ1]i = [µ]i and [Σ11]i,j = [Σ]i,j for i, j = 1, . . . , r.


2 Claim. If Σ is diagonal then Xi are independent.

2 Proof.


Block/partitioned matrices

2 Suppose X is a p-dimensional random vector.

2 Let X1 = (X1, . . . , Xr)T be the first r coordinates of X and let X2 = (Xr+1, . . . , Xp)

T

be its last q = p− r coordinates so that

X =

(X1

X2

).

2 Let µi = E[Xi] (i = 1, 2) then

µ ≡ E[X] =

(µ1

µ2

)2 Similarly, with

[Σ]ij ≡ Cov(Xi,Xj) ≡ E[XiXTj ]− E[Xi]E[Xj]

T


(so that Σ11 is r × r and Σ12 is r × q, etc.)

Var(X) ≡ Σ =

(Σ11 Σ12

Σ21 Σ22

).


Degenerate multivariate normal distribution

2 Suppose X is a multivariate normal with mean µ and a singular covariance

matrix Σ with rank r < p.

2 As Σ is symmetric it has (p − r) eigenvalues equal to 0 and there exists an

orthogonal matrix P such that

PTΣP =

(D 0

0 0

)≡ D

where

D = diag(λ1, · · · , λr), with λ1 ≥ λ2 ≥ · · · ≥ λr > 0.

and the columns of P are the eigenvectors of Σ


2 Let Y = PTX and denote by Y1 = (Y1, Y2, . . . , Yr)T and Y2 = (Yr+1, . . . , Yp)

T

and similarly for s = (s1, s2) and for γ = PTµ = (γ1,γ2).

2 Then Y has MGF

E[esT1 Y1+sT2 Y2] = E[esTY]

= E[esTPTX]

= expsTPTµ + 1

2sTPTΣPs

= exp

sTγ + 1

2sTDs

= exp

sT1 γ1 +

1

2sT1 Ds1

· exp

sT2 γ2

.

2 The RHS is a product of two MGFs, one of a distribution that is the constant

≡ γ2, and the other is Nr(γ1,D).

2 By the uniqueness of the MGF, Y2 ≡ γ2 and Y1 ∼ Nr(γ1,D).

2 Note that the Y1, . . . , Yr are independent. Does this ring a bell? Principal

Component Analysis. . .


2 It also follows that if X ∼ N(µ,Σ) with Σ of rank r then there exists an

invertible linear transformation G such that X = GW + µ where

Wr+1 = . . . = Wp = 0 and (W1, · · · ,Wr)T ∼ Nr(0, I).

see exercises.


Block independence

2 Theorem 1. X1 and X2 are independent if and only if Σ12 = 0.

2 Proof. If X1 and X2 are independent then X1 and X2 are element-wise uncor-

related and so Σ12 = 0.

Conversely, if Σ12 = 0 then Σ21 = 0 and if s = (s1, s2)T (with s1 ∈ Rr) we

have

sTΣs = sT1 Σ11s1 + sT2 Σ22s2.

Therefore, X has MGF

MX(s) = expsTµ + 1

2sTΣs

= exp

sT1µ1 + 1

2sT1 Σ11s1

· exp

sT2µ2 + 1

2sT2 Σ22s2

and so (why?) X1 and X2 are independent with

X1 ∼ Nr(µ1,Σ11) and X2 ∼ Ns(µ2,Σ22).



Detour into the linear algebra of block matrices

2 Let

A =

(A11 A12

A21 A22

)be an invertible p×p matrix (where A11 is r×r and A22 is s×s with r+s = p).

2 It is common to denote the inverse of A by

A−1 =

(A11 A12

A21 A22

)(where A11 is r × r and A22 is s× s).

2 By the definition of a matrix inverse we have the relations(A11 A12

A21 A22

)(A11 A12

A21 A22

)=

(Ir 0

0 Is

).


2 In particular,A21A11 + A22A21 = 0

A21A12 + A22A22 = Is.

2 Multiply the first equation from the right by A−111 A12 and subtract from the

second equation to find that

A22(A22 −A21A−111 A12) = Is =⇒ A22 = (A22 −A21A

−111 A12)

−1.

2 The same kind of algebra yields (exercise)

A12 = −A−111 A12A22

A21 = −A22A21A−111

−A12A21 = A11A11 − Ir

A22 = (A22 −A21A−111 A12)

−1

A11 = (A11 −A12A−122 A21)

−1.


Linear Predictor (Estimator)

2 X = (X1, X2,XT3 )T is a RV.

2 A linear predictor (estimator) of X1 given X3 is of the form bTX3, where

b ∈ Rp−2 (ignore X2 for now).

2 We seek the best such linear predictor:

bo ≡ argminb

E[(X1 − bTX3)

2].

2 Assume µ = 0 (the results hold for any µ) and let

f (b) ≡ E[(X1 − bTX3)2]

= E[(X1 − bTX3)(X1 − bTX3)

T]

= Σ11 − bTΣ31 −Σ13b + bTΣ33b

= Σ11 − 2bTΣ31 + bTΣ33b

2 Use calculus or algebra to minimize f .


2 We want to minimize (with respect to b)

f (b) = Σ11 − 2bTΣ31 + bTΣ33b.

2 Clearly, ∇b(−2bTΣ31) = −2Σ31 ∈ Rp−2.

2 Since bTAb =∑

i,j aijbibj it follows that for a symmetric A that

∇b(bTAb) = 2Ab.

2 Therefore,

∇f (b) = −2Σ31 + 2Σ33b.

2 Thus, the unique stationary point is attained at

b0 = Σ−133 Σ31.

2 Which is a minimum (as Σ33 is positive definite).


2 Algebra: complete the quadratic form

f (b) = Σ11 − bTΣ31 −Σ13b + bTΣ33b

= Σ11 −Σ13Σ−133 Σ31 + (b−Σ−133 Σ31)

TΣ33(b−Σ−133 Σ31)

≥ Σ11 −Σ131Σ−133 Σ31

with equality if and only if b = Σ−133 Σ31.

2 Thus, the best linear estimator (predictor) of X1 given X3 is

PX3(X1) = Σ13Σ−133 X3

Note that this is a normal RV.

Similarly, PX3(X2) = Σ23Σ−133 X3.

2 This is also known as the projection of X1 (X2) on the subspace spanned by

X3.


Conditional Distributions

2 Let

X =

(X1

X2

)∼ Np(µ,Σ)

where Σ is non-singular, X1 = (X1, . . . , Xr)T and X2 = (Xr+1, . . . , Xp)

T .

Then

X1 ∼ Nr(µ1,Σ11)

and

X2 ∼ Ns(µ2,Σ22)

where s = p− r.

2 What is the conditional distribution of X2 given that X1 = x1?


2 Recall that for the bivariate normal (X1, X2) we defined

fX2|X1(x2|x1) =

fX1,X2(x1, x2)

fX1(x1)

2 We saw that for Xi ∼ N(0, 1), Cov(X1, X2) = ρ and

X2 |X1 = x1 ∼ N(ρx1, 1− ρ2)


2 We can pursue the obvious generalization here (still assuming µ = 0):

log fX2|X1(X2|X1)

= logfX(X)

fX1(X1)

= C − 12

[(XT

1 ,XT2 )

(Σ11 Σ12

Σ21 Σ22

)(X1

X2

)−XT

1 Σ−111 X1

]= C − 1

2

[XT

1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1

+XT1 Σ12X2 + XT

2 Σ22X2

],

where C = C(Σ, r) is a constant and

Σ−1 =

(Σ11 Σ12

Σ21 Σ22

).


From the previous slide

log fX2|X1(X2|X1)

= C − 12

[XT

1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1 + XT1 Σ12X2 + XT

2 Σ22X2

],

2 Using the block inverse identities we obtain

Σ11 −Σ−111 = (Σ11Σ11 − I)Σ−111 = −Σ12Σ21Σ−111 = Σ−111 Σ12Σ

22Σ21Σ−111

Σ21 = −Σ22Σ21Σ−111

Σ12 = −Σ−111 Σ12Σ22

into the above expression we have

XT1

(Σ11 −Σ−111

)X1 + XT

2 Σ21X1 + XT1 Σ12X2 + XT

2 Σ22X2

= XT1 Σ−111 Σ12Σ

22Σ21Σ−111 X1 −XT

2 Σ22Σ21Σ−111 X1

−XT1 Σ−111 Σ12Σ

22X2 + XT2 Σ22X2

=(X2 −Σ21Σ

−111 X1

)TΣ22

(X2 −Σ21Σ

−111 X1

)2 Therefore,

log fX2|X1(X2|X1) = C − 1

2

[(X2 −Σ21Σ

−111 X1

)TΣ22

(X2 −Σ21Σ

−111 X1

)].


2 As a function of X2 this is a density of a multivariate normal with mean

Σ21Σ−111 X1 and covariance matrix(

Σ22)−1

= Σ22 −Σ21Σ−111 Σ12.

2 We essentially proved:

Theorem 2. If X ∼ Np(µ,Σ) then the conditional distribution of X2 given

X1 = X is

N(µ2 + Σ21Σ

−111 (X− µ1),Σ22 −Σ21Σ

−111 Σ12

).


2 Rather than routinely generalizing the result to include the case µ 6= 0 we give

a different proof for the general case.

2 Recall that if (X1, X2) is bivariate normal with Xi ∼ N(0, 1) and correlation ρ,

then we can de-correlate X2 from X1 as follows:

Let Y2 = X2 − ρX1, then

Cov(Y2, X1) = Cov(X2, X1)− ρCov(X1, X1) = 0.

2 So X2 = Y2 + ρX1 where Y2 ∼ N(0, 1− ρ2) is independent of X1 (?)

2 Therefore, conditioning on X1 = x1,

X2 ∼ N(ρx1, 1− ρ2).

2 Next we generalize this to the multivariate normal case.


2 Let Q = Σ21Σ−111 and let

Y2 = X2 −QX1.

Note that QX1 is the best linear predictor of X2 given X1

2 Y2 and X1 are uncorrelated since

Cov(Y2,X1) = Cov(X2,X1)− Cov(QX1,X1)

= Σ21 −QΣ11

= 0.

2 Then (X1

Y2

)is a multivariate normal RV (why?)


2 It follows that Y2 and X1 are independent and

Y2 ∼ N(µ2 −Qµ1,Σ22 −Σ21Σ−111 Σ12),

as

Y2 = X2 −QX1 and Q = Σ21Σ−111

so

Cov(Y2) = Cov(X2 −QX1,X2 −QX1)

= Σ22 −Σ21QT −QΣ12 + QΣ11Q

T

= Σ22 −Σ21Σ−111 Σ12.

2 Finally, X2 = Y2 + QX1 so the conditional distribution of X2 given X1 = X is

N(µ2 −Qµ1 + QX,Σ22 −Σ21Σ

−111 Σ12

)2 Notation: Σ2·1 = Σ22 −Σ21Σ

−111 Σ12 =

(Σ22)−1


Partial Correlation

2 The best linear estimator of X1 given X3 is

PX3(X1) = Σ13Σ−133 X3.

2 The partial correlation between X1 and X2 given X3 is

ρ12·34...p ≡ Cor[X1 − PX3(X1), X2 − PX3(X2)]

2 Theorem: Let

Σ−1 =

Σ11 Σ12 Σ13

Σ21 Σ22 Σ23

Σ31 Σ32 Σ33

and let d = Σ11Σ22 − Σ12Σ21 then

ρ12·34...p = − Σ12

√Σ11Σ22

.


2 Proof:Covariance is bilinear so

Cov[X1 − PX3(X1), X2 − PX3(X2)]

= Cov(X1, X2)− Cov(X1, PX3(X2))

−Cov(PX3(X1), X2) + Cov(PX3(X1), PX3(X2))

= Σ12 − Cov(X1,X3)Σ−133 ΣT

23

−Σ13Σ−133 Cov(X3, X2) + Σ13Σ

−133 Cov(X3,X3)Σ

−133 ΣT

23

= Σ12 −Σ13Σ−133 Σ32.

2 Similarly,

Cov [X1 − PX3(X1)]

= Σ11 −Σ13Σ−133 ΣT

13 −Σ13Σ−133 Σ31 + Σ13Σ

−133 Σ33Σ

−133 ΣT

13

= Σ11 −Σ13Σ−133 Σ31.

2 Hence,

ρ12·34...p =Σ12 −Σ13Σ

−133 Σ32√

(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ

−133 Σ32)


2 By our block inverse identities,

1

d

(Σ22 −Σ12

−Σ21 Σ11

)=

(Σ11 Σ12

Σ21 Σ22

)−1=

(Σ11 Σ12

Σ21 Σ22

)−

(Σ13

Σ23

)Σ−133

(Σ31 Σ32

).

2 Therefore,

ρ12·34...p =Σ12 −Σ13Σ

−133 Σ32√

(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ

−133 Σ32)

= − Σ12/d√(Σ22/d)(Σ11/d)

= − Σ12

√Σ11Σ22

.


2 Our definition of partial correlation and analysis do not require X to be multi-

variate normal.

2 However, if X is multivariate normal then

ρ12·3...p = CorF (X1, X2),

where the correlation is with respect to the distribution F = F(X1,X2)|X3, the

joint conditional distribution of X1 and X2 given X3.


Maximum Likelihood Estimates for µ and Σ

2 X1, . . . ,Xn are independent Np(µ0,Σ0) random vectors (here µ0 and Σ0 are

the true mean and covariance respectively).

2 Theorem: The maximum likelihood estimators for µ and Σ are:

µ = X and Σ = 1nS

where

S =

n∑i=1

(Xi −X)(Xi −X)T .


2 Proof: Assuming |Σ| > 0 the likelihood of (µ,Σ) is

L(µ,Σ) =

n∏i=1

(2π)−p/2|Σ|−1/2 exp−1

2(Xi − µ)TΣ−1(Xi − µ)

2 Recall: if A is m× n and B is n×m then tr(AB) = tr(BA).

Hence,

(Xi − µ)TΣ−1(Xi − µ) = tr[(Xi − µ)TΣ−1(Xi − µ)

]= tr

[Σ−1(Xi − µ)(Xi − µ)T

]Thus,

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−1

2tr

[Σ−1

n∑i=1

(Xi − µ)(Xi − µ)T

].


From the previous slide,

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−1

2tr

[Σ−1

n∑i=1

(Xi − µ)(Xi − µ)T

].

2 Relying on the identityn∑i=1

(Xi − µ)(Xi − µ)T =

n∑i=1

(Xi −X)(Xi −X)T + n(X− µ)(X− µ)T

where X = n−1∑n

i=1 Xi, we have

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−12tr

[Σ−1

(n(X− µ)(X− µ)T

n∑i=1

(Xi −X)(Xi −X)T

)].


2 From the previous slide

L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp

−12tr

[Σ−1

(n(X− µ)(X− µ)T

n∑i=1

(Xi −X)(Xi −X)T

)].

2 Maximizing over L(µ,Σ) (with respect to) µ is easy:

−12tr[Σ−1

((X− µ)(X− µ)T

)]= −1

2(X− µ)TΣ−1(X− µ) ≤ 0

with equality if and only if µ = X.

2 ThenmaxµL(µ,Σ) = L(X,Σ)

= (2π)−np/2|Σ|−n/2 exp−1

2tr(Σ−1S

)≡ LP (Σ)

where LP (Σ) is sometimes referred to as the profile likelihood.


2 Next note that

argmaxΣ LP (Σ) = argmaxΣ logLP (Σ)= argmaxΣ

−np

2 log(2π)− n2 log |Σ| − 1

2tr(Σ−1S

)since log(·) is monotonic.

2 In order to maximize the log profile likelihood we note that it can be shown that

∂ log |Σ|∂Σij

= tr

[Σ−1

∂Σ

∂Σij

]= tr

[Σ−1Eij

]where Eij is a zero matrix except for 1 in the (i, j)th entry. Next, it can be

shown that∂Σ−1

∂Σij= −Σ−1

∂Σ

∂ΣijΣ−1 = −Σ−1EijΣ

−1.


Hence,∂ logLP (Σ)

∂Σij= −n

2tr[Σ−1Eij

]+ 1

2tr(Σ−1EijΣ

−1S)

= 12tr[Σ−1

(SΣ−1 − nI

)Eij

].

Setting the above to 0 for all (i, j) we obtain the solution

Σ = 1nS.


Biasedness of MLE Estimators

Firstly, it is easy to shot that the MLE for µ is unbiased since,

E[X] =1

n

n∑i=1

E[Xi]

=1

n

n∑i=1

µ

= µ

and has covariance

Cov(X) =1

n2

n∑i=1

Cov[Xi]

=1

n2

n∑i=1

Σ

=1

nΣ.


However, the MLE for Σ is biased since

E[n−1S] = n−1E

[n∑i=1

(Xi −X)(Xi −X)T

]

= n−1E

[n∑i=1

XiXTi − nXX

T

]= n−1E

[n(Σ + µµT

)− n

(1

nΣ + µµT

)]= n−1

n Σ.

Hence,1

n−1S

is an unbiased estimator of Σ.


Sampling Distributions

Theorem: If Σ is a consistent estimator of Σ then, by virtue of the central limit

theorem √nΣ

−1/2(X− µ)

converges to Np(0, I) in distribution.

The sampling distribution for S is much more complicated and involve the Wishart

distribution (which we will cover later on).


Results on Quadratic Forms

2 Definition: A square matrix A is called idempotent if A2 = A.

2 Theorem: A symmetrix (symmetric matrix) A is idempotent if and only if all

its eigenvalues are in 0, 1.

2 Proof: Suppose that the eigenvalue descoposition of A is UΛUT then

A2 = UΛUTUΛU = UΛ2U

If A is idempotent then

UΛ2U = UΛU⇒ λ2i = λi

for 1 ≤ i ≤ p. Hence, if A is idempotent then λi ∈ 0, 1 for 1 ≤ i ≤ p.

If λi ∈ 0, 1 for 1 ≤ i ≤ p then λ2i = λi for 1 ≤ i ≤ p and

A2 = UΛ2U = UΛU = A.


2 Let X ∼ Np(0, I) and let C be a symmetric square matrix of rank r > 0.

2 Theorem 3: The random variable

XTCX ∼ χ2r

if and only if C is idempotent.


2 Proof: We can diagonalize: C = UDUT where

U is an orthogonal matrix and

D = diag(λ1, . . . , λr, 0, . . . , 0︸︷︷︸p−r

)

where λi ≥ λi+1 for i = 1, . . . , r − 1 (and λi 6= 0).

2 Let Y = UTX so that Y ∼ Np(0, I) and note that

W ≡ XTCX

= XTUDUTX

= YTDY

=

r∑i=1

λiY2i .

2 If C is idempotent then λi = 1 for 1 ≤ i ≤ r and W ∼ χ2r since Y 2

i are i.i.d.

χ21.


2 Conversely, assume W ∼ χ2r then its MGF is given by

MW (t) = (1− 2t)−r/2 for t < 12.

Since Y 2i are i.i.d. χ2

1, for λit < 1/2 for all i we also have

MW (t) =

r∏i=1

MλiY2i(t)

=

r∏i=1

MY 2i(λit)

=

r∏i=1

(1− 2λit)−1/2.

2 These two domains agree if and only if λ1 = 1 and λr > 0 so this has to be the

case.


2 Therefore, MW (t) = Mχ2r(t) for t < 1

2 only if

r∏i=1

(1− 2λit) = [MW (t)]−2

= [Mχ2(r)(t)]−2

=

r∏i=1

(1− 2t).

2 From the unique polynomial factorization, the equality holds if and only if λi = 1

for i = 1, . . . , r.


2 Theorem 4 [Craig’s Theorem]: If A and B are symmetric, non-negative

definite matrices and X ∼ Np(0, I) then

XTAX and XTBX

are independent if and only if AB = 0.


2 Proof: Assume AB = 0 then

BA = (AB)T = 0 so A and B commute.

Simultaneous diagonalization: there exist an orthogonal U and diagonal ma-

trices DA and DB such that UTAU = DA and UTBU = DB.

Next, AB = UDAUTUDBUT = UDADBUT so that AB = 0 implies

(WLOG) that

DA = diag(λ1, . . . , λr, 0, . . . , 0) and DB = diag(0, . . . , 0︸︷︷︸r

, λr+1, . . . , λp)

(for some r).

It follows that with Y ≡ UTX

XTAX = YTUTAUY =

r∑i=1

λiY2i and XTBX =

p∑i=r+1

λiY2i

As Y ∼ Np(0, I) the latter two are obviously independent random variables.


2 Conversely, suppose the quadratic forms are independent

Let U be an orthogonal matrix such that

UTAU = DA = diag(λ1, . . . , λr,0(p−r)).

Let B∗ = UTBU and let Y = UTX, then

XTAX = YTUTAUY =

r∑i=1

λiY2i

XTBX =∑i

∑j

b∗ijYiYj

Independence implies that

E[XTAX]E[XTBX] = E[XTAXXTBX]

Therefore,

E

[r∑i=1

λiY2i

]·E

∑i

∑j

b∗ijYiYj

= E

( r∑i=1

λiY2i

)∑j

∑k

b∗kjYkYj


But Y1, . . . , Yp are NID(0, 1) with E[Y 2i ] = 1, E[Y 3

i ] = 0 and E[Y 4i ] = 3.

Therefore, (r∑i=1

λi

)∑j

b∗jj

= 3

r∑i=1

λib∗ii +

∑i 6=j

λib∗jj.

Thus,∑r

i=1 λib∗ii = 0.

But λib∗ii ≥ 0 for all i (why?) and λi > 0 for i = i, . . . , r so b∗ii = 0 for

i = 1, . . . , r.

Since B∗ ≥ 0 it follows that any principal submatrix is non-negative definite

as well , hence b∗ij = 0 if i or j are in 1, . . . , r. Thus,

B∗ =

(0 0

0 B∗22

)hence DAB∗ = 0 and therefore

AB = AUUTB = UDAB∗UT = 0.


The Wishart Distribution

2 Let X = (X1, . . . ,Xn)T be n× p matrix with rows Xi ∼ NIDp(0,Σ)

2 The p×p matrix M ≡ XTX has a Wishart distribution Wp(Σ, n) whose density

is given by

C−1p,n|Σ|−n/2|M|(n−p−1)/2 exp−1

2tr[Σ−1M

].

where

Cp,n ≡ 2np/2πp(p−1)/4p∏i=1

Γ((n + 1− i)/2).

We also need n (called the degrees of freedom) to satisfy n > p − 1 and the

Wishart distribution parameter Σ (called the scale matrix) is assumed to be

positive definite.


2 Note that X =∑n

i=1 eiXTi where ei is a vector of length n whose elements are

0 except for the ith element which is equal to 1. Therefore,

M =

n∑j=1

XjeTj

n∑i=1

eiXTi

=∑i,j

XjδijXTi

=

n∑i=1

XiXTi

where δij is a scalar equal to 1 if i = j and 0 if i 6= j.

2 Hence, for known µ = 0 the MLE/sample covariance nΣ ∼ Wp(Σ, n).

2 If p = 1 then W1(Σ, n) = W1(σ2, n) = σ2χ2

n.

2 E[M] = E(∑n

i=1 XiXTi

)= nΣ.

2 Var(Mij) = n(Σ2ij + ΣiiΣjj).


2 Theorem 5: If M ∼ Wp(Σ, n) and B is a p× q matrix then

BTMB ∼ Wq(BTΣB, n).

2 Proof: Let Y = XB, then Y = (Y1, . . . ,Yn)T is an n × q matrix with

independent rows denoted YTi = XT

i B, or Yi = BTXi ∼ Nq(0,BTΣB).

It follows thatBTMB = BTXTXB

= YTY

∼ Wq(BTΣB, n).

2 Corollary: The principal submatrices of M have Wishart distributions.


2 Theorem 6: If M ∼ Wp(Σ, n) and a is any fixed p-vector such that aTΣa > 0

thenaTMa

aTΣa∼ χ2

n

2 Proof: Firstly, from Theorem 5 we have

aTMa ∼ W1(aTΣa, n)

and we have already argued that

W1(aTΣa, n) ∼ (aTΣa)χ2

n.

2 Corollary:

Mii ∼ Σ2iiχ

2n

2 The converse to Theorem 6 is not true.


2 Theorem 7: If

M1 ∼ Wp(Σ, n1) and M2 ∼ Wp(Σ, n2)

independently of M1 then

M1 + M2 ∼ Wp(Σ, n1 + n2).

2 Proof: Write Mi = XTi Xi where Xi has ni independent rows drawn from a

Np(0,Σ) distribution. Then

M1 + M2 = XT1 X1 + XT

2 X2

= XTX where X

=

(X1

X2

).


2 Recall that the sample covariance matrix is defined as 1nS where

S =

n∑i=1

(Xi −X)(Xi −X)T .

2 The Wishart distribution describes the distribution of S.

2 Theorem 8: If X is an n × p matrix with rows independent Np(µ,Σ) then

S ∼ Wp(Σ, n− 1) independently of X ∼ Np(µ, n−1Σ).

2 Proof: Recall that

S =

n∑i=1

(Xi −X)(Xi −X)T

=

n∑i=1

XiXTi − nXX

T

= XTX− nXXT.


Let P be an n× n orthogonal matrix with first row(1√n, . . . ,

1√n

).

Let Y = PX so Y1 =√nX where

Y = (Y1, . . . ,Yn)T .

Thus,

S = YTPPTY −Y1YT1

= YTY −Y1YT1

=

n∑i=2

YiYTi .

2 Exercise: WLOG µ = 0.

2 It follows from the following lemma that Yi ∼ NIDp(0,Σ) and therefore

S ∼ Wp(Σ, n− 1) independently of 1√nY1 = X.


2 Lemma: Let X be an n × p matrix with rows NIDp(0,Σ) and let U be an

n × n orthogonal matrix. Define Y = UX, and let YTi be the rows of Y

(i = 1, . . . , n). Then Yi ∼ NIDp(0,Σ).


2 Proof: We have E(Y) = UE(X) = 0np. Note that Yi = YTei and therefore

E[YiYTj ] = E[XTUTeie

Tj UX]

= E[XTuiuTj X]

= E

[(n∑k=1

XkeTk

)uiu

Tj

(n∑l=1

elXTl

)]

= E

∑k,l

XkuikujlXTl

=∑k,l

uikujlE[XkXTl ]

=∑k

uikujkΣ

= δijΣ,

where uTi the ith row of U.


Wishart Distribution (cont.)

2 Theorem 9: Suppose X has n rows which are NIDp(0,Σ) then XTAX ∼Wp(Σ, r) if and only if A is idempotent and of rank r.

2 Proof: If XTAX ∼ Wp(Σ, r) then by Theorem 6 for any a ∈ Rp with aTΣa >

0 we haveaTXTAXa

aTΣa∼ χ2

r.

Let

Y =Xa√aTΣa

,

then Y ∼ N(0n, In) and YTAY ∼ χ2r so A is idempotent of rank r by

Theorem 3.


2 Conversely, if A is idempotent of rank r then A = UDUT , where U is an

n× n orthogonal matrix and D = diag(1, . . . , 1︸︷︷︸r

, 0, . . . , 0︸︷︷︸n−r

).

2 Let Y = UTX so by the lemma Y has n rows YTi ∼ NIDp(0,Σ) and

XTAX = YTDY

=

n∑j=1

YjeTj

D

(n∑i=1

eiYTi

)

=

n∑i,j=1

Yj

(eTj Dei

)YTi

=

r∑i,j=1

YjδijYTi

=

r∑i=1

YiYTi ∼ Wp(Σ, r).


2 Theorem 10: If X has n rows which are NIDp(0,Σ) then for any symmetric,

non-negative definite n× n matrices A and B the random values XTAX and

XTBX are independent if and only if AB = 0.

2 Proof: Assume XTAX and XTBX are independent. Then

Choose a such that aTΣa > 0 and let

Y =Xa√aTΣa

,

then:

∗ Y ∼ Nn(0, I) and

∗ YTAY and YTBY are independent.

Thus by Craig’s Theorem we have AB = 0.


2 Conversely, assume AB = 0. Then

There exists an n× n orthogonal matrix P such that

PTAP = DA ≡ diag(α1, . . . , αr︸︷︷︸r

, 0 . . . 0︸︷︷︸n−r

),

and

PTBP = DB ≡ diag(0, . . . , 0︸︷︷︸r

, βr+1, . . . , βn︸︷︷︸n−r

)


Let Y = PTX, then by the last lemma again Y has n rows YTi ∼

NIDp(0,Σ) and

XTAX = YTDAY

=

n∑j=1

YjeTj

DA

(n∑i=1

eiYTi

)

=

n∑i,j=1

Yj

(eTj DAei

)YTi

=

r∑1

αiYiYTi .

Similarly, XTBX =∑n

r+1 βiYiYTi .

The latter two are obviously independent.


2 Theorem 11: Suppose X has n rows which are NIDp(0,Σ) with Σ positive

definite. LetM = XTX

=

(M11 M12

M21 M22

),

where M11 is an r × r matrix (with r < n). Let

M2·1 ≡M22 −M21M−111 M12.

Then

M2·1 ∼ Wq(Σ2·1, n− r)

independently of (M11,M12) where q = p− r and

Σ2·1 = Σ22 −Σ21Σ−111 Σ12.


2 Recall that

Σ2·1 is the conditional covariance of

X(2)i ≡ (Xi,r+1, . . . , Xip)

T given X(1)i ≡ (Xi1, . . . , Xir)

T .

This is also the covariance of

Yi ≡ X(2)i −QX

(1)i

where

QX(1)i ≡

(Σ21Σ

−111

)X

(1)i

is the best linear predictor of X(2)i given X

(1)i :

Yi ∼ NIDq(0q,Σ2·1)

and independent of X(1)i .

2 Hence, had we known this decomposition we could have estimated Σ2·1 as

YTY ∼ Wq(Σ2·1, n) (note the degrees of freedom is n rather than n− r).

2 But we typically don’t know Σ...


2 Proof (of Theorem 11).

Write X = (X1,X2) where X1 is n× r and X2 is n× q (with r + q = p),

thenM = XTX

=

(XT

1

XT2

)(X1,X2).

It can be shown that as Σ11 is positive definite and as n > r, M11 is positive

definite almost surely.

Thus, we can define

M2·1 = XT2 X2 −XT

2 X1M−111 XT

1 X2

= XT2 PX2,

where P = I−X1M−111 XT

1 .


if P = I −X1M−111 XT

1 it is easy to verify directly that P is idempotent (a

projection on Span(X1)⊥) and its rank is (n− r) as

tr(P) = n− tr[X1M−111 XT

1 ]

= n− r.

Also, XT1 P = PX1 = 0 so with

X2·1 ≡ X2 −X1Σ−111 Σ12

andM2·1 = XT

2 PX2

= XT2·1PX2·1.

But X2·1 is the component of X2 that is independent of X1 and in particular

we saw that the n rows of X2·1 ∼ NIDq(0q,Σ2·1).

It follows from Theorem 9 that, conditioned on X1, M2·1 ∼ Wq(Σ2·1, n−r).


This distribution does not depend on X1, therefore it is also the unconditional

distribution of M2·1.

Moreover, it follows that M2·1 is independent of X1 and hence of M11 =

XT1 X1.

Furthermore, P(I−P) = P−P = 0 so using the same rationale as in the

proof of Theorem 10 we see that given X1,

M2·1 = XT2·1PX2·1 and XT

1 (I−P)X2·1

are independent.


Now,

XT1 (I−P)X2·1 = XT

1 (X1M−111 XT

1 )(X2 −X1Σ−111 Σ12)

= M12 −M11Σ−111 Σ12.

Therefore, given X1, M2·1 is independent of

M12 = XT1 (I−P)X2·1 + M11Σ

−111 Σ12

Since M2·1 is also independent of X1 it is independent of (X1,M12) (why?)

It follows that M2·1 is independent of (M11,M12) (why?)


Linear Independence Lemma

2 Linear Independence Lemma: Suppose that Xi ∼ NIDp(Σ, 0) for i =

1, . . . , n ≤ p and that Σ is positive definite. Then X1, . . . ,Xn are almost

surely linearly independent.

2 Proof of lemma.

Firsly, without loss of generality we can assume that Σ = Ip. (exercise)

The proof is by induction where the case n = 1 is trivial since X1 ∼ Np(I,0)

implies that P (X1 = 0) = 0.


Now, assume that the lemma holds for n < p. In this case

∗ Either Xn+1 /∈ Span〈X1, . . . ,Xn〉 almost surely and we are done,

∗ or, there exists a set A ⊂ Ω with P (A) > 0 and random variables αi such

that

Xn+1(ω) =

n∑i=1

αi(ω)Xi(ω) and X1(ω), . . . ,Xn(ω)

are linearly independent for all ω ∈ A.

The latter identity can be thought of as p equations in n unknowns: αi(ω). Since X1(ω), . . . ,Xn(ω) are linearly independent, then without loss of gen-

erality the αi(ω) are uniquely determined from

X1(ω), . . . ,Xn(ω)

and Xn+1,k(ω) for k = 1, . . . , n.


But

Xn+1,n+1(ω) =

n∑i=1

αi(ω)Xi,n+1(ω)

and it follows that on the set A, Xn+1,n+1 can be determined from

X1(ω), . . . ,Xn(ω)

and Xn+1,k(ω) for k = 1, . . . , n.

This contradicts the independence assumption.

2 Corollary: If M ∼ Wp(Σ, n) with Σ positive definte and n ≥ p then |M| > 0

almost surely (exercise).


2 Theorem 12. If M ∼ Wp(Σ, n) with Σ positive definite and n ≥ p then

(a) For any fixed a 6= 0 ∈ Rp,

aTΣ−1a

aTM−1a∼ χ2

n−p+1.

In particular Σii/M ii ∼ χ2n−p+1.

(b) Mii is independent of all elements of M except Mii.

2 Proof:

Recall that M22 = M−12·1 and similarly Σ22 = Σ−12·1.

Therefore, from Theorem 11, with r = p− 1 we have

(Mpp)−1 = M2·1∼ Wp−r(Σ2·1, n− r)∼ (Σpp)−1χ2

n−p+1.

Moreover, M pp is independent of (M11,M12), i.e., M pp is independent of all

elements of M except Mpp


This proves the theorem for a = ep and hence for a = ei (why?)

For a general a 6= 0 ∈ Rp, let A be a non-singular matrix with last column

a, that is Aep = a. Then

aTΣ−1aaTM−1a

=eTpATΣ−1Aep

eTpATM−1Aep

=eTpΣ−1A ep

eTpM−1A ep

,

where ΣA = A−1Σ(A−1)T and MA = A−1M(A−1)T .

The proof is complete since by Theorem 5, MA ∼ Wp(ΣA, n).


2 Theorem 13. If M ∼ Wp(Σ, n) where n ≥ p then

|M| = |Σ| ·p∏i=1

Ui

where Ui are independent random variables, Ui ∼ χ2n−i+1.

2 Proof: Without loss of generality we may assume that Σ is positive definite

(exercise). We use induction on p.

For the case p = 1 we have

M ∼ σ2χ2n.

Write

M =

(M11 M12

M21 M22

)where M11 is a (p− 1)× (p− 1) matrix.


Recall that

M22 = [M−1]pp = |M11|/|M|

(similarly, Σ22 = |Σ11|/|Σ|).

Therefore,

|M| =|M11|M 22

=Σ22

M 22

|Σ||Σ11|

|M11|.

But from Theorem 12,

Up ≡Σ22

M 22=

[Σ−1]pp[M−1]pp

∼ χ2n−p+1

and is independent of M11.

We complete the proof by noting that M11 ∼ Wp−1(Σ11, n).


Hence, by the inductive hypothesis

|M11| = |Σ11|p−1∏i=1

Ui

where Ui ∼ χ2n−i+1 are independent of one another and obviously they depend

only on M11 and hence independent of Up.


2 Corollary: If X has n rows which are NIDp(µ,Σ) with Σ positive definite

and S = XTX− nXXT

then

Σkk/Skk ∼ χ2n−p.

aTΣ−1a

aTS−1a∼ χ2

n−p

for any fixed a 6= 0

If S11 is an r × r submatrix of S, then

S11 ∼ Wr(Σ11, n− 1)

independently ofS2·1 = S22 − S21S

−111 S12

∼ Wp−r(Σ2·1, n− r − 1).


Hotelling’s T 2

2 Definition. Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n) with n ≥ p

then

nZTM−1Z ∼ T 2(p, n)

where T 2(p, n) is Hotelling’s T 2 distribution.

2 Theorem 14: If Y ∼ Np(µ,Σ) and M ∼ Wp(Σ, n) independently of Y with

Σ positive definite and n ≥ p then

n(Y − µ)TM−1(Y − µ) ∼ T 2(p, n).

2 Proof:

Let Z = Σ−1/2(Y − µ) and let MΣ = Σ−1/2MΣ−1/2.

Then Z ∼ Np(0, I), MΣ ∼ Wp(I, n) and

n(Y − µ)TM−1(Y − µ) = nZTM−1Σ Z.


2 Theorem 15:

T 2(p, n) ∼ np

n− p + 1Fp,n−p+1

2 Proof:

Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n).

By Theorem 12, for a 6= 0, conditioned on Z = a

D ≡ ZTZ

ZTM−1Z∼ χ2

n−p+1

This conditional distribution of D does not depend on a hence it is also the

unconditional distribution of D.

For the same reason, D is independent of Z and therefore also of R ≡ZTZ ∼ χ2

p.

It follows thatnZTM−1Z

n=R

D∼ p

n− p + 1Fp,n−p+1.


2 Corollary 1: If X is the mean of a sample of size n ≥ p drawn from a Np(µ,Σ)

population with Σ positive definite, and if (n − 1)−1S is the unbiased sample

covariance matrix then

n(X− µ)T(

1

n− 1S

)−1(X− µ) ∼ (n− 1)p

n− pFp,n−p.

2 Proof:

It suffices to show that the LHS has a T 2(p, n− 1) distribution.

Let Y =√nX and let S = XTX− nXX

T.

Theorem 8 states: S ∼ Wp(Σ, n− 1) independently of Y ∼ Np(√nµ,Σ).

Therefore by Theorem 14,

(n− 1)(Y −√nµ)TS−1(Y −

√nµ) ∼ T 2(p, n− 1)

Finally,

n(X−µ)T(

1

n− 1S

)−1(X−µ) = (n− 1)(Y−

√nµ)TS−1(Y−

√nµ).



Documents

STAT3914 { Applied Statistics · STAT3914 { Applied Statistics Lecturer Dr. John T. Ormerod School of Mathematics & Statistics F07 University of Sydney (w) 02 9351 5883 (e) john.ormerod