Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Semester 1, 2012 (Last adjustments: September 5, 2012)
Lecture Notes
STAT3914 – Applied Statistics
Lecturer
Dr. John T. Ormerod
School of Mathematics & Statistics F07
University of Sydney
(w) 02 9351 5883
(e) john.ormerod (at) sydney.edu.au
STAT3914 – Outline
2 Multivariate Normal Distribution
2 Point Estimates and Confidence Intervals for MVN
2 Wishart Distriubtion
2 Derivation of Hotelling’s T2
2 The Expecation Maximization Algorithm
2 Missing Data Analysis
STAT3914: Lecture 3 2
The standard univariate normal distribution
2 A random variable (RV) Z has a standard normal distribution N(0, 1), if Z has
density
ϕ(z) =1√2π
e−12z
2, −∞ < z <∞.
2 Z has moment generating function
E(etZ) =
∫ ∞−∞
etz · (2π)−12e−
12z
2dz
=
∫ ∞−∞
(2π)−12 e−
12(z−t)
2· e
12t
2dz
=e12t
2
2 It follows that E[Z] = 0 and Var[Z] = 1.
STAT3914: Lecture 3 3
The standard multivariate normal distribution
2 Let U = (U1, U2, . . . , Up)T be a p-vector of NID(0, 1) RVs.
2 The joint density of U is given byp∏i=1
1√2πe−u
2i /2 = (2π)−p/2 exp
−1
2uTu.
2 Clearly,
E[U] = 0
where 0 is the 0 vector in Rp.
2 Similarly,
Cov[U] = E[UUT ] = I
where I = Ip is the p× p identity matrix.
2 We say U has a standard multivariate normal distribution which we denote
U ∼ Np(0, I).
STAT3914: Lecture 3 4
Non-standard univariate normal distribution
2 Given a RV Z ∼ N(0, 1) we can generate a non-standard normal with mean
µ ∈ R and standard deviation σ > 0 by defining X = µ + σZ.
2 The density of X can be readily recovered by the formula for the density of
transformed RVs.
2 Let ψ(z) = µ + σz, then X = ψ(Z) and ψ−1(x) = (x− µ)/σ so
fX(x) = fZ(ψ−1(x))
∣∣∣∣ ddxψ−1(x)
∣∣∣∣= fZ
(x− µσ
) ∣∣∣∣1σ∣∣∣∣
=1√
2πσ2exp
−(x− µ)2
2σ2
.
STAT3914: Lecture 3 5
Non-standard univariate normal distribution
2 The MGF of X is given by
E[etX ] = etµE[etσZ]
= exptµ + 1
2σ2t2.
2 An analogous transformation creates the general multivariate normal random
vector, i.e. if z ∼ N(0, 1) then
x = µ + σz ∼ N(µ, σ2).
STAT3914: Lecture 3 6
Non-standard multivariate normal distribution
2 Let U ∼ Np(0, I) and define
X = ψ(U) ≡ µ + AU
where µ ∈ Rp, and A is a p× p non-singular matrix.
2 Clearly,
E[X] = µ
and its covariance is given by
Cov[X] = E[(X− µ) (X− µ)T
]= E
[AUUTAT
]= AE
[UUT
]AT
= AAT
≡ Σ.
STAT3914: Lecture 3 7
Non-standard multivariate normal distribution
Suppose AAT = Σ then:
2 A is sort of a “square-root” of Σ (if Σ is symmetric then A = Σ12) and
|Σ| = |AAT | = |A| · |AT | = |A|2 > 0.
2 Claim: If U ∼ Np(0, I) and X = ψ(U) ≡ µ+ AU then X has a non-singular
multivariate normal distribution which we will denote
X ∼ Np(µ,Σ)
and X has density
fX(X) = (2π)−p/2|Σ|−12 exp
−1
2(x− µ)TΣ−1(x− µ).
2 Proof: Similarly to the univariate case with ψ−1(X) = A−1(X− µ)
fX(X) = fU
(ψ−1(X)
) ∣∣Jψ−1∣∣= (2π)−p/2 |Jψ|−1 exp
−1
2
[A−1(X− µ)
]T [A−1(X− µ)
]STAT3914: Lecture 3 8
2 The Jacobian of a linear transformation is the determinant of its matrix so
|Jψ| =
∣∣∣∣∣∣∣∂ψ1∂u1· · · ∂ψ1
∂up... . . . ...
∂ψp∂u1· · · ∂ψp
∂up
∣∣∣∣∣∣∣ = |A| = |Σ|12 .
2 The exponent simplifies as follows[A−1(X− µ)
]T [A−1(X− µ)
]= (X− µ)T (A−1)T (A−1)(X− µ)
= (X− µ)TΣ−1(X− µ)
as (A−1)T (A−1) = (AAT )−1 = Σ−1.
2 Thus, X has density
fX(X) = (2π)−p/2|Σ|−12 exp
−1
2(x− µ)TΣ−1(x− µ).
STAT3914: Lecture 3 9
The MGF of the multivariate normal distribution
2 Claim: If X ∼ Np(µ,Σ) then the moment generating function (MGF) of X is
given by:
MX(s) = expsTµ + 1
2sTΣs
.
2 Proof: For s ∈ Rp we have
MX(s) = E[esTX]
= E[exp(sTµ + sTAU)]
= exp(sTµ) · E[eaTU], where aT = sTA
= exp(sTµ) · exp12a
Ta
since aTU ∼ N(0, aTa)
= expsTµ + 1
2sTΣs
.
2 This expression for the MGF is well defined even when Σ is singular. While it
would be awkward to define the multivariate normal distribution using an MGF
it does suggest the following definition...
STAT3914: Lecture 3 10
The (general) multivariate normal distribution
2 Definition. A p-dimensional random vector X has a multivariate normal dis-
tribution if for every a ∈ Rp the linear combination aTX is univariate normal.
2 Claim. X is a p-dimensional multivariate normal random vector if and only if
its MGF can be expressed as
MX(s) = expsTµ + 1
2sTΣs
.
2 Aside: if MX is as above then by differentiating the MGF it follows that EX = µ
and that Cov(X) = Σ.
STAT3914: Lecture 3 11
The (general) multivariate normal distribution
2 Proof:. Suppose the above expression for the MGF holds. Then, the MGF of
Y = aTX is given by
MY (t) = E[etaTX]
= MX(ta)
= exp
taTµ +
1
2t2aTΣa
so that
aTX ∼ N(aTµ, aT Σa)
by the uniqueness theorem for MGFs.
STAT3914: Lecture 3 12
Conversely, if Ys = sTX is univariate normal for all s ∈ Rp then
E[eYs] = expE[Ys] + 1
2Var[Ys].
However, E[Ys] = sTE[X] = sTµ and
Var[Ys] = E[sTX− sTµ
] (sTX− sTµ
)T= sT
[E (X− µ) (X− µ)T
]s
= sTCov(X)s
Therefore, with Σ = Cov(X)
MX(s) = E[eYs] = expsTµ + 1
2sTΣs
.
STAT3914: Lecture 3 13
Marginal Distributions
2 If X ∼ Np(µ,Σ) then for all a ∈ Rp,
aTX ∼ N(aTµ, aTΣa).
2 So by choosing ei = (0, · · · , 0, 1, 0, · · · , 0) with 1 in the ith position we have
Xi ∼ N(µi,Σii).
2 Thus all marginal distributions are normal.
2 The converse is not true (see exercises) .
2 Moreover, if X1 = (X1, . . . , Xr)T consists of the first r coordinates of X, then
X1 is an r-dimensional normal RV (why?)
X1 ∼ Nr(µ1,Σ11),
where [µ1]i = [µ]i and [Σ11]i,j = [Σ]i,j for i, j = 1, . . . , r.
STAT3914: Lecture 3 14
2 Claim. If Σ is diagonal then Xi are independent.
2 Proof.
STAT3914: Lecture 3 15
Block/partitioned matrices
2 Suppose X is a p-dimensional random vector.
2 Let X1 = (X1, . . . , Xr)T be the first r coordinates of X and let X2 = (Xr+1, . . . , Xp)
T
be its last q = p− r coordinates so that
X =
(X1
X2
).
2 Let µi = E[Xi] (i = 1, 2) then
µ ≡ E[X] =
(µ1
µ2
)2 Similarly, with
[Σ]ij ≡ Cov(Xi,Xj) ≡ E[XiXTj ]− E[Xi]E[Xj]
T
STAT3914: Lecture 3 16
(so that Σ11 is r × r and Σ12 is r × q, etc.)
Var(X) ≡ Σ =
(Σ11 Σ12
Σ21 Σ22
).
STAT3914: Lecture 3 17
Degenerate multivariate normal distribution
2 Suppose X is a multivariate normal with mean µ and a singular covariance
matrix Σ with rank r < p.
2 As Σ is symmetric it has (p − r) eigenvalues equal to 0 and there exists an
orthogonal matrix P such that
PTΣP =
(D 0
0 0
)≡ D
where
D = diag(λ1, · · · , λr), with λ1 ≥ λ2 ≥ · · · ≥ λr > 0.
and the columns of P are the eigenvectors of Σ
STAT3914: Lecture 3 18
2 Let Y = PTX and denote by Y1 = (Y1, Y2, . . . , Yr)T and Y2 = (Yr+1, . . . , Yp)
T
and similarly for s = (s1, s2) and for γ = PTµ = (γ1,γ2).
2 Then Y has MGF
E[esT1 Y1+sT2 Y2] = E[esTY]
= E[esTPTX]
= expsTPTµ + 1
2sTPTΣPs
= exp
sTγ + 1
2sTDs
= exp
sT1 γ1 +
1
2sT1 Ds1
· exp
sT2 γ2
.
2 The RHS is a product of two MGFs, one of a distribution that is the constant
≡ γ2, and the other is Nr(γ1,D).
2 By the uniqueness of the MGF, Y2 ≡ γ2 and Y1 ∼ Nr(γ1,D).
2 Note that the Y1, . . . , Yr are independent. Does this ring a bell? Principal
Component Analysis. . .
STAT3914: Lecture 3 19
2 It also follows that if X ∼ N(µ,Σ) with Σ of rank r then there exists an
invertible linear transformation G such that X = GW + µ where
Wr+1 = . . . = Wp = 0 and (W1, · · · ,Wr)T ∼ Nr(0, I).
see exercises.
STAT3914: Lecture 3 20
Block independence
2 Theorem 1. X1 and X2 are independent if and only if Σ12 = 0.
2 Proof. If X1 and X2 are independent then X1 and X2 are element-wise uncor-
related and so Σ12 = 0.
Conversely, if Σ12 = 0 then Σ21 = 0 and if s = (s1, s2)T (with s1 ∈ Rr) we
have
sTΣs = sT1 Σ11s1 + sT2 Σ22s2.
Therefore, X has MGF
MX(s) = expsTµ + 1
2sTΣs
= exp
sT1µ1 + 1
2sT1 Σ11s1
· exp
sT2µ2 + 1
2sT2 Σ22s2
and so (why?) X1 and X2 are independent with
X1 ∼ Nr(µ1,Σ11) and X2 ∼ Ns(µ2,Σ22).
STAT3914: Lecture 3 21
STAT3914: Lecture 3 22
Detour into the linear algebra of block matrices
2 Let
A =
(A11 A12
A21 A22
)be an invertible p×p matrix (where A11 is r×r and A22 is s×s with r+s = p).
2 It is common to denote the inverse of A by
A−1 =
(A11 A12
A21 A22
)(where A11 is r × r and A22 is s× s).
2 By the definition of a matrix inverse we have the relations(A11 A12
A21 A22
)(A11 A12
A21 A22
)=
(Ir 0
0 Is
).
STAT3914: Lecture 3 23
2 In particular,A21A11 + A22A21 = 0
A21A12 + A22A22 = Is.
2 Multiply the first equation from the right by A−111 A12 and subtract from the
second equation to find that
A22(A22 −A21A−111 A12) = Is =⇒ A22 = (A22 −A21A
−111 A12)
−1.
2 The same kind of algebra yields (exercise)
A12 = −A−111 A12A22
A21 = −A22A21A−111
−A12A21 = A11A11 − Ir
A22 = (A22 −A21A−111 A12)
−1
A11 = (A11 −A12A−122 A21)
−1.
STAT3914: Lecture 3 24
Linear Predictor (Estimator)
2 X = (X1, X2,XT3 )T is a RV.
2 A linear predictor (estimator) of X1 given X3 is of the form bTX3, where
b ∈ Rp−2 (ignore X2 for now).
2 We seek the best such linear predictor:
bo ≡ argminb
E[(X1 − bTX3)
2].
2 Assume µ = 0 (the results hold for any µ) and let
f (b) ≡ E[(X1 − bTX3)2]
= E[(X1 − bTX3)(X1 − bTX3)
T]
= Σ11 − bTΣ31 −Σ13b + bTΣ33b
= Σ11 − 2bTΣ31 + bTΣ33b
2 Use calculus or algebra to minimize f .
STAT3914: Lecture 3 25
2 We want to minimize (with respect to b)
f (b) = Σ11 − 2bTΣ31 + bTΣ33b.
2 Clearly, ∇b(−2bTΣ31) = −2Σ31 ∈ Rp−2.
2 Since bTAb =∑
i,j aijbibj it follows that for a symmetric A that
∇b(bTAb) = 2Ab.
2 Therefore,
∇f (b) = −2Σ31 + 2Σ33b.
2 Thus, the unique stationary point is attained at
b0 = Σ−133 Σ31.
2 Which is a minimum (as Σ33 is positive definite).
STAT3914: Lecture 3 26
2 Algebra: complete the quadratic form
f (b) = Σ11 − bTΣ31 −Σ13b + bTΣ33b
= Σ11 −Σ13Σ−133 Σ31 + (b−Σ−133 Σ31)
TΣ33(b−Σ−133 Σ31)
≥ Σ11 −Σ131Σ−133 Σ31
with equality if and only if b = Σ−133 Σ31.
2 Thus, the best linear estimator (predictor) of X1 given X3 is
PX3(X1) = Σ13Σ−133 X3
Note that this is a normal RV.
Similarly, PX3(X2) = Σ23Σ−133 X3.
2 This is also known as the projection of X1 (X2) on the subspace spanned by
X3.
STAT3914: Lecture 3 27
Conditional Distributions
2 Let
X =
(X1
X2
)∼ Np(µ,Σ)
where Σ is non-singular, X1 = (X1, . . . , Xr)T and X2 = (Xr+1, . . . , Xp)
T .
Then
X1 ∼ Nr(µ1,Σ11)
and
X2 ∼ Ns(µ2,Σ22)
where s = p− r.
2 What is the conditional distribution of X2 given that X1 = x1?
STAT3914: Lecture 3 28
2 Recall that for the bivariate normal (X1, X2) we defined
fX2|X1(x2|x1) =
fX1,X2(x1, x2)
fX1(x1)
2 We saw that for Xi ∼ N(0, 1), Cov(X1, X2) = ρ and
X2 |X1 = x1 ∼ N(ρx1, 1− ρ2)
STAT3914: Lecture 3 29
2 We can pursue the obvious generalization here (still assuming µ = 0):
log fX2|X1(X2|X1)
= logfX(X)
fX1(X1)
= C − 12
[(XT
1 ,XT2 )
(Σ11 Σ12
Σ21 Σ22
)(X1
X2
)−XT
1 Σ−111 X1
]= C − 1
2
[XT
1
(Σ11 −Σ−111
)X1 + XT
2 Σ21X1
+XT1 Σ12X2 + XT
2 Σ22X2
],
where C = C(Σ, r) is a constant and
Σ−1 =
(Σ11 Σ12
Σ21 Σ22
).
STAT3914: Lecture 3 30
From the previous slide
log fX2|X1(X2|X1)
= C − 12
[XT
1
(Σ11 −Σ−111
)X1 + XT
2 Σ21X1 + XT1 Σ12X2 + XT
2 Σ22X2
],
2 Using the block inverse identities we obtain
Σ11 −Σ−111 = (Σ11Σ11 − I)Σ−111 = −Σ12Σ21Σ−111 = Σ−111 Σ12Σ
22Σ21Σ−111
Σ21 = −Σ22Σ21Σ−111
Σ12 = −Σ−111 Σ12Σ22
into the above expression we have
XT1
(Σ11 −Σ−111
)X1 + XT
2 Σ21X1 + XT1 Σ12X2 + XT
2 Σ22X2
= XT1 Σ−111 Σ12Σ
22Σ21Σ−111 X1 −XT
2 Σ22Σ21Σ−111 X1
−XT1 Σ−111 Σ12Σ
22X2 + XT2 Σ22X2
=(X2 −Σ21Σ
−111 X1
)TΣ22
(X2 −Σ21Σ
−111 X1
)2 Therefore,
log fX2|X1(X2|X1) = C − 1
2
[(X2 −Σ21Σ
−111 X1
)TΣ22
(X2 −Σ21Σ
−111 X1
)].
STAT3914: Lecture 3 31
2 As a function of X2 this is a density of a multivariate normal with mean
Σ21Σ−111 X1 and covariance matrix(
Σ22)−1
= Σ22 −Σ21Σ−111 Σ12.
2 We essentially proved:
Theorem 2. If X ∼ Np(µ,Σ) then the conditional distribution of X2 given
X1 = X is
N(µ2 + Σ21Σ
−111 (X− µ1),Σ22 −Σ21Σ
−111 Σ12
).
STAT3914: Lecture 3 32
2 Rather than routinely generalizing the result to include the case µ 6= 0 we give
a different proof for the general case.
2 Recall that if (X1, X2) is bivariate normal with Xi ∼ N(0, 1) and correlation ρ,
then we can de-correlate X2 from X1 as follows:
Let Y2 = X2 − ρX1, then
Cov(Y2, X1) = Cov(X2, X1)− ρCov(X1, X1) = 0.
2 So X2 = Y2 + ρX1 where Y2 ∼ N(0, 1− ρ2) is independent of X1 (?)
2 Therefore, conditioning on X1 = x1,
X2 ∼ N(ρx1, 1− ρ2).
2 Next we generalize this to the multivariate normal case.
STAT3914: Lecture 3 33
2 Let Q = Σ21Σ−111 and let
Y2 = X2 −QX1.
Note that QX1 is the best linear predictor of X2 given X1
2 Y2 and X1 are uncorrelated since
Cov(Y2,X1) = Cov(X2,X1)− Cov(QX1,X1)
= Σ21 −QΣ11
= 0.
2 Then (X1
Y2
)is a multivariate normal RV (why?)
STAT3914: Lecture 3 34
2 It follows that Y2 and X1 are independent and
Y2 ∼ N(µ2 −Qµ1,Σ22 −Σ21Σ−111 Σ12),
as
Y2 = X2 −QX1 and Q = Σ21Σ−111
so
Cov(Y2) = Cov(X2 −QX1,X2 −QX1)
= Σ22 −Σ21QT −QΣ12 + QΣ11Q
T
= Σ22 −Σ21Σ−111 Σ12.
2 Finally, X2 = Y2 + QX1 so the conditional distribution of X2 given X1 = X is
N(µ2 −Qµ1 + QX,Σ22 −Σ21Σ
−111 Σ12
)2 Notation: Σ2·1 = Σ22 −Σ21Σ
−111 Σ12 =
(Σ22)−1
STAT3914: Lecture 3 35
Partial Correlation
2 The best linear estimator of X1 given X3 is
PX3(X1) = Σ13Σ−133 X3.
2 The partial correlation between X1 and X2 given X3 is
ρ12·34...p ≡ Cor[X1 − PX3(X1), X2 − PX3(X2)]
2 Theorem: Let
Σ−1 =
Σ11 Σ12 Σ13
Σ21 Σ22 Σ23
Σ31 Σ32 Σ33
and let d = Σ11Σ22 − Σ12Σ21 then
ρ12·34...p = − Σ12
√Σ11Σ22
.
STAT3914: Lecture 3 36
2 Proof:Covariance is bilinear so
Cov[X1 − PX3(X1), X2 − PX3(X2)]
= Cov(X1, X2)− Cov(X1, PX3(X2))
−Cov(PX3(X1), X2) + Cov(PX3(X1), PX3(X2))
= Σ12 − Cov(X1,X3)Σ−133 ΣT
23
−Σ13Σ−133 Cov(X3, X2) + Σ13Σ
−133 Cov(X3,X3)Σ
−133 ΣT
23
= Σ12 −Σ13Σ−133 Σ32.
2 Similarly,
Cov [X1 − PX3(X1)]
= Σ11 −Σ13Σ−133 ΣT
13 −Σ13Σ−133 Σ31 + Σ13Σ
−133 Σ33Σ
−133 ΣT
13
= Σ11 −Σ13Σ−133 Σ31.
2 Hence,
ρ12·34...p =Σ12 −Σ13Σ
−133 Σ32√
(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ
−133 Σ32)
STAT3914: Lecture 3 37
2 By our block inverse identities,
1
d
(Σ22 −Σ12
−Σ21 Σ11
)=
(Σ11 Σ12
Σ21 Σ22
)−1=
(Σ11 Σ12
Σ21 Σ22
)−
(Σ13
Σ23
)Σ−133
(Σ31 Σ32
).
2 Therefore,
ρ12·34...p =Σ12 −Σ13Σ
−133 Σ32√
(Σ11 −Σ13Σ−133 Σ31)(Σ22 −Σ23Σ
−133 Σ32)
= − Σ12/d√(Σ22/d)(Σ11/d)
= − Σ12
√Σ11Σ22
.
STAT3914: Lecture 3 38
2 Our definition of partial correlation and analysis do not require X to be multi-
variate normal.
2 However, if X is multivariate normal then
ρ12·3...p = CorF (X1, X2),
where the correlation is with respect to the distribution F = F(X1,X2)|X3, the
joint conditional distribution of X1 and X2 given X3.
STAT3914: Lecture 3 39
Maximum Likelihood Estimates for µ and Σ
2 X1, . . . ,Xn are independent Np(µ0,Σ0) random vectors (here µ0 and Σ0 are
the true mean and covariance respectively).
2 Theorem: The maximum likelihood estimators for µ and Σ are:
µ = X and Σ = 1nS
where
S =
n∑i=1
(Xi −X)(Xi −X)T .
STAT3914: Lecture 3 40
2 Proof: Assuming |Σ| > 0 the likelihood of (µ,Σ) is
L(µ,Σ) =
n∏i=1
(2π)−p/2|Σ|−1/2 exp−1
2(Xi − µ)TΣ−1(Xi − µ)
2 Recall: if A is m× n and B is n×m then tr(AB) = tr(BA).
Hence,
(Xi − µ)TΣ−1(Xi − µ) = tr[(Xi − µ)TΣ−1(Xi − µ)
]= tr
[Σ−1(Xi − µ)(Xi − µ)T
]Thus,
L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp
−1
2tr
[Σ−1
n∑i=1
(Xi − µ)(Xi − µ)T
].
STAT3914: Lecture 3 41
From the previous slide,
L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp
−1
2tr
[Σ−1
n∑i=1
(Xi − µ)(Xi − µ)T
].
2 Relying on the identityn∑i=1
(Xi − µ)(Xi − µ)T =
n∑i=1
(Xi −X)(Xi −X)T + n(X− µ)(X− µ)T
where X = n−1∑n
i=1 Xi, we have
L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp
−12tr
[Σ−1
(n(X− µ)(X− µ)T
n∑i=1
(Xi −X)(Xi −X)T
)].
STAT3914: Lecture 3 42
2 From the previous slide
L(µ,Σ) = (2π)−np/2|Σ|−n/2 exp
−12tr
[Σ−1
(n(X− µ)(X− µ)T
n∑i=1
(Xi −X)(Xi −X)T
)].
2 Maximizing over L(µ,Σ) (with respect to) µ is easy:
−12tr[Σ−1
((X− µ)(X− µ)T
)]= −1
2(X− µ)TΣ−1(X− µ) ≤ 0
with equality if and only if µ = X.
2 ThenmaxµL(µ,Σ) = L(X,Σ)
= (2π)−np/2|Σ|−n/2 exp−1
2tr(Σ−1S
)≡ LP (Σ)
where LP (Σ) is sometimes referred to as the profile likelihood.
STAT3914: Lecture 3 43
2 Next note that
argmaxΣ LP (Σ) = argmaxΣ logLP (Σ)= argmaxΣ
−np
2 log(2π)− n2 log |Σ| − 1
2tr(Σ−1S
)since log(·) is monotonic.
2 In order to maximize the log profile likelihood we note that it can be shown that
∂ log |Σ|∂Σij
= tr
[Σ−1
∂Σ
∂Σij
]= tr
[Σ−1Eij
]where Eij is a zero matrix except for 1 in the (i, j)th entry. Next, it can be
shown that∂Σ−1
∂Σij= −Σ−1
∂Σ
∂ΣijΣ−1 = −Σ−1EijΣ
−1.
STAT3914: Lecture 3 44
Hence,∂ logLP (Σ)
∂Σij= −n
2tr[Σ−1Eij
]+ 1
2tr(Σ−1EijΣ
−1S)
= 12tr[Σ−1
(SΣ−1 − nI
)Eij
].
Setting the above to 0 for all (i, j) we obtain the solution
Σ = 1nS.
STAT3914: Lecture 3 45
Biasedness of MLE Estimators
Firstly, it is easy to shot that the MLE for µ is unbiased since,
E[X] =1
n
n∑i=1
E[Xi]
=1
n
n∑i=1
µ
= µ
and has covariance
Cov(X) =1
n2
n∑i=1
Cov[Xi]
=1
n2
n∑i=1
Σ
=1
nΣ.
STAT3914: Lecture 3 46
However, the MLE for Σ is biased since
E[n−1S] = n−1E
[n∑i=1
(Xi −X)(Xi −X)T
]
= n−1E
[n∑i=1
XiXTi − nXX
T
]= n−1E
[n(Σ + µµT
)− n
(1
nΣ + µµT
)]= n−1
n Σ.
Hence,1
n−1S
is an unbiased estimator of Σ.
STAT3914: Lecture 3 47
Sampling Distributions
Theorem: If Σ is a consistent estimator of Σ then, by virtue of the central limit
theorem √nΣ
−1/2(X− µ)
converges to Np(0, I) in distribution.
The sampling distribution for S is much more complicated and involve the Wishart
distribution (which we will cover later on).
STAT3914: Lecture 3 48
Results on Quadratic Forms
2 Definition: A square matrix A is called idempotent if A2 = A.
2 Theorem: A symmetrix (symmetric matrix) A is idempotent if and only if all
its eigenvalues are in 0, 1.
2 Proof: Suppose that the eigenvalue descoposition of A is UΛUT then
A2 = UΛUTUΛU = UΛ2U
If A is idempotent then
UΛ2U = UΛU⇒ λ2i = λi
for 1 ≤ i ≤ p. Hence, if A is idempotent then λi ∈ 0, 1 for 1 ≤ i ≤ p.
If λi ∈ 0, 1 for 1 ≤ i ≤ p then λ2i = λi for 1 ≤ i ≤ p and
A2 = UΛ2U = UΛU = A.
STAT3914: Lecture 3 49
2 Let X ∼ Np(0, I) and let C be a symmetric square matrix of rank r > 0.
2 Theorem 3: The random variable
XTCX ∼ χ2r
if and only if C is idempotent.
STAT3914: Lecture 3 50
2 Proof: We can diagonalize: C = UDUT where
U is an orthogonal matrix and
D = diag(λ1, . . . , λr, 0, . . . , 0︸ ︷︷ ︸p−r
)
where λi ≥ λi+1 for i = 1, . . . , r − 1 (and λi 6= 0).
2 Let Y = UTX so that Y ∼ Np(0, I) and note that
W ≡ XTCX
= XTUDUTX
= YTDY
=
r∑i=1
λiY2i .
2 If C is idempotent then λi = 1 for 1 ≤ i ≤ r and W ∼ χ2r since Y 2
i are i.i.d.
χ21.
STAT3914: Lecture 3 51
2 Conversely, assume W ∼ χ2r then its MGF is given by
MW (t) = (1− 2t)−r/2 for t < 12.
Since Y 2i are i.i.d. χ2
1, for λit < 1/2 for all i we also have
MW (t) =
r∏i=1
MλiY2i(t)
=
r∏i=1
MY 2i(λit)
=
r∏i=1
(1− 2λit)−1/2.
2 These two domains agree if and only if λ1 = 1 and λr > 0 so this has to be the
case.
STAT3914: Lecture 3 52
2 Therefore, MW (t) = Mχ2r(t) for t < 1
2 only if
r∏i=1
(1− 2λit) = [MW (t)]−2
= [Mχ2(r)(t)]−2
=
r∏i=1
(1− 2t).
2 From the unique polynomial factorization, the equality holds if and only if λi = 1
for i = 1, . . . , r.
STAT3914: Lecture 3 53
2 Theorem 4 [Craig’s Theorem]: If A and B are symmetric, non-negative
definite matrices and X ∼ Np(0, I) then
XTAX and XTBX
are independent if and only if AB = 0.
STAT3914: Lecture 3 54
2 Proof: Assume AB = 0 then
BA = (AB)T = 0 so A and B commute.
Simultaneous diagonalization: there exist an orthogonal U and diagonal ma-
trices DA and DB such that UTAU = DA and UTBU = DB.
Next, AB = UDAUTUDBUT = UDADBUT so that AB = 0 implies
(WLOG) that
DA = diag(λ1, . . . , λr, 0, . . . , 0) and DB = diag(0, . . . , 0︸ ︷︷ ︸r
, λr+1, . . . , λp)
(for some r).
It follows that with Y ≡ UTX
XTAX = YTUTAUY =
r∑i=1
λiY2i and XTBX =
p∑i=r+1
λiY2i
As Y ∼ Np(0, I) the latter two are obviously independent random variables.
STAT3914: Lecture 3 55
2 Conversely, suppose the quadratic forms are independent
Let U be an orthogonal matrix such that
UTAU = DA = diag(λ1, . . . , λr,0(p−r)).
Let B∗ = UTBU and let Y = UTX, then
XTAX = YTUTAUY =
r∑i=1
λiY2i
XTBX =∑i
∑j
b∗ijYiYj
Independence implies that
E[XTAX]E[XTBX] = E[XTAXXTBX]
Therefore,
E
[r∑i=1
λiY2i
]·E
∑i
∑j
b∗ijYiYj
= E
( r∑i=1
λiY2i
)∑j
∑k
b∗kjYkYj
STAT3914: Lecture 3 56
But Y1, . . . , Yp are NID(0, 1) with E[Y 2i ] = 1, E[Y 3
i ] = 0 and E[Y 4i ] = 3.
Therefore, (r∑i=1
λi
)∑j
b∗jj
= 3
r∑i=1
λib∗ii +
∑i 6=j
λib∗jj.
Thus,∑r
i=1 λib∗ii = 0.
But λib∗ii ≥ 0 for all i (why?) and λi > 0 for i = i, . . . , r so b∗ii = 0 for
i = 1, . . . , r.
Since B∗ ≥ 0 it follows that any principal submatrix is non-negative definite
as well , hence b∗ij = 0 if i or j are in 1, . . . , r. Thus,
B∗ =
(0 0
0 B∗22
)hence DAB∗ = 0 and therefore
AB = AUUTB = UDAB∗UT = 0.
STAT3914: Lecture 3 57
The Wishart Distribution
2 Let X = (X1, . . . ,Xn)T be n× p matrix with rows Xi ∼ NIDp(0,Σ)
2 The p×p matrix M ≡ XTX has a Wishart distribution Wp(Σ, n) whose density
is given by
C−1p,n|Σ|−n/2|M|(n−p−1)/2 exp−1
2tr[Σ−1M
].
where
Cp,n ≡ 2np/2πp(p−1)/4p∏i=1
Γ((n + 1− i)/2).
We also need n (called the degrees of freedom) to satisfy n > p − 1 and the
Wishart distribution parameter Σ (called the scale matrix) is assumed to be
positive definite.
STAT3914: Lecture 3 58
2 Note that X =∑n
i=1 eiXTi where ei is a vector of length n whose elements are
0 except for the ith element which is equal to 1. Therefore,
M =
n∑j=1
XjeTj
n∑i=1
eiXTi
=∑i,j
XjδijXTi
=
n∑i=1
XiXTi
where δij is a scalar equal to 1 if i = j and 0 if i 6= j.
2 Hence, for known µ = 0 the MLE/sample covariance nΣ ∼ Wp(Σ, n).
2 If p = 1 then W1(Σ, n) = W1(σ2, n) = σ2χ2
n.
2 E[M] = E(∑n
i=1 XiXTi
)= nΣ.
2 Var(Mij) = n(Σ2ij + ΣiiΣjj).
STAT3914: Lecture 3 59
2 Theorem 5: If M ∼ Wp(Σ, n) and B is a p× q matrix then
BTMB ∼ Wq(BTΣB, n).
2 Proof: Let Y = XB, then Y = (Y1, . . . ,Yn)T is an n × q matrix with
independent rows denoted YTi = XT
i B, or Yi = BTXi ∼ Nq(0,BTΣB).
It follows thatBTMB = BTXTXB
= YTY
∼ Wq(BTΣB, n).
2 Corollary: The principal submatrices of M have Wishart distributions.
STAT3914: Lecture 3 60
2 Theorem 6: If M ∼ Wp(Σ, n) and a is any fixed p-vector such that aTΣa > 0
thenaTMa
aTΣa∼ χ2
n
2 Proof: Firstly, from Theorem 5 we have
aTMa ∼ W1(aTΣa, n)
and we have already argued that
W1(aTΣa, n) ∼ (aTΣa)χ2
n.
2 Corollary:
Mii ∼ Σ2iiχ
2n
2 The converse to Theorem 6 is not true.
STAT3914: Lecture 3 61
2 Theorem 7: If
M1 ∼ Wp(Σ, n1) and M2 ∼ Wp(Σ, n2)
independently of M1 then
M1 + M2 ∼ Wp(Σ, n1 + n2).
2 Proof: Write Mi = XTi Xi where Xi has ni independent rows drawn from a
Np(0,Σ) distribution. Then
M1 + M2 = XT1 X1 + XT
2 X2
= XTX where X
=
(X1
X2
).
STAT3914: Lecture 3 62
2 Recall that the sample covariance matrix is defined as 1nS where
S =
n∑i=1
(Xi −X)(Xi −X)T .
2 The Wishart distribution describes the distribution of S.
2 Theorem 8: If X is an n × p matrix with rows independent Np(µ,Σ) then
S ∼ Wp(Σ, n− 1) independently of X ∼ Np(µ, n−1Σ).
2 Proof: Recall that
S =
n∑i=1
(Xi −X)(Xi −X)T
=
n∑i=1
XiXTi − nXX
T
= XTX− nXXT.
STAT3914: Lecture 3 63
Let P be an n× n orthogonal matrix with first row(1√n, . . . ,
1√n
).
Let Y = PX so Y1 =√nX where
Y = (Y1, . . . ,Yn)T .
Thus,
S = YTPPTY −Y1YT1
= YTY −Y1YT1
=
n∑i=2
YiYTi .
2 Exercise: WLOG µ = 0.
2 It follows from the following lemma that Yi ∼ NIDp(0,Σ) and therefore
S ∼ Wp(Σ, n− 1) independently of 1√nY1 = X.
STAT3914: Lecture 3 64
2 Lemma: Let X be an n × p matrix with rows NIDp(0,Σ) and let U be an
n × n orthogonal matrix. Define Y = UX, and let YTi be the rows of Y
(i = 1, . . . , n). Then Yi ∼ NIDp(0,Σ).
STAT3914: Lecture 3 65
2 Proof: We have E(Y) = UE(X) = 0np. Note that Yi = YTei and therefore
E[YiYTj ] = E[XTUTeie
Tj UX]
= E[XTuiuTj X]
= E
[(n∑k=1
XkeTk
)uiu
Tj
(n∑l=1
elXTl
)]
= E
∑k,l
XkuikujlXTl
=∑k,l
uikujlE[XkXTl ]
=∑k
uikujkΣ
= δijΣ,
where uTi the ith row of U.
STAT3914: Lecture 3 66
Wishart Distribution (cont.)
2 Theorem 9: Suppose X has n rows which are NIDp(0,Σ) then XTAX ∼Wp(Σ, r) if and only if A is idempotent and of rank r.
2 Proof: If XTAX ∼ Wp(Σ, r) then by Theorem 6 for any a ∈ Rp with aTΣa >
0 we haveaTXTAXa
aTΣa∼ χ2
r.
Let
Y =Xa√aTΣa
,
then Y ∼ N(0n, In) and YTAY ∼ χ2r so A is idempotent of rank r by
Theorem 3.
STAT3914: Lecture 3 67
2 Conversely, if A is idempotent of rank r then A = UDUT , where U is an
n× n orthogonal matrix and D = diag(1, . . . , 1︸ ︷︷ ︸r
, 0, . . . , 0︸ ︷︷ ︸n−r
).
2 Let Y = UTX so by the lemma Y has n rows YTi ∼ NIDp(0,Σ) and
XTAX = YTDY
=
n∑j=1
YjeTj
D
(n∑i=1
eiYTi
)
=
n∑i,j=1
Yj
(eTj Dei
)YTi
=
r∑i,j=1
YjδijYTi
=
r∑i=1
YiYTi ∼ Wp(Σ, r).
STAT3914: Lecture 3 68
2 Theorem 10: If X has n rows which are NIDp(0,Σ) then for any symmetric,
non-negative definite n× n matrices A and B the random values XTAX and
XTBX are independent if and only if AB = 0.
2 Proof: Assume XTAX and XTBX are independent. Then
Choose a such that aTΣa > 0 and let
Y =Xa√aTΣa
,
then:
∗ Y ∼ Nn(0, I) and
∗ YTAY and YTBY are independent.
Thus by Craig’s Theorem we have AB = 0.
STAT3914: Lecture 3 69
2 Conversely, assume AB = 0. Then
There exists an n× n orthogonal matrix P such that
PTAP = DA ≡ diag(α1, . . . , αr︸ ︷︷ ︸r
, 0 . . . 0︸ ︷︷ ︸n−r
),
and
PTBP = DB ≡ diag(0, . . . , 0︸ ︷︷ ︸r
, βr+1, . . . , βn︸ ︷︷ ︸n−r
)
STAT3914: Lecture 3 70
Let Y = PTX, then by the last lemma again Y has n rows YTi ∼
NIDp(0,Σ) and
XTAX = YTDAY
=
n∑j=1
YjeTj
DA
(n∑i=1
eiYTi
)
=
n∑i,j=1
Yj
(eTj DAei
)YTi
=
r∑1
αiYiYTi .
Similarly, XTBX =∑n
r+1 βiYiYTi .
The latter two are obviously independent.
STAT3914: Lecture 3 71
2 Theorem 11: Suppose X has n rows which are NIDp(0,Σ) with Σ positive
definite. LetM = XTX
=
(M11 M12
M21 M22
),
where M11 is an r × r matrix (with r < n). Let
M2·1 ≡M22 −M21M−111 M12.
Then
M2·1 ∼ Wq(Σ2·1, n− r)
independently of (M11,M12) where q = p− r and
Σ2·1 = Σ22 −Σ21Σ−111 Σ12.
STAT3914: Lecture 3 72
2 Recall that
Σ2·1 is the conditional covariance of
X(2)i ≡ (Xi,r+1, . . . , Xip)
T given X(1)i ≡ (Xi1, . . . , Xir)
T .
This is also the covariance of
Yi ≡ X(2)i −QX
(1)i
where
QX(1)i ≡
(Σ21Σ
−111
)X
(1)i
is the best linear predictor of X(2)i given X
(1)i :
Yi ∼ NIDq(0q,Σ2·1)
and independent of X(1)i .
2 Hence, had we known this decomposition we could have estimated Σ2·1 as
YTY ∼ Wq(Σ2·1, n) (note the degrees of freedom is n rather than n− r).
2 But we typically don’t know Σ...
STAT3914: Lecture 3 73
2 Proof (of Theorem 11).
Write X = (X1,X2) where X1 is n× r and X2 is n× q (with r + q = p),
thenM = XTX
=
(XT
1
XT2
)(X1,X2).
It can be shown that as Σ11 is positive definite and as n > r, M11 is positive
definite almost surely.
Thus, we can define
M2·1 = XT2 X2 −XT
2 X1M−111 XT
1 X2
= XT2 PX2,
where P = I−X1M−111 XT
1 .
STAT3914: Lecture 3 74
if P = I −X1M−111 XT
1 it is easy to verify directly that P is idempotent (a
projection on Span(X1)⊥) and its rank is (n− r) as
tr(P) = n− tr[X1M−111 XT
1 ]
= n− r.
Also, XT1 P = PX1 = 0 so with
X2·1 ≡ X2 −X1Σ−111 Σ12
andM2·1 = XT
2 PX2
= XT2·1PX2·1.
But X2·1 is the component of X2 that is independent of X1 and in particular
we saw that the n rows of X2·1 ∼ NIDq(0q,Σ2·1).
It follows from Theorem 9 that, conditioned on X1, M2·1 ∼ Wq(Σ2·1, n−r).
STAT3914: Lecture 3 75
This distribution does not depend on X1, therefore it is also the unconditional
distribution of M2·1.
Moreover, it follows that M2·1 is independent of X1 and hence of M11 =
XT1 X1.
Furthermore, P(I−P) = P−P = 0 so using the same rationale as in the
proof of Theorem 10 we see that given X1,
M2·1 = XT2·1PX2·1 and XT
1 (I−P)X2·1
are independent.
STAT3914: Lecture 3 76
Now,
XT1 (I−P)X2·1 = XT
1 (X1M−111 XT
1 )(X2 −X1Σ−111 Σ12)
= M12 −M11Σ−111 Σ12.
Therefore, given X1, M2·1 is independent of
M12 = XT1 (I−P)X2·1 + M11Σ
−111 Σ12
Since M2·1 is also independent of X1 it is independent of (X1,M12) (why?)
It follows that M2·1 is independent of (M11,M12) (why?)
STAT3914: Lecture 3 77
Linear Independence Lemma
2 Linear Independence Lemma: Suppose that Xi ∼ NIDp(Σ, 0) for i =
1, . . . , n ≤ p and that Σ is positive definite. Then X1, . . . ,Xn are almost
surely linearly independent.
2 Proof of lemma.
Firsly, without loss of generality we can assume that Σ = Ip. (exercise)
The proof is by induction where the case n = 1 is trivial since X1 ∼ Np(I,0)
implies that P (X1 = 0) = 0.
STAT3914: Lecture 3 78
Now, assume that the lemma holds for n < p. In this case
∗ Either Xn+1 /∈ Span〈X1, . . . ,Xn〉 almost surely and we are done,
∗ or, there exists a set A ⊂ Ω with P (A) > 0 and random variables αi such
that
Xn+1(ω) =
n∑i=1
αi(ω)Xi(ω) and X1(ω), . . . ,Xn(ω)
are linearly independent for all ω ∈ A.
The latter identity can be thought of as p equations in n unknowns: αi(ω). Since X1(ω), . . . ,Xn(ω) are linearly independent, then without loss of gen-
erality the αi(ω) are uniquely determined from
X1(ω), . . . ,Xn(ω)
and Xn+1,k(ω) for k = 1, . . . , n.
STAT3914: Lecture 3 79
But
Xn+1,n+1(ω) =
n∑i=1
αi(ω)Xi,n+1(ω)
and it follows that on the set A, Xn+1,n+1 can be determined from
X1(ω), . . . ,Xn(ω)
and Xn+1,k(ω) for k = 1, . . . , n.
This contradicts the independence assumption.
2 Corollary: If M ∼ Wp(Σ, n) with Σ positive definte and n ≥ p then |M| > 0
almost surely (exercise).
STAT3914: Lecture 3 80
2 Theorem 12. If M ∼ Wp(Σ, n) with Σ positive definite and n ≥ p then
(a) For any fixed a 6= 0 ∈ Rp,
aTΣ−1a
aTM−1a∼ χ2
n−p+1.
In particular Σii/M ii ∼ χ2n−p+1.
(b) Mii is independent of all elements of M except Mii.
2 Proof:
Recall that M22 = M−12·1 and similarly Σ22 = Σ−12·1.
Therefore, from Theorem 11, with r = p− 1 we have
(Mpp)−1 = M2·1∼ Wp−r(Σ2·1, n− r)∼ (Σpp)−1χ2
n−p+1.
Moreover, M pp is independent of (M11,M12), i.e., M pp is independent of all
elements of M except Mpp
STAT3914: Lecture 3 81
This proves the theorem for a = ep and hence for a = ei (why?)
For a general a 6= 0 ∈ Rp, let A be a non-singular matrix with last column
a, that is Aep = a. Then
aTΣ−1aaTM−1a
=eTpATΣ−1Aep
eTpATM−1Aep
=eTpΣ−1A ep
eTpM−1A ep
,
where ΣA = A−1Σ(A−1)T and MA = A−1M(A−1)T .
The proof is complete since by Theorem 5, MA ∼ Wp(ΣA, n).
STAT3914: Lecture 3 82
2 Theorem 13. If M ∼ Wp(Σ, n) where n ≥ p then
|M| = |Σ| ·p∏i=1
Ui
where Ui are independent random variables, Ui ∼ χ2n−i+1.
2 Proof: Without loss of generality we may assume that Σ is positive definite
(exercise). We use induction on p.
For the case p = 1 we have
M ∼ σ2χ2n.
Write
M =
(M11 M12
M21 M22
)where M11 is a (p− 1)× (p− 1) matrix.
STAT3914: Lecture 3 83
Recall that
M22 = [M−1]pp = |M11|/|M|
(similarly, Σ22 = |Σ11|/|Σ|).
Therefore,
|M| =|M11|M 22
=Σ22
M 22
|Σ||Σ11|
|M11|.
But from Theorem 12,
Up ≡Σ22
M 22=
[Σ−1]pp[M−1]pp
∼ χ2n−p+1
and is independent of M11.
We complete the proof by noting that M11 ∼ Wp−1(Σ11, n).
STAT3914: Lecture 3 84
Hence, by the inductive hypothesis
|M11| = |Σ11|p−1∏i=1
Ui
where Ui ∼ χ2n−i+1 are independent of one another and obviously they depend
only on M11 and hence independent of Up.
STAT3914: Lecture 3 85
2 Corollary: If X has n rows which are NIDp(µ,Σ) with Σ positive definite
and S = XTX− nXXT
then
Σkk/Skk ∼ χ2n−p.
aTΣ−1a
aTS−1a∼ χ2
n−p
for any fixed a 6= 0
If S11 is an r × r submatrix of S, then
S11 ∼ Wr(Σ11, n− 1)
independently ofS2·1 = S22 − S21S
−111 S12
∼ Wp−r(Σ2·1, n− r − 1).
STAT3914: Lecture 3 86
Hotelling’s T 2
2 Definition. Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n) with n ≥ p
then
nZTM−1Z ∼ T 2(p, n)
where T 2(p, n) is Hotelling’s T 2 distribution.
2 Theorem 14: If Y ∼ Np(µ,Σ) and M ∼ Wp(Σ, n) independently of Y with
Σ positive definite and n ≥ p then
n(Y − µ)TM−1(Y − µ) ∼ T 2(p, n).
2 Proof:
Let Z = Σ−1/2(Y − µ) and let MΣ = Σ−1/2MΣ−1/2.
Then Z ∼ Np(0, I), MΣ ∼ Wp(I, n) and
n(Y − µ)TM−1(Y − µ) = nZTM−1Σ Z.
STAT3914: Lecture 3 87
2 Theorem 15:
T 2(p, n) ∼ np
n− p + 1Fp,n−p+1
2 Proof:
Let Z ∼ Np(0, I) independently of M ∼ Wp(I, n).
By Theorem 12, for a 6= 0, conditioned on Z = a
D ≡ ZTZ
ZTM−1Z∼ χ2
n−p+1
This conditional distribution of D does not depend on a hence it is also the
unconditional distribution of D.
For the same reason, D is independent of Z and therefore also of R ≡ZTZ ∼ χ2
p.
It follows thatnZTM−1Z
n=R
D∼ p
n− p + 1Fp,n−p+1.
STAT3914: Lecture 3 88
2 Corollary 1: If X is the mean of a sample of size n ≥ p drawn from a Np(µ,Σ)
population with Σ positive definite, and if (n − 1)−1S is the unbiased sample
covariance matrix then
n(X− µ)T(
1
n− 1S
)−1(X− µ) ∼ (n− 1)p
n− pFp,n−p.
2 Proof:
It suffices to show that the LHS has a T 2(p, n− 1) distribution.
Let Y =√nX and let S = XTX− nXX
T.
Theorem 8 states: S ∼ Wp(Σ, n− 1) independently of Y ∼ Np(√nµ,Σ).
Therefore by Theorem 14,
(n− 1)(Y −√nµ)TS−1(Y −
√nµ) ∼ T 2(p, n− 1)
Finally,
n(X−µ)T(
1
n− 1S
)−1(X−µ) = (n− 1)(Y−
√nµ)TS−1(Y−
√nµ).
STAT3914: Lecture 3 89
STAT3914: Lecture 3 90