Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Linear algebra, multivariate distributions,and all that jazz
Rebecca C. SteortsPredictive Modeling: STA 521
September 8 2015
1
Linear algebra, multivariate distributions,and all that jazz
Rebecca C. SteortsPredictive Modeling: STA 521
September 8 2015
2
I Random vectors
I Independence
I Expectations and Covariances
I Quadratic Forms
I Multivariate Normal Distribution
I Using R
3
We review matrix algebra:
I No late labs, homeworks.
I Will drop lowest lab/homework grade.
I New homework is coming.
I What if I miss class or lab?
I If there is a grade question, send an email to TA’s and myself,outlining question and why you desire points back.
4
Random Vectors
DefinitionWe define Xp to be a p-variate random vector,
X =
X1
X2...Xp
,
where its entries X1, . . . , Xp are random variables .
RemarkA random variable can be considered a univariate random vector.
5
Independence
If X1, . . . , Xm are continuous, independent implies that thedensity factorizes: fX1,...,Xm
(x1, . . . , xm) =∏mi=1 fi(xi).
I Non-random vectors are constant or deterministic.
I They are also considered random vectors, however, withprobability 1 of being equal to a constant (there’s nothingrandom going on here).
I They are trivially independent of all other random vectors.
6
Expected Value
I Expected value of a random vector (or random matrix) isdefined to be the vector of expected values of its univariatecomponents.
I Properties of expectation carry over from the univariate case.
7
Expected Value
DefinitionLet X be a p-variate random vector. Then the expected value ofX is
µX = E(X) = E[(X1 · · · Xp)T ],
and if X is continuous then E(X) =∫xf(x) dx.
8
MSE
The mean is the best constant predictor of X in terms of the meansquared error (MSE):
E(X) = arg minc∈Rp
E‖X − c‖2.
Proof:
∂
∂x
[(X − c)T (X − c)
]=
∂
∂x
[XTX − 2cTX − cT c
](1)
= 2(X − c). (2)
Then E[X − c] = 0 =⇒ E[X] = c.Since the second derivative is positive (2), the solution is unique.
9
Let A be a m× p matrix and Y be an m-variate random vector.Then
E(AX + Y ) = AE(X) + E(Y ).
Let b be a constant vector. Then
E(bTX) = bTE(X).
10
Covariance Matrix
Let X be a p-variate random vector. The covariance matrix of X is
ΣXX = Var(X) = E{(X − µ)T (X − µ)}
=
Var(X1) Cov(X1, X2) . . . Cov(X1, Xp)
Cov(X2, X1) Var(X2) . . . Cov(X2, Xp)...
.... . .
...Cov(Xp, X1) Cov(Xp, X2) . . . Var(Xp)
.
11
Let A be an m× p matrix. Let Y be a random vector. Then
I Var(AX) = AVar(X)AT .
I Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X,Y ).
I If X and Y are independent then Cov(X,Y ) = 0.
12
Cross-covariance matrix
We define the covariance matrix (cross covariance) between X andY to be
ΣXY = Cov(X,Y ) = E{(X − µX)T (Y − µY )}
=
Cov(X1, Y1) Cov(X1, Y2) . . . Cov(X1, Ym)Cov(X2, Y1) Cov(X2, Y2) . . . Cov(X2, Xm)
......
. . ....
Cov(Xp, Y1) Cov(Xp, Y2) . . . Cov(Xp, Ym)
.
13
Other properties: Let A and B be constant matrices, and let a andb be constant vectors.
I Cov(AX,BY ) = ACov(X,Y )BT .
I Cov(X + a, Y + b) = Cov(X,Y ).
14
Neuroimaging example
I X is the intensity of light at every pixel of light in an image.
I Y is the magnitude of the fMRI at every voxel in the brain.
I −a is the average intensity for every image shown inexperiment.
I −b is the average fMRI of the experiment.
Cov(X − a, Y − b) = Cov(X,Y ).
We can center and shift observations, and the covariance isunchanged!
15
Let X and Y be p- and m-variate vectors. Then
V ar
[(XY
)]=
(Var(X) Cov(X,Y )
Cov(Y,X) Var(Y )
).
RemarkThe off-diagonal blocks of the covariance matrix are crosscovariances.
16
Trace
Let A = (aij) be a square matrices of dimension d× d.1. The trace of A is the sum of its diagonal elements:
tr(A) =∑i
aii.
17
Recall that the mean is the best constant predictor of X in termsof the MSE:
E(X) = arg minc∈Rp
E‖X − c‖2.
The total variance of X is the MSE of the mean:
E||X − E(X)||2 = tr(Var(X)).
The total variance of X measures the overall variability of thecomponents of X around the mean E(X).
18
Let a be a constant p-vector. Then aTX =∑p
i=1 aiXi andVar(aTX) = aT Var(X)a.
I This helps us measure the variation along some direction a.
I You can make a connection with this in other classes andPCA.
19
Quadratic Forms
Let A be a symmetric matrix and x a vector.
DefinitionA quadratic form is written as
xTAx =∑i
∑j
aijxixj .
Note: it’s a quadratic function of x.
20
I As a function of a, Var(aTX) = aT Var(X)a = aTΣXa,which is a quadratic form in a.
I Quadratic forms are very common in multivariate analysis.
I Example: Chi-squared test statistic is a quadratic form.
21
Suppose Z1 . . . Zpiid∼ N(0, 1).
Then
||Z||2 =∑i
Z2i ∼ χ2
p.
NowZi
ind∼ N(µi, 1), ∀i.
Then Y p×1 ∼ Np(µ, I) =⇒ Y TY ∼ χ2p(
12µ
Tµ).
This is called a non-central Chi-squared distribution.
22
Suppose Z1 . . . Zpiid∼ N(0, 1). Then
||Z||2 =∑i
Z2i ∼ χ2
p.
NowZi
ind∼ N(µi, 1), ∀i.
Then Y p×1 ∼ Np(µ, I) =⇒ Y TY ∼ χ2p(
12µ
Tµ).
This is called a non-central Chi-squared distribution.
22
Suppose Z1 . . . Zpiid∼ N(0, 1). Then
||Z||2 =∑i
Z2i ∼ χ2
p.
NowZi
ind∼ N(µi, 1), ∀i.
Then Y p×1 ∼ Np(µ, I) =⇒ Y TY ∼ χ2p(
12µ
Tµ).
This is called a non-central Chi-squared distribution.
22
Suppose Z1 . . . Zpiid∼ N(0, 1). Then
||Z||2 =∑i
Z2i ∼ χ2
p.
NowZi
ind∼ N(µi, 1), ∀i.
Then Y p×1 ∼ Np(µ, I) =⇒
Y TY ∼ χ2p(
12µ
Tµ).
This is called a non-central Chi-squared distribution.
22
Suppose Z1 . . . Zpiid∼ N(0, 1). Then
||Z||2 =∑i
Z2i ∼ χ2
p.
NowZi
ind∼ N(µi, 1), ∀i.
Then Y p×1 ∼ Np(µ, I) =⇒ Y TY ∼ χ2p(
12µ
Tµ).
This is called a non-central Chi-squared distribution.
22
Define Y = µ+ Z. Then Z ∼ N(0, I).
Then
Y TY = (Z + µ)T (Z + µ)
= ZTZ + 2µTZ + µTµ =: χ2p(
1
2µTµ).
Hence, the non-central χ2p is a quadratic form.
23
Define Y = µ+ Z. Then Z ∼ N(0, I). Then
Y TY = (Z + µ)T (Z + µ)
= ZTZ + 2µTZ + µTµ =: χ2p(
1
2µTµ).
Hence, the non-central χ2p is a quadratic form.
23
Positive Semi-Definite and Positive Definite
1. A square matrix A is called positive semi-definite if A issymmetric and xTAx ≥ 0 for all x 6= 0.
2. The matrix A is called positive definite if xTAx > 0 for allx 6= 0.
24
Eigenvalue and eigenvector
Let v > 0 and let A be d× d.1. v is an eigenvector with eigenvalue λ when
Av = λv.
25
More about eigenvalues and eigenvectors
Let v > 0 and let A be d× d.1. It’s typical to normalize the eigenvector to have length 1 (or
have it’s entries sum to 1).
2. A has at most p distinct eigenvalues (think about why).
3. Eigenvectors with distinct eigenvalues are orthogonal. (we willsoon define orthogonal).
26
More about eigenvalues and eigenvectors
Let v > 0 and let A be d× d.1. It’s typical to normalize the eigenvector to have length 1 (or
have it’s entries sum to 1).
2. A has at most p distinct eigenvalues (think about why).
3. Eigenvectors with distinct eigenvalues are orthogonal. (we willsoon define orthogonal).
26
More about eigenvalues and eigenvectors
Let v > 0 and let A be d× d.1. It’s typical to normalize the eigenvector to have length 1 (or
have it’s entries sum to 1).
2. A has at most p distinct eigenvalues (think about why).
3. Eigenvectors with distinct eigenvalues are orthogonal. (we willsoon define orthogonal).
26
If A is positive definite, then:
I All of its eigenvalues are real-valued and positive.
I Its inverse is also positive definite.
Note that covariance matrices have the following properties:
I Every covariance matrix is a positive semi-definite matrix.
I Every positive semi-definite matrix is a covariance matrix.
The following result from linear algebra is also highly useful.
27
Spectral Decomposition Theorem
Theorem (Spectral Decomposition Theorem)
Let Ap×p be symmetric with orthonormal eigenvectors v1, . . . , vpand corresponding eigenvalues λ1, . . . , λp. Then A = PΛP T ,where P = (v1 · · · vp) and Λ = Diag(λ1, . . . , λp).
The spectral decomposition theorem allows some operations withpositive definite matrices to be computed more easily:
I A−1 = PΛ−1P T .
I A1/2 = PΛ1/2P T .
28
Alternative Spectral Decomposition
P is orthogonal if P TP = 1 and PP T = 1.
Theorem (Alternative Spectral Decomposition)
Let A be symmetric n× n. Then we can write
A = PDP T ,
where D = diag(λ1, . . . , λn) and P is orthogonal. The λs are theeigenvalues of A and ith column of P is an eigenvectorcorresponding to λi.
29
TheoremIf Y ∼ Nn(µ, I), and P is an orthogonal projection with r(P ) = k,then
Y TPY ∼ χ2k(
1
2µTPµ).
P is a square matrix.
Take as a fact (theorem): P is an orthogonal projection if and onlyif P is symmetric and idempotent (i.e. P 2 = P ).
30
Proof: Since P has rank k, there exists an orthogonal matrix Γsuch that
I P = ΓDΓT (by SDT, slide 17) where
I D = diag{1, 1, . . . , 1, 0, . . . , 0}1.Also, recall that a rank-k matrix has k nonzero eigenvalues.Now
Y TPY = Y TΓDΓTY = ZTDZ,
where Zn×1 = ΓTY ∼ N(ξ = ΓTµ, I).
1To show on your own and think about: Why do the eigenvalues of anorthogonal projection have to be either 0 or 1?
31
We now partition the vector Z up into separate components Z(1)
and Z(2) that are uncorrelated with means ξ1 and ξ2.
Z =
(Z
(1)kx1
Z(2)(n−k)x1
)∼ N
((ξ1ξ2
),
(Ik 00 In−k
)).
You should be able to verify the distribution of Z on your own.
Since DZ =
(Z0
)=⇒ ||DZ||2 = ||Z(1)||.2
Since Z(1) ∼ Nk(ξ1, Ik), we know
||Z(1)||2 ∼ χ2k(
1
2||ξ1||2).
32
Recall that ξ =ΓTµ and P = ΓDΓT . By multiplying by ΓT and Γon left and right sides we obtain ΓTPΓ = D.From the above, we find
||ξ1||2 = ||Dξ||2 = ||ΓTPΓΓTµ||2 = ||ΓTPµ||2 = ||Pµ||2 = µTPµ.
(Since Γ is an orthogonal matrix, ||ΓTPµ||2 = ||Pµ||2.)We have shown that ||ξ1||2 = µTPµ, thus,
||Z1||2 ∼ χ2k(
1
2µTPµ).
33
The Multivariate normal distribution
I The multivariate normal (MVN) distribution.
I How do we standardize MVN distributions?
I Spectral and singular value decompositions.
I Computing eigenvalues and vectors in R.
I Some important properties of of the MVN.
I Next time: How to visualize MVN distributions.
34
We assume that the population mean is µ = E(X) andΣ = Var(X) = E[(X − µ)(X − µ)T ], where
µ =
µ1µ2...µp
and
Σ =
σ21 σ12 . . . σ1pσ21 σ22 . . . σ2p
......
. . ....
σp1 σp2 . . . σ2p
.
35
I MVN is generalization of univariate normal.
I For the MVN, we write X ∼MVN (µ,Σ).
I The (i, j)th component of Σ is the covariance between Xi
and Xj (so the diagonal of Σ gives the component variances).
36
Just as the probability density of a scalar normal is
p(x) =(2πσ2
)−1/2exp
{−1
2
(x− µ)2
σ2
}, (3)
the probability density of the multivariate normal is
p(~x) = (2π)−p/2 det Σ−1/2 exp
{−1
2(x− µ)TΣ−1(x− µ)
}. (4)
Univariate normal is special case of the multivariate normal with aone-dimensional mean “vector” and a one-by-one variance“matrix.”
37
Calculations are easy for the “standard” MVN (0, I).
(Every coordinate is an independent N (0, 1).)
Multivariate central limit theorem: if x1, x2, . . . xniid∼ (0, I), then
n−1/2∑n
i=1 xi tends to MVN (0, I) as n→∞.
How do we do calculations for non-standard MVNs?
38
Recall that the parameters of a normal change along with lineartransformations:
X ∼ N (µ, σ2)⇔ aX + b ∼ N (aµ+ b, a2σ2). (5)
I Use to “standardize” any normal to have mean 0 andvariance 1 (by looking at X−µ
σ ).
I Standardize MVNs in a very analogous way
I Need some general results about matrices first: decompositiontheorems.
39
P is orthogonal if P TP = 1 and PP T = 1.
Theorem (Spectral Decomposition)
Let A be symmetric n× n. Then we can write
A = PDP T ,
where D = diag(λ1, . . . , λn) and P is orthogonal. The λs are theeigenvalues of A and ith column of P is an eigenvectorcorresponding to λi.
Orthogonal matrices represent rotations of the coordinates.
Diagonal matrices represent stretchings/shrinkings of coordinates.
40
DefinitionLet A = (aij), B = (bij) be square matrices both of dimensiond× d.
1. The trace of A is the sum of its diagonal elements:tr(A) =
∑i aii.
2. The matrices A,B are similar if there exists an invertiblematrix E such that
A = EBE−1.
41
TheoremSuppose A,B are similar matrices of dimension d× d. Assume A isinvertible and has d distinct eigenvalues λ1, . . . , λd. Let det(A) bethe determinant of A. The following hold:
1. tr(A) =∑i
aii =∑i
λi.
2. The matrices A,B have the same eigenvalues and trace.
3. For any d× d matrix C, the trace and determinant of ACsatisfy: tr(AC) = tr(CA) and det(AC) = det(A)det(C).
42
Definition (Singular Value Decomposition)
Suppose that X is data of size d× n. The singular valuedecomposition (SVD) of X is X = UDV T , where D is an r × rdiagonal matrix. The matrices U and V have sizes d× r andn× r, respectively. Their columns are the left and righteigenvectors of X. The left eigenvectors uj and the righteigenvectors vj of X are unit vectors such that for all j,
XTuj = (uTj X)T = djvj and Xvj = djuj .
Used in PCA and dimension reduction methods.
43
Let Σ =
(3 22 4
). What are the eigenvalues and eigenvectors?
> eigen(matrix(c(3,2,2,4),nrow=2))
$values
[1] 5.561553 1.438447
$vectors
[,1] [,2]
[1,] 0.6154122 -0.7882054
[2,] 0.7882054 0.6154122
44
Did this work?
> Sigma <- matrix(c(3,2,2,4),nrow=2)
> V <- eigen(Sigma)$vectors
> Lambda <- eigen(Sigma)$values
> V %*% diag(Lambda) %*% t(V)
[,1] [,2]
[1,] 3 2
[2,] 2 4
Yes, we find that Σ = V ΛV T .
45
> plot(t(V),xlab="",ylab="",xlim=c(-1,1),ylim=c(-1,1))
> arrows(0,0,V[1,1],V[2,1])
> arrows(0,0,V[1,2],V[2,2])
-1.0 -0.5 0.0 0.5 1.0
-1.0
-0.5
0.0
0.5
1.0
Remark: When the covariances are all positive, the firsteigenvector only has positive entries. We see that here.
46
Let χ2p denote the chi-squared distribution with p degrees of
freedom.
TheoremLet X ∼MVN (µ,Σ) be a p-variate random vector, and assumethat Σ is positive definite.
1. Then Σ−1/2(X − µ) ∼MVN (0, Id×d).
2. Let X2 = (X − µ)TΣ−1(X − µ). Then X2 ∼ χ2d.
The transformation Σ−1/2(X − µ) is called the Mahalanobistransformation.
47
0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
X2
Cum
ulat
ive
dist
ribut
ion
X2χ22
48
Sigma = matrix(c(1, .5, .5, 1), 2)
mu = c(0, 0)
# Creating a bivariate normal with 10,000 draws
bivn <- mvrnorm(10000, mu = mu, Sigma = Sigma)
# Calculate the chi-sq random variable
x.2 <- (bivn - mu) %*% solve (Sigma)%*%t(bivn - mu)
plot(ecdf(diag(x.2)),xlab=expression(X^2),
ylab="Cumulative distribution",col="blue",main="",lwd=2)
curve(pchisq(x,df=2),add=TRUE,col="red",lty="dashed",lwd=2)
legend("bottomright",legend=c(expression(X^2),expression(chi[2]^2)),
col=c("blue","red"),lty=c("solid","dashed"))
49
Let A be a positive definite matrix with spectral decompositionA = PDP T =
∑i λiviv
Ti .
DefinitionWe define the square root of A as the matrix
A1/2 = PD1/2P T =∑i
λ1/2i pip
Ti .
Then A = A1/2A1/2 = A1/2(A1/2)T .
50
The MVN density can be written in terms of the square root of theinverse covariance matrix. Note that
f(x) =1
(2π)p/2det(Σ)1/2exp
{−1
2(x− µ)TΣ−1(x− µ)
}=
1
(2π)p/2det(Σ)1/2exp
{−1
2
[Σ−1/2(x− µ)
]T [Σ−1/2(x− µ)
]}=
1
(2π)p/2det(Σ)1/2exp
{−1
2||Σ−1/2(x− µ)||2
}.
51
I The covariance matrix Σ is symmetric and positive definite, sowe know from the spectral decomposition theorem that it canbe written as
Σ = PΛP T .
I Λ is the diagonal matrix of the eigenvalues of Σ.I P is the matrix whose columns are the orthonormal
eigenvectors of Σ (hence V is an orthogonal matrix).I Geometrically, orthogonal matrices represent rotations.I Multiplying by P rotates the coordinate axes so that they are
parallel to the eigenvectors of Σ.I Probabilistically, this tells us that the axes of the
probability-contour ellipse are parallel to those eigenvectors.I The radii of those axes are proportional to the square roots of
the eigenvalues.
52