Lecture 25: The Covariance Matrix and PrincipalComponent AnalysisAnup Rao
May 31, 2019
The covariance of two variables X, Y is a number, but it is oftenvery convenient to view this number as an entry of a bigger ma-trix, called the covariance matrix. Suppose X = (X1, X2, . . . , Xk) isa random vector. Then the covariance matrix K is the matrix withKi,j = Cov
[Xi, Xj
]. It is a k× k matrix, and the entries on the diago-
nal are the variances of all the variables. If X −E [X] is viewed as a If X is a random matrix, then its expec-tation is the matrix whose entriesare the expectation of the corre-sponding random entries of X. So,E [X] = ∑x p(x) · x, where here x is amatrix.
column vector, the covariance matrix is
K = E [(X−E [X])(X−E [X])ᵀ] .
The matrix representation is particularly nice for working withdata. Suppose X = (X1, . . . , Xk) is a random vector sampled by set-ting X to be a uniformly random vector from the set {v1, v2, . . . , vn}.Let v be the average of these vectors. Then let V be the k × n matrixwhose columns are v1 − v, v2 − v, . . . , vn − v. Then it is easy to checkthat the covariance matrix of X is
K = (1/n) ·VVᵀ.
The matrix representation allows us to easily generalize linear re-gression to the case that the inputs themselves are vectors. Supposewe are given data in the form of pairs (v1, y1), (v2, y2), . . . , (vn, yn),where v1, . . . , vn are each k-dimensional column vectors, and y1, . . . , yn
are numbers. For simplicity, let us assume again that the mean of thev’s is 0, and the mean of the y’s is 0. Let X be a random one of the nvectors, and Y be the corresponding number. Then we seek the vectora such that E
[(aᵀX−Y)2] is minimized. We have:
∇E
[(aᵀX−Y)2
]= ∇E
[Y2 − 2YaᵀX + aᵀXXᵀa
]= 2aᵀ E [XXᵀ]− 2 E [YX] ,
but here E [XXᵀ] is exactly the covariance matrix. So, it turns out thatthe solution is to set aᵀ = E [YX] · K−1. K may not be invertible in general,
but you can still solve for a using thepseudoinverse of K.
Principal Component Analysis
The covariance matrix has an extremely important application in ma-chine learning. In machine learning, we seek to find the key featuresof large sets of data, features that are most informative about theunderlying data set.
lecture 25: the covariance matrix and principal component analysis 2
Figure 1: PCA applied to twodimensional data. Image credit:Nicoguaro, from Wikipedia.
For example, if I were to try and figure out what kinds of moviesa particular student in 312 likes, I might ask the student to tell mehow much they like movies from the following categories: action,romantic comedies, horror, . . . . But are these the right questions toask? Should I be asking whether they like animated movies or not, orwhether they like slapstick humor? One of the key developments inmachine learning is the use of mathematics to eliminate the subjectiv-ity here—data is used to generate the most informative questions.
How can we bring the math to bear in the this scenario? Let usassociate every student in the class with a column vector—say a kdimensional vector x, where k is the number of movies on Netflix,and each coordinate indicates whether or not the student likes themovie. Let X be the vector corresponding to a uniformly randomstudent in the class. If I want to boil down a students preferences toa single number, it is natural to consider a number that is a linearfunction of their vector. So, I want a unit length vector a such that theinner product
aᵀx = a1x1 + a2x2 + · · ·+ akxk
is the most informative. Then a natural approach is to pick a to max-imize Var [aᵀX]. Geometrically, this corresponds to projecting all thek dimensional vectors to a single direction, in such a way that thevariance is maximized. There is a very nice web demo
here: http://setosa.io/ev/principal-component-analysis/.
To understand how to find this direction a, we need to use theconcept of eigenvalues and eigenvectors from linear algebra. Given a
lecture 25: the covariance matrix and principal component analysis 3
square matrix M, a number λ ∈ R is an eigenvalue of M if there is avector v such that Mv = λv. In general, eigenvalues can be complexnumbers, even if the original matrix is only real valued. However, wehave:
Fact 1. If M is a k× k symmetric matrix, then it has exactly k real valuedeigenvalues λ1 ≥ λ2 ≥ · · · ≥ λk, and the corresponding eigenvectors forman orthonormal basis v1, . . . , vk for Rk.
Returning to our problem above, we seek the vector a that maxi-mizes Var [aᵀX]. For simplicity, let us as usual assume that E [X] = 0,since if this is not the case, we can consider the random variableX′ = X − E [X] instead, and this does not change the choice of a.The assumption that the mean is 0 ensures that E [aᵀX] = a1 E [X1] +
· · ·+ an E [Xn] = 0. Then we have
Var [aᵀX] = E
[(aᵀX)2
]= E [aᵀXXᵀa] = aᵀ E [XXᵀ] a = aᵀKa,
where K is the covariance matrix of X.Since K is a symmetric matirx, Fact 1 applies. Let λ1, . . . , λk, v1, . . . , vk
be as in the fact. Then we can express a = α1v1 + α2v2 + . . . + αkvk.The fact that a has length 1 corresponds to√
α21 + · · ·+ α2
k = 1⇒ α21 + · · ·+ α2
k = 1.
We have:
aᵀKa
= (α1v1 + · · ·+ αkvk)ᵀK(α1v1 + · · ·+ αkvk)
= (α1v1 + · · ·+ αkvk)ᵀ(λ1α1v1 + · · ·+ λkαkvk)
= λ1α21 + · · ·+ λkα2
k . Since vi , vj are orthonormal, for i 6= j,we have vᵀi vj = 0, and vᵀi vi = 1.
Since λ1 is the largest eigenvalue, and α21 + . . . + α2
k = 1, this quantityis maximized when α1 = 1, and α2 = α3 = · · · = αk = 0. So, thelargest variance is achieved when a = v1, the eigenvector with thelargest eigenvalue! This eigenvector is called the principal component. In general, if you want the 5 most
informative numbers, you should pickthe eigenvectors corresponding tothe top 5 eigenvalues, and ask for thecorresponding linear combinations.