Upload
others
View
10
Download
2
Embed Size (px)
Citation preview
Principal Component Analysis and LinearDiscriminant Analysis
Ying Wu
Electrical Engineering and Computer ScienceNorthwestern University
Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 29
Decomposition and Components
◮ Decomposition is a great idea.
◮ Linear decomposition and linear basis, e.g., the Fouriertransform
◮ The bases◮ construct the feature space◮ may be orthogonal bases, may be not◮ give the direction to find the components◮ specified vs. learnt?
◮ The features◮ are the “image” (or projection) of the original signal in the
feature space◮ e.g., the orthogonal projection of the original signal onto the
feature space◮ the projection does not have to be orthogonal
◮ Feature extraction
2 / 29
Outline
Principal Component Analysis
Linear Discriminant Analysis
Comparison between PCA and LDA
3 / 29
Principal Components and Subspaces
◮ Subspaces preserve part of the information (and energy, oruncertainty)
◮ Principal components◮ are orthogonal bases◮ and preserve the large portion of the information of the data◮ capture the major uncertainties (or variations) of data
◮ Two views◮ Deterministic: minimizing the distortion of projection of the
data◮ Statistical: maximizing the uncertainty in data◮ are they the same?◮ under what condition they are the same?
4 / 29
View 1: Minimizing the MSE
x
x WW T ) (
◮ x ∈ Rn, and assume centering E{x} = 0.
◮ m is the dim of the subspace, m < n
◮ orthonormal bases W = [w1,w2, . . . ,wm]
◮ where WTW = I, i.e., rotation
◮ orthogonal projection of x:
Px =m∑
i=1
(wTi x)wi = (WWT )x
◮ it achieves the minimum mean-squareerror (prove it!)
eminMSE = E{||x− Px||2} = E{||P⊥x||}
PCA can be posed as: finding a subspace that minimizes the MSE:
argminW
JMSE (W) = E{||x− Px||2}, s.t., WTW = I
5 / 29
Let do it...It is easy to see:
JMSE (W) = E{xTx} − E{xTPx}
So,minimizing JMSE (W) −→ maximizing E{xTPx}
Then we have the following constrained optimization problem
maxW
E{xTWWTx} s.t. WTW = I
The Lagrangian is
L(W, λ) = E{xTWWTx}+ λT (I−WTW)
The set of KKT conditions gives:
∂L(W, λ)
∂wi
= 2E{xxT}wi − 2λiwi , ∀i
6 / 29
What is it?
Let’s denote by S = E{xxT} (note: E{x} = 0).The KKT conditions give:
Swi = λiwi , ∀i
or in a more concise matrix form:
SW = λiW
What is this?Then, the value of minimum MSE is
eminMSE =
n∑
i=m+1
λi
i.e., the sum of the eigenvalues of the orthogonal subspace to thePCA subspace.
7 / 29
View 2: Maximizing the Variation
Let’s look it from another perspective:
◮ We have a linear projection of x to a 1-d subspace y = wTx
◮ an important note: E{y} = 0 as E{x} = 0
◮ The first principal component of x is such that the variance ofthe projection y is maximized
◮ of course, we need to constrain w to be a unit vector.
◮ so we have the following optimization problem
maxw
J(w) = E{y2} = E{(wTx)2}, s.t. wTw = 1
◮ what is it?
maxw
J(w) = wTSw, s.t. wTw = 1
8 / 29
Maximizing the Variation (cont.)
◮ The sorted eigenvalues of S are λ1 ≥ λ2 ≥ . . . ≥ λn, andeigenvectors are {e1, . . . , en}.
◮ It is clearly that the first PC is y1 = eT1 x
◮ This can be generalized to m PCs (where m < n) with onemore constraint
E{ymyk} = 0, k < m
i.e., the PCs are uncorrelated with all previously found PCs
◮ The solution is:wk = ek
◮ Sounds familiar?
9 / 29
The Two Views Converge
The two views lead to the same result!
◮ You should prove:
uncorrelated components ⇐⇒ orthonormal projection bases
◮ What if we are more greedy, say needing independentcomponents?
◮ Do we shall expect orthonormal bases?
◮ In which case, we still have orthonormal bases?
◮ We’ll see it in next lecture.
10 / 29
The Closed-Form SolutionLearning the principal components from {x1, . . . , xN}:
1. calculating m = 1N
∑N
k=1 xk
2. centering A = [x1 −m, . . . , xN −m]
3. calculating S =∑
N
k=1(xk −m)(xk −m)T = AAT
4. eigenvalue decomposition
S = UTΣU
5. sorting λi and ei
6. finding the bases
W = [e1, e2, . . . , em]
Note: The components for x is
y = WT (x−m), where x ∈ Rn and y ∈ R
m
11 / 29
A First Issue
◮ n is the dimension of input data, N is the size of the trainingset
◮ In practice, n≫ N
◮ E.g., in image-based face recognition, if the resolution of aface image is 100× 100, when stacking all the pixels, we endup n = 10, 000.
◮ Note that S is a n × n matrix
◮ Difficulties:◮ S is ill-conditioned, as in general rank(S)≪ n◮ Eigenvalue decomposition of S is too demanding
◮ So, what should we do?
12 / 29
Solution I: First Trick
◮ A is a n × N matrix, then S = AAT is n × n,
◮ but ATA is N × N
◮ Trick◮ Let’s do eigenvalue decomposition on ATA◮ ATAe = λe −→ AATAe = λAe◮ i.e., if e is an eigenvector of ATA, then Ae is the eigenvector
of AAT
◮ and the corresponding eigenvalues are the same
◮ Don’t forget to normalize Ae
Note: This trick does not fully solved the problem, as we still needto do eigenvalue decomposition on a N × N matrix, which can befairly large in practice.
13 / 29
Solution II: Using SVD
◮ Instead of doing EVD, doing SVD (singular valuedecomposition) is easier
◮ A ∈ Rn × R
N
◮ A = UΣVT
◮ U ∈ Rn × R
N , and UTU = I◮ Σ ∈ R
N × RN , is diagonal
◮ V ∈ RN × R
N , and VTV = VVT = I
14 / 29
Solution III: Iterative Solution
◮ We can design an iterative procedure for finding W, i.e.,W←W +∆W
◮ looking at the View of MSE minimization, our cost function:
||x−m∑
i=1
(wTi x)wi ||
2 = ||x−m∑
i=1
yiwi ||2 = ||x− (WWT )x||2
◮ we can stop updating if the KKT is met
∆wi = γyi [x−m∑
i=1
yiwi ]
◮ Its matrix form is: ← subspace learning algorithm
∆W = γ(xxTW −WWTxxTW)
◮ Two issues:◮ The orthogonality is not reinforced◮ Slow convergence
15 / 29
Solution IV: PAST
To speed up the iteration, we can use recursive least squares(RLS). We can consider the following cost function
J(t) =t∑
i=1
βt−i ||x(i)−W(t)y(i)||2
where β is the forgetting factor.W can be solved recursively by the following PAST algorithm
1. y(t) = WT (t − 1)x(t)
2. h(t) = P(t − 1)y(t)
3. m(t) = h(t)/(β + yT (t)h(t))
4. P(t) = 1βTri [P(t − 1)−m(t)hT (t)]
5. e(t) = x(t)−W(t − 1)y(t)
6. W(t) = W(t − 1) + e(t)mT (t)
16 / 29
Outline
Principal Component Analysis
Linear Discriminant Analysis
Comparison between PCA and LDA
17 / 29
Face Recognition: Does PCA work well?
◮ The same face under different illumination conditions
◮ What does PCA capture?
◮ Is this what we really want?
18 / 29
From Descriptive to Discriminative
◮ PCA extracts features (or components) that well describe thepattern
◮ Are they necessarily good for distinguishing between classesand separating patterns?
◮ Examples?
◮ We need discriminative features.
◮ Supervision (or labeled training data) is needed.
◮ The issues are:◮ How do we define the discriminant and separability between
classes?◮ How many features do we need?◮ How do we maximizing the separability?
◮ Here, we give an example of linear discriminant analysis.
19 / 29
Linear Discriminant Analysis
◮ Finding an optimal linearprojection W
◮ Catches major difference betweenclasses and discount irrelevantfactors
◮ In the projected discriminativesubspace, data are clustered
20 / 29
Within-class and Between-class Scatters
We have two sets of labeled data: D1 = {x1, . . . , xn1} andD2 = {x1, . . . , xn2}. Let’s define some terms:
◮ The centers of two classes, mi =1ni
∑x∈Di
xi
◮ Data scatter by definition
S =∑
x∈D
(x−m)(x−m)T
◮ Within-class scatter:
Sw = S1 + S2
◮ Between-class scatter:
Sb = (m1 −m2)(m1 −m2)T
21 / 29
Fisher Liner DiscriminantInput: We have two sets of labeled data: D1 = {x1, . . . , xn1} andD2 = {x1, . . . , xn2}.Output: We want to find a 1-d linear projection w that maximizesthe separability between these two classes.
◮ Projected data: Y1 = wTD1 and Y2 = wTD2
◮ Projected class centers: m̃i = wTmi
◮ Projected within-class scatter (it is a scalar in this case)
S̃w = wTSww prove it!
◮ Projected between-class scatter (it is a scalar in this case)
S̃b = wTSbw prove it!
◮ Fisher Linear Discriminant
J(w) =|m̃1 − m̃2|
2
S̃1 + S̃2
=|S̃b|
|S̃w |=
wTSbw
wTSww
22 / 29
Rayleigh Quotient
Theoremf (λ) = ||Ax− λBx||B where ||z||B
△= zTB−1z is minimized by the
Rayleigh quotient
λ =xTAx
xTBx
Proof.
∂f (λ)
∂λ= (Bx)T (Bz) = xTBTB−1z
= xT (Ax− λBx) = xTAx− λxTBx
setting it to zero to see the result clearly.
23 / 29
Optimizing Fish Discriminant
TheoremJ(w) = wTSbw
wTSwwis maximized when
Sbw = λSww
Proof.Let wTSww = c 6= 0. We can construct the Lagrangian as
L(w, λ) = wTSbw − λ(wTSww − c)
Then KKT is∂L(w, λ)
∂w= Sbw − λSww
It is clearly thatSbw
∗ = λSww∗
24 / 29
An Efficient Solution
◮ A naive solution is S−1w Sbw = λw
◮ Then we can do EVD on S−1w Sb, which needs some
computation
◮ Is there a more efficient way?
◮ Facts:◮ Sbw is along the direction of m1 −m2. why?◮ we don’t care about λ the scalar factor
◮ So we can easily figure out the direction of w by
w = S−1w (m1 −m2)
◮ Note: rank(Sb) = 1
25 / 29
Multiple Discriminant AnalysisNow, we have c number of classes:
◮ within-class scatter Sw =∑
c
i=1 Si as before
◮ between-class scatter is a bit different from 2-class
Sb
△=
c∑
i=1
ni (mi −m)(mi −m)T
◮ total scatter
St
△=
∑
x
(x−m)(x−m)T = Sw + Sb
◮ MDA is to find a subspace with bases W that maximizes
J(W) =|S̃b|
|S̃w |=|WTSbW|
|WTSwW|
26 / 29
The Solution to MDA
◮ The solution is obtained by G-EVD
Sbwi = λiSwwi
where each wi is a generalized eigenvector
◮ In practice, what we can do is the following◮ find the eigenvalues as the root of the characteristic polynomial
|Sb − λiSw | = 0
◮ for each λi , solve wi from
(Sb − λiSw )wi = 0
◮ Note: W is not unique (up to rotation and scaling)
◮ Note: rank(Sb) ≤ (c − 1) (why?)
27 / 29
Outline
Principal Component Analysis
Linear Discriminant Analysis
Comparison between PCA and LDA
28 / 29
The Relation between PCA and LDA
PCA
LDA
29 / 29