Principal Component Analysis and Linear Discriminant Analysisusers.eecs.northwestern.edu/~yingwu/teaching/EECS433/... · 2012-04-03 · Principal Component Analysis and Linear Discriminant

Principal Component Analysis and LinearDiscriminant Analysis

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University

Evanston, IL 60208

http://www.eecs.northwestern.edu/~yingwu

1 / 29

Decomposition and Components

◮ Decomposition is a great idea.

◮ Linear decomposition and linear basis, e.g., the Fouriertransform

◮ The bases◮ construct the feature space◮ may be orthogonal bases, may be not◮ give the direction to find the components◮ specified vs. learnt?

◮ The features◮ are the “image” (or projection) of the original signal in the

feature space◮ e.g., the orthogonal projection of the original signal onto the

feature space◮ the projection does not have to be orthogonal

◮ Feature extraction

2 / 29

Outline

Principal Component Analysis

Linear Discriminant Analysis

Comparison between PCA and LDA

3 / 29

Principal Components and Subspaces

◮ Subspaces preserve part of the information (and energy, oruncertainty)

◮ Principal components◮ are orthogonal bases◮ and preserve the large portion of the information of the data◮ capture the major uncertainties (or variations) of data

◮ Two views◮ Deterministic: minimizing the distortion of projection of the

data◮ Statistical: maximizing the uncertainty in data◮ are they the same?◮ under what condition they are the same?

4 / 29

View 1: Minimizing the MSE

x

x WW T ) (

◮ x ∈ Rn, and assume centering E{x} = 0.

◮ m is the dim of the subspace, m < n

◮ orthonormal bases W = [w1,w2, . . . ,wm]

◮ where WTW = I, i.e., rotation

◮ orthogonal projection of x:

Px =m∑

i=1

(wTi x)wi = (WWT )x

◮ it achieves the minimum mean-squareerror (prove it!)

eminMSE = E{||x− Px||2} = E{||P⊥x||}

PCA can be posed as: finding a subspace that minimizes the MSE:

argminW

JMSE (W) = E{||x− Px||2}, s.t., WTW = I

5 / 29

Let do it...It is easy to see:

JMSE (W) = E{xTx} − E{xTPx}

So,minimizing JMSE (W) −→ maximizing E{xTPx}

Then we have the following constrained optimization problem

maxW

E{xTWWTx} s.t. WTW = I

The Lagrangian is

L(W, λ) = E{xTWWTx}+ λT (I−WTW)

The set of KKT conditions gives:

∂L(W, λ)

∂wi

= 2E{xxT}wi − 2λiwi , ∀i

6 / 29

What is it?

Let’s denote by S = E{xxT} (note: E{x} = 0).The KKT conditions give:

Swi = λiwi , ∀i

or in a more concise matrix form:

SW = λiW

What is this?Then, the value of minimum MSE is

eminMSE =

n∑

i=m+1

λi

i.e., the sum of the eigenvalues of the orthogonal subspace to thePCA subspace.

7 / 29

View 2: Maximizing the Variation

Let’s look it from another perspective:

◮ We have a linear projection of x to a 1-d subspace y = wTx

◮ an important note: E{y} = 0 as E{x} = 0

◮ The first principal component of x is such that the variance ofthe projection y is maximized

◮ of course, we need to constrain w to be a unit vector.

◮ so we have the following optimization problem

maxw

J(w) = E{y2} = E{(wTx)2}, s.t. wTw = 1

◮ what is it?

maxw

J(w) = wTSw, s.t. wTw = 1

8 / 29

Maximizing the Variation (cont.)

◮ The sorted eigenvalues of S are λ1 ≥ λ2 ≥ . . . ≥ λn, andeigenvectors are {e1, . . . , en}.

◮ It is clearly that the first PC is y1 = eT1 x

◮ This can be generalized to m PCs (where m < n) with onemore constraint

E{ymyk} = 0, k < m

i.e., the PCs are uncorrelated with all previously found PCs

◮ The solution is:wk = ek

◮ Sounds familiar?

9 / 29

The Two Views Converge

The two views lead to the same result!

◮ You should prove:

uncorrelated components ⇐⇒ orthonormal projection bases

◮ What if we are more greedy, say needing independentcomponents?

◮ Do we shall expect orthonormal bases?

◮ In which case, we still have orthonormal bases?

◮ We’ll see it in next lecture.

10 / 29

The Closed-Form SolutionLearning the principal components from {x1, . . . , xN}:

1. calculating m = 1N

∑N

k=1 xk

2. centering A = [x1 −m, . . . , xN −m]

3. calculating S =∑

N

k=1(xk −m)(xk −m)T = AAT

4. eigenvalue decomposition

S = UTΣU

5. sorting λi and ei

6. finding the bases

W = [e1, e2, . . . , em]

Note: The components for x is

y = WT (x−m), where x ∈ Rn and y ∈ R

m

11 / 29

A First Issue

◮ n is the dimension of input data, N is the size of the trainingset

◮ In practice, n≫ N

◮ E.g., in image-based face recognition, if the resolution of aface image is 100× 100, when stacking all the pixels, we endup n = 10, 000.

◮ Note that S is a n × n matrix

◮ Difficulties:◮ S is ill-conditioned, as in general rank(S)≪ n◮ Eigenvalue decomposition of S is too demanding

◮ So, what should we do?

12 / 29

Solution I: First Trick

◮ A is a n × N matrix, then S = AAT is n × n,

◮ but ATA is N × N

◮ Trick◮ Let’s do eigenvalue decomposition on ATA◮ ATAe = λe −→ AATAe = λAe◮ i.e., if e is an eigenvector of ATA, then Ae is the eigenvector

of AAT

◮ and the corresponding eigenvalues are the same

◮ Don’t forget to normalize Ae

Note: This trick does not fully solved the problem, as we still needto do eigenvalue decomposition on a N × N matrix, which can befairly large in practice.

13 / 29

Solution II: Using SVD

◮ Instead of doing EVD, doing SVD (singular valuedecomposition) is easier

◮ A ∈ Rn × R

N

◮ A = UΣVT

◮ U ∈ Rn × R

N , and UTU = I◮ Σ ∈ R

N × RN , is diagonal

◮ V ∈ RN × R

N , and VTV = VVT = I

14 / 29

Solution III: Iterative Solution

◮ We can design an iterative procedure for finding W, i.e.,W←W +∆W

◮ looking at the View of MSE minimization, our cost function:

||x−m∑

i=1

(wTi x)wi ||

2 = ||x−m∑

i=1

yiwi ||2 = ||x− (WWT )x||2

◮ we can stop updating if the KKT is met

∆wi = γyi [x−m∑

i=1

yiwi ]

◮ Its matrix form is: ← subspace learning algorithm

∆W = γ(xxTW −WWTxxTW)

◮ Two issues:◮ The orthogonality is not reinforced◮ Slow convergence

15 / 29

Solution IV: PAST

To speed up the iteration, we can use recursive least squares(RLS). We can consider the following cost function

J(t) =t∑

i=1

βt−i ||x(i)−W(t)y(i)||2

where β is the forgetting factor.W can be solved recursively by the following PAST algorithm

1. y(t) = WT (t − 1)x(t)

2. h(t) = P(t − 1)y(t)

3. m(t) = h(t)/(β + yT (t)h(t))

4. P(t) = 1βTri [P(t − 1)−m(t)hT (t)]

5. e(t) = x(t)−W(t − 1)y(t)

6. W(t) = W(t − 1) + e(t)mT (t)

16 / 29

Outline




17 / 29

Face Recognition: Does PCA work well?

◮ The same face under different illumination conditions

◮ What does PCA capture?

◮ Is this what we really want?

18 / 29

From Descriptive to Discriminative

◮ PCA extracts features (or components) that well describe thepattern

◮ Are they necessarily good for distinguishing between classesand separating patterns?

◮ Examples?

◮ We need discriminative features.

◮ Supervision (or labeled training data) is needed.

◮ The issues are:◮ How do we define the discriminant and separability between

classes?◮ How many features do we need?◮ How do we maximizing the separability?

◮ Here, we give an example of linear discriminant analysis.

19 / 29


◮ Finding an optimal linearprojection W

◮ Catches major difference betweenclasses and discount irrelevantfactors

◮ In the projected discriminativesubspace, data are clustered

20 / 29

Within-class and Between-class Scatters

We have two sets of labeled data: D1 = {x1, . . . , xn1} andD2 = {x1, . . . , xn2}. Let’s define some terms:

◮ The centers of two classes, mi =1ni

∑x∈Di

xi

◮ Data scatter by definition

S =∑

x∈D

(x−m)(x−m)T

◮ Within-class scatter:

Sw = S1 + S2

◮ Between-class scatter:

Sb = (m1 −m2)(m1 −m2)T

21 / 29

Fisher Liner DiscriminantInput: We have two sets of labeled data: D1 = {x1, . . . , xn1} andD2 = {x1, . . . , xn2}.Output: We want to find a 1-d linear projection w that maximizesthe separability between these two classes.

◮ Projected data: Y1 = wTD1 and Y2 = wTD2

◮ Projected class centers: m̃i = wTmi

◮ Projected within-class scatter (it is a scalar in this case)

S̃w = wTSww prove it!

◮ Projected between-class scatter (it is a scalar in this case)

S̃b = wTSbw prove it!

◮ Fisher Linear Discriminant

J(w) =|m̃1 − m̃2|

2

S̃1 + S̃2

=|S̃b|

|S̃w |=

wTSbw

wTSww

22 / 29

Rayleigh Quotient

Theoremf (λ) = ||Ax− λBx||B where ||z||B

△= zTB−1z is minimized by the

Rayleigh quotient

λ =xTAx

xTBx

Proof.

∂f (λ)

∂λ= (Bx)T (Bz) = xTBTB−1z

= xT (Ax− λBx) = xTAx− λxTBx

setting it to zero to see the result clearly.

23 / 29

Optimizing Fish Discriminant

TheoremJ(w) = wTSbw

wTSwwis maximized when

Sbw = λSww

Proof.Let wTSww = c 6= 0. We can construct the Lagrangian as

L(w, λ) = wTSbw − λ(wTSww − c)

Then KKT is∂L(w, λ)

∂w= Sbw − λSww

It is clearly thatSbw

∗ = λSww∗

24 / 29

An Efficient Solution

◮ A naive solution is S−1w Sbw = λw

◮ Then we can do EVD on S−1w Sb, which needs some

computation

◮ Is there a more efficient way?

◮ Facts:◮ Sbw is along the direction of m1 −m2. why?◮ we don’t care about λ the scalar factor

◮ So we can easily figure out the direction of w by

w = S−1w (m1 −m2)

◮ Note: rank(Sb) = 1

25 / 29

Multiple Discriminant AnalysisNow, we have c number of classes:

◮ within-class scatter Sw =∑

c

i=1 Si as before

◮ between-class scatter is a bit different from 2-class

Sb

△=

c∑

i=1

ni (mi −m)(mi −m)T

◮ total scatter

St

△=

∑

x

(x−m)(x−m)T = Sw + Sb

◮ MDA is to find a subspace with bases W that maximizes

J(W) =|S̃b|

|S̃w |=|WTSbW|

|WTSwW|

26 / 29

The Solution to MDA

◮ The solution is obtained by G-EVD

Sbwi = λiSwwi

where each wi is a generalized eigenvector

◮ In practice, what we can do is the following◮ find the eigenvalues as the root of the characteristic polynomial

|Sb − λiSw | = 0

◮ for each λi , solve wi from

(Sb − λiSw )wi = 0

◮ Note: W is not unique (up to rotation and scaling)

◮ Note: rank(Sb) ≤ (c − 1) (why?)

27 / 29

Outline




28 / 29

The Relation between PCA and LDA

PCA

LDA

29 / 29

Documents

Principal Component Analysis and Linear Discriminant Analysisusers.eecs.northwestern.edu/~yingwu/teaching/EECS433/... · 2012-04-03 · Principal Component Analysis and Linear Discriminant