The Rayleigh Quotient - SVCL · 2009-10-15 · T S S i w S w w S w J w W T = B positive semidefinite B W symmetr c, , appears in many problems in engineering and pattern recognition

The Rayleigh Quotient

Nuno Vasconcelos ECE Department, UCSDp ,

Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when

viewed from the full space, e.g.

1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

2

The role of the meannote that the mean of the entire data is a function of the coordinate systemy• if X has mean µ then X – µ has mean 0

we can always make the data have zero mean by centering• if and

⎤⎡ ||

⎥⎥⎥⎤

⎢⎢⎢⎡

=||

1 nxxX KTTT

c Xn

IX ⎟⎠⎞

⎜⎝⎛ −= 111

• then Xc has zero mean

⎥⎦⎢⎣ ||

3

can assume that X is zero mean without loss of generality

Principal component analysisIf y is Gaussian with covariance Σthe equiprobability contours y2

φq p y

λ1λ2y

φ1φ2

KyyT =Σ−1

are the ellipses whoseprincipal components φ are the

y1

• principal components φi are the eigenvectors of Σ

• principal lengths λi are the eigenvalues of Σ

by detecting small eigenvalues we can eliminate dimensions that have little variancethi i PCA

4

this is PCA

PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix

⎞⎛ 1

2) t it SVD

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

• 2) compute its SVD

TTcX ΜΠΝ=

• 3) principal components are columns of N, eigenvalues are

n/2πλ =

5

nii /πλ =

Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k

∑=

k

ii

r 1

2λ

∑=

== n

ii

ikr

1

2

1

λ

• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset

6

dataset

Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of

PCA• keeping the dimensions of largest energy (variance) is a good

idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has

smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out

it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do

7

Fischer’s linear discriminantfind the line z = wTx that best separates the two classes

bad projection

good projection

[ ] [ ]( )0|1|*

2|| =−= YZEYZE YZYZ

8

[ ] [ ]( )[ ] [ ]0|var1|var

max* ||

=+==

YZYZw

w

Linear discriminant analysisthis can be written as

between class scatter

wSwwSwwJ

WT

BT

=)( ( )( )( )

0101

Σ+Σ=−−= T

B

SS µµµµ

optimal solution is

wSw W ( )01 Σ+Σ=WS

within class scatter

( ) ( ) ( )011

01011* µµµµ −Σ+Σ=−= −−

WSw

BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance

9

otherwise, LDA leads to a sub-optimal classifier

The Rayleigh quotientit turns out that the maximization of the Rayleigh quotient

T iSSwSwwSwwJ

WT

BT

=)(tesemidefinipositive

symmetricSS WB

,,

appears in many problems in engineering and pattern recognitionwe have already seen that this is equivalent to

KwSwwSw WT

BT = subject to max

and can be solved using Lagrange multipliers

WBwj

10

The Rayleigh quotientdefine the Lagrangian

( )TT λ

maximize with respect to w

( )KwSww - SwL WT

BT −= λ

to obtain the solution

( ) 02 ==∇ w S- SL WBw λ

to obtain the solution

wSwS WB λ=

this is a generalized eigenvalue problem that you can solve using any eigenvalue routinewhich eigenvalue?

11

which eigenvalue?

The Rayleigh quotientrecall that we want

and the optimal w satisfies

KwSwwSw WT

BT

w= subject to max

and the optimal w satisfies

wSwS WB λ=

hence( ) ( ) KwSwwSw W

TB

T λλ == ****

which is maximum for the largest eigenvaluein summary, we need the generalized eigenvector

of largest eigen al eSS λ

12

of largest eigenvaluewSwS WB λ=

The Rayleigh quotientcase 1: Sw invertible• simplifies to a standard eigenvalue problemsimplifies to a standard eigenvalue problem

wwSS BW λ=−1

• w is the largest eigenvalue of Sw-1SB

case 2: Sw not invertible• this is case is more problematic• in fact the cost can be unbounded

consider w w + w w in the row space of S and w in the null• consider w = wr + wn, wr in the row space of Sw and wn in the null space

WT

WT

WT wSwwwwSwwwSw )()()( +=++=

13

rWTrnrW

Tr

rWnrnrWnrW

wSwwwSw

wSwwwwSwwwSw

)(

)( )()(

=+=

+++

The Rayleigh quotientand

TT

434212

)()(

++=

++=

nBTnnB

TrrB

Tr

nrBT

nrBT

wSwwSwwSw

wwSwwwSw

hence, if there is a (wr,wn) such that wrTSBwn ≥ 0,

20≥

• we can make the cost arbitrarily large• by simply scaling up the null space component wn

this can also be seen geometrically

14

The Rayleigh quotientrecall that

TT

th lli h

w2

φ1φ2

TWW

T SKwSw ΦΛΦ== with

are the ellipses whose• principal components φi are the

eigenvectors of Sw

1/λ11/λ2 w1

• principal lengths are 1/λi

when the eigenvalues go to zero, the ellipses blow upconsider the picture of the optimization problem

KwSwwSw TT =subject tomax

15

KwSwwSw WBw= subject to max

The Rayleigh quotient

KwSwwSw WT

BT

w= subject to max

w2

wTSBww*

1/λ11/λ2 w1

B

wTS w=K

the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint)

w Sww K

16

touches the blue ellipse (constraint)• in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity

The Rayleigh quotienthow do we avoid this problem?• we introduce another constraintwe introduce another constraint

LwKwSwwSw WT

BT

w== , subject to max

w2

wTSBww*wTw=L

1/λ11/λ2 w1

• restricts the set of possible solutions to these points (surfaces inwTSww=K

17

restricts the set of possible solutions to these points (surfaces in high dimensional case)

The Rayleigh quotientthe Lagrangian is now

( ) ( )TTT βλ

and the solution satisfies

( ) ( )Lww- KwSww - SwL TW

TB

T −−= βλ

or ( ) 02 =−=∇ w IS- SL WBw βλ

but this is exactly the solution of the original problem with

[ ]( ) λβγγλ / ,0 ==+ w IS- S WB

but this is exactly the solution of the original problem with Sw+γI instead of Sw

[ ] KISS TT bj t t

18

[ ] KwISwwSw WT

BT

w=+ γ subject to max

The Rayleigh quotientadding the constraint is equivalent to maximizing the regularized Rayleigh quotientg y g q

[ ]ISwSwwJ T

BT

=)(tid fi iiti

symmetricSS WB ,,

what does this accomplish?

[ ]wISw WT γ+ tesemidefinipositive

• note thatTT

WT

W IISS ΦΦ+ΦΛΦ=+⇒ΦΛΦ= γγ

• this makes all eigenvalues positive

[ ] TI Φ+ΛΦ= γ

19

• the matrix is no longer non-invertible

The Rayleigh quotientin summary

wSwwSw

WT

BT

wmax

tesemidefinipositivesymmetricSS WB

,,

1) Sw invertible• w* is the eigenvector of largest eigenvalue of Sw

-1SB

• the max value is λK, where λ is the largest eigenvalue

2) Sw not invertiblel i S S I• regularize: Sw -> Sw + γI

• w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB

• the max value is λK, where λ is the largest eigenvalue

20

the max value is λK, where λ is the largest eigenvalue

Regularized discriminant analysisback to LDA:• when the within scatter matrix is non-invertible instead ofwhen the within scatter matrix is non-invertible, instead of

wSwT ( )( )= TS µµµµ

between class scatter

wSwwSwwJ

WT

B=)( ( )( )( )01

0101

Σ+Σ=−−=

W

B

SS µµµµ

• we usewithin class scatter

ST ( )( )S T

wSwwSwwJ

WT

BT

=)( ( )( )( )IS

S

W

TB

γµµµµ

+Σ+Σ=−−=

01

0101

21

regularized within class scatter

Regularized discriminant analysisthis is called regularized discriminant analysis (RDA)noting thatnoting that

ISW 01 γ+Σ+Σ=γγγ =+

this can also be seen as regularizing each covariance

II 0011 γγ +Σ++Σ=γγγ =+ 01

this can also be seen as regularizing each covariance matrix individuallythe regularization parameters γi are determined by cross-validation • more on this later• basically means that we try several possibilities and keep the

22

• basically means that we try several possibilities and keep the best

Principal component analysisback to PCA: given X with one example per column• 1) create the centered data-matrix ⎤⎡ TT1) create the centered data-matrix

TTTc X

nIX ⎟

⎠⎞

⎜⎝⎛ −= 111

⎥⎥⎥⎤

⎢⎢⎢⎡ −

=

TT

Tc

xX

µM

1

this has one point per row• note that the projection of all points on principal component φ is

n ⎠⎝ ⎥⎦⎢⎣ − TTnx µ

p j p p p p φ

φTcXz = ( )

⎥⎤

⎢⎡ −

⎥⎤

⎢⎡ φµ

Txz 11

( ) ⎥⎥⎥⎥

⎦⎢⎢⎢⎢

⎣ −=

⎥⎥⎥

⎦⎢⎢⎢

⎣ φµT

nn xzMM

23

( ) ⎥⎦⎢⎣⎦⎣

Principal component analysisand, since

T

( ) 0111=⎟

⎠

⎞⎜⎝

⎛−=−= ∑∑ ∑ φµφµ

T

ii

i i

Tii x

nx

nz

n

the sample variance of Z is given by its norm

1 221)var( zzn

zi

i == ∑

recall that PCA looks for the largest variance component

( ) φφφφφ TTTTTT XXXXXz maxmaxmaxmax22 ===

24

( ) φφφφφφφφφ ccccc XXXXXz maxmaxmaxmax ===

Principal component analysisrecall that the sample covariance is

11 ( )( ) ( )∑∑ =−−=Σi

Tci

ci

i

Tii xx

nxx

n11 µµ

where xic is the ith column of Xc

this can be written as

T

c

cc XXx

1||

1 1⎥⎤

⎢⎡ −−

⎥⎤

⎢⎡

Σ M Tcc

cn

cn

c XXn

xxx

n||

1 =⎥⎥⎥

⎦⎢⎢⎢

⎣ −−⎥⎥⎥

⎦⎢⎢⎢

⎣

=Σ MK

25

Principal component analysishence the PCA problem is

φφφφ ΣTTT XX mama

as in LDA, this can be made arbitrarily large by simply li φ

φφφφφφ

Σ= TTcc

T XX maxmax

scaling φto normalize we constrain φ to have unit norm

1bj t tΣ φφφT

which is equivalent to

1 subject to max =Σ φφφφ

T

φφφφ

φ T

T Σmax

26

shows that PCA = maximization of a Rayleigh quotient

Principal component analysisin this case

T iSSwSwwSw

WT

BT

wmax

tesemidefinipositivesymmetricSS WB

,,

with SB = Σ and Sw = ISw is clearly invertible• no regularization problems• w* is the eigenvector of largest eigenvalue of Sw

-1SB

thi i j t th l t i t f th i Σ• this is just the largest eigenvector of the covariance Σ• the max value is λ, where λ is the largest eigenvalue

27

The Rayleigh quotient duallet’s assume, for a moment, that the solution is of the form

X

• i.e. a linear combination of the centered datapoints

αcXw =

hence the problem is equivalent to

αα cBTc

T XSX

thi d t h it f th l ti i

αααcW

Tc

TcBc

XSXmax

this does not change its form, the solution is• α* is the eigenvector of largest eigenvalue of (Xc

TSwXc)-1XcTSBXc

• the max value is λK where λ is the largest eigenvalue

28

the max value is λK, where λ is the largest eigenvalue

The Rayleigh quotient dualfor PCA• S = I and SB = Σ = 1/n X X T• Sw = I and SB = Σ = 1/n XcXc

• the solution satisfies

TT 111

43421α

λλλ wX

nXwwwXX

nwSwS T

ccTccWB

111 =⇔=⇔= −

• and, therefore, we have• w* eigenvalue of SB = XcXc

T

* i l f (X TS X ) 1X TS X (X TX ) 1X TX X TX X TX• α* eigenvalue of (XcTSwXc)-1Xc

TSBXc = (XcTXc)-1Xc

TXcXcTXc =Xc

TXc

• i.e. we have two alternative manners in which to compute PCA

29

Principal component analysisprimal• assemble matrix

dual• assemble matrixassemble matrix

• Σ = XcXcT

• compute eigenvectors φi

assemble matrix• Κ = Xc

TXc

• compute eigenvectors αi

• these are the principal components

• the principal components are

• φi = X αiφi Xcαi

in both cases we have an eigenvalue problemi l th f t d t ( )T• primal on the sum of outer products

• dual on the matrix of inner products

( )∑=Σi

Tci

ci xx

( ) cj

Tciij xxK =

30

( ) jiij xxK

The Rayleigh quotientthis is a property that holds for many Rayleigh quotient problemsp• the primal solution is a linear combination of datapoints• the dual solution only depends on dot-products of the datapoints

whenever both of these holdth bl b k li d• the problem can be kernelized

• this has various interesting properties • we will talk about themwe will talk about them

many examples

31

• kernel PCA, kernel LDA, manifold learning, etc.

32

Documents

The Rayleigh Quotient - SVCL · 2009-10-15 · T S S i w S w w S w J w W T = B positive semidefinite B W symmetr c, , appears in many problems in engineering and pattern recognition