Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
The Rayleigh Quotient
Nuno Vasconcelos ECE Department, UCSDp ,
Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids
2
The role of the meannote that the mean of the entire data is a function of the coordinate systemy• if X has mean µ then X – µ has mean 0
we can always make the data have zero mean by centering• if and
⎤⎡ ||
⎥⎥⎥⎤
⎢⎢⎢⎡
=||
1 nxxX KTTT
c Xn
IX ⎟⎠⎞
⎜⎝⎛ −= 111
• then Xc has zero mean
⎥⎦⎢⎣ ||
3
can assume that X is zero mean without loss of generality
Principal component analysisIf y is Gaussian with covariance Σthe equiprobability contours y2
φq p y
λ1λ2y
φ1φ2
KyyT =Σ−1
are the ellipses whoseprincipal components φ are the
y1
• principal components φi are the eigenvectors of Σ
• principal lengths λi are the eigenvalues of Σ
by detecting small eigenvalues we can eliminate dimensions that have little variancethi i PCA
4
this is PCA
PCA by SVDcomputation of PCA by SVDgiven X with one example per columngiven X with one example per column• 1) create the centered data-matrix
⎞⎛ 1
2) t it SVD
TTTc X
nIX ⎟
⎠⎞
⎜⎝⎛ −= 111
• 2) compute its SVD
TTcX ΜΠΝ=
• 3) principal components are columns of N, eigenvalues are
n/2πλ =
5
nii /πλ =
Principal component analysisa natural measure is to pick the eigenvectors that explain p % of the data variabilityp y• can be done by plotting the ratio rk as a function of k
∑=
k
ii
r 1
2λ
∑=
== n
ii
ikr
1
2
1
λ
• e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset
6
dataset
Limitations of PCAPCA is not optimal for classification• note that there is no mention of the class label in the definition ofnote that there is no mention of the class label in the definition of
PCA• keeping the dimensions of largest energy (variance) is a good
idea but not always enoughidea, but not always enough• certainly improves the density estimation, since space has
smaller dimension• but could be unwise from a classification point of view• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is theit is not hard to construct examples where PCA is the worst possible thing we could do
7
Fischer’s linear discriminantfind the line z = wTx that best separates the two classes
bad projection
good projection
[ ] [ ]( )0|1|*
2|| =−= YZEYZE YZYZ
8
[ ] [ ]( )[ ] [ ]0|var1|var
max* ||
=+==
YZYZw
w
Linear discriminant analysisthis can be written as
between class scatter
wSwwSwwJ
WT
BT
=)( ( )( )( )
0101
Σ+Σ=−−= T
B
SS µµµµ
optimal solution is
wSw W ( )01 Σ+Σ=WS
within class scatter
( ) ( ) ( )011
01011* µµµµ −Σ+Σ=−= −−
WSw
BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance
9
otherwise, LDA leads to a sub-optimal classifier
The Rayleigh quotientit turns out that the maximization of the Rayleigh quotient
T iSSwSwwSwwJ
WT
BT
=)(tesemidefinipositive
symmetricSS WB
,,
appears in many problems in engineering and pattern recognitionwe have already seen that this is equivalent to
KwSwwSw WT
BT = subject to max
and can be solved using Lagrange multipliers
WBwj
10
The Rayleigh quotientdefine the Lagrangian
( )TT λ
maximize with respect to w
( )KwSww - SwL WT
BT −= λ
to obtain the solution
( ) 02 ==∇ w S- SL WBw λ
to obtain the solution
wSwS WB λ=
this is a generalized eigenvalue problem that you can solve using any eigenvalue routinewhich eigenvalue?
11
which eigenvalue?
The Rayleigh quotientrecall that we want
and the optimal w satisfies
KwSwwSw WT
BT
w= subject to max
and the optimal w satisfies
wSwS WB λ=
hence( ) ( ) KwSwwSw W
TB
T λλ == ****
which is maximum for the largest eigenvaluein summary, we need the generalized eigenvector
of largest eigen al eSS λ
12
of largest eigenvaluewSwS WB λ=
The Rayleigh quotientcase 1: Sw invertible• simplifies to a standard eigenvalue problemsimplifies to a standard eigenvalue problem
wwSS BW λ=−1
• w is the largest eigenvalue of Sw-1SB
case 2: Sw not invertible• this is case is more problematic• in fact the cost can be unbounded
consider w w + w w in the row space of S and w in the null• consider w = wr + wn, wr in the row space of Sw and wn in the null space
WT
WT
WT wSwwwwSwwwSw )()()( +=++=
13
rWTrnrW
Tr
rWnrnrWnrW
wSwwwSw
wSwwwwSwwwSw
)(
)( )()(
=+=
+++
The Rayleigh quotientand
TT
434212
)()(
++=
++=
nBTnnB
TrrB
Tr
nrBT
nrBT
wSwwSwwSw
wwSwwwSw
hence, if there is a (wr,wn) such that wrTSBwn ≥ 0,
20≥
• we can make the cost arbitrarily large• by simply scaling up the null space component wn
this can also be seen geometrically
14
The Rayleigh quotientrecall that
TT
th lli h
w2
φ1φ2
TWW
T SKwSw ΦΛΦ== with
are the ellipses whose• principal components φi are the
eigenvectors of Sw
1/λ11/λ2 w1
• principal lengths are 1/λi
when the eigenvalues go to zero, the ellipses blow upconsider the picture of the optimization problem
KwSwwSw TT =subject tomax
15
KwSwwSw WBw= subject to max
The Rayleigh quotient
KwSwwSw WT
BT
w= subject to max
w2
wTSBww*
1/λ11/λ2 w1
B
wTS w=K
the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint)
w Sww K
16
touches the blue ellipse (constraint)• in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity
The Rayleigh quotienthow do we avoid this problem?• we introduce another constraintwe introduce another constraint
LwKwSwwSw WT
BT
w== , subject to max
w2
wTSBww*wTw=L
1/λ11/λ2 w1
• restricts the set of possible solutions to these points (surfaces inwTSww=K
17
restricts the set of possible solutions to these points (surfaces in high dimensional case)
The Rayleigh quotientthe Lagrangian is now
( ) ( )TTT βλ
and the solution satisfies
( ) ( )Lww- KwSww - SwL TW
TB
T −−= βλ
or ( ) 02 =−=∇ w IS- SL WBw βλ
but this is exactly the solution of the original problem with
[ ]( ) λβγγλ / ,0 ==+ w IS- S WB
but this is exactly the solution of the original problem with Sw+γI instead of Sw
[ ] KISS TT bj t t
18
[ ] KwISwwSw WT
BT
w=+ γ subject to max
The Rayleigh quotientadding the constraint is equivalent to maximizing the regularized Rayleigh quotientg y g q
[ ]ISwSwwJ T
BT
=)(tid fi iiti
symmetricSS WB ,,
what does this accomplish?
[ ]wISw WT γ+ tesemidefinipositive
• note thatTT
WT
W IISS ΦΦ+ΦΛΦ=+⇒ΦΛΦ= γγ
• this makes all eigenvalues positive
[ ] TI Φ+ΛΦ= γ
19
• the matrix is no longer non-invertible
The Rayleigh quotientin summary
wSwwSw
WT
BT
wmax
tesemidefinipositivesymmetricSS WB
,,
1) Sw invertible• w* is the eigenvector of largest eigenvalue of Sw
-1SB
• the max value is λK, where λ is the largest eigenvalue
2) Sw not invertiblel i S S I• regularize: Sw -> Sw + γI
• w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB
• the max value is λK, where λ is the largest eigenvalue
20
the max value is λK, where λ is the largest eigenvalue
Regularized discriminant analysisback to LDA:• when the within scatter matrix is non-invertible instead ofwhen the within scatter matrix is non-invertible, instead of
wSwT ( )( )= TS µµµµ
between class scatter
wSwwSwwJ
WT
B=)( ( )( )( )01
0101
Σ+Σ=−−=
W
B
SS µµµµ
• we usewithin class scatter
ST ( )( )S T
wSwwSwwJ
WT
BT
=)( ( )( )( )IS
S
W
TB
γµµµµ
+Σ+Σ=−−=
01
0101
21
regularized within class scatter
Regularized discriminant analysisthis is called regularized discriminant analysis (RDA)noting thatnoting that
ISW 01 γ+Σ+Σ=γγγ =+
this can also be seen as regularizing each covariance
II 0011 γγ +Σ++Σ=γγγ =+ 01
this can also be seen as regularizing each covariance matrix individuallythe regularization parameters γi are determined by cross-validation • more on this later• basically means that we try several possibilities and keep the
22
• basically means that we try several possibilities and keep the best
Principal component analysisback to PCA: given X with one example per column• 1) create the centered data-matrix ⎤⎡ TT1) create the centered data-matrix
TTTc X
nIX ⎟
⎠⎞
⎜⎝⎛ −= 111
⎥⎥⎥⎤
⎢⎢⎢⎡ −
=
TT
Tc
xX
µM
1
this has one point per row• note that the projection of all points on principal component φ is
n ⎠⎝ ⎥⎦⎢⎣ − TTnx µ
p j p p p p φ
φTcXz = ( )
⎥⎤
⎢⎡ −
⎥⎤
⎢⎡ φµ
Txz 11
( ) ⎥⎥⎥⎥
⎦⎢⎢⎢⎢
⎣ −=
⎥⎥⎥
⎦⎢⎢⎢
⎣ φµT
nn xzMM
23
( ) ⎥⎦⎢⎣⎦⎣
Principal component analysisand, since
T
( ) 0111=⎟
⎠
⎞⎜⎝
⎛−=−= ∑∑ ∑ φµφµ
T
ii
i i
Tii x
nx
nz
n
the sample variance of Z is given by its norm
1 221)var( zzn
zi
i == ∑
recall that PCA looks for the largest variance component
( ) φφφφφ TTTTTT XXXXXz maxmaxmaxmax22 ===
24
( ) φφφφφφφφφ ccccc XXXXXz maxmaxmaxmax ===
Principal component analysisrecall that the sample covariance is
11 ( )( ) ( )∑∑ =−−=Σi
Tci
ci
i
Tii xx
nxx
n11 µµ
where xic is the ith column of Xc
this can be written as
T
c
cc XXx
1||
1 1⎥⎤
⎢⎡ −−
⎥⎤
⎢⎡
Σ M Tcc
cn
cn
c XXn
xxx
n||
1 =⎥⎥⎥
⎦⎢⎢⎢
⎣ −−⎥⎥⎥
⎦⎢⎢⎢
⎣
=Σ MK
25
Principal component analysishence the PCA problem is
φφφφ ΣTTT XX mama
as in LDA, this can be made arbitrarily large by simply li φ
φφφφφφ
Σ= TTcc
T XX maxmax
scaling φto normalize we constrain φ to have unit norm
1bj t tΣ φφφT
which is equivalent to
1 subject to max =Σ φφφφ
T
φφφφ
φ T
T Σmax
26
shows that PCA = maximization of a Rayleigh quotient
Principal component analysisin this case
T iSSwSwwSw
WT
BT
wmax
tesemidefinipositivesymmetricSS WB
,,
with SB = Σ and Sw = ISw is clearly invertible• no regularization problems• w* is the eigenvector of largest eigenvalue of Sw
-1SB
thi i j t th l t i t f th i Σ• this is just the largest eigenvector of the covariance Σ• the max value is λ, where λ is the largest eigenvalue
27
The Rayleigh quotient duallet’s assume, for a moment, that the solution is of the form
X
• i.e. a linear combination of the centered datapoints
αcXw =
hence the problem is equivalent to
αα cBTc
T XSX
thi d t h it f th l ti i
αααcW
Tc
TcBc
XSXmax
this does not change its form, the solution is• α* is the eigenvector of largest eigenvalue of (Xc
TSwXc)-1XcTSBXc
• the max value is λK where λ is the largest eigenvalue
28
the max value is λK, where λ is the largest eigenvalue
The Rayleigh quotient dualfor PCA• S = I and SB = Σ = 1/n X X T• Sw = I and SB = Σ = 1/n XcXc
• the solution satisfies
TT 111
43421α
λλλ wX
nXwwwXX
nwSwS T
ccTccWB
111 =⇔=⇔= −
• and, therefore, we have• w* eigenvalue of SB = XcXc
T
* i l f (X TS X ) 1X TS X (X TX ) 1X TX X TX X TX• α* eigenvalue of (XcTSwXc)-1Xc
TSBXc = (XcTXc)-1Xc
TXcXcTXc =Xc
TXc
• i.e. we have two alternative manners in which to compute PCA
29
Principal component analysisprimal• assemble matrix
dual• assemble matrixassemble matrix
• Σ = XcXcT
• compute eigenvectors φi
assemble matrix• Κ = Xc
TXc
• compute eigenvectors αi
• these are the principal components
• the principal components are
• φi = X αiφi Xcαi
in both cases we have an eigenvalue problemi l th f t d t ( )T• primal on the sum of outer products
• dual on the matrix of inner products
( )∑=Σi
Tci
ci xx
( ) cj
Tciij xxK =
30
( ) jiij xxK
The Rayleigh quotientthis is a property that holds for many Rayleigh quotient problemsp• the primal solution is a linear combination of datapoints• the dual solution only depends on dot-products of the datapoints
whenever both of these holdth bl b k li d• the problem can be kernelized
• this has various interesting properties • we will talk about themwe will talk about them
many examples
31
• kernel PCA, kernel LDA, manifold learning, etc.
32