View
217
Download
3
Embed Size (px)
Citation preview
Principal Component Analysis
Jieping Ye
Department of Computer Science and Engineering
Arizona State University
http://www.public.asu.edu/~jye02
Outline of lecture
• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis (PCA)• Nonlinear PCA using Kernels
What is feature reduction?
• Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space.– Criterion for feature reduction can be different based on different
problem settings.• Unsupervised setting: minimize the information loss• Supervised setting: maximize the class discrimination
• Given a set of data points of p variables
Compute the linear transformation (projection)
nxxx ,,, 21
)(: pdxGyxG dTpdp
What is feature reduction?
dY pdTG
pX
dTdp XGYXG :
Linear transformation
Original data reduced data
High-dimensional data
Gene expression Face images Handwritten digits
Outline of lecture
• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels
Why feature reduction?
• Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade rapidly as the dimension
increases.
• The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type
of disease may be small.
Why feature reduction?
• Visualization: projection of high-dimensional data onto 2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
Application of feature reduction
• Face recognition• Handwritten digit recognition• Text mining• Image retrieval• Microarray data analysis• Protein classification
Outline of lecture
• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels
Feature reduction algorithms
• Unsupervised– Latent Semantic Indexing (LSI): truncated SVD– Independent Component Analysis (ICA)– Principal Component Analysis (PCA)– Canonical Correlation Analysis (CCA)
• Supervised – Linear Discriminant Analysis (LDA)
• Semi-supervised – Research topic
Outline of lecture
• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels
What is Principal Component Analysis?
• Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of
variables, smaller than the original set of variables– Retains most of the sample's information.– Useful for the compression and classification of data.
• By information we mean the variation present in the sample, given by the correlations between the original variables. – The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the fraction of the total information each retains.
Geometric picture of principal components (PCs)
2z
1z
• the 1st PC is a minimum distance fit to a line in X space
• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.
1z
Algebraic definition of PCs
.,,2,1,1
111 njxaxazp
iijij
T
pnxxx ,,, 21
]var[ 1z
Given a sample of n observations on a vector of p variables
define the first principal component of the sampleby the linear transformation
where the vector
is chosen such that is maximum.
),,,(
),,,(
21
121111
pjjjj
p
xxxx
aaaa
Algebraic derivation of PCs
To find first note that
where
is the covariance matrix.
Ti
n
ii xxxx
nS
1
1
1a
111
11
1
2
112
111
1
1))((]var[
Saaaxxxxan
xaxan
zzEz
Tn
i
T
iiT
n
i
Ti
T
mean. theis 1
1
n
iix
nx
In the following, we assume theData is centered. 0x
Algebraic derivation of PCs
npnxxxX ],,,[ 21
0x
TXXn
S1
Assume
Form the matrix:
then
TVUX
Obtain eigenvectors of S by computing the SVD of X:
To find that maximizes subject to
Let λ be a Lagrange multiplier
is an eigenvector of S
corresponding to the largest eigenvalue
therefore
Algebraic derivation of PCs
1a ]var[ 1z 111 aaT
0)(
0
)1(
1
111
1111
aIS
aSaLa
aaSaaL
p
TT
1a
.1
To find the next coefficient vector maximizing
then let λ and φ be Lagrange multipliers, and maximize
subject to
and to
First note that
Algebraic derivation of PCs
2a
122 aaT
]var[ 2z
0],cov[ 12 zz
2112112 ],cov[ aaSaazz TT
122222 )1( aaaaSaaL TTT
uncorrelated
Algebraic derivation of PCs
122222 )1( aaaaSaaL TTT
001222
aaSaLa
2222 and SaaaSa T
We find that is also an eigenvector of S
whose eigenvalue is the second largest.
In general
• The kth largest eigenvalue of S is the variance of the kth PC.
• The kth PC retains the kth greatest fraction of the variation in the sample.
Algebraic derivation of PCs
2a
2
kkTkk Saaz ]var[
kz
Algebraic derivation of PCs
• Main steps for computing PCs– Form the covariance matrix S.
– Compute its eigenvectors:
– Use the first d eigenvectors to form the d PCs.
– The transformation G is given by
],,,[ 21 daaaG
p
iia 1
d
iia 1
.point A test dTp xGx
Optimality property of PCA
npTndT
ndTnp
XGGXXG
XGX
)(
Dimension reductionReconstruction
ndT XGY
pdTG
npX
Original data
dpG npX
Optimality property of PCA
2
FXX
The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:
Main theoretical result:
dF
T
GIGXGGXdp
T2G subject to )(min
reconstruction error
PCA projection minimizes the reconstruction error among all linear projections of size d.
Applications of PCA
• Eigenfaces for recognition. Turk and Pentland. 1991.
• Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001.
• Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.
PCA for image compression
d=1 d=2 d=4 d=8
d=16 d=32 d=64 d=100Original Image
Outline of lecture
• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels
Motivation
Linear projections will not detect thepattern.
Nonlinear PCA using Kernels
• Traditional PCA applies linear transformation– May not be effective for nonlinear data
• Solution: apply nonlinear transformation to potentially very high-dimensional space.
• Computational efficiency: apply the kernel trick.– Require PCA can be rewritten in terms of dot product.
)(: xx
)()(),( jiji xxxxK More on kernelslater
Nonlinear PCA using Kernels
Rewrite PCA in terms of dot product
.0 i.e., centered,been has data theAssume i
ix
Ti
ii xx
nS
1The covariance matrix S can be written as
i
iTi
Ti
ii xvx
nvvvxx
nSv )(
11
Let v be The eigenvector of S corresponding to nonzero eigenvalue
Eigenvectors of S lie in the space spanned by all data points.
Nonlinear PCA using Kernels
].x,,x,[xX where,1
n21 TXXn
S
Xxvi
ii
i
iTi
Ti
ii xvx
nvvvxx
nSv )(
11
The covariance matrix can be written in matrix form:
XXXXn
Sv T 1
)())((1
XXXXXXn
TTT
)(1
XXn
T Any benefits?
Nonlinear PCA using Kernels
Next consider the feature space: )(: xx
].x,,x,[x where,1
n21 XXX
nS
T
XXn
T1 Xxvi
ii )(
The (i,j)-th entry of XXT
is )()( ji xx
Apply the kernel trick: )()(),( jiji xxxxK
Kn
1K is called the kernel matrix.
Nonlinear PCA using Kernels
• Projection of a test point x onto v:
iii
iii
iii
xxKxx
xxvx
),()()(
)()()(
Explicit mapping is not required here.
Reference
• Principal Component Analysis. I.T. Jolliffe.
• Kernel Principal Component Analysis. Schölkopf, et al.
• Geometric Methods for Feature Extraction and Dimensional Reduction. Burges.