Principal Component Analysis Jieping Ye Department of Computer Science and Engineering Arizona State University jye02

Principal Component Analysis

Jieping Ye

Department of Computer Science and Engineering

Arizona State University

http://www.public.asu.edu/~jye02

http://www.public.asu.edu/~jye02

Outline of lecture

• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis (PCA)• Nonlinear PCA using Kernels

What is feature reduction?

• Feature reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space.– Criterion for feature reduction can be different based on different

problem settings.• Unsupervised setting: minimize the information loss• Supervised setting: maximize the class discrimination

• Given a set of data points of p variables

Compute the linear transformation (projection)

nxxx ,,, 21

)(: pdxGyxG dTpdp

What is feature reduction?

dY pdTG

pX

dTdp XGYXG :

Linear transformation

Original data reduced data

High-dimensional data

Gene expression Face images Handwritten digits

Outline of lecture

• What is feature reduction?• Why feature reduction?• Feature reduction algorithms• Principal Component Analysis• Nonlinear PCA using Kernels

Why feature reduction?

• Most machine learning and data mining techniques may not be effective for high-dimensional data – Curse of Dimensionality– Query accuracy and efficiency degrade rapidly as the dimension

increases.

• The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type

of disease may be small.

Why feature reduction?

• Visualization: projection of high-dimensional data onto 2D or 3D.

• Data compression: efficient storage and retrieval.

• Noise removal: positive effect on query accuracy.

Application of feature reduction

• Face recognition• Handwritten digit recognition• Text mining• Image retrieval• Microarray data analysis• Protein classification

Outline of lecture


Feature reduction algorithms

• Unsupervised– Latent Semantic Indexing (LSI): truncated SVD– Independent Component Analysis (ICA)– Principal Component Analysis (PCA)– Canonical Correlation Analysis (CCA)

• Supervised – Linear Discriminant Analysis (LDA)

• Semi-supervised – Research topic

Outline of lecture


What is Principal Component Analysis?

• Principal component analysis (PCA) – Reduce the dimensionality of a data set by finding a new set of

variables, smaller than the original set of variables– Retains most of the sample's information.– Useful for the compression and classification of data.

• By information we mean the variation present in the sample, given by the correlations between the original variables. – The new variables, called principal components (PCs), are

uncorrelated, and are ordered by the fraction of the total information each retains.

Geometric picture of principal components (PCs)

2z

1z

• the 1st PC is a minimum distance fit to a line in X space

• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

1z

Algebraic definition of PCs

.,,2,1,1

111 njxaxazp

iijij

T

pnxxx ,,, 21

]var[ 1z

Given a sample of n observations on a vector of p variables

define the first principal component of the sampleby the linear transformation

where the vector

is chosen such that is maximum.

),,,(

),,,(

21

121111

pjjjj

p

xxxx

aaaa

Algebraic derivation of PCs

To find first note that

where

is the covariance matrix.

Ti

n

ii xxxx

nS

1

1

1a

111

11

1

2

112

111

1

1))((]var[

Saaaxxxxan

xaxan

zzEz

Tn

i

T

iiT

n

i

Ti

T

mean. theis 1

1

n

iix

nx

In the following, we assume theData is centered. 0x


npnxxxX ],,,[ 21

0x

TXXn

S1

Assume

Form the matrix:

then

TVUX

Obtain eigenvectors of S by computing the SVD of X:

To find that maximizes subject to

Let λ be a Lagrange multiplier

is an eigenvector of S

corresponding to the largest eigenvalue

therefore


1a ]var[ 1z 111 aaT

0)(

0

)1(

1

111

1111

aIS

aSaLa

aaSaaL

p

TT

1a

.1

To find the next coefficient vector maximizing

then let λ and φ be Lagrange multipliers, and maximize

subject to

and to

First note that


2a

122 aaT

]var[ 2z

0],cov[ 12 zz

2112112 ],cov[ aaSaazz TT

122222 )1( aaaaSaaL TTT

uncorrelated


122222 )1( aaaaSaaL TTT

001222

aaSaLa

2222 and SaaaSa T

We find that is also an eigenvector of S

whose eigenvalue is the second largest.

In general

• The kth largest eigenvalue of S is the variance of the kth PC.

• The kth PC retains the kth greatest fraction of the variation in the sample.


2a

2

kkTkk Saaz ]var[

kz


• Main steps for computing PCs– Form the covariance matrix S.

– Compute its eigenvectors:

– Use the first d eigenvectors to form the d PCs.

– The transformation G is given by

],,,[ 21 daaaG

p

iia 1

d

iia 1

.point A test dTp xGx

Optimality property of PCA

npTndT

ndTnp

XGGXXG

XGX

)(

Dimension reductionReconstruction

ndT XGY

pdTG

npX

Original data

dpG npX

Optimality property of PCA

2

FXX

The matrix G consisting of the first d eigenvectors of the covariance matrix S solves the following min problem:

Main theoretical result:

dF

T

GIGXGGXdp

T2G subject to )(min

reconstruction error

PCA projection minimizes the reconstruction error among all linear projections of size d.

Applications of PCA

• Eigenfaces for recognition. Turk and Pentland. 1991.

• Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001.

• Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003.

PCA for image compression

d=1 d=2 d=4 d=8

d=16 d=32 d=64 d=100Original Image

Outline of lecture


Motivation

Linear projections will not detect thepattern.

Nonlinear PCA using Kernels

• Traditional PCA applies linear transformation– May not be effective for nonlinear data

• Solution: apply nonlinear transformation to potentially very high-dimensional space.

• Computational efficiency: apply the kernel trick.– Require PCA can be rewritten in terms of dot product.

)(: xx

)()(),( jiji xxxxK More on kernelslater


Rewrite PCA in terms of dot product

.0 i.e., centered,been has data theAssume i

ix

Ti

ii xx

nS

1The covariance matrix S can be written as

i

iTi

Ti

ii xvx

nvvvxx

nSv )(

11

Let v be The eigenvector of S corresponding to nonzero eigenvalue

Eigenvectors of S lie in the space spanned by all data points.


].x,,x,[xX where,1

n21 TXXn

S

Xxvi

ii

i

iTi

Ti

ii xvx

nvvvxx

nSv )(

11

The covariance matrix can be written in matrix form:

XXXXn

Sv T 1

)())((1

XXXXXXn

TTT

)(1

XXn

T Any benefits?


Next consider the feature space: )(: xx

].x,,x,[x where,1

n21 XXX

nS

T

XXn

T1 Xxvi

ii )(

The (i,j)-th entry of XXT

is )()( ji xx

Apply the kernel trick: )()(),( jiji xxxxK

Kn

1K is called the kernel matrix.


• Projection of a test point x onto v:

iii

iii

iii

xxKxx

xxvx

),()()(

)()()(

Explicit mapping is not required here.

Reference

• Principal Component Analysis. I.T. Jolliffe.

• Kernel Principal Component Analysis. Schölkopf, et al.

• Geometric Methods for Feature Extraction and Dimensional Reduction. Burges.

Documents

Principal Component Analysis Jieping Ye Department of Computer Science and Engineering Arizona State University jye02