Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008

Principal Component Analysis Based on L1-Norm Maximization

Nojun KwakIEEE Transactions on Pattern Analysis

and Machine Intelligence, 2008

Outline

• Introduction• Background Knowledge• Problem Description• Algorithms• Experiments• Conclusion

2

Introduction

• In data analysis problems, why do we need dimensionality reduction?

• Principal Component Analysis(PCA)• PCA based on the L2-Norm is prone to the

presence of outliers.

3

Introduction

• Some algorithms for this problem:– L1-PCA• Weighted median method• Convex programming method• Maximum likelihood estimation method

– R1-PCA

4

Background Knowledge

• L1-Norm, L2-Norm• Principal Component Analysis(PCA)

5

Lp-Norm

• Consider an n-dimensional vector: • Define the p-Norm:

• L1-Norm is

• L2-Norm is

6

],...,,[ 21 nxxxx pn

i

p

ipxx

1

1

n

iixx

11

n

iixx

1

2

2

Lp-Norm

• For example, x = [1, 2, 3]

• Special case :

7

name symbol value approximation

L1-Norm |x|1 6 6.000

L2-Norm |x|2 3.742

L3-Norm |x|3 3.302

L4-Norm |x|4 3.146

L∞-Norm |x|∞ 3 3.000

14

326

72 41

iixx max

Principal Component Analysis

• Principal component analysis (PCA) is a technique to seek projections that best preserve the data in a least-squares sense.

• The projections constitute a low-dimensional linear subspace.

8


• The projection vectors , …, are the eigenvectors of the scatter matrix having the largest eigenvalues.

9

Scatter matrix:


• The rotational invariance property: a fundamental property of Euclidean space with L2-Norm.

• So, PCA has rotational invariance property.

10

Problem Description

• Traditional PCA: the presence of outliers.• The effect of the outliers with a large norm is

exaggerated by the use of the L2-Norm.• Other method?

11

Problem Description

• If we use L1-Norm instead of L2-Norm:

where is the dataset.

12

is the projection matrix.is the coefficient matrix.

Problem Description

• However, it’s very hard to achieve the exact solution.

• To resolve it, Ding et al. propose the R1-Norm and an approximate solution.

13

We call it R1-PCA.

Problem Description

• The solution of R1-PCA depends on the dimension of subspace being found.

• The optimal solution when is not necessarily a subspace of when .

• The proposed method: PCA-L1

14

• We consider that:

• The maximization is done on the feature space.

Algorithms

15

ensure to orthonormality of the projection matrix.

Algorithms

• However, it’s difficult to find a global solution for .

• The optimal ith projection varies with different as in R1-PCA.

• How to solve it?

16

Algorithms

• We simplify it into a series of problems using a greedy search method.

• Then, if we set , it become that:

17

Although the successive greedy solutions may differ from the optimal solution, it’s expected to provide a good approximation.

Algorithms

• The optimization is still difficult because it contains absolute value operation, which is nonlinear.

18

Algorithms

19

Algorithms

• However, does the PCA-L1 procedure finds a local maximum point ?

• We should prove it.

20

Theorem

• Theorem: With the PCA-L1 procedure vector converges to , which is a local maximum points of .

• The proof includes two parts:– is a non-decreasing function of .– The objective function has a local maximum value

at .

21

Proof

• is a non-decreasing function of .

is the set of optimal polarity corresponding to . For all ,

22

Proof

• This holds because

23

are parallel.

The inner product of two vectors.

Proof

• So, the objective function is non-decreasing and there are a finite number of data points.

The PCA-L1 procedure converges to a

projection vector .

24

Proof

• The objective function has a local maximum value at .

• Because converges to by the PCA-L1 procedure, for all .

• By Step 4b, for all .

25

Proof

• There exists a small neighborhood of , such that if , then for all .

• Then, since is parallel to , the inequality holds for all .

is a local maximum point.

26

Algorithms

• So, the PCA-L1 procedure finds a local maximum point .

• Because is a linear combination of data points , i.e., , it’s invariant to rotations.

Under rotational transformation R:X→RX, then W→RW.

27

Algorithms

• Computation complexity: • is the number of iterations for

convergence. does not depend on the

dimension .

28

Algorithms

• The PCA-L1 procedure just finds a local maximum solution. It may not be the global solution.

• We can set appropriately.– By setting .– Run the PCA-L1 with different initial vector .

29

Algorithms

• Extracting Multiple Features :

30

Original PCA’s thought.

Run the PCA-L1 for each feature dimension.

Algorithms

• How to guarantee the orthonormality of the projection vectors?

• We should show that is orthogonal to .

31

Proof

• The projection vector is a linear combination of samples .

It’s in the subspace spanned by .• Then, we consider :

32

Form Greedy search algorithm.

normal vector, (=1)

Proof

• Because , is orthogonal to ..

is orthogonal to .

33

The orthonormality of the projection vectors is guaranteed.

Algorithms

• Even if the greedy search algorithm does not provide the optimal solution, it provides a set of good projections that maximize L1 dispersion.

34

Algorithms

• For data analysis, we could decide how much data could be captured.

• In PCA, we could compute the eigenvalue:

35

The eigenvalue is equivalent to the variance of the feature.

We can compute the ratio of the variance to the total variance.

The sum of variance:

In eigenvalue, it exceeds 95% of the total variance, m is set to .

Algorithms

• In PCA-L1, once is obtained, we can compute the variance of the feature.

• The sum of variance:

• The total variance:

36

We can set the appropriate number of extracted features like original PCA.

Experiments

• In the experiments, we apply PCA-L1 algorithm and compared with R1-PCA and original PCA.

• Three experiments:– A Toy problem with an Outlier– UCI Data Sets– Face Reconstruction

37

A Toy Problem with an Outlier

• Consider the data points in a 2D space:

• If we discard the outlier, the projection vector should be .

38

an outlier.


• The projection vector:

39

outlier


• The residual error :

40

outlier

Average residual error

PCA-L1 L2-PCA R1-PCA

1.200 1.401 1.206Much influenced by the outlier.

UCI Data Sets

• Data sets in UCI machine learning repositories.• Compare the classification performances.• 1-NN classifier was used and 10-fold cross

validation for average classification rate.• For PCA-L1, we set the initial projection vector

as .

41

UCI Data Sets

• The data sets:

42

UCI Data Sets

• The average correct classification rates:

43

UCI Data Sets


44

UCI Data Sets


45

In many cases, PCA-L1 outperformed L2-PCA and R1-PCA when the number of extracted features was small.

UCI Data Sets

• Average Classification rate on UCI Data Sets:

46

PCA-L1 outperformed other methods by 1% on average.

UCI Data Sets

• Computation cost:

47

Face Reconstruction

• The Yale face database.– 11 individuals.– 15 face images for one person.

• Among 165 images, 20% were selected randomly and occluded with a noise block.

48

Face Reconstruction

• For these image sets, we applied:– L2-PCA(eigenface)– R1-PCA– PCA-L1

• Then, we used extracted features to reconstruct images.

49

Face Reconstruction

50

• Experimental results:

Face Reconstruction

• The average reconstruction error is:

51

original image

reconstructed image

Form 10~20 features, the difference became apparent and PCA-L1 outperformed than other methods.

Face Reconstruction

• We added 30 dummy images consist of random black and white dots to the original 165 Yale images.

• We applied:– L2-PCA(eigenface)– R1-PCA– PCA-L1

• We reconstructed images with features.

52

Face Reconstruction

• Experimental results:

53

Face Reconstruction

• The average reconstruction error:

54

From 6 to 36 features, the error of L2-PCA is constant. The dummy images serious affect those projection vectors.

From 14 to 36 features, the error of R1-PCA is increasing. The dummy images serious affect those projection vectors.

Conclusion

• The PCA-L1 was proven to find a local maximum point.

• The computation complexity is proportional to– the number of samples– the dimension of input space– The number of iterations

• The method is usually faster and robust to outliers.

55


• Given a dataset of l samples:• We represent D by projecting the data onto a

line running through the sample mean , denoted as ( ):

56

lidi RxD 1


• Then,

57


• To look for the best direction ,

58

scatter matrix


• We want to minimize :

Maximize , subject to • We use Lagrange multipliers:

59


• Since , minimizing can be achieved by choosing as the largest eigenvector of .

• Similarly, we can extend 1-d to -d projection.

60

Documents

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008