Nonlinear component analysis as a kernel eigenvalue problem

Presentation of paper #7:

Nonlinear component analysis as a kernel eigenvalue problemScholkopf, Smola, MullerNeural Computation 10, 1299-1319, MIT Press (1998)

Group C:

M. Filannino, G. Rates, U. Sandouk

COMP61021: Modelling and Visualization of high-dimensional data

Introduction

● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space dimension○ Useful for feature extraction, dimensionality reduction

Introduction

● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space○ Useful for feature extraction, dimensionality reduction

Motivation: possible solutions

Principal Curves Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516. ● Optimization (including the quality of data approximation)

● Natural geometric meaning

● Natural projection

http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png

Autoencoders Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507. ● Feed forward neural network

● Approximate the identity

function

Motivation: possible solutions

http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png

Motivation: some new problems

● Low input dimensions

● Problem dependant

● Hard optimization problems

Motivation: kernel trick

KPCA captures the overall variance of patterns





Video

Principle

"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"

Feat

ures

Data

Principle

"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"

N

ew fe

atur

es

Data

PrincipleGiven a data set of N centered observations in a d-dimensional space ● PCA diagonalizes the covariance matrix:

● It is necessary to solve the following system of equations:

● We can define the same computation in another dot product space F:

PrincipleGiven a data set of N centered observations in a high-dimensional space ● Covariance matrix in new space:

● Again, it is necessary to solve the following system of equations:

● This means that:

Principle● Combining the last tree equations, we obtain:

● we define a new function

● and a new N x N matrix:

● our equation becomes:

Principle● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the

corresponding eigenvectors, with λp being the first nonzero eigenvalue then we require they are normalized in F:

● Encoding a data point y means computing:

Algorithm

● CentralizationFor a given data set, subtracting the mean for all the observation to achieve the centralized data in RN.

● Finding principal componentsCompute the matrix using kernel function, find eigenvectors and eigenvalues

● Encoding training/testing data where x is a vector that encodes the training data. This can be done since we calculated eigenvalues and eigenvectors.

Algorithm● Reconstructing training data

The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.

● Reconstructing test data pointThe operation cannot be done because eigenvectors do not have a pre-images in the original dimension.

Disadvantages● Centering in original space does not mean centering in F, we need

to adjust the K matrix as follows: ● KPCA is now a parametric technique:

○ choice of a proper kernel function■ Gaussian, sigmoid, polynomial

○ Mercer's theorem■ k(x,y) must be continue, simmetric, and semi-defined positive

(xTAx ≥ 0)■ it guarantees that there are non-zero eigenvalues

● Data reconstruction is not possible, unless using approximation formula:

● Time complexity

○ we will return to this point later

● Handle non linearly separable problems

● Extraction of more principal components than PCA

○ Feature extraction vs. dimensionality reduction

Advantages

● Applications

● Data Sets

● Methods compared

● Assessment

● Experiments

● Results

Experiments

● Clustering○ Density Estimation

■ ex High correlation between features ○ De-noising

■ ex Lighting removing from bright images○ Compression

■ ex Image compression

● Classification○ ex categorisations

Applications

Hand written digit-Labelled -256 Dimensions-9298 Digits

● USPS Character Recognition

● Simple example1

1+2 = 3 Three Gaussians sd = 0.1 Dist [1,1] x [0.5, 1]

Three clusters- Unlabelled- 2 Dimensions

● Simple example2

Kernels

1+2 = 3 Uniform distribution Dist [-1, 1]

y= x2

- Unlabelled- 2 Dimensions

Experiment Name Created by Representation

● De-noising The eleven gaussians Dist [-1, 1] with zero mean

A circle and square- Unlabelled- 10 Dimensions

y x 2 C C noise sd 0.1

Datasets

1 Simple Example 1 experiment Dataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 4

2 USPS Character RecognitionDataset: USPS Methods Five layer Neural Networks Kernel SVM PCA SVM

3 De- noising Dataset: De-noising 11 gaussians sd = 0.1 Methods

Kernel Autoencoders Principal Curves Kernel PCA Linear PCA

4 KernelsRadial Basis Function Sigmoid

ParametersKernel PCAKernel Polynomial 1 7 Components 32 2048 (x x2) Neural Networks and SVMThe best parameters for the task ParametersThe best parameters for the task ParametersThe best parameters for the task

Experiments

● Supervised Unsupervised Linear PCA

Neural Networks Kernel PCA ● SVM Kernel Autoencoders● Kernel LDA Principal Curves

Linear

Non Linear

Classification

These are the methods we used in the experiments

Face Recognition

Dimensionality reduction

Methods

● 1 Accuracy Classification: Exact Classification Clustering: Comparable to other clusters

● ● 2 Time Complexity● The time to compute

● ● 3 Storage Complexity● The storage of the data

● ● 4 Interpretability● How easy it is to understand

Assessment

● Nonlinear PCA paper exDataset: 1+ 2 =3 The uniform dist with sd 0.2Classifier: The polynomial Kernel 1 - 4 PC: 1 – 3

Kernel Polynomial 1 -4

The eigenvector 1 -3 of highest eigenvalue

● Recreated example Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 x2

3D by a Kernel Do PCA 2D The function y = x2 + B

with noise B of sd= 0.2 from uniform distribution [-1, 1]

Accurate Clustering of Non linear features

Simple Example

Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 (x x2)

● The performance is better for Linear Classifier trained on non linear components than linear components

● The performance is

improved from linear as the number of component is increased

Fig The result of the Character Recognition experiment ( )

Character recognition

The de-noising on non linear feature of the distribution

Dataset: The De-noising eleven gaussians Training set: 100 Classifier: The Gaussian Kernel sd parameter PC: 2

Fig The result of the denoising experiment ( )

De-noising

The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kernels Gram Matrix are

Experiments Radial Basis FunctionDataset Three gaussian sd 0.1 Classifier y exp x y 0.1 Kernel 1 4 PC 1 8 SigmoidDataset Three Gaussian sd 0.1Classifier KernelPC 1 3

Kernels

RBF Sigmoid

PC 1 PC 2 PC 3 PC 4

PC 5 PC 6 PC 7 PC8

PC 1 PC2 PC3

-The PC 1-2 separate the 3 clusters - The PC of 3 -5 half the clusters -The PC of 6-8 split them orthogonally The clusters are split to 12 places.

-The PC 1 -2 separates the 3 clusters - The PC 3 half the 3 clusters -The same no of PC’s to separate clusters.- The Sigmoid needs < PC to half.

Results

Experiment 1 Experiment 2 Experiment 3 Experiment 4

1 Accuracy

Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid

Components 8 Split to 12 512 2 3 split to 6

Accuracy 4.4

2 Time

3 Space

4 Interpretability

Very Good Very Good Complicated Very good

Results

Kernel Fisher Discriminant (KDA) Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller

● Best discriminant projection

Discussions: KDA

http://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png

Discussions

Doing PCA in F rather in Rd

● The first k principal components carry more variance than any

other k directions

● The mean squared error observed by the first k principles is

minimal

● The principal components are uncorrelated

Discussions

Going into a higher dimensionality for a lower dimensionality ● Pick the right high dimensionality space

Need of a proper kernel

● What kernel to use?

○ Gaussian, sigmoidal, polynomial

● Problem dependent

Discussions

Time Complexity

● Alot of features (alot of dimensions).

● KPCA works!

○ Subspace of F (only the observed x's)

○ No dot product calculation

● Computational complexity is hardly changed by the fact that we

need to evaluate kernel function rather than just dot products

○ (if the kernel is easy to compute)

○ e.g. Polynomial Kernels

Payback: using linear classifier.

Discussions

Pre-image reconstruction maybe impossible

Approximation can be done in F

Need explicite ϕ

● Regression learning problem

● Non-linear optimization problem

● Algebric Solution (rarely)

Discussions

Interpretablity

● Cross-Features Features

○ Dependent on the kernel

● Reduced Space Features

○ Preserves the highest variance

among data in F.

Conclusions

Applications

● Feature Extraction (Classification)

● Clustering

● Denoising

● Novelty detection

● Dimensionality Reduction (Compression)

[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,” IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504-507.[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazione di spazi vettoriali per analisi statistica multi-dimensionale.[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.[7] Wikipedia, "Kernel Principal Component Analysis", 2011.[8] A. Ghodsi, "Data visualization", 2006.[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCA pattern reconstruction via approximate pre-images". In Proceedings of the 8th International Conference on Artificial Neural Networks, pages 147 - 152, 1998.

References

[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods", Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 2003. ● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An

Introduction to Kernel-Based Learning Algorithms” IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH 2001

● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch “Kernel PCA and De-Noising in Feature Spaces”

References

Thank you

Technology

Nonlinear component analysis as a kernel eigenvalue problem