43
Presentation of paper #7: Nonlinear component analysis as a kernel eigenvalue problem Scholkopf, Smola, Muller Neural Computation 10, 1299-1319, MIT Press (1998) Group C: M. Filannino, G. Rates, U. Sandouk COMP61021: Modelling and Visualization of high-dimensional data

Nonlinear component analysis as a kernel eigenvalue problem

Embed Size (px)

DESCRIPTION

Presentation of the Kernel PCA paper.

Citation preview

Page 1: Nonlinear component analysis as a kernel eigenvalue problem

Presentation of paper #7:

Nonlinear component analysis as a kernel eigenvalue problemScholkopf, Smola, MullerNeural Computation 10, 1299-1319, MIT Press (1998)

Group C:

M. Filannino, G. Rates, U. Sandouk

COMP61021: Modelling and Visualization of high-dimensional data

Page 2: Nonlinear component analysis as a kernel eigenvalue problem

Introduction

● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space dimension○ Useful for feature extraction, dimensionality reduction

Page 3: Nonlinear component analysis as a kernel eigenvalue problem

Introduction

● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space○ Useful for feature extraction, dimensionality reduction

Page 4: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: possible solutions

Principal Curves Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516. ● Optimization (including the quality of data approximation)

● Natural geometric meaning

● Natural projection

http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png

Page 5: Nonlinear component analysis as a kernel eigenvalue problem

Autoencoders Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507. ● Feed forward neural network

● Approximate the identity

function

Motivation: possible solutions

http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png

Page 6: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: some new problems

● Low input dimensions

● Problem dependant

● Hard optimization problems

Page 7: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: kernel trick

KPCA captures the overall variance of patterns

Page 8: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: kernel trick

Page 9: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: kernel trick

Page 10: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: kernel trick

Page 11: Nonlinear component analysis as a kernel eigenvalue problem

Motivation: kernel trick

Video

Page 12: Nonlinear component analysis as a kernel eigenvalue problem

Principle

"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"

Feat

ures

Data

Page 13: Nonlinear component analysis as a kernel eigenvalue problem

Principle

"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"

N

ew fe

atur

es

Data

Page 14: Nonlinear component analysis as a kernel eigenvalue problem

PrincipleGiven a data set of N centered observations in a d-dimensional space ● PCA diagonalizes the covariance matrix:

● It is necessary to solve the following system of equations:

● We can define the same computation in another dot product space F:

Page 15: Nonlinear component analysis as a kernel eigenvalue problem

PrincipleGiven a data set of N centered observations in a high-dimensional space ● Covariance matrix in new space:

● Again, it is necessary to solve the following system of equations:

● This means that:

Page 16: Nonlinear component analysis as a kernel eigenvalue problem

Principle● Combining the last tree equations, we obtain:

● we define a new function

● and a new N x N matrix:

● our equation becomes:

Page 17: Nonlinear component analysis as a kernel eigenvalue problem

Principle● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the

corresponding eigenvectors, with λp being the first nonzero eigenvalue then we require they are normalized in F:

● Encoding a data point y means computing:

Page 18: Nonlinear component analysis as a kernel eigenvalue problem

Algorithm

● CentralizationFor a given data set, subtracting the mean for all the observation to achieve the centralized data in RN.

● Finding principal componentsCompute the matrix using kernel function, find eigenvectors and eigenvalues

● Encoding training/testing data where x is a vector that encodes the training data. This can be done since we calculated eigenvalues and eigenvectors.

Page 19: Nonlinear component analysis as a kernel eigenvalue problem

Algorithm● Reconstructing training data

The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.

● Reconstructing test data pointThe operation cannot be done because eigenvectors do not have a pre-images in the original dimension.

Page 20: Nonlinear component analysis as a kernel eigenvalue problem

Disadvantages● Centering in original space does not mean centering in F, we need

to adjust the K matrix as follows: ● KPCA is now a parametric technique:

○ choice of a proper kernel function■ Gaussian, sigmoid, polynomial

○ Mercer's theorem■ k(x,y) must be continue, simmetric, and semi-defined positive

(xTAx ≥ 0)■ it guarantees that there are non-zero eigenvalues

● Data reconstruction is not possible, unless using approximation formula:

Page 21: Nonlinear component analysis as a kernel eigenvalue problem

● Time complexity

○ we will return to this point later

● Handle non linearly separable problems

● Extraction of more principal components than PCA

○ Feature extraction vs. dimensionality reduction

Advantages

Page 22: Nonlinear component analysis as a kernel eigenvalue problem

● Applications

● Data Sets

● Methods compared

● Assessment

● Experiments

● Results

Experiments

Page 23: Nonlinear component analysis as a kernel eigenvalue problem

● Clustering○ Density Estimation

■ ex High correlation between features ○ De-noising

■ ex Lighting removing from bright images○ Compression

■ ex Image compression

● Classification○ ex categorisations

Applications

Page 24: Nonlinear component analysis as a kernel eigenvalue problem

Hand written digit-Labelled -256 Dimensions-9298 Digits

● USPS Character Recognition

● Simple example1

1+2 = 3 Three Gaussians sd = 0.1 Dist [1,1] x [0.5, 1]

Three clusters- Unlabelled- 2 Dimensions

● Simple example2

Kernels

1+2 = 3 Uniform distribution Dist [-1, 1]

y= x2

- Unlabelled- 2 Dimensions

Experiment Name Created by Representation

● De-noising The eleven gaussians Dist [-1, 1] with zero mean

A circle and square- Unlabelled- 10 Dimensions

y x 2 C C noise sd 0.1

Datasets

Page 25: Nonlinear component analysis as a kernel eigenvalue problem

1 Simple Example 1 experiment Dataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 4

2 USPS Character RecognitionDataset: USPS Methods Five layer Neural Networks Kernel SVM PCA SVM

3 De- noising Dataset: De-noising 11 gaussians sd = 0.1 Methods

Kernel Autoencoders Principal Curves Kernel PCA Linear PCA

4 KernelsRadial Basis Function Sigmoid

ParametersKernel PCAKernel Polynomial 1 7 Components 32 2048 (x x2) Neural Networks and SVMThe best parameters for the task ParametersThe best parameters for the task ParametersThe best parameters for the task

Experiments

Page 26: Nonlinear component analysis as a kernel eigenvalue problem

● Supervised Unsupervised Linear PCA

Neural Networks Kernel PCA ● SVM Kernel Autoencoders● Kernel LDA Principal Curves

Linear

Non Linear

Classification

These are the methods we used in the experiments

Face Recognition

Dimensionality reduction

Methods

Page 27: Nonlinear component analysis as a kernel eigenvalue problem

● 1 Accuracy Classification: Exact Classification Clustering: Comparable to other clusters

● ● 2 Time Complexity● The time to compute

● ● 3 Storage Complexity● The storage of the data

● ● 4 Interpretability● How easy it is to understand

Assessment

Page 28: Nonlinear component analysis as a kernel eigenvalue problem

● Nonlinear PCA paper exDataset: 1+ 2 =3 The uniform dist with sd 0.2Classifier: The polynomial Kernel 1 - 4 PC: 1 – 3

Kernel Polynomial 1 -4

The eigenvector 1 -3 of highest eigenvalue

● Recreated example Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 x2

3D by a Kernel Do PCA 2D The function y = x2 + B

with noise B of sd= 0.2 from uniform distribution [-1, 1]

Accurate Clustering of Non linear features

Simple Example

Page 29: Nonlinear component analysis as a kernel eigenvalue problem

Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 (x x2)

● The performance is better for Linear Classifier trained on non linear components than linear components

● The performance is

improved from linear as the number of component is increased

Fig The result of the Character Recognition experiment ( )

Character recognition

Page 30: Nonlinear component analysis as a kernel eigenvalue problem

The de-noising on non linear feature of the distribution

Dataset: The De-noising eleven gaussians Training set: 100 Classifier: The Gaussian Kernel sd parameter PC: 2

Fig The result of the denoising experiment ( )

De-noising

Page 31: Nonlinear component analysis as a kernel eigenvalue problem

The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kernels Gram Matrix are

Experiments Radial Basis FunctionDataset Three gaussian sd 0.1 Classifier y exp x y 0.1 Kernel 1 4 PC 1 8 SigmoidDataset Three Gaussian sd 0.1Classifier KernelPC 1 3

Kernels

Page 32: Nonlinear component analysis as a kernel eigenvalue problem

RBF Sigmoid

PC 1 PC 2 PC 3 PC 4

PC 5 PC 6 PC 7 PC8

PC 1 PC2 PC3

-The PC 1-2 separate the 3 clusters - The PC of 3 -5 half the clusters -The PC of 6-8 split them orthogonally The clusters are split to 12 places.

-The PC 1 -2 separates the 3 clusters - The PC 3 half the 3 clusters -The same no of PC’s to separate clusters.- The Sigmoid needs < PC to half.

Results

Page 33: Nonlinear component analysis as a kernel eigenvalue problem

Experiment 1 Experiment 2 Experiment 3 Experiment 4

1 Accuracy

Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid

Components 8 Split to 12 512 2 3 split to 6

Accuracy 4.4

2 Time

3 Space

4 Interpretability

Very Good Very Good Complicated Very good

Results

Page 34: Nonlinear component analysis as a kernel eigenvalue problem

Kernel Fisher Discriminant (KDA) Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller

● Best discriminant projection

Discussions: KDA

http://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png

Page 35: Nonlinear component analysis as a kernel eigenvalue problem

Discussions

Doing PCA in F rather in Rd

● The first k principal components carry more variance than any

other k directions

● The mean squared error observed by the first k principles is

minimal

● The principal components are uncorrelated

Page 36: Nonlinear component analysis as a kernel eigenvalue problem

Discussions

Going into a higher dimensionality for a lower dimensionality ● Pick the right high dimensionality space

Need of a proper kernel

● What kernel to use?

○ Gaussian, sigmoidal, polynomial

● Problem dependent

Page 37: Nonlinear component analysis as a kernel eigenvalue problem

Discussions

Time Complexity

● Alot of features (alot of dimensions).

● KPCA works!

○ Subspace of F (only the observed x's)

○ No dot product calculation

● Computational complexity is hardly changed by the fact that we

need to evaluate kernel function rather than just dot products

○ (if the kernel is easy to compute)

○ e.g. Polynomial Kernels

Payback: using linear classifier.

Page 38: Nonlinear component analysis as a kernel eigenvalue problem

Discussions

Pre-image reconstruction maybe impossible

Approximation can be done in F

Need explicite ϕ

● Regression learning problem

● Non-linear optimization problem

● Algebric Solution (rarely)

Page 39: Nonlinear component analysis as a kernel eigenvalue problem

Discussions

Interpretablity

● Cross-Features Features

○ Dependent on the kernel

● Reduced Space Features

○ Preserves the highest variance

among data in F.

Page 40: Nonlinear component analysis as a kernel eigenvalue problem

Conclusions

Applications

● Feature Extraction (Classification)

● Clustering

● Denoising

● Novelty detection

● Dimensionality Reduction (Compression)

Page 41: Nonlinear component analysis as a kernel eigenvalue problem

[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,” IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504-507.[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazione di spazi vettoriali per analisi statistica multi-dimensionale.[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.[7] Wikipedia, "Kernel Principal Component Analysis", 2011.[8] A. Ghodsi, "Data visualization", 2006.[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCA pattern reconstruction via approximate pre-images". In Proceedings of the 8th International Conference on Artificial Neural Networks, pages 147 - 152, 1998.

References

Page 42: Nonlinear component analysis as a kernel eigenvalue problem

[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods", Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 2003. ● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An

Introduction to Kernel-Based Learning Algorithms” IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH 2001

● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch “Kernel PCA and De-Noising in Feature Spaces”

References

Page 43: Nonlinear component analysis as a kernel eigenvalue problem

Thank you