Upload
michele-filannino
View
2.639
Download
0
Embed Size (px)
DESCRIPTION
Presentation of the Kernel PCA paper.
Citation preview
Presentation of paper #7:
Nonlinear component analysis as a kernel eigenvalue problemScholkopf, Smola, MullerNeural Computation 10, 1299-1319, MIT Press (1998)
Group C:
M. Filannino, G. Rates, U. Sandouk
COMP61021: Modelling and Visualization of high-dimensional data
Introduction
● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space dimension○ Useful for feature extraction, dimensionality reduction
Introduction
● Kernel Principal Component Analysis (KPCA)○ KPCA is an extension of Principal Component Analysis○ It computes PCA into a new feature space○ Useful for feature extraction, dimensionality reduction
Motivation: possible solutions
Principal Curves Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516. ● Optimization (including the quality of data approximation)
● Natural geometric meaning
● Natural projection
http://pisuerga.inf.ubu.es/cgosorio/Visualization/imgs/review3_html_m20a05243.png
Autoencoders Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504--507. ● Feed forward neural network
● Approximate the identity
function
Motivation: possible solutions
http://www.nlpca.de/fig_NLPCA_bottleneck_autoassociative_autoencoder_neural_network.png
Motivation: some new problems
● Low input dimensions
● Problem dependant
● Hard optimization problems
Motivation: kernel trick
KPCA captures the overall variance of patterns
Motivation: kernel trick
Motivation: kernel trick
Motivation: kernel trick
Motivation: kernel trick
Video
Principle
"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"
Feat
ures
Data
Principle
"We are not interested in PCs in the input space, we are interested in PCs of features that are nonlinearly related to the original ones"
N
ew fe
atur
es
Data
PrincipleGiven a data set of N centered observations in a d-dimensional space ● PCA diagonalizes the covariance matrix:
● It is necessary to solve the following system of equations:
● We can define the same computation in another dot product space F:
PrincipleGiven a data set of N centered observations in a high-dimensional space ● Covariance matrix in new space:
● Again, it is necessary to solve the following system of equations:
● This means that:
Principle● Combining the last tree equations, we obtain:
● we define a new function
● and a new N x N matrix:
● our equation becomes:
Principle● let λ1 ≤ λ2 ≤ ... ≤ λN denote the eigenvalues of K, and α1, ..., αN the
corresponding eigenvectors, with λp being the first nonzero eigenvalue then we require they are normalized in F:
● Encoding a data point y means computing:
Algorithm
● CentralizationFor a given data set, subtracting the mean for all the observation to achieve the centralized data in RN.
● Finding principal componentsCompute the matrix using kernel function, find eigenvectors and eigenvalues
● Encoding training/testing data where x is a vector that encodes the training data. This can be done since we calculated eigenvalues and eigenvectors.
Algorithm● Reconstructing training data
The operation cannot be done because eigenvectors do not have a pre-images in the original dimension.
● Reconstructing test data pointThe operation cannot be done because eigenvectors do not have a pre-images in the original dimension.
Disadvantages● Centering in original space does not mean centering in F, we need
to adjust the K matrix as follows: ● KPCA is now a parametric technique:
○ choice of a proper kernel function■ Gaussian, sigmoid, polynomial
○ Mercer's theorem■ k(x,y) must be continue, simmetric, and semi-defined positive
(xTAx ≥ 0)■ it guarantees that there are non-zero eigenvalues
● Data reconstruction is not possible, unless using approximation formula:
● Time complexity
○ we will return to this point later
● Handle non linearly separable problems
● Extraction of more principal components than PCA
○ Feature extraction vs. dimensionality reduction
Advantages
● Applications
● Data Sets
● Methods compared
● Assessment
● Experiments
● Results
Experiments
● Clustering○ Density Estimation
■ ex High correlation between features ○ De-noising
■ ex Lighting removing from bright images○ Compression
■ ex Image compression
● Classification○ ex categorisations
Applications
Hand written digit-Labelled -256 Dimensions-9298 Digits
● USPS Character Recognition
● Simple example1
1+2 = 3 Three Gaussians sd = 0.1 Dist [1,1] x [0.5, 1]
Three clusters- Unlabelled- 2 Dimensions
● Simple example2
Kernels
1+2 = 3 Uniform distribution Dist [-1, 1]
y= x2
- Unlabelled- 2 Dimensions
Experiment Name Created by Representation
● De-noising The eleven gaussians Dist [-1, 1] with zero mean
A circle and square- Unlabelled- 10 Dimensions
y x 2 C C noise sd 0.1
Datasets
1 Simple Example 1 experiment Dataset : 1+ 2 = 3 The uniform dist sd = 0.2Kernel: Polynomial 1 – 4
2 USPS Character RecognitionDataset: USPS Methods Five layer Neural Networks Kernel SVM PCA SVM
3 De- noising Dataset: De-noising 11 gaussians sd = 0.1 Methods
Kernel Autoencoders Principal Curves Kernel PCA Linear PCA
4 KernelsRadial Basis Function Sigmoid
ParametersKernel PCAKernel Polynomial 1 7 Components 32 2048 (x x2) Neural Networks and SVMThe best parameters for the task ParametersThe best parameters for the task ParametersThe best parameters for the task
Experiments
● Supervised Unsupervised Linear PCA
Neural Networks Kernel PCA ● SVM Kernel Autoencoders● Kernel LDA Principal Curves
Linear
Non Linear
Classification
These are the methods we used in the experiments
Face Recognition
Dimensionality reduction
Methods
● 1 Accuracy Classification: Exact Classification Clustering: Comparable to other clusters
● ● 2 Time Complexity● The time to compute
● ● 3 Storage Complexity● The storage of the data
● ● 4 Interpretability● How easy it is to understand
Assessment
● Nonlinear PCA paper exDataset: 1+ 2 =3 The uniform dist with sd 0.2Classifier: The polynomial Kernel 1 - 4 PC: 1 – 3
Kernel Polynomial 1 -4
The eigenvector 1 -3 of highest eigenvalue
● Recreated example Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 x2
3D by a Kernel Do PCA 2D The function y = x2 + B
with noise B of sd= 0.2 from uniform distribution [-1, 1]
Accurate Clustering of Non linear features
Simple Example
Dataset: The USPS Handwritten digits Training set: 3000 Classifier: The SVM dot product Kernel 1 -7 PC: 32 – 2048 (x x2)
● The performance is better for Linear Classifier trained on non linear components than linear components
● The performance is
improved from linear as the number of component is increased
Fig The result of the Character Recognition experiment ( )
Character recognition
The de-noising on non linear feature of the distribution
Dataset: The De-noising eleven gaussians Training set: 100 Classifier: The Gaussian Kernel sd parameter PC: 2
Fig The result of the denoising experiment ( )
De-noising
The choice of Kernel regulates the accuracy of the algorithm and is dependent on the application. The Mercer Kernels Gram Matrix are
Experiments Radial Basis FunctionDataset Three gaussian sd 0.1 Classifier y exp x y 0.1 Kernel 1 4 PC 1 8 SigmoidDataset Three Gaussian sd 0.1Classifier KernelPC 1 3
Kernels
RBF Sigmoid
PC 1 PC 2 PC 3 PC 4
PC 5 PC 6 PC 7 PC8
PC 1 PC2 PC3
-The PC 1-2 separate the 3 clusters - The PC of 3 -5 half the clusters -The PC of 6-8 split them orthogonally The clusters are split to 12 places.
-The PC 1 -2 separates the 3 clusters - The PC 3 half the 3 clusters -The same no of PC’s to separate clusters.- The Sigmoid needs < PC to half.
Results
Experiment 1 Experiment 2 Experiment 3 Experiment 4
1 Accuracy
Kernel Polynomial 4 Polynomial 4 Gaussian 0.2 Sigmoid
Components 8 Split to 12 512 2 3 split to 6
Accuracy 4.4
2 Time
3 Space
4 Interpretability
Very Good Very Good Complicated Very good
Results
Kernel Fisher Discriminant (KDA) Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller
● Best discriminant projection
Discussions: KDA
http://lh3.ggpht.com/_qIDcOEX659I/S14l1wmtv6I/AAAAAAAAAxE/3G9kOsTt0VM/s1600-h/kda62.png
Discussions
Doing PCA in F rather in Rd
● The first k principal components carry more variance than any
other k directions
● The mean squared error observed by the first k principles is
minimal
● The principal components are uncorrelated
Discussions
Going into a higher dimensionality for a lower dimensionality ● Pick the right high dimensionality space
Need of a proper kernel
● What kernel to use?
○ Gaussian, sigmoidal, polynomial
● Problem dependent
Discussions
Time Complexity
● Alot of features (alot of dimensions).
● KPCA works!
○ Subspace of F (only the observed x's)
○ No dot product calculation
● Computational complexity is hardly changed by the fact that we
need to evaluate kernel function rather than just dot products
○ (if the kernel is easy to compute)
○ e.g. Polynomial Kernels
Payback: using linear classifier.
Discussions
Pre-image reconstruction maybe impossible
Approximation can be done in F
Need explicite ϕ
● Regression learning problem
● Non-linear optimization problem
● Algebric Solution (rarely)
Discussions
Interpretablity
● Cross-Features Features
○ Dependent on the kernel
● Reduced Space Features
○ Preserves the highest variance
among data in F.
Conclusions
Applications
● Feature Extraction (Classification)
● Clustering
● Denoising
● Novelty detection
● Dimensionality Reduction (Compression)
[1] J.T. Kwok and I.W. Tsang, “The Pre-Image Problem in Kernel Methods,” IEEE Trans. Neural Networks, vol. 15, no. 6, pp. 1517-1525, 2004.[2] Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504-507.[3] Sebastian Mika , Gunnar Rätsch , Jason Weston , Bernhard Schölkopf , Klaus-Robert Müller[4] Trevor Hastie; Werner Stuetzle, “Principal Curves,” Journal of the American Statistical Association, Vol. 84, No. 406. (Jun. 1989), pp. 502-516.[5] G. Moser, "Analisi delle componenti principali", Tecniche di trasformazione di spazi vettoriali per analisi statistica multi-dimensionale.[6] I.T. Jolliffe, "Principal component analysis", Spriger-Verlag, 2002.[7] Wikipedia, "Kernel Principal Component Analysis", 2011.[8] A. Ghodsi, "Data visualization", 2006.[9] B. Scholkopf, S. Mika, A. Smola, G. Ratsch, and K.R. Muller, "Kernel PCA pattern reconstruction via approximate pre-images". In Proceedings of the 8th International Conference on Artificial Neural Networks, pages 147 - 152, 1998.
References
[10] J.T.Kwok, I.W.Tsang, "The pre-image problem in kernel methods", Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), 2003. ● K-R, Müller, S, Mika, G, Rätsch, K,Tsuda, and B, Schölkopf “An
Introduction to Kernel-Based Learning Algorithms” IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 2, MARCH 2001
● S, Mika, B, Schölkopf, A, Smola Klaus-Robert M¨uller, M,Scholz, G, Rätsch “Kernel PCA and De-Noising in Feature Spaces”
References
Thank you