23
Input Space versus Feature Space in Kernel-Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of Computer Science and Engineering University of California, San Diego

Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of

Embed Size (px)

Citation preview

Input Space versus Feature Space in Kernel-Based Methods

Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola

presented by:

Joe Drish

Department of Computer Science and Engineering

University of California, San Diego

GoalsGoals

1) Introduce and illustrate the kernel trick

2) Discuss the kernel mapping from input space to feature space F

3) Review kernel algorithms: SVMs and kernel PCA

4) Discuss interpretation of the return from F to after the dot product computation

5) Discuss the form of constructing sparse approximations of feature space expansions

6) Evaluate and discuss the performance of SVMs and PCA

Objectives of the paper

Applications of kernel methods

1) Handwritten digit recognition

2) Face recognition

3) De-noising: this paper

DefinitionDefinition

A reproducing kernel k is a function k: R.• The domain of k consists of the data patterns {x1, …, xl} • is a compact set in which the data lives

• is typically a subset of RN

Computing k is equivalent to mapping data patterns into a higherdimensional space F, and then taking the dot product there.

A feature map : RN F is a function that maps the input data patternsinto a higher dimensional space F.

2

IllustrationIllustration

Using a feature map to map the data from input space into a higher dimensional feature space F:

X X

X

X

O

O

O

O

Φ(O)

Φ(O)

Φ(O)Φ(O)

Φ(X)

Φ(X)

Φ(X)

Φ(X)

Kernel TrickKernel Trick

We would like to compute the dot product in the higherdimensional space, or

(x) · (y).

To do this we only need to compute

k(x,y),since

k(x,y) = (x) · (y).

Note that the feature map is never explicitly computed. Weavoid this, and therefore avoid a burdensome computational task.

Example kernelsExample kernels

Gaussian:

Polynomial:

)2

yxexp()y,x(

2

2

k

0,)(),( cck dyxyx

Sigmoid: R,),)yx(()y,x( k

Nonlinear separation can be achieved.

Nonlinear SeparationNonlinear Separation

Mercer TheoryMercer Theory

ii

ii uuA F

)y()x(y)(x,N

iiiik

Necessary condition for the kernel-mercer trick:

NF is equal to the rank of ui uiT – the outer product

is the normalized eigenfunction – analogous to a normalized eigenvector

Input Space to Feature Space

Mercer :: Linear AlgebraMercer :: Linear Algebra

Linear algebra analogy:

Eigenvector problem Eigenfunction problem

x and y are vectorsu is the normalized eigenvector is the eigenvalue is the normalized eigenfunction

A k(x,y)

u, ,

)()(),( xyyx fdyfk xAx

RKHS, Capacity, MetricRKHS, Capacity, Metric

Reproducing kernel Hilbert space (RKHS)

• Hilbert space of functions f on some set X such that all evaluation functions are continuous, and the functions can be reproduced by the kernel

Capacity of the kernel map

• Bound on the how many training examples are required for learning, measured by the VC-dimension h

Metric of the kernel map

• Intrinsic shape of the manifold to which the data is mapped

Support Vector MachinesSupport Vector Machines

The decision boundary takes the form:

• Similar to single layer perceptron• Training examples xi with non-zero coefficients i are support vectors

Kernel Principal Component AnalysisKernel Principal Component Analysis

KPCA carries out a linear PCA in the feature space F

The extracted features take the nonlinear form

,)xx()x(1

l

ii

kik kf

The ki are the components of the k-th eigenvector of the matrix

ijjik ))xx((

KPCA and Dot ProductsKPCA and Dot Products

Wish to find eigenvectors V and eigenvalues of the covariance matrix

l

iiil

C1

.)()(1 Txx

Again, replace

(x) · (y).

with

k(x,y).

From Feature Space to Input SpaceFrom Feature Space to Input Space

Pre-image problem:

Here, is not in the image.

Projection Distance IllustrationProjection Distance Illustration

Approximate the vector F:

Minimizing Projection DistanceMinimizing Projection Distance

Maximize:

z is an approximate pre-image for if:

For kernels where k(z,z) = 1 (Gaussian), this reduces to:

Fixed-point iterationFixed-point iteration

Requiring no step-size, we can iterate:

So assuming a Gaussian kernel:

i are the eigenvectors of the centered Gram matrix• xi are the input space is the width

Kernel PCA Toy ExampleKernel PCA Toy Example

Generated an artificial data set from three point sources, 100 point each.

De-noising by Reconstruction, Part OneDe-noising by Reconstruction, Part One

• Reconstruction from projections onto the eigenvectors from previous example• Generated 20 new points from each Gaussian• Represented by their first n = 1, 2, …, 8 nonlinear principal components

De-noising by Reconstruction, Part TwoDe-noising by Reconstruction, Part Two

• Original points are moving in the direction of de-noising

De-noising in 2-dimensionsDe-noising in 2-dimensions

• A half circle and a square in the plane• De-noised versions are the solid lines

De-noising USPS data patternsDe-noising USPS data patterns

Patterns7291 train2007 testSize: 16 x 16

Linear PCA

Kernel PCA

Questions