82
All you want to know about GPs: Gaussian Process Latent Variable Model Raquel Urtasun and Neil Lawrence TTI Chicago, University of Sheffield June 16, 2012 Urtasun & Lawrence () GP tutorial June 16, 2012 1 / 35

All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

All you want to know about GPs:Gaussian Process Latent Variable Model

Raquel Urtasun and Neil Lawrence

TTI Chicago, University of Sheffield

June 16, 2012

Urtasun & Lawrence () GP tutorial June 16, 2012 1 / 35

Page 2: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Motivation for Non-Linear Dimensionality Reduction

USPS Data Set Handwritten Digit

3648 Dimensions

I 64 rows by 57 columnsI Space contains more

than just this digit.

I Even if we sampleevery nanosecond fromnow until the end ofthe universe, you won’tsee the original six!

Urtasun & Lawrence () GP tutorial June 16, 2012 2 / 35

Page 3: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Motivation for Non-Linear Dimensionality Reduction

USPS Data Set Handwritten Digit

3648 Dimensions

I 64 rows by 57 columnsI Space contains more

than just this digit.I Even if we sample

every nanosecond fromnow until the end ofthe universe, you won’tsee the original six!

Urtasun & Lawrence () GP tutorial June 16, 2012 2 / 35

Page 4: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Motivation for Non-Linear Dimensionality Reduction

USPS Data Set Handwritten Digit

3648 Dimensions

I 64 rows by 57 columnsI Space contains more

than just this digit.I Even if we sample

every nanosecond fromnow until the end ofthe universe, you won’tsee the original six!

Urtasun & Lawrence () GP tutorial June 16, 2012 2 / 35

Page 5: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Motivation for Non-Linear Dimensionality Reduction

USPS Data Set Handwritten Digit

3648 Dimensions

I 64 rows by 57 columnsI Space contains more

than just this digit.I Even if we sample

every nanosecond fromnow until the end ofthe universe, you won’tsee the original six!

Urtasun & Lawrence () GP tutorial June 16, 2012 2 / 35

Page 6: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 7: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 8: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 9: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 10: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 11: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 12: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 13: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 14: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Simple Model of Digit

Rotate a ’Prototype’

Urtasun & Lawrence () GP tutorial June 16, 2012 3 / 35

Page 15: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

MATLAB Demo

demDigitsManifold([1 2], ’all’)

Urtasun & Lawrence () GP tutorial June 16, 2012 4 / 35

Page 16: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

MATLAB Demo

demDigitsManifold([1 2], ’all’)

−0.1 −0.05 0 0.05 0.1−0.1

−0.05

0

0.05

0.1

PC no 1

PC

no 2

Urtasun & Lawrence () GP tutorial June 16, 2012 4 / 35

Page 17: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

MATLAB Demo

demDigitsManifold([1 2], ’sixnine’)

−0.1 −0.05 0 0.05 0.1−0.1

−0.05

0

0.05

0.1

PC no 1

PC

no 2

Urtasun & Lawrence () GP tutorial June 16, 2012 4 / 35

Page 18: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Low Dimensional Manifolds

Pure Rotation is too Simple

In practice the data may undergo several distortions.

I e.g. digits undergo ‘thinning’, translation and rotation.

For data with ‘structure’:

I we expect fewer distortions than dimensions;I we therefore expect the data to live on a lower dimensional manifold.

Conclusion: deal with high dimensional data by looking for lowerdimensional non-linear embedding.

Urtasun & Lawrence () GP tutorial June 16, 2012 5 / 35

Page 19: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Selection

Figure: demRotationDist. Feature selection via distance preservation.

Urtasun & Lawrence () GP tutorial June 16, 2012 6 / 35

Page 20: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Selection

Figure: demRotationDist. Feature selection via distance preservation.

Urtasun & Lawrence () GP tutorial June 16, 2012 6 / 35

Page 21: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Selection

Figure: demRotationDist. Feature selection via distance preservation.

Urtasun & Lawrence () GP tutorial June 16, 2012 6 / 35

Page 22: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 23: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 24: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 25: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 26: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 27: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances. Residuals aremuch reduced.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 28: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Feature Extraction

Figure: demRotationDist. Rotation preserves interpoint distances. Residuals aremuch reduced.

Urtasun & Lawrence () GP tutorial June 16, 2012 7 / 35

Page 29: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Which Rotation?

We need the rotation that will minimise residual error.

Retain features/directions with maximum variance.

Error is then given by the sum of residual variances.

E (X) =2

p

p∑k=q+1

σ2k .

Rotations of data matrix do not effect this analysis.

Rotate data so that largest variance directions are retained.

Urtasun & Lawrence () GP tutorial June 16, 2012 8 / 35

Page 30: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Which Rotation?

We need the rotation that will minimise residual error.

Retain features/directions with maximum variance.

Error is then given by the sum of residual variances.

E (X) =2

p

p∑k=q+1

σ2k .

Rotations of data matrix do not effect this analysis.

Rotate data so that largest variance directions are retained.

Urtasun & Lawrence () GP tutorial June 16, 2012 8 / 35

Page 31: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Which Rotation?

We need the rotation that will minimise residual error.

Retain features/directions with maximum variance.

Error is then given by the sum of residual variances.

E (X) =2

p

p∑k=q+1

σ2k .

Rotations of data matrix do not effect this analysis.

Rotate data so that largest variance directions are retained.

Urtasun & Lawrence () GP tutorial June 16, 2012 8 / 35

Page 32: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Which Rotation?

We need the rotation that will minimise residual error.

Retain features/directions with maximum variance.

Error is then given by the sum of residual variances.

E (X) =2

p

p∑k=q+1

σ2k .

Rotations of data matrix do not effect this analysis.

Rotate data so that largest variance directions are retained.

Urtasun & Lawrence () GP tutorial June 16, 2012 8 / 35

Page 33: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Reminder: Principal Component Analysis

How do we find these directions?

Find directions in data with maximal variance.

I That’s what PCA does!

PCA: rotate data to extract these directions.

PCA: work on the sample covariance matrix S = n−1Y>Y.

Urtasun & Lawrence () GP tutorial June 16, 2012 9 / 35

Page 34: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Principal Coordinates Analysis

The rotation which finds directions of maximum variance is theeigenvectors of the covariance matrix.

The variance in each direction is given by the eigenvalues.

Problem: working directly with the sample covariance, S, may beimpossible.

Urtasun & Lawrence () GP tutorial June 16, 2012 10 / 35

Page 35: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Equivalent Eigenvalue Problems

Principal Coordinate Analysis operates on YT Y.

Two eigenvalue problems are equivalent. One solves for the rotation,the other solves for the location of the rotated points.

When p < n it is easier to solve for the rotation, Rq. But when p > nwe solve for the embedding (principal coordinate analysis). fromdistance matrix.

Can we compute YY> instead?

Urtasun & Lawrence () GP tutorial June 16, 2012 11 / 35

Page 36: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Equivalent Eigenvalue Problems

Principal Coordinate Analysis operates on YT Y.

Two eigenvalue problems are equivalent. One solves for the rotation,the other solves for the location of the rotated points.

When p < n it is easier to solve for the rotation, Rq. But when p > nwe solve for the embedding (principal coordinate analysis). fromdistance matrix.

Can we compute YY> instead?

Urtasun & Lawrence () GP tutorial June 16, 2012 11 / 35

Page 37: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

The Covariance Interpretation

n−1Y>Y is the data covariance.

YY> is a centred inner product matrix.

I Also has an interpretation as a covariance matrix (Gaussian processes).I It expresses correlation and anti correlation between data points.I Standard covariance expresses correlation and anti correlation between

data dimensions.

Urtasun & Lawrence () GP tutorial June 16, 2012 12 / 35

Page 38: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Summary up to know on dimensionality reduction

Distributions can behave very non-intuitively in high dimensions.

Fortunately, most data is not really high dimensional.

Probabilistic PCA exploits linear low dimensional structure in thedata.

I Probabilistic interpretation brings with it many advantages:extensibility, Bayesian approaches, missing data.

Didn’t deal with the non-linearities highlighted by the six example!

Let’s look at non linear dimensionality reduction.

Urtasun & Lawrence () GP tutorial June 16, 2012 13 / 35

Page 39: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Spectral methods

LLE (Roweis & Saul, 00) , ISOMAP (Tenenbaum et al. 00),Laplacian Eigenmaps (Belkin &Niyogi, 01)

Based on local distance preservation

Figure: LLE embeddings from densely sampled data

Urtasun & Lawrence () GP tutorial June 16, 2012 14 / 35

Page 40: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Tangled String

Sometimes local distancepreservation in data space iswrong.

The pink and blue ballshould be separated.

But the assumption makesthe problem simpler (forspectral methods it isconvex).

Urtasun & Lawrence () GP tutorial June 16, 2012 15 / 35

Page 41: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Tangled String

Sometimes local distancepreservation in data space iswrong.

The pink and blue ballshould be separated.

But the assumption makesthe problem simpler (forspectral methods it isconvex).

Urtasun & Lawrence () GP tutorial June 16, 2012 15 / 35

Page 42: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Tangled String

Sometimes local distancepreservation in data space iswrong.

The pink and blue ballshould be separated.

But the assumption makesthe problem simpler (forspectral methods it isconvex).

Urtasun & Lawrence () GP tutorial June 16, 2012 15 / 35

Page 43: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Generative Models

Directly model the generating process.

Map from string to position in space.

How to model observation “generation”?

Urtasun & Lawrence () GP tutorial June 16, 2012 16 / 35

Page 44: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Example of data generation

y 2

y1

x

y1 = f1(x)

−→y2 = f2(x)

Figure: A string in two dimensions, formed by mapping from one dimension, x ,line to a two dimensional space, [y1, y2] using nonlinear functions f1(·) and f2(·).

Urtasun & Lawrence () GP tutorial June 16, 2012 17 / 35

Page 45: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Difficulty for Probabilistic Approaches

Propagate a probability distribution through a non-linear mapping.

Normalisation of distribution becomes intractable.x 2

x1

yj = fj(x)−→

Figure: A three dimensional manifold formed by mapping from a two dimensionalspace to a three dimensional space.

Urtasun & Lawrence () GP tutorial June 16, 2012 18 / 35

Page 46: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Difficulty for Probabilistic ApproachesPropagate a probability distribution through a non-linear mapping.Normalisation of distribution becomes intractable.

p(y)p(x)

y = f (x) + ε−→

Figure: A Gaussian distribution propagated through a non-linear mapping.yi = f (xi ) + εi . ε ∼ N

(0, 0.22

)and f (·) uses RBF basis, 100 centres between -4

and 4 and ` = 0.1. New distribution over y (right) is multimodal and difficult tonormalize.

Urtasun & Lawrence () GP tutorial June 16, 2012 19 / 35

Page 47: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Mapping of Points

Mapping points to higher dimensions is easy.

Figure: One dimensional Gaussian mapped to two dimensions.

Urtasun & Lawrence () GP tutorial June 16, 2012 20 / 35

Page 48: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Mapping of Points

Mapping points to higher dimensions is easy.

Figure: Two dimensional Gaussian mapped to three dimensions.

Urtasun & Lawrence () GP tutorial June 16, 2012 20 / 35

Page 49: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Dimensionality Reduction

Linear Latent Variable Model

Represent data, Y, with a lower dimensional set of latent variables X.

Assume a linear relationship of the form

yi ,: = Wxi ,: + εi ,:,

whereεi ,: ∼ N

(0, σ2I

).

Urtasun & Lawrence () GP tutorial June 16, 2012 21 / 35

Page 50: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model

Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Standard Latent variableapproach:

I Define Gaussian priorover latent space, X.

I Integrate out latentvariables.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 22 / 35

Page 51: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model

Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Standard Latent variableapproach:

I Define Gaussian priorover latent space, X.

I Integrate out latentvariables.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 22 / 35

Page 52: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model

Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Standard Latent variableapproach:

I Define Gaussian priorover latent space, X.

I Integrate out latentvariables.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

p (X) =n∏

i=1

N(xi,:|0, I

)

Urtasun & Lawrence () GP tutorial June 16, 2012 22 / 35

Page 53: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model

Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Standard Latent variableapproach:

I Define Gaussian priorover latent space, X.

I Integrate out latentvariables.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

p (X) =n∏

i=1

N(xi,:|0, I

)

p (Y|W) =n∏

i=1

N(

yi,:|0,WW> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 22 / 35

Page 54: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model II

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

X W

Y

p (Y|W) =n∏

i=1

N(

yi ,:|0,WW> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 23 / 35

Page 55: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model IIProbabilistic PCA Max. Likelihood Soln (Tipping 99)

p (Y|W) =n∏

i=1

N (yi ,:|0,C) , C = WW> + σ2I

log p (Y|W) = −n

2log |C| − 1

2tr(

C−1Y>Y)

+ const.

If Uq are first q principal eigenvectors of n−1Y>Y and the correspondingeigenvalues are Λq,

W = UqLR>, L =(Λq − σ2I

) 12

where R is an arbitrary rotation matrix.

Urtasun & Lawrence () GP tutorial June 16, 2012 23 / 35

Page 56: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model III

Dual Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Novel Latent variableapproach:

I Define Gaussian priorover parameters, W.

I Integrate outparameters.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 24 / 35

Page 57: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model III

Dual Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Novel Latent variableapproach:

I Define Gaussian priorover parameters, W.

I Integrate outparameters.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 24 / 35

Page 58: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model III

Dual Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Novel Latent variableapproach:

I Define Gaussian priorover parameters, W.

I Integrate outparameters.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

p (W) =

p∏i=1

N(wi,:|0, I

)

Urtasun & Lawrence () GP tutorial June 16, 2012 24 / 35

Page 59: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model III

Dual Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Novel Latent variableapproach:

I Define Gaussian priorover parameters, W.

I Integrate outparameters.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

p (W) =

p∏i=1

N(wi,:|0, I

)

p (Y|X) =

p∏j=1

N(

y:,j |0,XX> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 24 / 35

Page 60: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model IV

Dual Probabilistic PCA Max. Likelihood Soln (Lawrence 03, Lawrence 05)

X W

Y

p (Y|X) =

p∏j=1

N(

y:,j |0,XX> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 25 / 35

Page 61: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model IV

Dual Probabilistic PCA Max. Likelihood Soln (Lawrence 03, Lawrence 05)

p (Y|X) =

p∏j=1

N(y:,j |0,K

), K = XX> + σ2I

log p (Y|X) = −p

2log |K| −

1

2tr(

K−1YY>)

+ const.

If U′q are first q principal eigenvectors of p−1YY> and the corresponding eigenvalues are Λq ,

X = U′qLR>, L =(Λq − σ2I

) 12

where R is an arbitrary rotation matrix.

Urtasun & Lawrence () GP tutorial June 16, 2012 25 / 35

Page 62: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Linear Latent Variable Model IV

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

p (Y|W) =n∏

i=1

N(yi,:|0,C

), C = WW> + σ2I

log p (Y|W) = −n

2log |C| −

1

2tr(

C−1Y>Y)

+ const.

If Uq are first q principal eigenvectors of n−1Y>Y and the corresponding eigenvalues are Λq ,

W = UqLR>, L =(Λq − σ2I

) 12

where R is an arbitrary rotation matrix.

Urtasun & Lawrence () GP tutorial June 16, 2012 25 / 35

Page 63: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Equivalence of Formulations

The Eigenvalue Problems are equivalentSolution for Probabilistic PCA (solves for the mapping)

Y>YUq = UqΛq W = UqLR>

Solution for Dual Probabilistic PCA (solves for the latent positions)

YY>U′q = U′qΛq X = U′qLR>

Equivalence is from

Uq = Y>U′qΛ− 1

2q

You have probably used this trick to compute PCA efficiently whennumber of dimensions is much higher than number of points.

Urtasun & Lawrence () GP tutorial June 16, 2012 26 / 35

Page 64: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-Linear Latent Variable Model

Dual Probabilistic PCA

Define linear-Gaussianrelationship betweenlatent variables and data.

Novel Latent variableapproach:

I Define Gaussian priorover parameteters, W.

I Integrate outparameters.

X W

Y

p (Y|X,W) =n∏

i=1

N(yi,:|Wxi,:, σ

2I)

p (W) =

p∏i=1

N(wi,:|0, I

)

p (Y|X) =

p∏j=1

N(

y:,j |0,XX> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 27 / 35

Page 65: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-Linear Latent Variable Model

Dual Probabilistic PCA

Inspection of the marginallikelihood shows ...

I The covariance matrixis a covariancefunction.

I We recognise it as the‘linear kernel’.

I We call this theGaussian ProcessLatent Variable model(GP-LVM).

X W

Y

p (Y|X) =

p∏j=1

N(

y:,j |0,XX> + σ2I)

Urtasun & Lawrence () GP tutorial June 16, 2012 27 / 35

Page 66: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-Linear Latent Variable Model

Dual Probabilistic PCA

Inspection of the marginallikelihood shows ...

I The covariance matrixis a covariancefunction.

I We recognise it as the‘linear kernel’.

I We call this theGaussian ProcessLatent Variable model(GP-LVM).

X W

Y

p (Y|X) =

p∏j=1

N(y:,j |0,K

)

K = XX> + σ2I

Urtasun & Lawrence () GP tutorial June 16, 2012 27 / 35

Page 67: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-Linear Latent Variable Model

Dual Probabilistic PCA

Inspection of the marginallikelihood shows ...

I The covariance matrixis a covariancefunction.

I We recognise it as the‘linear kernel’.

I We call this theGaussian ProcessLatent Variable model(GP-LVM).

X W

Y

p (Y|X) =

p∏j=1

N(y:,j |0,K

)

K = XX> + σ2I

This is a product of Gaussian processes

with linear kernels.

Urtasun & Lawrence () GP tutorial June 16, 2012 27 / 35

Page 68: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-Linear Latent Variable Model

Dual Probabilistic PCA

Inspection of the marginallikelihood shows ...

I The covariance matrixis a covariancefunction.

I We recognise it as the‘linear kernel’.

I We call this theGaussian ProcessLatent Variable model(GP-LVM).

X W

Y

p (Y|X) =

p∏j=1

N(y:,j |0,K

)K =?

Replace linear kernel with non-linear

kernel for non-linear model.

Urtasun & Lawrence () GP tutorial June 16, 2012 27 / 35

Page 69: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-linear Latent Variable Models

Exponentiated Quadratic (EQ) Covariance

The EQ covariance has the form ki ,j = k (xi ,:, xj ,:) , where

k (xi ,:, xj ,:) = α exp

(−‖xi ,: − xj ,:‖2

2

2`2

).

No longer possible to optimise wrt X via an eigenvalue problem.

Instead find gradients with respect to X, α, ` and σ2 and optimiseusing conjugate gradients.

Urtasun & Lawrence () GP tutorial June 16, 2012 28 / 35

Page 70: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Non-linear Latent Variable Models

Exponentiated Quadratic (EQ) Covariance

The EQ covariance has the form ki ,j = k (xi ,:, xj ,:) , where

k (xi ,:, xj ,:) = α exp

(−‖xi ,: − xj ,:‖2

2

2`2

).

No longer possible to optimise wrt X via an eigenvalue problem.

Instead find gradients with respect to X, α, ` and σ2 and optimiseusing conjugate gradients.

Urtasun & Lawrence () GP tutorial June 16, 2012 28 / 35

Page 71: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Stick Man

Generalization with less Data than Dimensions

Powerful uncertainly handling of GPs leads to surprising properties.

Non-linear models can be used where there are fewer data points thandimensions without overfitting.

Example: Modelling a stick man in 102 dimensions with 55 datapoints!

Urtasun & Lawrence () GP tutorial June 16, 2012 29 / 35

Page 72: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Stick Man

Generalization with less Data than Dimensions

Powerful uncertainly handling of GPs leads to surprising properties.

Non-linear models can be used where there are fewer data points thandimensions without overfitting.

Example: Modelling a stick man in 102 dimensions with 55 datapoints!

Urtasun & Lawrence () GP tutorial June 16, 2012 29 / 35

Page 73: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Stick Man

Generalization with less Data than Dimensions

Powerful uncertainly handling of GPs leads to surprising properties.

Non-linear models can be used where there are fewer data points thandimensions without overfitting.

Example: Modelling a stick man in 102 dimensions with 55 datapoints!

Urtasun & Lawrence () GP tutorial June 16, 2012 29 / 35

Page 74: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Stick Man II

demStick1

−1 −0.5 0 0.5 1

−1.5

−1

−0.5

0

0.5

1

1.5

Figure: The latent space for the stick man motion capture data.

Urtasun & Lawrence () GP tutorial June 16, 2012 30 / 35

Page 75: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Let’s look at some applications and extensions of the GPLVM

Urtasun & Lawrence () GP tutorial June 16, 2012 31 / 35

Page 76: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Maximum Likelihood Solution

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

X W

Y

p (Y|W,µ) =n∏

i=1

N(

yi,:|µ,WW> + σ2I)

Gradient of log likelihood

Urtasun & Lawrence () GP tutorial June 16, 2012 32 / 35

Page 77: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Maximum Likelihood Solution

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

X W

Y

p(

Y|W)

=n∏

i=1

N(

yi,:|0,WW> + σ2I)

Gradient of log likelihood

d

dWlog p

(Y|W

)= −n

2C−1W +

1

2C−1Y>YC−1W

Urtasun & Lawrence () GP tutorial June 16, 2012 32 / 35

Page 78: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Maximum Likelihood Solution

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

X W

Y

p(

Y|W)

=n∏

i=1

N(yi,:|0,C

), C = WW> + σ2I

Gradient of log likelihood

Urtasun & Lawrence () GP tutorial June 16, 2012 32 / 35

Page 79: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Maximum Likelihood Solution

Probabilistic PCA Max. Likelihood Soln (Tipping 99)

X W

Y

p(

Y|W)

=n∏

i=1

N(yi,:|0,C

), C = WW> + σ2I

log p(

Y|W)

= −n

2log |C| −

1

2tr(

C−1Y>Y)

+ const.

Gradient of log likelihood

Urtasun & Lawrence () GP tutorial June 16, 2012 32 / 35

Page 80: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

OptimizationSeek fixed points

0 = −n

2C−1W +

1

2C−1Y>YC−1W

pre-multiply by 2C0 = −nW + Y>YC−1W

1

nY>YC−1W = W

Substitute W with singular value decomposition

W = ULR>

which implies

C = WW> + σ2I

= UL2U> + σ2I

Using matrix inversion lemma

C−1W = UL(σ2 + L2

)−1R>

Urtasun & Lawrence () GP tutorial June 16, 2012 33 / 35

Page 81: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Solution given by

1

nY>YU = U

(σ2 + L2

)which is recognised as an eigenvalue problem.

This implies that the columns of U are the eigenvectors of 1n Y>Y and

that σ2 + L2 are the eigenvalues of 1n Y>Y.

li =√λi − σ2 where λi is the ith eigenvalue of 1

n Y>Y.

Further manipulation shows that if we constrain W ∈ <p×q then thesolution is given by the largest q eigenvalues.

Urtasun & Lawrence () GP tutorial June 16, 2012 34 / 35

Page 82: All you want to know about GPs: Gaussian Process Latent ...ttic.uchicago.edu/~rurtasun/tutorials/raquel_gplvm.pdfAll you want to know about GPs: Gaussian Process Latent Variable Model

Probabilistic PCA Solution

If Uq are first q principal eigenvectors of n−1Y>Y and thecorresponding eigenvalues are Λq,

W = UqLR>, L =(Λq − σ2I

) 12

where R is an arbitrary rotation matrix.

Some further work shows that the principal eigenvectors need to beretained.

The maximum likelihood value for σ2 is given by the average of thediscarded eigenvalues.

Return

Urtasun & Lawrence () GP tutorial June 16, 2012 35 / 35