33
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References Gaussian Process Latent Variable Models & applications in single-cell genomics Kieran Campbell University of Oxford November 19, 2015 Kieran Campbell University of Oxford Gaussian Process Latent Variable Models & applications in single-cell genomics

Gaussian Process Latent Variable Models & applications in single-cell genomics

Embed Size (px)

Citation preview

Page 1: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Gaussian Process Latent Variable Models &applications in single-cell genomics

Kieran Campbell

University of Oxford

November 19, 2015

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 2: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Introduction to Gaussian Processes

Gaussian Process Latent Variable Models

Applications in single-cell genomics

References

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 3: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Introduction

In (Bayesian) supervised learning some (non-)linear functionf (x; w) parametrized by w is assumed to generate data {xn, yn}.

I f may take any parametric form, e.g. linear f (x) = w0 + w1x

I Posterior inference can be performed on

p(w |y,X ) =p(y|w ,X )p(w)

p(y|X )(1)

I Predictions of a new point {y∗, x∗} can be made bymarginalising over w :

p(y∗|y,X , x∗) =

∫dwp(y∗|w ,X , x∗)p(w |y,X ) (2)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 4: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Gaussian Process Regression

Gaussian Processes place a non-parametric prior over the functionsf (x)

I f always indexed by ‘input variable’ x

I Any subset of functions {fi}Ni=1 are jointly drawn from amultivariate Gaussian distribution with zero mean andcovariance matrix K :

p(f1, . . . , fN) = N (0,K ) (3)

I In other words, entirely defined by second-order statistics K

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 5: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Choice of Kernel

Behaviour of the GP defined by choice of kernel & parameters

I Kernel function K (x, x′) becomes covariance matrix once setof points ‘realised’

I Typical choice is double exponential

K (x, x′) = exp(−λ‖x− x′‖2) (4)

I Intuition is if x and x′ are similar, covariance will be larger andso f and f ′ will - on average - be closer together

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 6: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

GPs with noisy observations

So far assumed observations of f are noise free - GP becomesfunction interpolator

I Instead observations y(x) corrupted by noise soy ∼ N (f (x), σ2)

I Because everything is Gaussian, can marginalise over (latent)functions f and find

p(y1, . . . , yN) ∼ N (0,K + σ2I ) (5)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 7: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Predictions with noisy observations

To make predictions with GPs only need covariance between ‘old’inputs X and ‘new’ input x∗:

I Let k∗ = K (X , x∗) and k∗∗ = K (x∗, x∗)

I Then

p(f∗|x∗,X , y) = N (f∗|kT∗ K−1, k∗∗ − kT

∗ K−1k∗) (6)

This highlights the major disadvantage of GPs - to makepredictions we need to invert an n × n matrix - O(n3)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 8: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Effect of RBF kernel parameters

Kernelκ(xp, xq) =σ2f exp

(− 1

2l2(xp − xq)2

)+ σ2

yδqp

ParametersI l controls horizontal length scaleI σf controls vertical length scaleI σy noise variance

In figure (l , σf , σy ) have values

(a) (1, 1, 0.1)

(b) (0.3, 1.08, 0.00005)

(c) (3.0, 1.16, 0.89)

Figure: Rasmussen and Williams2006

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 9: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Dimensionality reduction & unsupervised learning

Dimensionality reduction

Want to reduce some observed data Y ∈ RN×D to a set of latentvariables X ∈ RN×Q where Q � D.

Methods

I Linear: PCA, ICA

I Non-linear: Laplacian eigenmaps, MDS, etc.

I Probabilistic: PPCA, GPLVM

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 10: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Probabilistic PCA (Tipping and Bishop, 1999)

Recall Y observed data matrix, X latent matrix. Then assume

yn = Wxn + ηn

where

I W linear relationship between latent space and data space

I ηn Gaussian noise mean 0 precision β

Then marginalise out X to find

p(yn|W , β) = N (yn|0,WW T + β−1I )

Analytic solution when W spans principal subspace - probabilisticPCA.

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 11: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

GPLVM (Lawrence 2005)

Alternative representation (dual probabilistic PCA)Instead of marginalising latent factors X , marginalise mapping W . Letp(W ) =

∏i N (wi , |0, I ) then

p(y:,d |X , β) = N (y:,d |0,XXT + β−1I )

GPLVMLawrence’s breakthrough was to realise that the covariance matrix

K = XXT + β−1I

can be replaced by any similarity (kernel) matrix S as in the GPframework.GP-LVM define a mapping from the latent space to the observed space.

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 12: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

GPLVM example - oil flow data

Figure: PCA (left) and GPLVM (right) on multi-phase oil flow data(Lawrence 2006)

I GPLVM shows better separation between oil flow class (shape) comparedto PCA

I GPLVM gives uncertainty in the data space. Since this is shared across allfeautures, can visualise in latent space (pixel intensity)

I If we want true uncertainty in latent need Bayesian approach to findp(latent|data)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 13: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Bayesian GPLVM

Ideally we want to know the uncertainty in the latent factorsp(latent|data). Approaches to inference:

I Metropolis-hastings - requires lots of tweaking but‘guaranteed’ for any model

I HMC with Stan - fast, requires less tweaking but less supportfor arbitrary priors

I Variational inference1

1Titsias, M., & Lawrence, N. (2010). Bayesian Gaussian Process LatentVariable Model. Artificial Intelligence, 9, 844-851.

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 14: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Buettner 2012Introduce ‘structure preserving’ GPLVM for clustering of single-cell qPCR fromzygote to blastocyst development

I Includes a ‘prior’ that preserves local structure by modifying likelihood(previously studied2)

I Find modified GPLVM gives better separation between differentdevelopmental stages)

2Maaten, L. Van Der. (2005). Preserving Local Structure in GaussianProcess Latent Variable Models

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 15: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Buettner 2015Use (GP?)-LVM to construct low-rank cell-to-cell covariance based on expression ofspecific gene pathway

Model

yg ∼ N (µg ,XXT + σ2νCCT + ν2

g I )

whereI X hidden factor such as cell cycleI C observed covariate

Can then assess gene-gene correlation controlling for hidden factors

Non-linear PCA of genes notannotated as cell-cycle. Left:before scLVM, right: after.

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 16: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Bayesian Gaussian Process Latent Variable Models forpseudotime inference

Pseudotime Artificial measure of a cells progression through someprocess (differentiation, apoptosis, cell cycle)Cell ordering problem Order high-dimensional transcriptomesthrough process

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 17: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Current approaches

Monocle

I ICA for dimensionalityreduction, longest paththrough minimum spanningtree to assign pseudotime

I Uses cubic smoothing splines &likelihood ratio test fordifferential expression analysis

Standard analysis is to examine differential expression acrosspseudotime

Questions What is the uncertainty in pseudotime? How does thisimpact the false discovery rate of differential expression analysis?

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 18: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Bayesian GPLVM for pseudotime inference

1. Reduce dimensionality of gene expression data (LE, t-SNE,PCA or all at once!)

2. Fit Bayesian GPLVM in reduced space (this is essentially aprobabilistic curve)

3. Quantify posterior samples, uncertainty etc

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 19: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Model

γ ∼ Gamma(γα, γβ)

λj ∼ Exp(γ)

σj ∼ InvGamma(α, β)

ti ∼ πt , i = 1, . . . ,N,

Σ = diag(σ21, . . . , σ

2P)

K (j)(t, t ′) = exp(−λj(t − t ′)2)

µj ∼ GP(0,K (j)), j = 1, . . . ,P,

xi ∼ N (µ(ti ),Σ), i = 1, . . . ,N.

(7)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 20: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Prior issues

How do we define the prior on t, πt?

I Typically want t = (t1, . . . , tn) to sit uniformly on [0, 1]

I t only appears in the likelihood via λj(t − t ′)2

I Means we can arbitrarily rescale λ→ λε and t →

√εt and get

same likelihood

I t equivalent on any subset of [0, 1] - ill-defined problem

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 21: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Solutions

Corp prior

I Want t to ‘fill out’ over [0, 1]I Introduce repulsive prior

πt(t) ∝N∏i=1

N∏j=i+1

sin (π|ti − tj |) (8)

I Non conjugate & difficult to evaluate gradient - need MetropolisHastings

Constrained random walk inferenceI If we constrain t to be on [0, 1] and use random walk sampling

(MH, HMC), pseudotimes naturally ‘wander’ towards the boundaryI Once there, covariance structure settles them into a steady state

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 22: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Applications to biological datasets

Applied Bayesian GPLVM to three datasets:

1. Monocle Differentiating human myoblasts (time series) - 155cells once contamination removed

2. Ear Differentiating cells from mouse cochlear & utricularsensory epithelia. Pseudotime shows supporting cells (SC)differentiating into hair cells (HC)

3. Waterfall Adult neurogenesis (PCA captures pseudotimevariation)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 23: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Sampling posterior curves

A Monocle dataset, laplacian eigenmaps representation

B Ear dataset, laplacian eigenmaps representation

C Waterfall dataset, PCA representation

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 24: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

What does the posterior uncertainty look like? (I)

95% HPD credible interval typically spans ∼ 14 of pseudotime

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 25: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

What does the posterior uncertainty look like? (II)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 26: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Effect of hyperparameters (Monocle dataset)Recall

K (t, t ′) ∝ exp(−λj(t − t ′)2

)λj ∼ Exp(γ)

γ ∼ Gamma(γα, γβ)

|λ| roughly corresponds to arc-length. So what are the effects ofchanging γα, γβ?

E[γ] = γαγβ

, Var[γ] = γαγ2β

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 27: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Approximate false discovery rate

How to approximate false discovery rate?I Refit differential expression

for each gene acrossposterior samples ofpseudotime

I Compute p- and q- valuesfor each sample for eachgene

I Statistic is proportionsignificant at 5% FDR

I Differential gene expressionis false positive ifproportion significant< 0.95 and q-value < 0.05

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 28: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Approximate false discovery rates

Approximate false discovery rate can be very high (∼ 3× largerthan it should be) but is also variable

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 29: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Integrating multiple dimensionality reduction algorithms

Can very easily integrate multiple source of data from differentdimensionality reduction algorithms:

p(t, {X}) ∝ πt(t)p(XLE|t)p(XPCA|t)p(XtSNE|t) (9)

Natural extension to integrate multiple heterogeneous source ofdata, e.g.

p(t, {X}) ∝ πt(t)p(imaging|t)p(ATAC|t)p(transcriptomics|t)(10)

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 30: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Example: Monocle with LE, PCA & t-SNELearning curves for each representation separately:

Joint learning of all representations:

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 31: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

FDR from multiple representation learning

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 32: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Some good references (I)

Gaussian Processes

I Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006).

GPLVM

I Lawrence, Neil D. ”Gaussian process latent variable models for visualisation of highdimensional data.” Advances in neural information processing systems 16.3 (2004):329-336.

I Titsias, Michalis K., and Neil D. Lawrence. ”Bayesian Gaussian process latent variablemodel.” International Conference on Artificial Intelligence and Statistics. 2010.

I van der Maaten, Laurens. ”Preserving local structure in Gaussian process latent variablemodels.” Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning.2009.

I Wang, Ye, and David B. Dunson. ”Probabilistic Curve Learning: Coulomb Repulsion andthe Electrostatic Gaussian Process.” arXiv preprint arXiv:1506.03768 (2015).

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics

Page 33: Gaussian Process Latent Variable Models & applications in single-cell genomics

Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References

Some good references (II)Latent variable models in single-cell genomics

I Buettner, Florian, and Fabian J. Theis. ”A novel approach for resolving differences insingle-cell gene expression patterns from zygote to blastocyst.” Bioinformatics 28.18(2012): i626-i632.

I Buettner, Florian, et al. ”Computational analysis of cell-to-cell heterogeneity in single-cellRNA-sequencing data reveals hidden subpopulations of cells.” Nature biotechnology 33.2(2015): 155-160.

Pseudotime

I Trapnell, Cole, et al. ”The dynamics and regulators of cell fate decisions are revealed bypseudotemporal ordering of single cells.” Nature biotechnology 32.4 (2014): 381-386.

I Bendall, Sean C., et al. ”Single-cell trajectory detection uncovers progression and regulatorycoordination in human B cell development.” Cell 157.3 (2014): 714-725.

I Marco, Eugenio, et al. ”Bifurcation analysis of single-cell gene expression data revealsepigenetic landscape.” Proceedings of the National Academy of Sciences 111.52 (2014):E5643-E5650.

I Shin, Jaehoon, et al. ”Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascadesunderlying Adult Neurogenesis.” Cell stem cell 17.3 (2015): 360-372.

I Leng, Ning, et al. ”Oscope identifies oscillatory genes in unsynchronized single-cellRNA-seq experiments.” Nature methods 12.10 (2015): 947-950.

Kieran Campbell University of Oxford

Gaussian Process Latent Variable Models & applications in single-cell genomics