Upload
kieran-campbell
View
327
Download
1
Embed Size (px)
Citation preview
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Gaussian Process Latent Variable Models &applications in single-cell genomics
Kieran Campbell
University of Oxford
November 19, 2015
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Introduction to Gaussian Processes
Gaussian Process Latent Variable Models
Applications in single-cell genomics
References
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Introduction
In (Bayesian) supervised learning some (non-)linear functionf (x; w) parametrized by w is assumed to generate data {xn, yn}.
I f may take any parametric form, e.g. linear f (x) = w0 + w1x
I Posterior inference can be performed on
p(w |y,X ) =p(y|w ,X )p(w)
p(y|X )(1)
I Predictions of a new point {y∗, x∗} can be made bymarginalising over w :
p(y∗|y,X , x∗) =
∫dwp(y∗|w ,X , x∗)p(w |y,X ) (2)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Gaussian Process Regression
Gaussian Processes place a non-parametric prior over the functionsf (x)
I f always indexed by ‘input variable’ x
I Any subset of functions {fi}Ni=1 are jointly drawn from amultivariate Gaussian distribution with zero mean andcovariance matrix K :
p(f1, . . . , fN) = N (0,K ) (3)
I In other words, entirely defined by second-order statistics K
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Choice of Kernel
Behaviour of the GP defined by choice of kernel & parameters
I Kernel function K (x, x′) becomes covariance matrix once setof points ‘realised’
I Typical choice is double exponential
K (x, x′) = exp(−λ‖x− x′‖2) (4)
I Intuition is if x and x′ are similar, covariance will be larger andso f and f ′ will - on average - be closer together
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPs with noisy observations
So far assumed observations of f are noise free - GP becomesfunction interpolator
I Instead observations y(x) corrupted by noise soy ∼ N (f (x), σ2)
I Because everything is Gaussian, can marginalise over (latent)functions f and find
p(y1, . . . , yN) ∼ N (0,K + σ2I ) (5)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Predictions with noisy observations
To make predictions with GPs only need covariance between ‘old’inputs X and ‘new’ input x∗:
I Let k∗ = K (X , x∗) and k∗∗ = K (x∗, x∗)
I Then
p(f∗|x∗,X , y) = N (f∗|kT∗ K−1, k∗∗ − kT
∗ K−1k∗) (6)
This highlights the major disadvantage of GPs - to makepredictions we need to invert an n × n matrix - O(n3)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Effect of RBF kernel parameters
Kernelκ(xp, xq) =σ2f exp
(− 1
2l2(xp − xq)2
)+ σ2
yδqp
ParametersI l controls horizontal length scaleI σf controls vertical length scaleI σy noise variance
In figure (l , σf , σy ) have values
(a) (1, 1, 0.1)
(b) (0.3, 1.08, 0.00005)
(c) (3.0, 1.16, 0.89)
Figure: Rasmussen and Williams2006
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Dimensionality reduction & unsupervised learning
Dimensionality reduction
Want to reduce some observed data Y ∈ RN×D to a set of latentvariables X ∈ RN×Q where Q � D.
Methods
I Linear: PCA, ICA
I Non-linear: Laplacian eigenmaps, MDS, etc.
I Probabilistic: PPCA, GPLVM
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Probabilistic PCA (Tipping and Bishop, 1999)
Recall Y observed data matrix, X latent matrix. Then assume
yn = Wxn + ηn
where
I W linear relationship between latent space and data space
I ηn Gaussian noise mean 0 precision β
Then marginalise out X to find
p(yn|W , β) = N (yn|0,WW T + β−1I )
Analytic solution when W spans principal subspace - probabilisticPCA.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPLVM (Lawrence 2005)
Alternative representation (dual probabilistic PCA)Instead of marginalising latent factors X , marginalise mapping W . Letp(W ) =
∏i N (wi , |0, I ) then
p(y:,d |X , β) = N (y:,d |0,XXT + β−1I )
GPLVMLawrence’s breakthrough was to realise that the covariance matrix
K = XXT + β−1I
can be replaced by any similarity (kernel) matrix S as in the GPframework.GP-LVM define a mapping from the latent space to the observed space.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
GPLVM example - oil flow data
Figure: PCA (left) and GPLVM (right) on multi-phase oil flow data(Lawrence 2006)
I GPLVM shows better separation between oil flow class (shape) comparedto PCA
I GPLVM gives uncertainty in the data space. Since this is shared across allfeautures, can visualise in latent space (pixel intensity)
I If we want true uncertainty in latent need Bayesian approach to findp(latent|data)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian GPLVM
Ideally we want to know the uncertainty in the latent factorsp(latent|data). Approaches to inference:
I Metropolis-hastings - requires lots of tweaking but‘guaranteed’ for any model
I HMC with Stan - fast, requires less tweaking but less supportfor arbitrary priors
I Variational inference1
1Titsias, M., & Lawrence, N. (2010). Bayesian Gaussian Process LatentVariable Model. Artificial Intelligence, 9, 844-851.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Buettner 2012Introduce ‘structure preserving’ GPLVM for clustering of single-cell qPCR fromzygote to blastocyst development
I Includes a ‘prior’ that preserves local structure by modifying likelihood(previously studied2)
I Find modified GPLVM gives better separation between differentdevelopmental stages)
2Maaten, L. Van Der. (2005). Preserving Local Structure in GaussianProcess Latent Variable Models
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Buettner 2015Use (GP?)-LVM to construct low-rank cell-to-cell covariance based on expression ofspecific gene pathway
Model
yg ∼ N (µg ,XXT + σ2νCCT + ν2
g I )
whereI X hidden factor such as cell cycleI C observed covariate
Can then assess gene-gene correlation controlling for hidden factors
Non-linear PCA of genes notannotated as cell-cycle. Left:before scLVM, right: after.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian Gaussian Process Latent Variable Models forpseudotime inference
Pseudotime Artificial measure of a cells progression through someprocess (differentiation, apoptosis, cell cycle)Cell ordering problem Order high-dimensional transcriptomesthrough process
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Current approaches
Monocle
I ICA for dimensionalityreduction, longest paththrough minimum spanningtree to assign pseudotime
I Uses cubic smoothing splines &likelihood ratio test fordifferential expression analysis
Standard analysis is to examine differential expression acrosspseudotime
Questions What is the uncertainty in pseudotime? How does thisimpact the false discovery rate of differential expression analysis?
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Bayesian GPLVM for pseudotime inference
1. Reduce dimensionality of gene expression data (LE, t-SNE,PCA or all at once!)
2. Fit Bayesian GPLVM in reduced space (this is essentially aprobabilistic curve)
3. Quantify posterior samples, uncertainty etc
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Model
γ ∼ Gamma(γα, γβ)
λj ∼ Exp(γ)
σj ∼ InvGamma(α, β)
ti ∼ πt , i = 1, . . . ,N,
Σ = diag(σ21, . . . , σ
2P)
K (j)(t, t ′) = exp(−λj(t − t ′)2)
µj ∼ GP(0,K (j)), j = 1, . . . ,P,
xi ∼ N (µ(ti ),Σ), i = 1, . . . ,N.
(7)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Prior issues
How do we define the prior on t, πt?
I Typically want t = (t1, . . . , tn) to sit uniformly on [0, 1]
I t only appears in the likelihood via λj(t − t ′)2
I Means we can arbitrarily rescale λ→ λε and t →
√εt and get
same likelihood
I t equivalent on any subset of [0, 1] - ill-defined problem
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Solutions
Corp prior
I Want t to ‘fill out’ over [0, 1]I Introduce repulsive prior
πt(t) ∝N∏i=1
N∏j=i+1
sin (π|ti − tj |) (8)
I Non conjugate & difficult to evaluate gradient - need MetropolisHastings
Constrained random walk inferenceI If we constrain t to be on [0, 1] and use random walk sampling
(MH, HMC), pseudotimes naturally ‘wander’ towards the boundaryI Once there, covariance structure settles them into a steady state
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Applications to biological datasets
Applied Bayesian GPLVM to three datasets:
1. Monocle Differentiating human myoblasts (time series) - 155cells once contamination removed
2. Ear Differentiating cells from mouse cochlear & utricularsensory epithelia. Pseudotime shows supporting cells (SC)differentiating into hair cells (HC)
3. Waterfall Adult neurogenesis (PCA captures pseudotimevariation)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Sampling posterior curves
A Monocle dataset, laplacian eigenmaps representation
B Ear dataset, laplacian eigenmaps representation
C Waterfall dataset, PCA representation
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
What does the posterior uncertainty look like? (I)
95% HPD credible interval typically spans ∼ 14 of pseudotime
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
What does the posterior uncertainty look like? (II)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Effect of hyperparameters (Monocle dataset)Recall
K (t, t ′) ∝ exp(−λj(t − t ′)2
)λj ∼ Exp(γ)
γ ∼ Gamma(γα, γβ)
|λ| roughly corresponds to arc-length. So what are the effects ofchanging γα, γβ?
E[γ] = γαγβ
, Var[γ] = γαγ2β
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Approximate false discovery rate
How to approximate false discovery rate?I Refit differential expression
for each gene acrossposterior samples ofpseudotime
I Compute p- and q- valuesfor each sample for eachgene
I Statistic is proportionsignificant at 5% FDR
I Differential gene expressionis false positive ifproportion significant< 0.95 and q-value < 0.05
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Approximate false discovery rates
Approximate false discovery rate can be very high (∼ 3× largerthan it should be) but is also variable
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Integrating multiple dimensionality reduction algorithms
Can very easily integrate multiple source of data from differentdimensionality reduction algorithms:
p(t, {X}) ∝ πt(t)p(XLE|t)p(XPCA|t)p(XtSNE|t) (9)
Natural extension to integrate multiple heterogeneous source ofdata, e.g.
p(t, {X}) ∝ πt(t)p(imaging|t)p(ATAC|t)p(transcriptomics|t)(10)
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Example: Monocle with LE, PCA & t-SNELearning curves for each representation separately:
Joint learning of all representations:
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
FDR from multiple representation learning
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Some good references (I)
Gaussian Processes
I Rasmussen, Carl Edward. ”Gaussian processes for machine learning.” (2006).
GPLVM
I Lawrence, Neil D. ”Gaussian process latent variable models for visualisation of highdimensional data.” Advances in neural information processing systems 16.3 (2004):329-336.
I Titsias, Michalis K., and Neil D. Lawrence. ”Bayesian Gaussian process latent variablemodel.” International Conference on Artificial Intelligence and Statistics. 2010.
I van der Maaten, Laurens. ”Preserving local structure in Gaussian process latent variablemodels.” Proceedings of the 18th Annual Belgian-Dutch Conference on Machine Learning.2009.
I Wang, Ye, and David B. Dunson. ”Probabilistic Curve Learning: Coulomb Repulsion andthe Electrostatic Gaussian Process.” arXiv preprint arXiv:1506.03768 (2015).
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics
Introduction to Gaussian Processes Gaussian Process Latent Variable Models Applications in single-cell genomics References
Some good references (II)Latent variable models in single-cell genomics
I Buettner, Florian, and Fabian J. Theis. ”A novel approach for resolving differences insingle-cell gene expression patterns from zygote to blastocyst.” Bioinformatics 28.18(2012): i626-i632.
I Buettner, Florian, et al. ”Computational analysis of cell-to-cell heterogeneity in single-cellRNA-sequencing data reveals hidden subpopulations of cells.” Nature biotechnology 33.2(2015): 155-160.
Pseudotime
I Trapnell, Cole, et al. ”The dynamics and regulators of cell fate decisions are revealed bypseudotemporal ordering of single cells.” Nature biotechnology 32.4 (2014): 381-386.
I Bendall, Sean C., et al. ”Single-cell trajectory detection uncovers progression and regulatorycoordination in human B cell development.” Cell 157.3 (2014): 714-725.
I Marco, Eugenio, et al. ”Bifurcation analysis of single-cell gene expression data revealsepigenetic landscape.” Proceedings of the National Academy of Sciences 111.52 (2014):E5643-E5650.
I Shin, Jaehoon, et al. ”Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascadesunderlying Adult Neurogenesis.” Cell stem cell 17.3 (2015): 360-372.
I Leng, Ning, et al. ”Oscope identifies oscillatory genes in unsynchronized single-cellRNA-seq experiments.” Nature methods 12.10 (2015): 947-950.
Kieran Campbell University of Oxford
Gaussian Process Latent Variable Models & applications in single-cell genomics