Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Hurdle Models for Single Cell Gene Expression
Andrew McDavid
Department of Statistics, University of Washingtonand
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research [email protected]
@EquivMeasures
July 1, 2016
1 / 18
Why single cells?
Yi vector of expression values.Bulk gene expression:
∑i Yi .
But what about:
The cell-to-cell variance ofeach gene (VarYj)?
Clusters of cells or latentstructure (E[Y|Z ])?Cellular coexpression(Cov Y)? or probabilisticindependences?
Biological averaging hasconvolved over variables ofinterest.
2 / 18
Bimodality and single cell gene expression
A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).
Some (gene dependent)fraction of the time, little or noexpression is detected.
Given detection, expression issymmetric and bounded awayfrom zero.
Fluidigm qPCR
40 - Cycle threshold (Ct)
3 / 18
Bimodality and single cell gene expression
A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).
Some (gene dependent)fraction of the time, little or noexpression is detected.
Given detection, expression issymmetric and bounded awayfrom zero.
RNAseq
WA
RS
CC
L5IR
G1
WARS CCL5 IRG1
0
2
4
6
0
3
6
9
0.0
2.5
5.0
7.5
0 2 4 6 0 3 6 9 0.0 2.5 5.0 7.5
log2(transcripts per million +1).
3 / 18
Bimodality and single cell gene expression
A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).
Some (gene dependent)fraction of the time, little or noexpression is detected.
Given detection, expression issymmetric and bounded awayfrom zero.
RNAseq, thresholded
WA
RS
CC
L5IR
G1
WARS CCL5 IRG1
0
2
4
6
0
3
6
9
0.0
2.5
5.0
7.5
0 2 4 6 0 3 6 9 0.0 2.5 5.0 7.5
log2(transcripts per million +1)
3 / 18
Cause of bimodality
Fluorescent in-situ hybridizationexperiments: mRNA oftenzero-inflated, log-normaldistributed.
Transcription occurs in bursts whileDNA is uncoiled and accessible,followed by stochastic decay.
Consistent with zero-inflation ofsingle-cell qPCR and sequencing.
Shalek, et al, 2013
4 / 18
Are zeros limits of detection or censoring?
N 10-cell equivalents ⇒ 10N theexpression of a single cellequivalent
Single molecule capture efficiencyvaries from 90% to 20%
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10.0
12.5
15.0
17.5
20.0
10.0 12.5 15.0 17.5 20.0log(1 cell average)
log(
10 c
ell a
vera
ge)
E(y(1))
= E
(1
10N
N∑i
2y(10),i
)Similar relationships for the frequency of expression
5 / 18
Hurdle models
Both rate of zeros and meanof log-normal vary accordingto biological treatments,generally in tandem.
Phenomenological model:accommodate, rather thanexplain.
HJURP KIF23 TOP2A
0
5
10
15
g0/g1
sg2
/mg0
/g1sg2
/mg0
/g1sg2
/m
Nor
mal
ized
log 2
(Cou
nt+1
)
0.50.60.70.80.9
ProportionExpressing
Hurdle model
Yi is log-expression in cell i . Then
Yi = UiVi and Ui ⊥ Vi ,Ui ∼ Normal(µi , τ2),Vi ∼ Bernoulli(pi ).
6 / 18
Hurdle models
Both rate of zeros and meanof log-normal vary accordingto biological treatments,generally in tandem.
Phenomenological model:accommodate, rather thanexplain.
HJURP KIF23 TOP2A
0
5
10
15
g0/g1
sg2
/mg0
/g1sg2
/mg0
/g1sg2
/m
Nor
mal
ized
log 2
(Cou
nt+1
)
0.50.60.70.80.9
ProportionExpressing
Hurdle model
Yi is log-expression in cell i . Then
Yi = UiVi and Ui ⊥ Vi ,Ui ∼ Normal(µi , τ2),Vi ∼ Bernoulli(pi ).
6 / 18
Hurdle linear model
Let
µi = XTi β,
logit pi = XTi β′
be linear functions of covariates. Then we can do ANOVA andlinear regression using the Hurdle model.The log-likelihood of a sample of n cells, given µi (β) and pi (β
′) is
L(µi , pi ; y) =n∑
i=1
[1[yi 6=0] logit pi + log(1− pi )
]︸ ︷︷ ︸Bernoulli
+
∑i :yi 6=0
−1/2 log(τ22π
)− 1/2
[yi − µiτ
]2︸ ︷︷ ︸
Normal
7 / 18
Cell cycle experiment (Dennis, et al [2014])
333 genes, 930 cells, sorted by cell cycle (G0/G1, S, G2/M)
119 known, ranked genes associated with cell cycle from abulk expression data base (cyclebase.org)
Compare number of ranked and unranked genes discovered ata given P-values using:
Binomial: logistic regression on 1y ≡ 1[y 6=0]Gaussian: linear regression on y
Hurdle: joint regressions on 1y and y
8 / 18
Performance
Binomial logistic regressionon 1y ≡ 1[y 6=0]
Gaussian linear regressionon y
Hurdle joint regressionson 1y and y
0
25
50
75
100
0 10 20 30 40Number of unranked discoveries
Num
ber o
f ran
ked
disc
over
ies
Test Type
Binomial
Hurdle
Gaussian
9 / 18
Hurdle model extensions and applications
1 Empirical Bayesian regularization to borrow strength acrossgenes.
Uij ∼ Normal(µij , τ2j ),τ2j ∼ Inverse-Gamma(a, b).
2 Stablity under linear separation with Cauchy prior onlogistic coefficients.
3 Mixed models, in which the between-individual andwithin-individual variability is parametrized.
4 Parametric graphical modeling on zero-inflated data toestimate gene-gene interactions
5 Competitive gene set enrichment analysis.
10 / 18
Tfh and HIV (Swiss Institute for Vaccine Research)
Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?
16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.
Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)
Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?
11 / 18
Tfh and HIV (Swiss Institute for Vaccine Research)
Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?
16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.
Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)
Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?
11 / 18
Tfh and HIV (Swiss Institute for Vaccine Research)
Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?
16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.
Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)
Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?
11 / 18
Competitive Gene Set Enrichment
1 Vector of expression estimates β̂g and β̂′g for genes
g = 1, . . . ,NG .
2 Geneset C
[Cg ] =
{1 Gene g is in set
0 Else
and its complement D = 1− C.3 Expression in the set vs expression outside the set:
δ =CT β̂
‖C‖1− D
T β̂
‖D‖1
by comparing δ to Normal(0,Var(δ)).
4 Need an estimate of Var(δ).
12 / 18
Var(δ) and non-independence
Expression between genes dependent, so Cov(β̂i , β̂j) 6= 0 in general.Estimate covariance matrix Λ = [λij ] = Cov(β̂i , β̂j), then
Var
(CT β̂
‖C‖1
)=
CTΛC
‖C‖21.
13 / 18
Var(δ) and non-independence
Expression between genes dependent, so Cov(β̂i , β̂j) 6= 0 in general.Estimate covariance matrix Λ = [λij ] = Cov(β̂i , β̂j), then
Var
(CT β̂
‖C‖1
)=
CTΛC
‖C‖21.
13 / 18
Proximate and future work
Power/sample size calculations
Gene expression matrices Yik in donor k for condition i overgenes j = 1, . . . , J.
Test condition effect βi 6= 0 over the donorsuper-population.
Super-population variability Var(βij) might be similar betweengenes, like shrinkage models for dispersions: limma, deseq2,edgeR, etc.
More useful decompositions of parameters for Hurdle models
Clustering on zero-inflated data
14 / 18
More Reading
Finak G., McDavid A., Yajima M, et al (2015). MAST: aflexible statistical framework for assessing transcriptionalchanges and characterizing heterogeneity in single-cell RNAsequencing data. Genome Biology.
Dennis, L., McDavid, A., Danaher, P., et al (2014). Modelingbi-modality improves characterization of cell cycle on geneexpression in single cells. PLoS Computational Biology.
McDavid, A., Finak, G., Chattopadyay, P. K., et al (2013).Data Exploration, Quality Control and Testing in Single-CellqPCR-Based Gene Expression Experiments. Bioinformatics.
http://github.com/RGLab/MAST – use branchsummarizedExpt
15 / 18
http://github.com/RGLab/MAST
Goal
Learn how to filter, explore and test for differential expression.Join me in eating this delicious dog food.Package to be submitted to Bioconductor for fall release, ongithub in the meantime.
MAITAnalysis Vignette
devtools::install_github(’RGLab/MAST@summarizedExpt’)
library(MAST)
vignette(’MAITAnalysis’)
file.edit(system.file(’doc/MAITAnalysis.R’,
package=’MAST’))
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2
, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap
, aheatmap,complexheatmap, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap, aheatmap
,complexheatmap, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap, aheatmap,complexheatmap
, heatmapTheAwakening
17 / 18
Infelicities
Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)
Also want a human-readable default key for plots
Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.
heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening
17 / 18
Acknowledgments
External Collaborators
Mario RoedererLucas DennisMartin PrlicGiuseppe PantaleoDan Lu and BillRobinson
Fred Hutch and UWStatistics
Raphael GottardoGreg FinakMasanao Yajima
R01 EB008400 from theNational Institute ofBiomedical Imaging andBioengineering
18 / 18