30
Hurdle Models for Single Cell Gene Expression Andrew McDavid Department of Statistics, University of Washington and Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center [email protected] @EquivMeasures July 1, 2016 1 / 18

Hurdle Models for Single Cell Gene Expression · 2016. 7. 6. · Bimodality and single cell gene expression A de ning characteristic is bimodality in expression (Flatz 2011, Powell

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Hurdle Models for Single Cell Gene Expression

    Andrew McDavid

    Department of Statistics, University of Washingtonand

    Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research [email protected]

    @EquivMeasures

    July 1, 2016

    1 / 18

  • Why single cells?

    Yi vector of expression values.Bulk gene expression:

    ∑i Yi .

    But what about:

    The cell-to-cell variance ofeach gene (VarYj)?

    Clusters of cells or latentstructure (E[Y|Z ])?Cellular coexpression(Cov Y)? or probabilisticindependences?

    Biological averaging hasconvolved over variables ofinterest.

    2 / 18

  • Bimodality and single cell gene expression

    A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).

    Some (gene dependent)fraction of the time, little or noexpression is detected.

    Given detection, expression issymmetric and bounded awayfrom zero.

    Fluidigm qPCR

    40 - Cycle threshold (Ct)

    3 / 18

  • Bimodality and single cell gene expression

    A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).

    Some (gene dependent)fraction of the time, little or noexpression is detected.

    Given detection, expression issymmetric and bounded awayfrom zero.

    RNAseq

    WA

    RS

    CC

    L5IR

    G1

    WARS CCL5 IRG1

    0

    2

    4

    6

    0

    3

    6

    9

    0.0

    2.5

    5.0

    7.5

    0 2 4 6 0 3 6 9 0.0 2.5 5.0 7.5

    log2(transcripts per million +1).

    3 / 18

  • Bimodality and single cell gene expression

    A defining characteristic isbimodality in expression (Flatz2011, Powell 2012, McDavid2013, Marinov 2014).

    Some (gene dependent)fraction of the time, little or noexpression is detected.

    Given detection, expression issymmetric and bounded awayfrom zero.

    RNAseq, thresholded

    WA

    RS

    CC

    L5IR

    G1

    WARS CCL5 IRG1

    0

    2

    4

    6

    0

    3

    6

    9

    0.0

    2.5

    5.0

    7.5

    0 2 4 6 0 3 6 9 0.0 2.5 5.0 7.5

    log2(transcripts per million +1)

    3 / 18

  • Cause of bimodality

    Fluorescent in-situ hybridizationexperiments: mRNA oftenzero-inflated, log-normaldistributed.

    Transcription occurs in bursts whileDNA is uncoiled and accessible,followed by stochastic decay.

    Consistent with zero-inflation ofsingle-cell qPCR and sequencing.

    Shalek, et al, 2013

    4 / 18

  • Are zeros limits of detection or censoring?

    N 10-cell equivalents ⇒ 10N theexpression of a single cellequivalent

    Single molecule capture efficiencyvaries from 90% to 20%

    ●●

    ●●

    ● ●

    ●●

    ● ●

    ● ●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    10.0

    12.5

    15.0

    17.5

    20.0

    10.0 12.5 15.0 17.5 20.0log(1 cell average)

    log(

    10 c

    ell a

    vera

    ge)

    E(y(1))

    = E

    (1

    10N

    N∑i

    2y(10),i

    )Similar relationships for the frequency of expression

    5 / 18

  • Hurdle models

    Both rate of zeros and meanof log-normal vary accordingto biological treatments,generally in tandem.

    Phenomenological model:accommodate, rather thanexplain.

    HJURP KIF23 TOP2A

    0

    5

    10

    15

    g0/g1

    sg2

    /mg0

    /g1sg2

    /mg0

    /g1sg2

    /m

    Nor

    mal

    ized

    log 2

    (Cou

    nt+1

    )

    0.50.60.70.80.9

    ProportionExpressing

    Hurdle model

    Yi is log-expression in cell i . Then

    Yi = UiVi and Ui ⊥ Vi ,Ui ∼ Normal(µi , τ2),Vi ∼ Bernoulli(pi ).

    6 / 18

  • Hurdle models

    Both rate of zeros and meanof log-normal vary accordingto biological treatments,generally in tandem.

    Phenomenological model:accommodate, rather thanexplain.

    HJURP KIF23 TOP2A

    0

    5

    10

    15

    g0/g1

    sg2

    /mg0

    /g1sg2

    /mg0

    /g1sg2

    /m

    Nor

    mal

    ized

    log 2

    (Cou

    nt+1

    )

    0.50.60.70.80.9

    ProportionExpressing

    Hurdle model

    Yi is log-expression in cell i . Then

    Yi = UiVi and Ui ⊥ Vi ,Ui ∼ Normal(µi , τ2),Vi ∼ Bernoulli(pi ).

    6 / 18

  • Hurdle linear model

    Let

    µi = XTi β,

    logit pi = XTi β′

    be linear functions of covariates. Then we can do ANOVA andlinear regression using the Hurdle model.The log-likelihood of a sample of n cells, given µi (β) and pi (β

    ′) is

    L(µi , pi ; y) =n∑

    i=1

    [1[yi 6=0] logit pi + log(1− pi )

    ]︸ ︷︷ ︸Bernoulli

    +

    ∑i :yi 6=0

    −1/2 log(τ22π

    )− 1/2

    [yi − µiτ

    ]2︸ ︷︷ ︸

    Normal

    7 / 18

  • Cell cycle experiment (Dennis, et al [2014])

    333 genes, 930 cells, sorted by cell cycle (G0/G1, S, G2/M)

    119 known, ranked genes associated with cell cycle from abulk expression data base (cyclebase.org)

    Compare number of ranked and unranked genes discovered ata given P-values using:

    Binomial: logistic regression on 1y ≡ 1[y 6=0]Gaussian: linear regression on y

    Hurdle: joint regressions on 1y and y

    8 / 18

  • Performance

    Binomial logistic regressionon 1y ≡ 1[y 6=0]

    Gaussian linear regressionon y

    Hurdle joint regressionson 1y and y

    0

    25

    50

    75

    100

    0 10 20 30 40Number of unranked discoveries

    Num

    ber o

    f ran

    ked

    disc

    over

    ies

    Test Type

    Binomial

    Hurdle

    Gaussian

    9 / 18

  • Hurdle model extensions and applications

    1 Empirical Bayesian regularization to borrow strength acrossgenes.

    Uij ∼ Normal(µij , τ2j ),τ2j ∼ Inverse-Gamma(a, b).

    2 Stablity under linear separation with Cauchy prior onlogistic coefficients.

    3 Mixed models, in which the between-individual andwithin-individual variability is parametrized.

    4 Parametric graphical modeling on zero-inflated data toestimate gene-gene interactions

    5 Competitive gene set enrichment analysis.

    10 / 18

  • Tfh and HIV (Swiss Institute for Vaccine Research)

    Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?

    16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.

    Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)

    Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?

    11 / 18

  • Tfh and HIV (Swiss Institute for Vaccine Research)

    Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?

    16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.

    Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)

    Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?

    11 / 18

  • Tfh and HIV (Swiss Institute for Vaccine Research)

    Scientific question: how does HIV alterthe expression profile of Tfh-maturationand signaling genes?

    16 donors, recent HIV naive toanti-retroviral therapy, and healthycontrols. Lymph biopsies.

    Two cell populations: CXCR5−PD1+,CXCR5+PD1+ (Tfh)

    Statistical question: do Tfh genes differon average compared to non-Tfh genesin HIV+ vs healthy controls?

    11 / 18

  • Competitive Gene Set Enrichment

    1 Vector of expression estimates β̂g and β̂′g for genes

    g = 1, . . . ,NG .

    2 Geneset C

    [Cg ] =

    {1 Gene g is in set

    0 Else

    and its complement D = 1− C.3 Expression in the set vs expression outside the set:

    δ =CT β̂

    ‖C‖1− D

    T β̂

    ‖D‖1

    by comparing δ to Normal(0,Var(δ)).

    4 Need an estimate of Var(δ).

    12 / 18

  • Var(δ) and non-independence

    Expression between genes dependent, so Cov(β̂i , β̂j) 6= 0 in general.Estimate covariance matrix Λ = [λij ] = Cov(β̂i , β̂j), then

    Var

    (CT β̂

    ‖C‖1

    )=

    CTΛC

    ‖C‖21.

    13 / 18

  • Var(δ) and non-independence

    Expression between genes dependent, so Cov(β̂i , β̂j) 6= 0 in general.Estimate covariance matrix Λ = [λij ] = Cov(β̂i , β̂j), then

    Var

    (CT β̂

    ‖C‖1

    )=

    CTΛC

    ‖C‖21.

    13 / 18

  • Proximate and future work

    Power/sample size calculations

    Gene expression matrices Yik in donor k for condition i overgenes j = 1, . . . , J.

    Test condition effect βi 6= 0 over the donorsuper-population.

    Super-population variability Var(βij) might be similar betweengenes, like shrinkage models for dispersions: limma, deseq2,edgeR, etc.

    More useful decompositions of parameters for Hurdle models

    Clustering on zero-inflated data

    14 / 18

  • More Reading

    Finak G., McDavid A., Yajima M, et al (2015). MAST: aflexible statistical framework for assessing transcriptionalchanges and characterizing heterogeneity in single-cell RNAsequencing data. Genome Biology.

    Dennis, L., McDavid, A., Danaher, P., et al (2014). Modelingbi-modality improves characterization of cell cycle on geneexpression in single cells. PLoS Computational Biology.

    McDavid, A., Finak, G., Chattopadyay, P. K., et al (2013).Data Exploration, Quality Control and Testing in Single-CellqPCR-Based Gene Expression Experiments. Bioinformatics.

    http://github.com/RGLab/MAST – use branchsummarizedExpt

    15 / 18

    http://github.com/RGLab/MAST

  • Goal

    Learn how to filter, explore and test for differential expression.Join me in eating this delicious dog food.Package to be submitted to Bioconductor for fall release, ongithub in the meantime.

    MAITAnalysis Vignette

    devtools::install_github(’RGLab/MAST@summarizedExpt’)

    library(MAST)

    vignette(’MAITAnalysis’)

    file.edit(system.file(’doc/MAITAnalysis.R’,

    package=’MAST’))

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2

    , pheatmap, aheatmap,complexheatmap, heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap

    , aheatmap,complexheatmap, heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap, aheatmap

    ,complexheatmap, heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap, aheatmap,complexheatmap

    , heatmapTheAwakening

    17 / 18

  • Infelicities

    Want a unique key for rows (ENSEMBLE ids vs entrez ids vsUCSC transcript ids)

    Also want a human-readable default key for plots

    Hard to tell what your contrasts are with model.matrix.Easy to get the wrong answer. Important for powercalculations.

    heatmap, heatmap2, pheatmap, aheatmap,complexheatmap, heatmapTheAwakening

    17 / 18

  • Acknowledgments

    External Collaborators

    Mario RoedererLucas DennisMartin PrlicGiuseppe PantaleoDan Lu and BillRobinson

    Fred Hutch and UWStatistics

    Raphael GottardoGreg FinakMasanao Yajima

    R01 EB008400 from theNational Institute ofBiomedical Imaging andBioengineering

    18 / 18