9
1 Rethinking Experimental Design in High-Dimensional Biology John D. Storey University of Washington Experimental Design Process of detailing the manner in which data should be collected to maximize the amount of information and minimize the uncertainty about some phenomenon Randomization In a well designed randomized study… –Independent factors average out over many studies to have no overall effect –Sometimes we can even infer causation from association High-Dimensional Biology Rather than studying a few biological variables at a time, instead focus on systems level behavior according to many variables Example: Measure RNA and protein expression levels from many genes to build a predictive regulatory network Gene Regulatory Networks “Wiring Diagram” -- the way in which the genes regulate one another and interact Most existing methods are based on correlation: – Clustering with correlation based distance – Bayes networks from partial-correlation based model selection Existing Literature Weaver (1999) Proc Pac Symp Biocomp Dhaeseleer et al. (2000) Bioinformatics Milo et al. (2002) Science Friedman (2004) Science Kelly and Ideker (2005) Nat Biotech Lee et al. (2006) PNAS Many more…

Experimental Design Experimental Design in High ...•Friedman (2004) Science •Kelly and Ideker ... on this topic •Partial correlation approaches too vulnerable to confounding

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • 1

    RethinkingExperimental Design in

    High-Dimensional Biology

    John D. StoreyUniversity of Washington

    Experimental Design

    Process of detailing the manner in whichdata should be collected to maximize theamount of information and minimize theuncertainty about some phenomenon

    Randomization

    In a well designed randomized study…

    –Independent factors average out overmany studies to have no overall effect

    –Sometimes we can even infer causationfrom association

    High-Dimensional Biology

    • Rather than studying a few biologicalvariables at a time, instead focus onsystems level behavior according to manyvariables

    • Example: Measure RNA and proteinexpression levels from many genes to builda predictive regulatory network

    Gene Regulatory Networks

    • “Wiring Diagram” -- the way in which thegenes regulate one another and interact

    • Most existing methods are based oncorrelation:– Clustering with correlation based distance

    – Bayes networks from partial-correlation basedmodel selection

    Existing Literature

    • Weaver (1999) Proc Pac Symp Biocomp

    • Dhaeseleer et al. (2000) Bioinformatics• Milo et al. (2002) Science

    • Friedman (2004) Science• Kelly and Ideker (2005) Nat Biotech

    • Lee et al. (2006) PNAS

    • Many more…

  • 2

    Causal Modeling

    • Randomization is the gold standard forinferring causality

    • Rich literature on this topic

    • Partial correlation approaches toovulnerable to confounding

    • Model selection doesn’t directly test eachindividual causal relationship

    • Neither offers a natural measure ofsignificance for causality

    Statistical Challenges

    • High-throughput molecular profiling dataare essentially observational

    • Small sample size, large number of genes• Factors subjected to one randomization (if

    any) may effect many variables

    • Highly complex regulation and interactionof genes

    Randomization inSystems Biology

    • What do we get out of classical definitionsof a well designed randomized study?

    • Is association likely to imply causation inhigh-throughput biology?

    • Is there a better way of thinking aboutrandomization and inferring causality?

    Expression in Humans

    • Age

    • Weight• Health state

    • Environment• Genetics

    • Batch

    • Clinic• Many others…

    Illustrative Simulation

    Arrays

    • 1,000 genes• 20 arrays• Gaussian noise

    Gen

    es

  • 3

    Illustrative Simulation

    Arrays

    Gen

    es

    • 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300

    differentiallyexpressed

    Illustrative Simulation• 1,000 genes• 20 arrays• Gaussian noise• Genes 1-300

    differentiallyexpressed

    • Genes 201-500affected by anindependenthidden variable

    Arrays

    Gen

    esAll P-values Null P-values

    Expression Heterogeneity Impact of EH on Testing

    • Can massively distort signal– Add spurious signal– Remove real signal

    • FDR estimates become highly variable

    • Significance ranking of tests become highlyvariable

  • 4

    P-values for a SingleTrue Null

    Paradox?

    • Randomization works for interrogating asingle variable when considering lots ofrepeated experiments

    • Randomization does not always work forinterrogating many variables at once, evenwhen considering lots of repeatedexperiments

    Main Point

    • In a traditional well designed study asingle randomization is applied tothousands of variables

    • Unmodeled or unmeasured factors shouldbe taken into account

    • We have a lot of data… so this is feasible

    • A mechanism for performing manyindependent randomizations is needed

    Overcoming Dependence

    Goal: Estimate factors causing EH andincorporate them into analysis

    Problem: Many causes of EH are unknownor unmeasured

    Solution: Estimate signatures of EHdirectly from expression data… “SurrogateVariable Analysis”

    SVA Algorithm

    Identify thegenes that aresignificantlyassociated withthe residualeigengene

    Observed DataResidual Data

    Regress out theprimaryvariable

    The primaryvariable

    Calculateresidualeigengenes

    Calculateeigengenes ofthe subset

    Return tothe originaldata for thatsubset

  • 5

    Unadjusted P-values SV-adjusted P-values

    Unadjusted Null P-values SV-adjusted Null P-values

  • 6

    Experimental Crosses

    • Randomized genetic backgrounds -- theyprovide a natural randomization of the“hardwired” component of the cell

    • If DNA variation at a locus causestranscriptional variation for some gene,then randomizing the genotype of the locusin turn randomizes the transcriptionallevels

    RIL Experiments

    Linkage ⇒Randomized Expression

    0.2

    0-0.2-0.4

    -0.6

    -0.8-1

    -1.2

    -1.4

    all Seg Seg inheritBY marker

    Seg inheritRM marker

    foldchange(log2)

    Genetic Analysis of Gene Expression

    Seg 1 Seg 2 ... Seg 112

    mRNAExpression

    6,000 phenotypes

    OligoGenotyping

    3,000 markers

    xBrem & Kruglyak(2005)

    Trai

    t Pos

    ition

    Marker Position

    Trai

    t Pos

    ition

    Marker Position

    “trans-linkage”

  • 7

    Large-Scale Perturbations

    Trai

    t Pos

    ition

    Marker Position

    “trans-linkage”

    Trai

    t Pos

    ition

    Marker Position

    After Surrogate Variable AnalysisTr

    ait P

    ositi

    on

    Marker Position

    ~55% show cis-linkage

    Local PerturbationsL = a locus, TA = transcript A, TB = transcript B

    Causality Equivalence Theorem: L → TA → TB with no hidden causalrelationships if and only if:

    1. L → TA2. L → TB3. L ⊥ TB | TA

  • 8

    TA Expression TB Expression TB |TA ExpressionTA or TB

    Expression

    All Segregants L=BY L=RM L=BY L=RM L=BY L=RM

    Rel

    ativ

    e Ex

    pres

    sion

    Lev

    els

    Regulatory Probabilities

    !

    Pij = Pr Ti " Tj( )# Pr Li " Ti " Tj( )= Pr Li " Ti( ) $Pr Li " Tj Li " Ti( ) $Pr Li%Tj |Ti[ ]Li " Ti,Li " Tj( )

    For each pair of genes i and j…. [i, j = 1,2,…, 6216]

    • At 90% probability threshold (FDR=6%):– CNS1 significantly regulates 144 genes– ILV6 significantly regulates 51 genes– SAL1 significantly regulates 36 genes

    • At 80% probability threshold (FDR=11%):– NAM9 significantly regulates 14 genes

    13%15,448404425580%

    6.0%4394214512790%

    2.7%149910757695%

    FDRNo. EdgesTotal No.Genes

    No.Regulators

    Probability

    CNS1 ILV6

    SAL1 NAM9

    ribosome

    NAM9 GO Analysis

    translationmitochondrion

    mitochondrialribosome

    FDR=6%, All Prob > 90%

  • 9

    Concluding Remarks

    • Benefits of classical well-designed randomizedexperiments are limited in the high-throughputsetting

    • Surrogate Variable Analysis captures andmodels large-scale signatures of unmodeledfactors

    • RIL experiments offer a natural randomization ofthe hardwired components of the cell

    • This randomization can be used to infer causalregulatory relationships among genes -- Trigger

    Acknowledgements

    Surrogate Variables:• Jeffrey Leek

    Causal Networks:

    • Lin Chen

    NIH grantsR01 HG002913U54 GM2119

    CollaboratorsJosh AkeyRon DavisLeonid KruglyakEric SchadtRonald TompkinsWenzhong XiaoGlue grant

    Manuscripts

    • Leek JT and Storey (2007) Capturingheterogeneity in gene expression studies bysurrogate variable analysis. PLoS Genetics.

    • Chen L, Emmert-Streib F, and Storey JD (2007)Harnessing naturally randomized transcriptionto infer regulatory relationships among genes.Genome Biology.