45
Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Embed Size (px)

Citation preview

Page 1: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Causal Inference & Genetic Regulatory

NetworksPeter Spirtes

Carnegie Mellon UniversityWith slides from Lizzie Silver

Page 2: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

OutlineBiology

Data and Background Knowledge

Problems

Algorithms

Page 3: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver
Page 4: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver
Page 5: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Causal Graphgene protein

mRNA mRNA

protein gene

Page 6: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Sources: http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/REGNET.jpg

Page 7: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Protein StatesFolding

Location (nucleus, membrane, etc.)

Phosphorylation at different sites

Ubiquitination

etc.

Page 8: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Protein StatesFolding

Location (nucleus, membrane, etc.)

Phosphorylation at different sites

Ubiquitination

Page 9: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Levels of DescriptionDifferential equation models of rates of reaction

Which transcription factors bind to which sites, and can in turn be prevented or aided in binding to sites

Which gene products affect the rate at which other genes produce proteins

Page 10: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Example Graph

Page 11: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

OutlineBiology

Data and Background Knowledge

Problems

Algorithms

Page 12: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver
Page 13: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Knockout Experimentsinsert DNA construct into cell

DNA construct recombines with target gene

target gene then either does not translate at all or translate a nonfunctional protein

Page 14: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Data: Hughes lab yeast data

Hughes lab data on S. cerevisiae:63 wild-type strains267 gene-deletion mutants

No information on direct effects, only total effects

But does include information on non-effects

Must normalize the data to account for differential manipulation - what is the right normalization method?

Page 15: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Data: M3DMany Microbes Microarrays Database (M3D)

Collected from multiple sources of published data different labs, strains, experimental conditions, genetic manipulations

E. coli (907), S. oneidensis (245), S. cerevisiae (904)

Experiment descriptions/features standardizedAffymetrix arrays uniformly normalized using

Robust Multi-arrayAverage (RMA); raw data also available

Page 16: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Data: RegulonDB RegulonDB: The E. coli regulatory network database

Curated database of expert knowledge, constantly updated

List of known TF → gene effects, annotated with valence (+;-;±), type(s) of evidence supporting, publications supporting

No information on size of effects

Some information about direct effects, from Chromatin

ImmunoPrecipitation (ChIP) assays, recognized binding sites, etc.

If an effect is not in RegulonDB, that doesn't mean it doesn't exist!

Page 17: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

OutlineBiology

Data and Background Knowledge

Problems

Algorithms

Page 18: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Search for Genetic Regulatory Networks

Goal: Search for a Directed Acyclic Graph (DAG) representing the “direct" regulatory effects (relative to the set of genes).

This is a hard problem! # of DAGs is super-exponential in # of genes (4,300 genes in E. coli ) Unobserved confounders Environmental conditions Excluded genes Latent TFs that inluence multiple genes Cycles Unfaithfulness Non-linearity, non-Gaussianity Density not strictly positive Small sample size Aggregation

How can we evaluate performance?

Page 19: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Li and Biggin

Page 20: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

FeedbackThe equilibrium state of a feedback system can

be represented by a cyclic graph.

If the joint distribution is Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs entails the corresponding conditional independence.

If the joint distribution is not Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs does not entail the corresponding conditional independence.

Page 21: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

FeedbackThere is an extended sense of “pattern” to represent

the set of all Markov equivalent graphs, cyclic or acyclic.

However, it is much more complicated than a pattern – not all Markov equivalent cyclic graphs share the same set of adjacencies, and there are dependencies among which edges.

There is an algorithm that is an extension of the PC algorithm for searching for cyclic graphs.

We do not have a graphical representation of the set of all Markov equivalence graphs with cycles and latent variables.

Page 22: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Local Markov Theorem (Chu)

Given an acyclic graph G representing the causal relations among a set V of random variables. Let Y, X1, . . . , Xk ∈ V, and X = {X1, . . . , Xk } be the set of parents of Y in G. If Y = cTX + ε, where cT = (c1, . . . , ck), and ε is a noise term independent of all non-descendents of Y , then Y is independent of all its non-parents, non-descendents conditional on its parents X, and this relation holds under aggregation.

22

Page 23: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

OutlineBiology

Data and Background Knowledge

Problems

Algorithms

Page 24: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Direct versus Indirect Effects

What is a manipulation? Set the value of a variable {means you

break the influence of whatever normally influences it. e.g. Setting arcA := 0

Direct v. total effects: Total effects: If I manipulate fnr and let

everything else vary as usual, what happens to sodA ?

Direct effects: If I manipulate fnR and also clamp arcA to its current value, what happens to sodA ?

(What if I don't know whether to clamp narK ?)

Page 25: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Gold Standard: Direct v. Total Effects

Direct effects: For known true graph: evaluate using graph similarity metrics

True positive and false positive rate for adjacencies and orientations Structural Hamming distance

For unknown true graph: Experimental control of all potential back-door paths Mechanistic approach: protein binding arrays

Total effects: For known true graph:

Path from cause to eect in both graphs? (Path length? Size of path coecients?) Structural intervention distance

For unknown true graph: Truth: Gene knock-out experiments estimate true total causal eect Prediction: IDA

Page 26: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

PatternsMarkov Equivalence Class (MEC) contains

all DAGs consistent with set of conditional independences

All DAGs in the MEC share the same adjacencies

All share the same “unshielded colliders”

Can be represented with a “pattern”: common \skeleton" of adjacencies edges directed when all the DAGs within

the MEC agree on their orientation edges left undirected otherwise

Page 27: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Intervention Effects When the DAG is Absent (IDA)

fnr

arcA

sodA

fnrzarcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

• Run PC. • Calculate effect q of manipulating arcA

on sodA in each graph.• Count how many times each value q

occurs.

0 0 0

q1q2 q3

{0,0,0,q1,q2,q3}

Page 28: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

IDA fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

q1 q2 q1 q2

• Too expensive if many variables.• By considering only local structure around

sodA and arcA can calculate all possible effects, but not how many times they occur.

{0,q1,q2,q3}

Page 29: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

IDA applicationp = 5360 genes (expression of genes)

231 gene knock downs ; 1.2 106 intervention effects

the truth is “known in good approximation” (thanks to intervention experiments)

goal: prediction of the true large intervention effects based on observational data with no knock-downs

n = 63

observational data

Page 30: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

IDA

Page 31: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Maathuis

Page 32: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

CStarRuns IDA multiple times in order to choose the

genes that are most stably among the ones selected as the strongest.

Sample 50% of the original data set (with replacement) one hundred times.

For each subsample, run the IDA algorithm.

Take the output of the IDA algorithm for that subsample and orders the variables by the size of the estimated (by IDA) lower bound of the total effect of the variable on the target.

32

Page 33: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Stability Selection StepRecord the frequency with which a given

variable appears in the top q of the total effect sizes, for a user selected value q.

Select the variables that appear with the highest frequency (that is are judged most often to have lower bounds of total effects on the target that are large)

33

Page 34: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Stability Selectionorder the variables Π1 > Π2 . . . Πp.

most often least often

Define the stably selected genes (covariates) as for some threshold 0.5 < πthr ≤ 1.

Denote the wrongly selected genes (false positives) by V = where Sfalse is the set of (covariates) whose true lower bound βj is 0.

34

Page 35: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Stability SelectionFor a given threshold πthr and a given value of q,

if

35

Page 36: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Stability SelectionCStaR is relatively insensitive to the choice of

the range of qs.

Down to a certain lower bound, small values of q lead to higher sensitivity.

For q-values below the lower bound, the ranking becomes unstable again.

36

Page 37: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Stability SelectionAll genes are ranked according to the median

rank with respect to the different q-values.

Ties in the final ranking are sorted according to median total causal effect size.

37

Page 38: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Cstar Results

Page 39: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Choosing Experiments: Steckhoven et al.

Mouse-ear cress response Y: days to bolting (flowering) of the plant

Covariates X: gene-expression profile

Observational data with n = 47 and p = 21,326

39

Page 40: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Experimental Confirmation: Steckhoven

et al.

PC + IDA + stability selection

Performed experiment on 14 of the top 20 (not previously known, easily available mutant)

40

Page 41: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Results: Arabidopsis thaliana

9 among the 14 mutants survived

4 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant)

41

Page 42: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Chen and Storey“For an individual organism, DNA has the useful

feature thatit is usually a static variable, meaning that it is fixed and will not change with changing RNA levels, protein levels, phenotypes,or environmental conditions. By performing designed crosses of genetically distinct inbred or isogenic lines, one can randomize the genotypes of an organism from two or more genetic backgrounds, thereby producing independent realizations of DNA content from offspring to offspring.”

Page 43: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Chen and Storey Identify causal relations of the form L → Ti → Tj

where L is known to be exogenous and prior to Ti and Tj

L is the genotype at a fixed locus, generated through crossing two haploid parental strains to produce 112 recombinant haploid segragant strains

Ti and Tj are expression levels of genes

Given this background knowledge, we just need to determine that L and Ti are dependent, Ti and Tj are dependent, and Tj is independent of Ti given L.

Page 44: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Faith et alUsed Many Microbes Microarrays Database

(M3D)

Evaluated using RegulonDB

Restricted search space: only allowed edges out of genes coding for TFs

Compared several search algorithms (but not a fair comparison for Bayes Net learning algorithm)

Page 45: Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

References LS Chen LS, F Emmert-Streib, and JD Storey (). Harnessing naturally randomized transcription

to inferregulatory relationships among genes. Genome Biology. 2007.

T. Chu C. Glymour, R. Scheines, P. Spirtes. A statistical problem for inference to regulatory structure from associations of gene expression measurement with microarrays. Bioinformatics 2003;19:1147-52. PMID: 12801876.

Jeremiah J. Faith, Boris Hayete, Joshua T. Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J. Collins, and Timothy S. Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology, 5(1):0054–0066, 2007.

J Li, M Biggin, Statistics requantitates the central dogma, Science, 347(6226), 1066-1067, 2015.

Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Buhlmann. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010.

K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529, (2005).

Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823