Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver

Causal Inference & Genetic Regulatory

NetworksPeter Spirtes

Carnegie Mellon UniversityWith slides from Lizzie Silver

OutlineBiology

Data and Background Knowledge

Problems

Algorithms

Causal Graphgene protein

mRNA mRNA

protein gene

Sources: http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/REGNET.jpg

Protein StatesFolding

Location (nucleus, membrane, etc.)

Phosphorylation at different sites

Ubiquitination

etc.

Protein StatesFolding

Location (nucleus, membrane, etc.)

Phosphorylation at different sites

Ubiquitination

Levels of DescriptionDifferential equation models of rates of reaction

Which transcription factors bind to which sites, and can in turn be prevented or aided in binding to sites

Which gene products affect the rate at which other genes produce proteins

Example Graph

OutlineBiology


Problems

Algorithms

Knockout Experimentsinsert DNA construct into cell

DNA construct recombines with target gene

target gene then either does not translate at all or translate a nonfunctional protein

Data: Hughes lab yeast data

Hughes lab data on S. cerevisiae:63 wild-type strains267 gene-deletion mutants

No information on direct effects, only total effects

But does include information on non-effects

Must normalize the data to account for differential manipulation - what is the right normalization method?

Data: M3DMany Microbes Microarrays Database (M3D)

Collected from multiple sources of published data different labs, strains, experimental conditions, genetic manipulations

E. coli (907), S. oneidensis (245), S. cerevisiae (904)

Experiment descriptions/features standardizedAffymetrix arrays uniformly normalized using

Robust Multi-arrayAverage (RMA); raw data also available

Data: RegulonDB RegulonDB: The E. coli regulatory network database

Curated database of expert knowledge, constantly updated

List of known TF → gene effects, annotated with valence (+;-;±), type(s) of evidence supporting, publications supporting

No information on size of effects

Some information about direct effects, from Chromatin

ImmunoPrecipitation (ChIP) assays, recognized binding sites, etc.

If an effect is not in RegulonDB, that doesn't mean it doesn't exist!

OutlineBiology


Problems

Algorithms

Search for Genetic Regulatory Networks

Goal: Search for a Directed Acyclic Graph (DAG) representing the “direct" regulatory effects (relative to the set of genes).

This is a hard problem! # of DAGs is super-exponential in # of genes (4,300 genes in E. coli ) Unobserved confounders Environmental conditions Excluded genes Latent TFs that inluence multiple genes Cycles Unfaithfulness Non-linearity, non-Gaussianity Density not strictly positive Small sample size Aggregation

How can we evaluate performance?

Li and Biggin

FeedbackThe equilibrium state of a feedback system can

be represented by a cyclic graph.

If the joint distribution is Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs entails the corresponding conditional independence.

If the joint distribution is not Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs does not entail the corresponding conditional independence.

FeedbackThere is an extended sense of “pattern” to represent

the set of all Markov equivalent graphs, cyclic or acyclic.

However, it is much more complicated than a pattern – not all Markov equivalent cyclic graphs share the same set of adjacencies, and there are dependencies among which edges.

There is an algorithm that is an extension of the PC algorithm for searching for cyclic graphs.

We do not have a graphical representation of the set of all Markov equivalence graphs with cycles and latent variables.

Local Markov Theorem (Chu)

Given an acyclic graph G representing the causal relations among a set V of random variables. Let Y, X1, . . . , Xk ∈ V, and X = {X1, . . . , Xk } be the set of parents of Y in G. If Y = cTX + ε, where cT = (c1, . . . , ck), and ε is a noise term independent of all non-descendents of Y , then Y is independent of all its non-parents, non-descendents conditional on its parents X, and this relation holds under aggregation.

22

OutlineBiology


Problems

Algorithms

Direct versus Indirect Effects

What is a manipulation? Set the value of a variable {means you

break the influence of whatever normally influences it. e.g. Setting arcA := 0

Direct v. total effects: Total effects: If I manipulate fnr and let

everything else vary as usual, what happens to sodA ?

Direct effects: If I manipulate fnR and also clamp arcA to its current value, what happens to sodA ?

(What if I don't know whether to clamp narK ?)

Gold Standard: Direct v. Total Effects

Direct effects: For known true graph: evaluate using graph similarity metrics

True positive and false positive rate for adjacencies and orientations Structural Hamming distance

For unknown true graph: Experimental control of all potential back-door paths Mechanistic approach: protein binding arrays

Total effects: For known true graph:

Path from cause to eect in both graphs? (Path length? Size of path coecients?) Structural intervention distance

For unknown true graph: Truth: Gene knock-out experiments estimate true total causal eect Prediction: IDA

PatternsMarkov Equivalence Class (MEC) contains

all DAGs consistent with set of conditional independences

All DAGs in the MEC share the same adjacencies

All share the same “unshielded colliders”

Can be represented with a “pattern”: common \skeleton" of adjacencies edges directed when all the DAGs within

the MEC agree on their orientation edges left undirected otherwise

Intervention Effects When the DAG is Absent (IDA)

fnr

arcA

sodA

fnrzarcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

• Run PC. • Calculate effect q of manipulating arcA

on sodA in each graph.• Count how many times each value q

occurs.

0 0 0

q1q2 q3

{0,0,0,q1,q2,q3}

IDA fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

fnr

arcA

sodA

q1 q2 q1 q2

• Too expensive if many variables.• By considering only local structure around

sodA and arcA can calculate all possible effects, but not how many times they occur.

{0,q1,q2,q3}

IDA applicationp = 5360 genes (expression of genes)

231 gene knock downs ; 1.2 106 intervention effects

the truth is “known in good approximation” (thanks to intervention experiments)

goal: prediction of the true large intervention effects based on observational data with no knock-downs

n = 63

observational data

IDA

Maathuis

CStarRuns IDA multiple times in order to choose the

genes that are most stably among the ones selected as the strongest.

Sample 50% of the original data set (with replacement) one hundred times.

For each subsample, run the IDA algorithm.

Take the output of the IDA algorithm for that subsample and orders the variables by the size of the estimated (by IDA) lower bound of the total effect of the variable on the target.

32

Stability Selection StepRecord the frequency with which a given

variable appears in the top q of the total effect sizes, for a user selected value q.

Select the variables that appear with the highest frequency (that is are judged most often to have lower bounds of total effects on the target that are large)

33

Stability Selectionorder the variables Π1 > Π2 . . . Πp.

most often least often

Define the stably selected genes (covariates) as for some threshold 0.5 < πthr ≤ 1.

Denote the wrongly selected genes (false positives) by V = where Sfalse is the set of (covariates) whose true lower bound βj is 0.

34

Stability SelectionFor a given threshold πthr and a given value of q,

if

35

Stability SelectionCStaR is relatively insensitive to the choice of

the range of qs.

Down to a certain lower bound, small values of q lead to higher sensitivity.

For q-values below the lower bound, the ranking becomes unstable again.

36

Stability SelectionAll genes are ranked according to the median

rank with respect to the different q-values.

Ties in the final ranking are sorted according to median total causal effect size.

37

Cstar Results

Choosing Experiments: Steckhoven et al.

Mouse-ear cress response Y: days to bolting (flowering) of the plant

Covariates X: gene-expression profile

Observational data with n = 47 and p = 21,326

39

Experimental Confirmation: Steckhoven

et al.

PC + IDA + stability selection

Performed experiment on 14 of the top 20 (not previously known, easily available mutant)

40

Results: Arabidopsis thaliana

9 among the 14 mutants survived

4 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant)

41

Chen and Storey“For an individual organism, DNA has the useful

feature thatit is usually a static variable, meaning that it is fixed and will not change with changing RNA levels, protein levels, phenotypes,or environmental conditions. By performing designed crosses of genetically distinct inbred or isogenic lines, one can randomize the genotypes of an organism from two or more genetic backgrounds, thereby producing independent realizations of DNA content from offspring to offspring.”

Chen and Storey Identify causal relations of the form L → Ti → Tj

where L is known to be exogenous and prior to Ti and Tj

L is the genotype at a fixed locus, generated through crossing two haploid parental strains to produce 112 recombinant haploid segragant strains

Ti and Tj are expression levels of genes

Given this background knowledge, we just need to determine that L and Ti are dependent, Ti and Tj are dependent, and Tj is independent of Ti given L.

Faith et alUsed Many Microbes Microarrays Database

(M3D)

Evaluated using RegulonDB

Restricted search space: only allowed edges out of genes coding for TFs

Compared several search algorithms (but not a fair comparison for Bayes Net learning algorithm)

References LS Chen LS, F Emmert-Streib, and JD Storey (). Harnessing naturally randomized transcription

to inferregulatory relationships among genes. Genome Biology. 2007.

T. Chu C. Glymour, R. Scheines, P. Spirtes. A statistical problem for inference to regulatory structure from associations of gene expression measurement with microarrays. Bioinformatics 2003;19:1147-52. PMID: 12801876.

Jeremiah J. Faith, Boris Hayete, Joshua T. Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J. Collins, and Timothy S. Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology, 5(1):0054–0066, 2007.

J Li, M Biggin, Statistics requantitates the central dogma, Science, 347(6226), 1066-1067, 2015.

Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Buhlmann. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010.

K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529, (2005).

Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823

Documents

Causal Inference & Genetic Regulatory Networks Peter Spirtes Carnegie Mellon University With slides from Lizzie Silver