Upload
poppy-carson
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Causal Inference & Genetic Regulatory
NetworksPeter Spirtes
Carnegie Mellon UniversityWith slides from Lizzie Silver
OutlineBiology
Data and Background Knowledge
Problems
Algorithms
Causal Graphgene protein
mRNA mRNA
protein gene
Sources: http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/REGNET.jpg
Protein StatesFolding
Location (nucleus, membrane, etc.)
Phosphorylation at different sites
Ubiquitination
etc.
Protein StatesFolding
Location (nucleus, membrane, etc.)
Phosphorylation at different sites
Ubiquitination
Levels of DescriptionDifferential equation models of rates of reaction
Which transcription factors bind to which sites, and can in turn be prevented or aided in binding to sites
Which gene products affect the rate at which other genes produce proteins
Example Graph
OutlineBiology
Data and Background Knowledge
Problems
Algorithms
Knockout Experimentsinsert DNA construct into cell
DNA construct recombines with target gene
target gene then either does not translate at all or translate a nonfunctional protein
Data: Hughes lab yeast data
Hughes lab data on S. cerevisiae:63 wild-type strains267 gene-deletion mutants
No information on direct effects, only total effects
But does include information on non-effects
Must normalize the data to account for differential manipulation - what is the right normalization method?
Data: M3DMany Microbes Microarrays Database (M3D)
Collected from multiple sources of published data different labs, strains, experimental conditions, genetic manipulations
E. coli (907), S. oneidensis (245), S. cerevisiae (904)
Experiment descriptions/features standardizedAffymetrix arrays uniformly normalized using
Robust Multi-arrayAverage (RMA); raw data also available
Data: RegulonDB RegulonDB: The E. coli regulatory network database
Curated database of expert knowledge, constantly updated
List of known TF → gene effects, annotated with valence (+;-;±), type(s) of evidence supporting, publications supporting
No information on size of effects
Some information about direct effects, from Chromatin
ImmunoPrecipitation (ChIP) assays, recognized binding sites, etc.
If an effect is not in RegulonDB, that doesn't mean it doesn't exist!
OutlineBiology
Data and Background Knowledge
Problems
Algorithms
Search for Genetic Regulatory Networks
Goal: Search for a Directed Acyclic Graph (DAG) representing the “direct" regulatory effects (relative to the set of genes).
This is a hard problem! # of DAGs is super-exponential in # of genes (4,300 genes in E. coli ) Unobserved confounders Environmental conditions Excluded genes Latent TFs that inluence multiple genes Cycles Unfaithfulness Non-linearity, non-Gaussianity Density not strictly positive Small sample size Aggregation
How can we evaluate performance?
Li and Biggin
FeedbackThe equilibrium state of a feedback system can
be represented by a cyclic graph.
If the joint distribution is Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs entails the corresponding conditional independence.
If the joint distribution is not Gaussian or multinomial, then the natural extension of d-separation to cyclic graphs does not entail the corresponding conditional independence.
FeedbackThere is an extended sense of “pattern” to represent
the set of all Markov equivalent graphs, cyclic or acyclic.
However, it is much more complicated than a pattern – not all Markov equivalent cyclic graphs share the same set of adjacencies, and there are dependencies among which edges.
There is an algorithm that is an extension of the PC algorithm for searching for cyclic graphs.
We do not have a graphical representation of the set of all Markov equivalence graphs with cycles and latent variables.
Local Markov Theorem (Chu)
Given an acyclic graph G representing the causal relations among a set V of random variables. Let Y, X1, . . . , Xk ∈ V, and X = {X1, . . . , Xk } be the set of parents of Y in G. If Y = cTX + ε, where cT = (c1, . . . , ck), and ε is a noise term independent of all non-descendents of Y , then Y is independent of all its non-parents, non-descendents conditional on its parents X, and this relation holds under aggregation.
22
OutlineBiology
Data and Background Knowledge
Problems
Algorithms
Direct versus Indirect Effects
What is a manipulation? Set the value of a variable {means you
break the influence of whatever normally influences it. e.g. Setting arcA := 0
Direct v. total effects: Total effects: If I manipulate fnr and let
everything else vary as usual, what happens to sodA ?
Direct effects: If I manipulate fnR and also clamp arcA to its current value, what happens to sodA ?
(What if I don't know whether to clamp narK ?)
Gold Standard: Direct v. Total Effects
Direct effects: For known true graph: evaluate using graph similarity metrics
True positive and false positive rate for adjacencies and orientations Structural Hamming distance
For unknown true graph: Experimental control of all potential back-door paths Mechanistic approach: protein binding arrays
Total effects: For known true graph:
Path from cause to eect in both graphs? (Path length? Size of path coecients?) Structural intervention distance
For unknown true graph: Truth: Gene knock-out experiments estimate true total causal eect Prediction: IDA
PatternsMarkov Equivalence Class (MEC) contains
all DAGs consistent with set of conditional independences
All DAGs in the MEC share the same adjacencies
All share the same “unshielded colliders”
Can be represented with a “pattern”: common \skeleton" of adjacencies edges directed when all the DAGs within
the MEC agree on their orientation edges left undirected otherwise
Intervention Effects When the DAG is Absent (IDA)
fnr
arcA
sodA
fnrzarcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
• Run PC. • Calculate effect q of manipulating arcA
on sodA in each graph.• Count how many times each value q
occurs.
0 0 0
q1q2 q3
{0,0,0,q1,q2,q3}
IDA fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
fnr
arcA
sodA
q1 q2 q1 q2
• Too expensive if many variables.• By considering only local structure around
sodA and arcA can calculate all possible effects, but not how many times they occur.
{0,q1,q2,q3}
IDA applicationp = 5360 genes (expression of genes)
231 gene knock downs ; 1.2 106 intervention effects
the truth is “known in good approximation” (thanks to intervention experiments)
goal: prediction of the true large intervention effects based on observational data with no knock-downs
n = 63
observational data
IDA
Maathuis
CStarRuns IDA multiple times in order to choose the
genes that are most stably among the ones selected as the strongest.
Sample 50% of the original data set (with replacement) one hundred times.
For each subsample, run the IDA algorithm.
Take the output of the IDA algorithm for that subsample and orders the variables by the size of the estimated (by IDA) lower bound of the total effect of the variable on the target.
32
Stability Selection StepRecord the frequency with which a given
variable appears in the top q of the total effect sizes, for a user selected value q.
Select the variables that appear with the highest frequency (that is are judged most often to have lower bounds of total effects on the target that are large)
33
Stability Selectionorder the variables Π1 > Π2 . . . Πp.
most often least often
Define the stably selected genes (covariates) as for some threshold 0.5 < πthr ≤ 1.
Denote the wrongly selected genes (false positives) by V = where Sfalse is the set of (covariates) whose true lower bound βj is 0.
34
Stability SelectionFor a given threshold πthr and a given value of q,
if
35
Stability SelectionCStaR is relatively insensitive to the choice of
the range of qs.
Down to a certain lower bound, small values of q lead to higher sensitivity.
For q-values below the lower bound, the ranking becomes unstable again.
36
Stability SelectionAll genes are ranked according to the median
rank with respect to the different q-values.
Ties in the final ranking are sorted according to median total causal effect size.
37
Cstar Results
Choosing Experiments: Steckhoven et al.
Mouse-ear cress response Y: days to bolting (flowering) of the plant
Covariates X: gene-expression profile
Observational data with n = 47 and p = 21,326
39
Experimental Confirmation: Steckhoven
et al.
PC + IDA + stability selection
Performed experiment on 14 of the top 20 (not previously known, easily available mutant)
40
Results: Arabidopsis thaliana
9 among the 14 mutants survived
4 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant)
41
Chen and Storey“For an individual organism, DNA has the useful
feature thatit is usually a static variable, meaning that it is fixed and will not change with changing RNA levels, protein levels, phenotypes,or environmental conditions. By performing designed crosses of genetically distinct inbred or isogenic lines, one can randomize the genotypes of an organism from two or more genetic backgrounds, thereby producing independent realizations of DNA content from offspring to offspring.”
Chen and Storey Identify causal relations of the form L → Ti → Tj
where L is known to be exogenous and prior to Ti and Tj
L is the genotype at a fixed locus, generated through crossing two haploid parental strains to produce 112 recombinant haploid segragant strains
Ti and Tj are expression levels of genes
Given this background knowledge, we just need to determine that L and Ti are dependent, Ti and Tj are dependent, and Tj is independent of Ti given L.
Faith et alUsed Many Microbes Microarrays Database
(M3D)
Evaluated using RegulonDB
Restricted search space: only allowed edges out of genes coding for TFs
Compared several search algorithms (but not a fair comparison for Bayes Net learning algorithm)
References LS Chen LS, F Emmert-Streib, and JD Storey (). Harnessing naturally randomized transcription
to inferregulatory relationships among genes. Genome Biology. 2007.
T. Chu C. Glymour, R. Scheines, P. Spirtes. A statistical problem for inference to regulatory structure from associations of gene expression measurement with microarrays. Bioinformatics 2003;19:1147-52. PMID: 12801876.
Jeremiah J. Faith, Boris Hayete, Joshua T. Thaden, Ilaria Mogno, Jamey Wierzbowski, Guillaume Cottarel, Simon Kasif, James J. Collins, and Timothy S. Gardner. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLOS Biology, 5(1):0054–0066, 2007.
J Li, M Biggin, Statistics requantitates the central dogma, Science, 347(6226), 1066-1067, 2015.
Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Buhlmann. Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4):247–248, 2010.
K. Sachs, O. Perez, D. Pe’er, D.A. Lauffenburger, and G.P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–529, (2005).
Daniel J. Stekhoven, Izabel Moraes, Gardar Sveinbjornsson, Lars Hennig, Marloes H. Maathuis, and Peter Buhlmann, Causal stability ranking, Bioinformatics, 28 (21) 2012, pp. 2819–2823