Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Causal Graphical Models in Systems Genetics
2013 Network Analysis Short Course - UCLA Human Genetics
Elias Chaibub Neto and Brian S Yandell
July 17, 2013
1
Motivation and basic concepts
2
Motivation
Suppose the expression of gene G is associated with a clinical phenotype C.
We want to know whether: G → C or if G ← C or if C ↔ G .
We cannot distinguish between these models using data since
f (G ) f (C | G ) = f (G ,C ) = f (C ) f (G | C ) ,
and their likelihood scores are identical.
3
Schadt et al. (2005)
However, if G and C map to the same QTL, we can use genetics to inferthe causal ordering among the phenotypes.
0 20 40 60 80 100
0
5
10
15
20
Chromosome
lod
Q
Q
G
C
causal
Q
G
C
reactive
Q
G
C
indepen
4
Schadt et al. (2005)
For a drug company, it is important to determine which genes are causal,and which genes are reactive, since:
I Causal genes have the potential to become drug targets.
I Whereas reactive genes are of lesser interest.
5
Genetics and causal inference
The integration of genetics and phenotype data allows us to infer causalrelations between phenotypes for two reasons:
1. In experimental crosses, the association of a QTL and a phenotype iscausal.
2. A causal QTL can be used to determine the causal order betweenphenotypes using the concept of conditional independence.
6
Causal relations between QTLs and phenotypes
In experim. crosses, the association of a QTL and a phenotype is causal.
Why is it so ?
QTL mapping is analogous to a randomized experiment (Li et al. 2006).
Randomization is considered the “gold standard” for causal inference.
Causality can be inferred from a randomized experiment since:
1. Application of a treatment to an experimental unit precedes theobservation of the outcome (genotype precedes phenotype).
2. Because the treatment levels are randomized across the experimentalunits, the effects of confounding variables get averaged out (themendelian randomization of alleles during meiosis average out theeffects of other unlinked loci on the phenotype).
7
Causal relations between QTLs and phenotypes
X
A
chr1
X
B
Phenotype
chr2
0
5
10
15
20
Chromosome
lod
1 2
Aa−− AA−−−2
02
46
8
Effect of QTL A, after the effect of B is averaged out
phen
otyp
e
Bb−− BB−−−2
02
46
8
Effect of QTL B, after theeffect of A is averaged out
phen
otyp
e
8
Conditional independence as the key to causal orderingModel: Q → G → C
Marginal dependence:
Aa AA
02
46
GAa AA
−2
26
C
Conditional independence:
0 2 4 6
−2
02
46
8
G
C
Aa AA
−3
−1
13
res(
C|G
)
9
Causal ordering between phenotypes
Q
G
C
causal
Q
G
C
reactive
Q
G
C
indepen
Aa AA
−10
−5
05
10re
s(C
|G)
Aa AA
−10
010
20re
s(C
|G)
Aa AA
−15
−5
05
10re
s(C
|G)
Aa AA
−10
05
15re
s(G
|C)
Aa AA
−5
05
res(
G|C
)
Aa AA
05
1020
res(
G|C
)
10
Causal ordering between phenotypes
In general (although it is not always true):
Models that share the same set of conditional independence relations(Markov equivalent models) cannot be distinguished using the data (theyhave equivalent likelihood functions).
Whereas, models with distinct sets of conditional independence relations,can be distinguished.
11
Causality tests for pairs ofphenotypes
Q
G
C
causal
Q
G
C
reactive
Q
G
C
indepen
12
Pairwise models as collapsed versions of more complexnetworks
(a) Q��
(b) Q��
(c) Q��
(d) Q��~~ ��@
@(e) Q
�� @@
Y1
}}{{ !!CC Y
��Y
~~}} AA Y
��Y��
Y~~}}
Y��
Y!!CC
C Y}}{{{
Y1
��
Y1
��
AAY
��
Y1 Y��
Y1 Y��
Y
��
!!CCC Y
=={{
!!CC Y
��������
Y2 Y2
Y}}{{
Y2 Y��
Y
vvnnnnnn Y
OO
66nnnnnn
Y2 Y2
Q}}{{
Q}}{{
Q~~|| BB
Q~~}} AA
Q~~|| BB
Y1// Y2 Y1
//<<Y2bb Y1
// Y2 Y1 Y2 Y1 ==Y2aa
A causal relation might be direct or mediated by other phenotypes.
Pairwise models are misspecified.
13
Schadt et al. 2005
By using this approach, Schadt et al. 2005, has been able to identify, andexperimentally validate genes related to obesity in a mouse cross.
So, what is the issue then?
I Model selection via AIC or BIC scores do not provide a measure ofuncertainty associated with the model selection call.
I With noisy data, model selection can lead to a large number of falsepositives.
14
The issue, and illustrationFor each one of the 1,000 simulations we:
I Generate noisy data from model: Q → Y1 → Y2.
I Fit models M1 : Q → Y1 → Y2 and M2 : Q → Y2 → Y1.
I Compute log-likelihood ratio LR12.
I If LR12 > 0, select M1. If LR12 < 0, select M2.
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
R2(Y1 = Q + ε)
R2 (Y
2=
Q+
ε)
false positives (318)true positives (682)
15
Issue: no measure of uncertainty for a model selection call
We want a statistical procedure that attaches a measure of uncertainty toa model selection call.
However, given the characteristics of our application problem, it:
1. Needs to handle misspecified models.
2. Needs to handle non-nested models:
M1 Q��~~
M2 Q��@
@
Y1// Y2 Y1 Y2
oo
3. Should, ideally, be fully analytical for the sake of computationalefficiency.
16
Assessing the significance of a model selection call
Vuong’s model selection test (Vuong 1989) satisfies these three criteria.
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
R2(Y1 = Q + ε)
R2 (Y
2=
Q+
ε)false positives (1)true positives (65)no calls (934)
17
Vuong’s model selection test (Vuong 1989)
Consider 2 competing models M1 ×M2.
Vuong’s test the hypothesis:
H0 : M1 is not closer to the true model than M2,
H1 : M1 is closer to the true model than M2.
where, under H0, the scaled log-likelihood-ratio test statistic
Z12 =LR̂12√n σ̂12.12
−→d N(0, 1) ,
with LR̂12 =∑n
i=1(log f̂1,i − log f̂2,i ), and σ̂12.12 is the sample variance ofthe log-likelihood ratio scores.
18
Causal Model Selection Tests (CMST)
Vuong’s test handles model selection for 2 models only.
However, we want to use data from experimental crosses to distinguishamong 4 models:
M1 Q��~~
M2 Q��@
@ M3 Q��~~ ��@
@ M4 Q��~~ ��@
@
Y1// Y2 Y1 Y2
oo Y1 Y2 Y1 ;; Y2cc
Likelihood equivalent models:
Ma4
Q����� ��>
>> Mb4
Q����� ��>
>> Mc4
Q����� ��>
>>
Y1// Y2 Y1 ;; Y2cc Y1 Y2
oo
19
Causal Model Selection Tests (CMST)
Combine several separate Vuong’s tests into a single one. 3 versions:
1. Parametric CMST: intersection-union test of 3 Vuong’s tests,M1 ×M2, M1 ×M3, M1 ×M4, testing:
H0 : M1 is not closer to the true model than M2, M3, or M4.
H1 : M1 is closer to the true model than M2, M3, and M4.
2. Non-parametric CMST: intersection-union test of 3 paired signtests (Clark’s test).
3. Joint-parametric CMST: extension of the parametric CMST testwhich accounts for the correlation among the test statistics of theVuong’s tests.
20
Yeast data analysis
Budding yeast genetical genomics data set (Brem and Kruglyak 2005).
Data on 112 strains with:
I Expression measurements on 5,740 transcripts.
I Dense genotype data on 2,956 markers.
Most importantly:
We evaluated the precision of the causal predictions using validated causalrelationships extracted from a data-base of 247 knock-out experiments inyeast (Hughes et al. 2000, Zhu et al. 2008).
21
Knockout signatures
I In each experiment, one gene was knocked-out, and the expressionlevels of the remainder genes in control and knocked-out strains wereinterrogated for differential expression.
I The set of differentially expressed genes form the knock-out signature(ko-signature) of the knocked-out gene (ko-gene).
I The ko-signature represents a validated set of causal relations.
22
Validation using yeast knockout signatures
To leverage the ko information, we:
I Determined which of the 247 ko-genes also showed a significant QTLin our data-set.
I For each ko-gene showing significant linkages, we determined whichother genes co-mapped to the ko-gene’s QTL, generating, in this way,a list of putative targets of the ko-gene.
I For each ko-gene/putative targets list, we applied all methods usingthe ko-gene as the Y1 phenotype, the putative target genes as the Y2
phenotypes and the ko-gene’s QTL as the causal anchor.
23
Validation using yeast knockout signatures
I In total, 135 ko-genes showed significant linkages (both cis- andtrans-).
I The number of genes in the target lists varied from ko-gene toko-gene, but, in total, there were 31,936 targets.
24
Validation using yeast knockout signatures
Performance in terms of “biologically validated” TP, FP and precision:
I TP: a statistically significant causal relation between a ko-gene and aputative target gene when the putative target gene belongs to theko-signature of the ko-gene.
I FP: a statistically significant causal relation between a ko-gene and aputative target gene when the target gene doesn’t belong to theko-signature.
I The “validated precision”, is computed as the ratio of true positivesby the sum of true and false positives.
25
Results: cis and trans ko-genes
0.02 0.06 0.10
050
100
150
200
250
300
350
True Positives
Nominal significance level
Num
ber
of tr
ue p
ositi
ves
0.02 0.06 0.10
010
0020
0030
0040
0050
00
False Positives
Nominal significance level
Num
ber
of fa
lse
posi
tives
0.02 0.06 0.10
0.00
0.05
0.10
0.15
0.20
0.25
Precision
Nominal significance level
Pre
cisi
onblack: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC
26
Results: cis ko-genes only
27 out of the 135 candidate regulator ko-genes mapped in cis.
0.02 0.06 0.10
020
4060
8010
012
014
0
True Positives
Nominal significance level
Num
ber
of tr
ue p
ositi
ves
0.02 0.06 0.10
010
020
030
040
050
0
False Positives
Nominal significance level
Num
ber
of fa
lse
posi
tives
0.02 0.06 0.10
0.0
0.1
0.2
0.3
0.4
Precision
Nominal significance levelP
reci
sion
black: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC
27
Precision side by side
0.02 0.06 0.10
0.0
0.1
0.2
0.3
0.4
0.5
Cis and trans
Nominal significance level
Pre
cisi
on
0.02 0.06 0.100.
00.
10.
20.
30.
40.
5
Cis only
Nominal significance level
Pre
cisi
on
black: BIC, blue: joint CMST BIC, green: par CMST BIC, red: non par CMST BIC
28
Cis-vs-trans caseWhy is the cis-vs-trans case easier than the trans-vs-trans case ?
In general, the cis-linkages tend to be stronger than trans-linkages.
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.1
0.2
0.3
0.4
0.5
R2(Y1 = Q + ε)
R2 (Y
2=
Q+
ε)
29
Conclusions
I CMST tests trade a reduction in the rate of false positives by adecrease in statistical power.
I Whether a more powerful and less precise, or a less powerful and moreprecise method is more adequate, depends on the biologist’s researchgoals and resources.
I If the biologist can easily validate several genes, a larger list generatedby more powered and less precise methods might be more appealing.
I If follow up studies are time consuming and expensive, and only a fewcandidates can be studied in detail, a more precise method thatconservatively identifies candidates with high confidence can be moreappealing.
30
Causal Bayesian networks and theQTLnet algorithm
YOL084W
YAL061W
YDR032C
YEL011W
YFR043C
YHR016C
YHR104W
YIL113W
YIR016W
YJL111W
YJL161W
YJL210W
YJR096W
YKL085W
YKL091C
YLR178C
YMR170C
YNL160W
YNL195C YNR014W
YOL097C
YOR028C
YPL154C
YPR160W
31
Standard Bayesian networksA graphical model is a multivariate probabilistic model whose conditionalindependence relations are represented by a graph.
Bayesian networks are directed acyclic graph (DAG) models,/.-,()*+1��=
=== /.-,()*+3 ///.-,()*+5 ///.-,()*+6/.-,()*+2 @@���� /.-,()*+4OO
Assuming the Markov property, the joint distribution factors according tothe conditional independence relations:
P(1, 2, 3, 4, 5, 6) = P(6 | 5)P(5 | 3, 4)P(4)P(3 | 1, 2)P(2)P(1)
6 ⊥⊥ {1, 2, 3, 4} | 5, 5 ⊥⊥ {1, 2} | 3, 4, and so on
i.e., each node is independent of its non-descendants given its parents.32
Standard Bayesian networks and causality
Even though the direct edges in a Bayes net are often interpreted as causalrelations, in reality they only represent conditional dependencies.
Different phenotype networks, for instance,
Y1 → Y2 → Y3 , Y1 ← Y2 → Y3 , Y1 ← Y2 ← Y3 ,
can represent the same set of conditional independence relations(Y1 ⊥⊥ Y3 | Y2, in this example).
When this is the case, we say the networks are Markov equivalent.
33
Standard Bayesian networks and causality
In general:
Markov equivalence ⇔ Distribution equivalence
(equivalence of likelihood functions)
Hence, model selection criteria cannot distinguish between Markovequivalent networks.
The best we can do is to learn equivalence classes of likelihood equivalentphenotype networks from the data.
34
Genetics as a way to reduce the size of equivalence classes
The incorporation of genetic information can help distinguish betweenlikelihood equivalent networks in two distinct ways:
1. By creating priors for the network structures, using the results ofcausality tests (Zhu et al. 2007).
2. By augmenting the phenotype network with QTL nodes, creating newsets of conditional independence relations (Chaibub Neto et al. 2010).
35
Genetic priors
Consider the networks
M1 : Y1 → Y2 → Y3 , M2 : Y1 ← Y2 ← Y3 .
These Markov equivalent networks have the same likelihood, i.e.,
P(D | M1) = P(D | M2) .
If the phenotypes are associated with QTLs, we can use the results of thecausality tests to compute prior probabilities for the network structures. If
P(M1)
P(M2)̸= 1 , then
P(M1 | D)
P(M2 | D)=
P(D | M1)P(M1)
P(D | M2)P(M2)̸= 1 ,
and we can use the posterior probability ratio to distinguish between thenetworks.
36
Augmenting the phenotype network with QTLs
Consider the Markov equivalent networks:
M1 : Y1 → Y2 → Y3 , M2 : Y1 ← Y2 ← Y3 .
By augmenting the phenotype network with a QTL node,
M1 : Q → Y1 → Y2 → Y3 , M2 : Q → Y1 ← Y2 ← Y3 ,
we have thatM1 andM2 have distinct sets of conditional independencerelations:
Y2 ⊥⊥ Q | Y1 , Y1 ⊥⊥ Y3 | Y2 , on M1
Y2 ⊥̸⊥ Q | Y1 , Y1 ⊥⊥ Y3 | Y2 , on M2
Hence,M1 andM2 are no longer likelihood equivalent.
37
Learning Bayesian Networks from Data
I Posterior prob of networkMk given the observed data, D,
P(Mk | D) =P(D | Mk)P(Mk)∑Mk
P(D | Mk)P(Mk).
I Prior predictive distribution of D givenMk
P(D | Mk) =
∫P(D | θ,Mk)P(θ) dθ
I Prior distribution of networkMk , P(Mk).
I Marginal distribution of the data
P(D) =∑Mk
P(D | Mk)P(Mk) ,
cannot, generally, be computed analytically because the number ofnetworks is too large.
38
Learning Bayesian Networks from Data
Complexity of the learning task:
# of nodes # of networks
1 12 33 254 5435 29,2816 3,781,50310 4.175099e+1820 2.344880e+7230 2.714854e+158
Hence, heuristic search algorithms are essential to traverse the networkspace efficiently.
39
QTLnet algorithm
I Perform joint inference of the causal phenotype network and theassociated genetic architecture.
I The genetic architecture is inferred conditional on the phenotypenetwork.
I Because the phenotype network structure is itself unknown, thealgorithm iterates between updating the network structure and geneticarchitecture using a Markov chain Monte Carlo (MCMC) approach.
I QTLnet corresponds to a mixed Bayesian network with continuousand discrete nodes representing phenotypes and QTLs, respectively.
40
QTLnet algorithm - standard structure sampler
41
Bayesian model averaging
M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10
Model
Pos
terio
r pr
ob0.
000.
100.
20
12
34
M1 12
34
M2 12
34
M3 12
34
M4 12
34
M5
12
34
M6 12
34
M7 12
34
M8 12
34
M9 12
34
M10
Pr(Y1 → Y2) = Pr(M1) + Pr(M3) + Pr(M4) = 0.54Pr(Y1 . . .Y2) = Pr(M2) + Pr(M5) + Pr(M7) = 0.34Pr(Y1 ← Y2) = Pr(M6) + Pr(M8) + Pr(M9) + Pr(M10) = 0.12
42
Yeast data analysis
We build a causal phenotype network around PHM7.
PHM7 is physically located close to the hotspot QTL on chr 15.
0 100 200 300 400
0
20
40
60
80
100
120
chr 15 hotspot
Map position (cM)
coun
ts
PHM7 is the cis-gene with the largest number of significant causal callsacross all hotspots (23 significant calls at α = 0.001 for joint CMST).
43
Yeast data analysisPHM7 (yellow) shows up at the topof the transcriptional network.
YOL084W
YAL061W
YDR032C
YEL011W
YFR043C
YHR016C
YHR104W
YIL113W
YIR016W
YJL111W
YJL161W
YJL210W
YJR096W
YKL085W
YKL091C
YLR178C
YMR170C
YNL160W
YNL195C YNR014W
YOL097C
YOR028C
YPL154C
YPR160W
44
References
1. Chaibub Neto et al. (2013) Modeling causality for pairs of phenotypes in systemsgenetics. Genetics 193: 1003-1013.
2. Chaibub Neto et al. (2010). Causal graphical models in systems genetics: aunified framework for joint inference of causal network and genetic architecture forcorrelated phenotypes. Annals of Applied Statistics 4: 320-339.
Software: R/qtlhot and R/qtlnet packages.
Further references:
1. Brem and Kruglyak (2005) PNAS 102: 1572-1577.
2. Clarke (2007) Political Analysis 15: 347-363.
3. Hughes et al. (2000) Cell 102: 109-116.
4. Kullback (1959) Information theory and statistics. John Wiley. New York.
5. Li et al. (2006) Plos Genetics 2: e114.
6. Schadt et al. (2005) Nature Genetics 37: 710-717.
7. Vuong (1989) Econometrica 57: 307-333.
8. Zhu et al. (2008) Nature Genetics 40: 854-861.
45
Acknowledgments
Co-authors:
I Brian S Yandell (Statistics - UW-Madison)
I Mark P Keller (Biochemistry - UW-Madison)
I Alan D Attie (Biochemistry - UW-Madison)
I Bin Zhang (Genetics and Genomic Sciences - MSSM)
I Jun Zhu (Genetics and Genomic Sciences - MSSM)
I Aimee T Broman (Biochemistry - UW-Madison)
46
Thank you !
47