Upload
usc
View
339
Download
2
Tags:
Embed Size (px)
Citation preview
Using Biological Knowledge ToDiscover Higher Order Interactions
In Genetic Association Studies
Gary K. ChenDuncan C. Thomas
Department of Preventive MedicineUSC
May 19, 2010
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Common diseases have complex etiology
I GWAS have had great success in searching forgenetic variants for common diseases
I Recent successes: AMD, BMI/obesity, Type 2diabetes, breast cancer, prostate cancer
I Marginal effects from single SNP analyses donot explain all heritability. Can we movebeyond the low-hanging fruit? (e.g. CNVs, rarevariants, epistatic interactions, etc.
I Ideally we would fit a model for all SNPs (andinteractions too)
Analyzing all SNPs simultaneously
I Difficult for GWAS: predictors far exceedobservations
I Shrinkage methods: LASSO, ridge regression,elastic net,...
I LASSO method (Tibshirani, J Royal Stat. Soc. 96)I penalizes likelihood based on tuning parameter λI produces sparse (interpretable) models
I In GWAS settings:I Double Exp (LaPlace) prior on β(Wu and Lange,
Bioinf. 2009)I Normal Exp Gamma prior on β(Hoggart et al
PLOS Genet 2008)I Fast! Provides the maximum a posteriori (MAP)
estimates
Fully Bayesian methods for variableselection
I Bayesian model averaging assesses uncertaintyI Probabilistically proposes sub-models from a
posterior distributionI Summarize statistics of parameters averaged across
all proposed modelsI Controls for multiple comparisons
I Disadvantage: Computationally expensiveI P(β) has normal distribution for conjugacyI “Spike and slab” ensures parsimonyI Example: Stochastic Search Variable Selection
via Gibbs sampling (George and McCullochJASA 93)
I βj |γj ∼ (1− γj)N(0, τ 2j ) + γjN(0, c2
j τ2j )
I e.g., f (γ) = Πpγj
j (1− pj)(1−γj )
Searching for interactions
I SSVS via Gibbs SamplingI For 1000 SNPs, length of γ:
500,500=1000 + (1000)(999)2
I Iterating through each parameter is slow
I Reversible jump MCMCI In contrast to SSVS, the “model” is
M = {j : γj 6= 0}I Model size changes at each iteration (similar to
stepwise regression)
I Informative priorsI Incorporating biological information at the level of
each variableI These priors can be used towards a proposal
function in a Metropolis Hastings algorithm
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Posterior density as a two-levelhierarchical model
I Posterior density:I L(Y |β,X ,M)P(β|π, τ, σ,M ,Z ,A)
I First level as likelihood: a GLM at the subjectlevel
I logit(P(Y = 1|β,X )) ∼ β0 +∑K
k=1 βkXI X can be G, E, GxG, GxE, etc.
I Second level as prior: βk as mixed modelI βk ∼ πTZk + φk + θk
Prior mean on variable in Z
Table: The Z matrix
Intercept Conservation Missense eQTL1 20 0 51 10 1 0.011 5 0 11 10 1 4.11 5 0 1.4
I βk ∼ πTZk + φk + θk
I π̂: regress β̂ on Z , π ∼ N(π̂,Σπ)
Variable connectivity in A matrix
Table: Example A matrix for SNP variables
Variable 1 2 31 0 1 02 1 0 13 0 1 0
One appraoch for populating the A matrix
Table: The Z matrix
Intercept Conservation Missense eQTL→ 1 20 0 5
1 10 1 0.01→ 1 5 0 1
1 10 1 4.11 5 0 1.4
I Define entry A1,3 as corr(Z1,−,Z3,−),dichotomize A
φk as mean across k ’s neighbors
Table: Example A matrix for SNP variables
Variable 1 2 31 0 1 02 1 0 13 0 1 0
I βk ∼ πTZk + φk + θk
I φk ∼ N(φ̄−k ,τ 2
νk)
I φ̄−k =Pm
j=1 φjAjkPmj=1 Ajk
, νk neighbors of variable k
I We set φj = β̂j
I Example: If β̂ = (0.2, 0.5, 0.4), φ2 = 0.3
How the parameters fit togetherI L(Y |β,X ,M)P(β|Z , π,A, τ, σ,M)
A reversible jump MCMC algorithm
I Propose a swap, addition or deletion of anvariable
I Perform reversible jump Metropolis Hastingsstep comparing posterior probabilities
I r = L(Y |β′,X ,M′)P(β′|Z ,π,A,τ,σ,M′)P(M→M′)L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M′→M)
I Accept move with probability min(1, r)
Model transition proposal density
I Suppose model M ′ has 1 newly proposedvariable:
I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)
I The variable-specific tuning parameter µkI A function of the components of β’s prior
standardized by their residual variancesI µk = |πT Zk+φ̄−k |
σ2+ τ2
νk
I Weak empirical support for priors lead to smallnumerator, large denominator
Model transition proposal density
I Suppose model M ′ has 1 newly proposedvariable:
I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)
I The global penalty tuning parameterI Emulate the BICI BIC (M ′)− BIC (M) = χ1(ln(n))I Probability of accepting M ′ is F−1
χ (ln(n))I µbaseline = Φ(F−1
χ (ln(n)))
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Using external information to enhancepower and specificity
I Disease model: 4 GxG interactions jointlycause disease through 4 endophenotypes
I Genotypes simulated for 14 independent SNPsI yik = (1− b)N(sia ∗ sib, 1) + bU(0, 1)I b ∼ Bernoulli(p), p is proportion of noiseI 24 endophenotypes y used only in the prior
I Disease status determined using a logisticmodel
I logit(Yi = 1) = β0 +β1yi01 +β2yi02 +β3yi34 +β4yi35
I First 8000 persons reserved as case controldataset, remaining 2000 for constructing priors
Constructing the Z and the A matrices
I Z matrixI Measures correlation between a model variable and
each endophenotype among 2000 individuals in theprior
I Zkq = corr(gk , yq)
I A matrixI Measures similarity between two variables by
comparing correlation profiles in ZI Ajk = corr(Zjq,Zkq)
Question 1: How do the priors affectpower and specificity?
I The A matrix contains information across all24 endophenotypes
I Set up 3 variants of the original Z matrixI 4 causal endophenotypes only (noise parameter
p = 0)I 4 intermediate endophenotypes only (noise
parameter p = 0.2)I 4 weakly correlated endophenotypes only (noise
parameter p = 0.8)
I Models tested:both A and Z , no A or Z , Aonly, Z only (with 3 variants)
Question 1: How do the priors affectpower and specificity?
At RR=1.5, all prior models perform very well
Question 1: How do the priors affectpower and specificity?
At RR=1.4, prior models with A, Z, or bothoutperform others
Question 1: How do the priors affectpower and specificity?
At RR=1.3, prior models with A, Z, or both have> 5% power
Question 1: How do the priors affectpower and specificity?
At RR=1.2, fully informative prior still retains 80%power
Question 1: How do the priors affectpower and specificity?
At RR=1.1, all prior models perform poorly (∼ 55%power)
Question 2: How do the priors affectposterior estimates (shrinkage)?
Posterior estimates of β vs MLE
Question 2: How do the priors affectposterior estimates (shrinkage)?
Posterior estimates of SE of β vs MLE
Question 3: How do the priors improverankings?
6,441 interactions tested. 4 causal.
Question 3: How do the priors improverankings?
513,591 interactions tested. 4 causal.
Summary of simulation
I Sensitivity analysisI All methods perform well at high RRsI Informative priors improve power at lower RRs but
not at extremely low RRs
I Like LASSO, shrinkage improves interpretability
I Model averaging can improve robustness ofrankings
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Discovering interactions in a knownpathway: Folate
Simulated data set
I 14 genes, 2 environmental variables
I 8000 individuals in casecontrol data, remaining2000 for constructing priors
I Used a pathway simulation program togenerate steady-state concentrations
I Reed et al J Nutr. 2006 Oct;136(10):2653-61I Enzyme kinetics parameters (Km, Vmax) genotype
specific
I 3 mechanisms believed to be related to diseaseetiology
I Homocysteine concentrationI Pyrimidine synthesisI Purine synthesis
Estimates of π
I Construct Z and A in same manner as previoussimulation:
I Z stores genotype-metabolite correlationsI A stores dichotomized-correlations between rows of
Z
I True log relative risk: .18 (RR=1.2)
Simulated Second-level coefficients πmechanism homocysteine pyrimidine purinehomocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)purine -0.01(0.36) 0.16(0.327) 0.19(0.07)
Comparison of BMA results to stepwiseregresssion
Interaction Pyrimidine synthesisBF MLE p-value
FTD*MAT-II 15 0.038FTD*MTHFR 20 0.046MTCH*MS 534 0.006PGT*MS 14 0.018
→ SHMT*CBS 1254 0.133→ SHMT*Fol 2324 0.036
TS*MTHFR 227 0.022→ TS*SHMT 1091 N/S
Pyrimidine synthesis
I SHMT*CBS SHMT*Fol SHMT*TS
Comparison of BMA results to stepwiseregresssion
Interaction Purine synthesisBF MLE p-value
→ MTCH*MS 1130 0.008→ MTCH*PGT 1416 0.026→ PGT*CBS 1022 0.069→ PGT*MS 2851 0.007→ SHMT*Fol 1398 0.022
SHMT*MAT-II 646 0.012TS*MTHFR 57 0.024
Purine synthesis
I MTCH*MS MTCH*PGT PGT*CBS PGT*MSSHMT*Fol
Comparison of BMA results to stepwiseregresssion
Interaction HomocysteineBF MLE p-value
CBS*MAT-II 77 0.045→ CBS*Met 1072 N/S
FTD*MAT-II 38 0.045FTD*MTHFR 213 0.015
→ MS*Met 1129 N/SMTCH*MS 978 0.006PGT*MS 75 0.044TS*MTHFR 41 0.022
Homocysteine levels
I CBS*Met MS*Met
Summary of folate pathway simulation
I Pathway knowledge can inform model search
I Simulated three plausible disease mechanisms
I Effect of causal metabolite on disease revealedin corresponding element of π
I Revealed plausible interactions not foundthrough a stepwise regression
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Using gene annotations to inform a searchfor interactions
I Proof of concept: GWAS of breast cancer
I Publicly data from NCI(https://caintegrator.nci.nih.gov/cgems/)
I 1,145 cases and 1,142 controls of Europeanancestry
I The 22 Gene Ontology terms from BiologicalProcess used to define priors in A and Z
I Included 6,078 SNPs, where each SNP had GOannotation and had lowest p-value in gene
Top 10 interactions found
Interaction Non-inf prior inf priorβ(SE) BF β(SE) BF
PARK2*SORCS1 0.22(0.06) 1e4 0.27(0.06) 5e4
AK5*ARHGAP26 0.16(0.05) 427 0.17(0.05) 903FGFR2*MAML2 -0.11(0.04) 1 -0.16(0.05) 686SHC3*KIF13B N/A N/A 0.17(0.05) 621PCLO*ME3 N/A N/A 0.18(0.05) 528CNGA3*CNN1 -0.16(0.05) 41 -0.17(0.05) 462FGFR2*CDT1 N/A N/A -0.16(0.05) 445SHC3*CXCL16 N/A N/A -0.18(0.05) 403FGFR2*ABCA1 -0.1(0.05) 158 -0.11(0.05) 268CYP2J2*SORCS1 -0.11(0.05) 74 -0.14(0.05) 266FGFR2*SCG5 N/A N/A 0.21(0.05) 235
Enrichment analysis
I Are the top interactions (BF > 100) enrichedfor certain GO terms?
I Compute empiric p-value for enrichmentI For each permute within bins representative of
non-independence in observed interactionsI Pool bins, compute frequency of a GO term in the
poolI pvalue: Number of iterations freq exceeded obs
freq divided by 1 million
I biological regulation (p=.008), growth(p=1e−6), metabolic process (p=.008), andregulation of biological process (p=.003).
Outline
1. Motivation
2. The algorithm: Incorporating biological priorsinto an MCMC sampler
3. Simulation 1: Performance of the method
4. Simulation 2: Detecting interactions in a knownpathway
5. Application to data from a GWAS
6. Future Extensions
Incorporate gene-expression data intoGWAS analyses
I Developing priorsI Should be more informative (e.g. empirical) and
granular (e.g. SNP level) than GOI Obtain genotype-expression paired data: HapMap?I Apply WGCNA to infer pathway modulesI Genotype-module correlations used in Z matrix
I Incorporate more advanced MCMC techniquesI Evolutionary Monte CarloI Multiply-try MetropolisI Brute-force search for MAP. Use MAP for initial
values?
Acknowledgements
I James Baurley
I David Conti
I Angela Presson (thanks in advance!)
I Funding: R01 ES016813 and R01 ES015090.