Integration of biological annotations using hierarchical modeling

Using Biological Knowledge ToDiscover Higher Order Interactions

In Genetic Association Studies

Gary K. ChenDuncan C. Thomas

Department of Preventive MedicineUSC

May 19, 2010

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Common diseases have complex etiology

I GWAS have had great success in searching forgenetic variants for common diseases

I Recent successes: AMD, BMI/obesity, Type 2diabetes, breast cancer, prostate cancer

I Marginal effects from single SNP analyses donot explain all heritability. Can we movebeyond the low-hanging fruit? (e.g. CNVs, rarevariants, epistatic interactions, etc.

I Ideally we would fit a model for all SNPs (andinteractions too)

Analyzing all SNPs simultaneously

I Difficult for GWAS: predictors far exceedobservations

I Shrinkage methods: LASSO, ridge regression,elastic net,...

I LASSO method (Tibshirani, J Royal Stat. Soc. 96)I penalizes likelihood based on tuning parameter λI produces sparse (interpretable) models

I In GWAS settings:I Double Exp (LaPlace) prior on β(Wu and Lange,

Bioinf. 2009)I Normal Exp Gamma prior on β(Hoggart et al

PLOS Genet 2008)I Fast! Provides the maximum a posteriori (MAP)

estimates

Fully Bayesian methods for variableselection

I Bayesian model averaging assesses uncertaintyI Probabilistically proposes sub-models from a

posterior distributionI Summarize statistics of parameters averaged across

all proposed modelsI Controls for multiple comparisons

I Disadvantage: Computationally expensiveI P(β) has normal distribution for conjugacyI “Spike and slab” ensures parsimonyI Example: Stochastic Search Variable Selection

via Gibbs sampling (George and McCullochJASA 93)

I βj |γj ∼ (1− γj)N(0, τ 2j ) + γjN(0, c2

j τ2j )

I e.g., f (γ) = Πpγj

j (1− pj)(1−γj )

Searching for interactions

I SSVS via Gibbs SamplingI For 1000 SNPs, length of γ:

500,500=1000 + (1000)(999)2

I Iterating through each parameter is slow

I Reversible jump MCMCI In contrast to SSVS, the “model” is

M = {j : γj 6= 0}I Model size changes at each iteration (similar to

stepwise regression)

I Informative priorsI Incorporating biological information at the level of

each variableI These priors can be used towards a proposal

function in a Metropolis Hastings algorithm

Outline

1. Motivation






Posterior density as a two-levelhierarchical model

I Posterior density:I L(Y |β,X ,M)P(β|π, τ, σ,M ,Z ,A)

I First level as likelihood: a GLM at the subjectlevel

I logit(P(Y = 1|β,X )) ∼ β0 +∑K

k=1 βkXI X can be G, E, GxG, GxE, etc.

I Second level as prior: βk as mixed modelI βk ∼ πTZk + φk + θk

Prior mean on variable in Z

Table: The Z matrix

Intercept Conservation Missense eQTL1 20 0 51 10 1 0.011 5 0 11 10 1 4.11 5 0 1.4

I βk ∼ πTZk + φk + θk

I π̂: regress β̂ on Z , π ∼ N(π̂,Σπ)

Variable connectivity in A matrix

Table: Example A matrix for SNP variables

Variable 1 2 31 0 1 02 1 0 13 0 1 0

One appraoch for populating the A matrix

Table: The Z matrix

Intercept Conservation Missense eQTL→ 1 20 0 5

1 10 1 0.01→ 1 5 0 1

1 10 1 4.11 5 0 1.4

I Define entry A1,3 as corr(Z1,−,Z3,−),dichotomize A

φk as mean across k ’s neighbors

Table: Example A matrix for SNP variables

Variable 1 2 31 0 1 02 1 0 13 0 1 0

I βk ∼ πTZk + φk + θk

I φk ∼ N(φ̄−k ,τ 2

νk)

I φ̄−k =Pm

j=1 φjAjkPmj=1 Ajk

, νk neighbors of variable k

I We set φj = β̂j

I Example: If β̂ = (0.2, 0.5, 0.4), φ2 = 0.3

How the parameters fit togetherI L(Y |β,X ,M)P(β|Z , π,A, τ, σ,M)

A reversible jump MCMC algorithm

I Propose a swap, addition or deletion of anvariable

I Perform reversible jump Metropolis Hastingsstep comparing posterior probabilities

I r = L(Y |β′,X ,M′)P(β′|Z ,π,A,τ,σ,M′)P(M→M′)L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M′→M)

I Accept move with probability min(1, r)

Model transition proposal density

I Suppose model M ′ has 1 newly proposedvariable:

I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)

I The variable-specific tuning parameter µkI A function of the components of β’s prior

standardized by their residual variancesI µk = |πT Zk+φ̄−k |

σ2+ τ2

νk

I Weak empirical support for priors lead to smallnumerator, large denominator

Model transition proposal density

I Suppose model M ′ has 1 newly proposedvariable:

I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)

I The global penalty tuning parameterI Emulate the BICI BIC (M ′)− BIC (M) = χ1(ln(n))I Probability of accepting M ′ is F−1

χ (ln(n))I µbaseline = Φ(F−1

χ (ln(n)))

Outline

1. Motivation






Using external information to enhancepower and specificity

I Disease model: 4 GxG interactions jointlycause disease through 4 endophenotypes

I Genotypes simulated for 14 independent SNPsI yik = (1− b)N(sia ∗ sib, 1) + bU(0, 1)I b ∼ Bernoulli(p), p is proportion of noiseI 24 endophenotypes y used only in the prior

I Disease status determined using a logisticmodel

I logit(Yi = 1) = β0 +β1yi01 +β2yi02 +β3yi34 +β4yi35

I First 8000 persons reserved as case controldataset, remaining 2000 for constructing priors

Constructing the Z and the A matrices

I Z matrixI Measures correlation between a model variable and

each endophenotype among 2000 individuals in theprior

I Zkq = corr(gk , yq)

I A matrixI Measures similarity between two variables by

comparing correlation profiles in ZI Ajk = corr(Zjq,Zkq)

Question 1: How do the priors affectpower and specificity?

I The A matrix contains information across all24 endophenotypes

I Set up 3 variants of the original Z matrixI 4 causal endophenotypes only (noise parameter

p = 0)I 4 intermediate endophenotypes only (noise

parameter p = 0.2)I 4 weakly correlated endophenotypes only (noise

parameter p = 0.8)

I Models tested:both A and Z , no A or Z , Aonly, Z only (with 3 variants)


At RR=1.5, all prior models perform very well


At RR=1.4, prior models with A, Z, or bothoutperform others


At RR=1.3, prior models with A, Z, or both have> 5% power


At RR=1.2, fully informative prior still retains 80%power


At RR=1.1, all prior models perform poorly (∼ 55%power)

Question 2: How do the priors affectposterior estimates (shrinkage)?

Posterior estimates of β vs MLE

Question 2: How do the priors affectposterior estimates (shrinkage)?

Posterior estimates of SE of β vs MLE

Question 3: How do the priors improverankings?

6,441 interactions tested. 4 causal.

Question 3: How do the priors improverankings?

513,591 interactions tested. 4 causal.

Summary of simulation

I Sensitivity analysisI All methods perform well at high RRsI Informative priors improve power at lower RRs but

not at extremely low RRs

I Like LASSO, shrinkage improves interpretability

I Model averaging can improve robustness ofrankings

Outline

1. Motivation






Discovering interactions in a knownpathway: Folate

Simulated data set

I 14 genes, 2 environmental variables

I 8000 individuals in casecontrol data, remaining2000 for constructing priors

I Used a pathway simulation program togenerate steady-state concentrations

I Reed et al J Nutr. 2006 Oct;136(10):2653-61I Enzyme kinetics parameters (Km, Vmax) genotype

specific

I 3 mechanisms believed to be related to diseaseetiology

I Homocysteine concentrationI Pyrimidine synthesisI Purine synthesis

Estimates of π

I Construct Z and A in same manner as previoussimulation:

I Z stores genotype-metabolite correlationsI A stores dichotomized-correlations between rows of

Z

I True log relative risk: .18 (RR=1.2)

Simulated Second-level coefficients πmechanism homocysteine pyrimidine purinehomocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)purine -0.01(0.36) 0.16(0.327) 0.19(0.07)

Comparison of BMA results to stepwiseregresssion

Interaction Pyrimidine synthesisBF MLE p-value

FTD*MAT-II 15 0.038FTD*MTHFR 20 0.046MTCH*MS 534 0.006PGT*MS 14 0.018

→ SHMT*CBS 1254 0.133→ SHMT*Fol 2324 0.036

TS*MTHFR 227 0.022→ TS*SHMT 1091 N/S

Pyrimidine synthesis

I SHMT*CBS SHMT*Fol SHMT*TS


Interaction Purine synthesisBF MLE p-value

→ MTCH*MS 1130 0.008→ MTCH*PGT 1416 0.026→ PGT*CBS 1022 0.069→ PGT*MS 2851 0.007→ SHMT*Fol 1398 0.022

SHMT*MAT-II 646 0.012TS*MTHFR 57 0.024

Purine synthesis

I MTCH*MS MTCH*PGT PGT*CBS PGT*MSSHMT*Fol


Interaction HomocysteineBF MLE p-value

CBS*MAT-II 77 0.045→ CBS*Met 1072 N/S

FTD*MAT-II 38 0.045FTD*MTHFR 213 0.015

→ MS*Met 1129 N/SMTCH*MS 978 0.006PGT*MS 75 0.044TS*MTHFR 41 0.022

Homocysteine levels

I CBS*Met MS*Met

Summary of folate pathway simulation

I Pathway knowledge can inform model search

I Simulated three plausible disease mechanisms

I Effect of causal metabolite on disease revealedin corresponding element of π

I Revealed plausible interactions not foundthrough a stepwise regression

Outline

1. Motivation






Using gene annotations to inform a searchfor interactions

I Proof of concept: GWAS of breast cancer

I Publicly data from NCI(https://caintegrator.nci.nih.gov/cgems/)

I 1,145 cases and 1,142 controls of Europeanancestry

I The 22 Gene Ontology terms from BiologicalProcess used to define priors in A and Z

I Included 6,078 SNPs, where each SNP had GOannotation and had lowest p-value in gene

Top 10 interactions found

Interaction Non-inf prior inf priorβ(SE) BF β(SE) BF

PARK2*SORCS1 0.22(0.06) 1e4 0.27(0.06) 5e4

AK5*ARHGAP26 0.16(0.05) 427 0.17(0.05) 903FGFR2*MAML2 -0.11(0.04) 1 -0.16(0.05) 686SHC3*KIF13B N/A N/A 0.17(0.05) 621PCLO*ME3 N/A N/A 0.18(0.05) 528CNGA3*CNN1 -0.16(0.05) 41 -0.17(0.05) 462FGFR2*CDT1 N/A N/A -0.16(0.05) 445SHC3*CXCL16 N/A N/A -0.18(0.05) 403FGFR2*ABCA1 -0.1(0.05) 158 -0.11(0.05) 268CYP2J2*SORCS1 -0.11(0.05) 74 -0.14(0.05) 266FGFR2*SCG5 N/A N/A 0.21(0.05) 235

Enrichment analysis

I Are the top interactions (BF > 100) enrichedfor certain GO terms?

I Compute empiric p-value for enrichmentI For each permute within bins representative of

non-independence in observed interactionsI Pool bins, compute frequency of a GO term in the

poolI pvalue: Number of iterations freq exceeded obs

freq divided by 1 million

I biological regulation (p=.008), growth(p=1e−6), metabolic process (p=.008), andregulation of biological process (p=.003).

Outline

1. Motivation






Incorporate gene-expression data intoGWAS analyses

I Developing priorsI Should be more informative (e.g. empirical) and

granular (e.g. SNP level) than GOI Obtain genotype-expression paired data: HapMap?I Apply WGCNA to infer pathway modulesI Genotype-module correlations used in Z matrix

I Incorporate more advanced MCMC techniquesI Evolutionary Monte CarloI Multiply-try MetropolisI Brute-force search for MAP. Use MAP for initial

values?

Acknowledgements

I James Baurley

I David Conti

I Angela Presson (thanks in advance!)

I Funding: R01 ES016813 and R01 ES015090.

Technology

Integration of biological annotations using hierarchical modeling