48
Using Biological Knowledge To Discover Higher Order Interactions In Genetic Association Studies Gary K. Chen Duncan C. Thomas Department of Preventive Medicine USC May 19, 2010

Integration of biological annotations using hierarchical modeling

  • Upload
    usc

  • View
    339

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Integration of biological annotations using hierarchical modeling

Using Biological Knowledge ToDiscover Higher Order Interactions

In Genetic Association Studies

Gary K. ChenDuncan C. Thomas

Department of Preventive MedicineUSC

May 19, 2010

Page 2: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 3: Integration of biological annotations using hierarchical modeling

Common diseases have complex etiology

I GWAS have had great success in searching forgenetic variants for common diseases

I Recent successes: AMD, BMI/obesity, Type 2diabetes, breast cancer, prostate cancer

I Marginal effects from single SNP analyses donot explain all heritability. Can we movebeyond the low-hanging fruit? (e.g. CNVs, rarevariants, epistatic interactions, etc.

I Ideally we would fit a model for all SNPs (andinteractions too)

Page 4: Integration of biological annotations using hierarchical modeling

Analyzing all SNPs simultaneously

I Difficult for GWAS: predictors far exceedobservations

I Shrinkage methods: LASSO, ridge regression,elastic net,...

I LASSO method (Tibshirani, J Royal Stat. Soc. 96)I penalizes likelihood based on tuning parameter λI produces sparse (interpretable) models

I In GWAS settings:I Double Exp (LaPlace) prior on β(Wu and Lange,

Bioinf. 2009)I Normal Exp Gamma prior on β(Hoggart et al

PLOS Genet 2008)I Fast! Provides the maximum a posteriori (MAP)

estimates

Page 5: Integration of biological annotations using hierarchical modeling

Fully Bayesian methods for variableselection

I Bayesian model averaging assesses uncertaintyI Probabilistically proposes sub-models from a

posterior distributionI Summarize statistics of parameters averaged across

all proposed modelsI Controls for multiple comparisons

I Disadvantage: Computationally expensiveI P(β) has normal distribution for conjugacyI “Spike and slab” ensures parsimonyI Example: Stochastic Search Variable Selection

via Gibbs sampling (George and McCullochJASA 93)

I βj |γj ∼ (1− γj)N(0, τ 2j ) + γjN(0, c2

j τ2j )

I e.g., f (γ) = Πpγj

j (1− pj)(1−γj )

Page 6: Integration of biological annotations using hierarchical modeling

Searching for interactions

I SSVS via Gibbs SamplingI For 1000 SNPs, length of γ:

500,500=1000 + (1000)(999)2

I Iterating through each parameter is slow

I Reversible jump MCMCI In contrast to SSVS, the “model” is

M = {j : γj 6= 0}I Model size changes at each iteration (similar to

stepwise regression)

I Informative priorsI Incorporating biological information at the level of

each variableI These priors can be used towards a proposal

function in a Metropolis Hastings algorithm

Page 7: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 8: Integration of biological annotations using hierarchical modeling

Posterior density as a two-levelhierarchical model

I Posterior density:I L(Y |β,X ,M)P(β|π, τ, σ,M ,Z ,A)

I First level as likelihood: a GLM at the subjectlevel

I logit(P(Y = 1|β,X )) ∼ β0 +∑K

k=1 βkXI X can be G, E, GxG, GxE, etc.

I Second level as prior: βk as mixed modelI βk ∼ πTZk + φk + θk

Page 9: Integration of biological annotations using hierarchical modeling

Prior mean on variable in Z

Table: The Z matrix

Intercept Conservation Missense eQTL1 20 0 51 10 1 0.011 5 0 11 10 1 4.11 5 0 1.4

I βk ∼ πTZk + φk + θk

I π̂: regress β̂ on Z , π ∼ N(π̂,Σπ)

Page 10: Integration of biological annotations using hierarchical modeling

Variable connectivity in A matrix

Table: Example A matrix for SNP variables

Variable 1 2 31 0 1 02 1 0 13 0 1 0

Page 11: Integration of biological annotations using hierarchical modeling

One appraoch for populating the A matrix

Table: The Z matrix

Intercept Conservation Missense eQTL→ 1 20 0 5

1 10 1 0.01→ 1 5 0 1

1 10 1 4.11 5 0 1.4

I Define entry A1,3 as corr(Z1,−,Z3,−),dichotomize A

Page 12: Integration of biological annotations using hierarchical modeling

φk as mean across k ’s neighbors

Table: Example A matrix for SNP variables

Variable 1 2 31 0 1 02 1 0 13 0 1 0

I βk ∼ πTZk + φk + θk

I φk ∼ N(φ̄−k ,τ 2

νk)

I φ̄−k =Pm

j=1 φjAjkPmj=1 Ajk

, νk neighbors of variable k

I We set φj = β̂j

I Example: If β̂ = (0.2, 0.5, 0.4), φ2 = 0.3

Page 13: Integration of biological annotations using hierarchical modeling

How the parameters fit togetherI L(Y |β,X ,M)P(β|Z , π,A, τ, σ,M)

Page 14: Integration of biological annotations using hierarchical modeling

A reversible jump MCMC algorithm

I Propose a swap, addition or deletion of anvariable

I Perform reversible jump Metropolis Hastingsstep comparing posterior probabilities

I r = L(Y |β′,X ,M′)P(β′|Z ,π,A,τ,σ,M′)P(M→M′)L(Y |β,X ,M)P(β|Z ,π,A,τ,σ,M)P(M′→M)

I Accept move with probability min(1, r)

Page 15: Integration of biological annotations using hierarchical modeling

Model transition proposal density

I Suppose model M ′ has 1 newly proposedvariable:

I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)

I The variable-specific tuning parameter µkI A function of the components of β’s prior

standardized by their residual variancesI µk = |πT Zk+φ̄−k |

σ2+ τ2

νk

I Weak empirical support for priors lead to smallnumerator, large denominator

Page 16: Integration of biological annotations using hierarchical modeling

Model transition proposal density

I Suppose model M ′ has 1 newly proposedvariable:

I P(M → M ′) = Φ−1(zk)I zk ∼ N(µk − µbaseline , 1)

I The global penalty tuning parameterI Emulate the BICI BIC (M ′)− BIC (M) = χ1(ln(n))I Probability of accepting M ′ is F−1

χ (ln(n))I µbaseline = Φ(F−1

χ (ln(n)))

Page 17: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 18: Integration of biological annotations using hierarchical modeling

Using external information to enhancepower and specificity

I Disease model: 4 GxG interactions jointlycause disease through 4 endophenotypes

I Genotypes simulated for 14 independent SNPsI yik = (1− b)N(sia ∗ sib, 1) + bU(0, 1)I b ∼ Bernoulli(p), p is proportion of noiseI 24 endophenotypes y used only in the prior

I Disease status determined using a logisticmodel

I logit(Yi = 1) = β0 +β1yi01 +β2yi02 +β3yi34 +β4yi35

I First 8000 persons reserved as case controldataset, remaining 2000 for constructing priors

Page 19: Integration of biological annotations using hierarchical modeling

Constructing the Z and the A matrices

I Z matrixI Measures correlation between a model variable and

each endophenotype among 2000 individuals in theprior

I Zkq = corr(gk , yq)

I A matrixI Measures similarity between two variables by

comparing correlation profiles in ZI Ajk = corr(Zjq,Zkq)

Page 20: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

I The A matrix contains information across all24 endophenotypes

I Set up 3 variants of the original Z matrixI 4 causal endophenotypes only (noise parameter

p = 0)I 4 intermediate endophenotypes only (noise

parameter p = 0.2)I 4 weakly correlated endophenotypes only (noise

parameter p = 0.8)

I Models tested:both A and Z , no A or Z , Aonly, Z only (with 3 variants)

Page 21: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

At RR=1.5, all prior models perform very well

Page 22: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

At RR=1.4, prior models with A, Z, or bothoutperform others

Page 23: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

At RR=1.3, prior models with A, Z, or both have> 5% power

Page 24: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

At RR=1.2, fully informative prior still retains 80%power

Page 25: Integration of biological annotations using hierarchical modeling

Question 1: How do the priors affectpower and specificity?

At RR=1.1, all prior models perform poorly (∼ 55%power)

Page 26: Integration of biological annotations using hierarchical modeling

Question 2: How do the priors affectposterior estimates (shrinkage)?

Posterior estimates of β vs MLE

Page 27: Integration of biological annotations using hierarchical modeling

Question 2: How do the priors affectposterior estimates (shrinkage)?

Posterior estimates of SE of β vs MLE

Page 28: Integration of biological annotations using hierarchical modeling

Question 3: How do the priors improverankings?

6,441 interactions tested. 4 causal.

Page 29: Integration of biological annotations using hierarchical modeling

Question 3: How do the priors improverankings?

513,591 interactions tested. 4 causal.

Page 30: Integration of biological annotations using hierarchical modeling

Summary of simulation

I Sensitivity analysisI All methods perform well at high RRsI Informative priors improve power at lower RRs but

not at extremely low RRs

I Like LASSO, shrinkage improves interpretability

I Model averaging can improve robustness ofrankings

Page 31: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 32: Integration of biological annotations using hierarchical modeling

Discovering interactions in a knownpathway: Folate

Page 33: Integration of biological annotations using hierarchical modeling

Simulated data set

I 14 genes, 2 environmental variables

I 8000 individuals in casecontrol data, remaining2000 for constructing priors

I Used a pathway simulation program togenerate steady-state concentrations

I Reed et al J Nutr. 2006 Oct;136(10):2653-61I Enzyme kinetics parameters (Km, Vmax) genotype

specific

I 3 mechanisms believed to be related to diseaseetiology

I Homocysteine concentrationI Pyrimidine synthesisI Purine synthesis

Page 34: Integration of biological annotations using hierarchical modeling

Estimates of π

I Construct Z and A in same manner as previoussimulation:

I Z stores genotype-metabolite correlationsI A stores dichotomized-correlations between rows of

Z

I True log relative risk: .18 (RR=1.2)

Simulated Second-level coefficients πmechanism homocysteine pyrimidine purinehomocysteine 0.18(0.13) -0.09(0.536) 0.002(0.38)pyrimidine -0.04(0.22) 0.22(0.066) -0.01(0.06)purine -0.01(0.36) 0.16(0.327) 0.19(0.07)

Page 35: Integration of biological annotations using hierarchical modeling

Comparison of BMA results to stepwiseregresssion

Interaction Pyrimidine synthesisBF MLE p-value

FTD*MAT-II 15 0.038FTD*MTHFR 20 0.046MTCH*MS 534 0.006PGT*MS 14 0.018

→ SHMT*CBS 1254 0.133→ SHMT*Fol 2324 0.036

TS*MTHFR 227 0.022→ TS*SHMT 1091 N/S

Page 36: Integration of biological annotations using hierarchical modeling

Pyrimidine synthesis

I SHMT*CBS SHMT*Fol SHMT*TS

Page 37: Integration of biological annotations using hierarchical modeling

Comparison of BMA results to stepwiseregresssion

Interaction Purine synthesisBF MLE p-value

→ MTCH*MS 1130 0.008→ MTCH*PGT 1416 0.026→ PGT*CBS 1022 0.069→ PGT*MS 2851 0.007→ SHMT*Fol 1398 0.022

SHMT*MAT-II 646 0.012TS*MTHFR 57 0.024

Page 38: Integration of biological annotations using hierarchical modeling

Purine synthesis

I MTCH*MS MTCH*PGT PGT*CBS PGT*MSSHMT*Fol

Page 39: Integration of biological annotations using hierarchical modeling

Comparison of BMA results to stepwiseregresssion

Interaction HomocysteineBF MLE p-value

CBS*MAT-II 77 0.045→ CBS*Met 1072 N/S

FTD*MAT-II 38 0.045FTD*MTHFR 213 0.015

→ MS*Met 1129 N/SMTCH*MS 978 0.006PGT*MS 75 0.044TS*MTHFR 41 0.022

Page 40: Integration of biological annotations using hierarchical modeling

Homocysteine levels

I CBS*Met MS*Met

Page 41: Integration of biological annotations using hierarchical modeling

Summary of folate pathway simulation

I Pathway knowledge can inform model search

I Simulated three plausible disease mechanisms

I Effect of causal metabolite on disease revealedin corresponding element of π

I Revealed plausible interactions not foundthrough a stepwise regression

Page 42: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 43: Integration of biological annotations using hierarchical modeling

Using gene annotations to inform a searchfor interactions

I Proof of concept: GWAS of breast cancer

I Publicly data from NCI(https://caintegrator.nci.nih.gov/cgems/)

I 1,145 cases and 1,142 controls of Europeanancestry

I The 22 Gene Ontology terms from BiologicalProcess used to define priors in A and Z

I Included 6,078 SNPs, where each SNP had GOannotation and had lowest p-value in gene

Page 44: Integration of biological annotations using hierarchical modeling

Top 10 interactions found

Interaction Non-inf prior inf priorβ(SE) BF β(SE) BF

PARK2*SORCS1 0.22(0.06) 1e4 0.27(0.06) 5e4

AK5*ARHGAP26 0.16(0.05) 427 0.17(0.05) 903FGFR2*MAML2 -0.11(0.04) 1 -0.16(0.05) 686SHC3*KIF13B N/A N/A 0.17(0.05) 621PCLO*ME3 N/A N/A 0.18(0.05) 528CNGA3*CNN1 -0.16(0.05) 41 -0.17(0.05) 462FGFR2*CDT1 N/A N/A -0.16(0.05) 445SHC3*CXCL16 N/A N/A -0.18(0.05) 403FGFR2*ABCA1 -0.1(0.05) 158 -0.11(0.05) 268CYP2J2*SORCS1 -0.11(0.05) 74 -0.14(0.05) 266FGFR2*SCG5 N/A N/A 0.21(0.05) 235

Page 45: Integration of biological annotations using hierarchical modeling

Enrichment analysis

I Are the top interactions (BF > 100) enrichedfor certain GO terms?

I Compute empiric p-value for enrichmentI For each permute within bins representative of

non-independence in observed interactionsI Pool bins, compute frequency of a GO term in the

poolI pvalue: Number of iterations freq exceeded obs

freq divided by 1 million

I biological regulation (p=.008), growth(p=1e−6), metabolic process (p=.008), andregulation of biological process (p=.003).

Page 46: Integration of biological annotations using hierarchical modeling

Outline

1. Motivation

2. The algorithm: Incorporating biological priorsinto an MCMC sampler

3. Simulation 1: Performance of the method

4. Simulation 2: Detecting interactions in a knownpathway

5. Application to data from a GWAS

6. Future Extensions

Page 47: Integration of biological annotations using hierarchical modeling

Incorporate gene-expression data intoGWAS analyses

I Developing priorsI Should be more informative (e.g. empirical) and

granular (e.g. SNP level) than GOI Obtain genotype-expression paired data: HapMap?I Apply WGCNA to infer pathway modulesI Genotype-module correlations used in Z matrix

I Incorporate more advanced MCMC techniquesI Evolutionary Monte CarloI Multiply-try MetropolisI Brute-force search for MAP. Use MAP for initial

values?

Page 48: Integration of biological annotations using hierarchical modeling

Acknowledgements

I James Baurley

I David Conti

I Angela Presson (thanks in advance!)

I Funding: R01 ES016813 and R01 ES015090.