50
with Bayesian with Bayesian Variable Variable Selection and Selection and MCMC MCMC Michael Swartz [email protected]

Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz [email protected]

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Gene Mapping with Gene Mapping with Bayesian Variable Bayesian Variable

Selection and MCMCSelection and MCMCMichael Swartz

[email protected]

Page 2: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

OutlineIntro to GeneticsIntro to Gene mapping, Association studiesThe Conditional logistic regression model

for Gene mappingBayesian Model Selection

Stochastic Search Variable Selection Stochastic Search Gene Suggestion (SSGS)

Performance on Simulated Data SSGS vs the MLE.

Page 3: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Intro to GeneticsIntro to Genetics

Page 4: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Picture book of Genetics

Chromosomes: Line up genes

Gene: A specific coding region of DNA

Locus: a gene’s position

Alleles:

Genotype: Both

Molecular Marker: A polymorphic locus with a known position on the chromosome

Haplotype: One

Page 5: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Linkage Violates Mendel’s Second

law: Genes segregate independently

Genes that co-segregate in the recombinant gametes are linked.

Biological source of linkage: Meiosis -- the process of cell division that produces haploid gametes.

Linkage

Allows us to measure genetic distance

Page 6: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Linkage Disequilibrium

Association of alleles in a population

Page 7: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Gene Mapping: Association Gene Mapping: Association StudiesStudies

Page 8: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Data: The Case-Parent Triad

Collect Haplotype information on the Parents (G) as well as the case (g) so we have information about the transmitted and non transmitted haplotypes. Model the probability of transmission.

Page 9: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Gene Mapping By Association

Transmission Disequilibrium Test (TDT) Uses transmitted and non-transmitted alleles in case parent

triads to jointly test for linkage and linkage disequilbrium Based on McNemar’s test for case-control data Tests for association between two loci at a time

Log-linear models Also used for case-control data TDT triads can be modeled with Conditional Logistic

Regression for case control data. (Self, et al, 1991, Thomas, et al., 1995)

Extends the TDT to multiple loci

Page 10: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Advantages to a log-linear model

Using a Bayesian model we can incorporate genetic association between the markers.

Easy to analyze multiple lociEasy to consider Gene X Gene interactionsEasy to consider haplotypesEasy to consider environmental effectsEasy to consider Gene X Envrionment effects

Page 11: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Advantages to a log-linear model

Using a Bayesian model we can incorporate genetic association between the markers.

Easy to analyze multiple lociEasy to consider Gene X Gene interactionsEasy to consider haplotypesEasy to consider environmental effectsEasy to consider Gene X Envrionment effects

Page 12: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Coding the Triads (Thomas et al., 1995; Schaid 1996)

Ex: 3 diallelic loci.Recall gip and GT

ip from the case-parent triad.

For the Logistic Regression model we use Zi= gim+gif.

This is known as GTDT coding scheme (Schaid 1996)

Using Haplotypes in Conditional Logistic Regression is one way to examine Complex Diseases using Triads

10

01

01

01

10

01

ifim gg

Page 13: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Sampling Distribution for Triads

( | ) ( | , , )

( | , , )* ( | , )

( | , )

( | , , )

m f

m f m f

m f

m f

P D g P D g G G

P g D G G P D G G

P g G G

P g D G G

Page 14: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

The Sampling distribution: a Conditional Logistic Function

(Thomas et al., 1995, Self et al., 1991)

**

),(

),(

)|(

)|(,,|

Ggifim

ifim

Ggi

iifimi ggRR

ggRR

gDP

gDPDGGgP

i

where G* is the set of all possible transmitted genotypes given the parents’ genotypes (“Pseudo-Controls”):

1

1 1

, explAL

m f la lal a

RR g g g

0 0 1 0 0 1 1 1* , , , , , , ,im if im if im if im ifG G G G G G G G G

and

Page 15: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Identifiability for Conditional Logistic Regression Parameters

Gene Mapping with Conditional Logistic Regression (CLR) uses categorical covariates (genotpye or haplotype)

For identifiability, we must define a reference category for each locus

Choose the most prevalent allele at each locus as it’s reference allele.

Page 16: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Calculating Prevalence from Triads (Thomas, 1995)

Let Cla denote the number of haplotypes in the case that carry allele a at locus l.

Likewise, let Pla denote the number of haplotypes in the parents that carry allele a at locus l.

If N denotes the total number of triads, then the prevalence of allele a at locus l can be calculated by: (Pla – Cla)/2N

Page 17: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Using CLR to infer genesFrequentist

Make Inference on the Maximum Likelihood Estimates for the parameters in the CLR model.

• Requires numerical optimization

• Prepackaged in STATA clogit command.

Bayesian Calculate Posterior Distribution and make inference

from the appropriate summaries• Requires Markov Chain Monte Carlo posterior simulation

• Implemented in Stochastic Search Gene Suggestion (SSGS)

Page 18: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Bayesian Model SelectionBayesian Model Selection

Page 19: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Use a Hierarchical Bayesian method

( ) ( ) ( ), | | | ,P Data P P f Data

Make inferences from the variable posterior:

Hierarchical Bayesian setup for Variable Selection

is an indicator vector of the variables, and () is the vector of coefficients for model .

| , |Data Data

Page 20: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Advantages to Bayesian Hierarchical Modeling

Account for prior informationAllow for Bayesian Variable Selection

TechniquesMake inference from model posterior No multiple testing because discussing pure

probabilities

Page 21: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Linear Regression: Introduce a latent variable to indicate covariate’s importance.

Hierarchy – allows prior information to enter the model and be updated by the data Likelihood: Y|,2 ~ Nn(X, 2I) Model Prior: ~ Binomial(p) Parameter Priors:

• | ~ Np(0,DR D )

• 2| ~ IG(/2, /2) /2 ~ 2

Stochastic Search Variable Selection(George and McCulloch, 1993)

Page 22: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Full Conditionals for , , and 2 recognizable Gibbs Sampling

Generalized to Various GLMs (George, McCulloch, and Tsay, 1996; Ntzoufras, Forster, and Dellaportas, 2000; and a few others).

Stochastic Search Variable Selection(Continued)

( ) ( ) ( ), | | | | ,P Data P P P f Data

Page 23: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Stochastic Search Gene SuggestionExtends Stochastic Search Variable Selection

(George and McCulloch, 1993)

Introduces two latent variables to indicate a gene’s importance in the model: one for loci and one for alleles.

Induces a hierarchy that allows prior information about genes to enter the model Genetic structure Genetic correlation

The hierarchical nature allows the data to update the probability of including a particular gene

Page 24: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Priors for Gene Suggestion Use two priors for gene suggestion

One indicator vector for locus selection: =(1,…,L),

where pl = P(Locus l is associated with the disease)

L

llpBernoulli

1

One indicator vector for allele selection given each locus: . Each element [la] pertains to a particular allele at locus l.

1

1 1

lAL

lal a

Bernoulli q

where qla= P(Allele a at locus l causes disease)

Page 25: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Prior for allele main effects (|,):Allelic dependence in model selectionPrior for main effects models the genetic

dependencies between loci and alleles

RDD,0MVN,|

where LLALA kkkk ,,,,,,Diag 1111 1

D

with each kla defined as

0* if0

1 * if*

lalala

lalalala λ

λck

Page 26: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

How SSGS worksExploits MVN Covariance matrix DRD (George

and McCulloch, 1993) If = 0, then la focuses the probability of la around 0

if = 1, then lacla expands the probability of la to cover reasonable values

Automatic methods for choosing and c in paperSubjectively

choose la such that -3la < la < 3la implies la =0

choose cla such that 3lacla covers reasonable values for la

Model information contained in P(| Data)R based on Linkage Disequilbrium can be helpful for

gene mapping

Page 27: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

L

The Prior Covariance Matrix

Define the Diagonal Blocks {lili} using the covariance for a multinomial distribution using allele frequencies assuming they are constant across generation.

Determine the off-diagonal blocks {lilj}{ij} using the allelic disequilibirium between the alleles at locus i and locus j: .

Define R = L-1

bababa jijiji ppp

Page 28: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Sampling from the Posterior

No full conditional for updating Use Hybrid Gibbs sampling and Metropolis-Hastings

Algorithm to construct a Markov Chain. Full conditionals for updating and Metropolis Hastings acceptance ratio for updating by locus

For a given model, sample repeatedly from Metropolis Hastings before proposing a new model Even model iterations generated by independence MLE proposal Odd model iterations generated by random walk proposal

,,,|,|

),,,|,,(

DGGgp

DGGgP

fm

fm

Page 29: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Gibbs Sampling ComponentP(i=1| (-i), , , g, Gm, Gf) = P(i=1| (-i), )

= a1/(a0+a1) a1 = f(| i=1, (-i), )*f((-i), i=1)

a0= f(| i=0, (-i), )*f((-i), i=0)

P(i=1| (-i), , , g, Gm, Gf) = P(i=1|(-i), ) = b1/(b0+b1) b1 = f(| i=1, (-i), )*f((-i), i=1)

b0 = f(| i=0, (-i), )*f((-i), i=0)

Page 30: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Metropolis Hastings Component (by locus)

1*11

*1**

;,,|,,,|

;,,|,,,|

tllll

tll

tl

lltlllll

qLp

qLp

MH Ratio:

Two different proposal Distributions: MLE independence proposal conditional on other loci

Random Walk symmetric proposal conditional on other loci

llll

tll DHDMVNq

|,1

,|1* ,|

ll

tll

tll DHDMVNq

|,1

,1

)(|1* ,|

Page 31: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

SSGS Flow Chart

Page 32: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Using a Bayesian Model, we simply summarize the posterior in a meaningful way

The MCMC sample is a large sample from our posterior

Thus we can summarize gene’s importance by using the marginal posterior probability of inclusion for each gene

Use the median model threshold: P(la) > .5

Finding Genes

1( is important) =

# laall iterations

p laiterations

Page 33: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulating DataSimulating Data

Page 34: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulated DataUsed genetic data simulated for Genetic

Analysis Workshop 12 (GAW12)Used Chromosome 1 from isolated population

Microsatellite markers simulated 1 cM apart, with 4-16 alleles

Simulated without influence from selection reference: Wijsman, E.M. Almasy, L., Amos, C.I., Borecki, I.,

Falk C.T., King, T.M., Martinez, M. M., Meyers, D., Neuman, R., Olson, J.M., Rich, S., Spence, M.A., Thomas, D. C., Vieland, V.J., Witte, J. S., MacCluer, J.W. (2001) Genetic Analysis Workshop 12: Analysis of Complex Genetic Traits: Applications to Asthma and Simulated Data. Genet Epidemiol 21(supp 1):S1-S853

Page 35: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Using GAW 12 Data: Model Simulation

Simulate directly from model Use the conditional logistic regression function to determine

probability of transmission of the genes• The parents determine the 4 possible children

• Treat each child as a category in a multinomial distribution

• Calculate the probability of each child using a conditional logistic regression function with specified ’s

• Draw 1 sample from the corresponding multinomial distribution to determine the affected genotype for the triad.

Know the right answers for Analyze the data twice

Independent R = I Dependent R – based on HWE & LD

Page 36: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 1: Model Simulation

3 loci with a total of 20 alleles, close together A1=4; A2=11; A3=5

GAW 12 Chromosome 1 Loci 9, 11, and 12Genetic Covariance Present

Average D’ for 3 loci span from 0.133 to 0.256 90% of || [0.005,0.386] ; median = 0.012

True Model: g2, g14, g16

True Betas: {2=2.74, 14=3.63, 16 =4.39; -

(2,14,16)=0} 200,000 iterations

Page 37: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Running STATAData was collected in TriadSTATA needs pseudocontrols enumeratedAssuming no recombination, construct each

Z vector (sum of the haplotypes) of the possible children given the parents

Obtain MLE and confidence intervals: Run clogit on the data stratified by family (only the 4 children are present in each stratification)

Page 38: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Preparing for SSGS

Label the haplotypes in the parents as transmitted or non transmitted

Calculate the MLE’s and Fisher’s information using STATA to define the proposal distribution for even iterations

Define the initial values for (mle) (= l) (= 1)

Page 39: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 1: Model Simulation

Independent Prior p = q = 0.5 = 0.2, c = 10 None of the ’s failed the

Heidelberger and Welch test for stationarity

Total models visited: 302

Dependent Priorp = q = 0.5 = 0.2, c = 10None of the ’s failed

the Heidelberger and Welch test for stationarity

Total models visited: 6046

Page 40: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 1: Suggested Genes

00.1

0.2

0.30.4

0.5

0.6

0.70.8

0.9

1

g1 G2* g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 G14* g15 G16* g17

DPIP

Method Suggested Genes

Dependent Prior g2, g14, g16

Independent Prior g2, g14, g16

MLE (95% CI) g2, g4, g5, g10, g12, g14, g16

Page 41: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 1: Estimation Intervals

Page 42: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Using GAW 12 Data: Disease Simulation

Simulate a disease Pick alleles at a marker that cause the disease Simulate disease based on a determined

penetrance,(P(D|genes)) sporadic risk (P(D|normal), and dominance

Know which alleles should be suggested by SSGS, but not the true

Analyze the data twice Dependent R – based on HWE & LD Independent R = I

Page 43: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 2: Simulated Disease3 loci from GAW 12 chromosome 1: Locus 1

A1=6, Locus 2 A2=8, Locus 8 A8 = 4Genetic Correlation:

Average D’ values span from 0.084 to 0.29 90% of || [0.0003,0.259]; median = 0.005

Penetrances:• P(D|L1a3,L1a3) =0.4• P(D|L8a2,L8a2) =0.6• P(D|L8a4,L8a4) =0.4• P(D|L8a2,L8a4) =0.5• P(D|any other genes) = 0.05

True model: g3, g14, g15 200,000 iterations

Page 44: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Simulation 2: Suggested Genes

Method Suggested Genes

Dependent Prior g3, g14, g15

Independent Prior g13, g14, g15, Missed g3

MLE (95% CI) g1, g3, g13, g14, g15

00.10.20.30.40.50.60.70.80.9

1

g1 g2 G3* g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 G14*G15*

DPIP

Page 45: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Sensitivity Analysis

Page 46: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu
Page 47: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu
Page 48: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

What we learned Today Extending the TDT to a conditional logistic regression

model has many advantages: analyze multiple loci Bayesian setting can incorporate genetic association and more!

We can find genes using Maximum likelihood estimation and inference for the parameters of the CLR model using STATA

We can improve the estimates of MLE by using SSGS with a prior that accounts for genetic association

SSGS has some sensitivity to prior: lower prior, less genes

Page 49: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

ReferencesBarbieri, M.M., and Berger, J. O. (2004), Optimal

Predictive Model Selection, Annals of Statistics 32, to appear.

Schaid, D. (1996) General Score tests for Associations of Genetic Markers with Disease Using Cases and Their Parents. Genetic Epidemiology. pp. 423-449

Self, S.G., et al. (1991) On estimating HLA/disease association with applications to a study of Aplastic Anemia. Biometrics, pp.53-61.

Thomas, D. C., et. al. (1995) “Variation in HLA-associated risks of Childhood Insulin Dependent Diabetes in the Finnish population: II. Haplotype Effects” Genetic Epidemiology. pp. 455-466.

SSGS dissertation: https://epi.mdanderson.org/~mswartz/

Page 50: Gene Mapping with Bayesian Variable Selection and MCMC Michael Swartz mswartz@stat.tamu.edu

Papers Extending SSVSChipman, H. (1996) “Bayesian variable selection

with related predictors”. The Canadian Journal of Statistics pp. 17-36.

George, E. I., McCulloch, R.E., and Tsay, R.S. (1996). Two approaches to bayesian model selections with applications” Bayesain Analysis in Econometrics and Statistics-Essays in honor of Arnold Zellner. (Eds. D.A. Berry, K.A. Chaloner, and J.K. Geweke). New York: Wiley pp. 339-348.

Ntzoufras, I. Forster, J.J., and Dellaportas, P. (2000) “Stochastic Search Variable Selection for Log-Linear Models” Journal of Statistical Computations and Simulations. pp.23-37