Fei Zou Department of Biostatistics Email: fzou@bios.unc

Preview:

DESCRIPTION

Bayesian Variable Selection in Semiparametric Regression Modeling with Applications to Genetic Mappping. Fei Zou Department of Biostatistics Email: fzou@bios.unc.edu. Outline. Introduction Experimental crosses Existing QTL Mapping Methods Bayesian semi-parametric QTL Mapping Results - PowerPoint PPT Presentation

Citation preview

Bayesian Variable Selection in Semiparametric Regression

Modeling with Applications to Genetic Mappping

Fei ZouDepartment of BiostatisticsEmail: fzou@bios.unc.edu

Outline

• Introduction– Experimental crosses– Existing QTL Mapping Methods

• Bayesian semi-parametric QTL Mapping• Results• Remarks and Conclusions

http://www.cs.unc.edu/Courses/comp590-090-f06/Slides/CSclass_Threadgill.ppt

Overview

• One gene one trait: very unlikely• The vast majority of biological traits are

caused by complex polygenes– Potentially interacting with each other

• Most traits have significant environmental exposure components– Potentially interacting with polygenes

ParentsP1

P2

Experimental Crosses: F2

Experimental Crosses

• F2 Backcross(BC)

F1 F1

P1 P2

F2: BC:

P1 P2

P1 F1

AA BB

AB AB

BB AB AB

AA BB

AA AB

AB AA AB

QTL Data Format

0: homozygous AA, 2: homozygous BB, 1: heterozygote AB.Marker positions:

Linkage Analysis

• Data structure: – Marker data (genotypes plus positions)– Phenotypic trait(s)– Other nongenetic covariates, such as age,

gender, environmental conditions etc• Quantitative trait loci (QTL): a particular

region of the genome containing one or more genes that are associated with the trait being assayed or measured

QTL Mapping of Experimental Crosses

• Single QTL Mapping• Single marker analysis (Sax, 1923 Genetics)• Interval mapping: Lander & Botstein (1989,

Genetics)• Multiple QTL mapping

• Composite interval mapping (Zeng 1993 PNAS, 1994 Genetics; Jansen & Stam, 1994 Genetics)

• Multiple interval mapping (Kao et al., 1999 Genetics)

• Bayesian analysis (Satagopan et al., 1997 Genetics)

Single QTL Interval Mapping

• For backcross, the model assumes

• QTL analysis:

• If QTL genotypes are observed, the analysis is trivial: simple t-test!

• However, QTL position is unknown and therefore QTL genotypes are unobserved

2(QTL genotype is AA) ~ ( , )i AAy N 2(QTL genotype is Aa) ~ ( , )i Aay N

0 : vs :AA Aa A AA AaH H

Interval Mapping

• For QTL between markers– QTL genotypes missing: can use marker genotypes to

infer the conditional probabilities of the QTL genotypes for a given QTL position

– Profile likelihood (LOD score) calculated across the whole genome or candidate regions using EM algorithm

– In any region where the profile exceeds a (genome-wide) significance threshold, a QTL declared at the position with the highest LOD score.

0

2

4

6

8

Chromosome

lod

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16171819 X

Profile LOD

Multiple QTL Mapping

• Most complicated traits are caused by multiple (potentially interacting) genes, which also interact with environment stimuli – Single QTL interval mapping

• Ghost QTL (Lander & Botstein 1989)• Low power

Multiple QTL Mapping• Composite interval mapping (Zeng 1993, 1994;

Jansen & Stam1993): searching for a putative QTL in a given region while simultaneously fitting partial regression coefficients for "background markers" to adjust the effects of other QTLs outside the region

• which background markers to include; window size etc

• Multiple interval mapping (Kao et al 1999): fitting multiple QTLs simultaneously

• Computationally intensive; how many QTLs to include?

Multiple QTL Mapping• Bayesian methods (Stephens and Fisch 1998

Biometrics; Sillanpaa and Arjas 1998 Genetics; Yi and Xu 2002 Genetic Research, and Yi et al. 2003 Genetics): treat the number of QTLs as a parameter by using reversible jump Markov chain Monte Carlo (MCMC) of Green (1995 Biometrika)

• change of dimensionality, the acceptance probability for such dimension change, which in practice, may not be handled correctly (Ven 2004 Genetics)

Multiple QTL Mapping• Alternative, multiple QTL mapping can be

viewed as a variable selection problem– Forward and step-wise selection procedures (Broman

and Speed 2002 JRSSB) – LASSO, etc– Bayesian QTL mapping

• Xu (2003 Genetics), Wang et al (2005 Genetics) Huang et al (2007 Genetics): Bayesian shrinkage

• Yi et al (2003 Genetics): stochastic search variable selection (SSVS) of George and McCulloch (1993 JASA)

• Yi (2004 Genetics): composite model space of Godsill (2001 J. Comp. Graph. Stat)

• Software: R/qtlbim by Yi’s group

Multiple QTL Mapping

• Limitations of existing QTL mapping methods – do not model covariates at all or only

model covariate effect linearly – do not model interactions at all or

model only lower order interactions, such as two way interactions

• The multiple QTL mapping is a very large variable selection problem: for p potential genes, with p being in the hundreds or thousands, there are possible main effect models, possible two-way interactions and possible higher order (k > 2) interactions.

2 p 22p

2pk

Semiparmetric Multiple (Potentially Interacting) QTL Mapping

• Goal: map multiple potentially interacting QTLs without specifically model all potential main and higher order interaction effects

• Semiparametric model:

where function is unspecified, QTL genotypes and represent all non- genetics factors/covariates.

• When equals : non-explicitly modeling the two way interaction between genes 1 and 2 and the gene-environmental interaction between gene 3 and covariate 1.

21 2 1( , ,... , ,..., ) , 1, , with (0, )~

iid

i i i ip i iq i iy x x x t t e i n e N

1 2, ,...i i ipx x x1,...,i iqt t

1 2 3 1i i i ix x x t

Bayesian Semi/non-parametric Methods

– Dirichlet process (Muller et al. 1996)– Splines (Smith and Kohn 1996; Denison et al. 1998

and DiMatteo et al. 2001)– Wavelets (Abramovich et al. 1998 JRSSB)– Kernel models (Liang et al 2007) – Gaussian process (Neal 1997; 1996)

• Gaussian process priors have a large support in the space of all smooth functions through an appropriate choice of covariance kernel.

• Gaussian process is flexible for curve estimation because of their flexible sample path shapes

• Gaussian process related to smoothing spline somehow (Wahba 1978 JRSSB)

Prior Specification on

• A Gaussian process such that all possible finite dimensional distributions follow multivariate normal with mean 0 and covariance function

where , s and s are hyperparameters and

1( ,..., )Tn

2 2' ' '

1 1

1cov( , ) exp[ ( ) ( ) ]p q

i i xk ik i k tj ij i jk j

x x t t

xk

1 1( ,..., , ,..., )i i ip i iqx x t t tj

• Hyperparameter defines the vertical scale of variations, i.e., controls the magnitude of the exponential part. Hyperparameters related to length scales which characterize the distance in that particular direction over which y is expected to vary significantly– controls the smoothness of : when

the posterior mean of almost interpolates the data while centered around the prior mean function if

– When = 0, y is expected to be an essentially constant function of that input variable xj, which is therefore deemed irrelevant (Mackay 1998).

( )xk tj 1/ xk

xk

0

Priors on • The original papers on the Gaussian process

(Mackay 1998; Neal 1997) did not view this method as an approach for variable selection and imposed a Gamma prior on the parameters. However, does provide information about the relevance of any QTL with value near zero indicating an irrelevant QTL.

• For variable selection purpose, we can impose the following Gamma mixture priors on

1/xk xk

s

Prior Specifications

• Inverse Gamma distributions are used for the priors of and . 2

Simulations

• Set ups:– backcross population – 200 or 500 individuals– 151 evenly spaced markers at 5cM intervals– Four QTLs with varying heritabilities:

• Main effect model: all four QTL act additively• Main plus two way interactions• Four way interactions only

n=500 and pure 4 way-interaction model

n=500 and pure additive model

Real Data Analysis

• A mouse study– # samples: 187 backcross samples– # markers: 85 with average marker distance

20 cM – Phenotypes: inguinal, gonadal, retroperitoneal

and mesenteric fat pad weights

Remarks

• For studies with large # of samples and/or large # of markers, MCMC converges very slowly– We employed the hybrid Monte Carlo method, which

merges the Metropolis-Hastings algorithm with sampling techniques based on dynamics simulation.

– We also estimated the maximum a posteriori (MAP) via conjugate gradient method (Hestenes et al 1952 J. Research of National Bureau of Standards)

• point estimate

Real Study: Cardiovascular Disease

• 2655 tag SNPs from roughly 200 selected candidate genes for cardiovascular disease

• 820 individuals• Non-genetic covariates: gender, smoking

status, age

Remarks

• Semiparemetric mapping is powerful in mapping multiple (potentially interacting with higher orders) QTL – Picks up genes related to the trait regardless

of their marginal main effects or joint epistasis effects

– Cannot readily differentiates genetic contributions

• main effect? interaction? or both?• Fine tuned parametric model with selected genes

Remarks and Future Research• How to extend the methodologies to human

genome-wide association (GWA) studies, where hundreds of thousands of markers are available– Is it possible?– potential solutions: pathway analysis; data reduction

techniques• How to extend the method to human pedigree

analysis where mixed effect model is used for correlated family members?– Use inheritance vector: so far results are very

promising

Acknowledgement

• Joint work with – Hanwen Huang – Haibo Zhou– Fuxia Cheng– Ina Hoeschele

• Funding support– NIH R01 GM074175

Recommended