BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH …ufdcimages.uflib.ufl.edu/UF/E0/02/24/60/00001/li_z.pdfChair: George Casella Major: Statistics With advancing technology, large single

BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH MISSINGCOVARIATES

By

ZHEN LI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008

1

c© 2008 Zhen Li

2

To my great parents: Ping Li and Xuemei Tang and my beloved husband, Jian

3

ACKNOWLEDGMENTS

First of all, I want to express my deepest gratitude to Dr. George Casella. Not only

for his patient advisement for numerous academic problems, his time, his peals of wisdom

shared with me, but also for for supporting me, encouraging me and inspiring me to

always to be better. I also want to thank Dr. Hani Doss, Dr. John Davis, Dr. Gary Peter

and Dr. Rongling Wu for serving as my committee.

I would like to thank my parents, Ping Li and Xuemei Tang, for their endless love,

constant emotional support and for their belief in me. I thank my sister for always being

there for me and loving me.

I could never thank my husband Jian enough for his love and his emotional support.

He has always been keeping me in peace and calm in those days. Without that, I could

not finish this journey.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 LITERATURE REVIEW AND PROJECT INTRODUCTION . . . . . . . . . . 11

1.1 Introduction to the Project . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Introduction to Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Patterns of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . 141.2.2 Mechanism of Missing Data . . . . . . . . . . . . . . . . . . . . . . 17

1.3 Existing Methods for Missing Data . . . . . . . . . . . . . . . . . . . . . . 191.3.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.3 Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.4 Bayesian Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 251.4.1 Semiautomatic Bayesian Variable Selection Method . . . . . . . . . 271.4.2 Automatic Bayesian Variable Selection Method . . . . . . . . . . . . 281.4.3 Stochastic Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 30

1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 A HIERARCHICAL BAYESIAN MODEL FOR GENOME-WIDE ASSOCIA-TION STUDIES OF SNPS WITH MISSING VALUES . . . . . . . . . . . . . . 33

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.1 The Model without Ramet Random Effect . . . . . . . . . . . . . . 392.2.2 The Model with Remete Random Effect . . . . . . . . . . . . . . . . 452.2.3 Increasing Computation Speed . . . . . . . . . . . . . . . . . . . . . 50

2.3 Results for Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.4 Results for Loblolly Pine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.5 Quantifying the Covariance and Variance . . . . . . . . . . . . . . . . . . . 572.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3 BAYESIAN VARIABLE SELECTION FOR GENOMIC DATA WITH MISS-ING COVARITES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.2 Bridge Sampling Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.2.1 General Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5

3.2.2 How to Choose g(β, γ, Zmis, σ2, φ2) ? . . . . . . . . . . . . . . . . . . 92

3.2.3 Comparison with the Simplest Model . . . . . . . . . . . . . . . . . 943.2.4 Marginal Likelihood for mδ(Y ) . . . . . . . . . . . . . . . . . . . . . 98

3.3 Markov Chain Monte Carlo Property . . . . . . . . . . . . . . . . . . . . . 1013.3.1 Candidate Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 1033.3.2 Convergence of Bayes Factors . . . . . . . . . . . . . . . . . . . . . 1043.3.3 Ergodicity Property of This M-H Chain . . . . . . . . . . . . . . . 105

3.3.3.1 Fixed n, uniformly ergodic converges to the distributionB(n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.3.3.2 Ergodic convergence to B . . . . . . . . . . . . . . . . . . 1083.4 Computation Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.4.1 Matrix Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1103.4.1.1 Two columns of parameters for one column of SNPs . . . . 1103.4.1.2 Determinant calculation . . . . . . . . . . . . . . . . . . . 118

3.4.2 Replace Z with an Average . . . . . . . . . . . . . . . . . . . . . . . 1223.5 Simulation and Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . 128

3.5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283.5.2 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4 SUMMARY AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . 132

4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

APPENDIX

A ERGODICITY OF GIBBS SAMPLING WHEN UPDATING Z MATRIX BYCOLUMNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

B AN ALGORITHM FOR CALCULATING THE NUMERATOR RELATION-SHIP MATRIX R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6

LIST OF TABLES

Table page

1-1 Illustration of univariate non-response . . . . . . . . . . . . . . . . . . . . . . . 14

1-2 Illustration of multivariate missing . . . . . . . . . . . . . . . . . . . . . . . . 15

1-3 Illustration of monotone missing . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1-4 Illustration of general missing data pattern . . . . . . . . . . . . . . . . . . . . 16

1-5 File matching missing data pattern . . . . . . . . . . . . . . . . . . . . . . . . 16

2-1 Illustration of combinations of missing data for one observation. . . . . . . . . . 37

2-2 The percentages of SNP categories in the generated data sets. . . . . . . . . . . 50

2-3 The percentages of correctly imputed SNPs for different probabilities of SNPcategories. 10% missing values exist. . . . . . . . . . . . . . . . . . . . . . . . . 52

2-4 The estimated means of family effects for different data sets with different per-centages of missing values. This methodology give accurate estimates as thepercentage of missing values goes up to 20%. . . . . . . . . . . . . . . . . . . . . 52

2-5 The estimated means of SNP effects for the data set without missing values andfor data sets with different percentages of missing values. . . . . . . . . . . . . 53

3-1 Comparison of time spent on inverse calculation using standard software andMiller’s method with 2 columns and 2 rows updated. . . . . . . . . . . . . . . . 115

3-2 Comparison of time spent in calculation of matrix inversion for different methods. 118

3-3 Records of subsets indicators with actual values of γ for Table 3-4 and Table 3-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3-4 Bayes factor calculation approximation. Use first 20000 iteration as burnin andtake the following 400 samples for the calculation. . . . . . . . . . . . . . . . . 127

3-5 Bayes factor calculation comparisons. Use first 40000 iteration as burnin andtake the following 400 samples for the calculation. . . . . . . . . . . . . . . . . 130

3-6 Simulation results of Bayesian variable selection for 15 SNPs and 450 observa-tions, 10% random missing values. Using the Bayes factor estimation formula. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3-7 Bayes variable selection for lesion length data set. Using the average of imputedmissing SNPs as if observed. 20000 steps of burnin and another 20000 steps assamples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7

LIST OF FIGURES

Figure page

2-1 The first trace plot is for the first 2 family effect parameters for the lesion lengthdata. The second plot is for one of SNP parameter for the carbon isotope datafrom Paltaka, Florida. The samples are taken after initial 40000 steps of burnin.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2-2 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for lesion length. The 22th SNP has significant dominant effects with 95%confidence, while the other SNPs are not significant. Other SNPs such as the2nd and the 23th SNPs are approximately significant with 95% confidence. Theseare good candidates for further biological exploration. . . . . . . . . . . . . . . 58

2-3 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for the carbon isotope data from Paltaka, Florida. The 16th SNP has sig-nificant dominant effects with 95% confidence, while the other SNPs are notsignificant. Other SNPs such as the 6nd and the 40th SNPs are approximatelysignificant with 95% confidence. These are good candidates for further biologi-cal exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2-4 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for carbon isotope data from Cuthbert, Georgia. The 8th, 35th and the36th SNP have significant dominant effects with 95% confidence, while the otherSNPs are not significant. Other SNPs such as the 13nd and the 28th SNPs areapproximately significant with 95% confidence. These are good candidates forfurther biological exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH MISSINGCOVARIATES

By

Zhen Li

December 2008

Chair: George CasellaMajor: Statistics

With advancing technology, large single nucleotide polymorphism (SNP) datasets

are easily available. For the ADEPT 2 project, we have candidate SNPs and interesting

phenotypic trait values available, while about 10% of the SNPs are missing.

Standard software packages cannot deal adequately with missing SNP data. For

example, SAS either uses an available case analysis (which employs all the complete cases

for the inference of target parameters) or the procedure MI (or MIANALYZE) where SAS

assumes multivariate normal distributions for all the variables. Some software deletes

the incomplete observations, which is generally unacceptable for datasets with many

SNPs, because it can give biased estimates, or possibly delete all the data. More recently,

single SNP association, linkage disequilibrium based imputation, and haplotype based

imputation have been proposed.

I describe a Bayesian hierarchical model to explain the SNP effects for the phenotypic

traits, and incorporated family structure information for the observations. For this

association test, the information of the degree of linkage disequilibrium is not required

and missing SNPs are imputed based on all the available information. We used a Gibbs

sampler to estimate the parameters and prove that updating one SNP at each iteration

still preserves the ergodic property of the Markov chain, and at the same time it improves

computation speed.

9

We also ran a stochastic search algorithm to search the good subsets of variables

or SNPs. Bayes factor is used as a model comparison criterion and a new Bayes factor

approximation formula is proposed. The hybrid Metropolis-Hastings algorithm was used

to search the good models in the model sample space and proven to have the ergodic

convergence property. To improve the computation speed, first a matrix identity was

applied to avoid direct calculation of matrix inversions and determinants, then we replaced

the imputed missing SNPs with the average of imputed SNPs, which substantially

increased the computation speed.

10

CHAPTER 1LITERATURE REVIEW AND PROJECT INTRODUCTION

1.1 Introduction to the Project

Loblolly pine is an economically and ecologically important tree species in the United

States. Loblolly pine grows in 14 states in the US from north in New Jersey, south in

Florida and west in Texas. Its annual harvest value is approximately 19 billion dollars, see

McKeever and Howard (1996). The pine species in the southern states produces 58% of

the timber in the US and 15.8% of the world’s timber, see Wear and Greis (2002).

Scientists are interested in discovering the relationship between the phenotypic traits

and the complex biological roles and functions of genes in loblolly pine, which could help

to explain the evolution of adaptive traits for other land plants. Right now, researchers

from several universities in the ADEPT 2 project are collaborating to identify alleles

which control wood properties and disease resistance. Another goal of the ADEPT 2

project is to associate allelic variation with phenotypic variation for some subset of genes.

More specifically, there are 4 objectives: 1) to identify 5000 target candidate genes for

wood property and disease resistance traits; 2) to discover alleles for 5000 candidate

genes using a high throughput resequencing pipeline; 3) to estimate the extent of linkage

disequilibrium in regions of a small number of candidate genes; 4) to detect and verify

associations between SNPs in ∼ 1000 candidate genes and a suite of wood property

phenotypes, disease resistance phenotypes.

In the previous ADEPT project, SNP, or single nucleotide polymorphism discovery

was done for 50 genes involved in resistance to disease and response to water-deficit. A

single nucleotide polymorphism is a DNA sequence variation occurring when a nucleotide -

A, T, C, or G- differs from other nucleotide in the individuals from the same species at the

same locus. The SNPs may occur in the coding genes or non-coding genes, so researchers

are trying to detect the SNPs which have significant influence for the quantitative traits

from large amount of potential SNPs.

11

From the ADEPT project, there were 61 loblolly pine families from a circular design

with some off-diagonal crossings. The 32 parents for these families were from University of

Florida and North Carolina State University. In each family there are certain number of

clones and each clone has a maximum of 5 ramets and the number of clones in the families

are not necessary equal. The details of the experiment can be found in Kayihan et al.

(2005). Following the terminology used in this project, two ramets within any clone share

exactly the same genetic information while two clones within any family only share the

same parents, and they can be considered to be siblings.

Loblolly pine is the host of two fungal pathogens, which causes the fusiform rust

disease and pitch canker disease. Fusiform rust has a spindle-shaped swelling on branches

and stems of loblolly pine trees and this disease is caused by a fungus. Fusiform rust will

form stem galls, which lead to short survival times, bad wood properties and slow growth.

This disease causes losses of hundreds of millions dollars in southern United States. Pitch

canker is also a very important disease, which is caused by fungus and produces resinous

lesions and leads to seedling mortality, decreased growth rates and crown dieback.

The large pitch canker data were recorded in the USDA Forest Service Resistance

Screening Center in Bent Creek, North Carolina and the smaller pitch canker screen was

conducted at the University of Florida. Both ten gall rust screens and one gall rust screens

were inoculated in North Carolina with different density of spores/ml. This data set

information is called CCLONES and will be referred by the name CCLONES later. We

will have another data set later from NCSU which consists of individual trees not recently

mated to each other and can serve as a natural occurring pine population and will be used

for SNP verification at the end of the ADEPT project.

For this dissertation, we have 46 genotyped SNPs in the CCLONES population

for association discovery and the methods developed in this study will be applied for

the ∼ 7600 SNPs later. The response are mainly the measurements of lesion length for

fusiform rust and pitch canker, where lesion is an infected or wounded patch of skin.

12

About 10% of the 46 SNPs have missing values in CCLONES, and on average, each

observation has more than one SNP missing. If we employ the conventional listwise

deletion method to deal with missing data, almost all the entries of observations will be

deleted and the analysis will be meaningless. Even worse, when considering the 3000

SNPs dataset later, then each observation has 300 SNPs, and if there is still about 10%

missing values, listwise method will delete the entire dataset. So effective methods, which

treat the incomplete data properly, is in urgent need and will save the expenses of the

experiment, labor and energy dramatically compared with the conventional methods such

as re-genotyping.

The goals of this dissertation are:

• To model the relationship between the phenotypic disease response of loblolly pine,

such as fusiform rust and pitch canker lesion length, and the genotyped SNPs, while

utilizing the family structure information of the loblolly CCLONES. Methodologies

for dealing with missing data in modeling and association tests are in need to utilize

the available incomplete data sets. As missing data are typical in many data sets

these days, including genetic data sets, this research potentially will have wide

applications.

• Furthermore, subsets of SNPs may exists and they may interactively responsible for

the phenotypic traits. Another goal is to be able to find the “good subsets” of SNPs

and let the biologists further investigate them.

• The last goal relates to the computation problem. According to a preliminary

analysis, the CCLONES data has 888 clones in it and if we utilize the family

structure by the means of covariance, we are dealing with 888 × 888 matrix and it

will involve many inversions of that large matrix. To select the important SNPs and

impute the missing SNPs (according the available information, generate the missing

SNP by the probability of what the genotype of that SNP could be), Markov chain

Monte Carlo will be the way to go and it definitely has a heavy computation burden,

13

we need to figure out ways to be able to run the Markov chains and to speed up the

chains. Computation is also an issue for the project of selecting subsets of variables.

To solve the computation problem is a critical step of applying the methods to

thousands of SNPs later.

1.2 Introduction to Missing Data

1.2.1 Patterns of Missing Data

The data sets discussed here are always arranged in a rectangular shape. The rows

correspond to the records for the observations and the columns record the variables or

responses. In this section, we will introduce the patterns of the missing data.

a) Univariate Non-response

Table 1-1. Illustration of univariate non-response

Age Weight(lb) Height(feet) Race28 160 5.8 white34 115 6.1 black55 216 5.4 white45 230 6.3 *

For the data sets with univariate missing pattern, only one column has missing data

and all the other columns have complete information. I use “ ∗ ” to denote the missing

value. Suppose we want to know the relationship of Height versus Age, Weight, and Race

from the data in this table. If the data set is complete, a most possible method to employ

is linear regression. For our situation, people might suggest discarding the last observation

and continue to do linear regression. But is the data missing completely at random? Will

deleting the incomplete case lead to a biased estimator?

b) Multivariate missing

In surveys, it is usual that the design variables are known before the study, but some

variables will not always be filled in by the people being surveyed. The following is an

example of multivariate missing pattern.

14

Table 1-2. Illustration of multivariate missing

Location Immigration Income Household number Working yearsAlachua Yes 53000 5 10Alachua Yes 65000 5 4

Lee No 70000 6 2Lee No * * *

Alachua No * * *

In the Table 1-2 the county name and immigration status are design variables and can

be filled in before the survey. The left columns are questions supposed to be answered in a

questionnaire and sometimes people do not answer. Again, before deleting the incomplete

cases, we need to ask ourselves is the missing data missing completely at random? Are

there any biases of the missing data? If the missing data are missing completely at

random, there would be no harm to just deleting these incomplete cases.

c) Monotone missing

This kind of missing pattern often happens in longitudinal studies where subjects

drop out over time and the column records in the later time tend to have more missing

data. For example, Table 1-3 is a medical experiment and researchers want to test the

effect of a certain drug for blood pressure:

Table 1-3. Illustration of monotone missing

Weight(lb) BP in 1 M BP in 3 M BP in 6 M BP in 9 M BP in year 1150 145 150 150 156 160167 111 132 140 145 150200 167 150 156 160 170230 170 * * * *180 210 200 * * *

In the above table, “BP in 1 M” means the record of blood pressure in one month.

If the observation is missing at 3 months, then the records for the later months tend to

be missing too. The reason for the occurring of missing values might be that the subject

moved to a new city, or he was too sick to continue the experiment or just any other

reason that he couldn’t continue the experiment. It will likely to introduce bias if we just

15

delete the incomplete cases. Here the mechanism of missing values needs to be further

investigated.

d) General missing pattern

The general missing pattern is the pattern that can not be put in any special cat-

egory. For example in genetic data analysis, the following data are typical: In this

Table 1-4. Illustration of general missing data pattern

Phenotypic value female parent male parent SNP1 SNP2 SNP315.33 11345 13453 gg tt *16.235 11345 13451 * tc gc17.89 11354 11542 gc * *19.31 11453 11671 * * cc20.32 11345 11651 cc cc *

example, the first column records the phenotypic value of the trait which the researcher

is interested in. The second and the third column records the parents information and

the left columns are the SNP information. For microarray data, typically we have more

columns of SNP information than that of the above table. So listwise deletion is totally

inappropriate since it tends to delete the majority of the data and waste lots of effort.

More appropriate methods should be applied.

e) File matching

In this category, for each observation, only one variable is observed out of the covari-

ates to be filled in. So in the worst cases, there might be no complete cases. Definitely

better methods than listwise deletion need to be employed. The following is an illustra-

tion:

Table 1-5. File matching missing data pattern

Age Job title Income (dollars)55 manager *40 banker *31 * 7000023 * 3500049 CEO *

16

In the above examples, we see directly that deleting the incomplete cases is quite

dangerous and understanding the mechanism of the missing data is very important for

valid statistical inference to be conducted.

1.2.2 Mechanism of Missing Data

The following categorizations are due to Little and Rubin (1987). According to their

definition, the mechanism of missing data can be divided into 3 categories. Suppose there

are missing data on variable Y and we use variable M to denote whether the variable Y is

observed or not. M = 1 means the variable is missing and M = 0, otherwise.

MCAR If whether or not Y is observed is independent of the values of Y itself, and

also is independent of any other variables in the data set, we say Y is missing completely

at random(MCAR). When this assumption is satisfied, the complete cases can be regarded

as sub-samples from the population and statistical inference with regard to the complete

cases is totally legitimate. The only problem is that the sample size is decreased and the

standard error is bigger than what it could be. MCAR is a rather strong assumption and

in most cases it is not satisfied. But when it is satisfied, listwise deletion is straightforward

to apply and gives a valid statistical inference.

If we denote the observed part of Y by Yobs, and the missing part by Ymis, then

Y = (Yobs, Ymis). Let ξ be the parameter of the missing mechanism. The situation of

missing completely at random can be expressed as

P (M |Yobs, Ymis, ξ) = P (M |ξ).

MAR If the probability that an observation is missing depends on Yobs, but not on

Ymis, then it is missing at random, MAR. This is a more general situation than missing

completely at random since the probability of missing data in MAR could depend on the

observed data while in MCAR it should be independent of both Yobs and Ymis. The precise

formula for MAR is:

P (M |Yobs, Ymis, ξ) = P (M |Yobs, ξ).

17

NMAR The last category is not missing at random, NMAR. When the case is not

missing at random, the probability that the data is missing depends not only on the

observed values but also on the missing value self. In this situation, typically we need to

model the mechanism of missing data specifically.

One thing to note is that, strictly speaking, we cannot verify that the missing data

is MAR since the missing data are not observed and thus impossible to verify whether

the probability of the missing data are totally independent of the unobserved value.

However, MAR is a reasonable assumption in many cases and it is an easier assumption

for statistical inference than NMAR, unless there is evidence of not missing at random,

generally we could assume it is MAR.

When the data set is complete, direct procedures like maximum likelihood can be

employed for the statistical inference. We would have

ζ = arg maxζ

P (Y |ζ).

Further, we assume that the parameter ζ of the data model and the parameter ξ of

the missing value mechanism are distinct. This means from the frequentist perspective,

the joint parameter space of (ζ, ξ) will be the Cartesian cross product of the space of ζ

and ξ. On the other hand, from the Bayesian perspective, the joint prior of (ζ, ξ) could

be factored into the product of independent priors for ζ and ξ. If both the MAR and

distinctness hold, we say the missing data mechanism is ignorable by Little and Rubin

(1987). When the distinctness holds, we have:

P (M,Yobs, Ymis|ζ, ξ) = P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ).

While the data is MAR, P (M |Yobs, Ymis, ξ) = P (M |Yobs, ξ), and the above formula can be

simplified as

P (M,Yobs, Ymis|ζ, ξ) = P (M |Yobs, ξ)P (Yobs, Ymis|ζ).

18

The probability distribution of the observed data (Yobs,M) would be :

P (M,Yobs|ζ, ξ) =

∫P (M,Yobs, Ymis|ζ, ξ) dYmis (1–1)

=

∫P (M |Yobs, ξ)P (Yobs, Ymis, |ζ) dYmis

= P (M |Yobs, ξ)

∫P (Yobs, Ymis|ζ) dYmis

= P (M |Yobs, ξ)P (Yobs|ζ).

Typically if we do maximum likelihood inference we will have

(ζ , ξ) = arg maxζ,ξ

P (M,Yobs|ζ, ξ)

under the assumption of ignorable non-response,

ζ = arg maxζ

P (Yobs|ζ),

and

ξ = arg maxξ

P (M |Yobs, ξ).

If the data is NMAR, the joint distribution of the observed and missing data conditioned

on the parameters will be:

P (M,Yobs, Ymis|ξ, ζ) = P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ).

Based on that, the observed likelihood is:

P (M,Yobs|ξ, ζ) =

∫P (M |Yobs, Ymis, ξ)P (Yobs, Ymis|ζ) dYmis,

which cannot be further simplified. In this situation, we cannot ignore the missing data

mechanism and we need investigate it to get valid statistical inference for the parameter ζ.

1.3 Existing Methods for Missing Data

Although a lot of methods have been proposed to handle missing data, only a few

have gained widespread popularity. I will briefly review the listwise deletion method,

19

the pairwise deletion method in this section and pay more attention to the maximum

likelihood method, Expectation Maximization (EM), and multiple imputation method

since these last three are very general approaches and have been widely used to handle

otherwise difficult missing data problems.

The listwise deletion method basically deletes the incomplete cases and processes the

statistical analysis as if there were no missing data. This method is the most straightfor-

ward method and is the default option in a lot of popular software. When the data are

MCAR, listwise deletion is equivalent to a sub-sampling of the population and the result-

ing statistical inference is legitimate except that we will get larger standard error because

of fewer observations. When the assumption of MCAR is violated, listwise deletion will

give biased estimates since the remaining data are weighted more than they should be.

Another disadvantage of this method is that it tends to lose substantial information from

the data if the missing data occur in multiple covariates.

Another method is pairwise deletion, which can be used in linear regression to

estimate the means or the covariance matrix. The idea of pairwise deletion is to use all the

cases that are available for the summary statistics which were intended to be computed.

For example, suppose we have a bivariate response (Y1, Y2) and each variable has missing

data. When we calculate the mean for Y1 we use all the observed data for Y1 although

the corresponding observation for Y2 might be missing. When we calculate the covariance

for Y1 and Y2, the cases with both Y1 and Y2 are observed will be used. The biggest

problem of pairwise deletion is that the estimates and standard errors in most software are

biased when the missing values are not MCAR, since the principle of pairwise deletion is

ambiguous in its implementation in software. When computing the covariance for Y1 and

Y2, there is no clear direction about which observed values should be used to calculate the

mean of Y1, the complete cases for Y1 or the complete cases for Y1 and Y2.

20

Mean substitution is another choice in the literature. It imputes the missing values

with the mean of the complete observed cases. It requires strong assumptions and gives

downwards biased standard errors.

1.3.1 Maximum Likelihood

The maximum likelihood method is a general approach which can handle MAR quite

well. General maximum likelihood estimates have a number of desirable properties such as

consistency, asymptotic efficiency and asymptotic normality. Consistency means that the

estimators converge to the true values under general conditions, and asymptotic efficiency

means that the true standard errors are the smallest among the consistent estimates.

Asymptotic normality implies that the repeated estimates have an asymptotic normal

distribution and the approximation will improve with larger sample sizes.

Maximum likelihood is especially easy when the missing data pattern is monotonic.

Consider the two variable case where (Y1, Y2) with Y2 having missing data. Y1 has n

observations and Y2 are only observed for m observations and the remaining n − m

observations of Y2 are missing. Obviously this is a monotone missing pattern. Suppose

that we write the observed likelihood as:

L(λ, µ|Y ) =m∏1

h(Y2j|Y1j, λ)n∏1

g(Y1j|µ),

where g(Y2|Y1, λ) is the conditional distribution of Y2 given Y1 with parameter λ and

g(Y1|µ) is the marginal distribution of Y1 with parameter µ. If we can achieve the above

factorization, we can continue to maximize the likelihood equations separately.

1.3.2 EM Algorithm

Direct factorization of the likelihood is not always possible, which means that

the direct maximum likelihood method has limited application for missing data. The

Expectation-Maximization (EM) algorithm is a very general method for obtaining the

maximum likelihood (ML) estimates and it has gained great popularity since its intro-

duction by Dempster et al. (1977). The basic idea of EM is to maximize the likelihood

21

of a difficult incomplete data set by repeatedly maximizing a “complete” data problem.

The EM algorithm requires the MAR assumption and is often used to find the maximum

likelihood (ML) estimates of the parameters.

In any incomplete-data problem, the density of the complete data can be factored as:

p(Y |θ) = p(Yobs|θ)p(Ymis|Yobs, θ).

If I take logs of both side, we will have

l(θ|Y ) = l(θ|Yobs) + log p(Ymis|Yobs, θ) + c, (1–2)

where l(θ|Y ) is the log likelihood of p(Y |θ) and l(θ|Yobs) is the log likelihood of p(Yobs|θ)and c is a constant. p(Ymis|Yobs, θ) is called the predictive distribution of the missing

data given θ. Because the second term in the above formula has Ymis in it, we cannot

maximize Equation 1–2 directly, instead we will integrate out the Ymis over the predictive

distribution. Suppose the current realization of θ is θ(t) and the predictive distribution is

P (Ymis|Yobs, θ(t)), the integration yields:

l(θ|Yobs) = Q(θ|θ(t)) + H(θ|θ(t)) + c,

where

Q(θ|θ(t)) =

∫l(θ|Y )p(Ymis|Yobs, θ

(t)) dYmis,

and

H(θ|θ(t)) =

∫log P (Ymis|Yobs, θ)p(Ymis|Yobs, θ

(t)) dYmis. (1–3)

So the EM algorithm consists of 2 steps of iterations:

• The expectation step: in which the function Q(θ|θ(t)) is calculated with respect tothe predictive distribution of the missing data p(Ymis|Yobs, θ

(t)).

• The maximization step is to find θ(t+1) by maximizing Q(θ|θ(t)) .

Repeat the two steps until the ML estimates converge. Wu (1983) was the first to prove

that θ(t) −→ θ. Dempster et al. (1977) proved that if we let θ(t+1) be the value that

22

maximizes Q(θ|θt), then θ(t+1) is always better than θ(t), which means the observed log

likelihood of θ(t+1) is at least as good as that of θ(t):

l(θ(t+1)|Yobs) ≥ l(θ(t)|Yobs). (1–4)

Data augmentation by Tanner and Wong (1987) is another widely used method

which often is used to explore the posterior distribution with missing data. It has a strong

resemblance to the EM algorithm and it is also composed of two steps. In the first step of

data augmentation, it simulates a complete data sufficient statistic while in EM algorithm

the first step is used to calculate the expectation. The other step in data augmentation

draws random simulation of parameters from the posterior distribution of complete

sufficient data which comes from the first step, while the other step of the EM algorithm

maximizes the parameter. Data augmentation assumes MAR also.

1.3.3 Multiple Imputation

Multiple imputation has been a popular method for handling general missing data

patterns since it was introduced by Rubin (1978). A considerable number of books have

been devoted to implementing the framework of multiple imputation, such as Analysis

of Incomplete Multivariate Data by Schafer (1997) and Missing Data by Allison (2002).

Generally they assume MAR if the missing data mechanism is not modeled.

The idea of multiple imputation is to complete the data to get a usable likelihood,

based on which statistical inference is much easier to be conducted. To get complete data,

random imputation takes the place of missing values. Since we know the imputed values

are not the true values of missing data, we need to impute the data multiple times to let

the random effect of imputation center around the unobserved true value. The multiple

complete data sets then are combined to give estimates.

Suppose there is data Y(mis) which has missing values in it and we create K complete

data sets Y(C1), Y(C2), ..., Y(CK). Suppose we perform regression analysis for each complete

data set Y(Cj) and have estimate β(Cj), we then average the estimates β(Cj) to get a better

23

estimate:

β =1

K

K∑j=1

β(Kj). (1–5)

The within variance is calculated as

U =1

K

K∑j=1

V ar(β(Kj)),

and the between imputation variance is calculated as

B =1

K

K∑j=1

(β(Kj) − β)(β(Kj) − β)′.

Combining the within variance and between variance produces the total variance as

V ar(β) = U + (1 +1

K)B, (1–6)

and the corresponding degrees of freedom VK is VK = (K − 1)[1 + U1+K

B

]2. This combined

variance will approximately have a t distribution with degrees of freedom VK .

Barnard and Rubin (1999) suggested using V ∗K = [ 1

VK+ 1

V ar(β)]−1 for small sample sizes.

To perform likelihood ratio tests, Li et al. (1991) and Meng and Rubin (1992) suggest the

following as an approximate likelihood ratio test.

DK =(β − β0)

′V ar(β)−1

(β − β0)

p(1 + rK),

where p is the dimension of β and β0 is the value of β under null hypothesis, and

rK = (1 +1

K)tr(BK V ar(β)

−1

)/p,

and

BK =1

K

K∑j=1

(βKj − β)(βKj − β)′.

This test has an associated F distribution Fp,w with p and w degrees of freedom, where

w = 4 + (p (K − 1)− 4)

[1−

1− 2p(K−1)

rK

]2

.

24

This test has a nice feature similar to the likelihood. It has the strong assumption that the

portion of missing data should be equal for each variable, however simulations by Li et al.

(1991) show that test is quite robust to the violation of this assumption when K ≥ 3.

One question that needs to be answered is how big should K be? Rubin (1987) gives

the following formula to calculate the efficiency under γ proportion of missing data and K

created complete data sets:

Ef = (1 +γ

K)−1.

From this formula it is easy to calculate the needed number of complete data sets to

achieve efficiency. It turns out that for low proportions of missing data, say γ ≤ 0.3,

K = 5 will give reasonably high efficiency while when large amounts of data are missing,

for example, γ = 0.7, K = 10 will have 93.5% efficiency.

In practice, multiple imputation requires a mechanism to impute the missing data and

there are a number of ways to do that. Reilley (1993) considers the Hot Deck imputation

which draws random samples from the actual observed values with equal weights as an

imputation. This is an easily employed method and Efron (1994) calls the Hot Deck

method a non-parametric bootstrap method. Xie and Paik (1997) proposed using the

Bayesian version of Hot Deck method; instead of drawing the random sample with equal

weights from the observed values, they put a Dirichlet prior on the weights. Another

imputation method assumes all of the covariates are normally distributed, even for

categorical variables. The category is rounded to the nearest cumulative function category.

1.4 Bayesian Variable Selection

Variable selection is a frequently used method in the statistical data analysis and

Bayesian variable selection has been the subject of substantial research in recent years.

Most of the theoretical developments have been based on normal linear regression. Smith

and Spiegelhalter (1980) proposed a unified approach of prior specification for model

choice and they showed that different prior approaches could lead to a Schwarz-type

criterion or Akaike Information Criterion. Mitchell and Beauchamp (1988) used a “spike

25

and slab” type of prior for the candidate variables to be considered. Their approach is

not fully Bayesian and an important parameter (the height of the spike over the height

of the slab) is estimated from the data and is used as a known parameter. George and

McCulloch (1993), George and McCulloch (1995), and George and McCulloch (1997)

advocated the Markov chain Monte Carlo approach for the posterior calculation and this

made the variable selection for large candidates possible. Brown et al. (1998) extended

the application of Bayesian variable selection to multivariate responses and a Markov

chain Monte Carlo algorithm was employed to speed up the computation. Berger and

Pericchi (2001) proposed the objective intrinsic Bayes Factors for model selection. An

objective fully automatic Bayesian procedure was proposed by Casella and Moreno (2006).

They used intrinsic priors to calculate the posterior probabilities and a stochastic search

algorithm was employed. Just a few related papers are listed above and many others exist

in the literature.

In normal linear regression, we have a dependent response Yn×1 and a set (X1, ..., Xs)

of s potential explanatory predictors and we assume Y = Xβ + ε, with Y |Xβ ∼ N(0, σ2I).

The actual value of some of the βi, i = 1, ..., s, may not be significantly different from 0 or

may be correlated with another. The goal of variable selection is to find a set of predictors

X∗1 , ..., X

∗r , which is a subset of X1, ..., Xs, such that each corresponding βi has a practical

significant effect on Y and the full model is simplified to a certain degree. I will employ an

index vector γ with length s and let each element of γ index whether the corresponding

predictor Xi is included in the selected model or not. The value γi = 1 means Xi is in the

selected model and γi = 0 otherwise. So the variable selection problem is to find a vector γ

according to some criterion.

Many variable selection methods have been based on the Akaike Information Crite-

rion, AIC Akaike (1973), Mallows’ Cp Mallows (1973) and Bayesian Information Criterion,

BIC Schwarz (1973) and have been applied to many problems when s is reasonably small.

26

These methods basically try to maximize a penalized sum of squares function, although

taking different penalized term in the target function.

Suppose Xγ represents the selected predictor matrix indexed by γ and let βγ =

(X ′γXγ)

−1XγY denote the least squares estimate of βγ. I use r to represent the number of

variables in the selected subsets. The sum of squares for regression is

SSγ = β′γX′γXγβγ.

AIC, Cp, and BIC are criteria trying to maximize the function

SSγ/σ2 − C · r, (1–7)

where σ2 is an estimate of σ2 and C is a penalized term chosen by different criterion. If

C takes the value of 2, Cp is resulted and gives exact BIC too. For AIC, take C = log n.

George and Foster (1994) proposed to take C = 2 log p as the risk inflation criterion,

RIC. The above mentioned methods have different motivations, unbiased predictive risk

estimate for Cp, expected information distance for AIC and asymptotic Bayes factor

for BIC. When the number of available predictors are small, the above methods have

been widely applied. When considering problems with hundreds of predictors, each

potential predictor can either be included in the selected model or not and therefore

there are 2smodels, a number that can easily go beyond the computation ability and the

above mentioned criteria will be impossible to be applied. New methodologies to handle

hundreds of predictors are needed.

1.4.1 Semiautomatic Bayesian Variable Selection Method

A popular hierarchical Bayesian formulation for variable selection is the followings:

Take the prior for (β, γ) to be

P (γi = 1) = 1− P (γi = 0) = pi,

27

and

π(β|σ, γ) = π(β|γ) = Np(0, QγRγQγ),

where R is the prior correlation matrix and

Qγ = Diag[b1τ21 , ..., bpτ

2p ],

with bi = 1 when γi = 0 and bi = c when γi = 1, where ci is some hyperparameter.

The parameter τ 2i is the prior variance for the element of β. Whether the pre-specified

hyperparameter is big or not represents the statistician’s preference for a saturated model

or a parsimonious model, while pi shows the prior belief about whether certain predictors

should be included in the model or not. The values c and pi are typically set to some

fixed numbers according to the statistician’s experience, and c = 100 and pi = 1/2 for

i = 1, ..., p, is a popular choice. Cui and George (2008) also proposed putting priors for c

and pi and integrating them out.

In the hierarchical model, the prior for σ2 needs to be specified too and normally it is

taken as the inverse Gamma Ga(µγ/2, λγ), where µγ and λγ are hyper-parameters. More

detailed discussion about how to decide the hyperparameter c, p, µγ and λγ has been given

in George and McCulloch (1993), George and McCulloch (1997).

With the above specification, the posterior distribution f(γ|Y ) could be calculated

and then the idea is to rank the posterior distribution to decide the ideal sets of predic-

tors. The actual posterior distribution calculation is a challenge and methods to address

this will be discussed in a later subsection.

1.4.2 Automatic Bayesian Variable Selection Method

Casella and Moreno (2006) proposed the fully automatic Bayesian variable selection

procedure by using intrinsic priors. Let Γ represent the set of realizations of all γ and

γ = 1 corresponds to the full model with all potential predictors. The Bayes factor is

defined as

Bγ1 =mγ(Y, X)

m1(Y, X),

28

where mγ(Y, X) is the marginal distribution for the model indexed by vector γ, while

m1(Y,X) is the marginal distribution with all the predictors in the model. It can be

shown that

P (γ|Y,X) =Bγ1(Y, X)

1 +∑

γ∈Γ,γ 6=1 Bγ1(Y,X), γ ∈ Γ,

where P (γ|Y,X) is the posterior distribution of a model with predictors indexed by

γ conditioned on the observed Y and X. Suppose that the standard default prior is

πD(βγ, σγ) = Cγ/σ2γ, where Cγ is a constant. If there are two models M1 and M2, and the

default priors are πD1 (βγ, σγ) and πD

2 (βγ, σγ). The proposed intrinsic priors are:

πIns1 (βγ1, σγ1) = πD

1 (βγ1, σγ1),

and

πIns(βγ2, σγ2|βγ1, σγ1) = πD(βγ2, σγ2)EM2

x(l)|βγ2,σγ2

f1 (x(l)|βγ1, σγ1)∫f2 (x(l)|βγ2, σγ2) πD(βγ2, σγ2) dβγ2 dσγ2

,

where f1 (x(l)|βγ1, σγ1) is normally distributed and x(l) is training sample. It was proved

that the intrinsic priors for β and σ conditioned on any point (βγ, σγ)is

πI(β, σ|βγ, σγ) = Np

(β|βγ, (σ

2 + σ2γU

−1)) 1

σγ

(1 +

σ2

σ2γ

)− 32

,

with U = Z ′Z and Z is a theoretical design matrix with dimension (p+1)×p. Furthermore

the unconditional intrinsic priors for β, σ can be directly calculated. By using the intrinsic

priors, hyperparameters are avoided and automatic priors could be achieved. One step

further, the Bayes factors can be calculated using the intrinsic priors and further the

posterior model probability for Mγ, γ ∈ Γ can be computed.

The intrinsic prior approach comes from the model structure and is free of hyper-

parameters (do not need to consider a range of hyperparameters), so it can be used as

default prior and it is currently one of the unique objective procedures.

29

1.4.3 Stochastic Search Algorithm

Either using the intrinsic priors or setting the hyperparameters for the priors, the

posterior model probability is troublesome to calculate. Furthermore, it is often unrealistic

to calculate all the 2p posterior probabilities. Fortunately a Markov chain Monte Carlo

stochastic search algorithm has been developed and has been successfully applied.

A general way to implement an MCMC procedure is to run a Gibbs sampler

(β0, σ0, γ0), (β1, σ1, γ1), ..., (βt, σt, γt), ...

with

βt+1 ∼ f(βt+1|σt, γt, X, Y ),

σt+1 ∼ f(σt+1|βt+1, γt, X, Y ),

γt+1 ∼ f(γt+1|βt+1, βt+1, X, Y ).

After enough burn in, we can rank the posterior frequencies by counting the visited models

and the highest ranked or almost equally highest ranked models will be selected. Although

the Gibbs sequence does not necessary visit the entire posterior model space, it normally

has enough chance to visit the best models. Casella and Moreno (2006) have more detailed

discussion about it.

Another way to explore the posterior distribution of P (γ|Y, X) is to construct

a Metropolis-Hastings Markov Chain. Normally the chain will not only visit all the

models, but also visit the better model more. This is how it works: at iteration t, draw

a candidate γ from a candidate distribution V ; accept the candidate γ with probability

min(1, P (γt′ |Y,X)V (γt)

P (γt|Y,X)V (γt′ )

), and stay at γt with probability 1 − min

(1, P (γt′ |Y,X)V (γt)

P (γt|Y,X)V (γt′ )

).

Normally the draws from V are independent and the generated Markov Chain has

stationary distribution P (γ|Y, X).

Carlin and Chib (1995) used the Gibbs sampler to generate samples from the joint

distribution of model and parameters conditioned on the response. This method is

30

computationally demanding and works well for small amount of candidate variables.

Dellaportas et al. (2002) proposed a hybrid Gibbs-Metropolis strategy based on Carlin

and Chib’s method. Green (1995) suggested the reversible jump MCMC method which

could be used for model selection with different dimensions. Chib (1995) also proposed

to estimate the marginal likelihood by using blocks of parameters. These listed methods

generally work well for moderate amounts of candidate variables and some of them have

good application results for special situations. For example, reversible jump is good

for mixture modeling with unknown number of elements. However, there is almost no

investigation in variable selection when a certain percentage of variables are missing.

1.5 Outline of the Dissertation

In the previous sections, I introduced the ADEPT 2 project and talked about the

design of the loblolly population. The patterns and the categories of missing data and

different methods for missing data were reviewed. Variable selection methods based on the

linear regression models, especially Bayesian variable selection methods, were reviewed.

In Chapter 2, an association test is proposed to detect the significant SNPs for

the target phenotypic traits. As the population structure contains substantial genetic

information of the population and it indexes the closeness of two individuals within the

population, we use a numerator relationship matrix to quantify the closeness between any

two individuals with the population. A Bayesian hierarchical model is proposed and the

Gibbs sampler is employed to sample the parameters of the joint likelihood. In order to

be able to handle thousands of SNPs towards the end of ADEPT 2 project, methods to

speed up the computation are proposed: a Gibbs sampling procedure which just samples

one column of SNPs in each cycle instead of all the columns of SNPs in each cycle; and a

matrix identity which takes advantages of the updating scheme and avoids direct matrix

inversion calculation. Simulation studies and real data analyse will be detailed.

In Chapter 3, a Bayesian model selection method is proposed to select “good”

subsets of variables, meanwhile missing data are being properly taken care of. We use the

31

Bayes factor as the criterion of model selection and a novel Bayes factor approximation

formula is proposed. Stochastic search algorithm is employed to search the subsets and

meanwhile the convergence properties are proved. The Bayes factor approximation formula

potentially has wide application for comparing subsets of different dimensions. The

method is applied to simulated data sets as well as the loblolly data.

In Chapter 4, a summary of the dissertation is given and the contribution of this

dissertation will be discussed. Some future work will also be discussed.

32

CHAPTER 2A HIERARCHICAL BAYESIAN MODEL FOR GENOME-WIDE ASSOCIATION

STUDIES OF SNPS WITH MISSING VALUES

2.1 Introduction

All species are made up of DNA sequences. Among any species, all individuals share

most of the DNA components and only a very small percentage of DNA sequences are

different. These different nucleotides in the DNA sequence are called single nucleotide

polymorphisms, SNPs. For human beings, we share about 99.9% DNA sequences and

the remaining 0.1% DNA sequences make us different individuals. Also, these 0.1% DNA

sequences are genetically responsible for different disease development and drug responses

for different individuals. Loblolly pine has 0.5% diversity of nucleotides, which is slightly

larger than that of soybeans and human beings. In this 0.5% SNPs, we are interested in

detecting the significant SNPs which affect the disease response of loblolly pine, especially,

pitch canker.

As the technologies are developing so rapidly, SNP data are getting cheaper and easily

available. At the same time, scientists are having more opportunity to use high through-

put SNP data sets which were not possible before. Although microarray technology could

provide high throughput SNP data, at the current ability, typically 5% to 10% of geno-

types are missing as pointed by Dai et al. (2006). One challenge scientists are facing is

to use available SNP and phenotypic trait information to detect the SNPs which have

strong association with phenotypic traits, at the same time a statistical method is needed

to properly address the missing SNPs.

Population association involving missing genotypic markers has made a lot of progress

for human genetics, especially in the case of tightly linked genotypes, that is, most atten-

tion has been focused on fine-scale molecular region. “Phase” package by Stephens et al.

(2001) aims at haplotype reconstruction of genotyped data and uses an EM algorithm

to maximize the likelihood of haplotype frequencies. They use an approximate type of

prior for the conditional haplotype distribution. Scheet and Stephens (2006) proposed

33

“Fastphase” for missing genotype imputation and haplotype phasing. They used a hidden

Markov chain for the cluster origins of alleles in the haplotypes and the origin of clusters

for genotypes. Imputation was taken as the “best guess” for the likelihood and parame-

ters were again estimated using the EM algorithm. These methods were reported to give

accurate estimates for tightly linked markers. Chen and Abecasis (2007) proposed family

based tests for genome wide association. They use an identical by descent parameter to

measure correlation for the test SNPs, and a kinship coefficient to model the correlation

between siblings. Their method is used for one family, one SNP at a time, and it seems

not fully applicable for complicated population pedigrees and for simultaneous SNP test-

ing. Servin and Stephens (2007) proposed a Bayesian regression approach for association

test. The missing genotypes were imputed by “Fastphase” beforehand, and a mixture

prior is used for the SNP effect. The priors for the number of significant SNPs is set to

be small, and has some influence on the results. Dai et al. (2006) used EM, weighted

EM, and a nonparametric method (CART) for association studies. They used multiple

imputation samples from the tree based algorithms. Roberts et al. (2007) proposed a fast

nearest-neighbor based algorithm to infer the missing genotypes. Marchini et al. (2007)

proposed a unifying framework of missing genotype imputation and association testing

based on haplotypes and other available human genomic data sets. Sun and Kardia (2008)

proposed a neural network model for the missing SNPs and used the BIC to choose the

predictors in the model and further predict the missing SNPs according to the chosen

model. Balding (2006) gives a detailed review of association studies and some missing

genotype methods based on human genetics. There is much more literature on this topic,

the above being just a sample. Almost all the literature we found, however, focuses on

human genetic association where the markers are tightly linked and haplotype based

inference are dominant. Some papers are devoted to the situations of parents information

being missing and proposed likelihood based methods, see Weinberg (1999), Martin et al.

(2003), and Boyles et al. (2005).

34

The Expectation Maximization algorithm originally was developed to impute the

missing data and some authors have been using EM to do the missing SNP imputation

too. But for our situation, with many potential correlated SNPs, EM is not computational

feasible and we will show why here. For the EM algorithm, we start with the following

model as in the Section 2.2 .

Y = Xβ + Zγ + ε (2–1)

Each row of the matrix Z, Zi, i = 1, . . . n corresponds to the SNP genotype informa-

tion of one individual and for one individual we write

Zi = (Zoi , Z

mi ).

The EM algorithm begins by building the complete data likelihood, which is the likelihood

function that would be used if the missing data are filled in.

When we fill in the missing data we write

Z∗i = (Zo

i , Zmi ),

and the complete data are (Y, Z∗) with likelihood function

LC =1

(2πσ2)n/2

∏i∈I0

exp

(− 1

2σ2(Yi −Xiβ − Ziγ)2

)

(2–2)

×∏i∈IM

exp

(− 1

2σ2(Yi −Xiβ − Z∗

i γ)2

)

where Io indexes those individual with complete SNP data, and IM indexes those individu-

als with missing SNP information.

The observed data likelihood, which is the function that we eventually use to estimate

the parameters, is based on the observed data only. Where there is missing data, the

complete data likelihood must be summed over all possible values of the missing data. So

35

we have

Lo =1

(2πσ2)n/2

∏i∈I0

exp

(− 1

2σ2(Yi −Xiβ − Ziγ)2

)

×∏i∈IM

∑

Z∗i

exp

(− 1

2σ2(Yi −Xiβ − Z∗

i γ)2

),

Here we do not have any information about the weight of each term and we will use equal

weights. The distribution of the missing data Z∗i is given by the ratio of LC/Lo:

P (Z∗i ) =

exp(− 1

2σ2 (Yi −Xiβ − Z∗i γ)2

)∑

Z∗iexp

(− 12σ2 (Yi −Xiβ − Z∗

i γ)2) , (2–3)

where the sum in the denominator is over all possible realizations of Z∗i . This is a discrete

distribution on the missing SNP data for each individual. To understand it, look at one

individual.

Suppose that there are g possible genotypes (typically g = 2 or 3) and individual i

has missing data on k SNPs. So the data for individual i is Zi = (Zo, Zm) where Zm has k

elements, each of which could be one of g class. For example, if g = 3 and k = 7, then Zm

can take values in the following Table 2-1, where the ∗ shows one possible value of the Zmi .

For the example, there are 37 = 2187 possible values for Zmi . In a real data set this could

grow out of hand. For example, if there were 12 missing SNPs then there are 531,441

possible values for Zmi ; with 20 missing SNPs the number grows to 3,486,784,401 (3.5

billion). If we employ the EM algorithm, in every iteration step, for each observation we

need to sum over a gigantic number of terms to calculate the distribution of the missing

data and this is computational infeasible. So then we thought we might run a Gibbs

Sampling program inside of each EM step to generate the means for the missing data and

avoid summing over billions of terms.

The Gibbs Sampler for the missing data simulates the samples of Zmi according to

the distribution of each missing element conditioned on the rest of the vector. More of

this will be discussed in Section 2.2 and here we just give a flavor why the Gibbs sampler

36

Table 2-1. Illustration of combinations of missing data for one observation.

SNP∗ ∗

Genotype ∗ ∗ ∗∗ ∗

works. For a particular element Zmij , the conditional distribution given the rest of the

vector Zmi(−j) is

P (Zmij = c|Zm

i(−j)) =exp

(− 1

2σ2 (Yi −Xiβ − Zoi γ

oi − Zm

i(−j)γmi(−j) − cγm

ij )2)

∑3`=1 exp

(− 1


oi − Zm

i(−j)γmi(−j) − c`γm

ij )2) (2–4)

where there are only g = 3 terms in the denominator sum and this basically made the

Gibbs sampler to be computational possible for us.

For plant association or other agricultural genome association studies, the whole

genome sequence library might not be complete and the degree of linkage disequilibrium

information might not be available. Thus new methodologies are needed.

To impute missing data, multiple imputation has received much attention since its

proposal by Little and Rubin (1987). It acknowledges the uncertainty due to missing data

and bases inference on multiply imputed data sets. It addresses the variability in the

missing data through multiple imputation. Most methods for large genetic data sets in

the literature, except Dai et al. (2006), use single imputation and base the imputation on

the largest probability for the candidate categories. Although single imputation tends to

give fast calculation, it generally loses power and can give biased parameter estimates. Not

like doing single imputation, we treat the missing SNPs as parameters and impute them

each iteration. This would be less biased compared with single SNP imputation and would

more accurately capture the variation in the data.

In an association study across the entire population, typically the family pedigree is

very important, since it explains the genetic information shared by relatives and which is

not necessary genotyped. Yu et al. (2005) used a kinship matrix and population structure

37

to describe the quantitative inheritance. Chen and Abecasis (2007) proposed family-

based association testing, but his method counts the relationship within families and not

the relationship between families. We propose to use a numerator relationship matrix

to explain the relationship of individuals within families and between families as well.

The idea behind calculation of a numerator relationship matrix was originally due to

Henderson (1976); see also Quaas (1976).

With the motivating CCLONES data set, we propose a Bayesian methodology for

association studies. We model the missing SNPs as parameters and use Gibbs sampling

to sample the parameters, including the missing SNPs. For missing SNP imputation, this

is essentially multiple imputation. We take advantage of the family information from the

available family pedigree and parent pedigree. It is not a haplotype-based method and

does not require the SNPs to be in linkage disequilibrium and allows the model to find the

correlation (if it exists). This property is quite appealing for plant association or any other

association for which we do not have a sequenced genome library or detailed information

about the candidate SNPs. It is especially useful for genome scans and can point out some

significant SNPs for further study. Finally, we prove that just updating one missing SNP

for each observation in each iteration will still achieve the target stationary distribution.

This will substantially increase the calculation speed if there are large numbers of SNPs

for each observation, and gives us the ability to deal with large high throughput data sets.

This chapter is organized as follows. In Section 2.2 we explain two proposed models,

and also propose a method to increase the computation speed. In Section 2.3, simulation

results are given and also we compare with other results. In Section 2.4, the real data

analysis results are reported here. In Section 2.5, we explain a method to categorize the

covariance relationships for any give pedigree situation. In the end, at Section 2.6, we give

a discussion.

38

2.2 Proposed Method

In this chapter, the response is assumed to be continuous with a normal distribution,

although the following method can be easily adapted to a discrete situation (using a latent

variable probit model), which will be explained in the discussion section. The data set

has fully observed family covariates for all the observations and missing values only exist

among SNPs. Interest is focused on developing methods to test the relationship between

the response and the SNPs. To quantify the effect of the SNPs, we decompose each SNP

into an additive and a dominant effect. For each SNP, if its genotype is homozygous, it

is assumed to have additive effect, dominant effect for heterozygous SNP, and a negative

additive effect for the mutant homozygous SNP.

In this section, we discuss two models for this data situations. For the simplicity of

analysis, we employ one of the them through the simulation and real data analysis.

2.2.1 The Model without Ramet Random Effect

First of all, a remete is the smallest observed individual in the data set and any two

individuals are called ramets when they share exact the same genetic information, not

only parents. We later use “clone” to mean a group of ramets in one family, which is

different as the common understanding of “clone”. For this Subsection and throughout the

dissertation, the phenotypic responses are averaged over remeters for each clone except

when specially pointed out.

The model is

Y = Xβ + Zγ + ε (2–5)

39

where

Yn×1 = phenotypic trait,

Xn×p = design matrix for family covariates,

βp×1 = coefficients for the family effect,

Zn×s = design matrix for SNPs (genotypes),

γs×1 = coefficient of additive and dominant effect for SNPs,

εn×1 ∼ N(0, σ2R).

The subscripts of these variables denote the dimensions. The variance covariance

matrix R is for the numerator relationship matrix, which describes the degree of kinship

between different individuals. Details about how to calculate the numerator relationship

matrix R will be given later.

Each row of the matrix Z, Zi, i = 1, . . . n, corresponds to the SNP genotype informa-

tion of one individual. Some of this information may be missing, and we write

Zi = (Zoi , Z

mi )

where Zoi are the observed genotypes for the ith individual, and Zm

i are the missing

genotypes. Note two things:

1. The values of Zmi are not observed. Thus, if ∗ denotes one missing SNP, a possible

Zi isZi = (1, ∗, 0, 0, ∗, ∗, 1).

2. Individuals may have different missing SNPs. So for 2 different individuals, we mighthave

Zi = (1, ∗, 0, 0, ∗, ∗, 1)

Zi′ = (∗, ∗, 1, 0, 0, 1, 1),

which can make a heavy burden for computation.

40

For a Bayesian model, we want to put a prior distribution on the parameters. We

put a noninformative uniform prior for β, which essentially leads us to least squares

estimation. For γ, we use the normal prior γ ∼ N(0, σ2φ2I). Here φ2 is a scale parameter

for the variance and σ2 is the variance parameter. For σ2 and φ2, we use an inverted

gamma priors: σ2 ∼ IG(a, b) and φ2 ∼ IG(c, d), where IG stands for the inverted gamma

distribution, and a, b, c, and d are constants used in the priors. We say σ2 has IG(a, b)

distribution if

f(σ2) =ba

Γ(a)

exp(− bσ2 )

(σ2)a+1.

According to Hobert and Casella (1996), we take a, b, c, d to be specified values, which

ensures a proper posterior distribution.

Now let us specify the values of a, b, c, d and make sure that the posterior distribu-

tions are proper. First of all, we can write the model specification as the follows

p(Y |β, γ, Z, σ2, φ2) ∼ N(Xβ + Zγ, σ2R), (2–6)

π(β) ∝ 1,

π(γ) ∼ N(0, σ2φ2I),

π(σ2) ∼ IG(a, b),

π(φ2) ∼ IG(c, d).

Let Px = I−X(X ′X)−1X ′ and t = rank(Px×Z). According to Theorem 1 of Hobert

and Casella (1996), the following inequalities must be satisfied

41

c < 0

q > q − t− 2c

n + 2c + 2a− p > 0

c > −s

2b > −n,

where s is the number of SNPs in the data, q = 2s, p is the dimension of β and n is the

number of observations.

Further simplification we lead to the following equations:

0 > c > −s

a > −n

2

n + 2c + 2a− p > 0

As the number of observations in the data set is about one thousand, the number

of parameters for SNPs are about one hundred and the dimension of β is 61, we take

c = −1, a = 3, b = 1, and d = 1. The above specification makes sure that the posterior

distributions are proper and the priors have wide flat ranges.

As our data set is from the loblolly pine genome, as with many other agriculture

genomes, it is not fully sequenced and there is not enough prior information in terms of

the frequency of genotypes for each SNP. Also, the physical positions of the SNPs are

not fully recorded and the SNPs may not be tightly linked as clusters of haplotypes. So

noninformative priors are used for the missing SNPs. As the SNPs considered here are

biallelic, each missing SNP, has 3 possible genotypes: homozygous, heterozygous and

mutant homozygous. For example, if the two alleles in the loci are A and G, the missing

42

SNP could be either AA, AG, or GG. So the prior assumes the SNP has an equal chance

to be AA, AG, or GG.

For the missing SNPs in the data, we assume that they are missing at random, MAR.

In another words, we assume that the probability that whether the SNP is missing is

related to the observed data, such as the phenotypic trait or other observed SNPs, and

is independent of the unobserved information. Conditional on this assumption, we will

impute the missing SNPs based on the correlation between SNPs within individuals and

between individuals, and use the phenotypic trait information to improve the power of

imputation. MAR is a reasonable assumption and not as strict as missing completely at

random, MCAR.

In this model, the covariance matrix, R, models the covariance between individuals

within the same family, and the covariance between individuals across families. Phenotypic

traits of related individuals are alike because they share a large fraction of genetic

material. Genotypes of relatives are similar because they share the same parents or

grandparents (in some degree). Genotyped SNPs may explain some part of the phenotypic

traits, still the un-typed genetic information contributes to the phenotypic traits as well

as other factors, such as environmental effects. Simply using the SNP marker information

without considering the family pedigree or history is not a wise approach. We expect that

incorporating the family structure information will increase the power of capturing the

underling nature of the data set and help to detect the significant SNPs.

In the literature, there are different methods to calculate the relationship matrix, such

as using a co-ancestry matrix, a kinship matrix, etc. The basic idea is to calculate the

probability that 2 individuals share one gene or SNP, which is passed down from the same

ancestry. Some of the methods employ pairwise calculation and thus do not guarantee a

positive definite relationship matrix, which is usually not satisfactory when the relation-

ship matrix is used as covariance matrix. We use the recursive calculation method due to

Henderson (1976). This method gives a numerator relationship matrix which quantifies the

43

probability of sharing one gene from same ancestry, based on known family pedigree and

parent pedigree in the population. Through calculating this relationship matrix we obtain

0.5 probability for the case that two siblings are within same family and share the same

copy of a gene from one ancestor. We will have a 0.25 probability if two individuals share

one parent. For the loblolly pine data set, there are a total of 9 categories of relatedness,

and some details can be seen in the Appendix B. Notice that even if the family pedigree

or parent pedigree is not available, we can still use the proposed method to calculate the

relationship matrix. In the following, we have Section 2.5 dedicated to categorizing the

relationships for any pedigree history.

Same as Equation 2–6, we have the following model specification

p(Y |β, γ, Z, σ2, φ2) ∼ N(Xβ + Zγ, σ2R),

π(β) ∝ 1,

π(γ) ∼ N(0, σ2φ2I),

π(σ2) ∼ IG(a, b),

π(φ2) ∼ IG(c, d).

So the conditionals are

β ∼ N((

X ′R−1X)−1

X ′R−1(Y − Zγ), σ2(X ′R−1X

)−1)

(2–7)

γ ∼ N

((Z ′R−1Z +

I

φ2

)−1

Z ′R−1(Y −Xβ), σ2

(Z ′R−1Z +

I

φ2

)−1)

σ2 ∼ 1

(σ2)n/2+s/2+a+1exp

(−

(Y −Xβ − Zγ)′ R−1 (Y −Xβ − Zγ) + |γ|2φ2 + 2b

2σ2

)

φ2 ∼ 1

(φ2)s2+c+1

exp−( |γ|2

σ2 + 2d)

2φ2.

In the above conditionals, the SNPs are generally denoted as the Z matrix, which

actually includes the observed SNPs and missing SNPs. As for the missing SNPs, we use

a Gibbs sampler to impute the missing SNPs. The Gibbs sampler for the missing data

44

simulates the samples of Zmi according to the distribution of each missing SNP conditional

on the rest of observed SNPs and sampled missing SNPs. For a particular SNP Zmij , the

jth missing SNP in the ith individual, the conditional distribution given the rest of the

vector Zmi(−j) and all other parameters in the model is

P (Zmij = c|Zm

i(−j)) =exp−K′

cΣ−1Kc

2σ2

∑3`=1 exp

(−K′

lΣ−1Kl

2σ2

) , (2–8)

where

Kc = (Yi −Xiβ − Zoi γ

oi − Zm


ij ),

and

Kl = (Yi −Xiβ − Zoi γ

oi − Zm

i(−j)γmi(−j) − clγ

mij ).

The value c is the genotype currently being considered for that missing SNP, and cl

represents any one of the possible genotypes for the SNP. Notice there are only 3 terms

in the denominator sum for each SNP and this is the key point why Gibbs sampling is

feasible for our situation with many SNPs of the observations.

2.2.2 The Model with Remete Random Effect

For the remetes within a clone, although they share exactly the same genetic infor-

mation, they still have the environmental influences and also environment and genotype

interaction. The original recorded phenotypic responses are traits per remete. We are

interested in the environment variance and hope to find out how big is the influence. In

this subsection, we propose a model to further describe the random effect of remetes.

The model is

Y = Xβ + Zγ + Wu + ε (2–9)

45

where

Yn×1 = phenotypic trait

Xn×p = design matrix for family structure

βp×1 = coefficients for parents, fixed effect

Zn×s = design matrix for SNPs (genotypes)

γs×1 = parameter of the SNP effects

Wn×r = design matrix for remetes

ur×1 = random effects of remetes

εn×1 ∼ N(0, σ2I)

The variance term is specified as the identity matrix right now for the simplicity. The full

specification of model 2–9 includes the following prior distributions:

ε ∼ N(0, σ2I

)

γ ∼ N(0, σ2φ2I

)

u ∼ N(0, σ2τ 2I

)(2–10)

σ2 ∼ π(σ2)

φ2 ∼ π(φ2)

τ 2 ∼ π(τ 2)

46

One thing to note is for the design matrix W , we do not have missing values since the

remetes information is available. The joint distribution of all of the parameters is

π(β, γ, u, σ2, φ2, τ 2) ∝ 1

(σ2)(n+s+r)/2(φ2)s/2(τ 2)r/2

× exp

(− 1

2σ2|Y −Xβ − Zγ −Wu|2

)

× exp

(− 1

2σ2φ2|γ|2

)(2–11)

× exp

(− 1

2σ2τ 2|u|2

)

× π(σ2)π(φ2)π(τ 2)

According to Hobert and Casella (1996), the following inequalities need to be satisfied to

make sure that the posterior distributions are proper:

2c > −rank(Z) (2–12)

2e > −rank(W )

2a > −n

c < 0

e < 0

rank(Z) > rank(K)− t− 2c

rank(W ) > rank(K)− t− 2e

n + 2e + 2c + 2a− p > 0,

where rank() is the rank function, K is the concatenated matrix [Z|W ], σ2 has the prior

σ2 ∼ IG(a, b), φ2 has the prior φ2 ∼ IG(c, d) , τ 2 has the priorτ 2 ∼ IG(e, f). Also

t = rank(Px × K) where Px = I − X(X ′X)−1X ′ and p is the number of families in the

data. So we take a = −1, c = −1, e = −1, b = 0, d = 0, f = 0 and they satisfy the

inequalities 2–12.

47

Based on Equation 2–11, with a flat prior on σ2, φ2 and τ 2 as we showed in above

that they ensure the properties of posterior distributions, we have the following full

conditionals distributions.

β ∼ N((X ′X)−1X ′(Y − Zγ −Wu), σ2(X ′X)−1

)

γ ∼ N

((

1

φ2I + Z ′Z)−1Z ′(Y −Xβ −Wu), σ2(

1

φ2I + Z ′Z)−1

)

u ∼ N

((

1

τ 2I + W ′W )−1W ′(Y −Xβ − Zγ), σ2(

1

τ 2I + W ′W )−1

)

σ2 ∝ 1

(σ2)(n+s+r)/2(2–13)

× exp

(− 1

2σ2

(|Y −Xβ − Zγ −Wu)|2 +

1

φ2|γ|2 +

1

τ 2|u|2

))

τ 2 ∝ 1

(τ 2)r/2exp

(− 1

2τ 2

( |u|2σ2

))

φ2 ∝ 1

(φ2)s/2exp

(− 1

2φ2

( |γ|2σ2

))

we generate Zi by using

P (Zmij = c|(Zo

i , Zmi(−j))) (2–14)

=ωij((Z

oi , Z

mi(−j), c)) exp

(− 1

2σ2 (Yi −Xiβ −Wiu− Zoi γ

oi − Zm


ij )2)

∑g`=1 ωil((Zo

i , Zmi(−j), c`)) exp

(− 1

2σ2 (Yi −Xiβ −Wiu− Zoi γ

oi − Zm


ij )2) ,

where ωil is the prior probability, which is assigned for the missing SNP to be the geno-

type cl. We let the priors of the missing SNPs to be equally distributed, and in other

words, all ωil = 1/3.

Using the above method, we need to sample u in each iteration, and this is feasible

theoretically. However, the dimension of u is really big and we do not really want the

samples of u. Now we want to try to integrate u out and get the marginal distribution.

48

When we do this we get

π(β, γ, σ2, φ2, τ 2) ∝ |A|1/2

(σ2)(n+s)/2(φ2)s/2(τ 2)r/2

× exp

(− 1

2σ2(Y −Xβ − Zγ)′B(Y −Xβ − Zγ)

)

× exp

(− 1

2σ2φ2|γ|2

)(2–15)

× π(σ2)π(φ2)π(τ 2),

with

A =1

τ 2I + W ′W

B = I −W ′A−1W

As we showed above, with flat priors on σ2 φ2 and τ 2, the posterior distributions from the

joint distribution 2–11 are ensured to be proper. With the same flat priors, the posterior

distributions from the Equation 2–15 will still be proper because the only difference

made here is having u being integrated out and this does change the properties of other

posterior distributions.

Based on Equation 2–15, we have the following full conditionals

β ∼ N((X ′BX)−1X ′B(Y − Zγ), σ2(X ′BX)−1

)

γ ∼ N

((

1

φ2I + Z ′Z)−1Z ′(Y −Xβ), σ2(

1

φ2I + Z ′Z)−1

)(2–16)

σ2 ∝ 1

(σ2)(n+s)/2

× exp

(− 1

2σ2

((Y −Xβ − Zγ)′B(Y −Xβ − Zγ) +

1

φ2|γ|2

))

φ2 ∝ 1

(φ2)s/2exp

(− 1

2φ2

( |γ|2σ2

)),

again giving normals on β and γ and an inverted gamma on σ2 and φ2.

We ran some simulations on this data set and we found out the computation time is

especially long. The reasons for that is when we consider observations based on remetes

49

Table 2-2. The percentages of SNP categories in the generated data sets.

Percentages SNP1 SNP2 SNP3 SNP4 SNP5

Double homozygous: 13% 30% 91% 77% 39%Heterozygous: 53% 38% 7% 19% 54%Mutant homozygous: 33% 32% 2% 4% 7%

instead of averages of remetes, the number of observations is about 4 times bigger.

Consequently, the computation time is much longer, in terms of matrix inversion and

matrix determinant. We decide to use the first model of the analysis and do not consider

the variation within the remetes. Notice that this does not change the inference on the

SNPs.

2.2.3 Increasing Computation Speed

For a data set containing hundreds of observations and hundreds of SNPs, or even

more, thousands of SNPs, the computation speed of the Gibbs sampler can be generally

a big issue. Furthermore, if the number of SNPs is increased, then for each iteration, the

number of missing SNPs to be updated will also increase. To speed up calculation, we

show that instead of updating all the SNPs at each iteration, updating only one column of

SNPs each cycle will still conserve the target stationary distribution and ergodicity.

Theorem 2.2.1. For the Gibbs sampler corresponding to (2–7) and (2–8), if instead

of updating all the parameters(β(t), γ(t), Z

(t)n×p, σ

2(t), φ2(t)

)in each cycle, we just update

(β(t), γ(t), Z

(t)j , σ2(t)

, φ2(t))

in each iteration, the Markov Chain achieves the same target

stationary distribution and ergodicity also holds. Here Zj denotes the jth column in the

Zn×p matrix for SNPs (the jth SNP for all observations) and j changes according to the

iteration index.

The proof will be given in Appendix A. The actual meaning of this theorem is that

instead of updating tens or hundreds of SNPs in one cycle, we just need to update one

SNP in each cycle. This will dramatically speed up computation, especially when there are

large numbers of SNPs in the data.

50

2.3 Results for Simulated Data

Before the actual application of the methodology to a real data set, we apply it to

simulated data. We simulated a data set with 6 families, 20 observations in each family

and 5 SNPs per observations. The 5 SNPs are independent of each other. The six families

are also independent, so that the parents of the 6 families are not related and individuals

across families are independent. The genotypes of the SNPs were generated according to

the percentages of SNP categories for the first 5 SNPs in the loblolly pine data set. On the

other hand, the individuals within each family share the same parents and they are related

and this relationship will be detailed in the numerator relationship matrix. The SNPs in

the simulated data are generated according to the percentages of SNP categories of the

observed SNPs in loblolly pine data. The sum of probabilities for double homozygous,

heterozygous, and mutant homozygous for each SNP is 1. From this data set, four data

sets with different percentages of missing values: 5%, 10%, 15%, and 20% were randomly

generated. The family effects, β, which were used to simulate the data, are listed in Table

2-5 as the actual values. The SNP effects (additive and dominant effects), which were

used to simulate the data, are listed as actual values in Table 2-5. We let the variance

parameter σ2 be 1. The proposed methodology was applied to analyze the data without

missing values and was also applied to data with different percentages of missing values.

We want to see whether it could identify the significant SNPs and check its performance

against different percentages of missing values, as well as its performance for different

probabilities of SNP genotype category.

Our ultimate goal is to find the significant SNPs from the candidate SNPs. Since

we believe that imputation is a tool to obtain better estimates of the parameters, we are

not particularly interested in recovering the actual imputed values for the missing SNPs.

With that being said, the simulation results in Table 2-3 show that when the probability

of one genotype for a certain SNP is dominantly high, the imputed SNPs are correctly

identified with substantially high probability. This could be seen in Table 2-3, where

51

Table 2-3. The percentages of correctly imputed SNPs for different probabilities of SNPcategories. 10% missing values exist.

SNP values:SNP1 SNP2 SNP3 SNP4 SNP5

a=-2, d=1 a=1, d=-1 a=3, d=0.5 a=2.5, d=0.1 a=0.3, d=3Probabilities to be “gg” 0.1309 0.3012 0.8181 0.7719 0.3983Probabilities to be “gc” 0.5307 0.3875 0.0796 0.1950 0.5425Probabilities to be “cc” 0.3384 0.3113 0.1023 0.0331 0.0592Correctly imputed probabilities: 0.55004 0.54788 0.63372 0.85075 0.65159Probabilities to be “gg” 0.0309 0.3012 0.3181 0.3719 0.0983Probabilities to be “gc” 0.8307 0.3875 0.0796 0.1950 0.8425Probabilities to be “cc” 0.1384 0.3113 0.6023 0.4331 0.0592Correctly imputed probabilities: 0.7589 0.35052 0.97621 0.64829 0.85301

SNPs have high probabilities (in boldface) for one of the SNP genotype categories and

they have higher probabilities of correct imputation (in boldface) than those do not. If

the probabilities of the missing SNP being either one of the candidate genotypes are

very close, the imputation tends to put the imputed SNP in either one of the candidate

genotype categories. This is a consistent result as the simulated SNPs are independent and

it is expected to get the most information from the marginal SNP itself. Table 2-4 and

Table 2-5 list the parameter estimation for family effect and SNP effects.

Table 2-4. The estimated means of family effects for different data sets with differentpercentages of missing values. This methodology give accurate estimates as thepercentage of missing values goes up to 20%.

Estimated means family1: β1 family 2: β2 family3: β3

Actual values: 15 20 25No missing SNPs 15.45 20.65 25.485% missing SNPs 15.16 20.74 25.4610% missing SNPs 16.18 21.38 25.6515% missing SNPs 15.45 19.63 24.5920% missing SNPs 14.87 20.18 24.68Estimated means family 4: β4 family 5: β5 family 6: β6

Actual values: 30 35 40No missing SNPs 29.84 34.76 40.405% missing SNPs 28.29 33.43 38.6210% missing SNPs 30.71 35.86 40.8115% missing SNPs 30.18 35.38 40.1820% missing SNPs 30.08 34.88 40.13

52

Table 2-5. The estimated means of SNP effects for the data set without missing valuesand for data sets with different percentages of missing values.

Actual SNP value:SNP1:a SNP1:d SNP2:a SNP2:d SNP3:a-2.00 1.00 1.00 -1.00 3.00

Means for no missing SNPs : -2.16 1.00 0.82 -0.75 2.59Means for 5% missing : -1.86 1.14 1.16 -1.05 3.00Means for 10% missing : -1.95 0.77 1.18 -1.52 2.74Means for 15% missing : -1.80 0.78 0.99 -0.96 2.48Means for 20% missing -2.08 1.29 1.21 -0.76 3.10

Actual SNP value:SNP3:d SNP4:a SNP4:d SNP5:a SNP5:d

0 2.50 0.10 0.30 3.00Means for no missing SNPs : 0.30 2.43 0.60 -0.04 2.38Means for 5% missing : 0.05 2.21 -0.20 0.48 2.88Means for 10% missing : 0.18 2.51 0.13 0.00 2.53Means for 15% missing : 0.67 2.43 0.47 0.73 3.20Means for 20% missing 1.32 1.87 -0.20 0.47 3.30

All the calculations were based on samples obtained after the initial 20000 steps

of burn-in. The results from Table 2-4 and Table 2-5 show that when the percentage

of missing values is not too high, less than 15%, the proposed methodology gives good

estimates for the parameters that we are interested in. When the percentages of missing

values is beyond 15% percentage, we need to be very careful with interpreting the results.

Take SNP3 as example, the dominant effect for SNP3 is actually 0 and the estimate was

1.32 when the percentage of missing values is 20%, although the estimates are accurate

when the percentages of missing values is less than 10%. The reason for that, we believe,

is one category of genotype for SNP3 has substantially higher probability and it over

dominates the other two categories and this over dominance. When the percentage of

missing values goes up, the dominated genotype category has only a small chance to be

well represented and thus may have unreliable estimates. Generally, as most microarray

data have less than 10% missing values, the methodology performs well.

2.4 Results for Loblolly Pine

In a previous loblolly pine project, SNP discovery was done for about 50 genes

which are involved in disease resistance and water-deficit response. SNPs for these genes

53

were genotyped by microarray and they are scattered over the genome. Also, as loblolly

pine is a tree species with rapid linkage disequilibrium property, the genotypes are not

closely linked. This is a totally different situation compared with human association

genetics, where linkage disequilibrium information is heavily relied on by haplotype

clustering modeling. The goal of the research presented in this dissertation is to detect the

significant SNPs which have strong influence for the quantitative traits from large amount

of potential SNPs using valid statistical procedures without assuming SNP markers are

clustered.

For this project, we are specifically interested in detecting the relationship between

lesion length and the genotyped SNPs, as lesion length is one of the most important

quantitative traits of pitch canker disease. We also have phenotypic data of carbon isotope

discrimination values from loblolly pines which are grown in Paltaka, Florida and Cuth-

bert, Georgia. The carbon isotope trait is related to the water use efficiency, thus has an

important role in the size growth of loblolly pine and further has substantial economical

values. Genetically speaking, the loblolly pines in Paltaka, Florida are replications of

loblolly pines in Cuthbert, Georgia, except they have different environment, which might

lead to different genetic-environment interaction. As for lesion length,with replication of

genetic information, it is a different phenotypic trait compared to the carbon isotope dis-

crimination trait. So we have three phenotypic traits data: Carbon isotope from Paltaka,

Carbon isotope from Cuthbert and lesion length data. The three share one genetic SNP

data.

For the design of the experiment, there are 61 loblolly pine families from a circular

design with some off-diagonal crossings. For example, family 00 is generated from parent

24 and parent 23, while family 01 is from parent 24 and parent 40. Family 00 and family

01 are not independent since they share one parent. Originally, this circular design had 70

families from 44 parents although our data sets just contain part of the experiment design.

More details of the experiment design can be found in Kayihan et al. (2005). There also

54

is a family pedigree file recording the family pedigree and a parent file which records the

parent pedigree. These family histories provide information for constructing the numerator

relationship matrix. In each family there are a certain number of clones, and the number

of clones range from 10 to 18. Note the term “clone” is borrowed from the geneticists in

the loblolly pine project and it does not imply that two clones would share the exact same

genetic DNA sequences, instead, two clones from the same family are just like two siblings.

We have 46 genotyped SNPs in the loblolly population for association discovery. About

10% of the 46 SNPs have missing values and, on average, each observation has more than

one missing SNP value. If we employ conventional listwise deletion to delete the missing

data, almost all observations will be deleted. The last 2 SNPs among the 46 SNPs, have

substantially higher percentages of missing values than the other SNPs, 69% and 54%

specifically. We are not sure of the reason of this high percentage of missing values, it

might be due to microarray measurement error, or experimental error or it might due to

some other underlying genetic reason. At this stage, we decided to not use the last 2 SNPs

in this analysis, and by doing that, overall, the percentage of missingness is decreased from

10.07% to 7.74%.

For the 61 families, we have family pedigree and parent pedigree information. By

using the Henderson (1976) method, we calculated the covariance numerator relationship

matrix. The details of the calculation are in Appendix B. We use a uniform noninfor-

mative prior for the family effect and a normal prior for the SNP effects which we are

interested in. As for the variance parameters, we use inverted gamma distributions and

make sure that the posterior distributions are proper according to Hobert and Casella

(1996). Each SNP is parameterized with additive and dominant effects. For example,

if this SNP is composed by “A” and “T” nucleotides, the additive effect is one half of

the difference of the SNP effect between homozygous “AA” and mutant homozygous

“TT”. If we use (γa, γd)T to parameterize the SNP effect, SNP “AA” would have effect

(1, 0)(γa, γd)T and SNP “TT” would have effect (−1, 0)(γa, γd)

T . The dominant effect is

55

Figure 2-1. The first trace plot is for the first 2 family effect parameters for the lesionlength data. The second plot is for one of SNP parameter for the carbonisotope data from Paltaka, Florida. The samples are taken after initial 40000steps of burnin.

the difference of the heterozygous effect and the average of homozygous effects. We use the

effect (0, 1)(γa, γd)T to parameterize the dominant effect for SNP “AT”. We updated the

missing SNPs by columns as detailed in Section 2.2.3, so the computation is fast (about

3 hours for 30000 iterations). We use 40000 iterations as initial burn in and record the

samples of further 2000 iterations for the 3 data analysis.

Two trace plots are shown in Figures 2-1. The first one is for one of the first family

effect parameters for the carbon isotope discrimination data from Paltaka, Florida. The

second plot is the trace plot of variance parameter for the carbon isotope discrimination

data from Cuthbert, Georgia. Both plots suggest that the Gibbs sampling simulation

converges to the stationary distribution after burn in.

Figures 2-2, 2-3, and 2-4 show the confidence intervals for the 44 SNP effects of

different data sets. We found that the 22nd SNP for the lesion length data, the 16th SNP

for the Paltaka data and the 8th, 35th, 36th SNPs for the Cuthbert data are significant

56

with 95% confidence intervals. We constructed these confidence intervals in Figures 2-2,

2-3, and 2-4 based on the samples from the Gibbs sampling. We did not employ standard

multiple test correction procedure for the tests as there are not many significant SNPs and

also our goal is to find some statistically significant SNPs and let the biologists to further

explore the biological pathway. These confidence interval plots point out the specific SNPs

to further investigate and we could have more SNPs to follow if we loosen the confidence

limits a lit bit. In Figures 2-2, 2-3 and 2-4, the x axis is for the index of SNPs. As each

SNP has two parameters, with 44 SNPs, the index ranges from 1 to 88. The y axis is for

the value of SNP effect. For each parameter of each SNP, there is a small red line in the

top of a blue line and a small red line on the bottom of the blue line. The small red line

on the top represents the upper bound of the 95% confidence interval and the small red

line on the bottom represents the lower bound of the 95% confidence interval. The small

green line in the middle represents the mean of the estimated parameter. Each SNP has

two parameters, additive and dominant, correspondingly, it has two lines in all the plots.

2.5 Quantifying the Covariance and Variance

In order to impute the missing SNPs more accurately, we want to incorporate the

family structure into the model as we mentioned before. For our loblolly pine project,

there are two data sets containing the family information: family pedigree and parent

pedigree. The family pedigree has 70 rows, which means there are 70 families in the data,

and 3 columns. The first column is the family ID, the second column denotes the female

parent ID and the third column denotes the male parent ID. The parent ID varies from 1

to 44 and we do not distinguish between female and male parent. In the data file of parent

pedigrees there are 44 rows which correspond to the 44 parents in the family pedigrees and

3 columns. The first column is the parent ID and the second and the third columns are

the IDs of the grandparents which are the parents of parents in family pedigree. One thing

to notice is: for the parents whose IDs are less than 35, the grandparents’s information

is assumed to be independent with any other grandparents without further specification;

57

Figure 2-2. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for lesion length. The 22th SNP has significant dominant effects with95% confidence, while the other SNPs are not significant. Other SNPs such asthe 2nd and the 23th SNPs are approximately significant with 95% confidence.These are good candidates for further biological exploration.

on the other hand if the parent’s ID is bigger or equal to 35 we know the grandparents’

ID. So 35 is an important number and we need to pay attention to it when trying to fill

in the covariance matrix of families. Another thing is for the parent ID 38, originally its

grandparents were recorded as 13 and 0, out of convenience for programming, I changed

them into 13 and −1 without changing the nature of the covariance. According to the

design of the experiment, the female and male parent are not the same for all the families.

When calculating the covariance between individuals from different families, one way

is to divide the combinations of two families into 11 big categories and further partition

inside the categories. In the following I will explain each of the 11 categories.

Category 1: All Parents’ IDs Less Than 35, None Equal

Suppose I use “a” and “b” to denote the parents IDs of the first family, “c” and “d”

to denote the IDs of the parents in the second family. So in this category it must satisfy:

a < 35, b < 35, c < 35, d < 35 and a 6= b 6= c 6= d. No parents’ ID are equal. This actually

58

Figure 2-3. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for the carbon isotope data from Paltaka, Florida. The 16th SNP hassignificant dominant effects with 95% confidence, while the other SNPs are notsignificant. Other SNPs such as the 6nd and the 40th SNPs are approximatelysignificant with 95% confidence. These are good candidates for furtherbiological exploration.

is the simplest case and if it is satisfied the two families have zero covariance. In our data

set, we need to compute total 2415 (2415 = 1 + 2 + · · · + 69) covariance cells and it turns

out there are 507 cases belong to this category.

Category 2: All IDs Less Than 35, One Equality

In this category, we still have a < 35, b < 35, c < 35 and d < 35, further, it must

satisfy one of the following equations: a = c or a = d or b = c b = d. When the above

situation is satisfied, the two families share one parent and they are half siblings. The

program results show there 88 cases in our data set.

Category 3: Three IDs Less Than 35, One Equality

When the parent ID is ≥ 35, we can trace down it’s grandparents’ ID. For example, if

a ≥ 35, we can trace it’s grandparents and use (a1, a2) to denote them. In this category, it

59

Figure 2-4. 95% Confidence intervals for the additive and dominant SNP effect of the 44SNPs for carbon isotope data from Cuthbert, Georgia. The 8th, 35th and the36th SNP have significant dominant effects with 95% confidence, while theother SNPs are not significant. Other SNPs such as the 13nd and the 28thSNPs are approximately significant with 95% confidence. These are goodcandidates for further biological exploration.

can be partitioned into 8 subcategories as the following:

Family 1 : (a1, a2) , b

Family 2 : b, d

,

Family 1 : (a1, a2) , b

Family 2 : c, b

,

Family 1 : a, (b1, b2)

Family 2 : a, d

,

Family 1 : a, (b1, b2)

Family 2 : c, a

,

Family 1 : a, d

Family 2 : (c1, c2) , d

,

Family 1 : d, b

Family 2 : (c1, c2) , d

,

Family 1 : a, c

Family 2 : c, (d1, d2)

,

Family 1 : c, b

Family 2 : c, (d1, d2)

.

60

Out of simplicity, I will just use

(a1, a2) , b

b, d

to denote

Family 1 : (a1, a2) , b

Family 2 : b, d

.

By default, in all later illustrations, the first row represents the first family and the second

row represents the second family.

Now we take

(a1, a2) , b

b, d

at a close look, in this situation, we know that the two families share one parent: b, but

we need to be a little bit careful since some of the following equations might hold and if it

does we want to know how many hold:

a1 = b,

a2 = b,

a1 = d,

a2 = d.

We found that out of the 84 cases of this category for our dataset, 83 cases do not

satisfy any one of the above equations, that is, all the 83 families are just half siblings.

For the last case in this category, when considering the covariance between family 38 and

family 70, it falls into the following situation,

a, d

(a, c2) , d

.

Category 4: Three Parents Less Than 35, None Equal

61

We can further partition this category into four subcategories and denote them as the

following:

(a1, a2) , b

c, d

,

a, (b1, b2)

c, d

,

a, b

(c1, c2) , d

,

a, b

c, (d1, d2)

.

Take the case of

(a1, a2) , b

c, d

as an example, we know the parents are distinct with each other, but we still need to

consider whether some of the following equations hold:

a1 = c,

a2 = c,

a1 = d,

a2 = d.

Note we do not need to consider about whether a1 = b or a2 = b since right now we

are just considering the covariance between 2 families.

The results show that for all the 896 cases of this category, none of the above equa-

tions is satisfied. So the covariance for these 896 cases are all 0.

Category 5: Two Parents Less Than 35 and From One Family, None Equalities

In this category, the first thing to notice is when the two parents bigger or equal

to 35 compose one family, and the other two less than 35 compose the another family,

the number of equal parent IDs is zero. This category can be further partitioned into 2

subcategories

(a1, a2) , (b1, b2)

c, d

,

a, b

(c1, c2) , (d1, d2)

.

62

Take the subcategory

(a1, a2) , (b1, b2)

c, d

as an example, we need to verify whether some of the following equations hold

a1 = c,

a2 = c,

b1 = c,

b2 = c

a1 = d,

a2 = d,

b1 = d,

b2 = d.

Note we do not need to worry whether some of the equations hold for a1 = b1, or

a1 = b2, or a2 = b1 or a2 = b2, since that part matters for the variance calculation for

individuals within the first family. We are considering the covariance between individuals

from the first family and the second family and they do not need to be considered right

now. We found out that in the 245 cases of this category, 223 cases do not have any

equation satisfied and they are independent. In the remaining 22 cases only one equation

holds for each case, which means one tree is the parent of one family and grandparent of

the other family in these 22 cases.

Category 6: Each Family Has One Parent Less Than 35, One Equality

Similar to above, we first partition this category into 8 subcategories and they are

illustrated as the following

(a1, a2) , b

(a1, a2) , d

,

(a1, a2) , b

c, (a1, a2)

,

(a1, a2) , b

(c1, c2) , b

,

(a1, a2) , b

b, (d1, d2)

,

63

a, (b1, b2)

a, (d1, d2)

,

a, (b1, b2)

(b1, b2) , d

,

a, (b1, b2)

(c1, c2) , a

,

a, (b1, b2)

c, (b1, b2)

.

These 8 subcategories can be represented by 2 situations. The first one is

(a1, a2) , b

(a1, a2) , d

,

in this situation, we need to verify the following equations

a1 = b,

a1 = d,

a2 = b,

a2 = d.

The second representation is

(a1, a2) , b

(c1, c2) , b

,

for which we need to verify the following 2 groups of equations.

Group 1

a1 = b,

a2 = b,

c1 = b,

c2 = b.

Group 2

a1 = c1,

a1 = c2,

a2 = c1,

a2 = c2.

Note the above two groups of equations have slight different meanings and it is

reasonable to believe that the covariance is bigger when one equation holds in the first

64

group than when one holds in the second group. The results show that there are 49 cases

in this category and all of them except one do not satisfy any of the above equations and

are just half siblings. For the case of family 1 and family 16, one equation holds as in

the group 2, so the two families not only share one parent but also share another great

grandparent.

Category 7: Each Family Has One Parent Less Than 35, No Equalities

In this category, there are four subcategories as the follows by using the same

notation as above:

(a1, a2) , b

(c1, c2) , d

,

(a1, a2) , b

c, (d1, d2)

,

a, (b1, b2)

(c1, c2) , d

,

a, (b1, b2)

c, (d1, d2)

.

We need to verify some equations for each of the four subcategories and we could just take

the first subcategory

(a1, a2) , b

(c1, c2) , d

as an example, and check whether the following equations from the two groups hold or

not:

Group 1:

c1 = b,

c2 = b,

a1 = d,

a2 = d.

Group 2:

a1 = c1,

a1 = c2

a2 = c1,

a2 = c2.

65

Note, the equation from group 1 is not the same as that from group 2 since they

have different weights in terms of covariance. The results shows there are 329 cases in this

category, 5 of them satisfy one equation from group 1, 24 of them satisfy one equation

from group 2, and 3 of them satisfy one equation from each group. The remaining of the

cases do not satisfy any equation from either group and their covariance is 0.

Category 8: One Parent Less Than 35, One Equality

Same as above, there are 8 subcategories here:

(a1, a2) , (b1, b2)

(a1, a2) , d

,

(a1, a2) , (b1, b2)

(b1, b2) , d

,

(a1, a2) , (b1, b2)

c, (a1, a2)

,

(a1, a2) , (b1, b2)

c, (b1, b2)

,

(a1, a2) , b

(a1, a2) , (d1, d2)

,

(a1, a2) , b

(c1, c2) , (a1, a2)

,

a, (b1, b2)

(b1, b2) , (d1, d2)

,

a, (b1, b2)

(c1, c2) , (b1, b2)

.

We will just take one subcategory to show the equations which need to be verified. For

example, for the subcategory:

(a1, a2) , (b1, b2)

(a1, a2) , d

,

the following equations need to be checked:

Group 1:

a1 = d,

a2 = d,

b1 = d,

b2 = d.

66

Group 2:

a1 = b1,

a1 = b2

a2 = b1,

a2 = b2.

As above, we need to notice that the equations from these 2 groups have different

weights when calculating the covariance.

The program results show that in our data set 27 out 28 cases in this category do not

have any equations holding in either group, that is, these cases are half siblings. Only for

the cases of family 65 and family 68, additionally, one equation holds from group 1.

Category 9: Only One Parent Less Than 35, No Equality

Same as above, we partition this category into 4 subcategories.

(a1, a2) , (b1, b2)

(c1, c2) , d

,

(a1, a2) , (b1, b2)

c, (d1, d2)

,

(a1, a2) , b

(c1, c2) , (d1, d2)

,

a, (b1, b2)

(c1, c2) , (d1, d2)

.

For these 4 subcategories we need verify some equations. Take the subcategory

(a1, a2) , (b1, b2)

(c1, c2) , d

as an example. The following 2 groups need to be checked where the equations in different

groups have different weights for covariance:

67

Group 1

a1 = d,

a2 = d,

b1 = d,

b2 = d.

Group 2

a1 = c1,

a1 = c2

a2 = c1,

a2 = c2

b1 = c1,

b1 = c2

b2 = c1,

b2 = c2.

In our data set, there are 168 cases that belong to this category and 126 cases do not

have any equation satiated in either group, that is, these cases have 0 covariance. Five

cases have one equation holding from group one, that is, one parent of a family is the

grandparent of the other family. Thirty six cases have 1 equations holding from group

2, that is, they have 1 tree being the grandparents for both families. One case has one

equation holding from either group.

Category 10: All Parents greater or equal to 35, One Equality

As usual, to clearly illustrate the thought flow, we partition this category into 4

subcategories:

(a1, a2) , (b1, b2)

(a1, a2) , (d1, d2)

,

(a1, a2) , (b1, b2)

(c1, c2) , (a1, a2)

,

(a1, a2) , (b1, b2)

(b1, b2) , (d1, d2)

,

(a1, a2) , (b1, b2)

(c1, c2) , (b1, b2)

.

68

Certainly we need to verify some of the equations. We take

(a1, a2) , (b1, b2)

(a1, a2) , (d1, d2)

as an example, and check the following equations

a1 = d1, a1 = d2,

a2 = d1, a2 = d2,

a1 = b1, a1 = b2

a2 = b1, a2 = b2,

b1 = d1, b1 = d2,

b2 = d1, b2 = d2.

In our data set, there are 9 cases belonging to this category and 7 of them do not

have any equations holding and they are just half siblings. The remaining 2 cases not only

share one parent but also share one grandparent.

Category 11: All Parents Greater or Equal to 35, No Equalities

In this last case, we cannot further partition and we can trace down all the grandpar-

ents’ ID. The following is an illustration:

(a1, a2) , (b1, b2)

(c1, c2) , (d1, d2)

.

69

For this, we still need to verify the following equations:

a1 = c1, a1 = c2,

a2 = c1, a2 = c2,

a1 = d1, a1 = d2

a2 = d1, a2 = d2,

b1 = c1, b1 = c2,

b2 = c1, b2 = c2,

b1 = d1, b1 = d2

b2 = d1, b2 = d2.

The data set has 12 cases in this category and 4 of them do not have any equation

holding, that is, they have 0 covariance. In the other 8 cases, they have 1 equation

holding, which means they share one grand parent in the two families.

Variance

Now I will explain how the variance for the 70 families is calculated. We know the

parents in each family are distinct from each other, and we need to check whether the

parents are still related in some degree. The followings are the possible situations for the

parents ID:

(a, b

),

((a1, a2) , b

),

(a, (b1, b2)

),

((a1, a2) , (b1, b2)

).

The first bracket means both the parents’ ID are less than 35 and thus automatically they

are independent. For the second bracket, the first parent’s ID is bigger than or equal to

35 and thus we are able to trace down it’s grandparents’ ID. With that, we can check

whether the other parent is the same as one of the grandparents. We need check the same

relationship for the third bracket. For the last bracket, both parents’ IDs are bigger or

equal to 35, thus we know the grandparents’ ID for both of them. With this information,

we need to check whether they share any grandparents or not. The program results

70

show that in all the 70 families, their parents are independent and they do not share any

grandparent.

Summary of Variance Covariance Calculation

In the above for our loblolly data, we partition all cases into 11 categories and we

further detail the relationships inside each category. As a check, for our loblolly pine data,

the sum of cases in the 11 categories is:

507 + 88 + 84 + 896 + 245 + 49 + 329 + 28 + 168 + 9 + 12 = 2415,

which equals to the number of covariances we need to calculate.

In the above, we found that in any family, the two parents are not related and

standard procedure can be used to calculate the variance for each family. In terms of

covariance, in category 1, all 507 cases have zero covariance. For category 2, all 88 cases

are half siblings. Eighty three out of 84 cases in category 3 are half siblings and when

the case comes to the covariance between family 38 and family 70, it has the following

representation,

a, d

(a, c2) , d

.

All 896 cases in category 4 have 0 covariance. In category 5, 223 out of 245 cases have 0

covariance and the remaining 22 cases have the following representation:

(a1, a2) , (b1, b2)

a1, d

.

In category 6, 48 out of 49 cases are half siblings. For the covariance between family 1 and

family 16, it has the following representation:

a, (b1, b2)

a, (d1, b2)

.

71

In category 7 there are a total of 329 cases and 297 cases have 0 covariance. Five of them

have the following representation:

(d, a2) , b

(c1, c2) , d

,

while 24 cases have the following representation:

(a1, a2) , b

(a1, c2) , d

.

The remaining 3 cases have the following illustration:

(a1, a2) , b

(a2, b) , d

.

In category 8, 27 out 28 cases have are half siblings and only the covariance of family 65

and 68 has a little bit stronger covariance with the following representation:

(a1, a2) , (b1, b2)

(a1, a2) , a1

.

In category 9, 126 out of 168 cases have 0 covariance. In other 5 cases one tree is the

parent of one family and also grandparent of the other family. The representation is:

(a1, a2) , (b1, b2)

(c1, c2) , a1

.

For the other 36 cases, they have the following representation:

(a1, a2) , (b1, b2)

(a1, c2) , d

.

72

The last case in this category has the following representation:

(a1, a2) , (b1, b2)

(a1, c2) , b1

.

Nine cases belong to category 10. Seven out of Nine are half siblings and the remaining

two have the following representation:

(a1, a2) , (b1, b2)

(b1, c2) , (a1, a2)

.

In the last category, 4 cases have 0 covariance and the remaining 8 cases share one

grandparent for the two families with the following representation

(a1, a2) , (b1, b2)

(c1, c2) , (a1, d2)

.

One thing needs to be mentioned. The above discussed method for categorizing the

variance covariance of individuals within some pedigree can be easily adapted to other

pedigree files and we have the programming codes. As for the values assigned to each

category, we received the help from a gleetiest after we quantified the categories.

2.6 Discussion

The Gibbs sampler simulates samples of Zmi according to the distribution of each

missing element, conditional on the remaining vector parameters and other imputed SNPs.

For a particular element Zmij , the conditional distribution given the rest of the vector Zm

i(−j)

is

P (Zmij = c|Zm

i(−j)) =exp

(− 1


oi − Zm


ij )2)

∑3`=1 exp

(− 1


oi − Zm


ij )2) (2–17)

where there are only 3 terms in the denominator sum and this makes the Gibbs sampler

computational feasible.

73

In the area of human genetic association, there are some articles using the EM

algorithm to impute the missing SNPs, but these methods typically rely on linkage dise-

quilibrium, and assume that either haplotypes or alleles are clustered. For our situation,

without the assumption of clustering or dependence, an EM algorithm needs to calcu-

late all the possible combinations of missing SNPs and the number of combinations can

easily increase beyond the realm of computational feasibility. Using a Monte Carlo EM

approach would lead to the Gibbs sampler of 2–8, but rather than the one random variable

generation per iteration that is used here, Monte Carlo EM would need thousands of Z

generations per iteration, again precluding computational feasibility.

As for the prior distribution for the SNP genotype categories, one alternative is

to use the available genome information or the observed genotype frequencies as prior

information. For our situation, we do not have a sequenced genome library for the loblolly

pine, and we do not want to use the observed SNP information as the prior, so we use a

uniform prior for the missing SNPs.

In this chapter, the responses are assumed to be continuous, but the method can be

made to adapt to discrete cases. For example, in a case control study, the response would

be either case or control status. By employing a probit model, we add a truncated latent

variable in the Gibbs sampling cycle and the latent variable would act as the response in

our previous data set examples. This method can also be modified to handle the situation

of a multiple category response.

Let us assume that the responses y is a vector with its elements to be either 0 or 1 for

the case control study. We employ a continuous latent variable l and assume that l has a

multivariate normal distribution. Then we can specify the whole model specification as the

follows:

74

Y = I(l < 0) (2–18)

l ∼ N(Xβ + Zγ, σ2I

)

β ∼ π(β)

γ ∼ N(0, σ2φ2I

)

σ2 ∼ π(σ2)

φ2 ∼ π(φ2)

Then we can write the joint likelihood as:

L(l, β, γ, σ2, φ2|y) ∝n∏

i=1

I(li > 0)I(yi = 1) + I(li ≤ 0)I(yi = 1) (2–19)

× exp

−|l −Xβ − Zγ|2

2σ2

exp

− |γ|2

2σ2φ2

π(β)π(σ2)π(σ2)π(φ2)

With the above joint likelihood, we can easily write the conditional distributions as

before, except that now we have conditional distribution of l given other parameters and

y. The conditional of li is

li ∼ N(Xiβ + Ziγ, σ2)

truncated at the left by 0, if y1 = 1

li ∼ N(Xiβ + Ziγ, σ2)

truncated at the right by 0, if y1 = 0.

So with all the conditionals and we can run a Gibbs sampler to sample through the

parameter space and further do statistical inference. When y is the multivariate response,

we can generate a latent variable and adapt similarly.

For the SNP effect, we use [1, 0][γa, γd]T to denote the effect of SNP “AA”, [0, 1][γa, γd]

T

to denote the effect of SNP “AC”, and [−1, 0][γa, γd]T to denote the effect of SNP “CC”.

75

The parameter γa is the additive SNP effect and γd is the dominant SNP effect. When

the values of γa and γd are not close, the algorithm will give different probabilities to the

three possible candidates, and impute the missing SNP accordingly. When the value of

γd is close to γa, the algorithm will give close probabilities for the two candidate SNP

genotypes, as it cannot distinguish between “AA” or “AC” in the missing SNP. When

the value of γd is close to −γa the algorithm could not distinguish between “CC” and

“AC”. However, our interest is not focused on recovering the actual genotype, we are more

interested in estimating the additive effect and dominant effect of each SNP. When one of

the above situations occur, we know that either SNP “AA” and “AC” have almost equal

effects on the response, fully dominant, or “CC” and “AC” have equal effect.

For population association, much research work has been focused on using haplotypes.

However, in the plant, or other agriculture genome associations, it is often the case

that the whole genome is not fully sequenced and not much prior linkage disequilibrium

information is available. Block-wise haplotypes or clusters of haplotypes are impossible to

construct. The proposed method has wide application for the field of plant or agriculture

association without constructing haplotypes, and it has reasonable calculation speed.

Based on our simulations, the proposed method can adequately identify the significant

SNPs.

76

CHAPTER 3BAYESIAN VARIABLE SELECTION FOR GENOMIC DATA WITH MISSING

COVARITES

3.1 Introduction

In Chapter 2, we employed a Bayesian hierarchical model and discovered some SNPs

according to the Bayesian credible intervals. Furthermore, it is biologically possible that

there are some subsets of SNPs which are interacting with each other and are responsible

for the phenotypic traits we are interested in. So in this chapter, we are interested in

selecting the “good” subsets, or the “good” models, which have high posterior probabili-

ties given the observed data. Our model choice is restricted to the range of linear mixed

models with the subsets of SNP variables chosen from the total s candidate SNP variables.

To be more general, there are s variables to be considered in the data set. When s is

finite, and the values of the covariates in the experiment are observed, much progress has

been made. The review of the progress has been detailed in Chapter 1. However, for the

situation that the candidate variables have a certain percentage of missing values, it is a

novel research field and not much work has been done.

For each candidate variable, there are two choices, either being included in the

“good model” or not. So with s candidate variables, there are 2s model choices in the

model sample space. When the number s is moderately large, the number 2s could be

huge. Typically when the number s is bigger than 15, it is unrealistic to calculate and

compare all the 2s posterior model probabilities. For this case, a feasible method is to do

a stochastic search. In theory, the stochastic search algorithm has the chance to search

all the sample space. This stochastic search chain is expected to stay longer with models

having high posterior probabilities, and occasionally, it visits models with low posterior

probabilities. By doing that, it avoids getting stuck in local modes and has chance to visit

the model sample space globally.

The following is our plan for the model selection:

• Run 2 parallel chains together.

77

• The first chain is a Gibbs sampler for the full model. This full model includes all the

parameters, the parameter of family effects, the variance parameters, the parameters

of the SNP effects and the missing SNPs.

• Simultaneously, run a hybrid Metropolis-Hastings (M-H) chain to search the good

subsets of variables, in our case, SNPs, according to the posterior probabilities of

models.

a) We will estimate the Bayes factors using samples from the parallel Gibbs

sampling.

b) The stochastic search chain is driven by estimated Bayes factors.

• According to the frequency of the model visits, report the good models.

Now let us introduce the Baye factor. For two models Mδ and M1 with parameters

θδ and θ1, the Bayes factor is defined as

BFδ, δ=1 =mδ(Y )

mδ=1(Y )=

∫p(θδ|Mδ)p(Y |θδ,Mδ) dθδ∫p(θ1|M1)p(Y |θ1,M1) dθ1

, (3–1)

where p(θδ|Mδ) is the parameter distribution for model Mδ and p(θ1|M1) is the parame-

ter distribution for model M1. Obviously p(Y |θδ,Mδ) is the probability distribution under

model Mδ and p(Y |θ1,M1) is the probability distribution under model M1. Here θδ and

θ1 are scalars, but they can be extended to be parameter vectors. The constant mδ(Y ) is

the marginal likelihood for model Mδ and the constant mδ=1(Y ) is the marginal likelihood

for model M1.

In this chapter, first we will discuss how to estimate the Bayes factors as no closed

form exists. Then we will propose a hybrid stochastic search algorithm to search the

sample space of the models and we will show that the ergodic properties of the derived

Markov chain. Computation speed is always a concern of variable selection, so finally we

scale the procedures up to handle large data sets.

78

3.2 Bridge Sampling Extension

We declare our notation first. We are doing Bayesian analysis and all the models

involve the prior specification of the parameters in the models. The full model is:

Yn×1 ∼ N(Xβ + Zγ, σ2I) (3–2)

βp×1 ∼ noninformative uniform from−∞ to ∞

γs×1 ∼ N(0, σ2φ2I)

σ2 ∼ IG(a, b)

φ2 ∼ IG(c, d)

Zmis ∼ uniform distribution

where IG(a, b) represents inverted Gamma distribution with the following parameteriza-

tion

f(σ2) =ba

Γ(a)

exp− bσ2

(σ2)a+1,

and a, b, c, d are pre-specified constants which make sure the posterior distributions are

proper.

If we use M1, . . . ,M2s to represent all the models in the model sample space, the

goal of this project is to select good models from these candidate models. For each one

of the models in the model sample space, there is a corresponding indicator vector δ and

each model is defined by a family of distributions with parameters to be β, δ, γδ, σ2, φ2. For

any two different models Mj and Mj′ , they share the same parameterizations of β, σ2, φ2,

except for a different indicator vector δ, thus also γδ.

Several things to be noticed:

• The matrix Zn×s is the design matrix of the SNPs in the full model and it con-

tains missing values in it, as about 10% of SNPs are missing in the loblolly data.

Corresponding to the δ for different models, the design matrix of SNPs Zδ changes

also.

79

• The parameter γδ parameterizes the SNP effects in each model and we are trying

to select subsets of SNPs which have significant effects on the response. In the full

model, implicitly, γ has a corresponding vector δ, δ = (1, 1, . . . 1)T . Since the δ vector

is all ones for the full model, we neglect δ implicitly and hope the notation is still

clear.

• The β vector parameterizes the effects of families in the data set. For all the can-

didate models, the family effect parameter β is always included. X is an incidence

matrix which records which family the observations belong to.

• There are 2s candidate models in the model sample space and the sample space is

finite.

As we are doing Bayesian variable selection, we use Bayes factor as a criterion to

select models. The difficult part of this project is that we are selecting subsets with

different dimensions and the missing values exist in the covariates of SNPs. To overcome

the problem, we proposed a Bayes factor approximation formula which takes care of

the different dimensions and the missing values. We plan to employ a stochastic search

algorithm to search the good subsets according to the Bayes factors. For each candidate

subset, we calculate the Bayes factor for the candidate subset versus the full model. Here

the full model acts as the reference model. Depending on the value of the current Bayes

factor and the previous Bayes factor, we decide whether take the candidate subset as the

current subset or not. So the full model acts as a reference model when calculating the

Bayes factors in the above plan. On the other hand, we may use the simplest model as a

reference model when calculating the Bayes factors, which does not include any variables

in it. We will show that using the simplest model does not gain anything for us in terms

of computation. More discussion of using the simplest model as a reference model will be

given later.

Before going to the calculation of Bayes factor, we will introduce bridge sampling

proposed by Meng and Wong (1996).

80

Let pi, i = 1, 2, be two probability densities and qi, i = 1, 2, be the corresponding

unnormalized densities with the constants ci, i = 1, 2, where the following equation holds

for i = 1, 2,

pi(ω) =qi(ω)

ci

, ω ∈ Ω ⊂ Rd,

here Ω is the support of pi(ω).

Under some general conditions, for any α(ω), Meng and Wong utilized the following

equation: ∫Ω2

q1(ω)α(ω)p2(ω)d ω∫Ω1

q2(ω)α(ω)p1(ω)d ω=

c1

c2

×∫

Ω1∩Ω2α(ω)p1(ω)p2(ω)d ω∫

Ω1∩Ω2α(ω)p1(ω)p2(ω)d ω

,

and this yields their key identity of

c1

c2

=E2 q1(ω)α(ω)E1 q2(ω)α(ω) ,

where E1 and E2 are the expectations which were taken with respect to the probabil-

ity densities p1 and p2. To apply this to Bayes factors with the definition of Equation 3–1,

we have

BFδ, δ=1 =mδ(Y )

mδ=1(Y ).

We know that

pδ(θ|Mδ, Y ) =p(Y |θ,Mδ)p(θ|Mδ)

mδ(Y ),

and

p(δ=1)(θ|M(δ=1), Y ) =p(Y |θ,M(δ=1))p(θ|M(δ=1))

mδ=1(Y ).

Using Meng and Wong’s terminology, c1 = mδ(Y ) , c2 = mδ=1(Y ),

q1 = p(Y |θ,Mδ)p(θ|Mδ),

and

q2 = p(Y |θ,M(δ=1))p(θ|M(δ=1)).

81

Also

p1 = pδ(θ|Mδ, Y ),

and

p2 = p(δ=1)(θ|M(δ=1), Y ).

Taking

α =1

q2(θ),

we will have

BFδ, δ=1 =mδ(Y )

mδ=1(Y )=

c1

c2

=

∫q1 α p2 dθ∫q2 α p1 dθ

=

∫q1

q2

p2(θ)d θ. (3–3)

When have samples of parameter θ from the distribution p2(θ), we can estimate the

Bayes factor by∑ q1(θ

(i))

q2(θ(i)),

which is called the bridge sampling estimate of the Bayes factor by Meng and Wong.

For our situation, we generally consider two models with different dimensions, and

we can not directly apply this formula as it assumes the same parameterization for the

two models. We will propose a bridge sampling type of Bayes factor estimation, which

accommodates models with different dimensions and missing values as well. Next we

will detail the Bayes factor calculation for our situation. First we show that a special

function g is needed for the proper calculation of Bayes factor. Then we will show what

this function could be for our situation.

3.2.1 General Formula

First we will give a general theorem which will be extended and be used to estimate

the Bayes factor later. In this theorem we will show why a g function is needed and what

condition it has to satisfy.

Suppose θa is the parameter for model Ma and (θa, θb) is the parameters for model

Mb. The likelihood for model Ma is fa(Y |θa) and the likelihood for model Mb is

fb(Y |θa, θb). Let πa(θa) and πb(θa, θb) denote the priors for model Ma and model Mb.

82

Using Meng and Wong’s bridge sampling method, with the samples of (θ(i)a , θ

(i)b ) from the

posterior distribution πb(θa, θb|Y ), the estimator for BFa, b would be

∑i

πa(θ(i)a )

πb(θ(i)a , θ

(i)b )

.

However, this is not a consistent estimator and a function g is needed to have the follow-

ing consistent estimator of the Bayes factor:

∑i

πa(θ(i)a ) g(θ

(i)a , θ

(i)b )

πb(θ(i)a , θ

(i)b )

. (3–4)

Theorem 3.2.1. With the above notations, a g function g(θa, θb) is needed for the

following equation to hold.

∫πa(θa)

πb(θa, θb)g(θa, θb)πb(θa, θb|Y )d θad θb =

ma(Y )

ma(Y ), (3–5)

where πb(θa, θb|Y ) is the posterior distribution of (θa, θb) given Y , ma(Y ) is the marginal

likelihood of model Ma and mb(Y ) is the marginal likelihood of model Mb. Furthermore,

the g(θa, θb) function must satisfy the following condition:

∫fb(Y |θa, θb)g(θa, θb)d θb = fa(Y |θa). (3–6)

Now let us prove the Theorem 3.2.1.

Proof. As mentioned before, with the samples (θ(i)a , θ

(i)b ) from the posterior distribution

πb(θa, θb|Y ), we will be able to use

∑i

πa(θ(i)a )g(θ

(i)a , θ

(i)b )

πb(θ(i)a , θ

(i)b )

to consistently estimate the Bayes factors. The following derivation shows why the above

estimator is consistent.

83

∫ ∫πa (θa) g(θa, θb)

πb(θa, θb)πb(θa, θb|Y ) dθa dθb (3–7)

=

∫ ∫πa (θa) g(θa, θb)

πb(θa, θb)

fb(Y |θa, θb)πb(θa, θb)

mb(Y )dθa dθb

=1

mb(Y )

∫ ∫πa(θa)g(θa, θb)fb(Y |θa, θb) dθa dθb

So if we have Equation 3–6 holding, then

1

mb(Y )

∫ ∫πa(θa)g(θa, θb)fb(Y |θa, θb) dθa dθb =

∫πa(θa)fa(Y |θa) dθa

mb(Y )

=ma(Y )

mb(Y ).

For our situation, we are considering two models with different dimensions. Model

Mδ has the parameters (β, γδ, σ2, φ2) and an associated model indicator vector δ, then

the design matrix for SNPs of the model Mδ is Zδ. Let M1 be the full model containing

all the SNPs. Both model Mδ and model M1 may have a certain percentage of missing

SNPs. With the observed data Yn×1, the likelihood for model Mδ is

fδ(Y |β, γδ, σ2, Zδmis

)

and the likelihood for model M1 is

f1(Y |β, γ, σ2, Zmis),

where (β, γδ, σ2, φ2, Zδmis

) are parameters in model Mδ and (β, γ, σ2, φ2, Zmis) are pa-

rameters for model M1. Zδmis is the missing values in the model Mδ and Zmis is the

missing values in the model M1. As model M1 includes all SNPs and model Mδ just

includes part of SNPs, Zδmis represents part of the missing values and Zmis represents

all the missing values. The marginal likelihood for for Y under model Mδ is denoted as

mδ(Y ) and the constant mδ=1(Y ) is the marginal likelihood for model M1. We use the

84

prior πδ(β, γδ, σ2, Zδmis

) for Mδ and π1(β, γ, σ2, Zmis) for M1. Suppose we have samples

(β(i), γ(i), σ2(i), φ2(i)

, i = 1, 2, . . . , ) from M1 and no closed form of Bayes factor for model

Mδ against M1 is available. Similarly as Equation 3–4, a g function is needed and the

modified bridge sampling estimator is

1

n

M∑i=1

πδ(β(i), σ2(i)

, γ(i)δ , Z

(i)δmis

) · g(β(i), σ2(i), φ2(i)

, γ(i)δ , Z

(i)δmis

)

π1(β(i), γ(i), σ2(i), Z(i)mis)

−→ mδ(Y )

mδ=1(Y )= BFδ, δ=1.

(3–8)

Corollary 3.2.1. Suppose we have models Mδ and M1 and the priors are defined as the

above for models Mδ and M1. A closed form of Bayes factor for model Mδ versus M1 is

not available. To get a consistent bridge sampling type Bayes factor estimator as Equation

3–8, we must find a g function that satisfies the following equation.

∑Zlmis

∫f1(Y |β, γ, σ2, φ2, Zmis)× g(β, γ, σ2, Zmis) dγl

= fδ(Y |β, γδ, σ2, Zδmis

). (3–9)

where the g function satisfies the above Equation 3–9 and contains a factor P (Zlmis=

zlmis). Also γ is composed of γδ and γl. The vector γδ parameterizes the SNPs which are

included in model Mδ and γl parameterizes the SNPs which are excluded in model Mδ.

The matrix Z is composed of Zδ and Zl, where Zl is the design matrix of SNPs which

are not included in model Mδ and Zlmisrepresents the missing SNPs within Zl. The

distribution P (Zlmis= zlmis

), for the discrete random vector Zlmis, could be any legitimate

discrete distribution and it is taken to be the uniform distribution, which is the same as the

prior for Zlmisin model M1.

The parameter φ2 is a variance parameter and it is used to specify the prior of the

SNP effect, γ. It does not appear in the likelihood f1(Y |β, γ, σ2, Zmis), but from the

Bayesian perspective, it appears in the joint likelihood of all the parameters. As we are

doing Bayesian analysis, we let the g function include the parameter φ2.

85

Next we will prove the above Corollary 3.2.1. Applying Theorem 3.2.1, θa is

(β, γδ, σ2, φ2Zδmis

), and θb is (Zlmis, γδ). The likelihood for (θa, θb) is

fb(θa, θb) = f1(Y |β, γ, σ2, Zmis),

where f1(Y |β, γ, σ2, Zmis) is the continuous density function of Y given all the parameters

including the missing values, and in our case, it is the normal density function. Also we

can consider that f1 is constant with respect to φ2 here. Notice here Zmis is composed

of (Zlmis, Zδmis

), and both Zlmisand Zδmis

are vector of missing values. Similarly, the

likelihood fa(θa) is

fa = fδ(Y |β, γδ, σ2, Zδmis

),

where fδ(Y |β, γδ, σ2, Zδmis

) is the continuous normal density function of Y given all the

parameters (β, γδ, σ2, Zδmis

) in our case. As the g function contains a factor

P (Zlmis= zlmis

),

so

∫fb(Y |θa, θb) g(θa, θb) dθb

=∑Zlmis

∫ [f1(Y |β, γ, σ2, φ2, Zmis) g(β, γ, σ2, φ2, Zmis)

]dγl

.

As we know θb = (γl, Zlmis), and Zlmis

is a vector of missing values. According to Theorem

3.2.1, the following equation must hold:

∑Zlmis

∫ [f1(Y |β, γ, σ2, φ2, Zmis)× g(β, γ, σ2, φ2, Zmis)

]dγl

= fδ(Y |β, γδ, σ2, φ2, Zδmis

).

So with samples (β(i), γ(i), σ2(i), φ2(i)

, Z(i)mis, i = 1, 2, . . . , ) from M1, as in (3–8)

1

n

M∑i=1

πδ(β(i), σ2(i)

, γ(i)δ , Z

(i)δmis

) · g(β(i), σ2(i), φ2(i)

, γ(i)δ , Z

(i)δmis

)

π1(β(i), γ(i), σ2(i), Z(i)lmis

)(3–10)

86

is a consistent estimator.

Theorem 3.2.2. When the conditions of model Mδ and model M1 are defined as above,

the following g function satisfies the condition of Equation 3–9 and further we have a

consistent Bayes factor estimator. The g is

g(β, γ, σ2, φ2, Zmis) = (2πσ2)−sl2 |Z ′

lZl| 12 (3–11)

× exp

(−(Y −Xβ − Zδγδ)

′Pl(Y −Xβ − Zδγδ)

2σ2

)× P (Zlmis

= zlmis),

where

Pl = Zl(Z′lZl)

−1Z ′l ,

With that,the Bayes factor estimator is

1

n

n∑i=1

(φ2(i))

sl2 |Z(i)

l

′Z

(i)l |

12 exp

(− (Y−Xβ(i)−Z

(i)δ γ

(i)δ )′P (i)

l (Y−Xβ(i)−Z(i)δ γ

(i)δ )

2σ2(i)

)

exp

(− |γ(t)

l |22σ2(i)φ2(i)

) . (3–12)

Notice P (Zlmis= zlmis

) could be any legitimate distribution, and it is taken as the uni-

form prior in its sample space, which is the same prior for Zlmisin π1(β, γ, σ2, φ2, Zδmis

, Zlmis).

So during the calculation, the P (Zlmis= zlmis

) within the g function is canceled with the

prior for Zlmisin π1(β, γ, σ2, φ2, Zδmis

, Zlmis).

Now we want to show why Equation 3–12 is a consistent estimator. We have two

models, Mδ and M1, and the following are the details of a Bayesian specification for these

two models.

87

For Mδ:

Yn×1 ∼ N(Xβ + Zδγδ, σ2I) (3–13)

βp×1 ∼ noninformative uniform from−∞ to ∞

γδ(sδ×1) ∼ N(0, σ2φ2I)

σ2 ∼ IG(a, b)

φ2 ∼ IG(c, d)

Zδmis∼ uniform distribution,

where Zδ has dimension n × sδ and sδ = sum(δ). sl denotes s − sum(δ). We use γδ to

denote the parameters of SNPs in model Mδ and it has dimension sδ × 1. We also use

γl to denote the parameters of SNPs which were excluded from model Mδ. Zδmisis the

missing values in the Zδ.

For M1, the model specification is described in 3–2 and it is almost the same as 3–13,

except that the design matrix for SNPs is changed from Zδ to Z as the model indicator

vector is changed from δ to δ = 1. Correspondingly, γδ is changed to γ.

For both models, the design matrix for SNPs, Zδ and Z, contain missing SNPs and we

do not explicitly mention that right now. Z is a matrix concatenated by the columns of Zδ

and Zl. The mathematical formula Z = Zδ + Zl does not hold, one obvious reason for that

is the number of columns of Z equals to the sum of the numbers of columns of matrix Zδ

and Zl. The vector γ, γδ, and γl have similar relationship.

Same as before, we use πδ (β, γδ, φ2, σ2, Zδmis

) as the prior for model Mδ and let

π1(β, γ, σ2, φ2, Zmis) as the prior for model M1.

88

Proof. Next we will show how to find our g function. By applying the Equation 3–9 in

Corollary 3.2.1, we have:

∑Zlmis

∫g(β, γ, σ2, φ2, Zmis)f1(Y |β, γ, σ2, Zmis) dγl

(3–14)

=∑Zlmis

∫g(β, γ, σ2, φ2, Zmis)

1

(2πσ2)n2

exp

(−|Y −Xβ − Zδγδ − Zlγl|2

2σ2

)dγl

=∑Zlmis


1

(2πσ2)n2

exp

(−|Y −Xβ − Zδγδ|2

2σ2

)

× exp

(−γ′lZ

′lZlγl

2σ2+

(Y −Xβ − Zδγδ)′Zlγl

σ2

)dγl

Now if we restrict g(β, γ, σ2, φ2, Zmis) to be constant with respect to φ2, we will have

∑Zlmis


=∑Zlmis


1

(2πσ2)n2

exp


2σ2

)

× exp

(−γ′lZ

′lZlγl

2σ2+

(Y −Xβ − Zδγδ)′Zlγl

σ2

)dγl

Further, if we restrict g(β, γ, σ2, φ2, Zmis) to be constant with respect to γl and integrate γl

out of Equation 3–14, we will have

∑Zlmis


(3–15)

=∑Zlmis

exp(− |Y−Xβ−Zδγδ|2

2σ2

)

(2πσ2)n2

g(β, γ, σ2, φ2, Zmis)(2πσ2)sl2 |Z ′

lZl|− 12

× exp

((Y −Xβ − Zδγδ)

′Zl(Z′lZl)

−1Z ′l(Y −Xβ − Zδγδ)

2σ2

),

where sl = sum(1)− sum(δ).

89

So if we take g be Equation 3–11, that is,

g(β, γ, σ2, φ2, Zmis) = (2πσ2)−sl2 |Z ′

lZl| 12

× exp


′Pl(Y −Xβ − Zδγδ)

2σ2

)P (Zlmis

= zlmis),

with Pl = Zl(Z′lZl)

−1Z ′l , then we will have

∑Zlmis


=∑Zlmis

exp(− |Y−Xβ−Zδγδ|2

2σ2

)

(2πσ2)n2

P (Zlmis

= zlmis) (3–16)

=exp

(− |Y−Xβ−Zδγδ|2

2σ2

)

(2πσ2)n2

∑Zlmis

P (Zlmis= zlmis

)

=exp


2σ2

)

(2πσ2)n2

= fδ(β, γδ, Zδmis, σ2, φ2).

In above we use the fact that

∑Zlmis

P (Zlmis= zlmis

) = 1,

which holds when Zlmisis a legitimate discrete random vector. In our later calculation, we

take Zlmisto be uniformly distributed.

So in above we showed that with the chosen g function,

∑Zlmis


=exp


2σ2

)

(2πσ2)n2

= fδ(β, γδ, Zδmis, σ2, φ2)

90

and the condition of Equation 3–9 is satisfied.

We take the prior for Mδ to be

πδ(β, γδ, σ2, φ2, Zδmis

) = 1×exp

(− |γδ|2

2σ2φ2

)

(2πσ2φ2)sδ2

× ba

Γ(a)

exp(− bσ2 )

(σ2)a+1

× dc

Γ(c)

exp(− dφ2 )

(φ2)c+1× P (Zδmis

= zδmis),

where the prior for β is taken to be 1 and P (Zδmis) is the uniform distribution on Zδmis

.

The prior for M1 is taken to be:

π1(β, γ, σ2, φ2, Zδmis, Zlmis

) = 1×exp

(− |γ|2

2σ2φ2

)

(2πσ2φ2)sδ2

× ba

Γ(a)

exp(− bσ2 )

(σ2)a+1

× dc

Γ(c)

exp(− dφ2 )

(φ2)c+1× P (Zδmis

= zδmis)× P (Zlmis

= zlmis),

where the prior for β is taken to be 1. Both the prior for P (Zδmis) and the prior for

P (Zlmis) are the discrete uniform distributions on the sample spaces.

With the above specification, if we have samples (β(i), γ(i), σ2(i), φ2(i)

, Z(i)mis) from

model M1, and follow the Equation 3–10, we can consistently estimate the Bayes factor as

the following:

1

n

m∑i=1

(2πσ2(i))−

sl2

∣∣∣Z(i)l

′Z

(i)l

∣∣∣12exp

(− (Y−Xβ(i)−Z

(i)δ γ

(i)δ )′P (i)


(i)δ )

2σ2(i)

)

(2πσ2(i)φ2(i))−sl2 exp

(− |γ(i)

l |22σ2(i)φ2(i)

) −→ BFδ,1,

where P (Z(i)lmis

= z(i)lmis

) in the g function is canceled with the prior for Zlmisin

π1(β(i), γ(i), σ2(i)

, φ2(i), Z

(i)δmis

, Z(i)lmis

), and the prior for Zδmisfor both models model M1 and

model Mδ are canceled.

91

3.2.2 How to Choose g(β, γ, Zmis, σ2, φ2) ?

In the above we show that as long as we found a g function that satisfies the Equation

3–9, we can use

1

n

i=n∑i=0

πδ

(β(i), γ

(i)δ , σ2(i)

, φ2(i), (Zδmis

)(i))

g(β(i), γ(i), σ2(i)

, φ2(i), Z

(i)mis

)

π1

(β(i), γ(i), σ2(i), φ2(i), Z

(i)mis

)

to consistently approximate the Bayes factor BFδ, δ=1 with the samples (β(i), γ(i), σ2(i), φ2(i)

, Z(i)mis)

from posterior distribution π1(β(i), γ(i), σ2(i)

, φ2(i), Z

(i)mis|Y ). So the g function is not unique.

In the following we will give another g function and also give some direction on finding g.

We will show that the following g(β, γ, Zmis, σ2, φ2) also satisfies the Equation 3–9.

g(β, γ, Zmis, σ2, φ2) = |R|1/2(2πσ2)−sl/2 exp

(− |γl|2

2σ2φ2

)(3–17)

× exp


′ZlR−1Z ′

l(Y −Xβ − Zδγδ)

2σ2

)× P (Zlmis

= zlmis),

where

R =I

φ2+ Z ′

lZl.

By applying Corollary 3.2.1, we need show that Equation 3–9 is satisfied. The left

side of Equation 3–9 is equal to:

∑Zlmis

∫g(β, γ, Zmis, σ

2, φ2)f1(Y |β, γ, Zmis, σ2) dγ

=∑Zlmis

∫|R|1/2(2πσ2)−sl/2 exp

(− |γl|2

2σ2φ2

)(3–18)

× exp


′ZlR−1Z ′


2σ2

)

×exp

(− |Y−Xβ−Zγ|2

2σ2

)

(2πσ2)n/2dγ

P (Zlmis

= zlmis)

92

In the above calculation, we just put the g function back into the integral. Next we will

first integrate γl out of Equation 3–18 and we have

∑Zlmis



(3–19)

=∑Zlmis

(2πσ2)−sl/2|R|1/2 exp


′ZlR−1Z ′


2σ2

)

×(2πσ2)sl/2|R|−1/2 exp

((Y −Xβ − Zδγδ)

′ZlR−1Z ′


2σ2

)

× exp (|Y −Xβ − Zδγδ|2)(2πσ2)2/n

P (Zlmis

= zlmis)

After the cancelation, we will have:

∑Zlmis



=∑Zlmis

exp (|Y −Xβ − Zδγδ|2)

(2πσ2)2/n

P (Zlmis

= zlmis) (3–20)

=exp (|Y −Xβ − Zδγδ|2)

(2πσ2)2/n

∑Zlmis

P (Zlmis= zlmis

)

= fδ(Y |β, γδ, Zδmis, σ2).

So in the above, Equation 3–17 and Equation 3–11 are two different g functions and

they both satisfy the Equation 3–9. That shows the g function is not unique. However,

basically we can find a g function by following the directions:

• Take any gcand(β, γ, Zmis, σ2, φ2) function and this gcand could be chosen out of the

convenience of computation or any other reason.

• First calculate

h(β, γδ, Zδmis, σ2, φ2) =

∫ ∫f1(β, γ, Zmis, σ

2, φ2)gcand(β, γ, Zmis, σ2, φ2) dγl dZlmis

fδ(β, γδ, Zδmis, σ2, φ2)

,

and then we will have

g =gcand(β, γ, Zmis, σ

2, φ2)

h(β, γδ, Zδmis, σ2, φ2)

× P (Zlmis= zlmis

),

93

and this g satisfies the condition of Equation 3–9.

So looking back, for the g function in Equation 3–17, the corresponding gcand function

is gcand = exp(− |γ|2

2σ2φ2

). Using the gcand function, we can calculate the h function and

further get the same g function as in Equation 3–17. As for the Equation 3–11, we just let

the gcand function be 1, and follow the same procedure: Calculate the h function, for this

case

h =

∫ ∫f1(β, γ, Zmis, σ

2, φ2)× 1 dγl dZlmis

fδ(β, γδ, Zδmis, σ2, φ2)

;

further get the

g =fδ(β, γδ, Zδmis

, σ2, φ2)∫ ∫f1(β, γ, Zmis, σ2, φ2) dγl dZlmis

× P (Zlmis= zlmis

),

which is the same function as Equation 3–11.

3.2.3 Comparison with the Simplest Model

In Section 3.1, we stated that we plan to run a hybrid M-H chain to search the good

subsets of variables according to the posterior probabilities of models. Later we will

show that this is equivalent to searching for the good subsets of variables according to

the Bayes factors. In the above subsections, we discussed the Bayes factor estimation

for a submodel versus the full model. So when comparing any two submodels, they are

implicitly compared with each other through the ratio of Bayes factors of submodels

versus the full model. For example, if we consider two submodels with SNP indicators δ1

and δ2, the ratio of Bayes factors,

BFδ1, δ=1

BFδ2, δ=1

will act as a criterion to decide whether the stochastic search chain stays in the current

sub-model or moves to the candidate sub-model. An alternative approach is to consider

the ratio of Bayes factors

BFδ1, δ=0

BFδ2, δ=0

,

94

where δ = 0 is the indicator vector for the simplest model which does not have SNPs. We

can give an intuitive explanation for that. When two subsets are compared, it does not

matter which subset is used as reference model, either δ = 0 or δ = 1 can be used.

Casella, Giron, Martinez, and Moreno (Casella et al.) discussed that there are two

ways to do Baysian model selection when using intrinsic priors: encompassing the models

from above, which means comparing all the models to the full model; encompassing the

models from below, that is, comparing all the models to the simplest model. They showed

when the number of subsets are finite, these two methods will essentially give equivalent

results.

In the following, we will give details of the Bayes factor approximation when the

simplest model is used as a reference. Also, we will provide some explanations as to

why we choose to use the full model as a reference. So for the Bayes factor of submodel

verses the simplest model, we are interested in calculating BFδ, δ=0, where δ = 0 means

the associated model has no variables in it, also as the simplest model. By using similar

approximation methods as above, we are able to find the Bayes factor approximation for

small model versus big model, and also as

BFδ, δ=0 =1

BFδ=0, δ

,

we are going to discuss the calculation of BFδ=0, δ instead of BFδ, δ=0.

Now the model M0 has the likelihood f0(Y |β, σ2) and the model Mδ has the likeli-

hood fδ(Y |β, γδ, σ2, φ2, Zδmis

).

Next we also show how we find the g function for this situation. Applying Theorem

3.2.1, the θa is (β, σ2, φ2) and θb is (γδ, Zδmis). For the equation (3–6) to be satisfied we

need ∫fb(Y |θa, θb)g(θa, θb)d θb = fa(Y |θa).

95

Then we should have the following equation holding:

∑Zδmis

∫ exp(− |Y−Xβ−Zδγδ|2

2σ2

)

(2πσ2)n2

g(β, σ2, γδ, φ2, Zδmis

) dγδ

= (2πσ2)−(n)2 exp

−|Y −Xβ|2

2σ2

. (3–21)

We will find the g function from the above Equation 3–21.

∑Zδmis


2σ2

)

(2πσ2)n2


) dγδ

=∑

Zδmis

∫ exp(− |Y−Xβ|2

2σ2 − γ′δZ′δZδγδ

2σ2 + (Y−Xβ)′Zδγδ

σ2

)

(2πσ2)n2


) dγδ

.

If we make a restriction that g(β, γδ, σ2, φ2, Zδmis

) is constant with respect to γδ and

further integrate γδ out of Equation 3–22, then we will have

∑Zδmis


2σ2

)

(2πσ2)n2


) dγδ

=∑

Zδmis

(2πσ2)−

(n−sδ)

2 exp

((Y −Xβ)′Zδ(Z

′δZδ)

−1Z ′δ(Y −Xβ)

2σ2

)

× g(β, σ2, γδ, φ2, Zδmis

) exp

(−|Y −Xβ|2

2σ2

)

So if we take

g(β, γδ, σ2, φ2, Zδmis

) (3–22)

= (2πσ2)−sδ2 exp

(−(Y −Xβ)′Zδ(Z

′δZδ)

−1Z ′δ(Y −Xβ)

2σ2

)× P (Zlmis

= zlmis),

96

We will have:

∑Zδmis


2σ2

)

(2πσ2)n2


) dγδ

=∑

Zδmis

(2πσ2)−

n2 exp

(−|Y −Xβ|2

2σ2

)P (Zδmis

= zδmis)

= (2πσ2)−n2 exp

(−|Y −Xβ|2

2σ2

) ∑Zδmis

P (Zδmis= zδmis

)

= (2πσ2)−n2 exp

(−|Y −Xβ|2

2σ2

)

= f0(β, σ2)

In above, we showed that the Equation 3–21 holds and we found the g function in

Equation 3–22. To make this method work, we need samples (β(i), γ(i)δ , σ2(i)

, φ2(i), Zδmis

(i))

from the full model M1 which has a likelihood of (Y |β, γ, σ2, φ2, Zmis). As we are planning

to run a Gibbs sampler for the full model which contains all the SNPs, from there we

will get samples of (β(i), γ(i)δ , σ2(i)

, φ2(i), Zδmis

(i)) and it has the correct marginalized joint

distribution, according to Theorem 10.6. of Robert and Casella (2004).

So we can approximate the Bayes factor BFδ=0, δ by

1

n

i=n∑i=1

π0(β(i), σ2(i)

) g(β(i), σ2(i), γδ

(i), φ2(i), Zδmis

(i))

π1(β(i), σ2(i), γδ(i), Zδmis

(i))−→ BFδ=0, δ,

following the similar calculation in last subsection, we can use the following to estimate

BFδ=0, δ

1

n

i=n∑i=1

(φ2(i))

sδ2 · exp

(− (Y−Xβ(i))′Pδ

(i)(Y−Xβ(i))

2σ2(i)

)

ba exp

(− b

φ2(i)

)

Γ(a)(φ2(i))(a+1)· exp

(− |γδ

(i)|22σ2(i)φ2(i)

) , (3–23)

where n is the number of pairs of samples from the Gibbs sampling and

P(i)δ = Z

(i)δ

(Z

(i)δ

′Z

(i)δ

)−1

Z(i)δ

′.

In terms of computation, for formula of Equation 3–23, which use the simplest model

as the reference model, and Equation 3–12, which use the full model as the reference

97

model, most of the time would be spend on calculating P(i)δ or P

(i)l . In a simulation study

we did not find obvious advantage of either method. From now on, we will only use full

model as reference model.

3.2.4 Marginal Likelihood for mδ(Y )

According to the definition of Bayes factor in Equation 3–1, for two models Mδ and

M1 with parameters θδ and θ1, the Bayes factor is defined as

BFδ, δ=1 =mδ(Y )

m1(Y )=

∫p(θδ|Mδ)p(Y |θδ,Mδ) dθδ∫p(θ1|M1)p(Y |θ1,M1) dθ1

.

So at first we were interested in calculating the Bayes factors directly for our situation. In

this subsection we will detail the calculations of marginal likelihood and show why there is

no closed form.

Let Mδ denote the model

Y = Xβ + Zδγδ + ε,

where γδ is a vector with length equal to sum(δ), and the elements of this vector are

composed of the elements of vector γ with the corresponding element of indicator vector δ

to be 1. Zδ is the is the matrix with columns being taken from the Z matrix according to

the corresponding indicator vector δ.

In the above model specification, Zδ =(Zδ(obs), Zδ(mis)

), and for simplicity we

just write Zδ. We let sδ to be the total number of elements in γ equal to 1. The joint

likelihood under the model specification of Mδ is:

Lδ(Y ) ∝ (2πσ2)−n2 (2πσ2φ2)−

sδ2 exp


2σ2

)(3–24)

× exp

(− |γδ|2

2σ2φ2

)× π

(σ2

)× π(φ2

).

98

With the likelihood for the joint distribution, the marginal likelihood of mδ(Y ) with

parameters β, γδ, σ2, φ2 and Zδmis being integrated out of joint likelihood Lδ(Y ) is:

mδ(Y ) =

∫· · ·

∫(2πσ2)−

n2 (2πσ2φ2)−

sδ2 exp


2σ2

)(3–25)

× exp

(− |γδ|2

2σ2φ2

)× π

(σ2

)× π(φ2

)

dβ dγδ dσ2 dφ2 dZδmis.

The first step of the above formula is to integrate β out. If we just look at the exponential

term with β in the above equality we can write:

exp


2σ2

)(3–26)

= exp

(−|Xβ − (Y − Zδγδ) |2

2σ2

)

= exp

(−β′X ′Xβ

2σ2

)exp

((Y − Zδγδ)

′ Xβ

σ2

)exp

(−|Y − Zδγδ|2

2σ2

).

By using the above organization, we can integrate β out of Equation 3–25 and get:

mδ (Y ) =

∫· · ·

∫(2πσ2)−

n2 (2πσ2φ2)−

sδ2 (2πσ2)

p2 |X ′X|− 1

2 (3–27)

× exp

(−(Y − Zδγδ)

′P (Y − Zδγδ)

2σ2

)exp

(− |γδ|2

2σ2φ2

)

×π(σ2

)× π(φ2

)dγδ dσ2 dφ2 dZδmis

,

where P is defined as P = I −X (X ′X)−1 X ′ and p is the dimension of β. The next step is

to integrate γδ out. We can write the exponential term of γδ as

exp

(−(Y − Zδγδ)

′P (Y − Zδγδ)

2σ2

)exp

(− |γδ|2

2σ2φ2

)(3–28)

= exp

(−

γδ′(Zδ

′PZδ + Iφ2 )γδ

2σ2+

Y ′Zδγδ

σ2

)exp

(−Y ′PY

2σ2

)

Now we can integrate γδ out of Equation 3–27 and get:

99

mδ (Y ) =

∫· · ·

∫(2πσ2)−

n2 (2πσ2φ2)

sδ2 (2πσ2)

p2 |X ′X|− 1

2 (2πσ2)−sδ2 (3–29)

×∣∣∣∣Zδ

′PZδ +I

φ2

∣∣∣∣− 1

2

exp

(Y ′Zδ(Zδ

′PZδ + Iφ2 )

−1Zδ′Y

2σ2

)

× exp

(−Y ′PY

2σ2

)π

(σ2

)× π(φ2

)dσ2 dφ2 dZδmis

,

Next let us reorganize the above Equation 3–29 a little bit and we will have:

mδ (Y ) =

∫· · ·

∫(2π)−

n−p2 (σ2)−

n−p2 (φ2)−

sδ2 |X ′X|− 1

2 |Pδ, φ2|− 12 (3–30)

× exp

(Y ′Zδ(Pδ, φ2)−1Zδ

′Y2σ2

)exp

(−Y ′PY

2σ2

)

×π(σ2

)× π(φ2

)dσ2 dφ2 dZδ(mis),

where Pδ, φ2 = Zδ′PZδ + I

φ2 .

The next step is to integrate σ2 out, and as we know the prior is:

π(σ2) =ba

Γ(a)

exp− bσ2

(σ2)a+1.

The above integration will become:

mδ (Y ) =

∫ ∫(2π)−

n−p2 (σ2)−

n−p2 (φ2)−

sδ2 |X ′X|− 1

2 |Pδ, φ2|− 12 (3–31)

× baΓ(n−p2

+ a)

Γ(a)(

Y ′PY2

− Y ′Zδ(Pδ, φ2 )−1Zδ′Y

2+ b

)(n−p2

+a)

×π(φ2

)dφ2 dZδmis

,

The next step is to integrate the missing data out of the above marginal distribution,

however the number of missing SNPs combination is huge and there is no explicit form

for it. As for the integration of φ2, we can not directly integrate it out and will leave the

integration here. So the above derivation shows that we can not directly calculate the

marginal likelihood, and further we do not have a closed form of the Bayes factor.

100

3.3 Markov Chain Monte Carlo Property

Right now we want to apply a variable selection method to detect good subsets of

SNPs. Suppose in the data set, each individual observation has s number of SNPs. For

each subset of variables, we associate it with an index vector δ. Each element of the δ

vector is either 1 or 0. If it is 1, the corresponding variable is included in the model; 0

otherwise. An example of a δ vector looks like this

δ =

1

0

...

1

s×1

,

and in this example, the first variable is included in the model, the second variable is not,

and the last one is included.

Our plan is to set up a Markov chain which is driven by Bayes factors such that the

stationary distribution of the Markov chain is proportional to the distribution of δ, in

other words, the distribution of models in the model sample space.

There are s variables in the data, in our case, s SNPs, so there are 2s models in the

model space. Each of the 2s models has an associated index vector δi, i = 1, · · · , 2s. For a

model associated with δi, , it has marginal likelihood

mδi(X) =

∫f(X|θδi

)π(θδi) dθδi

,

where θδiis the parameter vector of model δi. Let B(δi) denote the probability of model δi

in the model sample space, then the probability of model δi is

B(δi) =mδi

(X)∑2s

j=1 mδj(X)

=

mδi(X)

m1(X)

∑2s

j=1

mδj(X)

m1(X)

=BFδi, δ=1

1 +∑

j36=1 BFδj , δ=1

.

So B(δi) is the target distribution and it has finite sample space with the above probabil-

ity distribution.

101

The following is the plan to sample the target distribution.

• Run two parallel Markov chains. One chain is a Gibbs sampler for the full model

and samples all the parameters in the full model, including all the missing SNPs.

• Concurrently, run a hybrid M-H chain to select good models, in other words, good

subsets of SNPs. When running this chain, we will employ samples from the parallel

Gibbs chain to estimate the Bayes factors. The embedded list details the M-H steps.

1. Suppose the current state is δ1, and we will use a mixture distribution togenerate the candidate δ2. The mixture distribution is: with probability p,draw a sample using a random walk algorithm; with probability (1− p), draw asample from the uniform distribution on the sample space. The draws from theuniform sample space are i.i.d. and the tuning parameter p will be set as 0.75to insure random walk most of the time and also having the ability to jump outof local modes.

2. Using the samples from the previous Gibbs sampler, calculate the approximateBayes factors for the candidate state and calculate the acceptance probabilityρ (δ1, δ2).

3. Draw u from uniform distribution U (0, 1), if u < ρ (δ1, δ2) take δ2 as the nextstate, otherwise stay at δ1.

4. Repeat the steps, until the chain arrives its stationary distribution.

The above is the chain we designed. But in actual computation, the steps are a little

bit different. We run the Gibbs sampler first. Then with the samples from the Gibbs

chain, we run the M-H chain to find good subsets.

We plan to use the following candidate random walk distribution. Suppose the

current chain is at state δ1. Randomly choose one number r from the uniform discrete

distribution (1, 2, ..., s), and flip the rth element in δ1 from 1 to 0 or from 0 to 1, and at

the same time keep all other elements in δ1 untouched. Denote the new candidate sample

from random walk as δ′2. Let δ′′2 denote the candidate state from the other part of the

mixture distribution, the uniform distribution in the sample space. Each element of δ′′2 is

independent and randomly taken as either 0 or 1.

102

3.3.1 Candidate Distribution

Now the conditional distribution can be written as :

q (δ2|δ1) = p∑

j=1,...,s

Iδ1

(δ2 = δ′2j

) 1

s+ (1− p)

∑

k=1,...,2s

I (δ2 = δ′′2k)1

2s.

Also we have

q (δ1|δ2) = p∑

j=1,...,s

Iδ2

(δ1 = δ′1j

) 1

s+ (1− p)

∑

k=1,...,2s

I (δ1 = δ′′1k)1

2s.

Obviously, δ′′2k and δ′′1k have same uniform i.i.d distribution which does not depend on

the previous states. So for any state δ1 and δ2, we have

(1− p)∑

k=1,...,2s

I (δ2 = δ′′2k)1

2s= (1− p)

∑

k=1,...,2s

I (δ1 = δ′′1k)1

2s=

(1− p)

2s

Now take any states δ1 and δ2 for consideration. If δ1 and δ2 have more than one element

different or they are exactly the same, then δ1 and δ2 can not directly transfer to each

other from the random walk distribution. On the other hand, if δ1 and δ2 are only

different by one element, the random walk part probability is p/s. It can be summarized

as the follows:

p∑

j=1,...,s

Iδ1

(δ2 = δ′2j

) 1

s= p

∑j=1,...,s

Iδ2

(δ1 = δ′1j

) 1

s

=

0 : more than one element different or the same

ps

: just one element difference

So the above argument says, for any state of δ1 and δ2, the following always holds

q (δ2|δ1) = q (δ1|δ2) .

Then with the actual Bayes factors, the acceptance probability is:

ρ (δ1, δ2) = min

BFδ2, δ=1

BFδ1, δ=1

q (δ1|δ2)

q (δ2|δ1), 1

= min

BFδ2, δ=1

BFδ1, δ=1

, 1

,

103

where δ = 1 is the indicator vector for the full model.

Using samples from the Gibbs sampler and applying Theorem 3.2.1, we will have

BF

(n)δ2, δ=1 as a consistent Bayes factor estimate for BF(δ2, δ=1) and n is the number of pairs

of samplers from the Gibbs sampler. We further denote the ratio of estimated Bayes factor

as

BF(n)δ2, δ=1

BF

(n)δ1, δ=1

,

then the estimated acceptance probability would be

ρn (δ1, δ2) = min

BF

(n)δ2, δ=1

BF

(n)δ1, δ=1

q (δ1|δ2)

q (δ2|δ1), 1

= min

BF

(n)δ2, δ=1

BF

(n)δ1, δ=1

, 1

. (3–32)

3.3.2 Convergence of Bayes Factors

To run the Metropolis-Hastings Markov chain, we need to calculate BF(δ, δ=1), which

normally involves parts of the total SNPs. From the previous sections, we know that the

Bayes factors are intractable and we plan to approximate them. In the previous chapter,

we already devised a Gibbs sampler chain for the full model, which has all SNPs and all

parameters in the model. With that, we can use samples from the full Gibbs sampler to

approximate the Bayes factor. The approximation can be justified by the Theorem 10.6

from Robert and Casella (2004) The theorem states:

For the Gibbs Sampler of with the algorithm of [A.40] in Robert and Casella (2004),

if(Y (t)

)is ergodic, then the distribution g is a stationary distribution for the chain

(Y (t)

)and f is the limiting distribution of the subchain

(X(t)

). Here g is the full joint

distribution for X.

[A.40] in Robert and Casella (2004) is the following algorithm:

104

Given (y(t)1 , · · · , y

(t)p ),

Y(t+1)1 ∼ g1(y1|y(t)

2 , · · · , y(t)p )

Y(t+1)2 ∼ g2(y2|y(t+1)

1 , y(t)3 , · · · , y(t)

p )

...

Y (t+1)p ∼ gp(yp|y(t+1)

1 , · · · , y(t+1)p−1 )

where g1(y1|y(t)2 , · · · , y

(t)p ), · · · , gp(yp|y(t+1)

1 , · · · , y(t+1)p−1 ) are the conditional distributions.

Our sub-chain from the Gibbs sampler satisfies the regular conditions and is Harris-

recurrent, aperiodic, so it is ergodic. Further more, by the ergodic theorem, we will have

BF

(n)(δ2, δ=1)

BF

(n)(δ1, δ=1)

−→ BFδ2, δ=1

BFδ1, δ=1

, as n →∞,

where

BF(n)δ2, δ=1 is the estimator of BFδ2, δ=1 from the Gibbs sampler with sample size n.

3.3.3 Ergodicity Property of This M-H Chain

Ideally, when having the exact Bayes factors, we will run a M-H chain on the model

sample space to search for good subsets of variables. But with no closed form for the

Bayes factors, we approximate the Bayes factors using the samples from the Gibbs

sampler. We plan to run the M-H chain on the model sample space and the Gibbs sampler

for the full model concurrently. As we use estimated Bayes factors in the M-H chain, we

call this M-H chain a empirical chain. In practice, we use n pairs of samples from the

Gibbs sampler to estimate the Bayes factors and run the M-H chain t steps and n and

t are big numbers. In this section, our goal is to prove the ergodicity of the empirical

chain when both n and t go to infinity. Our proof has two steps. First we will consider

the situation where n is a fixed large number, and then we will consider the situation that

both n and t go to infinity.

105

3.3.3.1 Fixed n, uniformly ergodic converges to the distribution B(n)

With n fixed, the empirical M-H algorithm with estimated Bayes factors is a regular

M-H chain with finite sample space. We will show that it satisfies the detailed balance

condition with a stationary distribution,

Bn(δi) =mδi

(X)∑2s

j=1 mδj(X)

=

BF

(n)δi, δ=1

1 +∑

j3j 6=1

BF

(n)δj , δ=1

, n is fixed.

Also since the sample space is finite, this M-H chain is uniformly ergodic.

Now, let us show it satisfies the detailed balance condition. As the transition kernel

associated with this Markov chain is

K (δ1, δ2) = ρ (δ1, δ2) q (δ2|δ1) I (δ2 6= δ1) + (1− r (δ1)) I (δ1 = δ2) ,

We need to verify

ρ(δ1, δ2)q(δ2|δ1)I (δ2 6= δ1) P (δ = δ1) = ρ(δ2, δ1)q(δ1|δ2)I (δ2 6= δ1) P (δ = δ2),

and

(1− r(δ1))Iδ1(δ2)I (δ1 = δ2) P (δ = δ1) = (1− r(δ2))Iδ2(δ1)I (δ1 = δ2) P (δ = δ2).

In the above equations, ρ(δ1, δ2) is the acceptance probabilities, q(δ2|δ1) is the candidate

distribution given currently at the state of δ1. P (δ) stands for the target distribution of

the M-H chain with the above acceptance probabilities and candidate distribution. In our

situation, with n fixed, the target distribution is Bn(δi).

The above second equation is obvious, when δ1 = δ2, the left side is equal to the right

side. When δ1 6= δ2, both sides equal to 0. As we showed before that q(δ2|δ1) = q(δ1|δ2),

for the first equation, by applying the equation 3–32, we need to verify that

min

BF

(n)δ2, δ=1

BF

(n)δ1, δ=1

, 1

I (δ2 6= δ1) Bn(δ = δ1) = min

BF

(n)δ1, δ=1

BF

(n)δ2, δ=1

, 1

I (δ2 6= δ1) Bn(δ = δ2).

106

Here we assume δ1 6= δ2, otherwise, the above equation automatically holds. That is, we

need verify

min

BF

(n)δ2, δ=1

BF

(n)δ1, δ=1

, 1

BF

(n)δ1, δ=1

1 +∑

j3j 6=1

BF

(n)δj , δ=1

= min

BF

(n)δ1, δ=1

BF

(n)δ2, δ=1

, 1

BF

(n)δ2, δ=1

1 +∑

j3j 6=1

BF

(n)δj , δ=1

.

(3–33)

If

BF(n)δ2, δ=1 <

BF

(n)δ1, δ=1, the left side of Equation 3–33 is equal to

BF

(n)δ2, δ=1

1 +∑

j3j 6=1

BF

(n)δj , δ=1

.

In this situation,

min

BF

(n)δ1, δ=1

BF

(n)δ2, δ=1

, 1

= 1

and the right side of Equation 3–33 is equal to

BF

(n)δ2, δ=1

1 +∑

j3j 6=1

BF

(n)δj , δ=1

.

So the right side of Equation 3–33 is equal to the left side of (3–33). Follow the same

rational, Equation 3–33 holds when

BF

(n)δ2, δ=1 >

BF

(n)δ1, δ=1.

So the Equation 3–33 is satisfied. Then by using the Theorem 6.46 of Robert and Casella

(2004), B(n) is the stationary distribution of this chain.

In the above we showed that the empirical M-H Markov chain converges uniformly

to the distribution B(n) when n is fixed. However, our ultimate goal is to show this chain

converges to the distribution B(δi). That is,

B(δi) =mδi

(X)∑2s

j=1 mδj(X)

=BFδi, δ=1

1 +∑

j3δj 6=1 BFδj , δ=1

. (3–34)

107

3.3.3.2 Ergodic convergence to B

We will use KBnto denote the kernel density for our empirical M-H chain with the

estimated Bayes factor. We use B to denote the distribution of models with exact Bayes

factors and use Bn to denote the distribution of the models with estimated Bayes factors.

Theorem 3.3.1. Consider two parallel chains, a Gibbs sampler on the full model with all

parameters, and a M-H chain in the model space as defined in Section 3.3. The empirical

chain converges to the target stationary distribution with ergodic property.

‖ K(t)

Bn(δ0, ·)−B ‖TV−→ 0, (3–35)

where ‖ · ‖TV is the notation for the total variation norm, δ0 is the initial state of the

chain, and B is defined in Equation 3–34.

Now we will prove the above claim.

Proof.

‖ K(t)

Bn(δ0, ·)−B ‖TV = ‖ K

(t)

Bn(δ0, ·)− Bn + Bn −B ‖TV (3–36)

≤ ‖ K(t)

Bn(δ0, ·)− Bn ‖TV + ‖ Bn −B ‖TV

= sup∆‖K(t)

Bn(δ0, ∆)− Bn(∆)‖+ sup

∆‖Bn(∆)−B(∆)‖

=1

2

∑

δ

‖K(t)

Bn(δ0, δ)− Bn(δ)‖+

1

2

∑

δ

‖Bn(δ)−B(δ)‖.

Notice in the above equations, the first inequality is by the Triangle inequality of total

variation norm. The second equation is by the definition of total variation norm and the

third equation is by Scheffe’s lemma. By the ergodic property of the Gibbs sampler, we

know that

1

2

∑

δ

‖Bn(δ)−B(δ)‖ −→ 0, as n →∞. (3–37)

So to show Equation 3–35 we must show

1

2

∑

δ

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ −→ 0, as n, t →∞.

108

For any ε > 0, we always can find a pair (n, tn) such that:

‖K(tn)

Bn(δ0, δ)− Bn(δ)‖ < ξ,

where ξ = 2ε2s > 0, and s is the number of variables in the data set. The above is true since

from the previous proof for any n,

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ −→ 0, as t →∞.

Thus we can always find a tn, which satisfies the condition ‖K(tn)

Bn(δ0, δ) − Bn(δ)‖ < ξ. By

the Theorem 13.3.2 of Meyn and Tweedie (2008), we know that with n fixed, we have

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ > ‖K(t+1)

Bn(δ0, δ)− Bn(δ)‖.

So for any pair of (n, t) 3 t > tn and any δ, we always have

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ < ξ.

Furthermore, from the definition of ξ, for any n, we can always find (n, t), such that when

t > tn,

1

2

∑

δ

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ < ε.

So when n →∞ and tn < t →∞,

1

2

∑

δ

‖K(t)

Bn(δ0, δ)− Bn(δ)‖ −→ 0. (3–38)

Combining Equation 3–37 and Equation 3–38 we have

‖ K(t)

Bn(δ0, ·)−B ‖TV−→ 0, as n, t →∞. (3–39)

The above proof shows that our M-H chain will converge to the target stationary

distribution with the ergodic property.

109

3.4 Computation Speed

Computation speed is always a problem in variable selection. For our problem, in

every step of Markov chain, for one candidate model with index δ, we need to calculate an

estimated Bayes factor for the model δ verses the reference model.

From the above sections, we know that if we have samples (β(i), γ(i), σ2(i), φ2(i)

, Z(i)mis)

generated by Gibbs sampling from model M1, the reference model, also the full model, we

can approximate the Bayes factor by:

1

n

n∑i=1

(2πσ2(i))−

sl2 d

∣∣∣Z(i)l

′Z

(i)l

∣∣∣12exp

(− (Y−Xβ(i)−Z

(i)δ γ

(i)δ )′P (i)


(i)δ )

2σ2(i)

)


(− |γ(i)

l |22σ2(i)φ2(i)

) −→ BFδ,δ=1,

where Z(i)δ is a matrix with its columns coming from the corresponding columns of the

updated matrix Z(i) and the corresponding columns are those with δ elements equal to

1. On the other hand, Z(i)l is composed of columns of Z(i), which do not compose the Z

(i)δ

matrix. Also Pl = Zl(Z′lZl)

−1Z ′l .

In the above formula, for one Bayes factor estimation, we need to repeat n times

of∣∣∣Z(i)

l

′Z

(i)l

∣∣∣ and n times of (Z(i)l

′Z

(i)l )−1. As we know the determinant calculation is

closely related to the inverse calculation and if the inverse is available, the determinant

calculation is much easier. So we want to take advantages of this property without directly

calculating the determinants.

In this section, we will talk about two methods to speed up the computation. First we

will introduce a method to update the inverse based on inverse calculation from the last

step without calculating the inverse from the beginning. Second, we will talk about direct

replacement of Z(i) with Z, and the justification of doing this.

3.4.1 Matrix Inversion

3.4.1.1 Two columns of parameters for one column of SNPs

Suppose we have i = 1, ..., n, samples from the Gibbs sampler. In the Gibbs sampling

program, originally, we use two parameters to code one SNP. So for the design matrix

110

of the SNPs, we use two columns to code one column of SNPs. Previously in Chapter 2,

we showed that updating one column of SNPs in each Gibbs sampler cycle still preserves

the target stationary distribution, meanwhile dramatically speed up the computation.

So in the Gibbs sampler, at the ith iteration step, we update 2 columns of matrix Z(i−1)

from the previous iteration. Note the update is just with respect to the missing SNPs

and the parts of the design matrix with the observed SNPs are always untouched. In each

iteration of the Gibbs sampler, we only stored the updated SNPs out of the consideration

of the storage space. For the Bayesian variable selection problem, during each step of the

Markov chain, we need to recover 2 columns of the design matrix Z(i) first, since in the

Gibbs sampler, we only saved 2 column of SNP imputation, and not the whole matrix

Z(i). This is an enormous step because for each one of the n pairs of samples, we need to

first find the corresponding index of the updated SNPs, then find the missing positions

of the indexed SNPs, and finally update the corresponding missing SNPs. Considering

the size of the problem, this is the only feasible method and the time spent on this step is

unavoidable and we do not have any better solution.

Another place where most calculation time is spent is to find the |Z(i)l

′Z

(i)l | and

(Z(i)l

′Z

(i)l )−1. Right now we are dealing with a dataset with 44 SNPs and if we have

sum(1 − δ) = 30, then there are 30 SNPs not included in the submodel, to calculate

one Bayes factor, we need to do n times matrix inversion with dimension 60 × 60, as each

column of SNPs uses 2 columns for coding.

Now, we will discuss the difference between the just updated matrix Z(i) and the

previous design matrix Z(i−1). Then one method of calculating the inverse of matrix Z(i)

based on the inverse of matrix Z(i−1) will be given. These discussion will give us a flavor of

the final method we employ to calculate the inverse matrix.

Suppose in last step we have matrix Z(0) = [Z1, ..., Zi, Zi+1, ..., Z2s], and in the

current step the jth and the (j + 1)th column of Z(0) will be updated to Zj and Zj+1,

as for each iteration only one column of SNPs are updated, and we use 2 columns of the

111

design matrix to record one column of SNPs. That is, the current updated SNP matrix

is Z(1) = [Z1, ..., Zj, Zj+1, ..., Z2s]. As showed in above, we are interested in calculating(Z

(i)l

′Z

(i)l

)−1

and∣∣∣Z(i)

l

′Z

(i)l

∣∣∣. Although Zl is not exactly the same as Z, here we will just

discuss how to obtain(Z ′

(1)Z(1)

)−1

and we make the adjustments when programming. So

adapt to the discussion here, we are interested in the finding(Z ′

(1)Z(1)

)−1

. Let us first

write out Z ′(1)Z(1) here.

We can write

Z ′(1)Z(1) =

Z1′

...

Z ′j

Z ′j+1

...

Z2s′

(Z1 · · · Zj Zj+1 · · · Z2s

), (3–40)

and doing the multiplication we have

Z ′(1)Z(1) =

Z1′Z1 · · · Z1

′Zj Z1′Zj+1 · · · Z1

′Z2s

......

Z ′jZ1 · · · Z ′

jZj Z ′jZj+1 · · · Z ′

jZ2s

Z ′j+1Z1 · · · Z ′

j+1Zj Z ′j+1Zj+1 · · · Z ′

j+1Z2s

......

Z2s′Z1 · · · Z2s

′Zj Z2s′Zj+1 · · · Z2s

′Z2s

.

Next we will use A to denote Z ′(0)Z(0) and use A1 to denote Z ′

(1)Z(1). Further we can

write

A1 = (A + Ha + Hb),

112

where

(Ha)2s×2s =

0 · · · 0 (Z1′Zj − Z1

′Zj) (Z1′Zj+1 − Z1

′Zj+1) 0 · · · 0

......

0 · · · 0 (Z ′jZj − Zj

′Zj) (Z ′jZj+1 − Zj

′Zj+1) 0 · · · 0

0 · · · 0 (Z ′j+1Zj − Zj+1

′Zj) (Z ′j+1Zj+1 − Zj+1

′Zj+1) 0 · · · 0

......

0 · · · 0 (Z2s′Zj − Z2s

′Zj) (Z2s′Zj+1 − Z2s

′Zj+1) 0 · · · 0

,

and

(Hb)2s×2s =

0 · · · 0 0 · · · 0

......

0 · · · 0 0 · · · 0

(Zj′Z1 − Zj

′Z1) · · · 0 0 · · · (Zj′Za − Zj

′Z2s)

( ˙Zj+1′Z1 − Zj+1

′Z1) · · · 0 0 · · · ( ˙Zj+1′Z2s − Zj+1

′Z2s)

0 · · · 0 0 · · · 0

......

= 0 · · · 0 0 · · · 0

.

The above equations just show that the difference between the previous matrix A and the

updated matrix A1 is just 2 rows and 2 columns. This is a special structure and we want

to take advantage of this structure.

We already have(Z ′

(0)Z(0)

)−1

, that is, have A−1 from last step and notice rank(Ha) =

2 and rank(Hb) = 2. We can write the following

A1 = (A + Ha + Hb) = AB, (3–41)

where

B = I + A−1H1 + A−1H2. (3–42)

113

According to Miller (1981), if H is a rank 2 matrix, we have the following formula

(I + H)−1 = I − 1

a + b(aH −H2), (3–43)

where

a = 1 + tr(H)

and

2b = tr(H)2 − tr(H2).

For our situation, if we are able to calculate B−1, then A1−1 = B−1A−1. We can write

B = (I + A−1Ha + A−1Hb) = (I + A−1Ha)(I +

(I + A−1Ha

)−1A−1Hb

). (3–44)

Then the problem becomes to find

(I + A−1Ha)−1

and(I +

(I + A−1Ha

)−1A−1Hb

)−1

.

Suppose we are able to calculate

C = (I + A−1Ha)−1,

then the problem is to find (I + CA−1Hb)−1. Then we will have

B−1 = (I + CA−1H2)−1C. (3–45)

The question boils down to calculating C and (I + CA−1Hb)−1. For both problems, we can

apply Equation 3–43 to solve it.

As for calculating A−11 , we will have the following sub-steps:

1. Calculate Ha and Hb

2. • calculate ac = 1 + tr(A−1Ha),

114

Table 3-1. Comparison of time spent on inverse calculation using standard software andMiller’s method with 2 columns and 2 rows updated.

Matrix dimension: 100× 100 500× 500 1000× 1000 2000× 2000

Matlab’s direct calculation 0.000s 0.235s 1.609 11.985sMiller’s method 0.016s 1.454s 11.937s 99.703s

• calculate bc =(tr(A−1Ha))

2−tr((A−1Ha)

2)

2,

• and C = I − 1ac+bc

(ac (A−1Ha)− (A−1Ha)

2)

3. • calculate D = CA−1Hb,• calculate ad = 1 + tr(D),

• bd = (tr(D))2−tr(D2)2

,• calculate (I + D)−1 = I − 1

ad+bd(adD −D2),

4. calculate B−1 = (I + D)−1C,

5. finally, A1−1 = B−1A−1.

The above steps are needed to get one matrix inversion A−11 given that we have A−1

from last step. To get a Bayes factor estimation, we need repeat these steps n times. We

did a simulation for the above method and it shows that when two columns of matrix

are updated, the actual calculation time is not shortened, compared with the standard

software such as R and Matlab. The following Table 3-1 records the time spent on matrix

inversion with different methods.

To speed up the computation, we change the coding of SNPs from 2 parameters per

SNP to 1 parameter per SNP. Now we assume that the effect of a SNP is linear to the

number of copies of certain allele within that SNP, instead of parameterizing by additive

and dominant effect. By changing the coding method, we decrease the steps needed for

one Bayes factor to one half, since now there is only one column update in Z(i) instead of

2 columns.

As mentioned in the above, we decide to code the SNPs in another way: using one

column of parameters for one column of SNPs. In that way, the design matrix for the

SNPs records the number of copy of allele in that SNP. For example, we use γ to represent

the allele “C”’s effect. Then for an individual with SNP genotype “CC”, the SNP effect

115

would be 2γ, 1γ for SNP with genotype “CG”, and 0γ for SNP with genotype “GG”.

With the above coding, we went back to re-run the Gibbs sampling for the full model. we

used “1”, “0” and “− 1” instead of “2”, “1” and,“0”, which is equivalent.

The following method is directly due to Sadighi and Kalra (1988) and it is an

application of the Sherman-Morrison-Woodbury formula; see Woodbury (1950) and

Bartlett (1951). A review paper about the Sherman-Morrison-Woodbury formula is

written by Hager (1989).

Suppose we have the inverse of a matrix A, A−1, and we update the jth column

of A to be v. Let (A)j denote the jth column of A. Let ej to be a vector of all zeros,

except the jth element to be 1. Also let A1 to be the updated matrix. To find A−11 =

(A + (v − (A)j)eTj )−1, the following steps are needed:

• calculate b = A−1(v − (A)j)

• calculate b′ = − 11+(b)j

b, where (b)j is the jth element of vector b.

• then A−11 = (I + b′eT

j )A−1 = A−1 + b′eTj A−1

Obviously b′eTj is equivalent to the product of a column vector and a row vector. The

calculation for b is the product of a matrix and a vector. The calculation of b′ is a vector

multiplying a scale. All these calculations are not matrix production and they decrease the

computation a lot.

Next we will show why the above method works. First write

A1 = AB,

then

(A1)−1 = B−1A−1.

As we already have A−1, we need to find B and B−1, such that A1 = AB. If we let

B = I + beTj ,

116

⇒ A + (v − (A)j) eTj = A · (I + beT

j ) (3–46)

⇒ (v − (A)j) eTj = A · beT

j

⇒ v − (A)j = Ab

⇒ b = A−1(v − (A)j)

Furthermore,

B−1 = (I + beTj )−1 =

1 · · · 0 − b11+bj

0 · · · 0

.... . .

......

......

0 · · · 1 − bj−1

1+bj0 · · · 0

0 · · · 0 11+bj

0 · · · 0

0 · · · 0 − bj+1

1+bj1 · · · 0

......

......

. . ....

0 · · · 0 − bn

1+bj0 · · · 1

= I − b

(b)j

· eTj (3–47)

Similarly if the jth row of matrix A1, A(j)1 , is updated to be µ, and A2 is the updated

matrix. Then

c = (µ− A(j)1 )A−1

1 ,

C = (I + ej cT ),

and

A2 = CA1.

⇒ C−1 = I − 1

(c)j

ej c,

where (c)j is the jth element of vector c and

⇒ (A2)−1 = A−1

1 C−1 = A−11 (I − 1

(c)j

ej c) = A−11 − A1

1 + (cj)ej c. (3–48)

117

Table 3-2. Comparison of time spent in calculation of matrix inversion for differentmethods.

Matrix dimension: 500× 500 1000× 1000 1500× 1500 2000× 2000

Matlab’s direct calculation 0.266s 1.547s 5.016 11.703sMiller’s method 0.031s 0.188s 0.32s 0.719s

Sanighi & Kalra’s direct calculation 0.375s 2.735s 9.453s 21.657sSanighi & Kalra’s simplification 0.032s 0.11s 0.235s 0.531s

For our situation, when we use one parameter for one SNP, in each Gibbs sampler

iteration, we just update the jth column and jth row of the matrix Z ′Z and keep the

other parts untouched. Next we want to show a time table comparison for the calculation

of matrix inversion with different dimensions. For Sanighi & Kalra’s direct calculation

is I just follow their formula and let Matlab calculate the matrix multiplication directly.

Sanighi & Kalra’s simplification means to directly specify the cells and vectors without

matrix multiplication. Obviously, we will take the last method, and it is 20 times faster

than the default calculation of Matlab. We did the test in R and the results are similar.

3.4.1.2 Determinant calculation

Next question is to calculate the determinant of matrix A when the jth row and jth

column of A is updated to v and uT . We assume that we have A−1 and |A| at hand. Let

us first just try to get the determinant of the matrix with one column updated. After

that, we can apply the same method to get the determinant with both row and column

updated.

We write

A1 = A + veTj , (3–49)

where ej is the vector with only the jth element to be 1 and all others to be 0. As shown

before, we can write

A1 = AB,

with

B = I + beTj ,

118

and

b = A−1(v − Aj),

where Aj is the jth column of A. According to the theory of determinants of product of

matrices, when two square matrices G and H have the same dimensions, we have

|GH| = |G| × |H|. (3–50)

So for our situation,

|A1| = |A| × |B| = |A| × |I + beTj |.

Next we will show

|I + beTj | = (1 + bj),

where bj is the jth element of vector b. First we write

|I + beTj | =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 b1 0 · · · 0

.... . .

......

......

0 · · · 1 bj−1 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 bj+1 1 · · · 0

......

......

. . ....

0 · · · 0 bn 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

(3–51)

119

If we multiply the jth row with (− bn

1+bj) and add it to the nth row for the above matrix,

we will get

|I + beTj | =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 b1 0 · · · 0

.... . .

......

......

0 · · · 1 bj−1 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 bj+1 1 · · · 0

......

......

. . ....

0 · · · 0 bn 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

=

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 b1 0 · · · 0

.... . .

......

......

0 · · · 1 bj−1 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 bj+1 1 · · · 0

......

......

. . ....

0 · · · 0 0 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

.

(3–52)

Similarly, for all other rows in the above matrix we can do the similar operations. For

example, we can multiply (− bj

1+bj) to the jth row and add it to the jth row. Then the jth

row will become eTj .

120

So

|I + beTj | =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 b1 0 · · · 0

.... . .

......

......

0 · · · 1 bj−1 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 bj+1 1 · · · 0

......

......

. . ....

0 · · · 0 bn 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

=

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 b1 0 · · · 0

.... . .

......

......

0 · · · 1 bj−1 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 bj+1 1 · · · 0

......

......

. . ....

0 · · · 0 0 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

= · · · =

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

1 · · · 0 0 0 · · · 0

.... . .

......

......

0 · · · 1 0 0 · · · 0

0 · · · 0 1 + bj 0 · · · 0

0 · · · 0 0 1 · · · 0

......

......

. . ....

0 · · · 0 0 0 · · · 1

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

= (1 + bi). (3–53)

So we have

|A1| = |A| |I + beTj | = |A|(1 + bj), (3–54)

where bj = (A−1)(j)(v − Aj) and (A−1)(j) is the jth row of A−1 .

When we update the kth row of matrix A1 to be uT , we write

A2 = A1 + ek(u− A(k)1 )T = CA1, (3–55)

with

C = (I + ekcT ),

and

c = (u− A(k)1 )A−1

1 .

121

So doing the same calculation as above, we will see

|A2| = (1 + ck)|A1|,

where

ck = (u− A(k)1 )(A−1

1 )k,

with A(k) is the kth row of A and (A−11 )k is the kth column of A−1

1 .

Finally

|A2| = |A|(1 + bj)(1 + ck), (3–56)

with ck and bj defined as above.

3.4.2 Replace Z with an Average

The above matrix inversion method did speed up the computation. However, we

would like to see some further improvement of the computation. In this subsection, we try

to avoid all the n times matrix inversion and determinant calculation by substituting the

Z(i) with Z.

In the previous work we showed that the following formula can be used for the Bayes

factor approximation.

1

n

n∑i=1

(2πσ2(i))−

sl2 |Z(i)

l

′Z

(i)l |

12 exp

(− (Y−Xβ(i)−Z

(i)δ γ

(i)δ )′P (i)


(i)δ )

2σ2(i)

)


(− |γ(t)

l |22σ2(i)φ2(i)

) → BFδ, δ=1,

(3–57)

where all the samples are from the posterior distribution

π1(β, γ, σ2, φ2, Zmis|Y ).

Theoretically the above method is applicable, however when the method is applied to large

data set with many SNPs, the computation speed is slow. Suppose we want to calculate a

122

Bayes factor BFδ, δ=1 with n samples of

(β(i), γ(i), σ2(i), φ2(i)

, Z(i)mis),

we need to calculate n times the determinant for |Z(i)l

′Z

(i)l | and n times the matrix inverse

(Z(i)l

′Z

(i)l )−1. In the above notation, Zl and Zδ contain both observed SNPs and missing

SNPs. When we use the notation Z(i)l , we implicitly mean the ith sample of missing SNPs

in the matrix of Zl and the observed SNPs are automatically known in matrix Zl. The

same thing is implied in the notation of Zδ and we hope the meanings of the notation are

clear in different situations. Although in the last subsection we showed some methods

which avoid direct calculation of (Z(i)l

′Z

(i)l )−1 based on the previous samples, still the

computation time is a concern and in this subsection we want explore some other method

to improve it.

If the Z(i) = (Z(i)δ , Z

(i)l ) matrix does not change with i, then for each Bayes factor

estimation, instead of calculating the determinant and matrix inverse n times, we just

need to do one time matrix determinant and matrix inverse. The method proposed here

is to replace all the samples of (Z(i)δ , Z

(i)l ) with the expectation (EZδ, EZl), EZ. In the

following we justify the replacement with EZ.

Originally we want to calculate the Bayes factor BFδ, δ=1 and we showed that it is

equal to

∫· · ·

∫ (2πσ2)−sl2 |Zl

′Zl| 12 exp(− (Y−Xβ−Zδγδ)′Pl(Y−Xβ−Zδγδ)

2σ2

)

(2πσ2φ2)−sl2 exp

(− |γl|2

2σ2φ2

) (3–58)

×π1(β, γ, σ2, φ2, Zmiss|Y ) dθ dγ dσ2 dφ2 dZmis.

Now simplify the notation. We will use θ to represent the parameters (β, γ, σ2, φ2) and let

h (θ, Z) to denote the function

h (θ, Z) =(2πσ2)−

sl2 |Zl

′Zl| 12 exp(− (Y−Xβ−Zδγδ)′Pl(Y−Xβ−Zδγδ)

2σ2

)

(2πσ2φ2)−sl2 exp

(− |γl|2

2σ2φ2

) . (3–59)

123

Also use f(θ, Z) to denote the posterior distribution π1(β, γ, σ2, φ2, Zmis|Y ). So our target

integral becomes ∫ ∫h(θ, Z)f(θ, Z) dθ dZ, (3–60)

here the integral is implicitly with respect to the missing data in Z. As mentioned in the

above, we want to approximate Equation 3–60 by

∫ ∫h(θ, EZ)f(θ, Z) dθ dZ. (3–61)

Rewrite the above integrals as

∫ ∫h(θ, Z)f(θ, Z) dθ dZ =

∫ [∫h(θ, Z)f(Z|θ) dZ

]f(θ) dθ,

and ∫ ∫h(θ, EZ)f(θ, Z) dθ dZ =

∫[h(θ, EZ)] f(θ) dθ.

To fully justify the action of replacing Z with EZ in Equation 3–61, we need to check the

following:

∫ ∫h(θ, Z)f(θ, Z) dθ dZ −

∫ ∫h(θ, EZ)f(θ, Z) dθ dZ

=

∫ [∫h(θ, Z)f(Z|θ) dZ

]f(θ) dθ −

∫ ∫[h(θ, EZ)f(Z|θ) dZ] f(θ) dθ

=

∫ [∫h(θ, Z)f(Z|θ) dZ −

∫h(θ, EZ)f(Z|θ) dZ

]f(θ) dθ. (3–62)

Further, if we can show that∫

(h(θ, Z)f(Z|θ) dZ − ∫h(θ, EZ)f(Z|θ) dZ is close to zero, it

is sufficient to say that Equation 3–62 is close to zero under very general conditions. Now

124

expand (h(θ, Z)− h(θ, EZ)) in Taylor series at EZ

h(θ, Z)− h(θ, EZ) (3–63)

≈ (h(θ, EZ)− h(θ, EZ)) + (Z − EZ)

(∂h(θ, Z)

∂Z|Z=EZ

)

+1

2(Z − EZ)′

∂2[h(θ, Z)]

∂Z2|Z=EZ (Z − EZ)

= (Z − EZ)

(∂h(θ, Z)

∂Z|Z=EZ

)+

1

2(Z − EZ)′

∂2 [h(θ, Z)]

∂Z2|Z=EZ (Z − EZ)

So we can write

∫h(θ, Z)f(Z|θ) dZ −

∫h(θ, EZ)f(Z|θ) dZ (3–64)

=

(∂h(θ, Z)

∂Z|Z=EZ

) ∫(Z − EZ)f(Z|θ) dZ

+

∫1

2(Z − EZ)T ∂2[h(θ, Z)]

∂Z2|Z=EZ (Z − EZ)f(Z|θ) dZ

If we can argue that under certain conditions, either

EZ ≈ EZ|θ(Z) ⇐⇒∫

(Z − EZ)f(Z|θ) dZ ≈ 0

or (∂h(θ, Z)

∂Z|Z=EZ

)≈ 0,

then (∂h(θ, Z)

∂Z|Z=EZ

) ∫(Z − EZ)f(Z|θ) dZ ≈ 0.

Further if we neglect the second term in Equation 3–64, then we can say

∫h(θ, Z)f(Z|θ) dZ −

∫h(θ, EZ)f(Z|θ) dZ ≈ 0.

One step further,

∫ ∫h(θ, Z)f(Z|θ) dZ −

∫h(θ, EZ)f(Z|θ) dZ

f(θ) dθ ≈ 0.

125

As we know that the Z matrix represents the design matrix for the SNPs, and about

90% SNPs are observed, that is, the expectation of the observed SNPs of Z matrix is

always fixed as the actual observed values. The actual expectation calculation would occur

only in the 10% of missing SNPs. So we may say that EZ is very close to EZ|θ(Z) and EZ

is calculated by the Z.

We did some numerical simulation to compare the Bayes factor calculations. In this

simulated data set, there are 15 SNPs 20 families and 800 observations.

Table 3-3 records the different model indicators used for the simulation studies in

Table 3-4 and 3-5. We let the δs vary so that they could represent the possible different

model indicators that might occur.

Table 3-3. Records of subsets indicators with actual values of γ for Table 3-4 and Table3-5.

Indicator vector δActual values of γ: 2 0 1 −1 −3 0.1 0 0 0 0 −1 3 0 0 −3

δ1: 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1δ2: 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0δ3: 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0δ4: 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0δ5: 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1δ6: 1 0 1 1 1 0 0 0 0 0 0 1 0 0 1δ7: 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1δ8: 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1δ9: 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0δ10: 1 1 1 1 1 1 0 0 1 0 1 1 0 0 1δ11: 1 0 1 1 1 1 0 1 0 1 1 1 1 0 1δ12: 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1δ13: 1 0 1 1 1 1 1 0 0 1 1 1 0 0 1δ14: 1 0 1 1 1 1 0 1 0 0 1 1 1 0 1δ15: 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1

Table 3-4 shows the calculation results of Bayes factors by using different formula

of Bayes factor approximation. The upper part of the table uses samples from the Gibbs

sampler where only one column of missing SNPs are updated per cycle while the bottom

part of the table uses samples from the Gibbs sampler where all the missing SNPs are

updated in one cycle. The first column of Table 3-4 is the indicator vector and the actual

126

Table 3-4. Bayes factor calculation approximation. Use first 20000 iteration as burnin andtake the following 400 samples for the calculation.

sample one column of SNP per iterationdifferent δ applying formula (3–57) Zbar original Z

δ1 4.0057e + 009 7.6668e + 009 5.6653e + 009δ2 6.055e− 036 2.1169e− 052 4.0221e− 062δ3 1.7917e− 042 2.1869e− 057 2.3773e− 065δ4 8.2841e− 014 2.558e− 019 2.4601e− 021δ5 6.2531e− 016 1.9721e− 021 1.5186e− 031δ6 6.0943e + 006 1.6279e + 007 9.3191e + 006δ7 2.3314e + 008 2.8145e + 008 3.8463e + 008δ8 7.4575e + 005 1.1799e + 006 1.1003e + 006δ9 4.8807e− 032 2.0811e− 043 1.3484e− 052δ10 7.6769e + 005 1.2058e + 006 1.3248e + 006δ11 2.9626e + 004 2.5816e + 004 3.8642e + 004δ12 8.2784e + 005 1.1381e + 006 9.546e + 005δ13 8.2753e + 005 6.0808e + 005 9.0081e + 005δ14 5.8732e + 005 5.8733e + 005 6.8705e + 005δ15 4.2755e + 003 4.3482e + 003 3.1384e + 003

sample all SNPs together in one cycledifferent δ applying formula (3–57) Zbar original Z

δ1 2.2281e + 008 1.6784e + 008 3.253e + 008δ2 2.6563e− 047 2.5197e− 059 2.6204e− 066δ3 6.7539e− 047 1.8281e− 063 1.2993e− 068δ4 1.6489e− 017 1.0475e− 024 2.7792e− 028δ5 5.4427e− 024 1.6637e− 033 2.0014e− 042δ6 3.411e + 009 3.5043e + 008 1.9252e + 007δ7 4.8621e + 006 4.6095e + 006 1.2051e + 007δ8 2.7196e + 004 1.2130e + 004 5.5845e + 004δ9 9.0771e− 027 1.3545e− 043 2.1829e− 049δ10 1.1878e + 005 5.2192e + 004 1.2118e + 005δ11 6.8893e + 003 4.6568e + 003 1.4376e + 004δ12 6.6379e + 004 3.1702e + 004 7.7065e + 004δ13 2.3339e + 004 1.7211e + 004 5.5039e + 004δ14 1.028e + 005 1.0241e + 005 3.2299e + 005δ15 228.4 175.74 562.88

values of δ are from Table 3-3. The second column is the Bayes factor estimates using

the formula we originally planned to use and it ensures the Bayes factor estimates are

consistent. The third column is the Bayes factor estimates, where the missing SNPs are

replaced with the averages of imputed values Z. The fourth column is the Bayes factor

127

estimates while using the true genotypes for the missing SNPs. In real data analysis, it is

impossible to know the actual genotypes of the missing SNPs. In simulation study, we are

able to take the advantage of the simulated data set and compare the influence of different

methods. We could see that, the estimates using the Zbar are accurate when the Bayes

factors are big. When the Bayes factors are very small, the difference of estimates are due

to the random simulation errors. As long as they are very small, they have almost similar

effects when compared with big Bayes factors and it almost does not matter how small the

actual values are.

Table 3-5 again are the Bayes factor estimates using different methods for the missing

values. The second column assumes there is no missing values in the data set. that is,

I did not generate the missing data for this analysis. In the third and fourth columns, I

assign the missing values either all 1 or all −1 to test the effect of the extreme handling of

missing values. The upper part of the table using samples from the Gibbs sampler where

the missing SNPs were imputed 1 column per cycle while the bottom part of the table

are the estimates of the Bayes factors where all the missing SNPs were imputed in each

cycle. It shows, these extreme handling of missing SNPs do not give proper estimates of

Bayes factors. Combining Table 3-4 and Table 3-5, we conclude that using Z gives proper

estimates and yet it provides fast calculation.

3.5 Simulation and Real Data Analysis

3.5.1 Simulation

In the above sections, we talked about the Bayes factor approximation, the stochastic

variable search algorithm, the egodic property of the search algorithm, and the problem

of computation. In this section, we will show some simulation results when the above

techniques are applied.

In these simulation studies, we have 15 SNPs and the values of the SNPs are listed

in Table 3-6. 10% of missing SNPs were generated at random. γ is the parameter for the

SNP effect and γ1 is for SNP1, γ2 is for SNP2, etc.

128

The rows of the Table 3-6 represent the different models that the stochastic chain

visited. Any model that has less than 1% frequency of visits is not listed in the table.

From the first column of Table 3-6, we could see that two rows have higher frequencies

of visits than all other rows. Obviously, the stochastic chain spends 25% of its time on a

model which captures all the significant variables, and the chain spends about one half

of its time on a model which captures all the significant variables plus SNP7, SNP8, and

SNP8. The effects of SNP7, SNP8, and SNP8 are not as big as other variables in the real

model, but their effects are not insignificant according the criterion of Bayes factor. One

thing that needs to be mentioned is that the stochastic chain only spent about 4% of its

time on the model which includes all the variables, which is a desirable property since our

goal is to discover simple subsets which represent the essential combinations of variables

instead of taking the complicated models such as the full model.

3.5.2 Real Data Analysis

After some simulation studies, we applied the methodology to the loblolly data. The

response here is the lesion length. We use the average of imputed missing data to calculate

the Bayes factor estimates. The first column of Table 3-7 are the frequencies of visits the

chain spent on different models. As the model sample space is huge, 244, it is very likely

that once the chain leaves the visited model, it never comes back to the previous model.

In the table are listed several models that the chain spent more time on than other models

that are not listed. Although none of the frequencies of visits are substantially higher than

others, the analysis provides some subsets of variables to be further investigated at. One

thing to notice is that when we average the visits per SNP, several SNPs have more than

50% visit frequencies, and these SNPs appear often in the selected models too. If we are

interested in investigate the SNPs one by one, these SNPs definitely are good candidates.

129

Table 3-5. Bayes factor calculation comparisons. Use first 40000 iteration as burnin andtake the following 400 samples for the calculation.

sample one column of SNP per iterationdifferent without assign all assign all missing

δ missing SNP missing values one values minus one

δ1 8.5596e + 006 8.3541e + 006 5.9273e + 006δ2 0 4.1587e + 009 5.7762e + 009δ3 0 8.0566e + 007 8.3811e + 007δ4 0 9.5483e + 008 7.0391e + 008δ5 0 9.2959e + 008 1.073e + 009δ6 3.1286e− 090 1.0311e + 008 8.463e + 007δ7 1.0003e + 008 1.2594e + 006 9.3706e + 005δ8 4.1181e + 005 46386 30518δ9 0 2.0545e + 012 1.7827e + 012δ10 2.4772e + 005 20359 13880δ11 22953 3.6538e + 003 4.0996e + 003δ12 3.2352e + 005 2.9033e + 004 1.6675e + 004δ13 1.0803e + 006 1.2841e + 004 1.1535e + 004δ14 3.8579e + 005 4.2515e + 004 3.8305e + 004δ15 8331.8 3.1579e + 002 1.6113e + 002

sample all SNPs together in one cycledifferent without assign all assign all missing

δ missing SNP missing values one values minus one

δ1 4.2922e + 008 1.7333e + 006 1.49e + 006δ2 0 2.9384e + 008 3.7949e + 008δ3 0 4.2674e + 006 5.7925e + 006δ4 0 1.0262e + 008 1.4015e + 008δ5 0 5.8757e + 006 1.2619e + 007δ6 2.5633e− 102 1.483e + 007 1.5062e + 007δ7 6.8181e + 007 1.6452e + 005 1.2833e + 005δ8 9.6396e + 005 4.3144e + 003 4.392e + 003δ9 0 1.4913e + 010 1.7881e + 010δ10 2.3864e + 006 1.5517e + 003 1.8624e + 003δ11 11910 1.6863e + 003 1.0575e + 003δ12 3.2788e + 005 4.467e + 004 5.9156e + 003δ13 1.8485e + 005 2.1826e + 004 1.6741e + 004δ14 2.6493e + 005 11770 6394.3δ15 371.79 326.27 396.23

130

Table 3-6. Simulation results of Bayesian variable selection for 15 SNPs and 450observations, 10% random missing values. Using the Bayes factor estimationformula

posterior Actual values of the γmodel γ1 = 2, γ2 = 0, γ3 = 1, γ4 = −1, γ5 = −3,γ6 = 0, γ7 = 0.5, γ8 = 0.1

probability γ9 = 0.3, γ10 = 0, γ11 = −1, γ12 = 3, γ13 = 0, γ14 = 0, γ15 = −3

0.0036 no variables0.0001 γ15

0.001 γ11, γ15

0.0001 γ7, γ11, γ15

0.0008 γ1, γ7, γ11, γ15

0.002 γ1, γ4, γ7, γ11, γ15

0.0001 γ1, γ4, γ5, γ7, γ11, γ15

0.0001 γ1, γ4, γ5, γ7, γ11, γ12, γ15

0.2533 γ1, γ3, γ4, γ5, γ7, γ11, γ12, γ15

0.0608 γ1, γ3, γ4, γ5, γ7, γ8, γ11, γ12, γ15

0.5496 γ1, γ3, γ4, γ5, γ7, γ8, γ9, γ11, γ12, γ15

0.0045 γ1, γ3, γ4, γ5, γ7, γ8, γ9, γ11, γ12, γ13, γ15

0.0506 γ1, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15

0.0116 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15

0.0428 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ11, γ12, γ13, γ15

0.0428 γ1, γ2, γ3, γ4, γ5, γ6, γ7, γ8, γ9, γ10, γ11, γ12, γ13, γ14, γ15

Table 3-7. Bayes variable selection for lesion length data set. Using the average ofimputed missing SNPs as if observed. 20000 steps of burnin and another 20000steps as samples.

posterior modelSubsets of variablesprobability

>= 0.29%

0.4% γ9, γ10, γ12, γ14, γ15, γ16, γ22, γ30, γ31, γ40, γ43

0.39% γ8, γ10, γ12γ15, γ19, γ21, γ22, γ23, γ24, γ27, γ28, γ29, γ31, γ34, γ36, γ39, γ42

0.35% γ3, γ9, γ12γ13, γ14, γ15, γ21, γ27, γ31, γ34, γ38, γ40, γ42, γ43

0.32% γ12, γ15, γ16γ18, γ22, γ29, γ31, γ32, γ41, γ42, γ43, γ43

0.31% γ1, γ9, γ13γ18, γ19, γ22, γ23, γ27, γ28, γ31, γ34, γ36, γ38, γ39, γ42, γ43

0.30% γ1, γ8, γ12, γ13γ18, γ22, γ23, γ27, γ28, γ31, γ32, γ33, γ34, γ36, γ39, γ42,0.29% γ1, γ8, γ12, γ13γ18, γ21, γ22, γ23, γ27, γ28, γ31, γ32, γ33, γ34, γ36, γ39, γ42,0.29% γ15γ19, γ22, γ23, γ24, γ31, γ39, γ42,

variables selected more than 50% times on the average for all model

γ12, γ13, γ15, γ18, γ22, γ23, γ27, γ28, γ31, γ34, γ39, γ42

131

CHAPTER 4SUMMARY AND FUTURE WORK

4.1 Summary

Studies of the association between the candidate SNPs and phenotypic traits have

been a rapid developing research area. Most research has been focused on the human

genetics association by taking advantage of the sequenced human genome. For a not

fully sequenced genomes, not much progress has been made and suitable methods are

developing. In this dissertation, we proposed methods to test the SNP effects and also

select the groups of SNPs which are interactively responsible for the phenotypic traits,

and meanwhile we proposed methods to properly handle the missing values. This method

is better than other methods as it is a simultaneous solution, handles missing values,

provides unbiased estimation, has the flexibility of multiple imputations.

In Chapter 2, we proposed a Bayesian hierarchical model for the data structure. We

took advantage of the population structure by calculating a numerator relationship matrix,

which quantifies the covariance between two individual loblolly pines in this population.

For missing data, we imputed the possible values according to the posterior distribution

of the missing values. That is, after adjusting the observed values, we used the posterior

distribution of the missing values instead of the one mean value substitution or multiple

value averages.

One novel contribution is we proposed a nonstandard Gibbs sampler procedure and

proved that this Gibbs procedure ensures the target stationary distribution. By employing

this proposed procedure, the computation speed is dramatically increased by decreasing

the number of imputed columns in each updating cycle. The power of this procedure

will be more obvious when there are more observed SNPs in the data set and so this is

a desirable property since we will have many more SNPs toward the end of ADEPT 2

project.

132

In Chapter 3, we proposed a Bayesian variable selection method to select the good

subsets of variables, SNPs. The Bayes factor was used as a model comparison criterion

and we employed a hybrid stochastic search algorithm to search the good subsets in the

model sample space. We made a novel contribution by proposing a consistent Bayes factor

approximation for models of different dimensions, while properly handled the missing

values. We also proved that the proposed Metropolis-Hastings search algorithm with

approximated Bayes factors still has the ergodic property.

In terms of computation, we took advantage of the Bayes factor approximation

formula and the special Gibbs sampler updating procedure by applying the Sherman-

Morrison-Woodbury formula for our situation. The method decreases the matrix inversion

20 times for a matrix with dimension 2000 × 2000. In the end of the dissertation, we

also proposed to use Z to replace the imputed values for the calculation of Bayes factors.

We showed that it gives accurate estimates and meanwhile fundamentally scales up the

procedure.

4.2 Future Work

We tested the significance of SNP effect in Chapter 2. As we only had 44 SNPs

as candidates in the data set and we were hoping to discover some SNPs to further

investigate, we did not adjust the multiple tests. If we want, we could do some standard

adjustment, such as Bonferroni’s adjustment. We anticipate that our method will be

applied to thousands of SNPs when the data set is available. So one area we need to

further investigate is the adjustment for multiple tests, and meanwhile attention need

to be paid that these tests are not necessary independent. The multiple test adjustment

should be able to balance the power and false discovery rates. One possible direction

might be permutation studies.

In chapter 3, we used a stochastic search algorithm to search the good group of

subsets and we spend much time in speeding up the computation. The bottleneck of

the computation is due to the Bayes factor estimates, especially the many inversions

133

and determinant calculation which involve missing SNPs. We proposed to replace the

imputed SNPs with the average of the imputed SNPs and we gave some justification for

it. Other improvements are possible here and certainly it would be an interesting topic as

computation is always a problem for variable selection.

Another future work topic is to investigate other prior specifications for our model in

Chapter 2 and Chapter 3. We used the conjugate priors and that ensures the conditionals

to be standard distributions. We would like to see the difference of the theoretical results

and the difference of real data analysis when different priors are applied, such as g priors,

or intrinsic priors.

134

APPENDIX AERGODICITY OF GIBBS SAMPLING WHEN UPDATING Z MATRIX BY COLUMNS

Suppose we are running a Gibbs sampler in which one cycle consists of updating

(X,Y, Z), and the Zn×p matrix can be decomposed as (Z1, Z2, . . . , Zp). Standard Gibbs

sampling updates Zn×p all together in one cycle, in contrast, we can just systematically

update one column of Z at each cycle. By updating one column of Z in each iteration, the

computation time can decrease dramatically, especially when the number of columns of the

Z matrix is in the order of hundreds or thousands.

First let us write down our updating scheme. Suppose we have starting value

(X(0), Y (0), Z(0)1 , ..., Z(0)

p ).

In the first cycle, first, conditional on

(Y (0), Z(0)1 , ..., Z(0)

p ) update X(0) and get (X(1), Y (0), Z(0)1 , ..., Z(0)

p ).

Then conditioned on

(X(1), Z(0)1 , ..., Z(0)

p ), update Y (0) and get (X(1), Y (1), Z(0)1 , ..., Z(0)

p ).

Finally conditioned on

(X(1), Y (1), Z(0)2 , ..., Z(0)

p ), update Z(0)1 and get (X(1), Y (1), Z

(1)1 , Z

(0)2 , ..., Z(0)

p ).

The above is one cycle of the Gibbs sampling. Follow the same fashion and we will update

(X(1), Y (1), Z(0)2 ) in the second cycle. Continue doing that, after pth cycle, we will get

(X(p), Y (p), Z(1)1 , Z

(1)2 , ..., Z(1)

p ).

135

To prove that the above updating scheme conserves the ergodicity property, we will

show first that the kernel of the Gibbs sampler satisfies the target stationary distribu-

tion condition and, furthermore, we will show that with any initial condition the chain

converges to the target stationary distribution.

To prove that the Gibbs sampler has the stationary distribution, we need to show

that the following equation holds:

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)(A–1)

=

∫. . .

∫f

(X(1)|Y (0), Z

(0)1 , . . . , Z(0)

p

)f

(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)

×f(Z

(1)1 |X(1), Y (1), Z

(0)2 · · ·Z(0)

p

)f

(X(2)|Y (1), Z

(1)1 , . . . , Z(0)

p

)

× · · · × f(Y (p)|X(p), Z

(1)1 , . . . , Z

(1)p−1, Z

(0)p

)f

(Z(1)

p |X(p), Y (p), . . . , Z(1)(p−1)

)

×f(Z

(0)1 , . . . , Z(0)

p , X(0), Y (0))

dX(0) dY (0) dZ(0)1 , . . . , dZ(0)

p

dX(1) dY (1), . . . , dX(p−1) dY (p−1).

First we integrate X(0) and Y (0) out, then the right hand side of (A–1) becomes:

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)(A–2)

=

∫. . .

∫f

(X(1), Z

(0)1 , . . . , Z0

p

)f

(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)

×f(Z

(1)1 |X(1), Y (1), Z

(0)2 . . . , Z(0)

p

)f

(X(2)|Y (1), Z

(1)1 , . . . , Z(0)

p

)

× · · · × f(Y (p)|X(p), Z

(1)1 , . . . , Z

(1)p−1, Z

(0)p

)f

(Z(1)

p |X(p), Y (p), . . . , Z(1)(p−1)

)

dZ(0)1 , dZ

(0)2 , . . . , dZ(0)

p dX(1) dY (1), . . . , dX(p−1) dY (p−1).

136

Next we integrate Z(0)1 out and we have:

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)(A–3)

=

∫. . .

∫f

(X(1), Y (1), Z

(0)2 . . . , Z(0)

p

)f

(Z

(1)1 |X(1), Y (1), Z

(0)2 . . . , Z(0)

p

)

×f(X(2)|Y (1), Z

(1)1 , . . . , Z(0)

p

). . . f

(Y (p)|X(p), Z

(1)1 , . . . , Z

(1)p−1, Z

(0)p

)

×f(Z(1)

p |X(p), Y (p), . . . , Z(1)(p−1)

)dZ

(0)2 , . . . , dZ(0)

p dX(1) dY (1), . . . ,

dX(p−1) dY (p−1).

Continuing by integrating X(1), Y (1) out, so we have:

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)

=

∫. . .

∫f

(X(2), Z

(1)1 , Z

(0)2 , . . . , Z(0)

p

). . . f

(Y (p)|X(p), Z

(1)1 , . . . , Z

(1)p−1, Z

(0)p

)

×f(Z(1)

p |X(p), Y (p), . . . , Z(1)(p−1)

)dZ

(0)2 , . . . , dZ(0)

p

dX(2) dY (2), . . . , dX(p−1) dY (p−1). (A–4)

Doing the similar integration, further we get:

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)(A–5)

=

∫ ∫ ∫f

(X(p−1), Y (p−1), Z

(1)1 , . . . , Z

(1)(p−1), Z

(0)p

)f

(X(p)|Y (p−1), Z

(1)1 , . . . , Z

(1)(p−1), Z

(0)p

)

×f(Y (p)|X(p), Z

(1)1 , . . . , Z

(1)(p−1), Z

(0)p

)f

(Z(1)

p |X(p), Y (p), Z(1)1 , . . . , Z

(1)(p−1)

)

dX(p−1) dY (p−1) dZ(0)p .

Now doing the final integrations, we will have

f(X(p), Y (p), Z

(1)1 , . . . , Z(1)

p

)= f

(X(p), Y (p), Z

(1)1 , Z

(1)2 , . . . , Z(1)

p

).

The above proof shows that the chain of

(X(pt), Y (pt), Z(t)1 , Z

(t)2 , . . . , Z(t)

p ), t = 1, 2, 3, · · ·

137

satisfies the stationary distribution.

In above, we showed that the Gibbs sampler chain updating one column per cycle still

preserves the target stationary condition. To prove the chain has ergodicity property, we

can cite Theorem 10.10 in Robert and Casella (2004). The theorem states:

If the transition kernel of the Gibbs chain(X(pt), Y (pt), Z

(t)1 , . . . , Z

(t)p

)is absolutely

continuous with respect to the measure µ, and in addition,(X(pt), Y (pt), Z

(t)1 , . . . , Z

(t)p

)is

aperiodic, then for every initial distribution µ, the Gibbs Chain converges to the target

stationary distribution.

Obviously in our situation the dominating measure µ is the product of Les-

besgue measure with counting measures and all the conditional distributions are abso-

lutely continuous with respect to it. All our conditionals are aperiodic. So the chain(X(pt), Y (pt), Z

(t)1 , . . . , Z

(t)p

)has ergodicity.

Now consider the marginal distribution of the sub vector (X(t), Y (t)). We will show

that this marginal distribution satisfies the stationary condition. That is, we want to show

the following equation holds.

f(X(1), Y (1)

)=

∫ ∫ [∫f

(X(0), Y (0), Z

(0)1 , . . . , Z(0)

p

)dZ

(0)1 , . . . , dZ(0)

p

](A–6)

×f(X(1)|Y (0), Z

(0)1 , . . . , Z(0)

p

)f

(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)

×f(Z

(1)1 |X(1), Y (1), Z

(0)2 , . . . , Z(0)

p

)dZ

(1)1 dX(0) dY (0).

To show that, we calculate the right side of (A–6) and it equals to the left side of

(A–6). The following are the calculations:

138

∫ ∫ [∫f

(X(0), Y (0), Z

(0)1 , . . . , Z(0)

p

)dZ

(0)1 , . . . , dZ(0)

p

](A–7)

×f(X(1)|Y (0), Z

(0)1 , . . . , Z(0)

p

)f

(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)

×f(Z

(1)1 |X(1), Y (1), Z

(0)2 , . . . , Z(0)

p

)dZ

(1)1 dX(0) dY (0)

=

∫. . .

∫f

(Y (0), Z

(0)1 , . . . , Z(0)

p

)f

(X(1)|Y (0), Z

(0)1 , . . . , Z(0)

p

)

×f(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)f

(Z

(1)1 |X(1), Y (1), Z

(0)2 , . . . , Z(0)

p

)

dZ(0)1 , . . . , dZ(0)

p dZ(0)1 dY (0)

=

∫. . .

∫f

(X(1), Z

(0)1 , . . . , Z(0)

p

)f

(Y (1)|X(1), Z

(0)1 , . . . , Z(0)

p

)

×f(Z

(1)1 |X(1), Y (1), Z

(0)2 , . . . , Z(0)

p

)dZ

(0)1 , . . . , dZ(0)

p dZ(0)1

=

∫. . .

∫f

(Y (1)X(1), Z

(0)2 , . . . , Z(0)

p

)(Z

(1)1 |X(1), Y (1), Z

(0)2 , . . . , Z(0)

p

)

dZ(0)2 , . . . , dZ(0)

p dZ(0)1

= f(X(1), Y (1)

)

The above derivation shows that the stationary distribution of the chain

(X(t), Y (t)), t = 1, 2, · · ·

will converge to the marginal distribution of f(X, Y ). So if we are just interested in the

estimates of X,Y only, it is legitimate to use samples

(X(t), Y (t)), t = 1, 2, · · · ,

instead of

(X(pt), Y (pt)), t = 1, 2, · · ·

In all, compared with standard Gibbs updating algorithm, we are able to update the

missing SNPs, Z, less frequently and still obtain the desired property.

139

APPENDIX BAN ALGORITHM FOR CALCULATING THE NUMERATOR RELATIONSHIP

MATRIX R

The calculation algorithm is due to Henderson (1976) and Quass (1977). The individ-

uals within 61 families and the parents for the the 61 families are ordered together such

that the first 1, ..., a subjects are unrelated and are used as a “base” population. Let the

total number of subjects within families and parents of the 61 families to be n, and we

will get a numerator relationship matrix with dimension n × n. As the first a subjects

(being part of the parents of the 61 families ) are unrelated, the upper left submatrix with

dimension a × a of the numerator relationship matrix is identity matrix I. This identity

submatrix will be expanded iteratively until it reaches to dimension n× n.

As we know the sub-numerator relationship matrix for the first unrelated a subjects

is the identity, next we will give the details how to calculate the remaining cells of the

numerator relationship matrix for the related subjects. Consider the jth and the ith

subject from the above ordered subjects.

1. If both parents of the jth individual are known, say g and h, then

Rji = Rij = 0.5(Rig + Rih), i = 1, ..., j − 1;

Rjj = 1 + 0.5Rgh

where Rji is the cell of numerator relationship matrix in the jth row and ith column.

2. If only one parent is known for the jth subject, say it is g, then

Rji = Rij = 0.5Rig, i = 1, ..., j − 1;

Rjj = 1.

3. If neither parent is known for the jth subject,

Rji = Rij = 0, i = 1, ...j − 1;

Rjj = 1.

For the loblolly pine data, we have 44 pines acting as grandparents and they produce 61

pine families. The 61 families contains 888 individual pine trees all together, also called

140

clones. The phenotypic responses are taken from the individual clones. So our interest is

in calculating the relationship matrix for the 888 clones and it would have a dimension

888 × 888. According the Henderson’s method, we ordered the 44 grandparent pines and

888 individual pines together such that the first a pines are not related. Starting from

the (a + 1)th pine, we applied the above iteration calculation algorithm, and in the end

had a relationship matrix with dimension 932 × 932 for all the grandparent pines and all

individual clones. We took a submatrix from the right bottom of the previous numerator

relationship matrix with dimension 888 × 888 and it is the numerator relationship matrix

we used in the loblolly pine data analysis.

141

REFERENCES

Akaike, H. (1973). Information theory and an extension of the maximum likelihoodprinciple. In B. N. Pertrov and F. Csaki (Eds.), Second International Symposium onInformation Theory, Budapest Akademiai Kiado, pp. 267C281. Springer-Verlag.

Allison, P. D. (2002). Missing Data. Sage, CA: Thousand Oaks.

Balding, D. (2006). A tutorial on statistical methods for population association studies.Nature Reviews Genetics 7, 781.

Barnard, J. and D. B. Rubin (1999). Small sample degrees of freedom with multipleimputation. Biometrica, 949–955.

Bartlett, M. S. (1951). An inverse matrix adjustment arising in discriminant analysis.Annual of Mathematical Statistics 22, 107–111.

Berger, J. and L. Pericchi (2001). Objective bayesian methods for model selection:Introduction and comparison (with discussion). Model Selection, 135–207.

Boyles, A. L., W. Scott, E. Martin, S. Schmidt, Y. J. Li, A. Ashley-Koch, M. P. Bass,M. Schmidt, M. A. Pericak-Vance, M. C. Speer, and E. R. Hauser (2005). Linkagedisequilibrium inflates type ı error rates in multipoint linkage analysis when parentalgenotypes are missing. Human Heridity 59, 220–227.

Brown, P. J., M. Vannucci, and T. Fern (1998). Multivariate bayesian variable selectionand prediction. Journal of Royal Statistal Society B , 627–641.

Carlin, B. P. and S. Chib (1995). Bayesian model choice via markov chain monte carlomethods. Journal of Royal Statistical Society B , 473–484.

Casella, G., F. J. Giron, M. L. Martinez, and E. Moreno. Consistency of bayesianprocedures for variable selection. The Annals of Statistics . (to appear).

Casella, G. and E. Moreno (2006). Objective bayesian variable selection. Journal of theAmerican Statistical Association, 157–167.

Chen, W. M. and G. R. Abecasis (2007). family-based association tests for genomewideassociation scan. The American Journal of Human Genetics 81, 913–926.

Chib, S. (1995). Marginal likelihood from the gibbs output. Journal of the AmericanStatistical Association, 1313–1321.

Cui, W. and E. I. George (2008). Empirical bayes versus fully bayes variable selection.Journal of Statistical Planning and Inference, 888–900.

Dai, J. Y., I. Ruczinski, M. LeBlanc, and C. Kooperberg (2006). Imputation methods toimprove inference in snp association studies. Genetic Epidemiology 30, 690–702.

142

Dellaportas, P., J. J. Forster, and I. Ntzoufras (2002). Bayesian model choice via markovchain monte carlo methods. Statistics and Computing , 27–36.

Dempster, A. P., N. Laird, and D. B. Rubin (1977). Maximum likelihood estimation fromincomplete data via the em algorithm(with discussion). Journal Royal Statistical SocietySeries B 39, 1–38.

Efron, B. (1994). Missing data imputation and the bootstrap. Journal of the AmericanStatistical Association, 463–474.

George, E. I. and D. Foster (1994). The risk inflation criterion for multiple regression.Annals of Statistics , 1947–1975.

George, E. I. and R. E. McCulloch (1993). Variable selection via gibbs sampling. Journalof the American Statistical Society , 881–889.

George, E. I. and R. E. McCulloch (1995). Stochastic search variable selection. MarkovChain Monte Carlo in Practice, 203–214.

George, E. I. and R. E. McCulloch (1997). Approaches to bayesian variable selection.Statistica Sinica, 339–379.

Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesianmodel determination. Biometrika, 711–32.

Hager, W. W. (1989). Updating the inverse of a matrix. SIAM Review 31, 221–239.

Henderson, C. R. (1976). A simple method for computing the inverse of a numeratorrelationship matrix used in prediction of breeding values. Biometrics 39, 69–83.

Hobert, J. and G. Casella (1996). The effect of improper priors on gibbs sampling inhierarchical linear mixed models. Journal of the American Statistical Association 91,1461–1473.

Kayihan, G. C., D. A. Huber, A. M. Morse, T. T. White, and J. M. Davis (2005). Geneticdissection of fusiform rust and pitch canker disease traits in loblolly pine. Theory ofApplied Genetics 110, 948–958.

Li, K. H., T. E. Raghunathan, and D. B. Rubin (1991). Large-sample significancelevels from multiply imputed data using moment-based statistics and an f referencedistribution. J. American Statistical Association, 1065–1073.

Little, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. New York:Wiley & Sons.

Mallows, C. (1973). Some comments on cp. Technometrics 15, 661–675.

Marchini, J., B. Howie, S. Myers, G. McVean, and P. Donnelly (2007). A new multipointmethod for genome-wide association studies by imputation of genotypes. NatureGenetics 39, doi:10.1038/ng2088.

143

Martin, E. R., M. P. Bass, E. Hauser, and N. L. Kaplan (2003). Accounting for linkage infamily-based tests of association with missing parental genotypes. American Journal ofHuman Genetics 73, 1016–1026.

McKeever, D. B. and J. L. Howard (1996). Value of timber and agricultural products inthe united states,1991. Forest Products Journal 46, 45–50.

Meng, X. L. and D. B. Rubin (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrica, 103–111.

Meng, X. L. and W. H. Wong (1996). Simulating ratios of mormalizing constants via asimple identity: A theoretical exploration. Statistica Sinica 6, 831–860.

Meyn, S. P. and R. Tweedie (2008). Markov Chains and Stochastic Stability. New York:Springer-Verlag.

Miller, K. S. (1981). On the inverse of the sum of matrices. Mathematics Magazine 54,67–72.

Mitchell, T. J. and J. J. Beauchamp (1988). Bayesian variable selection in linear regres-sion. Journal of the American Statistical Association, 1023–1032.

Quaas, R. L. (1976). Computing the diagonal elements and inverse of a large numeratorrelationship matrix. Biometrics 46, 949–953.

Reilley, M. (1993). Data analysis using hot deck mmultiple imputation. The Statistician,307–313.

Robert, C. P. and G. Casella (2004). Monte Carlo Statistical Methods. United States ofAmerica: Springer.

Roberts, A., L. McMillan, W. Wang, J. Parker, I. Rusyn, and D. Threadgill (2007).Inferring missing genotypes in large snp panels using fast nearest-neighbor searches overlliding windows. Bioinformatics 23, i401–i407.

Rubin, D. B. (1978). Multiple imputations in sample surveys- a phenomenological bayesianapproach to nonresponse. Journal of the American Statistical Association, 20–34.

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley &Sons.

Sadighi, I. and P. K. Kalra (1988). Approaches for updating the matrix inverse for controlsystem problems with special reference to row or column perturbation. Electric PowerSystems Research 14, 137–147.

Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman &Hall.

144

Scheet, P. and M. Stephens (2006). A fast and flexible statistical model for large-scalepopulation genotype data: applications to inferring missing genotypes and haplotypicphase. The American Journal of Human Genetics 78, 629–44.

Schwarz, G. (1973). Estimating the dimension of a model. Annals of Statistics 6, 461–464.

Servin, B. and M. Stephens (2007). Imputation-based analysis of association atudies:candidate regions and quantitative traits. PLoS Genetics 7, e114.

Smith, A. M. F. and D. J. Spiegelhalter (1980). Bayes factors and choice criteria for linearmodels. Journal of the Royal Statistical Society, Serial B , 213–220.

Stephens, M., S. N. J., and P. Donnelly (2001). A new statistical method for haplotypereconstruction from population data. The American Journal of Human Genetics 68,978–989.

Sun, Y. V. and S. L. Kardia (2008). Imputing missing genotypic data of single-nucleotidepolymorphisms using neural networks. European Journal of Human Genetics 16,487–495.

Tanner, M. A. and W. H. Wong (1987). The calculation of posterior distributions by dataaugmentation. Journal of the American Statistical Association, 528–540.

Wear, D. N. and J. G. Greis (2002). Southern forest resource assessment: Summary offindings. Journal of Forestry 100, 6–14.

Weinberg, C. R. (1999). Allowing for missing parents in genetic studies of case-parenttriads. American Journal of Human Genetics 64, 1186–1193.

Woodbury, M. (1950). Inverting modified matrices. Memorandom Rept. 42, StatisticalResearch Group, Princeton University .

Wu, C. F. J. (1983). On the convergence properties of the em algorithm. The Annals ofStatistics , 95–103.

Xie, F. and M. Paik (1997). Multiple imputation methods for the missing covari-ates ingeneralized estimating equation. Biometrics , 1538–1546.

Yu, J. M., G. Pressoir, W. H. Briggs, I. V. Bi, M. Yamasaki, J. Doebley, M. D. McMullen,B. S. Gaut, D. M. Nielsen, J. B. Holland, S. Kresovich, and E. S. Buckler (2005). Aunified mixed model method for association mapping that accounts for multiple levels ofrelatedness. Nature Genetics 38, e114.

145

BIOGRAPHICAL SKETCH

Zhen Li was born in Jiangsu, China. She was the second of two daughters of Ping Li

and Xuemei Tang. She received a bachelor’s degree in mechanical engineering in Nanjing

University of Aeronautics and Astronautics in 2001. After that, she was enrolled in a

master’s program of Shanghai Jiaotong University. In 2004, she was admitted to the

Statistics Department of University of Florida. She received a master’s degree in statistics

in 2006 and is expecting to receive a Ph.D. degree in statistics in 2008.

146

Documents

BAYESIAN METHODOLOGIES FOR GENOMIC DATA WITH …ufdcimages.uflib.ufl.edu/UF/E0/02/24/60/00001/li_z.pdfChair: George Casella Major: Statistics With advancing technology, large single